Image Classification

In this tutorial, we assign labels to an image with confidence scores. The following figure (source) shows an example:

Get the source code for the tutorial from GitHub.


To train models on a particular dataset, use For example:

  • To train an MLP on mnist, use this command:
  • To save the models for each epoch, use this command:
  mkdir model; python --model-prefix model/mnist
  • To restart training from the model saved at epoch 8, use this command:
  python --model-prefix model/mnist --load-epoch 8
  • To choose another initial learning rate, and decay it by 0.9 for every half epoch, use this command:
  python --lr .1 --lr-factor .9 --lr-factor-epoch .5
  • To train a convolutional neural network on mnist by using GPU 0, use this command:
  python --network lenet --gpus 0
  • To use multiple GPUs, specify the list; for example: ---gpus 0,1,3.
  • To see more options, use --help.

Distributed Training

To speed training, train a model using multiple computers.

  • Quickly test distributed training on your local computer by using two workers:
  ../../tools/ -n 2 python --kv-store dist_sync

You can use either synchronous SGD dist_sync or asynchronous SGD dist_async.

  • If you have several computers that you can connect to using SSH, and if this mxnet folder is accessible on these computers (is mounted as a NFS; see a tutorial for Ubuntu), run a job on these computers, first by saving their hostnames on a file, for example:
  $ cat hosts
  • Then pass this file using -H:
  ../../tools/ -n 2 -H hosts python --kv-store dist_sync
  • If the mxnet folder isn’t available on the other computers, copy the mxnet library to this example folder:
  cp -r ../../python/mxnet .
  cp -r ../../lib/ mxnet

Then synchronize the folder to other the other computers /tmp/mxnet before running:

  ../../tools/ -n 2 -H hosts --sync-dir /tmp/mxnet python --kv-store dist_sync

For more launch options, for example, using YARN, and information about how to write a distributed training program, see this tutorial.

Generating Predictions

You have several options for generating predictions:

Using Your Own Datasets

There are two ways to feed data into MXNet:

  • Pack all examples into one or more compact recordio files. For more information, see this step-by-step tutorial and documentation. Avoid the common mistake of neglecting to shuffle the image list during packing. This causes training to fail. For example, accuracy keeps 0.001 for several rounds.

    Note: We automatically download the small datasets, such as mnist and cifar10.

  • For small datasets, which can be easily loaded into memory, here is an example:

        from sklearn.datasets import fetch_mldata
        from sklearn.utils import shuffle
        mnist = fetch_mldata('MNIST original', data_home="./mnist")
        # shuffle data
        X, y = shuffle(,
        # split dataset
        train_data = X[:50000, :].astype('float32')
        train_label = y[:50000]
        val_data = X[50000: 60000, :].astype('float32')
        val_label = y[50000:60000]
        # Normalize data
        train_data[:] /= 256.0
        val_data[:] /= 256.0
        # create a numpy iterator
        batch_size = 100
        train_iter =, train_label, batch_size=batch_size, shuffle=True)
        val_iter =, val_label, batch_size=batch_size)
        # create model as usual: model = mx.model.FeedForward(...) = train_iter, eval_data = val_iter)

Improving Performance

The following factors can significantly improve performance:

  • A fast back end. A fast BLAS library, e.g., openblas, atlas, and mkl, is necessary only if you are using a CPU processor. For Nvidia GPUs, we strongly recommend using CUDNN.
  • Input data:
    • Data format. Use the rec format.
- Number of threads used for decoding. By default, MXNet uses four CPU threads for decoding images, which
can often decode more than 1 Kb images per second. If you are using a low-end CPU or
very powerful GPUs, you
can increase the number of threads .

- Data storage location. Any local or distributed file system (HDFS, Amazon
S3) should be fine. If multiple computers read the
data from the network shared file system (NFS) at the same time, howeverr, you might encounter a problem.

- Batch size. We recommend using the largest size that the GPU memory can accommodate. A value that is too large might slow down convergence. A safe batch size for CIFAR 10 is approximately 200; for ImageNet

1K, the batch size can exceed 1 Kb.

  • If you are using more than one GPU, the right kvstore. For more information, see this guide.
- For a single computer, the default `local` is often sufficient. For models bigger than 100 MB, such as AlexNet

and VGG, you might want to use local_allreduce_device. local_allreduce_device uses more GPU memory than other options.

- For multiple computers, we recommend trying to use `dist_sync` first. If the

model is very large or if you use a large number of computers, you might want to use dist_async.


  • Computers
name hardware software
GTX980 Xeon E5-1650 v3, 4 x GTX 980 GCC 4.8, CUDA 7.5, CUDNN 3
TitanX dual Xeon E5-2630 v3, 4 x GTX Titan X GCC 4.8, CUDA 7.5, CUDNN 3
EC2-g2.8x Xeon E5-2670, 2 x GRID K520, 10G Ethernet GCC 4.8, CUDA 7.5, CUDNN 3
  • Datasets
name class image size training testing
CIFAR 10 10 28 × 28 × 3 60,000 10,000
ILSVRC 12 1,000 227 × 227 × 3 1,281,167 50,000


  • Command
python --batch-size 128 --lr 0.1 --lr-factor .94 --num-epoch 50
  • Performance:
1 GTX 980 2 GTX 980 4 GTX 980
842 img/sec 1640 img/sec 2943 img/sec


VGG with --network vgg

  • Performance
Cluster # machines # GPUs batch size kvstore epoch time
TitanX 1 1 96 none 14,545
local 19,692
local_allreduce_device 9,142

Inception with Batch Normalization with --network inception-bn

  • Performance
Cluster # machines # GPUs batch size kvstore epoch time
GTX980 1 1 32 local 13,210
2 64
3 128
TitanX 1 1 128 none 10,666
local 5,161
EC2-g2.8x 1 4 144 local 14,203
10 40 144 dist_sync 1,422
  • Convergence
    • single machine :
  python --batch-size 144 --lr 0.05 --lr-factor .94 \
      --gpus 0,1,2,3 --num-epoch 60 --network inception-bn \
      --data-dir ilsvrc12/ --model-prefix model/ilsvrc12
  • 10 x g2.8x : hosts contains the private IPs of the 10 computers
  ../../tools/ -H hosts -n 10 --sync-dir /tmp/mxnet  \
      python --batch-size 144 --lr 0.05 --lr-factor .94 \
        --gpus 0,1,2,3 --num-epoch 60 --network inception-bn \
        --kv-store dist_sync \
        --data-dir s3://dmlc/ilsvrc12/  --model-prefix s3://dmlc/model/ilsvrc12

Note: Occasional instability in Amazon S3 might cause training to hang or generate frequent errors, preventing downloading data to /mnt first.