Handwritten Digit Recognition

This tutorial guides you through a classic computer vision application: identify hand written digits with neural networks.

Load data

We first fetch the MNIST dataset, which is a commonly used dataset for handwritten digit recognition. Each image in this dataset has been resized into 28x28 with grayscale value between 0 and 254. The following codes download and load the images and the according labels into numpy.

import numpy as np
import os
import urllib
import gzip
import struct
def download_data(url, force_download=True): 
    fname = url.split("/")[-1]
    if force_download or not os.path.exists(fname):
        urllib.urlretrieve(url, fname)
    return fname

def read_data(label_url, image_url):
    with gzip.open(download_data(label_url)) as flbl:
        magic, num = struct.unpack(">II", flbl.read(8))
        label = np.fromstring(flbl.read(), dtype=np.int8)
    with gzip.open(download_data(image_url), 'rb') as fimg:
        magic, num, rows, cols = struct.unpack(">IIII", fimg.read(16))
        image = np.fromstring(fimg.read(), dtype=np.uint8).reshape(len(label), rows, cols)
    return (label, image)

(train_lbl, train_img) = read_data(
    path+'train-labels-idx1-ubyte.gz', path+'train-images-idx3-ubyte.gz')
(val_lbl, val_img) = read_data(
    path+'t10k-labels-idx1-ubyte.gz', path+'t10k-images-idx3-ubyte.gz')

We plot the first 10 images and print their labels.

%matplotlib inline
import matplotlib.pyplot as plt
for i in range(10):
    plt.imshow(train_img[i], cmap='Greys_r')
print('label: %s' % (train_lbl[0:10],))


label: [5 0 4 1 9 2 1 3 1 4]

Next we create data iterators for MXNet. The data iterator, which is similar the iterator, returns a batch of data in each next() call. A batch contains several images with its according labels. These images are stored in a 4-D matrix with shape (batch_size, num_channels, width, height). For the MNIST dataset, there is only one color channel, and both width and height are 28. In addition, we often shuffle the images used for training, which accelerates the training progress.

import mxnet as mx

def to4d(img):
    return img.reshape(img.shape[0], 1, 28, 28).astype(np.float32)/255

batch_size = 100
train_iter = mx.io.NDArrayIter(to4d(train_img), train_lbl, batch_size, shuffle=True)
val_iter = mx.io.NDArrayIter(to4d(val_img), val_lbl, batch_size)

Multilayer Perceptron

A multilayer perceptron contains several fully-connected layers. A fully-connected layer, with an n x m input matrix X outputs a matrix Y with size n x k, where k is often called as the hidden size. This layer has two parameters, the m x k weight matrix W and the m x 1 bias vector b. It compute the outputs with

$$Y = W X + b.$$

The output of a fully-connected layer is often feed into an activation layer, which performs element-wise operations. Two common options are the sigmoid function, or the rectifier (or “relu”) function, which outputs the max of 0 and the input.

The last fully-connected layer often has the hidden size equals to the number of classes in the dataset. Then we stack a softmax layer, which map the input into a probability score. Again assume the input X has size n x m:

$$ \left[\frac{\exp(x_{i1})}{\sum_{j=1}^m \exp(x_{ij})},\ldots, \frac{\exp(x_{im})}{\sum_{j=1}^m \exp(x_{ij})}\right] $$

Defining the multilayer perceptron in MXNet is straightforward, which has shown as following.

# Create a place holder variable for the input data
data = mx.sym.Variable('data')
# Flatten the data from 4-D shape (batch_size, num_channel, width, height) 
# into 2-D (batch_size, num_channel*width*height)
data = mx.sym.Flatten(data=data)

# The first fully-connected layer
fc1  = mx.sym.FullyConnected(data=data, name='fc1', num_hidden=128)
# Apply relu to the output of the first fully-connnected layer
act1 = mx.sym.Activation(data=fc1, name='relu1', act_type="relu")

# The second fully-connected layer and the according activation function
fc2  = mx.sym.FullyConnected(data=act1, name='fc2', num_hidden = 64)
act2 = mx.sym.Activation(data=fc2, name='relu2', act_type="relu")

# The thrid fully-connected layer, note that the hidden size should be 10, which is the number of unique digits
fc3  = mx.sym.FullyConnected(data=act2, name='fc3', num_hidden=10)
# The softmax and loss layer
mlp  = mx.sym.SoftmaxOutput(data=fc3, name='softmax')

# We visualize the network structure with output size (the batch_size is ignored.)
shape = {"data" : (batch_size, 1, 28, 28)}
mx.viz.plot_network(symbol=mlp, shape=shape)


Now both the network definition and data iterators are ready. We can start training.

import logging

model = mx.model.FeedForward(
    symbol = mlp,       # network structure
    num_epoch = 10,     # number of data passes for training 
    learning_rate = 0.1 # learning rate of SGD 
    X=train_iter,       # training data
    eval_data=val_iter, # validation data
    batch_end_callback = mx.callback.Speedometer(batch_size, 200) # output progress for each 200 data batches
WARNING:root:[Deprecation Warning] mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.
INFO:root:Start training with [cpu(0)]
INFO:root:Epoch[0] Batch [200]  Speed: 21765.89 samples/sec Train-accuracy=0.111300
INFO:root:Epoch[0] Batch [400]  Speed: 21178.05 samples/sec Train-accuracy=0.113800
INFO:root:Epoch[0] Batch [600]  Speed: 22749.74 samples/sec Train-accuracy=0.145050
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=2.770
INFO:root:Epoch[0] Validation-accuracy=0.245700
INFO:root:Epoch[1] Batch [200]  Speed: 22469.74 samples/sec Train-accuracy=0.401450
INFO:root:Epoch[1] Batch [400]  Speed: 22132.54 samples/sec Train-accuracy=0.752800
INFO:root:Epoch[1] Batch [600]  Speed: 22298.21 samples/sec Train-accuracy=0.830450
INFO:root:Epoch[1] Resetting Data Iterator
INFO:root:Epoch[1] Time cost=2.697
INFO:root:Epoch[1] Validation-accuracy=0.851800
INFO:root:Epoch[2] Batch [200]  Speed: 22641.03 samples/sec Train-accuracy=0.863100
INFO:root:Epoch[2] Batch [400]  Speed: 22535.41 samples/sec Train-accuracy=0.889600
INFO:root:Epoch[2] Batch [600]  Speed: 20554.33 samples/sec Train-accuracy=0.904700
INFO:root:Epoch[2] Resetting Data Iterator
INFO:root:Epoch[2] Time cost=2.751
INFO:root:Epoch[2] Validation-accuracy=0.910900
INFO:root:Epoch[3] Batch [200]  Speed: 22567.61 samples/sec Train-accuracy=0.919100
INFO:root:Epoch[3] Batch [400]  Speed: 21676.39 samples/sec Train-accuracy=0.926950
INFO:root:Epoch[3] Batch [600]  Speed: 22239.15 samples/sec Train-accuracy=0.935050
INFO:root:Epoch[3] Resetting Data Iterator
INFO:root:Epoch[3] Time cost=2.715
INFO:root:Epoch[3] Validation-accuracy=0.937000
INFO:root:Epoch[4] Batch [200]  Speed: 22807.12 samples/sec Train-accuracy=0.940900
INFO:root:Epoch[4] Batch [400]  Speed: 22676.59 samples/sec Train-accuracy=0.946550
INFO:root:Epoch[4] Batch [600]  Speed: 22396.59 samples/sec Train-accuracy=0.949450
INFO:root:Epoch[4] Resetting Data Iterator
INFO:root:Epoch[4] Time cost=2.659
INFO:root:Epoch[4] Validation-accuracy=0.948200
INFO:root:Epoch[5] Batch [200]  Speed: 21642.09 samples/sec Train-accuracy=0.954500
INFO:root:Epoch[5] Batch [400]  Speed: 22713.46 samples/sec Train-accuracy=0.957200
INFO:root:Epoch[5] Batch [600]  Speed: 22707.05 samples/sec Train-accuracy=0.958150
INFO:root:Epoch[5] Resetting Data Iterator
INFO:root:Epoch[5] Time cost=2.692
INFO:root:Epoch[5] Validation-accuracy=0.957600
INFO:root:Epoch[6] Batch [200]  Speed: 22707.33 samples/sec Train-accuracy=0.961650
INFO:root:Epoch[6] Batch [400]  Speed: 21229.35 samples/sec Train-accuracy=0.963350
INFO:root:Epoch[6] Batch [600]  Speed: 21281.46 samples/sec Train-accuracy=0.964600
INFO:root:Epoch[6] Resetting Data Iterator
INFO:root:Epoch[6] Time cost=2.770
INFO:root:Epoch[6] Validation-accuracy=0.962200
INFO:root:Epoch[7] Batch [200]  Speed: 22256.75 samples/sec Train-accuracy=0.966750
INFO:root:Epoch[7] Batch [400]  Speed: 22260.44 samples/sec Train-accuracy=0.969000
INFO:root:Epoch[7] Batch [600]  Speed: 22725.88 samples/sec Train-accuracy=0.970050
INFO:root:Epoch[7] Resetting Data Iterator
INFO:root:Epoch[7] Time cost=2.684
INFO:root:Epoch[7] Validation-accuracy=0.964600
INFO:root:Epoch[8] Batch [200]  Speed: 22429.85 samples/sec Train-accuracy=0.971950
INFO:root:Epoch[8] Batch [400]  Speed: 22352.06 samples/sec Train-accuracy=0.972100
INFO:root:Epoch[8] Batch [600]  Speed: 20139.16 samples/sec Train-accuracy=0.974400
INFO:root:Epoch[8] Resetting Data Iterator
INFO:root:Epoch[8] Time cost=2.786
INFO:root:Epoch[8] Validation-accuracy=0.966900
INFO:root:Epoch[9] Batch [200]  Speed: 21898.75 samples/sec Train-accuracy=0.975250
INFO:root:Epoch[9] Batch [400]  Speed: 22031.53 samples/sec Train-accuracy=0.974700
INFO:root:Epoch[9] Batch [600]  Speed: 22043.60 samples/sec Train-accuracy=0.977300
INFO:root:Epoch[9] Resetting Data Iterator
INFO:root:Epoch[9] Time cost=2.736
INFO:root:Epoch[9] Validation-accuracy=0.968200

After training is done, we can predict a single image.

plt.imshow(val_img[0], cmap='Greys_r')
prob = model.predict(val_img[0:1].astype(np.float32)/255)[0]
assert max(prob) > 0.99, "Low prediction accuracy."
print 'Classified as %d with probability %f' % (prob.argmax(), max(prob))


Classified as 7 with probability 0.999391

We can also evaluate the accuracy given a data iterator.

valid_acc = model.score(val_iter)
print 'Validation accuracy: %f%%' % (valid_acc *100,)
assert valid_acc > 0.95, "Low validation accuracy."
Validation accuracy: 96.820000%

Even more, we can recognizes the digit written on the below box.

from IPython.display import HTML
import cv2
import numpy as np

def classify(img):
    img = img[len('data:image/png;base64,'):].decode('base64')
    img = cv2.imdecode(np.fromstring(img, np.uint8), -1)
    img = cv2.resize(img[:,:,3], (28,28))
    img = img.astype(np.float32).reshape((1,1,28,28))/255.0
    return model.predict(img)[0].argmax()

To see the model in action, run the demo notebook at
Sorry, your browser doesn't support canvas technology.

<button id="classify" onclick="classify()">

<button id="clear" onclick="myClear()">
<input type="text" id="result_output" size="5" value="">

Convolutional Neural Networks

Note that the previous fully-connected layer simply reshapes the image into a vector during training. It ignores the spatial information that pixels are correlated on both horizontal and vertical dimensions. The convolutional layer aims to improve this drawback by using a more structural weight $W$. Instead of simply matrix-matrix multiplication, it uses 2-D convolution to obtain the output.

We can also have multiple feature maps, each with their own weight matrices, to capture different features:

Besides the convolutional layer, another major change of the convolutional neural network is the adding of pooling layers. A pooling layer reduce a $n\times m$ (often called kernal size) image patch into a single value to make the network less sensitive to the spatial location.

data = mx.symbol.Variable('data')
# first conv layer
conv1 = mx.sym.Convolution(data=data, kernel=(5,5), num_filter=20)
tanh1 = mx.sym.Activation(data=conv1, act_type="tanh")
pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2,2), stride=(2,2))
# second conv layer
conv2 = mx.sym.Convolution(data=pool1, kernel=(5,5), num_filter=50)
tanh2 = mx.sym.Activation(data=conv2, act_type="tanh")
pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2,2), stride=(2,2))
# first fullc layer
flatten = mx.sym.Flatten(data=pool2)
fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 = mx.sym.Activation(data=fc1, act_type="tanh")
# second fullc
fc2 = mx.sym.FullyConnected(data=tanh3, num_hidden=10)
# softmax loss
lenet = mx.sym.SoftmaxOutput(data=fc2, name='softmax')
mx.viz.plot_network(symbol=lenet, shape=shape)


Note that LeNet is more complex than the previous multilayer perceptron, so we use GPU instead of CPU for training.

model = mx.model.FeedForward(
    ctx = mx.gpu(0),     # use GPU 0 for training, others are same as before
    symbol = lenet,       
    num_epoch = 10,     
    learning_rate = 0.1)
    batch_end_callback = mx.callback.Speedometer(batch_size, 200)
assert model.score(val_iter) > 0.98, "Low validation accuracy."
WARNING:root:[Deprecation Warning] mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.
INFO:root:Start training with [gpu(0)]
INFO:root:Epoch[0] Batch [200]  Speed: 21246.94 samples/sec Train-accuracy=0.111550
INFO:root:Epoch[0] Batch [400]  Speed: 23678.45 samples/sec Train-accuracy=0.113800
INFO:root:Epoch[0] Batch [600]  Speed: 23676.68 samples/sec Train-accuracy=0.110600
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=2.639
INFO:root:Epoch[0] Validation-accuracy=0.113500
INFO:root:Epoch[1] Batch [200]  Speed: 23831.23 samples/sec Train-accuracy=0.149400
INFO:root:Epoch[1] Batch [400]  Speed: 23667.46 samples/sec Train-accuracy=0.780650
INFO:root:Epoch[1] Batch [600]  Speed: 23671.74 samples/sec Train-accuracy=0.912300
INFO:root:Epoch[1] Resetting Data Iterator
INFO:root:Epoch[1] Time cost=2.535
INFO:root:Epoch[1] Validation-accuracy=0.935700
INFO:root:Epoch[2] Batch [200]  Speed: 23709.24 samples/sec Train-accuracy=0.942450
INFO:root:Epoch[2] Batch [400]  Speed: 23602.33 samples/sec Train-accuracy=0.957550
INFO:root:Epoch[2] Batch [600]  Speed: 23586.94 samples/sec Train-accuracy=0.964650
INFO:root:Epoch[2] Resetting Data Iterator
INFO:root:Epoch[2] Time cost=2.545
INFO:root:Epoch[2] Validation-accuracy=0.970200
INFO:root:Epoch[3] Batch [200]  Speed: 23784.07 samples/sec Train-accuracy=0.971200
INFO:root:Epoch[3] Batch [400]  Speed: 23692.30 samples/sec Train-accuracy=0.975100
INFO:root:Epoch[3] Batch [600]  Speed: 23679.82 samples/sec Train-accuracy=0.976250
INFO:root:Epoch[3] Resetting Data Iterator
INFO:root:Epoch[3] Time cost=2.535
INFO:root:Epoch[3] Validation-accuracy=0.979500
INFO:root:Epoch[4] Batch [200]  Speed: 23785.40 samples/sec Train-accuracy=0.979900
INFO:root:Epoch[4] Batch [400]  Speed: 23686.64 samples/sec Train-accuracy=0.981900
INFO:root:Epoch[4] Batch [600]  Speed: 23704.04 samples/sec Train-accuracy=0.980700
INFO:root:Epoch[4] Resetting Data Iterator
INFO:root:Epoch[4] Time cost=2.535
INFO:root:Epoch[4] Validation-accuracy=0.982500
INFO:root:Epoch[5] Batch [200]  Speed: 23740.93 samples/sec Train-accuracy=0.984100
INFO:root:Epoch[5] Batch [400]  Speed: 23670.94 samples/sec Train-accuracy=0.986350
INFO:root:Epoch[5] Batch [600]  Speed: 23675.28 samples/sec Train-accuracy=0.984400
INFO:root:Epoch[5] Resetting Data Iterator
INFO:root:Epoch[5] Time cost=2.538
INFO:root:Epoch[5] Validation-accuracy=0.984600
INFO:root:Epoch[6] Batch [200]  Speed: 23825.26 samples/sec Train-accuracy=0.986750
INFO:root:Epoch[6] Batch [400]  Speed: 23705.14 samples/sec Train-accuracy=0.988550
INFO:root:Epoch[6] Batch [600]  Speed: 23710.37 samples/sec Train-accuracy=0.986350
INFO:root:Epoch[6] Resetting Data Iterator
INFO:root:Epoch[6] Time cost=2.532
INFO:root:Epoch[6] Validation-accuracy=0.986400
INFO:root:Epoch[7] Batch [200]  Speed: 23811.82 samples/sec Train-accuracy=0.988800
INFO:root:Epoch[7] Batch [400]  Speed: 23713.21 samples/sec Train-accuracy=0.990200
INFO:root:Epoch[7] Batch [600]  Speed: 23710.62 samples/sec Train-accuracy=0.988350
INFO:root:Epoch[7] Resetting Data Iterator
INFO:root:Epoch[7] Time cost=2.532
INFO:root:Epoch[7] Validation-accuracy=0.987700
INFO:root:Epoch[8] Batch [200]  Speed: 23741.21 samples/sec Train-accuracy=0.990350
INFO:root:Epoch[8] Batch [400]  Speed: 23655.82 samples/sec Train-accuracy=0.991600
INFO:root:Epoch[8] Batch [600]  Speed: 23653.22 samples/sec Train-accuracy=0.989900
INFO:root:Epoch[8] Resetting Data Iterator
INFO:root:Epoch[8] Time cost=2.539
INFO:root:Epoch[8] Validation-accuracy=0.987600
INFO:root:Epoch[9] Batch [200]  Speed: 23753.31 samples/sec Train-accuracy=0.991800
INFO:root:Epoch[9] Batch [400]  Speed: 23663.07 samples/sec Train-accuracy=0.992850
INFO:root:Epoch[9] Batch [600]  Speed: 23688.74 samples/sec Train-accuracy=0.990850
INFO:root:Epoch[9] Resetting Data Iterator
INFO:root:Epoch[9] Time cost=2.537
INFO:root:Epoch[9] Validation-accuracy=0.988300

Note that, with the same hyper-parameters, LeNet achieves 98.7% validation accuracy, which improves on the previous multilayer perceptron accuracy of 96.6%.

Because we rewrite the model parameters in mod, now we can try the previous digit recognition box again to check if or not the new CNN model improves the classification accuracy.