Optimization: initialize and update weights

Overview

This document summaries the APIs used to initialize and update the model weights during training

mxnet.initializer Weight initializer.
mxnet.optimizer Weight updating functions
mxnet.lr_scheduler Scheduling learning rate.

and how to develop a new optimization algorithm in MXNet.

Assume there there is a pre-defined Symbol and a Module is created for it

>>> data = mx.symbol.Variable('data')
>>> label = mx.symbol.Variable('softmax_label')
>>> fc = mx.symbol.FullyConnected(data, name='fc', num_hidden=10)
>>> loss = mx.symbol.SoftmaxOutput(fc, label, name='softmax')
>>> mod = mx.mod.Module(loss)
>>> mod.bind(data_shapes=[('data', (128,20))], label_shapes=[('softmax_label', (128,))])

Next we can initialize the weights with values sampled uniformly from [-1,1]:

>>> mod.init_params(mx.initializer.Uniform(scale=1.0))

Then we will train a model with standard SGD which decreases the learning rate by multiplying 0.9 for each 100 batches.

>>> lr_sch = mx.lr_scheduler.FactorScheduler(step=100, factor=0.9)
>>> mod.init_optimizer(
...     optimizer='sgd', optimizer_params=(('learning_rate', 0.1), ('lr_scheduler', lr_sch)))

Finally run mod.fit(...) to start training.

The mxnet.initializer package

The base class Initializer defines the default behaviors to initialize various parameters, such as set bias to 1, except for the weight. Other classes then defines how to initialize the weight.

Initializer The base class of an initializer.
Uniform Initialize the weight with value uniformly sampled from [-scale, scale].
Normal Initialize the weight with value sampled according to normal(0, sigma).
Load Initialize by loading data from file or dict.
Mixed Initialize with multiple initializers.
Zero Initialize the weight to 0.
One Initialize the weight to 1.
Constant Initialize the weight to a scalar value.
Orthogonal Initialize weight as orthogonal matrix.
Xavier Initialize the weight with Xavier or other similar schemes.
MSRAPrelu Initialize the weight according to a MSRA paper.
Bilinear Initialize weight for upsampling layers.
FusedRNN Initialize parameters for fused rnn layers.

The mxnet.optimizer package

The base class Optimizer accepts commonly shared arguments such as learning_rate and defines the interface. Each other class in this package implements one weight updating function.

Optimizer The base class inherited by all optimizers.
SGD The SGD optimizer with momentum and weight decay.
NAG Nesterov accelerated SGD.
RMSProp The RMSProp optimizer.
Adam The Adam optimizer.
AdaGrad AdaGrad optimizer
AdaDelta The AdaDelta optimizer.
DCASGD The DCASGD optimizer
SGLD Stochastic Gradient Riemannian Langevin Dynamics.

The mxnet.lr_scheduler package

The base class LRScheduler defines the interface, while other classes implement various schemes to change the learning rate during training.

LRScheduler Base class of a learning rate scheduler.
FactorScheduler Reduce the learning rate by a factor for every n steps.
MultiFactorScheduler Reduce the learning rate by given a list of steps.

Implement a new algorithm

Most classes listed in this document are implemented in Python by using NDArray. So implementing new weight updating or initialization functions is straightforward.

For initializer, create a subclass of Initializer and define the _init_weight method. We can also change the default behaviors to initialize other parameters such as _init_bias. See initializer.py for examples.

For optimizer, create a subclass of Optimizer and implement two methods create_state and update. Also add @mx.optimizer.Optimizer.register before this class. See optimizer.py for examples.

For lr_scheduler, create a subclass of LRScheduler and then implement the __call__ method. See lr_scheduler.py for examples.

API Reference

Weight updating functions

class mxnet.optimizer.Optimizer(rescale_grad=1.0, param_idx2name=None, wd=0.0, clip_gradient=None, learning_rate=0.01, lr_scheduler=None, sym=None, begin_num_update=0)

The base class inherited by all optimizers.

Parameters:
  • rescale_grad (float, optional) – Multiply the gradient with rescale_grad before updating. Often choose to be 1.0/batch_size.
  • param_idx2name (dict from int to string, optional) – A dictionary that maps int index to string name.
  • clip_gradient (float, optional) – Clip the gradient by projecting onto the box [-clip_gradient, clip_gradient].
  • learning_rate (float, optional) – The initial learning rate.
  • lr_scheduler (LRScheduler, optional) – The learning rate scheduler.
  • wd (float, optional) – The weight decay (or L2 regularization) coefficient. Modifies objective by adding a penalty for having large weights.
  • sym (Symbol, optional) – The Symbol this optimizer is applying to.
  • begin_num_update (int, optional) – The initial number of updates
static register(klass)

Register a new optimizer.

Once an optimizer is registered, we can create an instance of this optimizer with create_optimizer later.

Examples

>>> @mx.optimizer.Optimizer.register
... class MyOptimizer(mx.optimizer.Optimizer):
...     pass
>>> optim = mx.optimizer.Optimizer.create_optimizer('MyOptimizer')
>>> print(type(optim))
<class '__main__.MyOptimizer'>
static create_optimizer(name, **kwargs)

Instantiate an optimizer with a given name and kwargs.

Notes

We can use the alias create for Optimizer.create_optimizer

Parameters:
  • name (str) – Name of the optimizer. Should be the name of a subclass of Optimizer. Case insensitive.
  • kwargs (dict) – Parameters for the optimizer.
Returns:

An instantiated optimizer.

Return type:

Optimizer

Examples

>>> sgd = mx.optimizer.Optimizer.create_optimizer('sgd')
>>> type(sgd)
<class 'mxnet.optimizer.SGD'>
>>> adam = mx.optimizer.create('adam', learning_rate=.1)
>>> type(adam)
<class 'mxnet.optimizer.Adam'>
create_state(index, weight)

Create auxiliary state for a given weight

Some optimizers require additional states, e.g. as momentum, in addition to gradients in order to update weights. This function creates state for a given weight which will be used in update. This function is called only once for each weight.

Parameters:
  • index (int) – An unique index to identify the weight.
  • weight (NDArray) – The weight
Returns:

state – The state associated with the weight.

Return type:

any obj

update(index, weight, grad, state)

Update the weight given the corresponding gradient and state.

Parameters:
  • index (int) – An unique index to identify the weight.
  • weight (NDArray) – The weight
  • grad (NDArray) – The gradient of the objective with respect to this weight.
  • state (any obj) – The state associated with this weight.
set_lr_scale(args_lrscale)

[DEPRECATED] set lr scale. Use set_lr_mult instead.

set_lr_mult(args_lr_mult)

Set individual learning rate for each weight.

Parameters:args_lr_mult (dict of string/int to float) – Set the lr multipler for name/index to float. Setting multipler by index is supported for backward compatibility, but we recommend using name and symbol.
set_wd_mult(args_wd_mult)

Set individual weight decay for each weight.

By default wd multipler is 0 for all params whose name doesn’t end with _weight, if param_idx2name is provided.

Parameters:args_wd_mult (dict of string/int to float) – Set the wd multipler for name/index to float. Setting multipler by index is supported for backward compatibility, but we recommend using name and symbol.
mxnet.optimizer.register(klass)

Register a new optimizer.

Once an optimizer is registered, we can create an instance of this optimizer with create_optimizer later.

Examples

>>> @mx.optimizer.Optimizer.register
... class MyOptimizer(mx.optimizer.Optimizer):
...     pass
>>> optim = mx.optimizer.Optimizer.create_optimizer('MyOptimizer')
>>> print(type(optim))
<class '__main__.MyOptimizer'>
class mxnet.optimizer.SGD(momentum=0.0, **kwargs)

The SGD optimizer with momentum and weight decay.

The optimizer updates the weight by:

state = momentum * state + lr * rescale_grad * clip(grad, clip_gradient) + wd * weight weight = weight - state

This optimizer accepts the following parameters in addition to those accepted by Optimizer:

Parameters:momentum (float, optional) – The momentum value.
class mxnet.optimizer.DCASGD(momentum=0.0, lamda=0.04, **kwargs)

The DCASGD optimizer

This class implements the optimizer described in Asynchronous Stochastic Gradient Descent with Delay Compensation for Distributed Deep Learning, available at https://arxiv.org/abs/1609.08326

This optimizer accepts the following parameters in addition to those accepted by Optimizer:

Parameters:
  • momentum (float, optional) – The momentum value.
  • lamda (float, optional) – Scale DC value.
class mxnet.optimizer.NAG(**kwargs)

Nesterov accelerated SGD.

This optimizer updates each weight by:

state = momentum * state + grad + wd * weight weight = weight - (lr * (grad + momentum * state))

This optimizer accepts the same arguments as SGD.

class mxnet.optimizer.SGLD(**kwargs)

Stochastic Gradient Riemannian Langevin Dynamics.

This class implements the optimizer described in the paper Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex, available at https://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex.pdf

class mxnet.optimizer.ccSGD(*args, **kwargs)

[Deprecated] Same as sgd. Left here for backward compatibility.

class mxnet.optimizer.Adam(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)

The Adam optimizer.

This class implements the optimizer described in Adam: A Method for Stochastic Optimization, available at http://arxiv.org/abs/1412.6980

This optimizer accepts the following parameters in addition to those accepted by Optimizer:

Parameters:
  • beta1 (float, optional) – Exponential decay rate for the first moment estimates.
  • beta2 (float, optional) – Exponential decay rate for the second moment estimates.
  • epsilon (float, optional) – Small value to avoid divided by 0.
class mxnet.optimizer.AdaGrad(eps=1e-07, **kwargs)

AdaGrad optimizer

This calss implements the AdaGrad optiizer described in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, and available at http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

This optimizer accepts the following parameters in addition to those accepted by Optimizer:

Parameters:eps (float, optional) – Small value to avoid division by 0.
class mxnet.optimizer.RMSProp(learning_rate=0.001, gamma1=0.9, gamma2=0.9, epsilon=1e-08, centered=False, clip_weights=None, **kwargs)

The RMSProp optimizer.

Two versions of RMSProp are implemented:

If centered=False, we follow http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf by Tieleman & Hinton, 2012.

If centered=True, we follow http://arxiv.org/pdf/1308.0850v5.pdf (38)-(45) by Alex Graves, 2013.

This optimizer accepts the following parameters in addition to those accepted by Optimizer:

Parameters:
  • gamma1 (float, optional) – Decay factor of moving average for gradient^2.
  • gamma2 (float, optional) – A “momentum” factor. Only used if centered=True.
  • epsilon (float, optional) – Small value to avoid division by 0.
  • centered (bool, optional) – Use Graves’ or Tieleman & Hinton’s version of RMSProp.
  • clip_weights (float, optional) – clip weights into range [-clip_weights, clip_weights]
class mxnet.optimizer.AdaDelta(rho=0.9, epsilon=1e-05, **kwargs)

The AdaDelta optimizer.

This class implements AdaDelta, an optimizer described in ADADELTA: An adaptive learning rate method, available at https://arxiv.org/abs/1212.5701

This optimizer accepts the following parameters in addition to those accepted by Optimizer:

Parameters:
  • rho (float) – Decay rate for both squared gradients and delta.
  • epsilon (float) – Small value to avoid division by 0.
mxnet.optimizer.create(name, **kwargs)

Instantiate an optimizer with a given name and kwargs.

Notes

We can use the alias create for Optimizer.create_optimizer

Parameters:
  • name (str) – Name of the optimizer. Should be the name of a subclass of Optimizer. Case insensitive.
  • kwargs (dict) – Parameters for the optimizer.
Returns:

An instantiated optimizer.

Return type:

Optimizer

Examples

>>> sgd = mx.optimizer.Optimizer.create_optimizer('sgd')
>>> type(sgd)
<class 'mxnet.optimizer.SGD'>
>>> adam = mx.optimizer.create('adam', learning_rate=.1)
>>> type(adam)
<class 'mxnet.optimizer.Adam'>
class mxnet.optimizer.Updater(optimizer)

Updater for kvstore.

set_states(states)

Set updater states.

get_states()

Get updater states.

mxnet.optimizer.get_updater(optimizer)

Return a clossure of the updater needed for kvstore.

Parameters:optimizer (Optimizer) – The optimizer.
Returns:updater – The clossure of the updater.
Return type:function

Scheduling learning rate.

class mxnet.lr_scheduler.LRScheduler(base_lr=0.01)

Base class of a learning rate scheduler.

A scheduler returns a new learning rate based on the number of updates that have been performed.

Parameters:base_lr (float, optional) – The initial learning rate.
class mxnet.lr_scheduler.FactorScheduler(step, factor=1, stop_factor_lr=1e-08)

Reduce the learning rate by a factor for every n steps.

It returns a new learning rate by:

base_lr * pow(factor, floor(num_update/step))
Parameters:
  • step (int) – Changes the learning rate for every n updates.
  • factor (float, optional) – The factor to change the learning rate.
  • stop_factor_lr (float, optional) – Stop updating the learning rate if it is less than this value.
class mxnet.lr_scheduler.MultiFactorScheduler(step, factor=1)

Reduce the learning rate by given a list of steps.

Assume there exists k such that:

step[k] <= num_update and num_update < step[k+1]

Then calculate the new learning rate by:

base_lr * pow(factor, k+1)
Parameters:
  • step (list of int) – The list of steps to schedule a change
  • factor (float) – The factor to change the learning rate.

Weight initializer.

class mxnet.initializer.InitDesc

Descriptor for the initialization pattern.

name
: str
Name of variable.
attrs
: dict of str to str
Attributes of this variable taken from Symbol.attr_dict.
global_init
: Initializer
Global initializer to fallback to.
mxnet.initializer.register(klass)

Register an intializer to the initializer factory.

class mxnet.initializer.Initializer(**kwargs)

The base class of an initializer.

dumps()

Save the initializer to string

class mxnet.initializer.Load(param, default_init=None, verbose=False)

Initialize by loading data from file or dict.

Parameters:
  • param (str or dict of str->NDArray) – Parameter file or dict mapping name to NDArray.
  • default_init (Initializer) – Default initializer when name is not found in param.
  • verbose (bool) – Log source when initializing.
class mxnet.initializer.Mixed(patterns, initializers)

Initialize with multiple initializers.

Parameters:
  • patterns (list of str) – List of regular expression patterns to match parameter names.
  • initializers (list of Initializer) – List of Initializer corrosponding to patterns.
class mxnet.initializer.Zero

Initialize the weight to 0.

class mxnet.initializer.One

Initialize the weight to 1.

class mxnet.initializer.Constant(value)

Initialize the weight to a scalar value.

class mxnet.initializer.Uniform(scale=0.07)

Initialize the weight with value uniformly sampled from [-scale, scale].

Parameters:scale (float, optional) – The scale of uniform distribution.
class mxnet.initializer.Normal(sigma=0.01)

Initialize the weight with value sampled according to normal(0, sigma).

Parameters:sigma (float, optional) – Standard deviation for gaussian distribution.
class mxnet.initializer.Orthogonal(scale=1.414, rand_type='uniform')

Initialize weight as orthogonal matrix.

This initializer implements Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, available at https://arxiv.org/abs/1312.6120.

Parameters:
  • scale (float optional) – Scaling factor of weight.
  • rand_type (string optional) – Use “uniform” or “normal” random number to initialize weight.
class mxnet.initializer.Xavier(rnd_type='uniform', factor_type='avg', magnitude=3)

Initialize the weight with Xavier or other similar schemes.

Parameters:
  • rnd_type (str, optional) – Random generator type, can be `gaussian or uniform.
  • factor_type (str, optional) – Can be avg, in, or out.
  • magnitude (float, optional) – Scale of random number range.
class mxnet.initializer.MSRAPrelu(factor_type='avg', slope=0.25)

Initialize the weight according to a MSRA paper.

This initializer implements Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, available at https://arxiv.org/abs/1502.01852.

Parameters:
  • factor_type (str, optional) – Can be avg, in, or out.
  • slope (float, optional) – initial slope of any PReLU (or similar) nonlinearities.
class mxnet.initializer.Bilinear

Initialize weight for upsampling layers.

class mxnet.initializer.LSTMBias(forget_bias)

Initialize all bias of an LSTMCell to 0.0 except for the forget gate whose bias is set to custom value.

Parameters:
  • forget_bias (float, bias for the forget gate.) –
  • et al. 2015 recommends setting this to 1.0. (Jozefowicz) –
class mxnet.initializer.FusedRNN(init, num_hidden, num_layers, mode, bidirectional=False, forget_bias=1.0)

Initialize parameters for fused rnn layers.

Parameters:
  • init (Initializer) – intializer applied to unpacked weights. Fall back to global initializer if None.
  • num_hidden (int) – should be the same with arguments passed to FusedRNNCell.
  • num_layers (int) – should be the same with arguments passed to FusedRNNCell.
  • mode (str) – should be the same with arguments passed to FusedRNNCell.
  • bidirectional (bool) – should be the same with arguments passed to FusedRNNCell.
  • forget_bias (float) – should be the same with arguments passed to FusedRNNCell.