# Gluon Package¶

Warning

This package is currently experimental and may change in the near future.

## Overview¶

Gluon package is a high-level interface for MXNet designed to be easy to use while keeping most of the flexibility of low level API. Gluon supports both imperative and symbolic programming, making it easy to train complex models imperatively in Python and then deploy with symbolic graph in C++ and Scala.

## Parameter¶

class mxnet.gluon.Parameter(name, grad_req='write', shape=None, dtype=<type 'numpy.float32'>, lr_mult=1.0, wd_mult=1.0, init=None, allow_deferred_init=False)

A Container holding parameters (weights) of Blocks.

Parameter holds a copy of the the parameter on each Context after it is initialized with Parameter.initialize(...). If grad_req is not null, it will also hold a gradient array on each Context:

ctx = mx.gpu(0)
x = mx.nd.zeros((16, 100), ctx=ctx)
w = mx.gluon.Parameter('fc_weight', shape=(64, 100), init=mx.init.Xavier())
b = mx.gluon.Parameter('fc_bias', shape=(64,), init=mx.init.Zero())
w.initialize(ctx=ctx)
b.initialize(ctx=ctx)
out = mx.nd.FullyConnected(x, w.data(ctx), b.data(ctx), num_hidden=64)

Parameters: name (str) – Name of this parameter. grad_req ({'write', 'add', 'null'}, default 'write') – Specifies how to update gradient to grad arrays. ‘write’ means everytime gradient is written to grad NDArray. ‘add’ means everytime gradient is added to the grad NDArray. You need to manually call zero_grad() to clear the gradient buffer before each iteration when using this option. ‘null’ means gradient is not requested for this parameter. gradient arrays will not be allocated. shape (tuple of int, default None) – Shape of this parameter. By default shape is not specified. Parameter with unknown shape can be used for Symbol API, but init will throw an error when using NDArray API. dtype (numpy.dtype or str, default 'float32') – Data type of this parameter. For example, numpy.float32 or ‘float32’. lr_mult (float, default 1.0) – Learning rate multiplier. Learning rate will be multiplied by lr_mult when updating this parameter with optimizer. wd_mult (float, default 1.0) – Weight decay multiplier (L2 regularizer coefficient). Works similar to lr_mult. init (Initializer, default None) – Initializer of this parameter. Will use the global initializer by default.
initialize(init=None, ctx=None, default_init=<mxnet.initializer.Uniform object>)

Initializes parameter and gradient arrays. Only used for NDArray API.

Parameters: init (Initializer) – The initializer to use. Overrides Parameter.init and default_init. ctx (Context or list of Context, defaults to context.current_context().) – Initialize Parameter on given context. If ctx is a list of Context, a copy will be made for each context. Note Copies are independent arrays. User is responsible for keeping their values consistent when updating. Normally gluon.Trainer does this for you. default_init (Initializer) – Default initializer is used when both init and Parameter.init are None.

Examples

>>> weight = mx.gluon.Parameter('weight', shape=(2, 2))
>>> weight.initialize(ctx=mx.cpu(0))
>>> weight.data()
[[-0.01068833  0.01729892]
[ 0.02042518 -0.01618656]]
<NDArray 2x2 @cpu(0)>
[[ 0.  0.]
[ 0.  0.]]
<NDArray 2x2 @cpu(0)>
>>> weight.initialize(ctx=[mx.gpu(0), mx.gpu(1)])
>>> weight.data(mx.gpu(0))
[[-0.00873779 -0.02834515]
[ 0.05484822 -0.06206018]]
<NDArray 2x2 @gpu(0)>
>>> weight.data(mx.gpu(1))
[[-0.00873779 -0.02834515]
[ 0.05484822 -0.06206018]]
<NDArray 2x2 @gpu(1)>

set_data(data)

Sets this parameter’s value on all contexts to data.

data(ctx=None)

Returns a copy of this parameter on one context. Must have been initialized on this context before.

Parameters: ctx (Context) – Desired context. NDArray on ctx
list_data()

Returns copies of this parameter on all contexts, in the same order as creation.

grad(ctx=None)

Returns a gradient buffer for this parameter on one context.

Parameters: ctx (Context) – Desired context.
list_grad()

Returns gradient buffers on all contexts, in the same order as values.

list_ctx()

Returns a list of contexts this parameter is initialized on.

zero_grad()

Sets gradient buffer on all contexts to 0. No action is taken if parameter is uninitialized or doesn’t require gradient.

var()

Returns a symbol representing this parameter.

class mxnet.gluon.ParameterDict(prefix='', shared=None)

A dictionary managing a set of parameters.

Parameters: prefix (str, default '') – The prefix to be prepended to all Parameters’ name created by this dict. shared (ParameterDict or None) – If not None, when this dict’s get method creates a new parameter, will first try to retrieve it from shared dict. Usually used for sharing parameters with another Block.
prefix

Prefix of this dict. It will be prepended to Parameters’ name created with get.

get(name, **kwargs)

Retrieves a Parameter with name self.prefix+name. If not found, get will first try to retrieve it from shared dict. If still not found, get will create a new Parameter with key-word arguments and insert it to self.

Parameters: name (str) – Name of the desired Parameter. It will be prepended with this dictionary’s prefix. **kwargs – The rest of key-word arguments for the created Parameter. The created or retrieved Parameter. Parameter
update(other)

Copies all Parameters in other to self.

initialize(init=<mxnet.initializer.Uniform object>, ctx=None, verbose=False)

Initializes all Parameters managed by this dictionary to be used for NDArray API. It has no effect when using Symbol API.

Parameters: init (Initializer) – Global default Initializer to be used when Parameter.init is None. Otherwise, Parameter.init takes precedence. ctx (Context or list of Context) – Keeps a copy of Parameters on one or many context(s).
zero_grad()

Sets all Parameters’ gradient buffer to 0.

save(filename, strip_prefix='')

Save parameters to file.

filename : str
Path to parameter file.
strip_prefix : str, default ‘’
Strip prefix from parameter names before saving.
load(filename, ctx, allow_missing=False, ignore_extra=False, restore_prefix='')

filename : str
Path to parameter file.
ctx : Context or list of Context
allow_missing : bool, default False
ignore_extra : bool, default False
Whether to silently ignore parameters from the file that are not present in this ParameterDict.
restore_prefix : str, default ‘’

## Containers¶

class mxnet.gluon.Block(prefix=None, params=None)

Base class for all neural network layers and models. Your models should subclass this class.

Block can be nested recursively in a tree structure. You can create and assign child Block as regular attributes:

from mxnet.gluon import Block, nn
from mxnet import ndarray as F

class Model(Block):
def __init__(self, **kwargs):
super(Model, self).__init__(**kwargs)
# use name_scope to give child Blocks appropriate names.
# It also allows sharing Parameters between Blocks recursively.
with self.name_scope():
self.dense0 = nn.Dense(20)
self.dense1 = nn.Dense(20)

def forward(self, x):
x = F.relu(self.dense0(x))
return F.relu(self.dense1(x))

model = Model()
model.initialize(ctx=mx.cpu(0))
model(F.zeros((10, 10), ctx=mx.cpu(0)))


Child Block assigned this way will be registered and collect_params will collect their Parameters recursively.

Parameters: prefix (str) – Prefix acts like a name space. It will be prepended to the name of all Parameters and child Blocks in this Block‘s name_scope. Prefix should be unique within one model to prevent name collisions. params (ParameterDict or None) – ParameterDict for sharing weights with the new Block. For example, if you want dense1 to share dense0‘s weights, you can do: dense0 = nn.Dense(20) dense1 = nn.Dense(20, params=dense0.collect_params()) 
forward(*args)

Overrides to implement forward computation using NDArray. Only accepts positional arguments.

Parameters: *args – Input tensors.
__setattr__(name, value)

Registers parameters.

prefix

Prefix of this Block.

name

Name of this Block, without ‘_’ in the end.

name_scope()

Returns a name space object managing a child Block and parameter names. Should be used within a with statement:

with self.name_scope():
self.dense = nn.Dense(20)

params

Returns this Block‘s parameter dictionary (does not include its children’s parameters).

collect_params()

Returns a ParameterDict containing this Block and all of its children’s Parameters.

save_params(filename)

Save parameters to file.

filename : str
Path to file.
load_params(filename, ctx, allow_missing=False, ignore_extra=False)

filename : str
Path to parameter file.
ctx : Context or list of Context
allow_missing : bool, default False
ignore_extra : bool, default False
Whether to silently ignore parameters from the file that are not present in this Block.
register_child(block)

Registers block as a child of self. Blocks assigned to self as attributes will be registered automatically.

initialize(init=<mxnet.initializer.Uniform object>, ctx=None, verbose=False)

Initializes Parameters of this Block and its children.

Equivalent to block.collect_params().initialize(...)

hybridize(active=True)

Activates or deactivates HybridBlocks recursively. Has no effect on non-hybrid children.

Parameters: active (bool, default True) – Whether to turn hybrid on or off.
__call__(*args)

Calls forward. Only accepts positional arguments.

forward(*args)

Overrides to implement forward computation using NDArray. Only accepts positional arguments.

Parameters: *args – Input tensors.
class mxnet.gluon.HybridBlock(prefix=None, params=None)

HybridBlock supports forwarding with both Symbol and NDArray.

Forward computation in HybridBlock must be static to work with Symbols, i.e. you cannot call .asnumpy(), .shape, .dtype, etc on tensors. Also, you cannot use branching or loop logic that bases on non-constant expressions like random numbers or intermediate results, since they change the graph structure for each iteration.

Before activating with hybridize(), HybridBlock works just like normal Block. After activation, HybridBlock will create a symbolic graph representing the forward computation and cache it. On subsequent forwards, the cached graph will be used instead of hybrid_forward.

Refer Hybrid tutorial to see the end-to-end usage.

hybrid_forward(F, x, *args, **kwargs)

Overrides to construct symbolic graph for this Block.

Parameters: x (Symbol or NDArray) – The first input tensor. *args – Additional input tensors.
__setattr__(name, value)

Registers parameters.

infer_shape(*args)

Infers shape of Parameters from inputs.

forward(x, *args)

Defines the forward computation. Arguments can be either NDArray or Symbol.

hybrid_forward(F, x, *args, **kwargs)

Overrides to construct symbolic graph for this Block.

Parameters: x (Symbol or NDArray) – The first input tensor. *args – Additional input tensors.

## Neural Network Layers¶

### Containers¶

class mxnet.gluon.nn.Sequential(prefix=None, params=None)

Stacks Blocks sequentially.

Example:

net = nn.Sequential()
# use net's name_scope to give child Blocks appropriate names.
with net.name_scope():

add(block)

Adds block on top of the stack.

class mxnet.gluon.nn.HybridSequential(prefix=None, params=None)

Stacks HybridBlocks sequentially.

Example:

net = nn.Sequential()
# use net's name_scope to give child Blocks appropriate names.
with net.name_scope():

add(block)

Adds block on top of the stack.

### Basic Layers¶

class mxnet.gluon.nn.Dense(units, activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_units=0, **kwargs)

Just your regular densely-connected NN layer.

Dense implements the operation: output = activation(dot(input, weight) + bias) where activation is the element-wise activation function passed as the activation argument, weight is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).

Note: the input must be a tensor with rank 2. Use flatten to convert it to rank 2 manually if necessary.

Parameters: units (int) – Dimensionality of the output space. activation (str) – Activation function to use. See help on Activation layer. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x). use_bias (bool) – Whether the layer uses a bias vector. weight_initializer (str or Initializer) – Initializer for the kernel weights matrix. bias_initializer (str or Initializer) – Initializer for the bias vector. in_units (int, optional) – Size of the input data. If not specified, initialization will be deferred to the first time forward is called and in_units will be inferred from the shape of input data. prefix (str or None) – See document of Block. params (ParameterDict or None) – See document of Block.
Input shape:
A 2D input with shape (batch_size, in_units).
Output shape:
The output would have shape (batch_size, units).
class mxnet.gluon.nn.Activation(activation, **kwargs)

Applies an activation function to input.

Parameters: activation (str) – Name of activation function to use. See Activation() for available choices.
Input shape:
Arbitrary.
Output shape:
Same shape as input.
class mxnet.gluon.nn.Dropout(rate, **kwargs)

Applies Dropout to the input.

Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.

Parameters: rate (float) – Fraction of the input units to drop. Must be a number between 0 and 1.
Input shape:
Arbitrary.
Output shape:
Same shape as input.

References

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

class mxnet.gluon.nn.BatchNorm(axis=1, momentum=0.9, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', running_mean_initializer='zeros', running_variance_initializer='ones', in_channels=0, **kwargs)

Batch normalization layer (Ioffe and Szegedy, 2014). Normalizes the input at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

Parameters: axis (int, default 1) – The axis that should be normalized. This is typically the channels (C) axis. For instance, after a Conv2D layer with layout=’NCHW’, set axis=1 in BatchNorm. If layout=’NHWC’, then set axis=3. momentum (float, default 0.9) – Momentum for the moving average. epsilon (float, default 1e-3) – Small float added to variance to avoid dividing by zero. center (bool, default True) – If True, add offset of beta to normalized tensor. If False, beta is ignored. scale (bool, default True) – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer. beta_initializer (str or Initializer, default ‘zeros’) – Initializer for the beta weight. gamma_initializer (str or Initializer, default ‘ones’) – Initializer for the gamma weight. moving_mean_initializer (str or Initializer, default ‘zeros’) – Initializer for the moving mean. moving_variance_initializer (str or Initializer, default ‘ones’) – Initializer for the moving variance. in_channels (int, default 0) – Number of channels (feature maps) in input data. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Input shape:
Arbitrary.
Output shape:
Same shape as input.
class mxnet.gluon.nn.LeakyReLU(alpha, **kwargs)

Leaky version of a Rectified Linear Unit.

It allows a small gradient when the unit is not active:

f(x) = alpha * x for x < 0,
f(x) = x for x >= 0.

Parameters: alpha (float) – slope coefficient for the negative half axis. Must be >= 0.
Input shape:
Arbitrary.
Output shape:
Same shape as input.
class mxnet.gluon.nn.Embedding(input_dim, output_dim, dtype='float32', weight_initializer=None, **kwargs)

Turns non-negative integers (indexes/tokens) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

Parameters: input_dim (int) – Size of the vocabulary, i.e. maximum integer index + 1. output_dim (int) – Dimension of the dense embedding. dtype (str or np.dtype, default 'float32') – Data type of output embeddings. weight_initializer (Initializer) – Initializer for the embeddings matrix.
Input shape:
2D tensor with shape: (N, M).
Output shape:
3D tensor with shape: (N, M, output_dim).

### Convolutional Layers¶

class mxnet.gluon.nn.Conv1D(channels, kernel_size, strides=1, padding=0, dilation=1, groups=1, layout='NCW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

1D convolution layer (e.g. temporal convolution).

This layer creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal) dimension to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters: channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution. kernel_size (int or tuple/list of 1 int) – Specifies the dimensions of the convolution window. strides (int or tuple/list of 1 int,) – Specify the strides of the convolution. padding (int or a tuple/list of 1 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points dilation (int or tuple/list of 1 int) – Specifies the dilation rate to use for dilated convolution. groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated. layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Convolution is applied on the ‘W’ dimension. in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data. activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x). use_bias (bool) – Whether the layer uses a bias vector. weight_initializer (str or Initializer) – Initializer for the weight weights matrix. bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 3D array of shape (batch_size, in_channels, width) if layout is NCW.
Output shape:

This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW. out_width is calculated as:

out_width = floor((width+2*padding-dilation*(kernel_size-1)-1)/stride)+1

class mxnet.gluon.nn.Conv2D(channels, kernel_size, strides=(1, 1), padding=(0, 0), dilation=(1, 1), groups=1, layout='NCHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

2D convolution layer (e.g. spatial convolution over images).

This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters: channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution. kernel_size (int or tuple/list of 2 int) – Specifies the dimensions of the convolution window. strides (int or tuple/list of 2 int,) – Specify the strides of the convolution. padding (int or a tuple/list of 2 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points dilation (int or tuple/list of 2 int) – Specifies the dilation rate to use for dilated convolution. groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated. layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. Convolution is applied on the ‘H’ and ‘W’ dimensions. in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data. activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x). use_bias (bool) – Whether the layer uses a bias vector. weight_initializer (str or Initializer) – Initializer for the weight weights matrix. bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 4D array of shape (batch_size, in_channels, height, width) if layout is NCHW.
Output shape:

This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.

out_height and out_width are calculated as:

out_height = floor((height+2*padding[0]-dilation[0]*(kernel_size[0]-1)-1)/stride[0])+1

class mxnet.gluon.nn.Conv3D(channels, kernel_size, strides=(1, 1, 1), padding=(0, 0, 0), dilation=(1, 1, 1), groups=1, layout='NCDHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

3D convolution layer (e.g. spatial convolution over volumes).

This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters: channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution. kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window. strides (int or tuple/list of 3 int,) – Specify the strides of the convolution. padding (int or a tuple/list of 3 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution. groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated. layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. Convolution is applied on the ‘D’, ‘H’ and ‘W’ dimensions. in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data. activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x). use_bias (bool) – Whether the layer uses a bias vector. weight_initializer (str or Initializer) – Initializer for the weight weights matrix. bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 5D array of shape (batch_size, in_channels, depth, height, width) if layout is NCDHW.
Output shape:

This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.

out_depth, out_height and out_width are calculated as:

out_depth = floor((depth+2*padding[0]-dilation[0]*(kernel_size[0]-1)-1)/stride[0])+1

class mxnet.gluon.nn.Conv1DTranspose(channels, kernel_size, strides=1, padding=0, output_padding=0, dilation=1, groups=1, layout='NCW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

Transposed 1D convolution layer (sometimes called Deconvolution).

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters: channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution. kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window. strides (int or tuple/list of 3 int,) – Specify the strides of the convolution. padding (int or a tuple/list of 3 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution. groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated. layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Convolution is applied on the ‘W’ dimension. in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data. activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x). use_bias (bool) – Whether the layer uses a bias vector. weight_initializer (str or Initializer) – Initializer for the weight weights matrix. bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 3D array of shape (batch_size, in_channels, width) if layout is NCW.
Output shape:

This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.

out_width is calculated as:

out_width = (width-1)*strides-2*padding+kernel_size+output_padding

class mxnet.gluon.nn.Conv2DTranspose(channels, kernel_size, strides=(1, 1), padding=(0, 0), output_padding=(0, 0), dilation=(1, 1), groups=1, layout='NCHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

Transposed 2D convolution layer (sometimes called Deconvolution).

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters: channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution. kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window. strides (int or tuple/list of 3 int,) – Specify the strides of the convolution. padding (int or a tuple/list of 3 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution. groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated. layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. Convolution is applied on the ‘H’ and ‘W’ dimensions. in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data. activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x). use_bias (bool) – Whether the layer uses a bias vector. weight_initializer (str or Initializer) – Initializer for the weight weights matrix. bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 4D array of shape (batch_size, in_channels, height, width) if layout is NCHW.
Output shape:

This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.

out_height and out_width are calculated as:

out_height = (height-1)*strides[0]-2*padding[0]+kernel_size[0]+output_padding[0]

class mxnet.gluon.nn.Conv3DTranspose(channels, kernel_size, strides=(1, 1, 1), padding=(0, 0, 0), output_padding=(0, 0, 0), dilation=(1, 1, 1), groups=1, layout='NCDHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

Transposed 3D convolution layer (sometimes called Deconvolution).

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters: channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution. kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window. strides (int or tuple/list of 3 int,) – Specify the strides of the convolution. padding (int or a tuple/list of 3 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution. groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated. layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. Convolution is applied on the ‘D’, ‘H’, and ‘W’ dimensions. in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data. activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x). use_bias (bool) – Whether the layer uses a bias vector. weight_initializer (str or Initializer) – Initializer for the weight weights matrix. bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 5D array of shape (batch_size, in_channels, depth, height, width) if layout is NCDHW.
Output shape:

This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW. out_depth, out_height and out_width are calculated as:

out_depth = (depth-1)*strides[0]-2*padding[0]+kernel_size[0]+output_padding[0]


### Pooling Layers¶

class mxnet.gluon.nn.MaxPool1D(pool_size=2, strides=None, padding=0, layout='NCW', ceil_mode=False, **kwargs)

Max pooling operation for one dimensional data.

Parameters: pool_size (int) – Size of the max pooling windows. strides (int, or None) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size. padding (int) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points. layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Pooling is applied on the W dimension. ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 3D array of shape (batch_size, channels, width) if layout is NCW.
Output shape:

This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.

out_width is calculated as:

out_width = floor((width+2*padding-pool_size)/strides)+1


When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.MaxPool2D(pool_size=(2, 2), strides=None, padding=0, layout='NCHW', ceil_mode=False, **kwargs)

Max pooling operation for two dimensional (spatial) data.

Parameters: pool_size (int or list/tuple of 2 ints,) – Size of the max pooling windows. strides (int, list/tuple of 2 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size. padding (int or list/tuple of 2 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points. layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. padding is applied on ‘H’ and ‘W’ dimension. ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 4D array of shape (batch_size, channels, height, width) if layout is NCHW.
Output shape:

This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.

out_height and out_width are calculated as:

out_height = floor((height+2*padding[0]-pool_size[0])/strides[0])+1


When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.MaxPool3D(pool_size=(2, 2, 2), strides=None, padding=0, ceil_mode=False, layout='NCDHW', **kwargs)

Max pooling operation for 3D data (spatial or spatio-temporal).

Parameters: pool_size (int or list/tuple of 3 ints,) – Size of the max pooling windows. strides (int, list/tuple of 3 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size. padding (int or list/tuple of 3 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points. layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension. ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 5D array of shape (batch_size, channels, depth, height, width) if layout is NCDHW.
Output shape:

This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.

out_depth, out_height and out_width are calculated as

out_depth = floor((depth+2*padding[0]-pool_size[0])/strides[0])+1


When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.AvgPool1D(pool_size=2, strides=None, padding=0, layout='NCW', ceil_mode=False, **kwargs)

Average pooling operation for temporal data.

Parameters: pool_size (int) – Size of the max pooling windows. strides (int, or None) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size. padding (int) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points. layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. padding is applied on ‘W’ dimension. ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 3D array of shape (batch_size, channels, width) if layout is NCW.
Output shape:

This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.

out_width is calculated as:

out_width = floor((width+2*padding-pool_size)/strides)+1


When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.AvgPool2D(pool_size=(2, 2), strides=None, padding=0, ceil_mode=False, layout='NCHW', **kwargs)

Average pooling operation for spatial data.

Parameters: pool_size (int or list/tuple of 2 ints,) – Size of the max pooling windows. strides (int, list/tuple of 2 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size. padding (int or list/tuple of 2 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points. layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. padding is applied on ‘H’ and ‘W’ dimension. ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 4D array of shape (batch_size, channels, height, width) if layout is NCHW.
Output shape:

This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.

out_height and out_width are calculated as:

out_height = floor((height+2*padding[0]-pool_size[0])/strides[0])+1


When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.AvgPool3D(pool_size=(2, 2, 2), strides=None, padding=0, ceil_mode=False, layout='NCDHW', **kwargs)

Average pooling operation for 3D data (spatial or spatio-temporal).

Parameters: pool_size (int or list/tuple of 3 ints,) – Size of the max pooling windows. strides (int, list/tuple of 3 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size. padding (int or list/tuple of 3 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points. layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension. ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 5D array of shape (batch_size, channels, depth, height, width) if layout is NCDHW.
Output shape:

This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.

out_depth, out_height and out_width are calculated as

out_depth = floor((depth+2*padding[0]-pool_size[0])/strides[0])+1


When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.GlobalMaxPool1D(layout='NCW', **kwargs)

Global max pooling operation for temporal data.

class mxnet.gluon.nn.GlobalMaxPool2D(layout='NCHW', **kwargs)

Global max pooling operation for spatial data.

class mxnet.gluon.nn.GlobalMaxPool3D(layout='NCDHW', **kwargs)

Global max pooling operation for 3D data.

class mxnet.gluon.nn.GlobalAvgPool1D(layout='NCW', **kwargs)

Global average pooling operation for temporal data.

class mxnet.gluon.nn.GlobalAvgPool2D(layout='NCHW', **kwargs)

Global average pooling operation for spatial data.

class mxnet.gluon.nn.GlobalAvgPool3D(layout='NCDHW', **kwargs)

Global max pooling operation for 3D data.

## Recurrent Layers¶

class mxnet.gluon.rnn.RecurrentCell(prefix=None, params=None)

Abstract base class for RNN cells

Parameters: prefix (str, optional) – Prefix for names of Blocks (this prefix is also used for names of weights if params is None i.e. if params are being created and not reused) params (Parameter or None, optional) – Container for weight sharing between cells. A new Parameter container is created if params is None.
__call__(*args)

Calls forward. Only accepts positional arguments.

reset()

Reset before re-using the cell for another graph.

state_info(batch_size=0)

shape and layout information of states

begin_state(batch_size=0, func=<function zeros>, **kwargs)

Initial state for this cell.

Parameters: func (callable, default symbol.zeros) – Function for creating initial state. For Symbol API, func can be symbol.zeros, symbol.uniform, symbol.var etc. Use symbol.var if you want to directly feed input as states. For NDArray API, func can be ndarray.zeros, ndarray.ones, etc. batch_size (int, default 0) – Only required for NDArray API. Size of the batch (‘N’ in layout) dimension of input. **kwargs – Additional keyword arguments passed to func. For example mean, std, dtype, etc. states – Starting states for the first RNN step. nested list of Symbol
unroll(length, inputs, begin_state=None, layout='NTC', merge_outputs=None)

Unrolls an RNN cell across time steps.

Parameters: length (int) – Number of steps to unroll. inputs (Symbol, list of Symbol, or None) – If inputs is a single Symbol (usually the output of Embedding symbol), it should have shape (batch_size, length, ...) if layout is ‘NTC’, or (length, batch_size, ...) if layout is ‘TNC’. If inputs is a list of symbols (usually output of previous unroll), they should all have shape (batch_size, ...). begin_state (nested list of Symbol, optional) – Input states created by begin_state() or output state of another cell. Created from begin_state() if None. layout (str, optional) – layout of input symbol. Only used if inputs is a single Symbol. merge_outputs (bool, optional) – If False, returns outputs as a list of Symbols. If True, concatenates output across time steps and returns a single symbol with shape (batch_size, length, ...) if layout is ‘NTC’, or (length, batch_size, ...) if layout is ‘TNC’. If None, output whatever is faster. outputs (list of Symbol or Symbol) – Symbol (if merge_outputs is True) or list of Symbols (if merge_outputs is False) corresponding to the output from the RNN from this unrolling. states (list of Symbol) – The new state of this RNN after this unrolling. The type of this symbol is same as the output of begin_state().
forward(inputs, states)

Unrolls the recurrent cell for one time step.

Parameters: inputs (sym.Variable) – Input symbol, 2D, of shape (batch_size * num_units). states (list of sym.Variable) – RNN state from previous step or the output of begin_state(). output (Symbol) – Symbol corresponding to the output from the RNN when unrolling for a single time step. states (list of Symbol) – The new state of this RNN after this unrolling. The type of this symbol is same as the output of begin_state(). This can be used as an input state to the next time step of this RNN.

begin_state()
This function can provide the states for the first time step.
unroll()
This function unrolls an RNN for a given number of (>=1) time steps.
class mxnet.gluon.rnn.RNN(hidden_size, num_layers=1, activation='relu', layout='TNC', dropout=0, bidirectional=False, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, **kwargs)

Applies a multi-layer Elman RNN with tanh or ReLU non-linearity to an input sequence.

For each element in the input sequence, each layer computes the following function:

$h_t = \tanh(w_{ih} * x_t + b_{ih} + w_{hh} * h_{(t-1)} + b_{hh})$

where $$h_t$$ is the hidden state at time t, and $$x_t$$ is the hidden state of the previous layer at time t or $$input_t$$ for the first layer. If nonlinearity=’relu’, then ReLU is used instead of tanh.

Parameters: hidden_size (int) – The number of features in the hidden state h. num_layers (int, default 1) – Number of recurrent layers. activation ({'relu' or 'tanh'}, default 'tanh') – The activation function to use. layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively. dropout (float, default 0) – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer. bidirectional (bool, default False) – If True, becomes a bidirectional RNN. i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs. h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state. i2h_bias_initializer (str or Initializer) – Initializer for the bias vector. h2h_bias_initializer (str or Initializer) – Initializer for the bias vector. input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input. prefix (str or None) – Prefix of this Block. params (ParameterDict or None) – Shared Parameters for this Block.
Input shapes:
The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
Output shape:
The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
Recurrent state shape:
The recurrent state’s shape is (num_layers, batch_size, num_hidden). If bidirectional is True, state shape will instead be (num_layers, batch_size, 2*num_hidden)

Examples

>>> layer = mx.gluon.rnn.RNN(100, 3)
>>> layer.initialize()
>>> input = mx.nd.random_uniform(shape=(5, 3, 10))
>>> h0 = mx.nd.random_uniform(shape=(3, 3, 100))
>>> output, hn = layer(input, h0)

class mxnet.gluon.rnn.LSTM(hidden_size, num_layers=1, layout='TNC', dropout=0, bidirectional=False, input_size=0, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', **kwargs)

Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

$\begin{split}\begin{array}{ll} i_t = sigmoid(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\ f_t = sigmoid(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hc} h_{(t-1)} + b_{hg}) \\ o_t = sigmoid(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\ c_t = f_t * c_{(t-1)} + i_t * g_t \\ h_t = o_t * \tanh(c_t) \end{array}\end{split}$

where $$h_t$$ is the hidden state at time t, $$c_t$$ is the cell state at time t, $$x_t$$ is the hidden state of the previous layer at time t or $$input_t$$ for the first layer, and $$i_t$$, $$f_t$$, $$g_t$$, $$o_t$$ are the input, forget, cell, and out gates, respectively.

Parameters: hidden_size (int) – The number of features in the hidden state h. num_layers (int, default 1) – Number of recurrent layers. layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively. dropout (float, default 0) – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer. bidirectional (bool, default False) – If True, becomes a bidirectional RNN. i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs. h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state. i2h_bias_initializer (str or Initializer, default 'lstmbias') – Initializer for the bias vector. By default, bias for the forget gate is initialized to 1 while all other biases are initialized to zero. h2h_bias_initializer (str or Initializer) – Initializer for the bias vector. input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input. prefix (str or None) – Prefix of this Block. params (ParameterDict or None) – Shared Parameters for this Block.
Input shapes:
The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
Output shape:
The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
Recurrent state shape:
The recurrent state is a list of two NDArrays. Both has shape (num_layers, batch_size, num_hidden). If bidirectional is True, state shape will instead be (num_layers, batch_size, 2*num_hidden).

Examples

>>> layer = mx.gluon.rnn.LSTM(100, 3)
>>> layer.initialize()
>>> input = mx.nd.random_uniform(shape=(5, 3, 10))
>>> h0 = mx.nd.random_uniform(shape=(3, 3, 100))
>>> c0 = mx.nd.random_uniform(shape=(3, 3, 100))
>>> output, hn = layer(input, [h0, c0])

class mxnet.gluon.rnn.GRU(hidden_size, num_layers=1, layout='TNC', dropout=0, bidirectional=False, input_size=0, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', **kwargs)

Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

$\begin{split}\begin{array}{ll} r_t = sigmoid(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\ i_t = sigmoid(W_{ii} x_t + b_{ii} + W_hi h_{(t-1)} + b_{hi}) \\ n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\ h_t = (1 - i_t) * n_t + i_t * h_{(t-1)} \\ \end{array}\end{split}$

where $$h_t$$ is the hidden state at time t, $$x_t$$ is the hidden state of the previous layer at time t or $$input_t$$ for the first layer, and $$r_t$$, $$i_t$$, $$n_t$$ are the reset, input, and new gates, respectively.

Parameters: hidden_size (int) – The number of features in the hidden state h num_layers (int, default 1) – Number of recurrent layers. layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively. dropout (float, default 0) – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer bidirectional (bool, default False) – If True, becomes a bidirectional RNN. i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs. h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state. i2h_bias_initializer (str or Initializer) – Initializer for the bias vector. h2h_bias_initializer (str or Initializer) – Initializer for the bias vector. input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input. prefix (str or None) – Prefix of this Block. params (ParameterDict or None) – Shared Parameters for this Block.
Input shapes:
The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
Output shape:
The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
Recurrent state shape:
The recurrent state’s shape is (num_layers, batch_size, num_hidden). If bidirectional is True, state shape will instead be (num_layers, batch_size, 2*num_hidden)

Examples

>>> layer = mx.gluon.rnn.GRU(100, 3)
>>> layer.initialize()
>>> input = mx.nd.random_uniform(shape=(5, 3, 10))
>>> h0 = mx.nd.random_uniform(shape=(3, 3, 100))
>>> output, hn = layer(input, h0)

class mxnet.gluon.rnn.RNNCell(hidden_size, activation='tanh', i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)

Simple recurrent neural network cell.

Parameters: hidden_size (int) – Number of units in output symbol activation (str or Symbol, default 'tanh') – Type of activation function. i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs. h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state. i2h_bias_initializer (str or Initializer) – Initializer for the bias vector. h2h_bias_initializer (str or Initializer) – Initializer for the bias vector. prefix (str, default ‘rnn_‘) – Prefix for name of Blocks (and name of weight if params is None). params (Parameter or None) – Container for weight sharing between cells. Created if None.
class mxnet.gluon.rnn.LSTMCell(hidden_size, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)

Long-Short Term Memory (LSTM) network cell.

Parameters: hidden_size (int) – Number of units in output symbol. i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs. h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state. i2h_bias_initializer (str or Initializer, default 'lstmbias') – Initializer for the bias vector. By default, bias for the forget gate is initialized to 1 while all other biases are initialized to zero. h2h_bias_initializer (str or Initializer) – Initializer for the bias vector. prefix (str, default ‘lstm_‘) – Prefix for name of Blocks (and name of weight if params is None). params (Parameter or None) – Container for weight sharing between cells. Created if None.
class mxnet.gluon.rnn.GRUCell(hidden_size, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)

Gated Rectified Unit (GRU) network cell. Note: this is an implementation of the cuDNN version of GRUs (slight modification compared to Cho et al. 2014).

Parameters: hidden_size (int) – Number of units in output symbol. i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs. h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state. i2h_bias_initializer (str or Initializer) – Initializer for the bias vector. h2h_bias_initializer (str or Initializer) – Initializer for the bias vector. prefix (str, default ‘gru_‘) – prefix for name of Blocks (and name of weight if params is None). params (Parameter or None) – Container for weight sharing between cells. Created if None.
class mxnet.gluon.rnn.SequentialRNNCell(prefix=None, params=None)

Sequentially stacking multiple RNN cells.

add(cell)

Appends a cell into the stack.

Parameters: cell (rnn cell) –
class mxnet.gluon.rnn.BidirectionalCell(l_cell, r_cell, output_prefix='bi_')

Bidirectional RNN cell.

Parameters: l_cell (RecurrentCell) – Cell for forward unrolling r_cell (RecurrentCell) – Cell for backward unrolling
class mxnet.gluon.rnn.DropoutCell(dropout, prefix=None, params=None)

Applies dropout on input.

Parameters: dropout (float) – Percentage of elements to drop out, which is 1 - percentage to retain.
class mxnet.gluon.rnn.ZoneoutCell(base_cell, zoneout_outputs=0.0, zoneout_states=0.0)

Applies Zoneout on base cell.

class mxnet.gluon.rnn.ResidualCell(base_cell)

Adds residual connection as described in Wu et al, 2016 (https://arxiv.org/abs/1609.08144). Output of the cell is output of the base cell plus input.

## Trainer¶

class mxnet.gluon.Trainer(params, optimizer, optimizer_params, kvstore='device')

Applies an Optimizer on a set of Parameters. Trainer should be used together with autograd.

Parameters: params (ParameterDict) – The set of parameters to optimize. optimizer (str or Optimizer) – The optimizer to use. optimizer_params (dict) – Key-word arguments to be passed to optimizer constructor. For example, {‘learning_rate’: 0.1} kvstore (str or KVStore) – kvstore type for multi-gpu and distributed training.
step(batch_size, ignore_stale_grad=False)

Makes one step of parameter update. Should be called after autograd.compute_gradient and outside of record() scope.

Parameters: batch_size (int) – Batch size of data processed. Gradient will be normalized by 1/batch_size. Set this to 1 if you normalized loss manually with loss = mean(loss). ignore_stale_grad (bool, optional, default=False) – If true, ignores Parameters with stale gradient (gradient that has not been updated by backward after last step) and skip update.

## Loss functions¶

class mxnet.gluon.loss.L2Loss(weight=1.0, batch_axis=0, **kwargs)

Calculates the mean squared error between output and label:

$L = \frac{1}{2}\sum_i \Vert {output}_i - {label}_i \Vert^2.$

Output and label can have arbitrary shape as long as they have the same number of elements.

Parameters: weight (float or None) – Global scalar weight for loss. sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1). batch_axis (int, default 0) – The axis that represents mini-batch.
class mxnet.gluon.loss.L1Loss(weight=None, batch_axis=0, **kwargs)

Calculates the mean absolute error between output and label:

$L = \frac{1}{2}\sum_i \vert {output}_i - {label}_i \vert.$

Output and label must have the same shape.

Parameters: weight (float or None) – Global scalar weight for loss. sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1). batch_axis (int, default 0) – The axis that represents mini-batch.
class mxnet.gluon.loss.SoftmaxCrossEntropyLoss(axis=-1, sparse_label=True, from_logits=False, weight=None, batch_axis=0, **kwargs)

Computes the softmax cross entropy loss.

If sparse_label is True, label should contain integer category indicators:

$p = {softmax}({output})$$L = -\sum_i {log}(p_{i,{label}_i})$

Label’s shape should be output’s shape without the axis dimension. i.e. for output.shape = (1,2,3,4) and axis = 2, label.shape should be (1,2,4).

If sparse_label is False, label should contain probability distribution with the same shape as output:

$p = {softmax}({output})$$L = -\sum_i \sum_j {label}_j {log}(p_{ij})$
Parameters: axis (int, default -1) – The axis to sum over when computing softmax and entropy. sparse_label (bool, default True) – Whether label is an integer array instead of probability distribution. from_logits (bool, default False) – Whether input is a log probability (usually from log_softmax) instead of unnormalized numbers. weight (float or None) – Global scalar weight for loss. sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1). batch_axis (int, default 0) – The axis that represents mini-batch.
class mxnet.gluon.loss.KLDivLoss(from_logits=True, weight=None, batch_axis=0, **kwargs)

The Kullback-Leibler divergence loss.

KL divergence is a useful distance measure for continuous distributions and is often useful when performing direct regression over the space of (discretely sampled) continuous output distributions.

$L = 1/n \sum_i (label_i * (log(label_i) - output_i))$

Label’s shape should be the same as output’s.

Parameters: from_logits (bool, default is True) – Whether the input is log probability (usually from log_softmax) instead of unnormalized numbers. weight (float or None) – Global scalar weight for loss. sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1). batch_axis (int, default 0) – The axis that represents mini-batch.

## Utilities¶

utils.split_data(data, num_slice, batch_axis=0, even_split=True)

Splits an NDArray into num_slice slices along batch_axis. Usually used for data parallelism where each slices is sent to one device (i.e. GPU).

Parameters: data (NDArray) – A batch of data. num_slice (int) – Number of desired slices. batch_axis (int, default 0) – The axis along which to slice. even_split (bool, default True) – Whether to force all slices to have the same number of elements. If True, an error will be raised when num_slice does not evenly divide data.shape[batch_axis]. Return value is a list even if num_slice is 1. list of NDArray
utils.split_and_load(data, ctx_list, batch_axis=0, even_split=True)

Splits an NDArray into len(ctx_list) slices along batch_axis and loads each slice to one context in ctx_list.

Parameters: data (NDArray) – A batch of data. ctx_list (list of Context) – A list of Contexts. batch_axis (int, default 0) – The axis along which to slice. even_split (bool, default True) – Whether to force all slices to have the same number of elements. Each corresponds to a context in ctx_list. list of NDArray
utils.clip_global_norm(arrays, max_norm)

Rescales NDArrays so that the sum of their 2-norm is smaller than max_norm.