Gluon Package¶
Warning
This package is currently experimental and may change in the near future.
Overview¶
Gluon package is a highlevel interface for MXNet designed to be easy to use while keeping most of the flexibility of low level API. Gluon supports both imperative and symbolic programming, making it easy to train complex models imperatively in Python and then deploy with symbolic graph in C++ and Scala.
Parameter¶

class
mxnet.gluon.
Parameter
(name, grad_req='write', shape=None, dtype=<type 'numpy.float32'>, lr_mult=1.0, wd_mult=1.0, init=None, allow_deferred_init=False)¶ A Container holding parameters (weights) of `Block`s.
Parameter holds a copy of the the parameter on each Context after it is initialized with Parameter.initialize(...). If grad_req is not null, it will also hold a gradient array on each Context:
ctx = mx.gpu(0) x = mx.nd.zeros((16, 100), ctx=ctx) w = mx.gluon.Parameter('fc_weight', shape=(64, 100), init=mx.init.Xavier()) b = mx.gluon.Parameter('fc_bias', shape=(64,), init=mx.init.Zero()) w.initialize(ctx=ctx) b.initialize(ctx=ctx) out = mx.nd.FullyConnected(x, w.data(ctx), b.data(ctx), num_hidden=64)
Parameters:  name (str) – Name of this parameter.
 grad_req ({'write', 'add', 'null'}, default 'write') –
Specifies how to update gradient to grad arrays.
 ‘write’ means everytime gradient is written to grad NDArray.
 ‘add’ means everytime gradient is added to the grad NDArray. You need to manually call zero_grad() to clear the gradient buffer before each iteration when using this option.
 ‘null’ means gradient is not requested for this parameter. gradient arrays will not be allocated.
 shape (tuple of int, default None) – Shape of this parameter. By default shape is not specified. Parameter with unknown shape can be used for Symbol API, but init will throw an error when using NDArray API.
 dtype (numpy.dtype or str, default 'float32') – Data type of this parameter. For example, numpy.float32 or ‘float32’.
 lr_mult (float, default 1.0) – Learning rate multiplier. Learning rate will be multiplied by lr_mult when updating this parameter with optimizer.
 wd_mult (float, default 1.0) – Weight decay multiplier (L2 regularizer coefficient). Works similar to lr_mult.
 init (Initializer, default None) – Initializer of this parameter. Will use the global initializer by default.

initialize
(init=None, ctx=None, default_init=<mxnet.initializer.Uniform object>)¶ Initializes parameter and gradient arrays. Only used for NDArray API.
Parameters:  init (Initializer) – The initializer to use. Overrides Parameter.init and default_init.
 ctx (Context or list of Context, defaults to context.current_context().) –
Initialize Parameter on given context. If ctx is a list of Context, a copy will be made for each context.
Note
Copies are independent arrays. User is responsible for keeping
their values consistent when updating. Normally gluon.Trainer does this for you.
 default_init (Initializer) – Default initializer is used when both init and Parameter.init are None.
Examples
>>> weight = mx.gluon.Parameter('weight', shape=(2, 2)) >>> weight.initialize(ctx=mx.cpu(0)) >>> weight.data() [[0.01068833 0.01729892] [ 0.02042518 0.01618656]] <NDArray 2x2 @cpu(0)> >>> weight.grad() [[ 0. 0.] [ 0. 0.]] <NDArray 2x2 @cpu(0)> >>> weight.initialize(ctx=[mx.gpu(0), mx.gpu(1)]) >>> weight.data(mx.gpu(0)) [[0.00873779 0.02834515] [ 0.05484822 0.06206018]] <NDArray 2x2 @gpu(0)> >>> weight.data(mx.gpu(1)) [[0.00873779 0.02834515] [ 0.05484822 0.06206018]] <NDArray 2x2 @gpu(1)>

set_data
(data)¶ Sets this parameter’s value on all contexts to data.

data
(ctx=None)¶ Returns a copy of this parameter on one context. Must have been initialized on this context before.
Parameters: ctx (Context) – Desired context. Returns: Return type: NDArray on ctx

list_data
()¶ Returns copies of this parameter on all contexts, in the same order as creation.

grad
(ctx=None)¶ Returns a gradient buffer for this parameter on one context.
Parameters: ctx (Context) – Desired context.

list_grad
()¶ Returns gradient buffers on all contexts, in the same order as values.

list_ctx
()¶ Returns a list of contexts this parameter is initialized on.

zero_grad
()¶ Sets gradient buffer on all contexts to 0. No action is taken if parameter is uninitialized or doesn’t require gradient.

var
()¶ Returns a symbol representing this parameter.

class
mxnet.gluon.
ParameterDict
(prefix='', shared=None)¶ A dictionary managing a set of parameters.
Parameters:  prefix (str, default '') – The prefix to be prepended to all Parameters’ name created by this dict.
 shared (ParameterDict or None) – If not None, when this dict’s get method creates a new parameter, will first try to retrieve it from shared dict. Usually used for sharing parameters with another Block.

prefix
¶ Prefix of this dict. It will be prepended to Parameters’ name created with get.

get
(name, **kwargs)¶ Retrieves a Parameter with name self.prefix+name. If not found, get will first try to retrieve it from shared dict. If still not found, get will create a new Parameter with keyword arguments and insert it to self.
Parameters:  name (str) – Name of the desired Parameter. It will be prepended with this dictionary’s prefix.
 **kwargs –
The rest of keyword arguments for the created Parameter.
Returns: The created or retrieved Parameter.
Return type:

update
(other)¶ Copies all Parameters in other to self.

initialize
(init=<mxnet.initializer.Uniform object>, ctx=None, verbose=False)¶ Initializes all Parameters managed by this dictionary to be used for NDArray API. It has no effect when using Symbol API.
Parameters:  init (Initializer) – Global default Initializer to be used when Parameter.init is None. Otherwise, Parameter.init takes precedence.
 ctx (Context or list of Context) – Keeps a copy of Parameters on one or many context(s).

zero_grad
()¶ Sets all Parameters’ gradient buffer to 0.

save
(filename, strip_prefix='')¶ Save parameters to file.
 filename : str
 Path to parameter file.
 strip_prefix : str, default ‘’
 Strip prefix from parameter names before saving.

load
(filename, ctx, allow_missing=False, ignore_extra=False, restore_prefix='')¶ Load parameters from file.
 filename : str
 Path to parameter file.
 ctx : Context or list of Context
 Context(s) initialize loaded parameters on.
 allow_missing : bool, default False
 Whether to silently skip loading parameters not represents in the file.
 ignore_extra : bool, default False
 Whether to silently ignore parameters from the file that are not present in this ParameterDict.
 restore_prefix : str, default ‘’
 prepend prefix to names of stored parameters before loading.
Containers¶

class
mxnet.gluon.
Block
(prefix=None, params=None)¶ Base class for all neural network layers and models. Your models should subclass this class.
Block can be nested recursively in a tree structure. You can create and assign child Block as regular attributes:
from mxnet.gluon import Block, nn from mxnet import ndarray as F class Model(Block): def __init__(self, **kwargs): super(Model, self).__init__(**kwargs) # use name_scope to give child Blocks appropriate names. # It also allows sharing Parameters between Blocks recursively. with self.name_scope(): self.dense0 = nn.Dense(20) self.dense1 = nn.Dense(20) def forward(self, x): x = F.relu(self.dense0(x)) return F.relu(self.dense1(x)) model = Model() model.initialize(ctx=mx.cpu(0)) model(F.zeros((10, 10), ctx=mx.cpu(0)))
Child Block assigned this way will be registered and collect_params will collect their Parameters recursively.
Parameters:  prefix (str) – Prefix acts like a name space. It will be prepended to the name of all Parameters and child Block`s in this `Block‘s name_scope. Prefix should be unique within one model to prevent name collisions.
 params (ParameterDict or None) –
ParameterDict for sharing weights with the new Block. For example, if you want dense1 to share dense0‘s weights, you can do:
dense0 = nn.Dense(20) dense1 = nn.Dense(20, params=dense0.collect_params())

forward
(*args)¶ Overrides to implement forward computation using NDArray. Only accepts positional arguments.
Parameters: *args – Input tensors.

__setattr__
(name, value)¶ Registers parameters.

prefix
¶ Prefix of this Block.

name
¶ Name of this Block, without ‘_’ in the end.

name_scope
()¶ Returns a name space object managing a child Block and parameter names. Should be used within a with statement:
with self.name_scope(): self.dense = nn.Dense(20)

params
¶ Returns this Block‘s parameter dictionary (does not include its children’s parameters).

collect_params
()¶ Returns a ParameterDict containing this Block and all of its children’s Parameters.

save_params
(filename)¶ Save parameters to file.
 filename : str
 Path to file.

load_params
(filename, ctx, allow_missing=False, ignore_extra=False)¶ Load parameters from file.
 filename : str
 Path to parameter file.
 ctx : Context or list of Context
 Context(s) initialize loaded parameters on.
 allow_missing : bool, default False
 Whether to silently skip loading parameters not represents in the file.
 ignore_extra : bool, default False
 Whether to silently ignore parameters from the file that are not present in this Block.

register_child
(block)¶ Registers block as a child of self. `Block`s assigned to self as attributes will be registered automatically.

initialize
(init=<mxnet.initializer.Uniform object>, ctx=None, verbose=False)¶ Initializes Parameter`s of this `Block and its children.
Equivalent to block.collect_params().initialize(...)

hybridize
(active=True)¶ Activates or deactivates `HybridBlock`s recursively. Has no effect on nonhybrid children.
Parameters: active (bool, default True) – Whether to turn hybrid on or off.

__call__
(*args)¶ Calls forward. Only accepts positional arguments.

forward
(*args) Overrides to implement forward computation using NDArray. Only accepts positional arguments.
Parameters: *args – Input tensors.

class
mxnet.gluon.
HybridBlock
(prefix=None, params=None)¶ HybridBlock supports forwarding with both Symbol and NDArray.
Forward computation in HybridBlock must be static to work with Symbol`s, i.e. you cannot call `.asnumpy(), .shape, .dtype, etc on tensors. Also, you cannot use branching or loop logic that bases on nonconstant expressions like random numbers or intermediate results, since they change the graph structure for each iteration.
Before activating with hybridize(), HybridBlock works just like normal Block. After activation, HybridBlock will create a symbolic graph representing the forward computation and cache it. On subsequent forwards, the cached graph will be used instead of hybrid_forward.
Refer Hybrid tutorial to see the endtoend usage.

hybrid_forward
(F, x, *args, **kwargs)¶ Overrides to construct symbolic graph for this Block.
Parameters:  x (Symbol or NDArray) – The first input tensor.
 *args –
Additional input tensors.

__setattr__
(name, value)¶ Registers parameters.

infer_shape
(*args)¶ Infers shape of Parameters from inputs.

forward
(x, *args)¶ Defines the forward computation. Arguments can be either NDArray or Symbol.

hybrid_forward
(F, x, *args, **kwargs) Overrides to construct symbolic graph for this Block.
Parameters:  x (Symbol or NDArray) – The first input tensor.
 *args –
Additional input tensors.

Neural Network Layers¶
Containers¶

class
mxnet.gluon.nn.
Sequential
(prefix=None, params=None)¶ Stacks `Block`s sequentially.
Example:
net = nn.Sequential() # use net's name_scope to give child Blocks appropriate names. with net.name_scope(): net.add(nn.Dense(10, activation='relu')) net.add(nn.Dense(20))

add
(block)¶ Adds block on top of the stack.


class
mxnet.gluon.nn.
HybridSequential
(prefix=None, params=None)¶ Stacks `HybridBlock`s sequentially.
Example:
net = nn.Sequential() # use net's name_scope to give child Blocks appropriate names. with net.name_scope(): net.add(nn.Dense(10, activation='relu')) net.add(nn.Dense(20))

add
(block)¶ Adds block on top of the stack.

Basic Layers¶

class
mxnet.gluon.nn.
Dense
(units, activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_units=0, **kwargs)¶ Just your regular denselyconnected NN layer.
Dense implements the operation: output = activation(dot(input, weight) + bias) where activation is the elementwise activation function passed as the activation argument, weight is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).
Note: the input must be a tensor with rank 2. Use flatten to convert it to rank 2 manually if necessary.
Parameters:  units (int) – Dimensionality of the output space.
 activation (str) – Activation function to use. See help on Activation layer. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
 use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the kernel weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 in_units (int, optional) – Size of the input data. If not specified, initialization will be deferred to the first time forward is called and in_units will be inferred from the shape of input data.
 prefix (str or None) – See document of Block.
 params (ParameterDict or None) – See document of Block.
 Input shape:
 A 2D input with shape (batch_size, in_units).
 Output shape:
 The output would have shape (batch_size, units).

class
mxnet.gluon.nn.
Activation
(activation, **kwargs)¶ Applies an activation function to input.
Parameters: activation (str) – Name of activation function to use. See Activation()
for available choices. Input shape:
 Arbitrary.
 Output shape:
 Same shape as input.

class
mxnet.gluon.nn.
Dropout
(rate, **kwargs)¶ Applies Dropout to the input.
Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.
Parameters: rate (float) – Fraction of the input units to drop. Must be a number between 0 and 1.  Input shape:
 Arbitrary.
 Output shape:
 Same shape as input.
References
Dropout: A Simple Way to Prevent Neural Networks from Overfitting

class
mxnet.gluon.nn.
BatchNorm
(axis=1, momentum=0.9, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', running_mean_initializer='zeros', running_variance_initializer='ones', in_channels=0, **kwargs)¶ Batch normalization layer (Ioffe and Szegedy, 2014). Normalizes the input at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.
Parameters:  axis (int, default 1) – The axis that should be normalized. This is typically the channels (C) axis. For instance, after a Conv2D layer with layout=’NCHW’, set axis=1 in BatchNorm. If layout=’NHWC’, then set axis=3.
 momentum (float, default 0.9) – Momentum for the moving average.
 epsilon (float, default 1e3) – Small float added to variance to avoid dividing by zero.
 center (bool, default True) – If True, add offset of beta to normalized tensor. If False, beta is ignored.
 scale (bool, default True) – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.
 beta_initializer (str or Initializer, default ‘zeros’) – Initializer for the beta weight.
 gamma_initializer (str or Initializer, default ‘ones’) – Initializer for the gamma weight.
 moving_mean_initializer (str or Initializer, default ‘zeros’) – Initializer for the moving mean.
 moving_variance_initializer (str or Initializer, default ‘ones’) – Initializer for the moving variance.
 in_channels (int, default 0) – Number of channels (feature maps) in input data. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 Input shape:
 Arbitrary.
 Output shape:
 Same shape as input.

class
mxnet.gluon.nn.
LeakyReLU
(alpha, **kwargs)¶ Leaky version of a Rectified Linear Unit.
It allows a small gradient when the unit is not active:
`f(x) = alpha * x for x < 0`, `f(x) = x for x >= 0`.
Parameters: alpha (float) – slope coefficient for the negative half axis. Must be >= 0.  Input shape:
 Arbitrary.
 Output shape:
 Same shape as input.

class
mxnet.gluon.nn.
Embedding
(input_dim, output_dim, dtype='float32', weight_initializer=None, **kwargs)¶ Turns nonnegative integers (indexes/tokens) into dense vectors of fixed size. eg. [[4], [20]] > [[0.25, 0.1], [0.6, 0.2]]
Parameters:  input_dim (int) – Size of the vocabulary, i.e. maximum integer index + 1.
 output_dim (int) – Dimension of the dense embedding.
 dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
 weight_initializer (Initializer) – Initializer for the embeddings matrix.
 Input shape:
 2D tensor with shape: (N, M).
 Output shape:
 3D tensor with shape: (N, M, output_dim).
Convolutional Layers¶

class
mxnet.gluon.nn.
Conv1D
(channels, kernel_size, strides=1, padding=0, dilation=1, groups=1, layout='NCW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ 1D convolution layer (e.g. temporal convolution).
This layer creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal) dimension to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 1 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 1 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 1 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 1 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Convolution is applied on the ‘W’ dimension.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 3D array of shape (batch_size, in_channels, width) if layout is NCW.
 Output shape:
This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW. out_width is calculated as:
out_width = floor((width+2*paddingdilation*(kernel_size1)1)/stride)+1

class
mxnet.gluon.nn.
Conv2D
(channels, kernel_size, strides=(1, 1), padding=(0, 0), dilation=(1, 1), groups=1, layout='NCHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ 2D convolution layer (e.g. spatial convolution over images).
This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 2 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 2 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 2 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 2 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. Convolution is applied on the ‘H’ and ‘W’ dimensions.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 4D array of shape (batch_size, in_channels, height, width) if layout is NCHW.
 Output shape:
This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.
out_height and out_width are calculated as:
out_height = floor((height+2*padding[0]dilation[0]*(kernel_size[0]1)1)/stride[0])+1 out_width = floor((width+2*padding[1]dilation[1]*(kernel_size[1]1)1)/stride[1])+1

class
mxnet.gluon.nn.
Conv3D
(channels, kernel_size, strides=(1, 1, 1), padding=(0, 0, 0), dilation=(1, 1, 1), groups=1, layout='NCDHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ 3D convolution layer (e.g. spatial convolution over volumes).
This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 3 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. Convolution is applied on the ‘D’, ‘H’ and ‘W’ dimensions.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 5D array of shape (batch_size, in_channels, depth, height, width) if layout is NCDHW.
 Output shape:
This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.
out_depth, out_height and out_width are calculated as:
out_depth = floor((depth+2*padding[0]dilation[0]*(kernel_size[0]1)1)/stride[0])+1 out_height = floor((height+2*padding[1]dilation[1]*(kernel_size[1]1)1)/stride[1])+1 out_width = floor((width+2*padding[2]dilation[2]*(kernel_size[2]1)1)/stride[2])+1

class
mxnet.gluon.nn.
Conv1DTranspose
(channels, kernel_size, strides=1, padding=0, output_padding=0, dilation=1, groups=1, layout='NCW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ Transposed 1D convolution layer (sometimes called Deconvolution).
The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 3 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Convolution is applied on the ‘W’ dimension.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 3D array of shape (batch_size, in_channels, width) if layout is NCW.
 Output shape:
This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.
out_width is calculated as:
out_width = (width1)*strides2*padding+kernel_size+output_padding

class
mxnet.gluon.nn.
Conv2DTranspose
(channels, kernel_size, strides=(1, 1), padding=(0, 0), output_padding=(0, 0), dilation=(1, 1), groups=1, layout='NCHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ Transposed 2D convolution layer (sometimes called Deconvolution).
The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 3 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. Convolution is applied on the ‘H’ and ‘W’ dimensions.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 4D array of shape (batch_size, in_channels, height, width) if layout is NCHW.
 Output shape:
This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.
out_height and out_width are calculated as:
out_height = (height1)*strides[0]2*padding[0]+kernel_size[0]+output_padding[0] out_width = (width1)*strides[1]2*padding[1]+kernel_size[1]+output_padding[1]

class
mxnet.gluon.nn.
Conv3DTranspose
(channels, kernel_size, strides=(1, 1, 1), padding=(0, 0, 0), output_padding=(0, 0, 0), dilation=(1, 1, 1), groups=1, layout='NCDHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ Transposed 3D convolution layer (sometimes called Deconvolution).
The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 3 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. Convolution is applied on the ‘D’, ‘H’, and ‘W’ dimensions.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 5D array of shape (batch_size, in_channels, depth, height, width) if layout is NCDHW.
 Output shape:
This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW. out_depth, out_height and out_width are calculated as:
out_depth = (depth1)*strides[0]2*padding[0]+kernel_size[0]+output_padding[0] out_height = (height1)*strides[1]2*padding[1]+kernel_size[1]+output_padding[1] out_width = (width1)*strides[2]2*padding[2]+kernel_size[2]+output_padding[2]
Pooling Layers¶

class
mxnet.gluon.nn.
MaxPool1D
(pool_size=2, strides=None, padding=0, layout='NCW', ceil_mode=False, **kwargs)¶ Max pooling operation for one dimensional data.
Parameters:  pool_size (int) – Size of the max pooling windows.
 strides (int, or None) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Pooling is applied on the W dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 3D array of shape (batch_size, channels, width) if layout is NCW.
 Output shape:
This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.
out_width is calculated as:
out_width = floor((width+2*paddingpool_size)/strides)+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
MaxPool2D
(pool_size=(2, 2), strides=None, padding=0, layout='NCHW', ceil_mode=False, **kwargs)¶ Max pooling operation for two dimensional (spatial) data.
Parameters:  pool_size (int or list/tuple of 2 ints,) – Size of the max pooling windows.
 strides (int, list/tuple of 2 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int or list/tuple of 2 ints,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. padding is applied on ‘H’ and ‘W’ dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 4D array of shape (batch_size, channels, height, width) if layout is NCHW.
 Output shape:
This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.
out_height and out_width are calculated as:
out_height = floor((height+2*padding[0]pool_size[0])/strides[0])+1 out_width = floor((width+2*padding[1]pool_size[1])/strides[1])+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
MaxPool3D
(pool_size=(2, 2, 2), strides=None, padding=0, ceil_mode=False, layout='NCDHW', **kwargs)¶ Max pooling operation for 3D data (spatial or spatiotemporal).
Parameters:  pool_size (int or list/tuple of 3 ints,) – Size of the max pooling windows.
 strides (int, list/tuple of 3 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int or list/tuple of 3 ints,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 5D array of shape (batch_size, channels, depth, height, width) if layout is NCDHW.
 Output shape:
This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.
out_depth, out_height and out_width are calculated as
out_depth = floor((depth+2*padding[0]pool_size[0])/strides[0])+1 out_height = floor((height+2*padding[1]pool_size[1])/strides[1])+1 out_width = floor((width+2*padding[2]pool_size[2])/strides[2])+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
AvgPool1D
(pool_size=2, strides=None, padding=0, layout='NCW', ceil_mode=False, **kwargs)¶ Average pooling operation for temporal data.
Parameters:  pool_size (int) – Size of the max pooling windows.
 strides (int, or None) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. padding is applied on ‘W’ dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 3D array of shape (batch_size, channels, width) if layout is NCW.
 Output shape:
This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.
out_width is calculated as:
out_width = floor((width+2*paddingpool_size)/strides)+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
AvgPool2D
(pool_size=(2, 2), strides=None, padding=0, ceil_mode=False, layout='NCHW', **kwargs)¶ Average pooling operation for spatial data.
Parameters:  pool_size (int or list/tuple of 2 ints,) – Size of the max pooling windows.
 strides (int, list/tuple of 2 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int or list/tuple of 2 ints,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. padding is applied on ‘H’ and ‘W’ dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 4D array of shape (batch_size, channels, height, width) if layout is NCHW.
 Output shape:
This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.
out_height and out_width are calculated as:
out_height = floor((height+2*padding[0]pool_size[0])/strides[0])+1 out_width = floor((width+2*padding[1]pool_size[1])/strides[1])+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
AvgPool3D
(pool_size=(2, 2, 2), strides=None, padding=0, ceil_mode=False, layout='NCDHW', **kwargs)¶ Average pooling operation for 3D data (spatial or spatiotemporal).
Parameters:  pool_size (int or list/tuple of 3 ints,) – Size of the max pooling windows.
 strides (int, list/tuple of 3 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int or list/tuple of 3 ints,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 5D array of shape (batch_size, channels, depth, height, width) if layout is NCDHW.
 Output shape:
This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.
out_depth, out_height and out_width are calculated as
out_depth = floor((depth+2*padding[0]pool_size[0])/strides[0])+1 out_height = floor((height+2*padding[1]pool_size[1])/strides[1])+1 out_width = floor((width+2*padding[2]pool_size[2])/strides[2])+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
GlobalMaxPool1D
(layout='NCW', **kwargs)¶ Global max pooling operation for temporal data.

class
mxnet.gluon.nn.
GlobalMaxPool2D
(layout='NCHW', **kwargs)¶ Global max pooling operation for spatial data.

class
mxnet.gluon.nn.
GlobalMaxPool3D
(layout='NCDHW', **kwargs)¶ Global max pooling operation for 3D data.

class
mxnet.gluon.nn.
GlobalAvgPool1D
(layout='NCW', **kwargs)¶ Global average pooling operation for temporal data.

class
mxnet.gluon.nn.
GlobalAvgPool2D
(layout='NCHW', **kwargs)¶ Global average pooling operation for spatial data.

class
mxnet.gluon.nn.
GlobalAvgPool3D
(layout='NCDHW', **kwargs)¶ Global max pooling operation for 3D data.
Recurrent Layers¶

class
mxnet.gluon.rnn.
RecurrentCell
(prefix=None, params=None)¶ Abstract base class for RNN cells
Parameters:  prefix (str, optional) – Prefix for names of Block`s (this prefix is also used for names of weights if `params is None i.e. if params are being created and not reused)
 params (Parameter or None, optional) – Container for weight sharing between cells. A new Parameter container is created if params is None.

__call__
(*args)¶ Calls forward. Only accepts positional arguments.

reset
()¶ Reset before reusing the cell for another graph.

state_info
(batch_size=0)¶ shape and layout information of states

begin_state
(batch_size=0, func=<function zeros>, **kwargs)¶ Initial state for this cell.
Parameters:  func (callable, default symbol.zeros) –
Function for creating initial state.
For Symbol API, func can be symbol.zeros, symbol.uniform, symbol.var etc. Use symbol.var if you want to directly feed input as states.
For NDArray API, func can be ndarray.zeros, ndarray.ones, etc.
 batch_size (int, default 0) – Only required for NDArray API. Size of the batch (‘N’ in layout) dimension of input.
 **kwargs –
Additional keyword arguments passed to func. For example mean, std, dtype, etc.
Returns: states – Starting states for the first RNN step.
Return type: nested list of Symbol
 func (callable, default symbol.zeros) –

unroll
(length, inputs, begin_state=None, layout='NTC', merge_outputs=None)¶ Unrolls an RNN cell across time steps.
Parameters:  length (int) – Number of steps to unroll.
 inputs (Symbol, list of Symbol, or None) –
If inputs is a single Symbol (usually the output of Embedding symbol), it should have shape (batch_size, length, ...) if layout is ‘NTC’, or (length, batch_size, ...) if layout is ‘TNC’.
If inputs is a list of symbols (usually output of previous unroll), they should all have shape (batch_size, ...).
 begin_state (nested list of Symbol, optional) – Input states created by begin_state() or output state of another cell. Created from begin_state() if None.
 layout (str, optional) – layout of input symbol. Only used if inputs is a single Symbol.
 merge_outputs (bool, optional) – If False, returns outputs as a list of Symbols. If True, concatenates output across time steps and returns a single symbol with shape (batch_size, length, ...) if layout is ‘NTC’, or (length, batch_size, ...) if layout is ‘TNC’. If None, output whatever is faster.
Returns:  outputs (list of Symbol or Symbol) – Symbol (if merge_outputs is True) or list of Symbols (if merge_outputs is False) corresponding to the output from the RNN from this unrolling.
 states (list of Symbol) – The new state of this RNN after this unrolling. The type of this symbol is same as the output of begin_state().

forward
(inputs, states)¶ Unrolls the recurrent cell for one time step.
Parameters:  inputs (sym.Variable) – Input symbol, 2D, of shape (batch_size * num_units).
 states (list of sym.Variable) – RNN state from previous step or the output of begin_state().
Returns:  output (Symbol) – Symbol corresponding to the output from the RNN when unrolling for a single time step.
 states (list of Symbol) – The new state of this RNN after this unrolling. The type of this symbol is same as the output of begin_state(). This can be used as an input state to the next time step of this RNN.
See also
begin_state()
 This function can provide the states for the first time step.
unroll()
 This function unrolls an RNN for a given number of (>=1) time steps.

class
mxnet.gluon.rnn.
RNN
(hidden_size, num_layers=1, activation='relu', layout='TNC', dropout=0, bidirectional=False, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, **kwargs)¶ Applies a multilayer Elman RNN with tanh or ReLU nonlinearity to an input sequence.
For each element in the input sequence, each layer computes the following function:
\[h_t = \tanh(w_{ih} * x_t + b_{ih} + w_{hh} * h_{(t1)} + b_{hh})\]where \(h_t\) is the hidden state at time t, and \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer. If nonlinearity=’relu’, then ReLU is used instead of tanh.
Parameters:  hidden_size (int) – The number of features in the hidden state h.
 num_layers (int, default 1) – Number of recurrent layers.
 activation ({'relu' or 'tanh'}, default 'tanh') – The activation function to use.
 layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively.
 dropout (float, default 0) – If nonzero, introduces a dropout layer on the outputs of each RNN layer except the last layer.
 bidirectional (bool, default False) – If True, becomes a bidirectional RNN.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input.
 prefix (str or None) – Prefix of this Block.
 params (ParameterDict or None) – Shared Parameters for this Block.
 Input shapes:
 The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
 Output shape:
 The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
 Recurrent state shape:
 The recurrent state’s shape is (num_layers, batch_size, num_hidden). If bidirectional is True, state shape will instead be (num_layers, batch_size, 2*num_hidden)
Examples
>>> layer = mx.gluon.rnn.RNN(100, 3) >>> layer.initialize() >>> input = mx.nd.random_uniform(shape=(5, 3, 10)) >>> h0 = mx.nd.random_uniform(shape=(3, 3, 100)) >>> output, hn = layer(input, h0)

class
mxnet.gluon.rnn.
LSTM
(hidden_size, num_layers=1, layout='TNC', dropout=0, bidirectional=False, input_size=0, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', **kwargs)¶ Applies a multilayer long shortterm memory (LSTM) RNN to an input sequence.
For each element in the input sequence, each layer computes the following function:
\[\begin{split}\begin{array}{ll} i_t = sigmoid(W_{ii} x_t + b_{ii} + W_{hi} h_{(t1)} + b_{hi}) \\ f_t = sigmoid(W_{if} x_t + b_{if} + W_{hf} h_{(t1)} + b_{hf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hc} h_{(t1)} + b_{hg}) \\ o_t = sigmoid(W_{io} x_t + b_{io} + W_{ho} h_{(t1)} + b_{ho}) \\ c_t = f_t * c_{(t1)} + i_t * g_t \\ h_t = o_t * \tanh(c_t) \end{array}\end{split}\]where \(h_t\) is the hidden state at time t, \(c_t\) is the cell state at time t, \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer, and \(i_t\), \(f_t\), \(g_t\), \(o_t\) are the input, forget, cell, and out gates, respectively.
Parameters:  hidden_size (int) – The number of features in the hidden state h.
 num_layers (int, default 1) – Number of recurrent layers.
 layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively.
 dropout (float, default 0) – If nonzero, introduces a dropout layer on the outputs of each RNN layer except the last layer.
 bidirectional (bool, default False) – If True, becomes a bidirectional RNN.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer, default 'lstmbias') – Initializer for the bias vector. By default, bias for the forget gate is initialized to 1 while all other biases are initialized to zero.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input.
 prefix (str or None) – Prefix of this Block.
 params (ParameterDict or None) – Shared Parameters for this Block.
 Input shapes:
 The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
 Output shape:
 The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
 Recurrent state shape:
 The recurrent state is a list of two NDArrays. Both has shape (num_layers, batch_size, num_hidden). If bidirectional is True, state shape will instead be (num_layers, batch_size, 2*num_hidden).
Examples
>>> layer = mx.gluon.rnn.LSTM(100, 3) >>> layer.initialize() >>> input = mx.nd.random_uniform(shape=(5, 3, 10)) >>> h0 = mx.nd.random_uniform(shape=(3, 3, 100)) >>> c0 = mx.nd.random_uniform(shape=(3, 3, 100)) >>> output, hn = layer(input, [h0, c0])

class
mxnet.gluon.rnn.
GRU
(hidden_size, num_layers=1, layout='TNC', dropout=0, bidirectional=False, input_size=0, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', **kwargs)¶ Applies a multilayer gated recurrent unit (GRU) RNN to an input sequence.
For each element in the input sequence, each layer computes the following function:
\[\begin{split}\begin{array}{ll} r_t = sigmoid(W_{ir} x_t + b_{ir} + W_{hr} h_{(t1)} + b_{hr}) \\ i_t = sigmoid(W_{ii} x_t + b_{ii} + W_hi h_{(t1)} + b_{hi}) \\ n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t1)}+ b_{hn})) \\ h_t = (1  i_t) * n_t + i_t * h_{(t1)} \\ \end{array}\end{split}\]where \(h_t\) is the hidden state at time t, \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer, and \(r_t\), \(i_t\), \(n_t\) are the reset, input, and new gates, respectively.
Parameters:  hidden_size (int) – The number of features in the hidden state h
 num_layers (int, default 1) – Number of recurrent layers.
 layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively.
 dropout (float, default 0) – If nonzero, introduces a dropout layer on the outputs of each RNN layer except the last layer
 bidirectional (bool, default False) – If True, becomes a bidirectional RNN.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input.
 prefix (str or None) – Prefix of this Block.
 params (ParameterDict or None) – Shared Parameters for this Block.
 Input shapes:
 The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
 Output shape:
 The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
 Recurrent state shape:
 The recurrent state’s shape is (num_layers, batch_size, num_hidden). If bidirectional is True, state shape will instead be (num_layers, batch_size, 2*num_hidden)
Examples
>>> layer = mx.gluon.rnn.GRU(100, 3) >>> layer.initialize() >>> input = mx.nd.random_uniform(shape=(5, 3, 10)) >>> h0 = mx.nd.random_uniform(shape=(3, 3, 100)) >>> output, hn = layer(input, h0)

class
mxnet.gluon.rnn.
RNNCell
(hidden_size, activation='tanh', i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)¶ Simple recurrent neural network cell.
Parameters:  hidden_size (int) – Number of units in output symbol
 activation (str or Symbol, default 'tanh') – Type of activation function.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 prefix (str, default ‘rnn_‘) – Prefix for name of Block`s (and name of weight if params is `None).
 params (Parameter or None) – Container for weight sharing between cells. Created if None.

class
mxnet.gluon.rnn.
LSTMCell
(hidden_size, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)¶ LongShort Term Memory (LSTM) network cell.
Parameters:  hidden_size (int) – Number of units in output symbol.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer, default 'lstmbias') – Initializer for the bias vector. By default, bias for the forget gate is initialized to 1 while all other biases are initialized to zero.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 prefix (str, default ‘lstm_‘) – Prefix for name of Block`s (and name of weight if params is `None).
 params (Parameter or None) – Container for weight sharing between cells. Created if None.

class
mxnet.gluon.rnn.
GRUCell
(hidden_size, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)¶ Gated Rectified Unit (GRU) network cell. Note: this is an implementation of the cuDNN version of GRUs (slight modification compared to Cho et al. 2014).
Parameters:  hidden_size (int) – Number of units in output symbol.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 prefix (str, default ‘gru_‘) – prefix for name of Block`s (and name of weight if params is `None).
 params (Parameter or None) – Container for weight sharing between cells. Created if None.

class
mxnet.gluon.rnn.
SequentialRNNCell
(prefix=None, params=None)¶ Sequentially stacking multiple RNN cells.

add
(cell)¶ Appends a cell into the stack.
Parameters: cell (rnn cell) –


class
mxnet.gluon.rnn.
BidirectionalCell
(l_cell, r_cell, output_prefix='bi_')¶ Bidirectional RNN cell.
Parameters:  l_cell (RecurrentCell) – Cell for forward unrolling
 r_cell (RecurrentCell) – Cell for backward unrolling

class
mxnet.gluon.rnn.
DropoutCell
(dropout, prefix=None, params=None)¶ Applies dropout on input.
Parameters: dropout (float) – Percentage of elements to drop out, which is 1  percentage to retain.

class
mxnet.gluon.rnn.
ZoneoutCell
(base_cell, zoneout_outputs=0.0, zoneout_states=0.0)¶ Applies Zoneout on base cell.

class
mxnet.gluon.rnn.
ResidualCell
(base_cell)¶ Adds residual connection as described in Wu et al, 2016 (https://arxiv.org/abs/1609.08144). Output of the cell is output of the base cell plus input.
Trainer¶

class
mxnet.gluon.
Trainer
(params, optimizer, optimizer_params, kvstore='device')¶ Applies an Optimizer on a set of Parameters. Trainer should be used together with autograd.
Parameters:  params (ParameterDict) – The set of parameters to optimize.
 optimizer (str or Optimizer) – The optimizer to use.
 optimizer_params (dict) – Keyword arguments to be passed to optimizer constructor. For example, {‘learning_rate’: 0.1}
 kvstore (str or KVStore) – kvstore type for multigpu and distributed training.

step
(batch_size, ignore_stale_grad=False)¶ Makes one step of parameter update. Should be called after autograd.compute_gradient and outside of record() scope.
Parameters:  batch_size (int) – Batch size of data processed. Gradient will be normalized by 1/batch_size. Set this to 1 if you normalized loss manually with loss = mean(loss).
 ignore_stale_grad (bool, optional, default=False) – If true, ignores Parameters with stale gradient (gradient that has not been updated by backward after last step) and skip update.
Loss functions¶

class
mxnet.gluon.loss.
L2Loss
(weight=1.0, batch_axis=0, **kwargs)¶ Calculates the mean squared error between output and label:
\[L = \frac{1}{2}\sum_i \Vert {output}_i  {label}_i \Vert^2.\]Output and label can have arbitrary shape as long as they have the same number of elements.
Parameters:  weight (float or None) – Global scalar weight for loss.
 sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
 batch_axis (int, default 0) – The axis that represents minibatch.

class
mxnet.gluon.loss.
L1Loss
(weight=None, batch_axis=0, **kwargs)¶ Calculates the mean absolute error between output and label:
\[L = \frac{1}{2}\sum_i \vert {output}_i  {label}_i \vert.\]Output and label must have the same shape.
Parameters:  weight (float or None) – Global scalar weight for loss.
 sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
 batch_axis (int, default 0) – The axis that represents minibatch.

class
mxnet.gluon.loss.
SoftmaxCrossEntropyLoss
(axis=1, sparse_label=True, from_logits=False, weight=None, batch_axis=0, **kwargs)¶ Computes the softmax cross entropy loss.
If sparse_label is True, label should contain integer category indicators:
\[p = {softmax}({output})\]\[L = \sum_i {log}(p_{i,{label}_i})\]Label’s shape should be output’s shape without the axis dimension. i.e. for output.shape = (1,2,3,4) and axis = 2, label.shape should be (1,2,4).
If sparse_label is False, label should contain probability distribution with the same shape as output:
\[p = {softmax}({output})\]\[L = \sum_i \sum_j {label}_j {log}(p_{ij})\]Parameters:  axis (int, default 1) – The axis to sum over when computing softmax and entropy.
 sparse_label (bool, default True) – Whether label is an integer array instead of probability distribution.
 from_logits (bool, default False) – Whether input is a log probability (usually from log_softmax) instead of unnormalized numbers.
 weight (float or None) – Global scalar weight for loss.
 sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
 batch_axis (int, default 0) – The axis that represents minibatch.

class
mxnet.gluon.loss.
KLDivLoss
(from_logits=True, weight=None, batch_axis=0, **kwargs)¶ The KullbackLeibler divergence loss.
KL divergence is a useful distance measure for continuous distributions and is often useful when performing direct regression over the space of (discretely sampled) continuous output distributions.
\[L = 1/n \sum_i (label_i * (log(label_i)  output_i))\]Label’s shape should be the same as output’s.
Parameters:  from_logits (bool, default is True) – Whether the input is log probability (usually from log_softmax) instead of unnormalized numbers.
 weight (float or None) – Global scalar weight for loss.
 sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
 batch_axis (int, default 0) – The axis that represents minibatch.
Utilities¶

utils.
split_data
(data, num_slice, batch_axis=0, even_split=True)¶ Splits an NDArray into num_slice slices along batch_axis. Usually used for data parallelism where each slices is sent to one device (i.e. GPU).
Parameters:  data (NDArray) – A batch of data.
 num_slice (int) – Number of desired slices.
 batch_axis (int, default 0) – The axis along which to slice.
 even_split (bool, default True) – Whether to force all slices to have the same number of elements. If True, an error will be raised when num_slice does not evenly divide data.shape[batch_axis].
Returns: Return value is a list even if num_slice is 1.
Return type: list of NDArray

utils.
split_and_load
(data, ctx_list, batch_axis=0, even_split=True)¶ Splits an NDArray into len(ctx_list) slices along batch_axis and loads each slice to one context in ctx_list.
Parameters:  data (NDArray) – A batch of data.
 ctx_list (list of Context) – A list of Contexts.
 batch_axis (int, default 0) – The axis along which to slice.
 even_split (bool, default True) – Whether to force all slices to have the same number of elements.
Returns: Each corresponds to a context in ctx_list.
Return type: list of NDArray

utils.
clip_global_norm
(arrays, max_norm)¶ Rescales NDArrays so that the sum of their 2norm is smaller than max_norm.