Survey of Existing Interfaces and Implementations

Commonly used deep learning libraries with good RNN /LSTM support include Theano and its wrappers Lasagne and Keras; CNTK; TensorFlow; and various implementations in Torch, like char-rnn, this, and this.


RNN support in Theano is provided by its scan operator, which allows construction of a loop where the number of iterations is specified as a runtime value of a symbolic variable. You can find an official example of an LSTM implementation with scan here.


I’m not very familiar with the Theano internals, but it seems from theano/scan_module/ that the scan operator is implemented with a loop in Python that performs one iteration at a time:

    fn = self.fn.fn

    while (i < n_steps) and cond:
        # ...

The grad function in Theano constructs a symbolic graph for computing gradients. So the grad for the scan operator is actually implemented by constructing another scan operator:

    local_op = Scan(inner_gfn_ins, inner_gfn_outs, info)
    outputs = local_op(*outer_inputs)

The performance guide for Theano’s scan operator suggests minimizing the usage of scan. This might be due to the fact that the loop is executed in Python, which might be a bit slow (due to context switching and the performance of Python itself). Moreover, because no unrolling is performed, the graph optimizer can’t see the big picture.

If I understand correctly, when multiple RNN/LSTM layers are stacked, instead of a single loop with each iteration computing the whole feedforward network operation, the computation sequentially does a separate loop for each layer that uses the scan operator. If all of the intermediate values are stored to support computing the gradients, this is fine. Otherwise, using a single loop could be more memory efficient.


The documentation for RNN in Lasagne can be found here. In Lasagne, a recurrent layer is just like a standard layer, except that the input shape is expected to be (batch_size, sequence_length, feature_dimension). The output shape is then (batch_size, sequence_length, output_dimension).

Both batch_size and sequence_length are specified as None, and inferred from the data. Alternatively, when memory is sufficient and the (maximum) sequence length is known beforehand, you can set unroll_scan to False. Then Lasagne will unroll the graph explicitly, instead of using the Theano scan operator. Explicitly unrolling is implemented in

The recurrent layer also accepts a mask_input, to support variable length sequences (e.g., when sequences within a mini-batch have different lengths. The mask has the shape (batch_size, sequence_length).


The documentation for RNN in Keras can be found here. The interface in Keras is similar to the interface in Lasagne. The input is expected to be of shape (batch_size, sequence_length, feature_dimension), and the output shape (if return_sequences is True) is (batch_size, sequence_length, feature_dimension).

Keras currently supports both a Theano and a TensorFlow back end. RNN for the Theano back end is implemented with the scan operator. For TensorFlow, it seem to be implemented via explicitly unrolling. The documentation says that for the TensorFlow back end, the sequence length must be specified beforehand, and masking is currently not working (because tf.reduce_any is not functioning yet).


karpathy/char-rnn is implemented by explicitly unrolling. On the contrary, Element-Research/rnn runs sequence iteration in Lua. It actually has a very modular design:

  • The basic RNN/LSTM modules run only one time step per one call of forward (and accumulate/store necessary information to support backward computation, if needed). You could have detailed control when using this API directly.
  • A collection of Sequencers are defined to model common scenarios, like forward sequence, bi-directional sequence, attention models, etc.
  • There are other utility modules, like masking to support variable length sequences, etc.


CNTK looks quite different from other common deep learning libraries. I don’t understand it very well. I will talk with Yu to get more details.

It seems that the basic data types are matrices (although there is also a TensorView utility class). The mini-batch data for sequence data is packed in a matrix with N-row being feature_dimension and N-column being sequence_length * batch_size (see Figure 2.9 on page 50 of the CNTKBook).

Recurrent networks are first-class citizens in CNTK. In section of the CNTKBook, you can see an example of a customized computation node. The node needs to explicitly define the functions for standard forward and forward with a time index, which is used for RNN evaluation:

    virtual void EvaluateThisNode()
        EvaluateThisNodeS(FunctionValues(), Inputs(0)->
            FunctionValues(), Inputs(1)->FunctionValues());
    virtual void EvaluateThisNode(const size_t timeIdxInSeq)
        Matrix<ElemType> sliceInputValue = Inputs(1)->
            FunctionValues().ColumnSlice(timeIdxInSeq *
            m_samplesInRecurrentStep, m_samplesInRecurrentStep);
        Matrix<ElemType> sliceOutputValue =    m_functionValues.
            ColumnSlice(timeIdxInSeq * m_samplesInRecurrentStep,
        EvaluateThisNodeS(sliceOutputValue, Inputs(0)->
            FunctionValues(), sliceInput1Value);

The function ColumnSlice(start_col, num_col) takes out the packed data for that time index, as described above (here m_samplesInRecurrentStep must be the mini-batch size).

The low-level API for recurrent connection seem to be a delay node. But I’m not sure how to use this low-level API. The example of ptb language model uses a very high-level API (simply setting recurrentLayer = 1 in the config).


The current example of RNNLM in TensorFlow uses explicit unrolling for a predefined number of time steps. The whitepaper mentions that an advanced control flow API (Theano’s scan-like) is planned.