Train LSTM with Multiple GPUs Using Model Parallelism

LSTM evaluation is inherently hard because of its complex data dependency. LSTM training, which has greater data dependency in reverse order at its back propagation phase, is even harder to parallelize. For general infomation about LSTM, see the excellent introduction by Christopher. For an example of LSTM training with model parallelism, see example/model-parallelism-lstm/.

Model Parallelism: Using Multiple GPUs As a Pipeline

Recently, there’s been a great deal of heated discussion about model parallelism in applied machine learning. It was originally designed for the super large convolutional layer in GoogleNet. We borrowed the idea of placing each layer in one GPU. The primitive for model parallelism is the layers in a neural network model. The benefit that provides is that the GPU doesn’t have to maintain the parameters of all the layers in memory. This reduces the memory limitation for large-scale tasks; for example, machine translation.

screen shot 2016-05-06 at 10 13 16 pm

In the preceding figure, different LSTM models are assigned to different GPUs. After GPU 1 finishes computing layer 1 with the first sentence, the output is given to GPU 2. At the same time, GPU 1 fetches the next sentence and start training. This is significantly different from data parallelism because there is no contention to update the shared model at the end of each iteration, and most of the communication happens during pipelining intermediate results between GPUs.

In the current implementation, the layers are defined in lstm_unroll().

Workload Partitioning

Implementing model parallelism requires good knowledge of the training task in order to partition the network throughout the GPUs. Although it requires detailed analysis that is beyond the scope of a course project, we found that you can apply some general principles:

  • To avoid data transmission, place neighbor layers in the same GPU.
  • To avoid bottlenecks in a pipeline, balance the workload between GPUs.
  • Remember that different kinds of layers have different computation-memory properties.

screen shot 2016-05-07 at 1 51 02 am

Let’s take a quick look at the two pipelines in the preceding diagram. They both have eight layers with a decoder and an encoder layer. Based on our first principle, it’s unwise to place all neighbor layers in separate GPUs. We also want to balance the workload across GPUs. Although the LSTM layers consume less memory than the decoder/encoder layers, they consume more computation time because of the dependency of the unrolled LSTM. Thus, the partition on the left will be faster than the one on the right because the workload is more evenly disstributed in model parallelism.

Currently, the layer partition is implemented in and configured in using the group2ctx option.

Apply Bucketing to Model Parallelism

To run model parallelism with bucketing, you need to unroll an LSTM model for each bucket to obtain an executor for each. For details about how the model is bound, see

On the other hand, because model parallelism partitions the model/layers, the input data has to be transformed/transposed to the agreed shape. For more details, see bucket_io.