Deep Neural Networks (DNN) describe a range of powerful algorithms used for tasks such as computer vision, natural language processing, image generation and many more. The DNN models can become very large, for example the famous GPT3 model from OpenAI contains 175 billion model weights, which translates into over half a terabyte of memory which is required to store the model. DNN models are trained by adjusting the weights via gradient based methods. For big DNN models, the training cannot be done on a single machine in reasonable time. Therefore, an HPC infrastructure is needed for DNN training.
Most commonly, DNN training is distributed on the computational nodes by giving every node some portion of the entire training dataset. After the local computation of the gradient for the model update is finished, the resulting gradients of every node are combined into one global update for the shared model. The operation for combining the gradients is a well-known operation for distributed computing called allreduce. Although this data parallel computation with allreduce seems to be conceptually rather simple, it quickly becomes a performance bottleneck when scaled to a higher number of nodes.
Comprex is a communication library, which lowers the impact of the communication overhead by reducing the volume of data which needs to be communicated in the allreduce operation. The library builds on top of our GPI and GaspiCxx communication libraries. For DNN training, Comprex compresses the gradients, and at the same time keeps track of the compression error locally. The local error is used to correct data in the following send operation. The allreduce operations we developed with Comprex make use of the unique features of the GPI communication library it is build upon, allowing us to build very fast, asynchronous communication patterns. Figure 1 shows a ring like communication scheme, where the gradient information is sent around in a round-robin fashion.
Figure 1 Comprex ring-allreduce communication pattern
Comprex compresses and decompresses the gradients before and after the send and receive operations. The unique feature is that all nodes work and communicate without synchronization, therefore allowing for fast and scalable DNN training.
Our allreduce operations are implemented in Tensorflow, a popular framework for DNN training. Additionally, we augment the Comprex allreduce operation with techniques for smoothing out the effects of gradient compression on the training in a special Comprex optimizer. The difference in the training with and without Comprex can be seen in Figure 2. The training on 32 nodes can be finished in much less time, while keeping the final accuracy of the model high.
Figure 2 Training of Resnet32 with and without Comprex