|Tensor Movement Orchestration in Multi-GPU Training Systems
|Lin, Shao Fu
Chen, Yi Jung
Cheng, Hsiang Yun
|Proceedings - International Symposium on High-Performance Computer Architecture
As deep neural network (DNN) models grow deeper and wider, one of the main challenges for training large-scale neural networks is overcoming limited GPU memory capacity. One common solution is to utilize the host memory as the external memory for swapping tensors in and out of GPU memory. However, the effectiveness of such tensor swapping can be impaired in data-parallel training systems due to contention on the shared PCIe channel to the host. In this paper, we propose the first large-model support framework that coordinates tensor movements among GPUs to alleviate PCIe channel contention. We design two types of coordination mechanisms. In the first mechanism, PCIe channel accesses from different GPUs are interleaved by selecting disjoint swapped-out tensors for each GPU. In the second method, swap commands are orchestrated to avoid contention. The effectiveness of these two methods depends on the model size and how often the GPUs synchronize on gradients. Experimental results show that compared to large-model support that is oblivious to channel contention, the proposed solution achieves average speedups of 38.3% to 31.8% when the memory footprint size is 1.33 to 2 times the GPU memory size.
|Appears in Collections:
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.