Tensor Movement Orchestration in Multi-GPU Training Systems

Lin, Shao Fu; Chen, Yi Jung; Cheng, Hsiang Yun; CHIA-LIN YANG

標題:	Tensor Movement Orchestration in Multi-GPU Training Systems
作者:	Lin, Shao Fu Chen, Yi Jung Cheng, Hsiang Yun CHIA-LIN YANG
公開日期:	1-一月-2023
卷:	2023-February
來源出版物:	Proceedings - International Symposium on High-Performance Computer Architecture
摘要:	As deep neural network (DNN) models grow deeper and wider, one of the main challenges for training large-scale neural networks is overcoming limited GPU memory capacity. One common solution is to utilize the host memory as the external memory for swapping tensors in and out of GPU memory. However, the effectiveness of such tensor swapping can be impaired in data-parallel training systems due to contention on the shared PCIe channel to the host. In this paper, we propose the first large-model support framework that coordinates tensor movements among GPUs to alleviate PCIe channel contention. We design two types of coordination mechanisms. In the first mechanism, PCIe channel accesses from different GPUs are interleaved by selecting disjoint swapped-out tensors for each GPU. In the second method, swap commands are orchestrated to avoid contention. The effectiveness of these two methods depends on the model size and how often the GPUs synchronize on gradients. Experimental results show that compared to large-model support that is oblivious to channel contention, the proposed solution achieves average speedups of 38.3% to 31.8% when the memory footprint size is 1.33 to 2 times the GPU memory size.
URI:	https://scholars.lib.ntu.edu.tw/handle/123456789/630476
ISBN:	9781665476522
ISSN:	15300897
DOI:	10.1109/HPCA56546.2023.10071043
顯示於：	資訊工程學系

顯示文件完整紀錄

Page view(s)

checked on 2024/4/27

Google Scholar^TM

檢查

Altmetric

TAIR相關文章

Page view(s)

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM