Tensor Movement Orchestration in Multi-GPU Training Systems

Lin, Shao Fu; Chen, Yi Jung; Cheng, Hsiang Yun; CHIA-LIN YANG

doi:10.1109/HPCA56546.2023.10071043

Tensor Movement Orchestration in Multi-GPU Training Systems

Journal

Proceedings - International Symposium on High-Performance Computer Architecture

Journal Volume

2023-February

ISBN

9781665476522

Date Issued

2023-01-01

Author(s)

Lin, Shao Fu

Chen, Yi Jung

Cheng, Hsiang Yun

CHIA-LIN YANG

DOI

10.1109/HPCA56546.2023.10071043

URI

https://scholars.lib.ntu.edu.tw/handle/123456789/630476

URL

https://api.elsevier.com/content/abstract/scopus_id/85151637333

Abstract

As deep neural network (DNN) models grow deeper and wider, one of the main challenges for training large-scale neural networks is overcoming limited GPU memory capacity. One common solution is to utilize the host memory as the external memory for swapping tensors in and out of GPU memory. However, the effectiveness of such tensor swapping can be impaired in data-parallel training systems due to contention on the shared PCIe channel to the host. In this paper, we propose the first large-model support framework that coordinates tensor movements among GPUs to alleviate PCIe channel contention. We design two types of coordination mechanisms. In the first mechanism, PCIe channel accesses from different GPUs are interleaved by selecting disjoint swapped-out tensors for each GPU. In the second method, swap commands are orchestrated to avoid contention. The effectiveness of these two methods depends on the model size and how often the GPUs synchronize on gradients. Experimental results show that compared to large-model support that is oblivious to channel contention, the proposed solution achieves average speedups of 38.3% to 31.8% when the memory footprint size is 1.33 to 2 times the GPU memory size.

Type

conference paper

Tensor Movement Orchestration in Multi-GPU Training Systems

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)