End-to-End Performance Optimization for Training Streaming Convolutional Neural Networks using Billion-Pixel Whole-Slide Images
Journal
Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021
Pages
1127-1137
Date Issued
2021
Author(s)
Abstract
The combination of digital pathology and artificial intelligence attracts more and more attention. In order to take into account the global and detailed textural information, recent works proposed to use whole-slide images (WSI) with more than 10 million pixels to train artificial neural networks in pursuit of a high-precision model. However, such an approach faces new technical challenges. When researchers conduct neural network training on a graphics processing unit (GPU), the extremely high spatial resolution of WSI and the large amount of intermediate data generated during the training process exceed the memory capacity of the GPU. Therefore, streaming convolutional neural network (SCNN) was proposed to disassemble a large image into multiple smaller patches that can be forward and backward propagated under the limitation of GPU memory during the training process. However, in a multi-GPU high-performance computing (HPC) system, when the central processing unit (CPU) performs data preprocessing for the distributed training of multiple GPUs, decoding large images and data augmentation consume a lot of memory and CPU time, resulting in a performance bottleneck.This paper first presents a multithreaded image decoder optimized for multi-core CPUs to speedup WSI loading. We also propose a patch-level data augmentation algorithm and implement it on a dedicated GPU to distribute the augmented data to other GPUs for training the SCNN. We use the above method to train the ResNet-50 SCNN model by learning from the TCGA-LUAD and TCGA-LUSC datasets published by The Cancer Genome Atlas. In comparison to training the SCNN on multiple GPUs using the regular distributed data parallelism, our method can save up to 92.8% of memory and increase the training speed by up to 242.5% when training with 8 A100 GPUs on the NVIDIA DGX-A100 system. © 2021 IEEE.
Subjects
Data preprocessing; Distributed computing; Machine learning; Performance analysis
Other Subjects
Computer graphics; Convolution; Convolutional neural networks; Decoding; Distributed computer systems; Graphics processing unit; Machine learning; Pixels; Program processors; Convolutional neural network; Data preprocessing; Digital pathologies; End-to-end performance; Large images; Multiple GPUs; Performance optimizations; Performances analysis; Training process; Whole slide images; Computer graphics equipment
Type
conference paper
