Efficient video captioning on heterogeneous system architectures
Journal
Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021
Pages
1035-1045
Date Issued
2021
Author(s)
Abstract
Video captioning is the core technology to drive the development of many important multidisciplinary applications, such as AI-assisted medical diagnosis, storytelling through videos, video question answering, lip-reading, just to name a few. Video captioning employs a hybrid CNN+RNN neural network model to translate video scenes into natural language descriptions. For deep learning inference, a typical approach is running both the CNN and the RNN on a GPU. Such a GPU-only approach often suffers long inference time due to underutilization of the computing power offered by the CPU+GPU heterogeneous system architecture, which is a common architecture in modern computers.This work is an early effort to tackle the performance issue of performing deep learning inference using a hybrid CNN+RNN model on a heterogeneous system with a CPU and a GPU. This is a challenging task because of (1) CNN and RNN exhibit very different computing behaviors. This raises the question of how to split the two models into computing tasks and properly assign the tasks to the CPU and the GPU to minimize the inference time for a video frame, and (2) Data dependency exists between the CNN and the RNN within a video frame, as well as between the adjacent RNNs across two video frames. These data dependencies prohibit full parallelization of the hybrid model. To solve these two problems, we propose two optimizations: a fine-grained scheduling scheme for mapping computation and devices within a video frame, and a pipeline scheduling scheme to exploit maximum parallelism between the execution of the video frames. To facilitate our optimizations, we also develop an accurate regression-based cost model to predict the computation time of CNN/RNN operations and the communication time for moving data between CPU and GPU. Experimental results show that our optimization improves the performance of video captioning by up to 3.24× on the CPU+GPU system, compared with the GPU-only execution. ? 2021 IEEE.
Subjects
Dynamic programming
Heterogeneous system architectures
Model scheduling
Pipelining
Video captioning
Computer architecture
Diagnosis
Digital storage
Natural language processing systems
Network architecture
Recurrent neural networks
Scheduling
Common architecture
Communication time
Computing behavior
Heterogeneous systems
Neural network model
Performance issues
Pipeline scheduling
Scheduling schemes
Graphics processing unit
Type
conference paper