為多核心平台開發高效能處理機通訊程式庫

洪士灝臺灣大學:資訊工程學研究所楊文隆Yang, Wen-LongWen-LongYang2010-05-172018-07-052010-05-172018-07-052009U0001-1908200916260800http://ntur.lib.ntu.edu.tw//handle/246246/183384近年來，多核心處理器被視為是個有效提升處理器效能的設計並且被廣泛地採用著。對於嵌入式多核心平台而言，核心間溝通機制由不同的晶片製造商所發展，缺乏統一標準，造成應用程式開發上很大的困難。因此，我們非常需要一套標準的、可攜的、有效率的行程間溝通函式庫來解決這樣的問題。高效能平行計算機領域中，MPI(Message Passing Interface)是一套非常成功並且被普遍使用的平行函式庫。但是對嵌入式系統這樣資源有限的平台來說，完整的MPI會使用到過多的記憶體空間，並不合適。本篇論文提出一套行程間溝通函式庫，也就是MSG(Message Passing)函式庫。它包含了不可或缺的MPI子集合來達成行程間溝通及同步的功能。此外，MSG函式庫的設計也同時考慮到相容性、可攜性及溝通效能。本篇論文中，我們選擇使用IBM的高效能Cell多核心處理器作為一個實例研究，開發MSG函式庫，並且和已有的Cell溝通函式庫的效能做比較。與IBM DaCS及CML的點對點傳輸相比，MSG函式庫能夠提升5%~29.7%的效能。與DaCS的集合運算(collective operation)相比， MSG函式庫甚至提供高達到2至22倍的速度。另外，我們在Cell平台上建構MSG函式庫的經驗也有助於未來將MSG函式庫移稙到其他的多核心平台上。In the recent years, the multi-core system has been posed as a solution for speedingp processor performance. For embedded multi-core platforms, diverse communcationechanisms developed by vendors have led to difficulties in developing applicatonsn those platforms. Thus, a standard, portable, and efficient inter-processommunication mechanism for embedded multicore platforms is needed to removehese difficulties.n the domain of high-performance and parallel computing, MPI has become auccessful and prevalent message-passing scheme. But a full MPI library would require significant amount of memory space for code and message buffers. It is not suitableo adopt the complete MPI specifications on embedded systems with limited memoryapacity.e proposed to build a communication library, called the MSG library, for multicorelatforms. The MSG library contains an essential subset of the MPI standard,ncluding blocking and non-blocking point-to-point communicatons, one-sidedommunicatons, and a subset of collective operations. In addition to provide theommunicaton and synchronization mechanisms, the MSG library is designed forompatibility, portability, and performance.n our work, the IBM Cell platform is chosen to implement our design becausef its high-performance characteristics. As a case study, we developed several parallelplications with the MSG library and evaluate its performance against the otherommunication libraries. Our experiences on the Cell platform should help designnd implement the MSG library on other platforms.Abstract(Chinese) ibstract iicknowledgements ivist of Tables ixist of Figures x Introduction 1.1 Communication Libraries for Multi-core Platforms . . . . . . . . . . . 3.1.1 Overview of MPI . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Overview of MCAPI . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Comparison of MPI and MCAPI . . . . . . . . . . . . . . . . 7.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Backgrounds and Related Work 10.1 Overview of the Cell Processor . . . . . . . . . . . . . . . . . . . . . . 10.2 Communication Mechanisms on Cell Processors . . . . . . . . . . . . 12.2.1 Direct Memory Access(DMA) . . . . . . . . . . . . . . . . . . 12.2.1.1 SPE-initated DMA . . . . . . . . . . . . . . . . . . . 13.2.1.2 PPE-initiated DMA . . . . . . . . . . . . . . . . . . 14.2.1.3 SPE-SPE DMA transfers . . . . . . . . . . . . . . . 16.2.1.4 Atomic Operations . . . . . . . . . . . . . . . . . . . 16.2.2 Mailbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.3 Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.3 Programming Frameworks for Cell Broadband Engine Architecture . 19.3.1 Programming with the Shared Memory Model . . . . . . . . . 20.3.2 Programming with the Message Passing Model . . . . . . . . . 21.3.3 Programming Frameworks with Other Models . . . . . . . . . 25.4 Communication Libraries for Other Heterogeneous Multi-core Platforms 28.4.1 ICPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.4.2 Streaming RPC . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Primitive Communication Mechanisms Performance of Cell 31.1 Experimental Platform and Methodologies . . . . . . . . . . . . . . . 31.2 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.1 DMA and memory access instructions . . . . . . . . . . . . . . 33.2.1.1 PPE side . . . . . . . . . . . . . . . . . . . . . . . . 33.2.1.2 SPE side . . . . . . . . . . . . . . . . . . . . . . . . 37.2.1.3 Comparison and Summary . . . . . . . . . . . . . . . 39.2.2 Mailbox and Signal Performance . . . . . . . . . . . . . . . . 40.2.2.1 Mailbox . . . . . . . . . . . . . . . . . . . . . . . . . 40.2.2.2 Signal . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Design and Implementation of the MSG Library 42.1 Unaligned Data Transmission Handling . . . . . . . . . . . . . . . . . 44.2 Point-to-point Communication . . . . . . . . . . . . . . . . . . . . . . 47.2.1 Non-blocking Message-Passing Scheme . . . . . . . . . . . . . 49.2.1.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . 50.2.1.2 Implementation . . . . . . . . . . . . . . . . . . . . . 51.2.2 Blocking Message-Passing Scheme . . . . . . . . . . . . . . . . 54.2.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . 54.2.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . 55.2.3 Comparison of Non-blocking and Blocking Message Passing Scheme 57.3 One-Sided Communication . . . . . . . . . . . . . . . . . . . . . . . . 58.4 Collective Communication . . . . . . . . . . . . . . . . . . . . . . . . 59.4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 60.5 Mapping Between MSG Library Interface and MPI Standard . . . . . 64 Experiments and Performance Evaluation 66.1 Latency of Point-to-point Communication in MSG Library . . . . . . 67.1.1 Latencies between PPE and SPE . . . . . . . . . . . . . . . . 68.1.2 SPE-SPE communication . . . . . . . . . . . . . . . . . . . . 70.2 Case Study of Data Parallel Applications: RC5 Encryption and Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Case Study of Pipelined Applications: AES Decryption and SHA-1ashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75.4 Performance of Collective Operations . . . . . . . . . . . . . . . . . . 77.4.1 Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79.4.2 Scatter and Gather . . . . . . . . . . . . . . . . . . . . . . . . 81 Conclusion and Future Work 83.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84ibliography 86application/pdf731914 bytesapplication/pdfen-US多核心系統嵌入式系統Cell處理器核心間溝通MPI平行程式multicoreembedded systemCell processorintercore communicaton為多核心平台開發高效能處理機通訊程式庫Development of a High-Performance Inter-processorommunication Library for Multicore Platformsthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/183384/1/ntu-98-R96922116-1.pdf