郭斯彥臺灣大學:電機工程學研究所鄒易達Tzou, I-taI-taTzou2007-11-262018-07-062007-11-262018-07-062007http://ntur.lib.ntu.edu.tw//handle/246246/53431Recently, parallel computing is one of the main techniques to enhance computer performance. High performance computer can be applied to different fields, including commerce, national defense, and science. Numerical simulation is an important method that flourished science today. The simulation will fail if there is a intrusion during the simulation, so fault tolerance is an important issue. There are two main categories of fault tolerant techniques, 1) Automatic, and 2)Non-Automatic. Basic automatic fault tolerant techniques applied on clusters will be discussed, which includes coordinated, uncoordinated checkpoints and pessimistic, optimistic message logging. An automatic fault tolerant cluster under a scientific computational environment will be implemented with coordinated checkpoint. A storage backup strategy will also be implemented with a redundant array of inexpensive disks level five network file server.口試委員會審定書 I 誌謝 II 中文摘要 III ABSTRACT IV LIST OF FIGURES 3 LIST OF TABLES 4 CHAPTER 1 INTRODUCTION 5 CHAPTER 2 BACKGROUND 7 2.1 PARALLEL HARDWARE ARCHITECTURES 7 2.1.1 Interconnection Networks 8 2.1.2 SIMD Systems 9 2.1.3 Shared Memory MIMD 10 2.1.4 Distributed-Memory MIMD 10 2.1.5 Strengths and Weakness 11 2.2 PARALLEL SOFTWARE ARCHITECTURE 11 2.2.1 Message Passing Interface 12 2.2.2 MPICH 13 CHAPTER 3 FAULT TOLERANT ARCHITECTURES 15 3.1 BASIC FAULT TOLERANT METHODS 15 3.1.1 Checkpointing 16 3.1.2 Message Logging 19 3.2 FAULT TOLERANT MPI 21 3.2.1 MPICH – V1 21 3.2.2 MPICH – V2 25 3.2.3 MPICH – CL 29 3.3 FAULT TOLERANT STORAGE 32 CHAPTER 4 IMPLEMENTATION 36 4.1 HARDWARE ARCHITECTURE 36 4.2 SOFTWARE COMPONENTS 37 4.3 BACKUP STRATEGY 38 4.4 EVALUATION 38 CHAPTER 5 CONCLUSION 42 REFERENCE 44559700 bytesapplication/pdfen-US容錯機制叢集系統檢查點訊息記錄獨立磁碟備援陣列網路檔案伺服器Fault Tolerantclustercheckpointmessage logredundant array of inexpensive disksnetwork file server容錯及錯誤回復叢集系統在科學計算之實作Implementation of a Fault Tolerant Cluster with Error Recovery for Scientific Computationthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/53431/1/ntu-96-J93921042-1.pdf