Implementation of MPI-Based Fault Tolerant Middleware with Non-Blocking Message Logging Protocol
Date Issued
2005
Date
2005
Author(s)
Jiang, Chung-Ching
DOI
en-US
Abstract
In recent years, parallel computing is one of the main ways to increase computer performance. High performance parallel computers apply to the fields of commerce, defense, and science, where high performance computing benefits numerical simulations, a major way to accelerate improvement of the current science.
Many people begin to research and develop distributed systems which perform parallel computing. To design a distributed system is complicated and difficult. Fault tolerance is an important indicator in many characteristics worthy to be particularly designed. Although every computer in a distributed system may fail, fault tolerance has the capability to deal with the failures in the system. Thus, how to make a system free from failures when in executing is an important study in fault tolerance.
The methods of rollback recovery are divided into checkpoint and message log. These two methods have different algorithms. Until now, no algorithm is admittedly the most efficient. Thus, we have to choose a different algorithm in different environments or circumstances to get the best efficiency.
This goal of this paper is to discuss the differences in fault tolerance methods in MPI-based distributed system. We implement a MPI-based fault tolerant middleware with non-blocking message logging protocol, measure its performances, and share practical experience with others.
Subjects
平行式計算
容錯
檢查點
訊息紀錄
parallel computing
fault tolerance
MPI
checkpoint
message log
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-94-R92921082-1.pdf
Size
23.31 KB
Format
Adobe PDF
Checksum
(MD5):435c0266b28a7b746e51687ed8bee6cd
