Study of Algorithm and Hardware Design for Genome Reassembly by Short Reads
Date Issued
2012
Date
2012
Author(s)
Chang, Bo-Yi
Abstract
Genome governs the behavior and traits of a being. It is the DNA sequence in cell nuclei. DNA sequences consist of four types of nucleobases - A, C, G and T. A DNA sequence can be regarded as a long string constructed by these four alphabets. By the DNA sequences of an individual, we can analyze potential hereditary diseases it may have. Human DNA sequence is surely the most popular topic.
DNA sequencing machine can decode the alphabet on a DNA sequence. However, contemporary DNA sequencing machines have a length limit for incoming DNA sequences(about 20~1000 nucleobases). The total length of human DNA sequences is about 3×〖10〗^9 nucleobases, which can not be directly decoded by DNA sequencing machines.
So, to decode the alphabet of the DNA sequence of a human, we need to extract multiple copies of the DNA sequences from him/her, and then randomly cut the sequences into small pieces with length under 1000 nucleobases for DNA sequencing machines to decode. These small pieces are called short reads. After the alphabet on short reads is decoded, we reconstruct the original DNA sequence by these short reads.
Since the DNA sequences between two different human are very similar, we can refer to a reconstructed DNA sequence of another human(which is called reference sequence). For each short read, find the most similar subsequence and the location of the most similar subsequence (which is called best mapping location) on the reference sequence. The original DNA sequence can be reconstructed by putting each short read to the best mapping location. The process of finding the most similar subsequence on the reference sequence is called short read alignment.
Currently, short read alignment algorithms are mostly done in software. After genome reassembly become a popular medical service in hospitals, the speed of software is not enough for huge amount of needs. Short read alignment contains candidate mapping location(CML) generator and similarity score with actual alignment computation. State-of-the-art CML generators consumes too much memory when implemented in hardware. We propose Bloom filter-based CML generator, which has much lower memory consumption than state-of-the-art solutions. Similarity score computation and actual alignment computation are usually done with Smith-Waterman algorithm, which is implemented in systolic array for hardware implementations. We propose a modified processing element in the systolic array to have a smaller area and shorter critical path.
Subjects
Genome reassembly
short read alignment
Bloom filter
candidate mapping location
Smith-Waterman algorithm
SDGs
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-101-R99943005-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):5212dd9fca221827ba064c13ab760686
