Study of Algorithm and Hardware Design for Genome Reassembly by Short Reads

Chang,  Bo-Yi

Study of Algorithm and Hardware Design for Genome Reassembly by Short Reads

Date Issued

2012

Date

2012

Author(s)

Chang, Bo-Yi

URI

http://ntur.lib.ntu.edu.tw//handle/246246/256668

Abstract

Genome governs the behavior and traits of a being. It is the DNA sequence in cell nuclei. DNA sequences consist of four types of nucleobases - A, C, G and T. A DNA sequence can be regarded as a long string constructed by these four alphabets. By the DNA sequences of an individual, we can analyze potential hereditary diseases it may have. Human DNA sequence is surely the most popular topic. DNA sequencing machine can decode the alphabet on a DNA sequence. However, contemporary DNA sequencing machines have a length limit for incoming DNA sequences(about 20~1000 nucleobases). The total length of human DNA sequences is about 3×〖10〗^9 nucleobases, which can not be directly decoded by DNA sequencing machines. So, to decode the alphabet of the DNA sequence of a human, we need to extract multiple copies of the DNA sequences from him/her, and then randomly cut the sequences into small pieces with length under 1000 nucleobases for DNA sequencing machines to decode. These small pieces are called short reads. After the alphabet on short reads is decoded, we reconstruct the original DNA sequence by these short reads. Since the DNA sequences between two different human are very similar, we can refer to a reconstructed DNA sequence of another human(which is called reference sequence). For each short read, find the most similar subsequence and the location of the most similar subsequence (which is called best mapping location) on the reference sequence. The original DNA sequence can be reconstructed by putting each short read to the best mapping location. The process of finding the most similar subsequence on the reference sequence is called short read alignment. Currently, short read alignment algorithms are mostly done in software. After genome reassembly become a popular medical service in hospitals, the speed of software is not enough for huge amount of needs. Short read alignment contains candidate mapping location(CML) generator and similarity score with actual alignment computation. State-of-the-art CML generators consumes too much memory when implemented in hardware. We propose Bloom filter-based CML generator, which has much lower memory consumption than state-of-the-art solutions. Similarity score computation and actual alignment computation are usually done with Smith-Waterman algorithm, which is implemented in systolic array for hardware implementations. We propose a modified processing element in the systolic array to have a smaller area and shorter critical path.

Subjects

Genome reassembly

short read alignment

Bloom filter

candidate mapping location

Smith-Waterman algorithm

SDGs

[SDGs]SDG3

Type

thesis

File(s)

Name

ntu-101-R99943005-1.pdf

Size

23.32 KB

Format

Adobe PDF

Checksum

(MD5):5212dd9fca221827ba064c13ab760686

Study of Algorithm and Hardware Design for Genome Reassembly by Short Reads

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)