DNAC: A Compression Algorithm for DNA Sequences by Non-overlapping Approximate Repeats
Date Issued
2004
Date
2004
Author(s)
Chang, Chun-Ho
DOI
en-US
Abstract
The repetitive structure of genomic DNA includes many secrets to be explored. In this thesis, we apply this repetitive structure to compress the DNA sequences. The existing compression tools are not suitable because its alphabet size is not large enough. In addition, the time and space complexity of DNA compression algorithm must be as low as possible because there can be millions of bases in a DNA sequence.
We present a lossless compression algorithm, DNA Compression (DNAC), for genetic sequences by searching for non-overlapping approximate repeats. DNAC has 4 computation phases. In the first phase, it builds a suffix tree to locate exact repeats. In the second phase, all the exact repeats are extended into approximate repeats by dynamic programming in order to save more bits. In the third phase, we extract the optimal non-overlapping repeats from overlapping ones. In the last phase, we use the Fibonacci series to encode the repeats in a self-delimited way. According to our experimental results, DNAC not only achieves better compression ratio but also runs in almost linear time.
Subjects
不重疊重覆區段
費氏序列
近似重覆區段
無失真壓縮
去氧核醣核酸序列
Fibonacci series
non-overlapping repeats
DNA sequence
lossless compression
approximate repeats
Type
other
File(s)![Thumbnail Image]()
Loading...
Name
ntu-93-R91725026-1.pdf
Size
23.31 KB
Format
Adobe PDF
Checksum
(MD5):47dbf28e4db6cc73db5599015da0d334
