Efficient Filtration Methods for Biological Sequence Databases
Date Issued
2007
Date
2007
Author(s)
Lin, Chao-Wen
Abstract
In this dissertation, we propose two filtration methods for two sequence similarity queries of biological databases: DNA sequence similarity query and siRNA off-target query. Both queries are used to retrieve similar sequences from biological sequence databases. The DNA sequence similarity query is used to retrieve all sequences in a database so that the edit distance between the query and a sequence retrieved is less than a user-specified threshold, where the length of such a query is often long. It is mostly used to retrieve highly similar sequences. A small interfering RNA (siRNA), also called silencing RNA, is used to knock a gene down by an artificial mechanism. Although an siRNA is designed to silence a specific gene, many researchers have shown that the genes highly similar to the siRNA are also silenced or their expressions are depressed. An siRNA off-target query can be used to find those highly similar genes in a database, where the Hamming distance between the query and the sequence retrieved is less than a user-specified threshold. Its query length is usually short (normally 19~25 base pairs). For both queries, a filtration method can be used to screen out most unqualified data sequences from the database and leave only a small number of candidate sequences for sequence comparisons. For the DNA sequence similarity retrieval query, we propose a method called Transformation-based Database Filtration method (TDF). In the TDF method, the data sequences are first divided into several blocks, each of which is transformed into a feature vector by Haar wavelet transform and stored in an index file. Then, we search the index and extract those candidate blocks whose edit distance to the feature vector of the query sequence is less than a user-specified threshold. For the siRNA off-target query, we propose a method called Common Prefix Filtration method (CPF). In the CPF method, the data sequences are first sorted and stored in an index tree according to their common prefixes. Then, we search the index tree and filter out those sequences whose Hamming distance to the query sequence is greater than a user-specified threshold. We extract those possible candidate sequences and pass them to the verification stage. The experiment results show that our both filtration methods can filter out most unqualified sequences and guarantee no false negatives. The experimental results show that the TDF method outperforms the QUASAR and YM methods while the CPF method outperforms the YM and SoS methods.
Subjects
sequence query
biological sequence database
filtration method
File(s)![Thumbnail Image]()
Loading...
Name
ntu-96-D89725003-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):9b993fa34918ad4b4d485e0490a7c4a2
