Efficient Filtration Methods for Biological Sequence Databases

Lin, Chao-Wen

Efficient Filtration Methods for Biological Sequence Databases

Date Issued

2007

Date

2007

Author(s)

Lin, Chao-Wen

URI

http://ntur.lib.ntu.edu.tw//handle/246246/179829

Abstract

In this dissertation, we propose two filtration methods for two sequence similarity queries of biological databases: DNA sequence similarity query and siRNA off-target query. Both queries are used to retrieve similar sequences from biological sequence databases. The DNA sequence similarity query is used to retrieve all sequences in a database so that the edit distance between the query and a sequence retrieved is less than a user-specified threshold, where the length of such a query is often long. It is mostly used to retrieve highly similar sequences. A small interfering RNA (siRNA), also called silencing RNA, is used to knock a gene down by an artificial mechanism. Although an siRNA is designed to silence a specific gene, many researchers have shown that the genes highly similar to the siRNA are also silenced or their expressions are depressed. An siRNA off-target query can be used to find those highly similar genes in a database, where the Hamming distance between the query and the sequence retrieved is less than a user-specified threshold. Its query length is usually short (normally 19~25 base pairs). For both queries, a filtration method can be used to screen out most unqualified data sequences from the database and leave only a small number of candidate sequences for sequence comparisons. For the DNA sequence similarity retrieval query, we propose a method called Transformation-based Database Filtration method (TDF). In the TDF method, the data sequences are first divided into several blocks, each of which is transformed into a feature vector by Haar wavelet transform and stored in an index file. Then, we search the index and extract those candidate blocks whose edit distance to the feature vector of the query sequence is less than a user-specified threshold. For the siRNA off-target query, we propose a method called Common Prefix Filtration method (CPF). In the CPF method, the data sequences are first sorted and stored in an index tree according to their common prefixes. Then, we search the index tree and filter out those sequences whose Hamming distance to the query sequence is greater than a user-specified threshold. We extract those possible candidate sequences and pass them to the verification stage. The experiment results show that our both filtration methods can filter out most unqualified sequences and guarantee no false negatives. The experimental results show that the TDF method outperforms the QUASAR and YM methods while the CPF method outperforms the YM and SoS methods.

Subjects

sequence query

biological sequence database

filtration method

File(s)

Name

ntu-96-D89725003-1.pdf

Size

23.32 KB

Format

Adobe PDF

Checksum

(MD5):9b993fa34918ad4b4d485e0490a7c4a2

Efficient Filtration Methods for Biological Sequence Databases

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)