Investigating Variations of Transcription Factor Binding Sites by 1000 Genomes Data
Date Issued
2015
Date
2015
Author(s)
Wu, Po-Chun
Abstract
Gene regulation is essential and important for maintaining cellular functions. Therefore, how biological system regulates gene expression is a very important research topic for researchers. Gene regulation of cell functioning can be divided into many parts, including gene expression, mRNA transcription and splicing, post-translational modification, etc. This study aims at exploring the activation and inactivation effect of gene expression, through the interaction between transcription factors and double-stranded DNA. Among the three billion base pairs of human genome, some biological significant fragments such as genes or transcription factor binding sites account for only a small portion of DNA. The size of transcription factor binding motifs is about 5 to 15 nucleotides. Accordingly, how to identify transcription factor binding sites and how they achieve gene regulation is a very important research issue. Meanwhile, the bonding strength between transcription factors and their binding sites may also affect the regulation of gene expression. In the 1990s, the Human Genome Sequencing Project launched. Limited to the technology at that time, this project spent a lot of money and manpower. Finally, 23 human chromosomes were completed sequencing in 2001, including in total three billion bases. This is a considerable milestone on human genome research. With the development of biotechnology and the reducing cost of computer calculation, the technology of genome sequencing started to grow fast. In 2008, the 1000 Genomes Project started, planning to use faster and easier sequencing technology, to sequencing more than a thousand human genomes within three years. In 2012, in total 1,092 human genomes have been published. So far, the latest version dataset of this project has already contained 2,504 human genome data. The completion of human genome allows researchers to perform high-throughput screening of transcription factor binding sites. More and more individual genome datasets, provided a wealth of research themes letting us to glimpse the differences within individual transcription factor binding sites. The objective of this study is using the data of 1000 Genomes Project to explore individual variations in transcription factor binding sites, and the possibilities of its applications on genetic tests. This study collected the binding site data of 34 human transcription factors in the JASPAR database, and combined this information with the variant data of the 1000 Genomes Project to explore individual variations in transcription factor binding sites. Analysis from the study shows, the JASPAR-denoted transcription factor binding sites have only about 3% of position with individual variations. Furthermore, the positions with individual variations do not consistent with the original motifs of the transcription factor binding sites. Some individual variations occur at the positions where the corresponding motif implies not allowing variations. In order to further investigate the rationale behind this inconsistency, this study used an online tool named PiDNA, which predicts the binding motif of a DNA-binding protein using protein-DNA complex structures. This study employed such binding motifs to explore the potential minor form that might be omitted previously. At the end of this study, it discusses the future application of personal genetic diagnosis, and how to use existing bioinformatics tools and public databases to assess the importance of the occurrence of variants observed in transcription factor binding sites. It is expected that this study can provide novel insights for individual genetic tests in the personalized medecine.
Subjects
transcription factor
transcription factor binding site
transcription factor binding motif
1000 Genomes Project
Type
thesis
