Abstract: The objective of this project is to study the hybrid hierarchical protein sequence clustering algorithms. The project aims to provide biologists a protein hierarchy that matches different sizes of proteins in the different levels of the hierarchy. In the recent study, we have successfully employed the statistical models to improve the efficiency of the traditional hierarchical clustering algorithms for protein family analysis. The proposed statistical model based algorithm also provides users a summarized hierarchy that the size of which is much smaller than the original binary tree generated by the traditional hierarchical clustering algorithms.
There are still some challenges for protein sequence clustering. In this project, we will continue our recent study to design a hybrid hierarchical clustering algorithm based on statistical models. In order to satisfy the demand of protein family analysis, the first problem we need to tackle is some multi-function proteins should be placed at more than one position in the protein hierarchy. Next, different sizes of protein families possess different properties. Smaller families ask for the property of homogeneity, while the larger families need to utilize the property of transitivity in order to find remote homology. The hierarchical clustering algorithm should hybridize different criterions for controlling the formation of new clusters.
The duration of this project is one year. In the first half of the year, we plan to recognize the proteins that should be duplicated in the bottom level of the hierarchy by examining the distribution of the similarities between a particular protein and all of the other proteins. In the remaining half of the year, different controlling criterions are designed and used in the different stages of clustering process to generate the hierarchy that matches the protein families better.
protein sequence clustering