https://scholars.lib.ntu.edu.tw/handle/123456789/549056
標題: | Predicting pathogenic non-coding variants on imbalanced data set using cluster ensemble sampling | 作者: | Chuang, K.-W. CHIEN-YU CHEN |
關鍵字: | Imbalanced; Machine learning; Non-coding; Pathogenic variants | 公開日期: | 2019 | 起(迄)頁: | 850-855 | 來源出版物: | Proceedings - 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering, BIBE 2019 | 摘要: | In the past few years, many variants in the non-coding regions of the human genome have been reported by personal whole genome sequencing. It is a challenge to distinguish pathogenic non-coding variants from such a large number of benign non-coding variants. Many machine learning methods for predicting pathogenic non-coding variants have been proposed. However, the precision and recall rates of the currently existing methods decline rapidly when the number of negative samples in the data increases. Both under-and over-sampling techniques have been employed in the field of machine learning to resolve the poor performance of classification methods on imbalanced data. Even though, we observed that a more sophisticated method with better performance is still largely desired for the problem of predicting pathogenic non-coding variants. In this regard, this study aims at presenting a general framework for imbalanced data learning, CE-SMURF, which incorporates both Cluster Ensemble (CE) sampling and hyper-ensemble techniques to further improve the prediction accuracy of detecting pathogenic non-coding variants. The results demonstrate that the final setting of CE-SMURF (f = 0, r = 0.1) is superior in training, and outperforms other existing methods on the testing data, providing a valuable insight to tackle the imbalanced learning issue for many future applications in the field of genomic precision medicine. ? 2019 IEEE. |
URI: | https://www.scopus.com/inward/record.url?eid=2-s2.0-85078574342&partnerID=40&md5=c2f9a7779bd341c84577dea62d811c2b https://scholars.lib.ntu.edu.tw/handle/123456789/549056 |
DOI: | 10.1109/BIBE.2019.00158 | SDG/關鍵字: | Bioinformatics; Learning systems; Machine learning; Classification methods; Imbalanced; Imbalanced Data-sets; Machine learning methods; Non-coding; Pathogenic variants; Precision and recall; Whole genome sequencing; Forecasting |
顯示於: | 生物機電工程學系 |
在 IR 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。