Predicting pathogenic non-coding variants on imbalanced data set using cluster ensemble sampling

Chuang, K.-W.; CHIEN-YU CHEN; Chuang, K.-W.; Chen, C.-Y.

doi:10.1109/BIBE.2019.00158

Predicting pathogenic non-coding variants on imbalanced data set using cluster ensemble sampling

Journal

Proceedings - 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering, BIBE 2019

Pages

850-855

Date Issued

2019

Author(s)

Chuang, K.-W.

CHIEN-YU CHEN

DOI

10.1109/BIBE.2019.00158

URI

https://www.scopus.com/inward/record.url?eid=2-s2.0-85078574342&partnerID=40&md5=c2f9a7779bd341c84577dea62d811c2b

https://scholars.lib.ntu.edu.tw/handle/123456789/549056

Abstract

In the past few years, many variants in the non-coding regions of the human genome have been reported by personal whole genome sequencing. It is a challenge to distinguish pathogenic non-coding variants from such a large number of benign non-coding variants. Many machine learning methods for predicting pathogenic non-coding variants have been proposed. However, the precision and recall rates of the currently existing methods decline rapidly when the number of negative samples in the data increases. Both under-and over-sampling techniques have been employed in the field of machine learning to resolve the poor performance of classification methods on imbalanced data. Even though, we observed that a more sophisticated method with better performance is still largely desired for the problem of predicting pathogenic non-coding variants. In this regard, this study aims at presenting a general framework for imbalanced data learning, CE-SMURF, which incorporates both Cluster Ensemble (CE) sampling and hyper-ensemble techniques to further improve the prediction accuracy of detecting pathogenic non-coding variants. The results demonstrate that the final setting of CE-SMURF (f = 0, r = 0.1) is superior in training, and outperforms other existing methods on the testing data, providing a valuable insight to tackle the imbalanced learning issue for many future applications in the field of genomic precision medicine. ? 2019 IEEE.

Subjects

Imbalanced; Machine learning; Non-coding; Pathogenic variants

SDGs

[SDGs]SDG1

[SDGs]SDG3

Other Subjects

Bioinformatics; Learning systems; Machine learning; Classification methods; Imbalanced; Imbalanced Data-sets; Machine learning methods; Non-coding; Pathogenic variants; Precision and recall; Whole genome sequencing; Forecasting

Type

conference paper

Predicting pathogenic non-coding variants on imbalanced data set using cluster ensemble sampling

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)