Predicting pathogenic non-coding variants on imbalanced data set using cluster ensemble sampling

Chuang, K.-W.K.-W.ChuangCHIEN-YU CHEN2021-02-202021-02-202019https://www.scopus.com/inward/record.url?eid=2-s2.0-85078574342&partnerID=40&md5=c2f9a7779bd341c84577dea62d811c2bhttps://scholars.lib.ntu.edu.tw/handle/123456789/549056In the past few years, many variants in the non-coding regions of the human genome have been reported by personal whole genome sequencing. It is a challenge to distinguish pathogenic non-coding variants from such a large number of benign non-coding variants. Many machine learning methods for predicting pathogenic non-coding variants have been proposed. However, the precision and recall rates of the currently existing methods decline rapidly when the number of negative samples in the data increases. Both under-and over-sampling techniques have been employed in the field of machine learning to resolve the poor performance of classification methods on imbalanced data. Even though, we observed that a more sophisticated method with better performance is still largely desired for the problem of predicting pathogenic non-coding variants. In this regard, this study aims at presenting a general framework for imbalanced data learning, CE-SMURF, which incorporates both Cluster Ensemble (CE) sampling and hyper-ensemble techniques to further improve the prediction accuracy of detecting pathogenic non-coding variants. The results demonstrate that the final setting of CE-SMURF (f = 0, r = 0.1) is superior in training, and outperforms other existing methods on the testing data, providing a valuable insight to tackle the imbalanced learning issue for many future applications in the field of genomic precision medicine. ? 2019 IEEE.Imbalanced; Machine learning; Non-coding; Pathogenic variants[SDGs]SDG1[SDGs]SDG3Bioinformatics; Learning systems; Machine learning; Classification methods; Imbalanced; Imbalanced Data-sets; Machine learning methods; Non-coding; Pathogenic variants; Precision and recall; Whole genome sequencing; ForecastingPredicting pathogenic non-coding variants on imbalanced data set using cluster ensemble samplingconference paper10.1109/BIBE.2019.001582-s2.0-85078574342