nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo searchdiv qikanlogo popupnotification paper paperNew
2024, 05, v.52 1-9
基于机器学习的非同义突变致病性预测及其特征重要性分析
基金项目(Foundation): 广州市科技计划项目(2023A03J0540)
邮箱(Email): zb-yushihui@kingmed.com.cn;
DOI:
摘要:

目的:本研究旨在评估多种机器学习模型对非同义突变致病性预测的集成效能,并通过特征重要性分析和多数据集验证各预测工具的贡献和效果。方法:使用27种致病性预测工具对ClinVar数据集和三个外部验证集的非同义突变进行致病性评估,采用均值、中位数和随机森林填补三种方法对缺失值进行处理。使用四种经典机器学习模型(随机森林、神经网络、朴素贝叶斯、极限梯度提升树)集成预测工具,结合三种填补方式构建12个模型。根据内部验证集的准确率和kappa值评估最佳缺失值填补方法,并进一步评估采用该填补方法的四个模型在多项指标上的性能表现。通过特征重要性评分评估各预测工具在集成模型中的重要性,并在内外部验证集中验证。结果:随机森林填补方法在缺失值填补方面表现最佳,平均准确率为0.908 0,平均kappa值为0.808 7。四种机器学习算法中,极限梯度提升树模型在各项性能指标上综合表现最优,神经网络和随机森林模型的性能表现与极限梯度提升树模型没有明显差异,朴素贝叶斯模型特异性最高、运行时间最短,但kappa值较低。特征重要性评分显示,AlphaMissense、VEST4和MVP是极限梯度提升树模型的核心特征,在内部验证集和三个外部验证集中,AlphaMissense、VEST4和DEOGEN2的AUC值均排在前五。本研究构建的集成预测极端梯度提升树模型在内部验证集的AUC值为0.976 3,高于任一单个预测分数,在外部验证集中AUC均在0.96以上。结论:本研究发现,采用随机森林填补缺失值的极限梯度提升树模型在预测非同义突变致病性方面表现最佳,在集成多个预测工具时可考虑使用该模型。AlphaMissense和VEST4等预测工具在集成模型中的贡献显著,具有较高的可信度和准确性,可为非同义突变的致病性提供可靠的预测。

Abstract:

Objective:This study aims to assess the integrated performance of various machine learning models in predicting the pathogenicity of nonsynonymous variant,and to validate the contributions and effects of each prediction tool through feature importance analysis and multiple dataset validation.Methods:Twenty ? seven pathogenicity prediction tools were used to evaluate the pathogenicity of nonsynonymous variants in the ClinVar dataset and three external validation sets,handling missing values with mean,median,and random forest imputation methods.Four classical machine learning models(random forest,neural network,naive bayes,extreme gradient boosting tree)were used to integrate prediction tools,constructing twelve models combined with the three imputation methods. Thebest imputation method was evaluated based on the accuracy and kappa values of the internal validation set,and theperformance of the four models using this imputation method was further assessed on multiple metrics. The importanceof each prediction tool in the ensemble model was evaluated using feature importance scoring,and validated in internaland external validation sets.Results:The random forest imputation method performed best in handling missing values,with an average accuracy of 0.908 0 and an average kappa value of 0.808 7. Among the four machine learningalgorithms,the extreme gradient boosting tree model showed the best overall performance across various metrics. Theneural network and random forest models had similar performance to the extreme gradient boosting tree model,whilethe naive bayes model had the highest specificity and shortest runtime but a lower kappa value. Feature importancescores indicated that Alpha Missense,VEST4,and MVP were the core features of the extreme gradient boosting treemodel. In both the internal validation set and the three external validation sets,Alpha Missense,VEST4,and DEOGEN2had AUC values ranking in the top five. The ensemble prediction extreme gradient boosting tree model constructed inthis study had an AUC value of 0.976 3 in the internal validation set,higher than any single prediction score,with AUCvalues above 0.96 in the external validation sets.Conclusions:This study found that the extreme gradient boosting treemodel,using random forest imputation for missing values,performed best in predicting the pathogenicity ofnonsynonymous variant. This model can be considered when integrating multiple prediction tools. Prediction tools suchas Alpha Missense and VEST4 made significant contributions to the ensemble model with high predictive reliability andaccuracy,which can provide reliable predictions for the pathogenicity of nonsynonymous mutations.

参考文献

[1] Brlek P,Buli?L,Bra?i?M,et al. Implementing whole genome sequencing(WGS)in clinical practice:advantages,challenges,and future perspectives[J]. Cells,2024,13(6):504.

[2] Lek M,Karczewski KJ,Minikel EV,et al. Analysis of protein?coding genetic variation in 60,706 humans[J].Nature,2016,536(7616):285?291.

[3] Garcia F,de Andrade ES,Palmero EI. Insights on variant analysis in silico tools for pathogenicity prediction[J].Front Genet,2022,13:1010327.

[4] Li C,Zhi D,Wang K,et al. MetaRNN:differentiating rare pathogenic and rare benign missense SNVs and InDels using deep learning[J]. Genome Med,2022,14(1):115.

[5] Ioannidis NM,Rothstein JH,Pejaver V,et al. REVEL:an ensemble method for predicting the pathogenicity of rare missense variants[J]. Am J Hum Genet,2016,99(4):877?885.

[6] Sundaram L,Gao H,Padigepati SR,et al. Predicting the clinical impact of human mutation with deep neural networks[J]. Nat Genet,2018,50(8):1161?1170.

[7] Cheng J,Novati G,Pan J,et al. Accurate proteome?wide missense variant effect prediction with AlphaMissense[J]. Science,2023,381(6664):eadg7492.

[8] Ainscough BJ,Griffith M,Coffman AC,et al. DoCM:a database of curated mutations in cancer[J]. Nat Methods,2016,13(10):806?807.

[9] Meng Y,Yu C,Chen M,et al. Mutation landscape of TSC1/TSC2 in Chinese patients with tuberous sclerosis complex[J]. J Hum Genet,2021,66(3):227?236.

[10] McLaren W,Gil L,Hunt SE,et al. The ensembl variant effect predictor[J]. Genome Biol,2016,17(1):122.

[11] Liu X,Li C,Mou C,et al. dbNSFP v4:a comprehensive database of transcript?specific functional predictions and annotations for human nonsynonymous and splice?site SNVs[J]. Genome Med,2020,12(1):103.

[12] Wang K,Li M,Hakonarson H. ANNOVAR:functional annotation of genetic variants from high?throughput sequencing data[J]. Nucleic Acids Res,2010,38(16):e164.

[13] Kumar P,Henikoff S,Ng PC. Predicting the effects of coding non?synonymous variants on protein function using the SIFT algorithm[J]. Nat Protoc,2009,4(7):1073?1081.

[14] Vaser R,Adusumalli S,Leng SN,et al. SIFT missense predictions for genomes[J]. Nat Protoc,2016,11(1):1?9.

[15] Adzhubei IA,Schmidt S,Peshkin L,et al. A method and server for predicting damaging missense mutations[J].Nat Methods,2010,7(4):248?249.

[16] Chun S,Fay JC. Identification of deleterious mutations within three human genomes[J]. Genome Res,2009,19(9):1553?1561.

[17] Schwarz JM,R?delsperger C,Schuelke M,et al.MutationTaster evaluates disease?causing potential of sequence alterations[J]. Nat Methods,2010,7(8):575?576.

[18] Reva B,Antipin Y,Sander C. Predicting the functional impact of protein mutations:application to cancer genomics[J]. Nucleic Acids Res,2011,39(17):e118.

[19] Carter H,Douville C,Stenson PD,et al. Identifying Mendelian disease genes with the variant effect scoring tool[J]. BMC Genomics,2013,14 Suppl 3(Suppl 3):S3.

[20] Qi H,Zhang H,Zhao Y,et al. MVP predicts the pathogenicity of missense variants by deep learning[J].Nat Commun,2021,12(1):510.

[21] Samocha KE,Kosmicki JA,Karczewski KJ,et al.Regional missense constraint improves variant deleteriousness prediction[J]. bioRxiv,2017:148353.

[22] Raimondi D,Tanyalcin I,FertéJ,et al. DEOGEN2:prediction and interactive visualization of single amino acid variant deleteriousness in human proteins[J].Nucleic Acids Res,2017,45(W1):W201?W206.

[23] Malhis N,Jacobson M,Jones S,et al. LIST?S2:taxonomy based sorting of deleterious missense mutations across species[J]. Nucleic Acids Res,2020,48(W1):W154?W161.

[24] Rentzsch P,Witten D,Cooper GM,et al. CADD:predicting the deleteriousness of variants throughout the human genome[J]. Nucleic Acids Res,2019,47(D1):D886?D894.

[25] Quang D,Chen Y,Xie X. DANN:a deep learning approach for annotating the pathogenicity of genetic variants[J]. Bioinformatics,2015,31(5):761?763.

[26] Shihab HA,Rogers MF,Gough J,et al. An integrative approach to predicting the functional effects of non?coding and coding sequence variation[J]. Bioinformatics,2015,31(10):1536?1543.

[27] Rogers MF,Shihab HA,Mort M,et al. FATHMM?XF:accurate prediction of pathogenic point mutations via extended features[J]. Bioinformatics,2018,34(3):511?513.

[28] Lu Q,Hu Y,Sun J,et al. A statistical framework to predict functional non?coding regions in the human genome through integrated analysis of annotation data[J].Sci Rep,2015,5:10576.

[29] Gulko B,Hubisz MJ,Gronau I,et al. A method for calculating probabilities of fitness consequences for point mutations across the human genome[J]. Nat Genet,2015,47(3):276?283.

[30] Davydov EV,Goode DL,Sirota M,et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++[J]. PLoS Comput Biol,2010,6(12):e1001025.

[31] Pollard KS,Hubisz MJ,Rosenbloom KR,et al. Detection of nonneutral substitution rates on mammalian phylogenies[J]. Genome Res,2010,20(1):110?121.

[32] Siepel A,Bejerano G,Pedersen JS,et al. Evolutionarily conserved elements in vertebrate,insect,worm,and yeast genomes[J]. Genome Res,2005,15(8):1034?1050.

[33] Garber M,Guttman M,Clamp M,et al. Identifying novel constrained elements by exploiting biased substitution patterns[J]. Bioinformatics,2009,25(12):i54?62.

[34]陈娟,王献雨,罗玲玲,等.缺失值填补效果:机器学习与统计学习的比较[J].统计与决策,2020,36(17):28?32.

[35] McHugh ML. Interrater reliability:the kappa statistic[J].Biochem Med(Zagreb),2012,22(3):276?282.

[36] Ljungdahl A,Kohani S,Page NF,et al. AlphaMissense is better correlated with functional assays of missense impact than earlier prediction algorithms[J]. bioRxiv,2023:2023.10.24.562294.

[37] Wilcox EH,Sarmady M,Wulf B,et al. Evaluating the impact of in silico predictors on clinical variant classification[J]. Genet Med,2022,24(4):924?930.

[38] Pejaver V,Byrne AB,Feng BJ,et al. Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4criteria[J]. Am J Hum Genet,2022,109(12):2163?2177.

[39] Richards S,Aziz N,Bale S,et al. Standards and guidelines for the interpretation of sequence variants:a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology[J]. Genet Med,2015,17(5):405?424.

基本信息:

DOI:

中图分类号:R440;TP181

引用信息:

[1]沈茂婷,林俊维,范喜杰,等.基于机器学习的非同义突变致病性预测及其特征重要性分析[J].广州医科大学学报,2024,52(05):1-9.

基金信息:

广州市科技计划项目(2023A03J0540)

检 索 高级检索

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文