MutPred-Indel is a machine learning-based method and software package that integrates genetic and molecular data to reason probabilistically about the pathogenicity of nonframeshifting indel variants. The model provides both pathogenicity prediction and a ranked list of molecular alterations potentially affecting phenotype. It is trained on a set of pathogenic and unlabeled (putatively neutral) variants obtained from the Human Gene Mutation Database (HGMD) [1] and ExAC [2]. MutPred-Indel is a bagged ensemble of 100 feed-forward neural networks, each trained on a balanced subset of pathogenic and putatively neutral variants.
MutPred-Indel was developed by Kymberleigh Pagel at Indiana University Bloomington, and was a joint project of the Mooney group at the University of Washington and the Radivojac group at Indiana University.
Pagel, Kymberleigh A., et al. "Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome." PLoS computational biology 15.6 (2019): e1007112.
This work is funded by:
NIH R01LM009722 (PI: Mooney)
NIH R01MH105524 (PI: Iakoucheva and Radivojac)
Indiana University Precision Health Initiative (PI: Radivojac)
Neutral training data available here.
Pathogenic training data available here.
ASD variants available here.
Code to generate structural and functional features avilable here.
Beyond MutPred, several other tools have been developed as part of the MutPred project. They are list ed below in chronological order of their development:
MutPred: A random forest model to predict the effect of single amino acid substitutions. Trained on a smaller training set with fewer features (and predicted molecular mechanisms). Proof-of-concept predecessor to MutPred2.
MutPred2: A bagged ensemble of 30 feed-forward neural networks that integrates genetic and molecular data to reason probabilistically about the pathogenicity of amino acid substitutions.
Functional regulatory SNP predictor: an ensemble of decision trees for the prediction of SNPs in regulatory regions that impact gene expression. A new and enhanced method, called RSVP has been developed, and uses an expanded feature set that includes information from ENCODE.
MutPred-Splice: a random forest-based approach to prioritize exonic variants (missense or samesense) which are likely to disrupt pre-mRNA splicing from whole-genome sequencing data sets.
MutPred-LOF: A bagged ensemble of 100 neural networks to predict the effect of frameshifting insertion/deletion and stop-gain variants.
1. Stenson PD, Mort M, Ball EV, Evans K, Hayden M, Heywood S, Hussain M, Phillips AD, Cooper DN. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet (2017)
2. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O'Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Rose SA, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won HH, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Donnelly S, Elosua R, Florez JC, Gabriel SB, Getz G, Glatt SJ, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf JM, Sklar P, Sullivan PF, Tuomilehto J, Tsuang MT, Watkins HC, Wilson JG, Daly MJ, MacArthur DG, Exome Aggregation Consortium.. Analysis of protein-coding genetic variation in 60,706 humans. Nature (2016) 536(7616):285-291.