Abstract
Deep neural networks (DNNs) have been successfully utilized in many scientific problems for their high prediction accuracy, but their application to genetic studies remains challenging due to their poor interpretability. Here we consider the problem of scalable, robust variable selection in DNNs for the identification of putative causal genetic variants in genome sequencing studies. We identified a pronounced randomness in feature selection in DNNs due to its stochastic nature, which may hinder interpretability and give rise to misleading results. We propose an interpretable neural network model, stabilized using ensembling, with controlled variable selection for genetic studies. The merit of the proposed method includes: flexible modelling of the nonlinear effect of genetic variants to improve statistical power; multiple knockoffs in the input layer to rigorously control the false discovery rate; hierarchical layers to substantially reduce the number of weight parameters and activations, and improve computational efficiency; and stabilized feature selection to reduce the randomness in identified signals. We evaluate the proposed method in extensive simulation studies and apply it to the analysis of Alzheimer’s disease genetics. We show that the proposed method, when compared with conventional linear and nonlinear methods, can lead to substantially more discoveries.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Alzheimer’s disease genetic cohort data can be obtained for approved research (see the description in the work by Le Guen and colleagues62). Simulation datasets are available on our GitHub repository: https://github.com/Peyman-HK/De-randomized-HiDe-MK (ref. 70).
Code availability
The code for the generation and reproduction of the simulation studies of SKAT haplotype data is written in R. The code for HiDe-MK training, prediction and evaluation were written in Python with Keras and Tensorflow. The codes are feely available at: https://github.com/Peyman-HK/De-randomized-HiDe-MK. The doi of the code can be found at https://doi.org/10.5281/zenodo.6872386 (ref. 70). The pseudo code for simulation studies can be found in Supplementary Section 4.
References
Sierksma, A., Escott-Price, V. & De Strooper, B. Translating genetic risk of Alzheimer’s disease into mechanistic insight and drug targets. Science 370, 61–66 (2020).
Visscher, P. M. et al. 10 Years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).
Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95, 5–23 (2014).
Zuk, O., Hechter, E., Sunyaev, S. R. & Lander, E. S. The mystery of missing heritability: genetic interactions create phantom heritability. Proc. Natl Acad. Sci. USA 109, 1193–1198 (2012).
Ma, Y. et al. Analysis of whole-exome sequencing data for Alzheimer disease stratified by APOE Genotype. JAMA Neurol. 76, 1099–1108 (2019).
Jun, G. R. et al. Transethnic genome-wide scan identifies novel Alzheimer’s disease loci. Alzheimers. Dement. 13, 727–738 (2017).
Belloy, M. E. et al. Association of klotho-VS heterozygosity with risk of Alzheimer disease in individuals who carry APOE4. JAMA Neurol. 77, 849–862 (2020).
He, L. et al. Exome-wide age-of-onset analysis reveals exonic variants in ERN1 and SPPL2C associated with Alzheimer’s disease. Transl. Psychiatry 11, 146 (2021).
Sims, R., Hill, M. & Williams, J. The multiplex model of the genetics of Alzheimer’s disease. Nat. Neurosci. 23, 311–322 (2020).
Costanzo, M. et al. A global genetic interaction network maps a wiring diagram of cellular function. Science 353, aaf1420 (2016).
Kuzmin, E. et al. Systematic analysis of complex genetic interactions. Science 360, eaao1729 (2018).
Phillips, P. C. Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. Genet. 9, 855–867 (2008).
Moore, J. H. & Williams, S. M. Epistasis and its implications for personal genetics. Am. J. Hum. Genet. 85, 309–320 (2009).
Cordell, H. J. Detecting gene–gene interactions that underlie human diseases. Nat. Rev. Genet. 10, 392–404 (2009).
Scarselli, F. & Chung Tsoi, A. Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results. Neural Netw. 11, 15–37 (1998).
Koo, P. K. & Ploenzke, M. Improving representations of genomic sequence motifs in convolutional networks with exponential activations. Nat. Mach. Intell. 3, 258–266 (2021).
Cao, Y., Geddes, T. A., Yang, J. Y. H. & Yang, P. Ensemble deep learning in bioinformatics. Nat. Mach. Intell. 2, 500–508 (2020).
Manifold, B., Men, S., Hu, R. & Fu, D. A versatile deep learning architecture for classification and label-free prediction of hyperspectral images. Nat. Mach. Intell. 3, 306–315 (2021).
Song, Z. & Li, J. Variable selection with false discovery rate control in deep neural networks. Nat. Mach. Intell. 3, 426–433 (2021).
Ghorbani, A., Abid, A. & Zou, J. Y. Interpretation of neural networks is fragile. In Proc. AAAI Conference on Artificial Intelligence Vol. 33 3681–3688 (AAAI, 2019); https://doi.org/10.1609/aaai.v33i01.33013681
Barber, R. F. & Candès, E. J. Controlling the false discovery rate via knockoffs. Ann. Stat. 43, 2055–2085 (2015).
Candès, E., Fan, Y., Janson, L. & Lv, J. Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. B 80, 551–577 (2018).
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 58, 267–288 (1996).
Sesia, M., Katsevich, E., Bates, S., Candès, E. & Sabatti, C. Multi-resolution localization of causal variants across the genome. Nat. Commun. 11, 1093 (2020).
Lu, Y. Y., Fan, Y., Lv, J. & Noble, W. S. DeepPINK: reproducible feature selection in deep neural networks. In Proc. 32nd International Conference on Neural Information Processing Systems 8690–8700 (Curran Associates, 2018).
He, Z. et al. Identification of putative causal loci in whole-genome sequencing data via knockoff statistics. Nat. Commun. 12, 3512 (2021).
Lu, L., Shin, Y., Su, Y. & Karniadakis, G. E. Dying ReLU and initialization: theory and numerical examples. Commun. Comput. Phys. 5, 1671–1706 (2020).
Clevert, D.-A., Unterthiner, T. & Hochreiter, S. Fast and accurate deep network learning by exponential linear units (ELUs). In International Conference on Learning Representations (ICLR, 2016).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
He, Z., Xu, B., Buxbaum, J. & Ionita-Laza, I. A genome-wide scan statistic framework for whole-genome sequence data analysis. Nat. Commun. 10, 3018 (2019).
Li, X. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 52, 969–983 (2020).
Dai, C., Lin, B., Xing, X. & Liu, J. False discovery rate control via data splitting. J. Am. Stat. Soc. https://doi.org/10.1080/01621459.2022.2060113 (2020).
Tibshirani, J. F., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008).
Lee, S., Zhao, Z., Miropolsky, L., Wu, M. SKAT: SNP-Set (Sequence) Kernel Association Test, R package, version 2.2.4. (2022)
Gimenez, J. R. & Zou, J. Improving the stability of the knockoff procedure: multiple simultaneous knockoffs and entropy maximization. In Proc. 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) (PMLR, 2018).
Ren, Z., Wei, Y. & Candès, E. Derandomizing knockoffs. J. Am. Stat. Assoc. https://doi.org/10.1080/01621459.2021.196272 (2021).
He, Z. et al. Genome-wide analysis of common and rare variants via multiple knockoffs at biobank scale, with an application to Alzheimer disease genetics. Am. J. Hum. Genet. 108, 2336–2353 (2021).
Shea J, A., Fulton-Howard, B. & Goate, A. Interpretation of risk loci from genome-wide association studies of Alzheimer’s disease. Lancet Neurol. 19, 326–335 (2020).
Ferkingstad, E. et al. Large-scale integration of the plasma proteome with genetics and disease. Nat. Genet. 53, 1712–1721 (2021).
Sesia, M., Bates, S., Candès, E., Marchini, J. & Sabatti, C. False discovery rate control in genome-wide association studies with population structure. Proc. Natl Acad. Sci. USA 118, e2105841118 (2021).
Schaffner, S. F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res 15, 1576–1583 (2005).
Sesia, M., Sabatti, C. & Candès, E. J. Gene hunting with hidden Markov model knockoffs. Biometrika 106, 1–18 (2019).
Plassman, B. L. et al. Prevalence of dementia in the United States: the aging, demographics, and memory study. Neuroepidemiology 29, 125–132 (2007).
Escott-Price, V., Shoai, M., Pither, R., Williams, J. & Hardy, J. Polygenic score prediction captures nearly all common genetic risk for Alzheimer’s disease. Neurobiol. Aging 49, 214.e7–214.e11 (2017).
Guen, Y. Le et al. A novel age-informed approach for genetic association analysis in Alzheimer’s disease. Alzheimer’s Res. Ther. 13, 72 (2021).
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Beecham, G. W. et al. The Alzheimer’s disease sequencing project: study design and sample selection. Neurol. Genet. 3, e194–e194 (2017).
Weiner, M. W. et al. The Alzheimer’s disease neuroimaging initiative: progress report and future plans. Alzheimers. Dement. 6, 202–211.e7 (2010).
Bennett, D. A. et al. Overview and findings from the rush memory and aging project. Curr. Alzheimer Res. 9, 646–663 (2012).
Kunkle, B. W. et al. Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Aβ, tau, immunity and lipid processing. Nat. Genet. 51, 414–430 (2019).
Kunkle, B. W. et al. Novel Alzheimer disease risk loci and pathways in African American individuals using the African genome resources panel: a meta-analysis. JAMA Neurol. 78, 102–113 (2021).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Chen, C.-Y. et al. Improved ancestry inference using weights from external reference panels. Bioinformatics 29, 1399–1406 (2013).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Andrews, S. J., Fulton-Howard, B. & Goate, A. Interpretation of risk loci from genome-wide association studies of Alzheimer’s disease. Lancet Neurol. 19, 326–335 (2020).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Hechtlinger, Y. Interpretation of prediction models using the input gradient. Preprint at https://arxiv.org/abs/1611.07634 (2016).
Le Guen, Y. et al. A novel age-informed approach for genetic association analysis in Alzheimer’s disease. Alzheimers. Res. Ther. 13, 72 (2021).
Saha, S. et al. Hierarchical deep learning neural network (HiDeNN): an artificial intelligence (AI) framework for computational science and engineering. Comput. Methods Appl. Mech. Eng. 373, 113452 (2021).
Roy, D., Panda, P. & Roy, K. Tree-CNN: a hierarchical deep convolutional neural network for incremental learning. Neural Netw. 121, 148–160 (2020).
Kim, J., Kim, B., Roy, P. P. & Jeong, D. Efficient facial expression recognition algorithm based on hierarchical deep neural network structure. IEEE Access 7, 41273–41285 (2019).
Xu, Y. et al. A hierarchical deep learning approach with transparency and interpretability based on small samples for glaucoma diagnosis. npj Digit. Med. 4, 48 (2021).
Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proc. 13th International Conference on Artificial Intelligence and Statistics (AISTATS) Vol. 9, 249–256 (JMLR, 2010).
LeCun, Y. A., Bottou, L., Orr, G. B. & Müller, K.-R. in Neural Networks: Tricks of the Trade (eds. Müller, K.-R. et al.) 2nd edn, 9–48 (Springer, 2012); https://doi.org/10.1007/978-3-642-35289-8_3
Jha, N. K., Mittal, S. & Mattela, G. The ramifications of making deep neural networks compact. Preprint at https://arxiv.org/abs/2006.15098 (2020).
Peyman-HK/Stabilized-HiDe-MK: Stabilized HiDe-MK (Zenodo, 2022); https://doi.org/10.5281/zenodo.6872386
Acknowledgements
This research was supported by NIH/NIA award AG066206 (ZH).
Author information
Authors and Affiliations
Contributions
P.H.K., and Z.H. developed the concepts for the manuscript and proposed the method. P.H.K., F.L., Y.L.G. and Z.H. designed the analyses and applications and discussed the results. P.H.K., Z.H. and F.L. conducted the analyses. Z.H., Y.L.G. and M.E.B. helped interpret the results of the real data analyses. P.H.K., Z.H., F.L. and Y.L.G. prepared the manuscript and contributed to editing the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Yue Cao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–7, Tables 1–4 and discussions of ‘Notes on the real data preparation’ and ‘Model configurations’.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kassani, P.H., Lu, F., Le Guen, Y. et al. Deep neural networks with controlled variable selection for the identification of putative causal genetic variants. Nat Mach Intell 4, 761–771 (2022). https://doi.org/10.1038/s42256-022-00525-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-022-00525-0
This article is cited by
-
Artificial intelligence for nailfold capillaroscopy analyses – a proof of concept application in juvenile dermatomyositis
Pediatric Research (2024)
-
Phenotype prediction using biologically interpretable neural networks on multi-cohort multi-omics data
npj Systems Biology and Applications (2024)