Strauch, Y. orcid.org/0000-0003-0820-8319, Lord, J. orcid.org/0000-0002-0539-9343, Niranjan, M. orcid.org/0000-0001-7021-140X et al. (1 more author) (2022) CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites. PLOS ONE, 17 (6). e0269159. ISSN 1932-6203
Abstract
Background
It is estimated that up to 50% of all disease causing variants disrupt splicing. Due to its complexity, our ability to predict which variants disrupt splicing is limited, meaning missed diagnoses for patients. The emergence of machine learning for targeted medicine holds great potential to improve prediction of splice disrupting variants. The recently published SpliceAI algorithm utilises deep neural networks and has been reported to have a greater accuracy than other commonly used methods.
Methods and findings
The original SpliceAI was trained on splice sites included in primary isoforms combined with novel junctions observed in GTEx data, which might introduce noise and de-correlate the machine learning input with its output. Limiting the data to only validated and manual annotated primary and alternatively spliced GENCODE sites in training may improve predictive abilities. All of these gene isoforms were collapsed (aggregated into one pseudo-isoform) and the SpliceAI architecture was retrained (CI-SpliceAI). Predictive performance on a newly curated dataset of 1,316 functionally validated variants from the literature was compared with the original SpliceAI, alongside MMSplice, MaxEntScan, and SQUIRLS. Both SpliceAI algorithms outperformed the other methods, with the original SpliceAI achieving an accuracy of ∼91%, and CI-SpliceAI showing an improvement at ∼92% overall. Predictive accuracy increased in the majority of curated variants.
Conclusions
We show that including only manually annotated alternatively spliced sites in training data improves prediction of clinically relevant variants, and highlight avenues for further performance improvements.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | © 2022 Strauch et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
Keywords: | Alternative Splicing; Humans; Machine Learning; Mutation; Neural Networks, Computer; RNA Splice Sites; RNA Splicing |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Medicine, Dentistry and Health (Sheffield) > School of Medicine and Population Health |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 25 Jan 2024 08:39 |
Last Modified: | 25 Jan 2024 08:39 |
Status: | Published |
Publisher: | Public Library of Science (PLoS) |
Refereed: | Yes |
Identification Number: | 10.1371/journal.pone.0269159 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:207761 |