Evidence for 28 genetic disorders discovered by combining healthcare and research data

Summary De novo mutations (DNMs) in protein-coding genes are a well-established cause of developmental disorders (DD)1. However, known DD-associated genes only account for a minority of the observed excess of such DNMs1,2. To identify novel DD-associated genes, we integrated healthcare and research exome sequences on 31,058 DD parent-offspring trios, and developed a simulation-based statistical test to identify gene-specific enrichments of DNMs. We identified 285 significantly DD-associated genes, including 28 not previously robustly associated with DDs. Despite detecting more DD-associated genes, much of the excess of DNMs of protein-coding genes remains unaccounted for. Modelling suggests that over 1,000 novel DD-associated genes await discovery, many of which are likely to be less penetrant than the currently known genes. Research access to clinical diagnostic datasets will be critical for completing the map of dominant DDs.


Multiple Testing Correction 22
Comparing DeNovoWEST to previous enrichment test 24 Synonymous-only enrichment test 25 Analysis of undiagnosed cases 26 Removing synonymous outlier genes 27 Withholding the 10 most mutated genes from weight creation 28 Evaluating potential cryptic splice site variants 29

Estimating the fraction of cases explained by de novo mutations 31
Splitting by cohort 31 Splitting by sex 31

Comparing the new and known genes 33
Phenotypic similarity between new and known genes 33 Functional similarity between new and known genes 35 Evaluating potentially smaller mutational targets in the novel genes 37 Comparison of constraint metrics 37

GeneDx Data Sharing
The American College of Genetics and Genomics (ACMG) position statement on genomic data sharing helps inform GeneDx's philosophy regarding data use: "To ensure that our patients receive the most informed care possible, the American College of Medical Genetics and Genomics advocates for extensive sharing of laboratory and clinical data from individuals who have undergone genomic testing. Information that underpins health-care service delivery should be treated neither as intellectual property nor as a trade secret when other patients may benefit from the knowledge being widely available" 1 . With this in mind, as part of its clinical testing program, per patient consent and without sharing any non-deidentified data, and as described on the website (https://www.genedx.com/all-forms/use-of-specimen-and-genetic-information/).
GeneDx uses resources such as GeneMatcher 1,2 (https://genematcher.org) in order to help identify evidence that variants in a given gene may contribute to an observed phenotype.
Similarly, the endeavor described in this work allows GeneDx analysts and scientists to identify affected patients in other cohorts, which translates to the ability to better clinically interpret variants in specific genes of interest and provide reports to referring clinicians. This process is also useful for updating variants from those binned as candidate genes to more clearly causative disease explanations; GeneDx has updated over 200 clinical reports regarding such genes since 2000, affecting more than 3,000 individual cases.
joint-genotyped in two separate batches, one with 10,138 trios and the other 8651 trios. GATK VariantRecalibrator (VQSR) was applied for both SNPs and INDELs, with known SNPs from 1000 Genomes phase 1 high confidence set and "gold standard" INDELs from Mills et al 9 .
Variants in VQSR VCF files were annotated with Ensembl Variant Effect Predictor (VEP) 10  Families gave informed consent for both the diagnostic procedure as well as for forthcoming research that could result in the identification of new genes underlying ID by meta-analysis, as presented here.
The exomes of 2418 patient-parent trios were sequenced, using DNA isolated from blood, at the Beijing Genomics Institute (BGI) in Copenhagen. Exome capture was performed using Agilent SureSelect v4 and v5 and samples were sequenced on an Illumina HiSeq 4000 instrument with paired-end reads to a median target coverage of 112x. Sequence reads were aligned to the hg19 reference genome using BWA version v0.7.12 and duplicate marking by Picard v1.90.
Variants were subsequently called by the GATK haplotypecaller (version v3. . The diagnostic WES process as outlined above only reports (de novo) variants that can be linked to the individuals' phenotype. In this study, we systematically collected all DNMs in regions in or close to (200bp) a capture target. DNMs were called as described previously 12 .
Briefly, variants called within parental samples were removed from the variants called in the child. For the remaining variants pileups were generated from the alignments of the child and both parents. Based on pileup results variants were then classified into the following categories: "maternal (for identified in the mother only)", "paternal (for identified in the father only)", "low coverage" (for insufficient read depth in either parent), "shared" (for identified in both parents)", and "possibly de novo" (for absent in the parents). Variants classified as possibly de novo were included in this study.
We applied various quality filters to ensure that only the most reliable calls were included in the study, these are described in the quality control section below.

Definition of diagnostic lists
For various analyses in this work, we wanted a list of known developmental disorder genes. In order to define this list, we collected diagnostic lists from each center and created sets of "consensus" and "discordant" genes. For the initial submission and preprint, we used diagnostic lists from September 2018. While under revision, a number of genes were added to each center's diagnostic list. To more accurately reflect the current state of the field, we updated these diagnostic gene lists to versions curated in September 2019 to January 2020.
For the DDD cohort, we used the Developmental Disorders Genotype-Phenotype Database (DDG2P) list, which is a curated list of genes specifically associated with developmental disorders. For every gene on the list, DDG2P provides the level of certainty, consequence of the mutation, and allelic status of variants associated with developmental disorders. We downloaded the DDG2P list on 22 September 2019 from https://www.ebi.ac.uk/gene2phenotype/downloads. In order to define diagnostic genes that act in a dominant fashion, we selected only genes that were considered "probable", "confirmed", or "both RD and IF" (i.e. high levels of certainty of being a true DD-associated gene) and had an allelic status of "monoallelic", "x-linked dominant", "hemizygous", or "imprinted".
GeneDx maintains a continually curated list of genes, used to define reporting categories for clinical exome and genome testing, which have been definitively or putatively implicated in human Mendelian disease, with modes of inheritance noted for each gene. Starting with the January 2020 curation list, those genes with dominant modes of inheritance and definitive implications in disease were manually reviewed to remove any genes with no association to developmental disorders either because of no phenotypic overlap with the inclusion criteria for this study or because the relevant phenotypes were adult onset.
For the list from RUMC, gene panels for intellectual disability, epilepsy, and craniofacial anomalies/Multiple congenital anomalies were designed by multidisciplinary expert teams consisting of a clinical laboratory geneticist, a molecular geneticist, and a clinical specialist.
After mapping to HGNC IDs and symbols, any gene that was considered diagnostic by all three centers was designated as a "consensus" gene (n=380). For genes on one or two of the diagnostic lists, we considered them "discordant" genes (n=607). The following filters were applied specifically to each center. These were chosen to minimise the number of false positive DNMs. The filters were applied in this specific order. ■ Filter out mutations in sites with more than one mutation with VAF < 0.3 ○ X chromosome ■ We ran DeNovoGear as previously described, but with a different set of hard filters to account for the lower coverage in males and to maximise sensitivity and specificity. We examined all candidate DNMs in males and a large subset of those in females manually in IGV 16 , and used this to settled on the following set of filters:

Joint quality control of dataset
• We removed DNMs in the pseudoautosomal regions.
• The variant had to be called heterozygous or, for males, hemizygous in the child in the original GATK calls, and called homozygous reference in the parents.
• We removed variants in segmental duplications.
• For male probands, we required the following: in the child, alternate allele depth > 2 and RD > 2; in the mother, RD > 5 • For female probands, we required the following: in the child, alternate allele depth > 2, RD > 7; in the mother, RD > 5; in the father, RD > 1 • For single nucleotide variants, we required p > 10 -3 on a Fisher's exact test for strand bias, pooling across trios (ignoring fathers of male probands) where a de novo was called at the same site by DeNovoGear.
• For female probands, we removed indels < 5bp if they had VAF < 0.3 or MAF > 0, since these were vastly over-represented and seemed to be a common error mode.
• Removed DNMs if any two of the following conditions were met (these conditions were applied separately for males and females): Removing variants from siblings Siblings will sometimes share DNMs that arose as the same mutational event in one of their parental germlines. To avoid double counting these shared DNMs, we identified siblings in each cohort and randomly removed one of each pair of shared variants. In total, we removed 11 DNMs found in siblings.
Additionally, we removed 20 false positive variants identified in genes identified as novel in our initial submission (see below). After filtering we had a total set of 45,221 coding DNMs (Sup  Fig 2b).
Specifically, as described in McRae et al 3 and mentioned above, the DDD study selected a ppDNM (posterior probability of a de novo mutation) threshold such that the observed number of synonymous DNMs matched the expected number.
All three cohorts have far more carriers of nonsynonymous DNMs in the consensus genes than expected based on a null mutational model (Sup Fig 2c).

Confirming variants in novel genes
We wanted to confirm all DNMs in the significant genes not previously considered diagnostic by any center (the "novel" genes). Between the initial submission and revision, variants were evaluated in 67 genes. Each center validated every DNM in these genes either via Sanger sequencing or manual inspection of IGV plots. Across the three centers, only 21 variants appeared to be false positives; these variants were removed from subsequent analyses. The final set of 45,221 de novo mutations is included as Sup Table 1.

DeNovoWEST
DeNovoWEST (De Novo Weighted Enrichment Simulation Test) is the testing framework we used to assess gene-wise de novo mutation enrichment (https://github.com/queenjobo/DeNovoWEST). Each observed DNM in our dataset was assigned a mutation severity score. This severity score is a proxy for how deleterious we expect the mutation to be. Details of how these were calculated are given below. For each gene we then calculated a gene severity score which is the sum of all mutation severity scores that fall into that gene. There are two components to DeNovoWEST: the overall enrichment test which includes all variant consequences and the gain-of-function specific test which assess enrichment and clustering of missense variants only.

Enrichment Test
We used a simulation-based approach to evaluate whether these observed gene severity scores are higher than what we would expect under the null hypothesis of no de novo mutation enrichment. To calculate the probability of observing a gene score that is as or more extreme than the one that we observe for this gene (the p-value) we considered the case of observing k number of DNMs in the gene where k ranged from 0 to 250. This upper limit was chosen as it is far above the number of DNMs we see in any individual gene in our dataset and the probability of observing more than that number of DNMs for our cohort in a single gene is negligible with respect to our significance threshold. The enrichment p-value, pEnrich, was then calculated as the sum of these probabilities as summarized by the following equation where S denotes the gene score, s is the observed gene score and K is the number of DNMs in the gene: These probabilities were calculated as follows: • We calculated which is the probability of observing 0 mutations in this gene.
This was calculated based on our sample size (N) and the mutation rate of the gene assuming mutations follow a Poisson distribution with . This was adjusted to for genes that fall on the X chromosome.
• We also analytically calculated , the probability that the severity score of 1 mutation was greater than what we observed. This was calculated using the mutability of each position and the annotated score for that mutation.
• We then used a simulation-based approach to calculate , the probability of observing an as or more extreme gene score if we see 2 to 250 DNMs in this gene. We calculated analytically under the null assuming the DNMs followed a poisson distribution (as above). We then multiplied this with a simulationbased estimate for , the probability of observing a gene score greater than or equal to the one we observe. To calculate the latter probability we simulated the distribution of gene scores as follows: ○ Simulate k DNMs across the gene. The probability of mutation was weighted by the trinucleotide sequence context at every base position in that gene.
○ Assign the simulated de novo mutations a mutation severity score ○ Sum the simulated mutation severity scores to get the simulated gene severity score ○ We performed 10 9 x simulations for every k. This was the equivalent of 10 9 simulations per gene and meant that our p-value was robust to stochasticity far below our p-value threshold.

Determination of Weights
We calculated the weights used in the DeNovoWEST test from observed enrichments across consequence classes, shet values, missense constraint information and, for some, CADD score bins (version 1.0) 18 . shet refers to the estimated selective effect of heterozygous PTVs 19 . We stratified genes into two groups of 'high' shet and 'low' shet. We defined high shet genes as those with a shet ≥0.15 and low shet genes as those with an estimate <0. 15. This threshold was one suggested by Cassa et al to define selectively disadvantageous genes. Enrichments were calculated by dividing the number of observed DNMs by the number of expected DNMsacross all sites in the exome that fell into a specific strata. We calculated the number of expected mutations given our sample size and the triplet context mutation rates at the sites 17 . Details for the weight calibration for each consequence class are given below: • Missense mutations were stratified based on whether they fell in a low or high shet gene 19 , whether or not they fell into a region of missense constraint 20 , and finally into CADD score bins of size 6. We fit four LOESS lines on the enrichments for mutations that were in high shet genes + missense constrained regions, high shet genes + not in a missense constrained region, low shet genes + missense constrained regions, low shet genes + not in a missense constrained region.
• Nonsense mutations were stratified based on whether they fell in a low or high shet gene, and then into CADD score bins of size 15 for the high shet genes and CADD score bins of 7.5 for low shet genes. We fit two LOESS lines on the enrichments for mutations that were in high shet genes vs those in low shet genes.
• Synonymous mutations were stratified based on whether they fell in a low or high shet gene.
• Canonical splice site variants were stratified based on whether they fell in a low or high shet gene.
• Inframe indels were assigned weights based on the overall enrichment of missense mutations as an appropriate approximation for their deleteriousness. These were stratified by whether they fell in a low or high shet gene but not stratified by CADD score bins.
• Frameshift indels were assigned the same weights as nonsense mutations with a CADD score ≥ 45 and whether they fell in a low or high shet gene.

Missense enrichment and clustering test
This test is designed to detect genes that may be acting via a gain-of-function mechanism and consists of two parts. The first is implementing the enrichment test (as described above) but only considering missense variants to obtain a pMisEnrich p-value. The second part consists of a missense clustering test. We assessed clustering of missense DNMs within genes to identify genes where DNMs may be acting through dominant negative or activating mechanisms. For this we used DeNovoNear, which has been described previously 3 and is available on Github (https://github.com/jeremymcrae/denovonear). We refer to this clustering p-value as  pClustering. We combined pMisEnrich and pClustering using Fisher's method to obtain pMEC.
To ensure Fisher's method was appropriate we confirmed that these two p-values were independent by simulating DNMs under the null for >20,000 tests and found the p-values were uncorrelated.

Combining Tests
We combined the results from the overall enrichment test and the missense enrichment/clustering test by taking the minimum of the two p-values as follows: Column headers are described within the file. Table 3. Novel genes. For each of the 28 novel genes in this analysis, we determined if it had an associated phenotype in OMIM, any publications about an association between mutations in that gene and developmental disorders, and whether it was significant in a study of inherited and de novo mutations in autism spectrum disorders 21 ("sig_ASD") or a metaanalysis of de novo mutations in individuals with neurodevelopmental disorders 22 ("sig_meta"). Step 2: Missense enrichment and clustering test

Supplementary
Step 1: Overall enrichment test

Comparing DeNovoWEST to previous enrichment test
To evaluate our new statistical framework we ran DeNovoWEST on the DNMs from the ~4k DDD trios from the 2017 publication 3 (not including those from the meta-analysis). We compared the results of the old and new test (Sup Fig 5). Using the previous enrichment test, 82 genes were found to be significantly associated with DD while DeNovoWEST identified 91 significant genes. Seventy-eight of these genes overlapped with the previous results, 13 genes were significant in DeNovoWEST results that were not previously and 4 genes that were previously significant were no longer significant. These three genes that we are lost are consensus genes (AUTS2, BRPF1, EEF1A2, IQSEC2) that were close to the threshold. For the 13 genes that were gained with DeNovoWEST, 10 were consensus, 1 was discordant and 2 were not on any of the diagnostic lists. This demonstrates that we gain substantially more genes than we lose when applying our new enrichment framework. threshold.

Synonymous-only enrichment test
As part of quality control, we repeated the enrichment framework using only synonymous DNMs. While synonymous mutations can be pathogenic, we expect that-as a class-they will not be significantly enriched in any gene. There were 6029 genes with a de novo synonymous mutation, but none of these genes was significantly enriched (enrichment p < 2.66 x 10 -6 , Bonferroni corrected for 18,762 tests; Sup Fig 6). Here, we are using the p-value provided after the simulation segment of DeNovoWEST.
Of note, the gene with the highest synonymous enrichment p-value from DeNovoWEST is has been Bonferroni corrected.

Removing synonymous outlier genes
We noticed that some of the putatively novel genes had significantly more synonymous variants than expected in the gnomAD 24 database. For example, HIST1H4E has 88 observed synonymous variants in gnomAD when only 25 were expected (observed/expected = 3.51, synonymous Z = -9.89). These outliers are indicative that the underlying mutational model is not properly modeling these genes, and may be underpredicting the number of expected variants.
We do not want to highlight false positive diagnostic genes (e.g. those that are significant in our test because the underlying model is wrong and are therefore not truly associated with developmental disorders). To address this, we selected the 11 of 33 putatively novel genes that had significantly elevated rates of synonymous variants in gnomAD compared to expectation (defined as synonymous Z ≤ -1.65, one tailed p ~ 0.05), multiplied all of their mutation rates by the observed/expected synonymous value (oe_syn) in gnomAD, and re-ran DeNovoWEST using these modified mutation rates. Five of these 11 genes (EEF2, HIST1H2AC, HIST1H4E, SPTBN1, ZFHX) failed to pass the exome-wide significance threshold (p < 0.025/18762) and were subsequently removed from our novel list, leaving 28 novel genes.
Withholding the 10 most mutated genes from weight creation One potential weakness of our approach is the circularity in defining weights based on mutation enrichments in our cohort and then applying those weights to the same data. Ideally, we would generate weights with each gene held out and then apply weights to the held out gene.
However, we hypothesized that doing so would have a minimal impact on the final enrichment results. To test this, we removed the mutations in the 10 most mutated genes (DDX3X, ARID1B, ANKRD11, KMT2A, MECP2, DYRK1A, SCN2A, STXBP1, MED13L, CREBBP), regenerated weights, and applied those weights to the full dataset.
We found that the DeNovoWEST p-values for the full analysis vs this analysis were highly correlated (r =0.99), including nearly identical p-values for the 10 genes withheld from weight creation. There was 1 gene that was significant in the full analysis but not in this analysis (PCBP2, p= 1.1 x 10 -6 in full analysis; p = 1.8 x 10 -6 in analysis with the top 10 genes removed from weight calculation). There was one gene that was significant in this analysis but not in the full analysis (SLC39A8, p= 1.4 x 10 -6 in full analysis; p = 1.1 x 10 -6 in analysis with the top 10 genes removed from weight calculation). For both differing genes the p-values were very similar and close to the significance threshold. These minimal differences support that our approach is robust to this form of circularity.

Evaluating potential cryptic splice site variants
While it is known that the terminal exonic guanine before a splice site can have a strong impact on splicing and be involved in developmental disorders 25,26 , there are too few of them as a class to treat as a separate group when determining weights in the enrichment framework. To address these bases and others that may act as cryptic splice sites, we downloaded all publicly available SpliceAI scores 27 , which were provided for single nucleotide variants with a maximum delta score ≥ 0.1. We annotated our DNMs with the largest SpliceAI delta score of the four possible scores (acceptor gain, acceptor loss, donor gain, and donor loss). Only 8% (n = 3329/40992) of our single nucleotide DNMs could be annotated with a SpliceAI delta score, but 92% of single nucleotide splice acceptor and donor variants (n = 948/1032) received a score.
We selected a threshold of 0. Additionally, we extracted the mutational context for all sites with a SpliceAI delta score ≥ 0.8 and used the sample size and triplet context mutation rates 17  For "All non-splice variants", DNMs with the consequences of "splice_acceptor_variant" and "splice_donor_variant" were removed.

Estimating the fraction of cases explained by de novo mutations
We previously estimated that the proportion of DDD individuals explained by diagnostic de novo coding mutations in known or as-yet-undiscovered genes was between 0.461 and 0.486 15 Fig 1b).

Splitting by sex
When we split the cohorts by the sex of the proband, we find that a significantly higher fraction of females have a nonsynonymous DNM in an autosomal diagnostic genes (as above, all consensus genes, significant discordant genes, and novel genes) when compared to males (OR

Functional similarity between new and known genes
To compare the functional similarity between the consensus and novel genes we looked across various properties that have been known to be important in classifying haploinsufficiency 29 . The details of each variable we have used are detailed below: • Somatic driver gene: a binary variable of whether the gene is a known somatic driver gene 30 .
• Median RPKM fetal brain: the median RPKM in the fetal brain taken from BrainSpan 31 .
• Relevant GO term: a binary variable of whether the gene was annotated with one of twenty GO terms that were enriched in consensus DD genes. To select these terms we annotated all genes with GO terms and looked at the enrichment of each GO term between consensus DD genes and non-DD genes (genes that are not significant in our analysis and are not on either the discordant or consensus genes lists). We defined relevant GO terms as the top 20 most enriched terms that appear in at least 20 of the 380 consensus genes. This was to ensure that we were picking terms that were generalisable to the entire set were not specific to only a few genes. The terms selected are detailed in Sup Table 5. At least one of these 20 terms were present in 237 (71%) of consensus genes and 2,874 (16%) of non-DD genes.
• Network distance to consensus genes: As in Huang et al 29 , we created a protein-protein interaction network by integrating information from the Human Protein Reference Database 32 , STRING 33 , and Reactome 34 . We then calculated the shortest path distance (a measure of proximity) between each gene and consensus genes.
• Promoter GERP and Coding GERP: These were calculated as described in Huang et al. 29 • pLI: the gnomAD pLI scores were used here 24 .
We compared the mean of these variables across consensus and novel genes compared to non-DD genes (Figure 2b) and found that the novel and consensus genes had very similar enrichments across most variables. The only significant difference was found in the network distance to another consensus DD gene (p-value 0.004, Wilcoxon test).

Evaluating potentially smaller mutational targets in the novel genes
We hypothesize that the novel genes may have a smaller mutational target than the previously established genes. In the following set of analyses, we only consider significant consensus genes (n = 196) and compare them to the significant discordant and novel genes (n = 94).
Many of the known developmental disorder genes act via haploinsufficiency (e.g. > 50% of the monoallelic DDG2P list have "loss of function" as their listed mutation consequence) and could therefore be caused by almost all true loss-of-function variants within a gene. Disorders due to gain-of-function mutations, however, would be caused by a much smaller number of mutations within a gene and would therefore require large sample sizes to be significantly burdened by DNMs. To investigate if the new genes (here both the discordant and novel) were more likely to act via a gain-of-function mechanism, we compared the ratio of missense to PTV DNMs in the gene sets and found that the consensus genes have a significantly lower missense:PTV ratio than the discordant and novel genes (ratio 1.20 vs 2.51; chi-squared p = 1.22 x 10 -25 ).
Alternatively, a smaller mutational target could be due to a shorter gene length. We would expect this to be most obvious for genes predicted to act via haploinsufficiency. To define such likely haploinsufficient genes, we subsetted the 285 genes to 149 with significant PTV enrichment (p < 1 x 10 -6 ). We, however, do not find significant differences in gene length or PTV mutability between the consensus genes compared to discordant and novel genes (Wilcox p = 0.9404 and 0.8534, respectively).

Comparison of constraint metrics
Additionally, we compared various constraint metrics between the significant consensus genes and the significant discordant and novel genes. To correct for mutability, for each of these metrics, we assigned missing genes the median value. We then ran a regression of the probability of mutation on the constraint metric and used the residuals in Wilcoxon rank sum tests. Using these residuals therefore corrects for systematic differences in length in between gene sets. When correcting for mutability in this way, we find no significant differences between the two gene sets for shet 19 (Wilcox p = 0.1185), pLI 24 (Wilcox p = 0.6039), or LOEUF (Wilcox p = 0.8010). We also find no significant difference between the known and novel genes for the rate of observed/expected missense variants in gnomAD (oe_mis; Wilcox p = 0.9228).
We also wanted to search for clustering of de novo PTVs specifically at the terminal end of genes since transcripts with such PTVs can sometimes escape from nonsense mediated decay (NMD). To test this, we selected all genes with > 2 de novo PTVs in our cohort and determined the probability of a PTV mutation in regions likely to escape from NMD (the last exon and 50bp into the penultimate exon 36     The number of protein domain occurrences is another aspect we correct for. The number of occurrences of family is given by^ and is compared to the total number of protein domain occurrences: The mutation rate of the protein in which a protein domain occurs might be biased. To correct for this we count the total number of mutations found in all the proteins that contain a domain of family and define that as _ _ NOL ( ), then we divide that by the total protein sequence length of all of these proteins ℎ_ ( )to obtain the mutation rate: genes shared between the somatic driver gene list and the 285 significant genes in our full analysis (56 consensus, 9 discordant, and 5 novel). This overlap between the two lists is significant when correcting for shet 19 (logistic regression, p = 1.7 x 10 -49 ).
To determine the enrichment of DNMs in the somatic driver genes, we used the previouslymentioned sequence context-based mutational model to predict the expected number DNMs in the gene set 13,17 . The somatic driver genes are significantly enriched for nonsynonymous DNMs compared to expectation (missense and PTVs; Poisson p < 10 -50 for both categories; Sup Table   9A). This enrichment remains when removing the 70 somatic driver genes that are significant in the de novo analysis (Poisson p = 8.0 x 10 -21 for missense and 2.8 x 10 -27 for PTVs; Sup Table   9B). Overall, 9.4% of cases carry a nonsynonymous DNM in a somatic driver gene; in 6.7% of cases, the somatic driver gene impacted is also significant in the de novo analyses. Supplementary Table 9. Enrichment of de novo mutations in somatic driver genes. Using a sequence-context based mutational model 13,17 and the size of the cohort, we predicted the expected number of de novo mutations (DNMs) by mutational consequence in 369 somatic driver genes 30 (a) and the somatic driver genes that were not significant in the de novo analysis (b). These expectations were compared to the observed number of DNMs with a Poisson test.

A) All somatic driver genes
Overall, we find 117 DNMs at 76 unique sites in our cohort that are recurrently mutated in the TCGA data in at least three patients. To determine enrichment at these recurrently mutated sites in TCGA, we focused on single nucleotide variants only. We extracted the local sequence context and used triplet context mutation rates 17

Recurrent Mutations
We identified 773 recurrent mutations in our combined dataset (736 single nucleotide variants and 37 indels). The most recurrently mutated sites are listed in Extended Data Table 1.

Germline selection genes
We found 39 mutations that fell in 12 known germline selection genes (FGFR2, FGFR3, PTPN11, HRAS, KRAS, RET, BRAF, CBL, MAP2K1, MAP2K2, RAF1, SOS1) 57,58 . To determine if the observed mutations in germline selection genes are known to be activating, we first confirmed that these were all genes known to be acting through gain-of-function mechanisms.
All of these genes are known monoallelic DD-associated genes annotated as having activating mutation consequences according to DDG2P 4 . We then confirmed that all these recurrent mutations are listed as pathogenic or likely pathogenic variants in ClinVar 59 .

DNM burden in non-significant genes
We calculated the remaining DNM burden in the genes that were not significantly associated with developmental disorders (DD) in our analysis and were not a consensus DD genes. There were 2,172 genes that were not associated with DD and had a pLI ≥ 0.9 (high pLI) and 10,472 genes with a pLI < 0.9 (low pLI) 24 . The burden was calculated by calculating the observed/expected number of DNMs across the four groups of genes categorised by both missense/PTV mutations and low/high pLI. While there was a remaining burden across all groups, we found that the highest remaining enrichment was in PTV DNMs in high pLI genes.
We then repeated the analysis removing nominally significant genes (unadjusted p-value > 0.05). We saw that the PTV enrichment dropped substantially more than the missense enrichment in both low and high pLI genes (Sup Fig 12). This suggests that our test is still not as well positioned to identify genes with mutations acting via gain-of-function mechanisms that tend to be driven by missense mutations. Impact of pre/perinatal death on power A structural malformation (detected via ultrasound) during pregnancy is associated with pre/perinatal death, which here encompasses spontaneous fetal loss, stillbirth, early neonatal death, and termination of pregnancy for fetal anomaly. To understand the impact that pre/perinatal death may have had on our power to detect DD-associated genes we asked a clinician to assign each consensus DD gene (n=380) a low, medium or high likelihood of a patient with a pathogenic mutation in the gene presenting with a structural malformation on ultrasound. We subsetted these to those known to be haploinsufficient according to DDG2P (n= 217).We then calculated the ratio of the number of observed de novo PTVs in each group to the total number of expected de novo PTVs across each group. We found that genes with a medium/high likelihood were significantly less PTV enriched in our cohort compared to genes with a low likelihood (p =4.6 x 10 -5 , Poisson test). The difference between medium and low was significantly different (p = 0.00012, Poisson test) but the difference between low and high was not (p = 0.08, Poisson test).
For the DDD data, we also had information on whether there had been an abnormal ultrasound during pregnancy for the patients. To verify that the classification was valid, we compared the proportion of individuals in the DDD study with nonsynonymous mutations in the three gene groups that were reported to have an abnormal scan during pregnancy. We found that the proportion of patients with an abnormal ultrasound for those with a nonsynonymous DNM in a gene with a high or medium likelihood of abnormal ultrasound was significantly higher than for those with a DNM in a gene with a low likelihood classification (12.8% (low) vs 28.0% (medium/high), chi-sq test p = 7.34 x 10 -13 ). We also looked at DNMs called in 640 trios from the Prenatal Assessment of Genomes and Exomes (PAGE) 61 study and found that there was a higher enrichment of nonsynonymous DNMs in genes with a medium/high likelihood of presenting with an ultrasound abnormality compared to the genes with a low likelihood but this was not significant (2.3 (low) vs 4.8 (medium/high) enrichment, p = 0.052, Poisson test).
We regressed the proportion of individuals with a nonsynonymous mutation in a consensus HI gene who had an abnormal ultrasound against the observed PTV enrichment in that gene. We performed a generalized linear model with a quasipoisson family on the count of PTVs per genes with the expected count and the proportion of abnormal ultrasounds as independent variables. We did not find a significant regression coefficient for proportion of abnormal ultrasounds (p = 0.33).
To assess whether mutations in novel genes are more likely to be associated with pre/perinatal death than consensus genes we compared the nonsynonymous de novo enrichment of these two groups in PAGE. We did not find a significant difference between the enrichment in novel genes (3.4) compared to consensus genes (2.9) (p = 0.44, Poisson test). We are not well powered here, and would need to include more samples to detect a significant difference. We also did not find that individuals in DDD with mutations in novel genes were significantly more likely to have an abnormal ultrasound compared to consensus genes (proportion abnormal for novel 0.24, proportion abnormal for consensus 0.17; chi-sq test p = 0.08).

Penetrance
To begin to explore potential variable penetrance, we focused on 172 genes with significant PTV enrichment (p < 1 x 10 -6 ) in our cohort (based solely on the number of PTV mutations without any variant weighting, and thus independent of gnomAD data) and extracted their observed PTV and synonymous counts in gnomAD 24 from the released gene-level constraint file (gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz). Since we planned to compare the PTV to synonymous ratio in these genes, we wanted to remove any genes that were outliers in gnomAD or would have higher-than-expected PTV counts due to biological reasons. We therefore removed the 13 genes that had a constraint flag in the gnomAD file (10 with "syn_outlier", 2 with "mis_too_many|syn_outlier", and 1 with "lof_too_many") as well as three genes associated with clonal hematopoiesis of indeterminate potential (DNMT3A, ASXL1, and PPM1D) 62,63 that were in the 172 PTV enriched genes.
The remaining 156 genes were divided into five bins based on their log10(PTV enrichment).
Specifically, the bins were equally spaced in log10(PTV enrichment) space, and therefore vary in the number of genes in each bin (range 9 to 62). For each bin, we determined the PTV to synonymous ratio in gnomAD (Fig 3c). Then, we regressed the PTV to synonymous ratio on the PTV enrichment bin and weighted it by the number of genes in each bin. The slope of this regression was nominally significant (p = 0.03069).

Power analysis
Calculating power tailored specifically according to our de novo enrichment test was challenging as we would be required to make assumptions about the distribution of mutation consequences according to undiscovered DD-associated genes. Even with these assumptions, the calculation would be computationally intensive given the simulation based framework. We decided to calculate power in the context of the enrichment of PTV mutations. Therefore this power measure is specifically the power to detect haploinsufficient genes. Power was defined as the power to detect the median PTV enrichment in known monoallelic DD-associated genes. The set of known monoallelic DD-associated genes was defined as 163 genes in the consensus gene list that had at least one observed de novo PTV in our dataset. This set of genes were chosen as we wanted the selected genes to be defined independently of our test results.
Conditioning on at least 1 PTV was to remove most genes that are not likely to be ascertained into our cohorts (for example genes with very distinctive clinical phenotypes that are not typically diagnosed by exome sequencing). We calculated the PTV enrichment by dividing the observed

Downsampling
To assess if our discovery of DD-associated genes via DNM enrichment has begun to plateau, we downsampled the cohort to 5k, 10k, 15k, 20k and 25k and ran the DeNovoWEST framework to identify the number of significant genes. As is depicted in Extended Data Figure 1a, we do not see evidence of a plateau, indicating there is still much to be gained from DNM enrichment analyses.

Modelling remaining DNM burden
We modelled the remaining DNM burden separately for PTV and missense mutations.

Model for PTV DNM burden
We simulated the number of de novo PTVs across a range of numbers of remaining haploinsufficient genes and the PTV enrichment we would observe in these genes. Our PTV model made the following assumptions: • PTV enrichment was the same across all remaining undiscovered HI DD-associated genes.
• All undiscovered HI DD-associated genes have the same level of penetrance This captures three essential parts of the distribution of observed de novo PTVs: the total number of de novo PTVs, the number of genes that are currently significantly enriched for PTVs in our cohort (regardless of whether they appear in our consensus or discordant known gene groups) and the number of genes that are not significant but have an elevated PTV enrichment ( > 2).
In Extended Data Figure 1b we have also annotated the median PTV enrichment in significant HI genes which is 39.7. This was calculated as the median PTV enrichment for genes that are significant in our analysis and annotated as monoallelic with a loss of function mechanism according to DDG2P.

Model for missense DNM burden
The set up for the model for missense mutations was very similar to the PTV model with a few key differences. We simulated the number of de novo missense variants across a range of numbers of genes with a pathogenic DD-associated variant, mean missense enrichments in these genes and distributions of missense enrichment. We included a third dimension to the likelihood space which allowed us to model the distribution of missense enrichment across the genes with a pathogenic DD-associated missense variant.
We modelled the distribution of missense enrichment using the gamma distribution. We used 6 different shape parameters to represent different scenarios (0.1,0.5,1,510,20) (Sup Fig 13(a)). We found that the most likely scenario was that the shape for the missense enrichment distribution was 20. According to this shape there are ~1000 genes with pathogenic DD missense mutations with ~3 fold enrichment of missense mutations. This result complements Missense Enrichment Shape = 20

Expression in fetal brain
We defined expression in the fetal brain for a gene as having a median RPKM > 0 in the BrainSpan dataset 31 . We observed that the genes that were not significant in our analysis were significantly less likely to be expressed in the fetal brain (p = 1.05 x 10 -29 , proportion test). We then subsetted genes into those with a high pLI (pLI ≥ 0.9). We found that there was no significant difference between the proportion of the unassociated high pLI genes expressed in the fetal brain compared to the significant high pLI genes (p = 0.11 , proportion test; Sup Fig   14). This suggests that a number of these high pLI genes might be associated with DD as sample size increases.
Supplementary Figure 14: Comparison of proportion of genes expressed in fetal brain.
The proportion of genes between significant and non-significant genes in our analysis split by low (<0.9) and high (>0.9) pLI. The set of significant genes also includes all genes on the consensus diagnostic list (331 significant high pLI genes, 2372 non-significant high pLI genes, 112 significant low pLI genes, 10,687 non-significant high pLI genes). Exact binomial 95% confidence intervals shown. Genes that were not significant in our analysis were significantly less likely to be expressed in the fetal brain (p = 1.05 x 10 -29 , two-sided proportion test)