The sound of a gene breaking

Rivas MA#*, Pirinen M, Conrad DF, Lek M, Tsang EK, Karczewski KJ, Maller JB, Kukurba KR, DeLuca DS, Fromer M, Ferreira PG, Smith KS, Zhang R, Zhao F, Banks E, Poplin R, Ruderfer DM, Purcell SM, Tukiainen T, Minikel EV, Stenson PD, Cooper DN, Huang KH, Sullivan TJ, Nedzel J; GTEx Consortium; Geuvadis Consortium, Bustamante CD, Li JB, Daly MJ, Guigo R, Donnelly P, Ardlie K, Sammeth M, Dermitzakis ET, McCarthy MI, Montgomery SB, Lappalainen T*#, MacArthur DG#* (2015) Effect of predicted protein-truncating genetic variants on the human transcriptome. Science 348:666-9

#* Equal contribution; corresponding authors

If you listened to your cells very carefully, would you be able to hear when a gene breaks? In a new paper in Science, we show that this is, in a way, possible: by analysis of huge RNA-sequencing data sets, we can detect cellular effects of protein-truncating variants in the genome.

Trending: rare variants in human disease

Like fashion and culinarism and all other areas of life, genetic research has its trends. One of the hottest things right now is rare variant association analysis, enabled by cost-effective exome sequencing. In these studies, sequence data from several thousands of individuals is compared to find disease-associated genes where cases have a higher load of genetic variants that are likely to disturb gene function, compared to the controls. This approach has identified genes that are relevant to for example autism, as well as individual mutations that substantially increase or decrease the risk for e.g. cardiovascular disease.

A key step in the analysis of rare variants in disease is to distinguish genetic variants that are likely to disturb gene function. This allows to look for the culprits from a smaller group of potentially important variants, instead of drowning the signal in the noise of all the variants in our genomes, most of which have no impact at all. The most common approach is to focus on protein-truncating variants or PTVs (sometimes also called loss-of-function variants*) that terminate or substantially alter the protein sequence. These are typically determined from DNA sequence data alone based on the known genetic code and gene models.

In this study, we wanted to better interpret the functional effects of this important class of variants by analyzing their effects on the transcriptome. We integrated DNA and RNA sequencing data from hundreds of individuals from two projects, Geuvadis that we published in 2013, and the new GTEx project that has multiple tissue samples from the same individuals. The total 653 individuals were from the normal population without any particular disease, but still, as expected, we found as many as 16,286 PTVs – earlier studies have shown that all of us carry some broken genes without it causing any apparent harm. We then studied how the impact of these variants can be

captured in the RNA sequencing data, and what we can learn of their function based on the patterns that we find.


Getting rid of the nonsense

The first type of variants that we studied included SNPs that introduce a premature stop sign for the translation machinery (“nonsense SNPs”), and small insertions or deletions that scramble the downstream protein code (“frameshift indels”). Both of these give the translation machinery an early stop sign leading to a truncated protein, which is in the best case just nonfunctional (loss-of-function), but sometimes these pieces of protein can be actively harmful (gain-of-function). To avoid the latter risky situation, nature has developed an incredibly smart cellular pathway called nonsense-mediated decay (NMD) that identifies RNA transcripts that have a stop codon too early in the gene, and decay it before it gets translated into potentially harmful proteins. This amazing mechanism is easy to detect in RNA sequencing data. We look at each nonsense SNP or a frameshift indel in an individual that is heterozygous for the variant, and if the “stop” allele is observed at a substantially lower level than the expected 50-50 ratio, it’s a likely case of active nonsense-mediated decay. Matti Pirinen and Manny Rivas developed a neat statistical model for this, published at the same time in Bioinformatics.

So what did our analysis ASE_NMDteach us about nonsense-mediated decay? First, we looked at how the current gold-standard method for predicting NMD works. According to the 50 base pair rule formulated in the 90’s, transcripts where the premature stop is within 50bp from the last splice junction should not trigger NMD, and our data in this and earlier studies indeed shows a substantially lower levels of NMD for such variants. However, the prediction still fails for about 25% of the variants. Having the biggest data set of this type thus far, we trained a new model based on the properties of the variant, gene, and sequence, and improved the ability to predict NMD to about 80%. This is still far from perfect, but we hope that larger data sets and improving mechanistic understanding of NMD will improve this in the future.

The multi-tissue GTEx data allowed us to show a previously uncharacterized phenomenon: the same variant in the same individual can have different levels of nonsense-mediated decay in different tissues. This happens for about 20% of the variants in our data set, but given the relatively small set of tissues from most individuals, the true value might be even higher. This demonstrates that the effect of PTVs depends on the cellular context and can vary between tissues, like many other types of genetic effects. I really look forward to seeing future studies of how important this is for manifestation of clinical traits in different parts of the body.

Splicing it right

Another type of variants that we analyzed were variants close to the splice junction where exons of the gene are joined together to make a full mature transcript. Variants close to the junctions can disturb the process and lead to abnormal splicing, which often destroys the gene function. Most existing variant annotation tools classify variants very roughly to those in the 2-bp canonical splice site that is essential for proper splicing, and to those in the proximity with putative effects. We developed a new model for measuring if and how variants in different positions respective to the splice site affect splicing. We found widespread effects outside the canonical splice sites, which was not exactly a surprise, but the substantial variation between the different positions demonstrates the need for much more refined annotation of splice-affecting variants.

Buffers against gene loss

Finally, we studied large deletions that essentially get rid of the whole gene, with a specific question in mind: do individuals heterozygous for the deletion have gene expression levels half from the normal – as expected from only one copy – or is the normal allele able to compensate for the missing one, leading to expression levels close to normal. The latter situation, called dosage compensation, has been a debated topic, and while our data is not sufficient to say that it never happens, we can say that it is not common. This is a particularly interesting finding given the fact that heterozygous PTVs of all types are widespread and well-tolerated in the normal population (a phenomenon called haplosufficiency). Apparently the buffering mechanisms that make our cells tolerant to heterozygous variants operate at higher cellular levels, rather than compensating for the expression levels of the gene itself.

The lessons learned

One general observation in our paper was the different effects of rare versus common PTVs, with common variants having less NMD and little enrichment in canonical splice sites. Common variants generally represent variants that have less detrimental functional effects – they have been tolerated by natural selection – and our data shows that they tell a slightly different story of how variants affect molecular function. While analysis of common variants in the context of GWAS, eQTLs, and other molecular *QTLs are important and informative, analyzing rare variant effects as well, as we have done in this study, is important for understanding the full spectrum of impact of genetic variants and especially the effect of new mutations. Our paper is just one of the first steps in this direction, and our sample size is still too low to capture truly rare variants well, and thus I expect many interesting discoveries in this field.

Our paper is a comprehensive analysis of how protein-truncating variants can affect the transcriptome – or from another angle, how transcriptome data can be used to better understand PTVs. Even though it’s an awful cliché to say that the conclusion of your study is “it’s complicated”, in this particular case that is a slightly less trivial answer. Of all genetic variants, PTVs are thought to be the easy ones to predict and assign function to, just computationally based on the sequence. But even supposedly simple things can be pretty complicated when you look closer, as we did here. Our results show context-specificity and point to currently unknown mechanisms of molecular function of PTVs, to such extent that it warrants more attention to achieve an informed and sophisticated classifications of these variants.

The increasing size of the GTEx data set will allow us to dig deeper into these questions, investigating the causes and consequences of e.g. tissue-specificity of nonsense-mediated decay, and test whether more refined partitioning of PTVs could improve disease-related analysis. One major topic for future research is to extend these types of analyses to individuals with disease to obtain better understanding of molecular function of disease-associated variants. Finally, while exome sequencing is quickly becoming standard clinical practice, transcriptome sequencing is not there yet. We believe that our study not only establishes some of the crucial data analysis practices and approaches, but also demonstrates the value of the transcriptome in interpreting the genome.


*We made a conscious choice to use the term protein-truncating variants, instead of the quite widely used loss-of-function variants, because PTV refers to the factual changes that these variants cause by truncating the gene (splice variants are a borderline case and some of them are not strictly PTVs). The downstream effect of protein truncation can be either a loss-of-function or a gain-of-function, as discussed above, although this classification is currently usually unknown. In my opinion, using the term LoF when you have analyzed PTVs is simply incorrect, and even though it’s still common, I would be extremely happy to see the community switching to more accurate terminology. Otherwise we will end up in a huge mess where some variants labeled as loss-of-function actually have gain-of-function effects. [/rant]