The sound of a gene breaking

Rivas MA#*, Pirinen M, Conrad DF, Lek M, Tsang EK, Karczewski KJ, Maller JB, Kukurba KR, DeLuca DS, Fromer M, Ferreira PG, Smith KS, Zhang R, Zhao F, Banks E, Poplin R, Ruderfer DM, Purcell SM, Tukiainen T, Minikel EV, Stenson PD, Cooper DN, Huang KH, Sullivan TJ, Nedzel J; GTEx Consortium; Geuvadis Consortium, Bustamante CD, Li JB, Daly MJ, Guigo R, Donnelly P, Ardlie K, Sammeth M, Dermitzakis ET, McCarthy MI, Montgomery SB, Lappalainen T*#, MacArthur DG#* (2015) Effect of predicted protein-truncating genetic variants on the human transcriptome. Science 348:666-9

#* Equal contribution; corresponding authors

If you listened to your cells very carefully, would you be able to hear when a gene breaks? In a new paper in Science, we show that this is, in a way, possible: by analysis of huge RNA-sequencing data sets, we can detect cellular effects of protein-truncating variants in the genome.

Trending: rare variants in human disease

Like fashion and culinarism and all other areas of life, genetic research has its trends. One of the hottest things right now is rare variant association analysis, enabled by cost-effective exome sequencing. In these studies, sequence data from several thousands of individuals is compared to find disease-associated genes where cases have a higher load of genetic variants that are likely to disturb gene function, compared to the controls. This approach has identified genes that are relevant to for example autism, as well as individual mutations that substantially increase or decrease the risk for e.g. cardiovascular disease.

A key step in the analysis of rare variants in disease is to distinguish genetic variants that are likely to disturb gene function. This allows to look for the culprits from a smaller group of potentially important variants, instead of drowning the signal in the noise of all the variants in our genomes, most of which have no impact at all. The most common approach is to focus on protein-truncating variants or PTVs (sometimes also called loss-of-function variants*) that terminate or substantially alter the protein sequence. These are typically determined from DNA sequence data alone based on the known genetic code and gene models.

In this study, we wanted to better interpret the functional effects of this important class of variants by analyzing their effects on the transcriptome. We integrated DNA and RNA sequencing data from hundreds of individuals from two projects, Geuvadis that we published in 2013, and the new GTEx project that has multiple tissue samples from the same individuals. The total 653 individuals were from the normal population without any particular disease, but still, as expected, we found as many as 16,286 PTVs – earlier studies have shown that all of us carry some broken genes without it causing any apparent harm. We then studied how the impact of these variants can be

captured in the RNA sequencing data, and what we can learn of their function based on the patterns that we find.

Gif1

Getting rid of the nonsense

The first type of variants that we studied included SNPs that introduce a premature stop sign for the translation machinery (“nonsense SNPs”), and small insertions or deletions that scramble the downstream protein code (“frameshift indels”). Both of these give the translation machinery an early stop sign leading to a truncated protein, which is in the best case just nonfunctional (loss-of-function), but sometimes these pieces of protein can be actively harmful (gain-of-function). To avoid the latter risky situation, nature has developed an incredibly smart cellular pathway called nonsense-mediated decay (NMD) that identifies RNA transcripts that have a stop codon too early in the gene, and decay it before it gets translated into potentially harmful proteins. This amazing mechanism is easy to detect in RNA sequencing data. We look at each nonsense SNP or a frameshift indel in an individual that is heterozygous for the variant, and if the “stop” allele is observed at a substantially lower level than the expected 50-50 ratio, it’s a likely case of active nonsense-mediated decay. Matti Pirinen and Manny Rivas developed a neat statistical model for this, published at the same time in Bioinformatics.

So what did our analysis ASE_NMDteach us about nonsense-mediated decay? First, we looked at how the current gold-standard method for predicting NMD works. According to the 50 base pair rule formulated in the 90’s, transcripts where the premature stop is within 50bp from the last splice junction should not trigger NMD, and our data in this and earlier studies indeed shows a substantially lower levels of NMD for such variants. However, the prediction still fails for about 25% of the variants. Having the biggest data set of this type thus far, we trained a new model based on the properties of the variant, gene, and sequence, and improved the ability to predict NMD to about 80%. This is still far from perfect, but we hope that larger data sets and improving mechanistic understanding of NMD will improve this in the future.

The multi-tissue GTEx data allowed us to show a previously uncharacterized phenomenon: the same variant in the same individual can have different levels of nonsense-mediated decay in different tissues. This happens for about 20% of the variants in our data set, but given the relatively small set of tissues from most individuals, the true value might be even higher. This demonstrates that the effect of PTVs depends on the cellular context and can vary between tissues, like many other types of genetic effects. I really look forward to seeing future studies of how important this is for manifestation of clinical traits in different parts of the body.

Splicing it right

Another type of variants that we analyzed were variants close to the splice junction where exons of the gene are joined together to make a full mature transcript. Variants close to the junctions can disturb the process and lead to abnormal splicing, which often destroys the gene function. Most existing variant annotation tools classify variants very roughly to those in the 2-bp canonical splice site that is essential for proper splicing, and to those in the proximity with putative effects. We developed a new model for measuring if and how variants in different positions respective to the splice site affect splicing. We found widespread effects outside the canonical splice sites, which was not exactly a surprise, but the substantial variation between the different positions demonstrates the need for much more refined annotation of splice-affecting variants.

Buffers against gene loss

Finally, we studied large deletions that essentially get rid of the whole gene, with a specific question in mind: do individuals heterozygous for the deletion have gene expression levels half from the normal – as expected from only one copy – or is the normal allele able to compensate for the missing one, leading to expression levels close to normal. The latter situation, called dosage compensation, has been a debated topic, and while our data is not sufficient to say that it never happens, we can say that it is not common. This is a particularly interesting finding given the fact that heterozygous PTVs of all types are widespread and well-tolerated in the normal population (a phenomenon called haplosufficiency). Apparently the buffering mechanisms that make our cells tolerant to heterozygous variants operate at higher cellular levels, rather than compensating for the expression levels of the gene itself.

The lessons learned

One general observation in our paper was the different effects of rare versus common PTVs, with common variants having less NMD and little enrichment in canonical splice sites. Common variants generally represent variants that have less detrimental functional effects – they have been tolerated by natural selection – and our data shows that they tell a slightly different story of how variants affect molecular function. While analysis of common variants in the context of GWAS, eQTLs, and other molecular *QTLs are important and informative, analyzing rare variant effects as well, as we have done in this study, is important for understanding the full spectrum of impact of genetic variants and especially the effect of new mutations. Our paper is just one of the first steps in this direction, and our sample size is still too low to capture truly rare variants well, and thus I expect many interesting discoveries in this field.

Our paper is a comprehensive analysis of how protein-truncating variants can affect the transcriptome – or from another angle, how transcriptome data can be used to better understand PTVs. Even though it’s an awful cliché to say that the conclusion of your study is “it’s complicated”, in this particular case that is a slightly less trivial answer. Of all genetic variants, PTVs are thought to be the easy ones to predict and assign function to, just computationally based on the sequence. But even supposedly simple things can be pretty complicated when you look closer, as we did here. Our results show context-specificity and point to currently unknown mechanisms of molecular function of PTVs, to such extent that it warrants more attention to achieve an informed and sophisticated classifications of these variants.

The increasing size of the GTEx data set will allow us to dig deeper into these questions, investigating the causes and consequences of e.g. tissue-specificity of nonsense-mediated decay, and test whether more refined partitioning of PTVs could improve disease-related analysis. One major topic for future research is to extend these types of analyses to individuals with disease to obtain better understanding of molecular function of disease-associated variants. Finally, while exome sequencing is quickly becoming standard clinical practice, transcriptome sequencing is not there yet. We believe that our study not only establishes some of the crucial data analysis practices and approaches, but also demonstrates the value of the transcriptome in interpreting the genome.

 

*We made a conscious choice to use the term protein-truncating variants, instead of the quite widely used loss-of-function variants, because PTV refers to the factual changes that these variants cause by truncating the gene (splice variants are a borderline case and some of them are not strictly PTVs). The downstream effect of protein truncation can be either a loss-of-function or a gain-of-function, as discussed above, although this classification is currently usually unknown. In my opinion, using the term LoF when you have analyzed PTVs is simply incorrect, and even though it’s still common, I would be extremely happy to see the community switching to more accurate terminology. Otherwise we will end up in a huge mess where some variants labeled as loss-of-function actually have gain-of-function effects. [/rant]

Allelic mapping bias doesn’t confound eQTL analysis

Panousis N, Gutierrez-Arcelus M, Dermitzakis ET, Lappalainen T: Allelic mapping bias in RNA-sequencing is not a major confounder in eQTL studies. Genome Biology 2014 15:467

It was winter 2012, and I couldn’t sleep. All was well, I was living in Geneva, and when I wasn’t skiing in the Alps, I was working as a postdoc in Manolis Dermitzakis’s lab. I had been working on several projects to understand how genetic affects gene expression, and I was just about to dive into the largest eQTL analysis of that time. But I had a nagging concern that was keeping me awake at night. I wasn’t sure if all our eQTLs were real.

My concern was allelic mapping bias, which arises when alignment differs between RNA-sequencing reads that carry different alleles of genetic variants. Typically, a read derived from the nonreference allele has a lower probability of mapping due to the mismatch it has to the reference genome. Thus, when this technical bias occurs, it distorts read counts making the reference allele seem higher expressed. It’s a pretty serious issue in allele-specific expression analysis that focuses specifically on comparing the read counts over heterozygous sites. I had done quite a bit of work to understand how much this bias affected ASE analysis, and how to deal with it. I was pretty comfortable with that.

But the bread and butter of population-scale RNA-seq analysis is not allele-specific expression analysis. It’s eQTL analysis, which aims to find an association between genotypes of genetic variants and gene expression levels in a population sample, thus marking regulatory variants in the genome. My concern was that if mapping bias indeed distorts reads so that individuals carrying nonreference alleles sometimes end up with lower read quantification of the surrounding gene, this could result in false eQTL associations. The analogous problem with expression microarray probes has been a major problem. Yet, no one had analyzed this for RNA-seq data. What if many of the published RNA-seq based eQTLs were false? What if the Geuvadis eQTL analysis that I was working on would be biased?

mapping_bias

Despite my concerns, I was not really sure if this was likely to be a very common problem, but that the gravity of the potential error – given our and others’ investment in RNA-seq eQTL studies – warranted a proper analysis. Luckily, Nikos Panousis joined the lab as a new PhD student, and he took on the project. He had to learn the ins and outs of the whole eQTL pipeline, comb through hundreds of lines of my perl code for processing 1000 Genomes data and simulating RNA-seq reads (that must have been the most painful part), run tens of thousands of jobs to simulate reads and align them, and of course do all the downstream analysis.

Screen Shot 2014-09-21 at 4.12.36 PMOur approach for tackling this question was pretty straightforward. The first thing was to simulate all possible RNA-seq reads overlapping common variants in 1000 Genomes Europeans, using the full haplotype information to take flanking variants into account. All variant loci were simulated with single-end reads using the genome sequence to build the reads, and for coding variants we simulated splice junctions and paired-end reads as well. After mapping these reads, we could simply ask if a read with reference and nonreference alleles both aligned to the correct locus. We found out that usually they did, but especially reads carrying nonreference alleles of indels often did not map correctly. These results were little affected with the choice of the aligner, or by adding splicing information to the picture, but with paired-end reads the bias was a bit less than with single-end reads, as expected. These mapping bias statistics per variant are released with the paper for others to use and analyze. Having thus created a list of loci with likely allelic mapping bias, we were able to tackle the main question of our paper: whether these biased reads give rise to biased quantifications and false eQTLs.

Screen Shot 2014-09-21 at 4.07.55 PM

In order to analyze this, we took real RNA-seq eQTL data that we later published in Gutierrez-Arcelus et al. and removed all the reads in the positions that were biased in simulations, thus getting rid of the potentially dodgy data. We then re-ran quantifications and the eQTL analysis. The comparison of the original results and the new ones, with filtering of the potential biases, showed a comforting pattern: there was only a handful of strong eQTL associations that disappeared in the filtering (blue dots in the figure), thus suggesting that they were false positives in the original data. But these were a tiny fraction of thousands of significant eQTLs. The vast majority of eQTLs were OK. We – and everyone else doing eQTL analysis with RNA-seq – were OK.

So, who should care about this negative result? If you are working on RNA-seq and eQTLs, it is important to know if a signal in your data is driven by true biology or a technical bias. Importantly, the bias we analyzed here is of the most dangerous type, since it mimics the biological signal that we’re looking for in eQTL studies – association between genotype and expression. However, the results of this paper do not mean that allelic mapping bias is not an issue at all. For example, analysis of allele-specific expression (or chromatin state or TF binding) is particularly sensitive to mapping biases, and in such analyses additional care is needed (we use stringent filters). And what about much larger eQTL studies, where increased statistical power means that even tiny biases can become significant? This warrants future analysis either with simulations or real data.

There is still work to be done. Thus far, I haven’t heard of a computationally feasible alignment method that would correct allelic mapping bias perfectly, taking fully into account indels, splicing, rare variants, variants that you may not have (correctly) genotyped, and multiple flanking variants on the same read. I know that people are working on this, and I hope that an elegant, scalable solution will be available in the near future. In the meanwhile, I hope that the RNA-seq eQTL community finds this paper useful, and that we can all sleep a little bit better at night, knowing that the vast majority of our eQTLs are not affected by allelic mapping bias.