Panousis N, Gutierrez-Arcelus M, Dermitzakis ET, Lappalainen T: Allelic mapping bias in RNA-sequencing is not a major confounder in eQTL studies. Genome Biology 2014 15:467
It was winter 2012, and I couldn’t sleep. All was well, I was living in Geneva, and when I wasn’t skiing in the Alps, I was working as a postdoc in Manolis Dermitzakis’s lab. I had been working on several projects to understand how genetic affects gene expression, and I was just about to dive into the largest eQTL analysis of that time. But I had a nagging concern that was keeping me awake at night. I wasn’t sure if all our eQTLs were real.
My concern was allelic mapping bias, which arises when alignment differs between RNA-sequencing reads that carry different alleles of genetic variants. Typically, a read derived from the nonreference allele has a lower probability of mapping due to the mismatch it has to the reference genome. Thus, when this technical bias occurs, it distorts read counts making the reference allele seem higher expressed. It’s a pretty serious issue in allele-specific expression analysis that focuses specifically on comparing the read counts over heterozygous sites. I had done quite a bit of work to understand how much this bias affected ASE analysis, and how to deal with it. I was pretty comfortable with that.
But the bread and butter of population-scale RNA-seq analysis is not allele-specific expression analysis. It’s eQTL analysis, which aims to find an association between genotypes of genetic variants and gene expression levels in a population sample, thus marking regulatory variants in the genome. My concern was that if mapping bias indeed distorts reads so that individuals carrying nonreference alleles sometimes end up with lower read quantification of the surrounding gene, this could result in false eQTL associations. The analogous problem with expression microarray probes has been a major problem. Yet, no one had analyzed this for RNA-seq data. What if many of the published RNA-seq based eQTLs were false? What if the Geuvadis eQTL analysis that I was working on would be biased?
Despite my concerns, I was not really sure if this was likely to be a very common problem, but that the gravity of the potential error – given our and others’ investment in RNA-seq eQTL studies – warranted a proper analysis. Luckily, Nikos Panousis joined the lab as a new PhD student, and he took on the project. He had to learn the ins and outs of the whole eQTL pipeline, comb through hundreds of lines of my perl code for processing 1000 Genomes data and simulating RNA-seq reads (that must have been the most painful part), run tens of thousands of jobs to simulate reads and align them, and of course do all the downstream analysis.
Our approach for tackling this question was pretty straightforward. The first thing was to simulate all possible RNA-seq reads overlapping common variants in 1000 Genomes Europeans, using the full haplotype information to take flanking variants into account. All variant loci were simulated with single-end reads using the genome sequence to build the reads, and for coding variants we simulated splice junctions and paired-end reads as well. After mapping these reads, we could simply ask if a read with reference and nonreference alleles both aligned to the correct locus. We found out that usually they did, but especially reads carrying nonreference alleles of indels often did not map correctly. These results were little affected with the choice of the aligner, or by adding splicing information to the picture, but with paired-end reads the bias was a bit less than with single-end reads, as expected. These mapping bias statistics per variant are released with the paper for others to use and analyze. Having thus created a list of loci with likely allelic mapping bias, we were able to tackle the main question of our paper: whether these biased reads give rise to biased quantifications and false eQTLs.
In order to analyze this, we took real RNA-seq eQTL data that we later published in Gutierrez-Arcelus et al. and removed all the reads in the positions that were biased in simulations, thus getting rid of the potentially dodgy data. We then re-ran quantifications and the eQTL analysis. The comparison of the original results and the new ones, with filtering of the potential biases, showed a comforting pattern: there was only a handful of strong eQTL associations that disappeared in the filtering (blue dots in the figure), thus suggesting that they were false positives in the original data. But these were a tiny fraction of thousands of significant eQTLs. The vast majority of eQTLs were OK. We – and everyone else doing eQTL analysis with RNA-seq – were OK.
So, who should care about this negative result? If you are working on RNA-seq and eQTLs, it is important to know if a signal in your data is driven by true biology or a technical bias. Importantly, the bias we analyzed here is of the most dangerous type, since it mimics the biological signal that we’re looking for in eQTL studies – association between genotype and expression. However, the results of this paper do not mean that allelic mapping bias is not an issue at all. For example, analysis of allele-specific expression (or chromatin state or TF binding) is particularly sensitive to mapping biases, and in such analyses additional care is needed (we use stringent filters). And what about much larger eQTL studies, where increased statistical power means that even tiny biases can become significant? This warrants future analysis either with simulations or real data.
There is still work to be done. Thus far, I haven’t heard of a computationally feasible alignment method that would correct allelic mapping bias perfectly, taking fully into account indels, splicing, rare variants, variants that you may not have (correctly) genotyped, and multiple flanking variants on the same read. I know that people are working on this, and I hope that an elegant, scalable solution will be available in the near future. In the meanwhile, I hope that the RNA-seq eQTL community finds this paper useful, and that we can all sleep a little bit better at night, knowing that the vast majority of our eQTLs are not affected by allelic mapping bias.