The sound of a gene breaking

Rivas MA#*, Pirinen M, Conrad DF, Lek M, Tsang EK, Karczewski KJ, Maller JB, Kukurba KR, DeLuca DS, Fromer M, Ferreira PG, Smith KS, Zhang R, Zhao F, Banks E, Poplin R, Ruderfer DM, Purcell SM, Tukiainen T, Minikel EV, Stenson PD, Cooper DN, Huang KH, Sullivan TJ, Nedzel J; GTEx Consortium; Geuvadis Consortium, Bustamante CD, Li JB, Daly MJ, Guigo R, Donnelly P, Ardlie K, Sammeth M, Dermitzakis ET, McCarthy MI, Montgomery SB, Lappalainen T*#, MacArthur DG#* (2015) Effect of predicted protein-truncating genetic variants on the human transcriptome. Science 348:666-9

#* Equal contribution; corresponding authors

If you listened to your cells very carefully, would you be able to hear when a gene breaks? In a new paper in Science, we show that this is, in a way, possible: by analysis of huge RNA-sequencing data sets, we can detect cellular effects of protein-truncating variants in the genome.

Trending: rare variants in human disease

Like fashion and culinarism and all other areas of life, genetic research has its trends. One of the hottest things right now is rare variant association analysis, enabled by cost-effective exome sequencing. In these studies, sequence data from several thousands of individuals is compared to find disease-associated genes where cases have a higher load of genetic variants that are likely to disturb gene function, compared to the controls. This approach has identified genes that are relevant to for example autism, as well as individual mutations that substantially increase or decrease the risk for e.g. cardiovascular disease.

A key step in the analysis of rare variants in disease is to distinguish genetic variants that are likely to disturb gene function. This allows to look for the culprits from a smaller group of potentially important variants, instead of drowning the signal in the noise of all the variants in our genomes, most of which have no impact at all. The most common approach is to focus on protein-truncating variants or PTVs (sometimes also called loss-of-function variants*) that terminate or substantially alter the protein sequence. These are typically determined from DNA sequence data alone based on the known genetic code and gene models.

In this study, we wanted to better interpret the functional effects of this important class of variants by analyzing their effects on the transcriptome. We integrated DNA and RNA sequencing data from hundreds of individuals from two projects, Geuvadis that we published in 2013, and the new GTEx project that has multiple tissue samples from the same individuals. The total 653 individuals were from the normal population without any particular disease, but still, as expected, we found as many as 16,286 PTVs – earlier studies have shown that all of us carry some broken genes without it causing any apparent harm. We then studied how the impact of these variants can be

captured in the RNA sequencing data, and what we can learn of their function based on the patterns that we find.


Getting rid of the nonsense

The first type of variants that we studied included SNPs that introduce a premature stop sign for the translation machinery (“nonsense SNPs”), and small insertions or deletions that scramble the downstream protein code (“frameshift indels”). Both of these give the translation machinery an early stop sign leading to a truncated protein, which is in the best case just nonfunctional (loss-of-function), but sometimes these pieces of protein can be actively harmful (gain-of-function). To avoid the latter risky situation, nature has developed an incredibly smart cellular pathway called nonsense-mediated decay (NMD) that identifies RNA transcripts that have a stop codon too early in the gene, and decay it before it gets translated into potentially harmful proteins. This amazing mechanism is easy to detect in RNA sequencing data. We look at each nonsense SNP or a frameshift indel in an individual that is heterozygous for the variant, and if the “stop” allele is observed at a substantially lower level than the expected 50-50 ratio, it’s a likely case of active nonsense-mediated decay. Matti Pirinen and Manny Rivas developed a neat statistical model for this, published at the same time in Bioinformatics.

So what did our analysis ASE_NMDteach us about nonsense-mediated decay? First, we looked at how the current gold-standard method for predicting NMD works. According to the 50 base pair rule formulated in the 90’s, transcripts where the premature stop is within 50bp from the last splice junction should not trigger NMD, and our data in this and earlier studies indeed shows a substantially lower levels of NMD for such variants. However, the prediction still fails for about 25% of the variants. Having the biggest data set of this type thus far, we trained a new model based on the properties of the variant, gene, and sequence, and improved the ability to predict NMD to about 80%. This is still far from perfect, but we hope that larger data sets and improving mechanistic understanding of NMD will improve this in the future.

The multi-tissue GTEx data allowed us to show a previously uncharacterized phenomenon: the same variant in the same individual can have different levels of nonsense-mediated decay in different tissues. This happens for about 20% of the variants in our data set, but given the relatively small set of tissues from most individuals, the true value might be even higher. This demonstrates that the effect of PTVs depends on the cellular context and can vary between tissues, like many other types of genetic effects. I really look forward to seeing future studies of how important this is for manifestation of clinical traits in different parts of the body.

Splicing it right

Another type of variants that we analyzed were variants close to the splice junction where exons of the gene are joined together to make a full mature transcript. Variants close to the junctions can disturb the process and lead to abnormal splicing, which often destroys the gene function. Most existing variant annotation tools classify variants very roughly to those in the 2-bp canonical splice site that is essential for proper splicing, and to those in the proximity with putative effects. We developed a new model for measuring if and how variants in different positions respective to the splice site affect splicing. We found widespread effects outside the canonical splice sites, which was not exactly a surprise, but the substantial variation between the different positions demonstrates the need for much more refined annotation of splice-affecting variants.

Buffers against gene loss

Finally, we studied large deletions that essentially get rid of the whole gene, with a specific question in mind: do individuals heterozygous for the deletion have gene expression levels half from the normal – as expected from only one copy – or is the normal allele able to compensate for the missing one, leading to expression levels close to normal. The latter situation, called dosage compensation, has been a debated topic, and while our data is not sufficient to say that it never happens, we can say that it is not common. This is a particularly interesting finding given the fact that heterozygous PTVs of all types are widespread and well-tolerated in the normal population (a phenomenon called haplosufficiency). Apparently the buffering mechanisms that make our cells tolerant to heterozygous variants operate at higher cellular levels, rather than compensating for the expression levels of the gene itself.

The lessons learned

One general observation in our paper was the different effects of rare versus common PTVs, with common variants having less NMD and little enrichment in canonical splice sites. Common variants generally represent variants that have less detrimental functional effects – they have been tolerated by natural selection – and our data shows that they tell a slightly different story of how variants affect molecular function. While analysis of common variants in the context of GWAS, eQTLs, and other molecular *QTLs are important and informative, analyzing rare variant effects as well, as we have done in this study, is important for understanding the full spectrum of impact of genetic variants and especially the effect of new mutations. Our paper is just one of the first steps in this direction, and our sample size is still too low to capture truly rare variants well, and thus I expect many interesting discoveries in this field.

Our paper is a comprehensive analysis of how protein-truncating variants can affect the transcriptome – or from another angle, how transcriptome data can be used to better understand PTVs. Even though it’s an awful cliché to say that the conclusion of your study is “it’s complicated”, in this particular case that is a slightly less trivial answer. Of all genetic variants, PTVs are thought to be the easy ones to predict and assign function to, just computationally based on the sequence. But even supposedly simple things can be pretty complicated when you look closer, as we did here. Our results show context-specificity and point to currently unknown mechanisms of molecular function of PTVs, to such extent that it warrants more attention to achieve an informed and sophisticated classifications of these variants.

The increasing size of the GTEx data set will allow us to dig deeper into these questions, investigating the causes and consequences of e.g. tissue-specificity of nonsense-mediated decay, and test whether more refined partitioning of PTVs could improve disease-related analysis. One major topic for future research is to extend these types of analyses to individuals with disease to obtain better understanding of molecular function of disease-associated variants. Finally, while exome sequencing is quickly becoming standard clinical practice, transcriptome sequencing is not there yet. We believe that our study not only establishes some of the crucial data analysis practices and approaches, but also demonstrates the value of the transcriptome in interpreting the genome.


*We made a conscious choice to use the term protein-truncating variants, instead of the quite widely used loss-of-function variants, because PTV refers to the factual changes that these variants cause by truncating the gene (splice variants are a borderline case and some of them are not strictly PTVs). The downstream effect of protein truncation can be either a loss-of-function or a gain-of-function, as discussed above, although this classification is currently usually unknown. In my opinion, using the term LoF when you have analyzed PTVs is simply incorrect, and even though it’s still common, I would be extremely happy to see the community switching to more accurate terminology. Otherwise we will end up in a huge mess where some variants labeled as loss-of-function actually have gain-of-function effects. [/rant]

Unequal inheritance of gene function: parental gene expression across human tissues

Baran Y, Subramaniam M, Biton A, Tukiainen T, Tsang EK, Rivas MA, Pirinen M, Gutierrez-Arcelus M, Smith KS, Kukurba KR, Zhang R, Eng C, Torgerson DG, Urbanek C; GTEx Consortium, Li JB, Rodriguez-Santana JR, Burchard EG, Seibold MA, MacArthur DG, Montgomery SB, Zaitlen NA*, Lappalainen T* (2015) The landscape of genomic imprinting across diverse adult human tissues. Genome Research DOI: 10.1101/gr.192278.115

* Equal contribution

Our genomes carry about one hundred genes that rebel against one of the basic rules of genetics: equal contribution of both parents. This small group of genes are expressed and functional only from a copy inherited from only from our mother or only from the father, in contrast to the over 20,000 genes expressed from alleles inherited from both parents. This happens by epigenetic silencing of the other copy, where specific marks on the DNA carry a memory of whether a gene sitting on a copy of a chromosome came from the sperm or the egg. These marks can lead to the cell to inactivate the gene from one of the parental copies. This phenomenon called genomic imprinting is the topic of our new paper in Genome Research.

Blog_Fig1Why do we care about a phenomenon that is clearly a rare exception from the normal patter? First of all, imprinting is famous for its role in an intriguing set of human diseases, where the effect of a genetic mutation depends on whether the affected individual inherited it from the mother or from the father. The best known example is an imprinted locus in chromosome 15, where the exact same genetic mutations cause either the Prader-Willi syndrome if the mutation comes from the mother, or the Angelman syndrome if it comes from the father – with essentially opposite sets of symptoms. The same parental effect can apply for weak genetic risk for common diseases. These effects are part of an interesting general phenomenon where the impact of an individual’s genetic variants cannot be predicted from the genetic code alone, without knowing their genetic context or environment. In the case of imprinted genes, the context of crucial importance is the parental ancestry of those variants, and during the recent years there has been increasing emphasis on our need to understand how imprinting modifies genetic associations to disease.

Additionally – putting my geek hat on – this little quirk of nature is just a tremendously cool biological phenomenon and we don’t even know why it exists. Imprinting happens in mammals and some plants, but the evolutionary fitness benefits of giving up two functional chromosomes in specific loci are not known. There are a few theories, from parental conflict to mother-offspring coadaptation, but the situation is terribly unclear. A persistent problem has been that the field has been lacking comprehensive, systematic data sets to empirically test the mathematical models of the evolution of imprinting.

In our new paper, we have characterized imprinting across a diverse set of human tissues, using a systematic genome-wide approach. This was made possible by the data set of the Genotype Tissue Expression (GTEx) project pilot phase, with genotype and RNA-sequencing data across 33 tissues and 178 individuals. From these data, we first measured allele-specific expression, where a heterozygous site in a gene can be used to distinguish gene expression from the two copies of a gene. Usually, both copies are expressed roughly in same amounts, but under imprinting, one gene copy is silenced and thus we will see expression of only one allele.

The challenge in analyzing imprinting from GTEx family data comes from it being a population sample, without any information of parents of each individual. Thus, we can’t directly observe expression of only the maternally or paternally inherited allele in a given gene. However, a strong signal of imprinting will lead to expression of only one allele in all individuals in a manner that is independent of genetic effects. Detecting this signal is not an easy task due to various technical and biological confounding factors, but together with Yael Baran and Noah Zaitlen we developed a sophisticated statistical method that finds genes where the pattern of monoallelic expression is consistent with imprinting and not with other processes. These results are supported by several validation data sets and careful curation. When possible, previously published resources were used to classify maternally and paternally expressed genes

Applying our method to the GTEx data set, we discovered 42 genes with a solid pattern of imprinted gene expression. This included 30 genes with some previous evidence of imprinting, and also 12 new genes, mostly discovered in rarely studied tissues. This alone highlights the importance of analyzing diverse tissues. This is still not the full list of imprinted genes – we intentionally chose a conservative approach to avoid false positives – but for the first time, we can now take a look at how imprinting of this confident set of genes varies between human tissues.

About half of these genes are imprinted in all the tissues where they are expressed, but several genes show tissue-specificity of imprinting by being expressed from both alleles in some tissues. Most tissues have similar overall levels of imprinting, but testis came out as an outlier, having significantly less imprinting. This makes perfect sense: up to 60-70% of the cells in the testis samples are from the male germline, and in these cells the imprinting marks of these individuals’ parents are erased and replaced with paternal imprinting marks for the next generation.

The real surprise of our study came from looking if tissue-specificity of imprinting could also manifest in a gene having maternal expression in some tissues and paternal in others. Usually this is not the case, but we found two exceptions to this rule – a previously known case of GRB10, and something that made me fall off the chair: IGF2. If you google this gene, you will see that it’s one of the best known imprinted genes, has a role to many diseases, and is expressed from the paternal allele. Except that in the brain it’s not. Our data shows that the brain expresses the opposite allele than other tissues, i.e. the maternal one. The functional consequences, evolutionary origins, and molecular mechanisms of this flip of the imprinted allele in IGF2 need to be figured out by future studies. But our discovery demonstrates tremendous plasticity in imprinting across tissues, and the power of our approach to detect novel phenomena even in very well studied genes.


Many human traits – from gene expression levels to disease risk – vary between individuals. But what about imprinting? We find several genes with evidence of inter-individual differences in how tightly imprinted the gene is, suggesting that imprinting can be a variable trait not only between tissues but also between individuals. Interestingly, some of this variation seems to be driven by sex in skeletal muscle – a sexually dimorphic tissue – where a few maternally imprinted growth repressors have lower level of imprinting in females. One can’t help but speculate if this is an attempt by mothers to suppress muscle growth in their daughters but not sons. An interesting question for the future will be the quest for the potential genetic variants that may regulate imprinting levels – iQTLs, analogously to eQTLs that affect gene expression levels.

We’ve made an effort to make everything accessible to the community by making the paper open-access, releasing the software, R code for the analysis and figures, and publishing a comprehensive set of statistics and data in the supplement and soon also in the GTEx portal. We are convinced of the power and robustness of our method, and look forward to applying it to future data sets. We hope that our method and results will be a useful resource for imprinting researchers and for the wider genomics community.

But our study is much more than just a resource paper. We take an important step towards systematic, statistically rigorous high-throughput analysis of imprinting in humans. In many other areas of genomics, the shift from candidate gene analysis to genome-wide approaches has enabled a major boost in biological and medical discovery, and the same is not happening for imprinting. While family-based study designs have many benefits in analysis of imprinting, these samples are often very difficult to collect from humans, and population-based study designs such as GTEx provide access to an unparalleled diversity of tissues that is otherwise not available. By providing a map of imprinting across human tissues, we have refined previous catalogs of imprinted gene and discovered new patterns of how imprinting varies between tissues and individuals. We look forward to even deeper analyses with the growing GTEx data set, as well as integration of our findings with research of molecular mechanisms and disease relevance.




Allelic mapping bias doesn’t confound eQTL analysis

Panousis N, Gutierrez-Arcelus M, Dermitzakis ET, Lappalainen T: Allelic mapping bias in RNA-sequencing is not a major confounder in eQTL studies. Genome Biology 2014 15:467

It was winter 2012, and I couldn’t sleep. All was well, I was living in Geneva, and when I wasn’t skiing in the Alps, I was working as a postdoc in Manolis Dermitzakis’s lab. I had been working on several projects to understand how genetic affects gene expression, and I was just about to dive into the largest eQTL analysis of that time. But I had a nagging concern that was keeping me awake at night. I wasn’t sure if all our eQTLs were real.

My concern was allelic mapping bias, which arises when alignment differs between RNA-sequencing reads that carry different alleles of genetic variants. Typically, a read derived from the nonreference allele has a lower probability of mapping due to the mismatch it has to the reference genome. Thus, when this technical bias occurs, it distorts read counts making the reference allele seem higher expressed. It’s a pretty serious issue in allele-specific expression analysis that focuses specifically on comparing the read counts over heterozygous sites. I had done quite a bit of work to understand how much this bias affected ASE analysis, and how to deal with it. I was pretty comfortable with that.

But the bread and butter of population-scale RNA-seq analysis is not allele-specific expression analysis. It’s eQTL analysis, which aims to find an association between genotypes of genetic variants and gene expression levels in a population sample, thus marking regulatory variants in the genome. My concern was that if mapping bias indeed distorts reads so that individuals carrying nonreference alleles sometimes end up with lower read quantification of the surrounding gene, this could result in false eQTL associations. The analogous problem with expression microarray probes has been a major problem. Yet, no one had analyzed this for RNA-seq data. What if many of the published RNA-seq based eQTLs were false? What if the Geuvadis eQTL analysis that I was working on would be biased?


Despite my concerns, I was not really sure if this was likely to be a very common problem, but that the gravity of the potential error – given our and others’ investment in RNA-seq eQTL studies – warranted a proper analysis. Luckily, Nikos Panousis joined the lab as a new PhD student, and he took on the project. He had to learn the ins and outs of the whole eQTL pipeline, comb through hundreds of lines of my perl code for processing 1000 Genomes data and simulating RNA-seq reads (that must have been the most painful part), run tens of thousands of jobs to simulate reads and align them, and of course do all the downstream analysis.

Screen Shot 2014-09-21 at 4.12.36 PMOur approach for tackling this question was pretty straightforward. The first thing was to simulate all possible RNA-seq reads overlapping common variants in 1000 Genomes Europeans, using the full haplotype information to take flanking variants into account. All variant loci were simulated with single-end reads using the genome sequence to build the reads, and for coding variants we simulated splice junctions and paired-end reads as well. After mapping these reads, we could simply ask if a read with reference and nonreference alleles both aligned to the correct locus. We found out that usually they did, but especially reads carrying nonreference alleles of indels often did not map correctly. These results were little affected with the choice of the aligner, or by adding splicing information to the picture, but with paired-end reads the bias was a bit less than with single-end reads, as expected. These mapping bias statistics per variant are released with the paper for others to use and analyze. Having thus created a list of loci with likely allelic mapping bias, we were able to tackle the main question of our paper: whether these biased reads give rise to biased quantifications and false eQTLs.

Screen Shot 2014-09-21 at 4.07.55 PM

In order to analyze this, we took real RNA-seq eQTL data that we later published in Gutierrez-Arcelus et al. and removed all the reads in the positions that were biased in simulations, thus getting rid of the potentially dodgy data. We then re-ran quantifications and the eQTL analysis. The comparison of the original results and the new ones, with filtering of the potential biases, showed a comforting pattern: there was only a handful of strong eQTL associations that disappeared in the filtering (blue dots in the figure), thus suggesting that they were false positives in the original data. But these were a tiny fraction of thousands of significant eQTLs. The vast majority of eQTLs were OK. We – and everyone else doing eQTL analysis with RNA-seq – were OK.

So, who should care about this negative result? If you are working on RNA-seq and eQTLs, it is important to know if a signal in your data is driven by true biology or a technical bias. Importantly, the bias we analyzed here is of the most dangerous type, since it mimics the biological signal that we’re looking for in eQTL studies – association between genotype and expression. However, the results of this paper do not mean that allelic mapping bias is not an issue at all. For example, analysis of allele-specific expression (or chromatin state or TF binding) is particularly sensitive to mapping biases, and in such analyses additional care is needed (we use stringent filters). And what about much larger eQTL studies, where increased statistical power means that even tiny biases can become significant? This warrants future analysis either with simulations or real data.

There is still work to be done. Thus far, I haven’t heard of a computationally feasible alignment method that would correct allelic mapping bias perfectly, taking fully into account indels, splicing, rare variants, variants that you may not have (correctly) genotyped, and multiple flanking variants on the same read. I know that people are working on this, and I hope that an elegant, scalable solution will be available in the near future. In the meanwhile, I hope that the RNA-seq eQTL community finds this paper useful, and that we can all sleep a little bit better at night, knowing that the vast majority of our eQTLs are not affected by allelic mapping bias.

Uncovering functional variation in humans by genome and transcriptome sequencing

In a paper published in Nature in September 2013, we describe results of the largest study to date integrating RNA and genome sequencing data from multiple human populations, and provide a comprehensive map of how genetic variation affects the transcriptome. This was achieved by RNA-sequencing of individuals that are part of the 1000 Genomes sample set, thus adding a functional dimension to the most important catalogue of human genomes. In this blog post, previously published in Genomes Unzipped, and in the Stanford CEGH blog, I will discuss how our findings shed light on genetic associations to disease.

As genome-wide studies are providing an increasingly comprehensive catalog of genetic variants that predispose to various diseases, we are faced with a huge challenge: what do these variants actually do in the cell? Understanding the biological mechanisms underlying diseases is essential to develop interventions, but traditional molecular biology follow-up is not really feasible for the thousands of discovered GWAS loci. Thus, we need high-throughput approaches for measuring genetic effects at the cellular level, which is an intermediate between the genome and the disease. The cellular trait most amenable for such analysis is the transcriptome, which we can now measure reliably and robustly by RNA-sequencing (as shown by our companion paper in Nature Biotechnology).

In this project, several European institutes of the Geuvadis Consortium sequenced mRNA and small RNA from lymphoblast cell lines from 465 individuals that are in the 1000 Genomes sample set. The idea of gene expression analysis of genetic reference samples is not new (see e.g. papers by Stranger et al., Pickrell et al. and Montgomery et al.), but the bigger scale and better quality enables discovery of exciting new biology, as demonstrated by other recent RNA-seq papers as well (e.g. Battle et al., Gutierrez-Arcelus et al.).

Regulatory variants underlying GWAS signals

Our first striking observation was that over one half of measured genes are affected by common genetic variation in human populations – called expression quantitative trait loci or eQTLs. Regulatory associations are not like GWAS studies where you are lucky to find a handful of significant hits; regulatory variation is literally (almost) everywhere – it’s the rule, not the exception.

The vast majority these regulatory variants won’t have any effect on the phenotype at the individual level, but some of them do. The first obvious question was how many known GWAS variants are eQTLs in our study, and indeed pretty many of them are – 16%. So does this prove that in all these GWAS regions we have identified the regulatory change as the cellular mechanism that drives the disease? Unfortunately the answer is no. Regulatory associations are so common that the expected overlap just by chance is as high as 11%. This means that your favorite GWAS variant having a significant regulatory association is very far from sufficient proof of it being the biological mechanism of the disease or trait. The same applies to overlap with for example ENCODE annotations, by the way. This is not overcautious small print. We’ve basically reversed the problem of having hardly any clue of functional mechanisms to having too many putative functions. We’ve found the haystack.

How can we solve this problem? Luckily, there are statistical methods to analyze the two association signals in the same genomic region to find out if the gene expression association is likely to be causal to the disease association. You still can’t be 100% sure, but that is much smaller print. And we do find an enrichment of such a signal, as in previous studies – telling us that regulatory changes are enriched for being causal biological mechanisms underlying GWAS signals.

From associated regions to causal variants

How can we use eQTLs to understand GWAS associations? In this eQTL association landscape, the red rs838705 variant is a GWAS variant associated to calcium levels. But the most likely causal variant for the change in calcium levels is the top eQTL variant rs201966773 – a 2bp insertion that overlaps several regulatory elements of the DGKD gene.

We can take this analysis an important step further to pinpoint likely causal variants. Thus far, nearly all association studies have used data from SNP arrays that measure only a subset of all the common variants. This works fine for identifying more or less broad regions of the genome that have a variant somewhere that changes the function of the genome such that it predisposes to the trait in question. However, usually there’s no clue what the precise causal variant is and what its exact properties are.

The first step in finding the causal variants is getting genome sequencing data, which is what we have in our study. We show that we have pretty good power to pinpoint causal regulatory variants in many of the loci, which is great news for understanding mechanisms of genome regulation. This has a cool application for dozens of GWAS loci that are driven by a regulatory association: by discovering the putative causal regulatory variant from our association data, we’re at the same time pinpointing the likely causal GWAS variant as well. Thus, combining genome sequencing and cellular phenotype data can give us information not only of the biological mechanisms underlying GWAS associations, but also identify the likely causal variants.

Where do we go next?

In this study we have integrated genome and transcriptome sequencing data to understand the landscape of functional variation in human populations. In addition to our scientific discoveries, this is an extremely valuable open-access data set for the human genetics community, as it links directly to the 1000 Genomes data that is used by nearly all human genetics projects. Since our pre-publication data release in November 2012, the data set has already been downloaded thousands of times, and we’ve put a lot of effort into open data sharing by having a browser and even opening our project wiki for the public.

This paper is a big step forward, but we’re still far from full understanding of how genetic variation affects the transcriptome and how this affects human disease. One important challenge is cellular effects of rare and loss-of-function variants, which we address only briefly in this paper. Furthermore, other projects such as GTEx are describing transcriptome variation and its genetic causes in large variety of human tissues. We are part of that effort too, so stay tuned.

This study and other projects analyzing cellular phenotypes in the general human population are providing the baseline of the general population spectrum of functional genetic variation and transcriptome variation, which is essential to be able to distinguish the cases where things go wrong and cause disease. At the same time as we move forward with basic research, it is important to push for clinical applications to target cellular perturbations leading to disease, and develop approaches for personalized transcriptomics to better interpret personalized genomes.