In a paper published in Nature in September 2013, we describe results of the largest study to date integrating RNA and genome sequencing data from multiple human populations, and provide a comprehensive map of how genetic variation affects the transcriptome. This was achieved by RNA-sequencing of individuals that are part of the 1000 Genomes sample set, thus adding a functional dimension to the most important catalogue of human genomes. In this blog post, previously published in Genomes Unzipped, and in the Stanford CEGH blog, I will discuss how our findings shed light on genetic associations to disease.
As genome-wide studies are providing an increasingly comprehensive catalog of genetic variants that predispose to various diseases, we are faced with a huge challenge: what do these variants actually do in the cell? Understanding the biological mechanisms underlying diseases is essential to develop interventions, but traditional molecular biology follow-up is not really feasible for the thousands of discovered GWAS loci. Thus, we need high-throughput approaches for measuring genetic effects at the cellular level, which is an intermediate between the genome and the disease. The cellular trait most amenable for such analysis is the transcriptome, which we can now measure reliably and robustly by RNA-sequencing (as shown by our companion paper in Nature Biotechnology).
In this project, several European institutes of the Geuvadis Consortium sequenced mRNA and small RNA from lymphoblast cell lines from 465 individuals that are in the 1000 Genomes sample set. The idea of gene expression analysis of genetic reference samples is not new (see e.g. papers by Stranger et al., Pickrell et al. and Montgomery et al.), but the bigger scale and better quality enables discovery of exciting new biology, as demonstrated by other recent RNA-seq papers as well (e.g. Battle et al., Gutierrez-Arcelus et al.).
Regulatory variants underlying GWAS signals
Our first striking observation was that over one half of measured genes are affected by common genetic variation in human populations – called expression quantitative trait loci or eQTLs. Regulatory associations are not like GWAS studies where you are lucky to find a handful of significant hits; regulatory variation is literally (almost) everywhere – it’s the rule, not the exception.
The vast majority these regulatory variants won’t have any effect on the phenotype at the individual level, but some of them do. The first obvious question was how many known GWAS variants are eQTLs in our study, and indeed pretty many of them are – 16%. So does this prove that in all these GWAS regions we have identified the regulatory change as the cellular mechanism that drives the disease? Unfortunately the answer is no. Regulatory associations are so common that the expected overlap just by chance is as high as 11%. This means that your favorite GWAS variant having a significant regulatory association is very far from sufficient proof of it being the biological mechanism of the disease or trait. The same applies to overlap with for example ENCODE annotations, by the way. This is not overcautious small print. We’ve basically reversed the problem of having hardly any clue of functional mechanisms to having too many putative functions. We’ve found the haystack.
How can we solve this problem? Luckily, there are statistical methods to analyze the two association signals in the same genomic region to find out if the gene expression association is likely to be causal to the disease association. You still can’t be 100% sure, but that is much smaller print. And we do find an enrichment of such a signal, as in previous studies – telling us that regulatory changes are enriched for being causal biological mechanisms underlying GWAS signals.
From associated regions to causal variants
We can take this analysis an important step further to pinpoint likely causal variants. Thus far, nearly all association studies have used data from SNP arrays that measure only a subset of all the common variants. This works fine for identifying more or less broad regions of the genome that have a variant somewhere that changes the function of the genome such that it predisposes to the trait in question. However, usually there’s no clue what the precise causal variant is and what its exact properties are.
The first step in finding the causal variants is getting genome sequencing data, which is what we have in our study. We show that we have pretty good power to pinpoint causal regulatory variants in many of the loci, which is great news for understanding mechanisms of genome regulation. This has a cool application for dozens of GWAS loci that are driven by a regulatory association: by discovering the putative causal regulatory variant from our association data, we’re at the same time pinpointing the likely causal GWAS variant as well. Thus, combining genome sequencing and cellular phenotype data can give us information not only of the biological mechanisms underlying GWAS associations, but also identify the likely causal variants.
Where do we go next?
In this study we have integrated genome and transcriptome sequencing data to understand the landscape of functional variation in human populations. In addition to our scientific discoveries, this is an extremely valuable open-access data set for the human genetics community, as it links directly to the 1000 Genomes data that is used by nearly all human genetics projects. Since our pre-publication data release in November 2012, the data set has already been downloaded thousands of times, and we’ve put a lot of effort into open data sharing by having a browser and even opening our project wiki for the public.
This paper is a big step forward, but we’re still far from full understanding of how genetic variation affects the transcriptome and how this affects human disease. One important challenge is cellular effects of rare and loss-of-function variants, which we address only briefly in this paper. Furthermore, other projects such as GTEx are describing transcriptome variation and its genetic causes in large variety of human tissues. We are part of that effort too, so stay tuned.
This study and other projects analyzing cellular phenotypes in the general human population are providing the baseline of the general population spectrum of functional genetic variation and transcriptome variation, which is essential to be able to distinguish the cases where things go wrong and cause disease. At the same time as we move forward with basic research, it is important to push for clinical applications to target cellular perturbations leading to disease, and develop approaches for personalized transcriptomics to better interpret personalized genomes.