Unequal inheritance of gene function: parental gene expression across human tissues

Baran Y, Subramaniam M, Biton A, Tukiainen T, Tsang EK, Rivas MA, Pirinen M, Gutierrez-Arcelus M, Smith KS, Kukurba KR, Zhang R, Eng C, Torgerson DG, Urbanek C; GTEx Consortium, Li JB, Rodriguez-Santana JR, Burchard EG, Seibold MA, MacArthur DG, Montgomery SB, Zaitlen NA*, Lappalainen T* (2015) The landscape of genomic imprinting across diverse adult human tissues. Genome Research DOI: 10.1101/gr.192278.115

* Equal contribution

Our genomes carry about one hundred genes that rebel against one of the basic rules of genetics: equal contribution of both parents. This small group of genes are expressed and functional only from a copy inherited from only from our mother or only from the father, in contrast to the over 20,000 genes expressed from alleles inherited from both parents. This happens by epigenetic silencing of the other copy, where specific marks on the DNA carry a memory of whether a gene sitting on a copy of a chromosome came from the sperm or the egg. These marks can lead to the cell to inactivate the gene from one of the parental copies. This phenomenon called genomic imprinting is the topic of our new paper in Genome Research.

Blog_Fig1Why do we care about a phenomenon that is clearly a rare exception from the normal patter? First of all, imprinting is famous for its role in an intriguing set of human diseases, where the effect of a genetic mutation depends on whether the affected individual inherited it from the mother or from the father. The best known example is an imprinted locus in chromosome 15, where the exact same genetic mutations cause either the Prader-Willi syndrome if the mutation comes from the mother, or the Angelman syndrome if it comes from the father – with essentially opposite sets of symptoms. The same parental effect can apply for weak genetic risk for common diseases. These effects are part of an interesting general phenomenon where the impact of an individual’s genetic variants cannot be predicted from the genetic code alone, without knowing their genetic context or environment. In the case of imprinted genes, the context of crucial importance is the parental ancestry of those variants, and during the recent years there has been increasing emphasis on our need to understand how imprinting modifies genetic associations to disease.

Additionally – putting my geek hat on – this little quirk of nature is just a tremendously cool biological phenomenon and we don’t even know why it exists. Imprinting happens in mammals and some plants, but the evolutionary fitness benefits of giving up two functional chromosomes in specific loci are not known. There are a few theories, from parental conflict to mother-offspring coadaptation, but the situation is terribly unclear. A persistent problem has been that the field has been lacking comprehensive, systematic data sets to empirically test the mathematical models of the evolution of imprinting.

In our new paper, we have characterized imprinting across a diverse set of human tissues, using a systematic genome-wide approach. This was made possible by the data set of the Genotype Tissue Expression (GTEx) project pilot phase, with genotype and RNA-sequencing data across 33 tissues and 178 individuals. From these data, we first measured allele-specific expression, where a heterozygous site in a gene can be used to distinguish gene expression from the two copies of a gene. Usually, both copies are expressed roughly in same amounts, but under imprinting, one gene copy is silenced and thus we will see expression of only one allele.

The challenge in analyzing imprinting from GTEx family data comes from it being a population sample, without any information of parents of each individual. Thus, we can’t directly observe expression of only the maternally or paternally inherited allele in a given gene. However, a strong signal of imprinting will lead to expression of only one allele in all individuals in a manner that is independent of genetic effects. Detecting this signal is not an easy task due to various technical and biological confounding factors, but together with Yael Baran and Noah Zaitlen we developed a sophisticated statistical method that finds genes where the pattern of monoallelic expression is consistent with imprinting and not with other processes. These results are supported by several validation data sets and careful curation. When possible, previously published resources were used to classify maternally and paternally expressed genes

Applying our method to the GTEx data set, we discovered 42 genes with a solid pattern of imprinted gene expression. This included 30 genes with some previous evidence of imprinting, and also 12 new genes, mostly discovered in rarely studied tissues. This alone highlights the importance of analyzing diverse tissues. This is still not the full list of imprinted genes – we intentionally chose a conservative approach to avoid false positives – but for the first time, we can now take a look at how imprinting of this confident set of genes varies between human tissues.

About half of these genes are imprinted in all the tissues where they are expressed, but several genes show tissue-specificity of imprinting by being expressed from both alleles in some tissues. Most tissues have similar overall levels of imprinting, but testis came out as an outlier, having significantly less imprinting. This makes perfect sense: up to 60-70% of the cells in the testis samples are from the male germline, and in these cells the imprinting marks of these individuals’ parents are erased and replaced with paternal imprinting marks for the next generation.

The real surprise of our study came from looking if tissue-specificity of imprinting could also manifest in a gene having maternal expression in some tissues and paternal in others. Usually this is not the case, but we found two exceptions to this rule – a previously known case of GRB10, and something that made me fall off the chair: IGF2. If you google this gene, you will see that it’s one of the best known imprinted genes, has a role to many diseases, and is expressed from the paternal allele. Except that in the brain it’s not. Our data shows that the brain expresses the opposite allele than other tissues, i.e. the maternal one. The functional consequences, evolutionary origins, and molecular mechanisms of this flip of the imprinted allele in IGF2 need to be figured out by future studies. But our discovery demonstrates tremendous plasticity in imprinting across tissues, and the power of our approach to detect novel phenomena even in very well studied genes.

Blog_Fig2

Many human traits – from gene expression levels to disease risk – vary between individuals. But what about imprinting? We find several genes with evidence of inter-individual differences in how tightly imprinted the gene is, suggesting that imprinting can be a variable trait not only between tissues but also between individuals. Interestingly, some of this variation seems to be driven by sex in skeletal muscle – a sexually dimorphic tissue – where a few maternally imprinted growth repressors have lower level of imprinting in females. One can’t help but speculate if this is an attempt by mothers to suppress muscle growth in their daughters but not sons. An interesting question for the future will be the quest for the potential genetic variants that may regulate imprinting levels – iQTLs, analogously to eQTLs that affect gene expression levels.

We’ve made an effort to make everything accessible to the community by making the paper open-access, releasing the software, R code for the analysis and figures, and publishing a comprehensive set of statistics and data in the supplement and soon also in the GTEx portal. We are convinced of the power and robustness of our method, and look forward to applying it to future data sets. We hope that our method and results will be a useful resource for imprinting researchers and for the wider genomics community.

But our study is much more than just a resource paper. We take an important step towards systematic, statistically rigorous high-throughput analysis of imprinting in humans. In many other areas of genomics, the shift from candidate gene analysis to genome-wide approaches has enabled a major boost in biological and medical discovery, and the same is not happening for imprinting. While family-based study designs have many benefits in analysis of imprinting, these samples are often very difficult to collect from humans, and population-based study designs such as GTEx provide access to an unparalleled diversity of tissues that is otherwise not available. By providing a map of imprinting across human tissues, we have refined previous catalogs of imprinted gene and discovered new patterns of how imprinting varies between tissues and individuals. We look forward to even deeper analyses with the growing GTEx data set, as well as integration of our findings with research of molecular mechanisms and disease relevance.

 

 

 

Uncovering functional variation in humans by genome and transcriptome sequencing

In a paper published in Nature in September 2013, we describe results of the largest study to date integrating RNA and genome sequencing data from multiple human populations, and provide a comprehensive map of how genetic variation affects the transcriptome. This was achieved by RNA-sequencing of individuals that are part of the 1000 Genomes sample set, thus adding a functional dimension to the most important catalogue of human genomes. In this blog post, previously published in Genomes Unzipped, and in the Stanford CEGH blog, I will discuss how our findings shed light on genetic associations to disease.

As genome-wide studies are providing an increasingly comprehensive catalog of genetic variants that predispose to various diseases, we are faced with a huge challenge: what do these variants actually do in the cell? Understanding the biological mechanisms underlying diseases is essential to develop interventions, but traditional molecular biology follow-up is not really feasible for the thousands of discovered GWAS loci. Thus, we need high-throughput approaches for measuring genetic effects at the cellular level, which is an intermediate between the genome and the disease. The cellular trait most amenable for such analysis is the transcriptome, which we can now measure reliably and robustly by RNA-sequencing (as shown by our companion paper in Nature Biotechnology).

In this project, several European institutes of the Geuvadis Consortium sequenced mRNA and small RNA from lymphoblast cell lines from 465 individuals that are in the 1000 Genomes sample set. The idea of gene expression analysis of genetic reference samples is not new (see e.g. papers by Stranger et al., Pickrell et al. and Montgomery et al.), but the bigger scale and better quality enables discovery of exciting new biology, as demonstrated by other recent RNA-seq papers as well (e.g. Battle et al., Gutierrez-Arcelus et al.).

Regulatory variants underlying GWAS signals

Our first striking observation was that over one half of measured genes are affected by common genetic variation in human populations – called expression quantitative trait loci or eQTLs. Regulatory associations are not like GWAS studies where you are lucky to find a handful of significant hits; regulatory variation is literally (almost) everywhere – it’s the rule, not the exception.

The vast majority these regulatory variants won’t have any effect on the phenotype at the individual level, but some of them do. The first obvious question was how many known GWAS variants are eQTLs in our study, and indeed pretty many of them are – 16%. So does this prove that in all these GWAS regions we have identified the regulatory change as the cellular mechanism that drives the disease? Unfortunately the answer is no. Regulatory associations are so common that the expected overlap just by chance is as high as 11%. This means that your favorite GWAS variant having a significant regulatory association is very far from sufficient proof of it being the biological mechanism of the disease or trait. The same applies to overlap with for example ENCODE annotations, by the way. This is not overcautious small print. We’ve basically reversed the problem of having hardly any clue of functional mechanisms to having too many putative functions. We’ve found the haystack.

How can we solve this problem? Luckily, there are statistical methods to analyze the two association signals in the same genomic region to find out if the gene expression association is likely to be causal to the disease association. You still can’t be 100% sure, but that is much smaller print. And we do find an enrichment of such a signal, as in previous studies – telling us that regulatory changes are enriched for being causal biological mechanisms underlying GWAS signals.

From associated regions to causal variants

How can we use eQTLs to understand GWAS associations? In this eQTL association landscape, the red rs838705 variant is a GWAS variant associated to calcium levels. But the most likely causal variant for the change in calcium levels is the top eQTL variant rs201966773 – a 2bp insertion that overlaps several regulatory elements of the DGKD gene.

We can take this analysis an important step further to pinpoint likely causal variants. Thus far, nearly all association studies have used data from SNP arrays that measure only a subset of all the common variants. This works fine for identifying more or less broad regions of the genome that have a variant somewhere that changes the function of the genome such that it predisposes to the trait in question. However, usually there’s no clue what the precise causal variant is and what its exact properties are.

The first step in finding the causal variants is getting genome sequencing data, which is what we have in our study. We show that we have pretty good power to pinpoint causal regulatory variants in many of the loci, which is great news for understanding mechanisms of genome regulation. This has a cool application for dozens of GWAS loci that are driven by a regulatory association: by discovering the putative causal regulatory variant from our association data, we’re at the same time pinpointing the likely causal GWAS variant as well. Thus, combining genome sequencing and cellular phenotype data can give us information not only of the biological mechanisms underlying GWAS associations, but also identify the likely causal variants.

Where do we go next?

In this study we have integrated genome and transcriptome sequencing data to understand the landscape of functional variation in human populations. In addition to our scientific discoveries, this is an extremely valuable open-access data set for the human genetics community, as it links directly to the 1000 Genomes data that is used by nearly all human genetics projects. Since our pre-publication data release in November 2012, the data set has already been downloaded thousands of times, and we’ve put a lot of effort into open data sharing by having a browser and even opening our project wiki for the public.

This paper is a big step forward, but we’re still far from full understanding of how genetic variation affects the transcriptome and how this affects human disease. One important challenge is cellular effects of rare and loss-of-function variants, which we address only briefly in this paper. Furthermore, other projects such as GTEx are describing transcriptome variation and its genetic causes in large variety of human tissues. We are part of that effort too, so stay tuned.

This study and other projects analyzing cellular phenotypes in the general human population are providing the baseline of the general population spectrum of functional genetic variation and transcriptome variation, which is essential to be able to distinguish the cases where things go wrong and cause disease. At the same time as we move forward with basic research, it is important to push for clinical applications to target cellular perturbations leading to disease, and develop approaches for personalized transcriptomics to better interpret personalized genomes.