Skip to document

Chapter 6 GGHH 2023 - notes

notes
Course

Genes, Genomics And Human Health (300820)

95 Documents
Students shared 95 documents in this course
Academic year: 2023/2024
Uploaded by:
Anonymous Student
This document has been uploaded by a student, just like you, who decided to remain anonymous.
Western Sydney University

Comments

Please sign in or register to post comments.

Preview text

Chapter 6. Expression Quantitative Trait Loci (eQTL)

Learning Outcomes

  1. Define an eQTL
  2. Summarise the methodology of RNAseq
  3. Understand the reason for expressing RNAseq outcomes as transcripts per million (TPM)
  4. Explain why patterns of H3K4me3 and H3K27ac can be used as markers of transcriptionally active genes
  5. Incorporate this data into a systematic strategy for identifying functional alleles identified in GWAS

6 Introduction

As we learnt in Chapter 5, techniques such as ChIPseq and DNaseSeq, together with information on DNA conservation (homology) have been critical in mapping active regulatory DNA elements across the human genome. There is a complex vocabulary of histone subunit modifications that correlate with the remodelling of chromatin and the activation or repression of gene transcription. However, given that it is estimated there are up to one million enhancers and just 20-23,000 protein-coding genes across the human genome, assigning enhancers to the regulation of specific genes in different cell types and in response to different signals, is one of the major challenges in functional genomics.

Most alleles identified in GWAS that are associated with common human traits and illnesses map to non- coding DNA, and while the location of associated SNPs and limits of linkage disequilibrium (LD) allow us to assign SNPs to regulatory DNA elements, correlating enhancer activation with gene transcription is more difficult. This has led to many studies being done that seek to correlate specific alleles with changes in gene transcription. This approach treats gene expression as a measurable trait in much the same way as we use measurements such as height or blood pressure as phenotypic traits. Gene transcription is a quantifiable trait, and GWAS can identify loci that contain alleles that are associated with variation in the expression, hence the term expression quantitative trait loci (eQTL).

Figure 6. A major challenge in genomics is to assign risk alleles to variation in the transcription of one or more genes. Gene A and Gene B are two genes that show the same pattern of transcription, as seen by the appearance of H3K4me3 and HsK27ac sites at the promoters of both genes in the same cell type. An enhancer is identified by H3K27ac and DNase HS sites in the intergenic DNA between the two genes, and a second enhancer is also located in intron 4 of Gene B. A GWAS has produced a significant association that maps to the intergenic enhancer. The challenge is to determine whether alternate alleles of the intergenic enhancer correlate with variation in the transcription of Gene A and/or Gene B (as indicated by the arrows and question marks).

In this chapter we will consider methods that can measure gene transcription, and how this information can be integrated with GWAS and other information to identify eQTLs (Fig. 6).

6 Sequencing RNA (RNAseq)

Our cells contain a highly complex mixture of RNA species that can be broadly divided into protein-coding RNA (mRNA) and non-coding RNA. The total amount of RNA in a cell is collectively referred to as the transcriptome. Non-coding RNA encompasses structural RNAs that contribute to translation (ribosomal RNA, rRNA, and transfer RNA, tRNA) as well as regulatory microRNA (miRNA) and long-non-coding RNA (lncRNA) that is thought to play an important role in the regulation of gene transcription. More recently, it was found that enhancers can transcribe short and transient RNA species (eRNA), and while eRNA is thought to contribute to enhancer-dependent gene transcription, much remains to be done to understand the details of eRNA activity. As each cell type in the human body is the process of a specialised program of development and differentiation that reflects specialised networks of gene transcription, the identity and abundance of RNA varies between different cell types, and in response to different stimuli. An additional layer of complexity of the transcriptome is that mRNAs in the human genome undergo alternative splicing, resulting in more than one transcript being produced from a single gene. Not surprisingly, identifying and quantifying the abundance and diversity of RNA transcripts in a cell is highly complex and places great demand on the sequencing methodology and the post-sequencing bioinformatic analysis of sequencing products. There are several

Figure 6. Sequencing alternate RNA isoforms. A. A gene has five exons (colour coded), that can be expressed in three combinations. Beneath the top image of the gene is shown fragments of RNA are sequenced and aligned to the human genome, with the abundance of fragments aligned to each exon of a gene being a measure of the expression of that exon in a cell. Note that in addition to short fragments that map to a single exon, some fragments span two exons (as indicated by fragments joined by a horizontal line) that are important in identifying isoforms. B. Alignment of the sequenced fragments and calculation of the abundance of the aligned fragments can identify individual isoforms and the abundance of each isoform.

The TPM calculation is a way of normalising the number of RNAseq reads that map to a single gene transcript by accounting for the length of the gene transcript and the total number of mapped sequencing reads in an RNAseq experiment. The calculation itself is not difficult, although as each RNAseq experiment maps millions of fragments to several tens of thousands of transcripts, it requires specialised software to complete the calculations.

  1. Firstly, a normalized transcript-level expression is calculated for each gene: this is the number of read counts mapped to the transcript divided by the length of the transcript in kilobases. The resulting number is defined as reads per kilobase (RPK)
  2. Next, the sum of ALL RPK values in a sample is calculated and this number is divided by 1,000, (this is a “per million” scaling factor).

These calculations are incorporated into the UCSC Genome Browser using the data generated by a large consortium called the Genotype-Tissue Expression (GTEx) project (gtexportal/home/). This project seeks to correlate variation in gene transcription in multiple tissues with genetic variation, identifying loci that can be identified as expression quantitative trait loci, or eQTLs.

Figure 6. Histone H3K4me3 and Histone H3K27ac at the promoter is correlated with RNA expression. Top: Shown is the gene GATA3, a key transcription factor in skin epithelial cells; exons are indicated as blue boxes, introns are horizontal lines with arrows indicating the direction of transcription. Conservation is shown beneath the GATA3 gene as a series of green peaks. Middle: H3K4me3 and H3K27ac peaks are shown, with highest and broadest peaks found in the skin epithelial cells (keratinocytes). Bottom: Coverage shows the depth (number) of sequenced RNA fragments aligned to each exon. TPM is the normalised expression (tags per million) in each tissue, which is highest in skin epithelial cells. Note the correlation between large peaks of H3K4me3 and H3K27ac in keratinocytes (human skin cells) with high RNA expression in the same tissue. Figure adapted from the UCSC Human Genome Browser.

As we saw in the previous chapter, Next Generation sequencing is used to map and quantify the number (depth) of DNA fragments that are associated with post-translationally modified histone subunits, with the results being displayed as peaks in a human genome browser. Similarly, Next Generation DNA sequencing is used to sequence short cDNA fragments to identify and quantify the expression of RNA from genes across the human genome. A combined display of the transformed data as peaks in the UCSC Human Genome Browser clearly identifies cells or tissues that show the highest RNA expression of a specific gene, and that this pattern of expression is correlated with robust peaks of H3K4me3 and H3K27ac at the promoter of the gene (Fig. 6). This is also a very good example of the transformation of complex data into an easily understood visual format.

6 Expression Quantitative Trait Loci (eQTL) and Genome-wide Association studies (GWAS)

Approximately 2% of the human genome is coding DNA - the DNA sequences that encode proteins. It is therefore not surprising that when a GWAS scans the human genome for alleles that are associated with a specific phenotype, the great majority of the identified alleles map to non-coding regions of the human genome. However, each GWAS samples only a fraction of the human genetic variants that are present in the genome of an individual by using tagging SNPs, usually with minor allele frequencies of not less than between 1 and 5%. As a consequence of this, the identification of haplotypes associated with a specific phenotype in most cases

6 Combining GWAS with eQTL data: the GTEx Project

As already mentioned, GTEx is a large project that combines data from genotyping, whole genome sequencing, and RNA sequencing to provide an open-access resource that will assist in identifying genetic variants that affect gene expression in different tissues.

When considering how genetic variation might affect gene expression, each of the following possibilities can be considered: 1) alternate alleles uniformly affect the expression of a gene in all tissues that express the gene; 2) alternate alleles have an effect on gene expression in some but not all tissues that express a gene; 3) the effect of alternate alleles is only seen in the presence of an environmental change, such as an infection or chronic disease. Furthermore, since the relationship between genotype and gene expression is a statistical association, multiple alleles in a single haplotype can show an association with the expression of a given gene. The gene ACTN3 is expressed in skeletal muscle fibres and is involved in cross-linking actin containing thin filaments. The GTEx project has identified an eQTL in ACTN3 that is associated with variation in the expression of this gene in skeletal muscle cells (Fig. 6).

Figure 6. Association of alleles in the gene ACTNB3 with variation in gene expression in muscle. A. RNAseq analysis of ACTN3 showing robust expression in skeletal muscle cells but not whole blood. B. The ACTN3 gene (exons shown as blue boxes, introns as horizontal lines with arrows indicating the direction of transcription) and histone modifications in skeletal myoblasts associated with promoter (H3K4me3 and H3K27ac) or enhancer (H3K4me1 and H3K27ac) elements. C. The association between homozygotes and heterozygote genotypes of three SNPs in the ACTN3 gene (rs679228, rs509556, rs1815739), with the location of each SNP in ACTN3 indicated by the arrows. Note that there is no association between the different genotypes in whole blood (rs679228 shown as an example). Gene expression is normalised and plotted as violin plots. There is moderate-strong linkage disequilibrium between the SNPs rs679228, rs509556, rs1815739. Figure adapted from data available at genome.ucsc/ and gtexportal/home/.

There is some evidence for the presence of transcription regulatory sequences at this locus (Fig. 6, H3K4me1), although the T-allele of rs1815739 introduces a stop codon in the ACTN3 protein that could result in a decrease in RNA that is not the result of a change in gene transcription.

Conclusion The GTEx project aims to correlate variation in gene expression with genetic variation in multiple tissues. The identification of eQTLs is an important resource for the functional identification of non-coding DNA alleles identified in GWAS projects.

Was this document helpful?

Chapter 6 GGHH 2023 - notes

Course: Genes, Genomics And Human Health (300820)

95 Documents
Students shared 95 documents in this course
Was this document helpful?
1
Chapter 6. Expression Quantitative Trait Loci (eQTL)
Learning Outcomes
1. Define an eQTL
2. Summarise the methodology of RNAseq
3. Understand the reason for expressing RNAseq outcomes as transcripts per million (TPM)
4. Explain why patterns of H3K4me3 and H3K27ac can be used as markers of transcriptionally active
genes
5. Incorporate this data into a systematic strategy for identifying functional alleles identified in GWAS
6.1 Introduction
As we learnt in Chapter 5, techniques such as ChIPseq and DNaseSeq, together with information on DNA
conservation (homology) have been critical in mapping active regulatory DNA elements across the human
genome. There is a complex vocabulary of histone subunit modifications that correlate with the remodelling
of chromatin and the activation or repression of gene transcription. However, given that it is estimated there
are up to one million enhancers and just 20-23,000 protein-coding genes across the human genome, assigning
enhancers to the regulation of specific genes in different cell types and in response to different signals, is one
of the major challenges in functional genomics.
Most alleles identified in GWAS that are associated with common human traits and illnesses map to non-
coding DNA, and while the location of associated SNPs and limits of linkage disequilibrium (LD) allow us to
assign SNPs to regulatory DNA elements, correlating enhancer activation with gene transcription is more
difficult. This has led to many studies being done that seek to correlate specific alleles with changes in gene
transcription. This approach treats gene expression as a measurable trait in much the same way as we use
measurements such as height or blood pressure as phenotypic traits. Gene transcription is a quantifiable trait,
and GWAS can identify loci that contain alleles that are associated with variation in the expression, hence the
term expression quantitative trait loci (eQTL).