A Symphony of Genes: The GTEx Dataset

Why do we care about gene expression?

There are about 25,000 unique proteins in your body right now, performing all the tasks you need to survive—everything from keeping your heart beating to reading and processing this blog post! The way proteins are expressed can be thought of as a symphony: not all proteins are needed at all points or in the same quantities, just like musical notes are at specific points and in varying volume during a song.

Gene expression via the transcribing of mRNA is how your body’s symphony is orchestrated. Looking at the different levels of mRNA transcripts can give us an idea of how these instructions to the orchestra vary between genes and between individuals. This can also help us look at disease. If a gene is over- or underexpressed, the music cannot be played as DNA has written it, and disease may result.

Adapted from khanacademy.com

Adapted from khanacademy.com

What is the Genotype-Tissue Expression (GTEx) Project?

The Naqvi et al. team examined this orchestra in a new and exciting way, studying how gene expression changes between the sexes, and how patterns emerge in this data among species, including humans. For human data, they pulled from available resources: the Genotype-Tissue Expression (GTEx) Project. GTEx is a massive undertaking that catalogues gene expression in many different tissues of healthy, undiseased humans who were recently deceased and allowed their body to be used for science. Naqvi et al. reformatted this data so that it could be compared to other species.

Get to know our dataset

In this dataset, we have 4 variables representing gene name, human sample ID (given by GTEx), TPM (Transcripts Per Million) and count. As for the metrics of expression levels, we are only interested in TPM, which is used as a normalized measurement of the proportion of transcripts in mRNA. Count is the raw, not normalized measurement of transcripts. This subset was chosen to give a manageable number of subjects (50, compared to original 740) for students to work with for smaller projects, or for taking an initial look at relationships between gene expression.

Glimpse of data subset in our package.

Glimpse of data subset in our package.

An interesting case: the FoxP orchestra

For example, in the FoxP subfamily of genes, FOXP1, FOXP2, and FOXP4 are highly similar in structure (homologous) and are often co-expressed in the brain with overlapping patterns (Hannehalli & Kaestner, 2009). FOXP2 is an important transcription factor involved in language development (Enard et al., 2002). Abnormal expression profiles of FOXP2 during brain development correspond to speech and language disorders (Lai et al., 2003). The biological relevance of FOXP2 and the homology of the FoxP subfamily of genes inspired us to create a heatmap to compare their expression profiles among 50 healthy human subjects.

To make this graph, we filtered our GETx data subset for the three genes of interest, FOXP1, FOXP2, and FOXP4. We were able to make a heatmap to see the general trends in relationships within the FoxP subfamily. We can see a general trend of FOXP2 being expressed less than either FOXP1 or FOXP4, which are both more variable between subjects.

Some fun statistics!

This dataset may also be used for statistical inferences about the data. In the below scatterplot, TPM for FOXP1 and FOXP2 is compared within each human subject, and found to have a positive correlation with a correlation coefficient of 0.603.

This means as expression of one of these genes increases, the expression of the other gene increases, and we can say the data matches this relationship relatively well. This dataset can be used to look at relationships of mRNA transcript abundance between genes, between people, or both. This could allow you to answer questions about differences in height, immune system function, disease, or anything else between people of different sexes, ages, or ethnicities.


Enard, W. et al. Molecular evolution of FOXP2, a gene involved in speech and language. Nature 418, 869–872 (2002). https://doi.org/10.1038/nature01025

Hannenhalli, S., & Kaestner, K. H. (2009). The evolution of Fox genes and their role in development and disease. Nature reviews. Genetics, 10(4), 233–240. https://doi.org/10.1038/nrg2523

Lai, C. S., Gerrelli, D., Monaco, A. P., Fisher, S. E. & Copp, A. J. FOXP2 expression during brain development coincides with adult sites of pathology in a severe speech and language disorder. Brain 126, 2455–2462 (2003). https://doi.org/10.1093/brain/awg247

Naqvi, S., Godfrey, A. K., Hughes, J. F., Goodheart, M. L., Mitchell, R. N., & Page, D. C. (2019). Conservation, acquisition, and functional impact of sex-biased gene expression in mammals. Science, 365(6450), eaaw7317. https://doi.org/10.1126/science.aaw7317

The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS.