BCH339N: Predicting Tumor Phenotypes from Gene Expression

The objective of our research is to predict various tumor phenotypes from gene expression data.

Our presentation today will start out by outlining our two project objectives.

After that, we’ll visualize how our patient samples cluster, then perform a differential expression analysis and gene set enrichment analysis.

Next, we’ll use random forest predictions to assign tumor phenotypes.

And lastly we’ll discuss the broader impacts of our findings.

Let’s start out by outlining our objectives and explaining where we sourced our data.

We started out by searching The Cancer Genome Atlas for transcriptomic data of tumors that we predicted would be biologically diverse.

Our analyses are based around gene expression data for breast cancer samples, skin melanoma samples, low grade glioma samples (which is a type of tumor that occurs in non-neuronal nervous system cells, like the spinal cord), and lastly, samples from mesothelioma tumors (which is usually caused by asbestos exposure).

Our first objective was to determine which underlying biological processes characterize each of the four tumor types.

Our second objective was to determine whether gene expression profiles could be used to predict the identity of a tumor.

I’ll now go through the steps that we took to address the first objective. The first step in working with high dimensional data like transcriptomic profiles is to cluster the information into interpretable dimensions. For this, I used the UMAP method for dimensionality reduction.

UMAP is a non-linear algorithm that projects high dimensional data into two dimensions. So for example, the gene expression matrix with 60k genes and 1200 samples was able to be simplified into two dimensions using this algorithm.

Why is dimensionality reduction useful? It allows us to verify that the overarching differences between patient sample data is caused by biological groups like tumor type, as opposed to other underlying variables, perhaps like age or sex.

By verifying that clusters correspond to biological groups, the design of our differential gene expression experiment is more robust.

Here is the two dimensional projection of expression counts. We can see that for the most part, samples cluster based on their tumor identity. It’s important to note that the distances between clusters do not have any meaning using UMAP, since the algorithm is non-linear.

Now that we validated that tumor identity is a justifiable way to group our samples, we performed a differential expression analysis followed by a gene set enrichment analysis.

Within this portion of the analysis, the first goal was to determine which genes were over-expressed in each tumor type.

Four DESeq experiments were designed so that log2Fold changes of a particular tumor type were compared to the remaining samples.

Once DESeq provided log2Fold changes and test statistics for each gene, a gene set enrichment analysis was performed to determine which biological pathways were over-expressed in each tumor type.

We used the Hallmark set of 50 well defined biological pathways for this analysis.

Statistics from DESeq were used to rank genes by their importance in each of the biological pathways.

Here are the gene set enrichment results for breast cancer tumors.

The y axis defines the pathways expressed in the data, and the x axis describes the normalized enrichment scores. Negative enrichment scores indicate that the pathway was over-expressed in breast cancer samples.

Our results support that the most over-expressed genes belong to estrogen response pathways.

Based on our data, we characterize breast cancer samples by their expression of estrogen response pathways.

This is a reasonable pathway since estrogen is responsible for female sex characteristics.

Next, we look at the gene set enrichment results for skin cancer tumors.

Again, negative enrichment scores indicate that the pathway was over-expressed in skin cancer samples.

Our results support that the most over-expressed genes belong to MYC pathways.

Based on our data, we characterize skin cancer samples by their expression of MYC pathways, which are oncogenic transcription factors.

Next, we look at the gene set enrichment results for low grade glioma tumors.

Again, negative enrichment scores indicate that the pathway was over-expressed in glioma samples.

Our results support that the most over-expressed genes belong to hedgehog signaling pathways.

Based on our data, we characterize low grade glioma samples by their expression of hedgehog signaling pathways, which play important roles in stem cell regulation.

This is a reasonable pathway since low grade gliomas occur in the spinal cord, which houses lots of stem cells.

Next, we look at the gene set enrichment results for mesothelioma tumors.

Again, negative enrichment scores indicate that the pathway was over-expressed in mesothelioma samples.

Our results support that the most over-expressed genes belong to interferon response pathways.

Based on our data, we characterize mesothelioma samples by their expression of interferon response pathways, which play important roles in cell immune response.