Use Gene Set Enrichment Analysis to identify common signatures between gene sets |
Added by GenomeSpaceTeam on 2015.05.06
Last updated on over 3 years ago.
How do I create a custom-generated gene set? Are there any commonalities between custom-generated gene sets, and MSigDB hallmark gene sets?
This recipe provides a method for identifying and visualizing similarities between diverse gene sets relevant to a study. An example use of this recipe is a case where an investigator may want to compare two phenotypes, such as two types of cancer, to determine which gene sets may be similar between these phenotypes.
Background information: What is Gene Set Enrichment Analysis, and why should I use it?
Gene sets are lists of genes that share similar functions, transcriptional regulation, chromosomal positions, pathways, or other biological processes. It is possible to identify gene sets that are enriched or over-represented in a particular phenotype, such as a specific disease. Gene Set Enrichment Analysis (GSEA) is a computational method which determines whether an a priori defined set of genes shows statistically significant, concordant differences between two phenotypes. GSEA can be used with a custom gene set generated by the user, or with the annotated, standardized gene sets which are available in the Molecular Signatures Database (MSigDB) collection. Completing GSEA on a gene expression dataset will identify those gene sets which are significantly enriched in a particular phenotype. Comparing similarities between the top gene sets following GSEA can yield unique insights into the mechanisms associated with a specific phenotype, which cannot be observed using a single-gene analysis.
Use case: Targeting MYCN in Neuroblastoma by BET Bromodomain Inhibition (Puissant et al. , Cancer Discov. 2013).
This study analyzed gene expression data generated from primary neuroblastoma tumors of two genetic classes: tumors harboring MYCN amplification (“MYCN amplified”) and tumors without MYCN amplification (“MYCN non-amplified”). MYCN amplified neuroblastoma is exquisitely dependent on the bromodomain and extra-terminal (BET) family of proteins. As such, treatment of MYCN amplified cell lines or tumors with JQ1, a small-molecule inhibitor of BET proteins, leads to dramatic transcriptional changes and induces cell death.
A training set of gene expression data was analyzed using GenePattern, and custom gene sets were generated representing the MYCN amplified and MYCN non-amplified datasets. The custom-generated gene sets were then concatenated with the Hallmark gene set from MSigDB using tools in Galaxy. Subsequently, a test gene expression dataset of neuroblastoma cell lines treated with JQ1 (treatment) or DMSO (control) was used to rank this collection of gene sets using single-sample Gene Set Enrichment Analysis (ssGSEA).
This analysis reveals that MYCN-associated gene sets are enriched in JQ1-associated datasets, and suggests that JQ1 functions to suppress transcriptional programs mediated by MYCN amplification. The resulting similarities of the top-ranked gene sets are visualized using ConstellationMap, a module available in GenePattern. This helps to highlight similarities and overlaps between gene sets.
To complete this recipe, we will need a gene expression dataset describing two conditions or phenotypes. In this example, we will use microarray samples from primary neuroblastoma tumors that have been categorized into tumors with amplification of MYCN, and tumors without amplification of MYCN. We will use this data to derive MYCN-associated gene sets. Then, we will also need a test dataset to complete gene set enrichment analysis of our MYCN-associated gene sets. We will use a test microarray dataset of neuroblastoma cell lines, which have either been treated with JQ1 (treatment), or with DMSO (control). We will need the following datasets, which can be downloaded from the following GenomeSpace Public folder:
A neuroblastoma primary tumor dataset (GSE12460):
Public
> RecipeData
> ExpressionData
> MYCN.gct
: This file contains gene expression data of two phenotypes: MYCN-amplified and MYCN-nonamplified. The file is available in GenePattern's GCT format.
Public
> RecipeData
> ExpressionData
> MYCN.cls
: This file contains class assignments (MYCN+ or MYCN-) for all the samples in the GCT file, as identified by the GenePattern CLS format.
A test gene expression dataset (GSE43392):
Public
> RecipeData
> ExpressionData
> GSE43392null.collapsed.gct
: This file contains gene expression data of neuroblastoma cell lines treated with either JQ1 (treatment) or with DMSO (control). The file is available in GenePattern's GCT format. The file has already been preprocessed, normalized, and probe IDs have been collapsed to HUGO Gene Symbols.
Public
> RecipeData
> ExpressionData
> GSE43392.cls
: This file contains class assignments (treatment or control) for all the samples in the GCT file, as identified by the GenePattern CLS format.
First, we will identify genes that are differentially expressed in the MYCN-amplified phenotype, compared to the MYCN-nonamplified phenotype. We will accomplish this using the ComparativeMarkerSelection
module in GenePattern.
Modules
tab, and search for "ComparativeMarkerSelection".input file
: MYCN.gct
(found in the Public
> RecipeData
> ExpressionData
folder)cls file
: MYCN.cls
(found in the Public
> RecipeData
> ExpressionData
folder)log transformed data
: yesRun
to run ComparativeMarkerSelection
.
Next, we will convert the gene expression signatures associated with MYCN amplification and MYCN nonamplification into custom gene sets. We will filter the data and create two new datasets of genes that are upregulated in the MYCN amplified phenotype, and genes that are upregulated in the MYCN nonamplified phenotype. We will use ComparativeMarkerSelectionViewer
in GenePattern
Modules
tab, and search for "ComparativeMarkerSelectionViewer".comparative marker selection filename
: MYCN.comp.marker.odf
dataset filename
: MYCN.gct
(found in the Public
> RecipeData
> ExpressionData
folder)Run
to run ComparativeMarkerSelectionViewer
.Launch
button to launch the visualizer. Once the visualizer loads the file, select the genes which are significantly up-regulated in the MYCN amplified phenotype, as follows:
Edit > Filter Features > Create/Edit Filter
.T-Test
from the dropdown menu.<=
box to the right of T-Test
.Add Filter
.FDR(BH)
from the dropdown menu.<=
box to the right of FDR(BH)
.OK
to filter the dataset.Rank
header to select all rows, then navigating to the following menu: File > Save Dataset(.gct).
Change the following parameters:
MYCN_amplified_up.gct
Download
button then Create
. This will save the file to your local directory.Close
Edit > Filter Features > Create/Edit Filter
.>=
box to the right of T-Test
.OK
to filter the dataset.Rank
header to select all rows, then navigating to the following menu: File > Save Dataset(.gct).
Change the following parameters:
MYCN_nonamplified_up.gct
Download
button then Create
. This will save the file to your local directory.Close
Convert
gmt
Convert on Server
Next, we will download a specific gene set from MSigDB, the Hallmark gene set (H). The Hallmark gene set summarizes and represents specific well-defined biological states or processes in sets of coherently expressed genes. They are a good benchmark for evaluating which biological functions or processes are enriched in a list of genes.
Downloads
tab on the MSigDB website.Next, we will merge the MYCN-associated gene sets with the MSigDB Hallmark gene set, to create one file that can be used for GSEA.
Get Data > GenomeSpace Import
.h.all.v5.1.symbols.gmt
MYCN_amplifed_up.gmt
MYCN_nonamplifed_up.gmt
Send to Galaxy
to upload them to Galaxy.Text Manipulation > Concatenate Datasets
. Once the tool is loaded, change the following parameters:Concatenate Dataset
: MYCN_amplified_up.gmt
Insert Dataset
to add a new dataset.Concatenate Dataset
: MYCN_nonamplified_up.gmt
Insert Dataset
to add a new dataset.Concatenate Dataset
: h.all.v5.1.symbols.gmt
Execute
to run the job.Send Data > GenomeSpace Exporter
. Once the tool is loaded, change the following parameters:Send this dataset to GenomeSpace
: concatenate datasets…
.Choose Target Directory
: your directoryFilename
: concatenated_genesets.gmt
Execute
to run the job.Next, we run single-sample gene set enrichment analysis (ssGSEA) on the text expression dataset, using the custom GMT file of MYCN-associated gene sets and the MSigDB Hallmark gene set. ssGSEA calculates separate enrichment scores for each pairing of a sample and gene set, and represents the degree to which the genes in a particular gene set are coordinately up- or down-regulated within a single sample. We will use the ssGSEAprojection
module in GenePattern.
Modules
tab, and search for "ssGSEAProjection". Once the module has loaded, change the following parameters:
input gct file
: GSE43392null.collapsed.gct
output file prefix
: GSE43392.ssGSEA.output
gene sets database file
: concatenated_genesets.gmt
combine mode
: combine.offRun
to run the job. This will produce a new GCT file of the ssGSEA projection with gene sets as rows, samples as columns, and enrichment scores as elements.Finally, we will visualize the results from the ssGSEA projection using the ConstellationMap
module in GenePattern. ConstellationMap
is a downstream visualization and analysis tool with an interactive web visualizer, which identifies commonalities between high-scoring gene sets. You can find more information at the GParc ConstellationMap
module page.
Modules
tab, and search for "ConstellationMap". Once the module has loaded, change the following parameters:
input gct file
: GSE43392.ssGSEA.output.gct
input cls file
: GSE43392.cls
gene sets file
: concatenate_genesets.gmt
top n
: 20direction
: negativejaccard threshold
: 0.05target class
: JQ1Run
to run the job. This will produce several output files.Visualizer.html
, then Open Link
, to launch a new webpage. This webpage is an interactive, in-browser radial plot.This is an example interpretation of the results from this recipe. First, we identified sets of genes that are significantly up- or down-regulated in MYCN amplified vs. MYCN non-amplified neuroblastoma tumors. We built a personalized collection of gene sets by combining these two new gene sets, MYCN_amplified_up and MYCN_nonamplified_up, with the Hallmarks collection gene set collection from MSigDB.
Next, we investigated whether any of the gene sets in our combined collection were significantly enriched in neuroblastoma cell lines treated with DMSO (control) compared with those treated with JQ1. We used the ssGSEAProjection
module to get per sample enrichment values for each gene set. Then, we compared these gene set enrichment profiles against each other, and against the JQ1 treatment vs. control phenotype, using the ConstellationMap
module.
Given the effects of JQ1, we expect that MYC and MYCN associated gene sets, including our new signature-based gene sets, will have enrichment profiles that are significantly associated with controls compared to JQ1-treated samples.
Our results seem to suggest that MYC and MYCN-amplification programs (highlighted) are indeed affected by JQ1 treatment, being highly anti-correlated to JQ1-treated samples and having similar enrichment profiles. Moreover, these gene sets share 12 genes that may be further investigated as markers for JQ1 susceptibility. As investigators, we could continue to investigate clusters of any of the other highly associated gene sets for potential leads in the search for downstream effects of JQ1 treatment. As these are investigative steps, the significance of any results would need further confirmation.