Identify and validate a consensus signature using gene expression data
Added by GenomeSpaceTeam on 2015.04.22
Last updated on over 1 year ago.
Do phenotypically different expression datasets share a common signature? Can the signature distinguish phenotypes in an independent dataset?
This recipe provides one method for identifying a consensus gene signature from a training set of several phenotypically distinct gene expression dataset. The recipe then validates the ability of the consensus signature to accurately distinguish phenotypes by using an independent test gene expression dataset. An example use case of this recipe is when an investigator may want to develop a gene expression signature to predict a specific phenotype, such as cancer or another disease.
Background information: What is a consensus gene expression signature?
A gene expression signature is the pattern of expression in a specific group of genes, usually ones that are related by function, position or other biological process. A consensus gene signature is an expression pattern for a specific group of genes, which is shared among different samples or across different phenotypes. For example, a group of genes regulating immune response could be similarly up-regulated during many different, unrelated infections. There are several types of consensus signatures; those that can be derived from gene expression data are called transcriptional consensus signatures. Consensus signatures can be created by overlapping individual gene signatures derived from multiple datasets. Compared to individual gene expression signatures, consensus signatures may be more accurate at distinguishing different phenotypes, such as diseased vs. normal samples.
Use case: Targeting MYCN in Neuroblastoma by BET Bromodomain Inhibition (Puissant et al. , Cancer Discov. 2013).
This study analyzed gene expression data generated from primary neuroblastoma tumors of two genetic classes: tumors harboring MYCN amplification (“MYCN amplified”) and tumors without MYCN amplification (“MYCN non-amplified”). MYCN amplified neuroblastoma is exquisitely dependent on the bromodomain and extra-terminal (BET) family of proteins. As such, treatment of MYCN amplified cell lines or tumors with JQ1, a small-molecule inhibitor of BET proteins, leads to dramatic transcriptional changes and induces cell death.
To identify a consensus signature to predict sensitivity to JQ1 treatment, two training datasets and one test dataset was obtained from InSilicoDB. The training dataset included acute myeloid leukemia (AML) and a multiple myeloid leukemia (MM) cell lines, which had been treated with either DMSO (control) or with JQ1 (treatment). The test dataset included MYCN amplified and MYCN nonamplified neuroblastoma primary tumor samples. GenePattern was used to analyze the AML and MM cell lines; for each dataset, a gene expression signature was derived to identify JQ1 response in the cell line. Using Galaxy, the two signatures were then overlapped to determine the consensus signature between the two phenotypes.
GenePattern was used to validate the ability of this JQ1-associated consensus signature to differentiate between phenotypes, by using the signature to hierarchically cluster the test dataset (neuroblastoma). Since the MYCN amplified and MYCN non-amplified neuroblastoma samples should have differing expression profiles, it was hypothesized that the consensus signature would be able to separate the samples by phenotype. Indeed, the consensus signature was able to cluster the MYCN-amplified and MYCN-nonamplified samples separately, revealing that the consensus signature accurately distinguishes the sensitivity-to-JQ1 phenotype.
To complete this recipe, we will need several gene expression datasets, which will be obtained from InSilico DB; thus, we do not need any additional files from the GenomeSpace Public Folder.
In this step, we use InSilico DB to retrieve three gene expression datasets. GenomeSpace will automatically convert the expression dataset files to a form that is readable by GenePattern. If you are using your own data, make sure that your input will include a GCT and CLS file.
In this example, we are using datasets that have already been normalized by the original authors. In these examples, the Robust Multiarray Averaging (RMA) method of microarray normalization and summarization was used on the datasets. We use the following datasets:
Publictab, search for "GSE29799".
Analyzebutton for the entry with, 'HuGene-1_0-st-v1' under
Technology. Click the following buttons in the pop-up:
Normalization options: Original normalization
Gene/probe options: genes
Open in GenomeSpaceto download the files to GenomeSpace.
The output files should appear in your GenomeSpace home directory.
We will use the
ComparativeMarkerSelection module to identify genes which are differentially expressed and can distinguish between two phenotypes (e.g. normal vs. JQ1-treated), separately for the acute myeloid leukemia (AML) and multiple myeloma (MM) datasets. This module uses the GCT file and the CLS file.
NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.
Modulestab, and search for "ComparativeMarkerSelection". Once the module is loaded, change the following parameters:
input file: load the AML GCT file, e.g.,
GSE29799GPL6244_RNA_ORIGINALGENE_30916.gct. To do this, use the
GenomeSpacetab to navigate to the GSE29799_AML directory containing the AML GCT file, then drag the file to the input box.
cls file: load the AML treatment CLS file, e.g.,
treatment.cls, also located in the GSE29799_AML directory.
log transformed data: yes
ComparativeMarkerSelectionon the AML dataset.
Save to GenomeSpace.
input file: load the MM GCT file, e.g.,
GSE31365GPL6244_RNA_ORIGINALGENE_30917.gct. To do this, use the
GenomeSpacetab to navigate to the GSE31365_MM directory containing the MM GCT file, then drag the file to the input box.
cls file: load the MM treatment CLS file, e.g.,
treatment.cls, also located in the GSE31365_MM directory.
log transformed data: yes
ComparativeMarkerSelectionon the MM dataset.
Save to GenomeSpace.
We will load the two sets of differentially expressed genes from the AML and MM datasets into Galaxy. Then, we will use a pre-built GenomeSpace workflow to process the datasets, filtering and removing features that do not pass certain cutoffs. Finally, we will create a consensus signature and send a list of gene symbols back to GenomeSpace for additional analysis.
NOTE: If you have not yet associated your GenomeSpace account with your Galaxy account, you will be asked to do so. If you do not yet have a Galaxy account, you can automatically generate a new account that will be associated with your GenomeSpace account.
Get Data > GenomeSpace import
Send to Galaxy.
New type: tabular
Save. Ignore any warnings which may pop up.
We will use a pre-built GenomeSpace workflow to identify the consensus gene signature. This pre-built GenomeSpace workflow uses several steps to determine the overlap between the AML and MM datasets. First, we filter the AML and MM datasets to the top genes using the following cutoffs: (1) >= 1.5 differential expression; and (2) FDR < 0.05 as calculated by
Step 1: Input Dataset:
Step 2: Input Dataset:
Choose Target Directory: choose a directory to save the file to, e.g. your home directory.
We will use several GenePattern modules to extract the relevant information from our test dataset, which is the MYCN-amplified and MYCN-nonamplified neuroblastoma dataset. Then, we will project the consensus signature onto the neuroblastoma dataset and evaluate its ability to distinguish the two phenotypes (MYCN-amplified and MYCN-nonamplified) by clustering the resulting dataset.
We will use
SelectFeaturesColumns to filter the neuroblastoma dataset to only those samples that are MYCN-amplified or MYCN-nonamplified. There is a third group of samples (called 'NILL'), in which MYCN amplification status was not determined; therefore, we filter these samples out and work only with the annotated data.
Modulestab, and search for "SelectFeaturesColumns". Once the module is loaded, change the following parameters:
input filename: load the neuroblastoma GCT file, e.g.,
GSE12460GPL750_RNA_ORIGINALGENE_31813.gct. To do this, use the GenomeSpace tab to navigate to the GSE12460_MYCN directory containing the neuroblastoma GCT file, then drag the file to the input box.
columns: 0-2, 4-6, 8-16, 18-19, 21-25, 28, 30-36, 38-54
Jobstab, and reload the
SelectFeaturesColumnsmodule by clicking on the job and choosing
input filename: load the neuroblastoma CLS file, e.g.,
Myc.Expression.cls. To do this, click the next to the input filename parameter to remove the GCT file from the module. Then, use the GenomeSpace tab to navigate to the GSE12460_MYCN directory containing the neuroblastoma CLS file, then drag the file to the input box.
We will use
SelectFeaturesRows to filter the neuroblastoma dataset to only those gene symbols which are in the consensus signature.
Modulestab, and search for "SelectFeaturesRows". Once the module is loaded, change the following parameters:
MYCN.gene.exp.gct, the previously filtered neuroblastoma GCT file. To do this, use the
Jobstab to find the previous job results, then drag the file to the input box.
consensus.genelist.txt, the consensus signature gene list. To do this, use the
GenomeSpacetab to navigate to the directory containing the consensus signature gene list, then drag the file to the input box.
We will use
GENE-E to view the filtered neuroblastoma dataset, and to cluster the data by phenotype (MYCN-amplified vs. MYCN-nonamplified), to determine how well the consensus signature can distinguish between phenotypes.
Modulestab, and search for "GENE_E".
Jobstab, then change the following parameters:
sample information or class file:
Launchbutton to prompt a download of the
GENE-E. You may have to enter your GenePattern or GenomeSpace log in credentials.
GENE-Ehas been loaded, we will perform hierarchical clustering on the filtered neuroblastoma dataset:
Hierarchical Clusteringicon (), or navigate to the following menu:
Tools > Clustering > Hierarchical Clustering...
Column distance metric: Euclidean distance
Linkage method: Complete Linkage
OKto run the clustering algorithm.
This is an example interpretation of the results from this recipe. First, we identified a consensus gene signature of JQ1 activity by finding genes that became differentially expressed due to JQ1 treatment in both acute myeloid leukemia (AML) and multiple myeloma (MM). Then, we projected this consensus signature on a test dataset of neuroblastoma cells which were not treated with JQ1, but were either MYCN-amplified or MYCN-nonamplified. Since MYCN amplification is associated with an increased sensitivity to BET bromodomain inhibitors, such as JQ1, we expected that a signature of JQ1 activity would be able to separate MYCN-amplified and MYCN-nonamplified phenotypes.
These results suggest that the JQ1 consensus signature is capable of differentiating between MYCN-amplified neuroblastoma and MYCN-nonamplified neuroblastoma samples. In particular, we see that when we use hierarchical clustering to differentiate the two phenotypes, we observe three distinct groups of samples: (1) the majority of the MYCN-amplified samples (left cluster, light blue); (2) MYCN-nonamplified samples that are similar to MYCN-amplified samples (middle cluster, dark blue); and (3) MYCN-nonamplified samples which are distinct from MYCN-amplified samples (right cluster, dark blue). The significance of this possible result would need further confirmation.