Identify and validate a consensus signature using gene expression data |
Added by GenomeSpaceTeam on 2015.04.22
Last updated on over 3 years ago.
Do phenotypically different expression datasets share a common signature? Can the signature distinguish phenotypes in an independent dataset?
This recipe provides one method for identifying a consensus gene signature from a training set of several phenotypically distinct gene expression dataset. The recipe then validates the ability of the consensus signature to accurately distinguish phenotypes by using an independent test gene expression dataset. An example use case of this recipe is when an investigator may want to develop a gene expression signature to predict a specific phenotype, such as cancer or another disease.
Background information: What is a consensus gene expression signature?
A gene expression signature is the pattern of expression in a specific group of genes, usually ones that are related by function, position or other biological process. A consensus gene signature is an expression pattern for a specific group of genes, which is shared among different samples or across different phenotypes. For example, a group of genes regulating immune response could be similarly up-regulated during many different, unrelated infections. There are several types of consensus signatures; those that can be derived from gene expression data are called transcriptional consensus signatures. Consensus signatures can be created by overlapping individual gene signatures derived from multiple datasets. Compared to individual gene expression signatures, consensus signatures may be more accurate at distinguishing different phenotypes, such as diseased vs. normal samples.
Use case: Targeting MYCN in Neuroblastoma by BET Bromodomain Inhibition (Puissant et al. , Cancer Discov. 2013).
This study analyzed gene expression data generated from primary neuroblastoma tumors of two genetic classes: tumors harboring MYCN amplification (“MYCN amplified”) and tumors without MYCN amplification (“MYCN non-amplified”). MYCN amplified neuroblastoma is exquisitely dependent on the bromodomain and extra-terminal (BET) family of proteins. As such, treatment of MYCN amplified cell lines or tumors with JQ1, a small-molecule inhibitor of BET proteins, leads to dramatic transcriptional changes and induces cell death.
To identify a consensus signature to predict sensitivity to JQ1 treatment, two training datasets and one test dataset were used. The training dataset included acute myeloid leukemia (AML) and a multiple myeloid leukemia (MM) cell lines, which had been treated with either DMSO (control) or with JQ1 (treatment). The test dataset included MYCN amplified and MYCN nonamplified neuroblastoma primary tumor samples. GenePattern was used to analyze the AML and MM cell lines; for each dataset, a gene expression signature was derived to identify JQ1 response in the cell line. Using Galaxy, the two signatures were then overlapped to determine the consensus signature between the two phenotypes.
GenePattern was used to validate the ability of this JQ1-associated consensus signature to differentiate between phenotypes, by using the signature to hierarchically cluster the test dataset (neuroblastoma). Since the MYCN amplified and MYCN non-amplified neuroblastoma samples should have differing expression profiles, it was hypothesized that the consensus signature would be able to separate the samples by phenotype. Indeed, the consensus signature was able to cluster the MYCN-amplified and MYCN-nonamplified samples separately, revealing that the consensus signature accurately distinguishes the sensitivity-to-JQ1 phenotype.
To complete this recipe, we will need several gene expression datasets:
Public
> RecipeData
> ExpressionData
> GSE29799_AML
: a JQ1-treated acute myeloid leukemia (AML) expression dataset (6 samples)Public
> RecipeData
> ExpressionData
> GSE31365_MM
: a JQ1-treated multiple myeloma (MM) expression dataset (12 samples)Public
> RecipeData
> ExpressionData
> GSE12460_NB
: a neuroblastoma (NB) expression dataset which has MCYN-amplified and MYCN-nonamplified samples (64 samples)We will use the ComparativeMarkerSelection
module to identify genes which are differentially expressed and can distinguish between two phenotypes (e.g. normal vs. JQ1-treated), separately for the acute myeloid leukemia (AML) and multiple myeloma (MM) datasets. This module uses the GCT file and the CLS file.
NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.
Modules
tab, and search for "ComparativeMarkerSelection". Once the module is loaded, change the following parameters:
input file
: load the AML GCT file, e.g., GSE29799GPL6244_RNA_ORIGINALGENE_XXXXX.gct
. To do this, use the GenomeSpace
tab to navigate to Public
> RecipeData
> ExpressionData
> GSE29799_AML
, then drag the file to the input box.cls file
: load the AML treatment CLS file, e.g., treatment.cls
, also located in the GSE29799_AML
directory.log transformed data
: yesoutput filename
: AML_genes.comp.marker.odf
Run
to run ComparativeMarkerSelection
on the AML dataset.
Save to GenomeSpace
.Save
.input file
: load the MM GCT file, e.g., GSE31365GPL6244_RNA_ORIGINALGENE_XXXXX.gct
. To do this, use the GenomeSpace
tab to navigate to Public
> RecipeData
> ExpressionData
> GSE31365_MM
, then drag the file to the input box.cls file
: load the MM treatment CLS file, e.g., treatment.cls
, also located in the GSE31365_MM
directory.log transformed data
: yesoutput filename
: MM_genes.comp.marker.odf
Run
to run ComparativeMarkerSelection
on the MM dataset.Save to GenomeSpace
.Save
.We will load the two sets of differentially expressed genes from the AML and MM datasets into Galaxy. Then, we will use a pre-built GenomeSpace workflow to process the datasets, filtering and removing features that do not pass certain cutoffs. Finally, we will create a consensus signature and send a list of gene symbols back to GenomeSpace for additional analysis.
NOTE: If you have not yet associated your GenomeSpace account with your Galaxy account, you will be asked to do so. If you do not yet have a Galaxy account, you can automatically generate a new account that will be associated with your GenomeSpace account.
Get Data > GenomeSpace import
AML_genes.comp.marker.odf
and MM_genes.comp.marker.odf
files.Send to Galaxy
.Datatype
tab.New type
: tabularSave
. Ignore any warnings which may pop up.We will use a pre-built GenomeSpace workflow to identify the consensus gene signature. This pre-built GenomeSpace workflow uses several steps to determine the overlap between the AML and MM datasets. First, we filter the AML and MM datasets to the top genes using the following cutoffs: (1) >= 1.5 differential expression; and (2) FDR < 0.05 as calculated by ComparativeMarkerSelection
(GenePattern).
Run
.Step 1: Input Dataset
: AML_genes.comp.marker.odf
Step 2: Input Dataset
: MML_genes.comp.marker.odf
Choose Target Directory
: choose a directory to save the file to, e.g. your home directory.Run workflow
.We will use several GenePattern modules to extract the relevant information from our test dataset, which is the MYCN-amplified and MYCN-nonamplified neuroblastoma dataset. Then, we will project the consensus signature onto the neuroblastoma dataset and evaluate its ability to distinguish the two phenotypes (MYCN-amplified and MYCN-nonamplified) by clustering the resulting dataset.
We will use SelectFeaturesColumns
to filter the neuroblastoma dataset to only those samples that are MYCN-amplified or MYCN-nonamplified. There is a third group of samples (called 'NILL'), in which MYCN amplification status was not determined; therefore, we filter these samples out and work only with the annotated data.
Modules
tab, and search for SelectFeaturesColumns
. Once the module is loaded, change the following parameters:
input filename
: use the GenomeSpace tab to navigate to Public > RecipeData > ExpressionData > GSE12460
, then click and drag to load the neuroblastoma GCT file, e.g., GSE12460GPL750_RNA_ORIGINALGENE_XXXX.gct
.columns
: 0-2, 4-6, 8-16, 18-19, 21-25, 28, 30-36, 38-54output
: MYCN.gene.exp.gct
Run
.
Jobs
tab, and reload the SelectFeaturesColumns
module by clicking on the job and choosing Reload Job
.input filename
: load the neuroblastoma CLS file, e.g., Myc.Expression.cls
. To do this, click the next to the input filename parameter to remove the GCT file from the module. Then, use the GenomeSpace tab to navigate to the GSE12460_MYCN directory: Public > RecipeData > ExpressionData > GSE12460
, then drag the file to the input box.output
: MYCN.gene.exp.cls
Run
.Save File
to download a copy of this to your local computer. We will need this file to be downloaded locally for a later step.We will use SelectFeaturesRows
to filter the neuroblastoma dataset to only those gene symbols which are in the consensus signature.
Modules
tab, and search for "SelectFeaturesRows". Once the module is loaded, change the following parameters:
input filename
: MYCN.gene.exp.gct
, the previously filtered neuroblastoma GCT file. To do this, use the Jobs
tab to find the previous job results, then drag the file to the input box.list filename
: consensus.genelist.txt
, the consensus signature gene list. To do this, use the GenomeSpace
tab to navigate to the directory containing the consensus signature gene list, then drag the file to the input box.output
: MYCN.consensus.gct
Run
.
We will use HierarchicalClustering
in GenePattern to create dendrogram of the filtered neuroblastoma dataset, clustering the data by phenotype (MYCN-amplified vs. MYCN-nonamplified), to determine how well the consensus signature can distinguish between phenotypes. Then, we will use HierarchicalClusteringViewer
to view the results of the clustering algorithm and to label samples by phenotype.
Modules
tab, and search for HierarchicalClustering
.Jobs
tab, then change the following parameters:
input file
: MYCN.consensus.gct
(output from SelectFeaturesRows
).column distance measure
: Euclidean distancerow distance measure
: No row clusteringclustering method
: Pairwise complete-linkageRun
.
HierarchicalClusteringViewer
.cdt file
: MYCN.consensus.cdt
atr file
: MYCN.consensus.atr
Run
.
HierarchicalClusteringViewer
has loaded, you should see a heatmap fo the original MYCN.consensus.gct
file. To color the samples by phenotype, change the following parameters:
MYCN.consensus.cls
file you downloaded to your local directory from Step 5.
This is an example interpretation of the results from this recipe. First, we identified a consensus gene signature of JQ1 activity by finding genes that became differentially expressed due to JQ1 treatment in both acute myeloid leukemia (AML) and multiple myeloma (MM). Then, we projected this consensus signature on a test dataset of neuroblastoma cells which were not treated with JQ1, but were either MYCN-amplified or MYCN-nonamplified. Since MYCN amplification is associated with an increased sensitivity to BET bromodomain inhibitors, such as JQ1, we expected that a signature of JQ1 activity would be able to separate MYCN-amplified and MYCN-nonamplified phenotypes.
These results suggest that the JQ1 consensus signature is capable of differentiating between MYCN-amplified neuroblastoma and MYCN-nonamplified neuroblastoma samples. In particular, we see that when we use hierarchical clustering to differentiate the two phenotypes, we observe three distinct groups of samples: (1) the majority of the MYCN-amplified samples (left cluster, light blue); (2) MYCN-nonamplified samples that are similar to MYCN-amplified samples (middle cluster, dark blue); and (3) MYCN-nonamplified samples which are distinct from MYCN-amplified samples (right cluster, dark blue). The significance of this possible result would need further confirmation.