Identify TCGA data that have specific mutational profiles |
Added by GenomeSpaceTeam on 2016.04.26
Last updated on over 3 years ago.
How do I obtain and analyze data from The Cancer Genome Atlas (TCGA)? Which TCGA datasets have specific mutations in my gene of interest?
This recipe provides a method for identifying and obtaining specific datasets of interest from The Cancer Genome Atlas (TCGA), through a web-based tool called FireBrowse. An example use of this recipe is a case where an investigator may have a gene they are interested in, such as ERCC2, and would like to know if there are mutations in this gene in specific datasets of interest, such as bladder cancer.
Tumors arise from mutational changes to healthy cells, and are frequently deficient in one or more DNA repair pathways. The accumulation of mutations in tumor can be described by the “mutational signature”, a pattern of genetic mutations found in tumor DNA, which reflect different mutation events. Mutational signatures can be specific to certain tissues or cancer types. Many of these mutational signatures are associated with DNA repair pathways.
An in-depth study of urothelial carcinoma, which causes ~150,000 deaths annually, by Kim et al. (Nature Genetics, 2016) has identified a mutational signature in bladder cancer involving the nucleotide-excision repair (NER) pathway. Kim et al. identified a mutational signature involving ERCC2, a gene encoding a DNA helicase which plays a critical role in the NER pathway. Somatic mutations in this gene may prevent proper functioning of the NER pathway, allowing mutations to accumulate. Uniquely, urothelial cancer is the only known tumor type to date in which ERCC2 is significantly mutated.
Kim et al. used the collection of bladder carcinoma (BLCA) samples in The Cancer Genome Atlas (TCGA) to complete their analysis. Data were downloaded from the Broad Institute TCGA Genome Data Analysis Center, and samples were categorized based on mutational status. Tumors with somatic, missense mutations in ERCC2 were compared to non-mutated (wild-type) samples to identify the comprehensive mutational landscape of bladder cancer (also described in this TCGA paper).
This recipe provides a method for processing data from The Cancer Genome Atlas (TCGA), to identify samples which have mutations in specific genes. The purpose of this recipe is to categorize data by mutational status, for further downstream analysis (e.g. comparing tumors of different mutational status, etc.). Data is collected from FireBrowse; Galaxy and GenePattern are used to categorize samples by mutational status and generate GCT and CLS files. The RNA-seq datasets are gene-level normalized RSEM expression estimates.
TCGA Barcodes
The Cancer Genome Atlas labels its datasets with the TCGA barcode, an identifer that describes the metadata associated with sample. You can learn more about the TCGA Barcodes on the NIH National Cancer Institute Wiki page (see also: working with TCGA data).
TCGA barcodes adhere to a certain format: TCGA-00-1111-22A-33B-4444-55
. For this recipe, we are interested in the Sample type, indicated by the 22A
section of the barcode. For this recipe we are interested in samples with designation 01
(solid tumor, or TP) or 11
(solid tissue normal, or NT), which are paired tumor-normal samples.
To complete this recipe, we will need several datasets. First, we need example Bladder Cancer (BLCA) data from RNA-seq reads which have been processed and had normalized RSEM values calculated for each gene. In this dataset, we should have bladder cancer tumor samples, in addition to paired normal tissue. For this recipe we are trying to identify the bladder cancer samples that have mutations in the gene ERCC2. Therefore, we also need a list of which samples in the RNA-seq data have mutation annotation files (MAFs), and additionally we need to identify the list of bladder cancer samples that have ERCC2 mutations. We will be using the following datasets, which we will obtain from FireBrowse:
BLCA.illuminahiseq_rnaseqv2-RSEM_genes_normalized.tar.gz
: This is a zipped file containing the normalized RSEM values for bladder cancer Illumina hi-seq RNA-seq data.
BLCA.Mutation_Packager_Calls.tar.gz
: This is a zipped archive of mutation annotation files (MAFs) for all bladder cancer samples with mutations. Specifically, we will only be using the MANIFEST.txt
file within this zipped archive, which just lists the samples with MAFs.
ERCC2mutSamples.txt
: This text file lists the bladder cancer samples that have mutations in the ERCC2 gene. This file lists the specific type of mutation that occurs in the sample and whether it causes a missense or silent mutation of the protein product. This file also contains samples which do not have RNA-seq data, and must be filtered out.
This purpose of this recipe is to filter an RNA-seq dataset of bladder cancer samples, in order to find samples which have mutations in the ERCC2 gene, and compare them against samples that do not have these mutations. The output of this recipe will be a filtered GCT file (BLCA.rnaseq.processed.gct
), and an accompanying CLS file (BLCA.rnaseq.processed.cls
).
Use FireBrowse to obtain the three files necessary to complete this recipe: (1) a file of gene RSEM values for BLCA RNA-seq data; (2) a list of BLCA samples with MAF annotations; and, (3) a list of BLCA samples with ERCC2 mutations, and the mutation types.
SELECT COHORT
, choose BLCA
.mRNASeq
(dark red bar). This will open a window called BLCA mRNASeq Archives
.Send To
.illuminahiseq_rnaseqv2-RSEM_genes_normalized
.GenomeSpace Upload
button, which will open the GenomeSpace upload/download manager.Submit
to send the data to GenomeSpace. This will create a new archived file in your GenomeSpace account, called gdac.broadinstitute.org_BLCA.Merge_rnaseqv2_...[date/time]...tar.gz
. Once the upload is complete, close the data manager window and the BLCA mRNASeq Archives
popup window.Mutation Annotation File
(grey bar). This will open a window called BLCA MAF Archives
.Send To
.Mutation_Packager_Calls
.GenomeSpace Upload
button, which will open the GenomeSpace upload/download manager.Submit
to send the data to GenomeSpace. This will create a new archived file in your GenomeSpace account, called gdac.broadinstitute.org_BLCA.Mutation_Packager_Calls...[date/time]...tar.gz
. Once the upload is complete, close the data manager window and the BLCA MAF Archives
popup window.WEB-API
link in the navigation menu.Analyses: Fine grained retrieval of analysis pipeline results
.GET /Analyses/Mutation/MAF
.format
: tsvcohort
: BLCAgene
: ERCC2Perform Query
. This will return several results, such as the "Request URL", "Response Body", "Response Code", and "Response Headers".Request URL
section, copy the URL.Upload the ERCC2 mutation data file to GenomeSpace. Then, un-zip the directory of RNA-seq data, BLCA RNA-seq RSEM normalized
file. This will create a .txt file of the normalized RSEM values of the RNA-seq data. Un-tar, un-zip and process the mutation annotation file list, BLCA Mutation Packager Calls
file. This will create a folder of the MAF annotations for each sample. The only file we will use for this recipe is the MANIFEST.txt
file.
File > Import from URL
.Go
. This will open the GenomeSpace upload/download manager.ERCC2mutSamples.txt
).Submit
to send the data to GenomeSpace.gdac.broadinstitute.org_BLCA.Merge_rnaseqv2_...[date/time]...tar.gz
file.Expand Archive
. Give the unzipped folder an easy to remember name, e.g. "BLCA_RNAseq". Then click Expand Archive
. This should create a new folder, BLCA_RNAseq
.BLCA_RNAseq
until you find the RSEM file, BLCA.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt
. This is the file we will be working with. Feel free to copy this file to a different part of your directory, or to rename the file.gdac.broadinstitute.org_BLCA.Mutation_Packager_Calls...[date/time]...tar.gz
file.Expand Archive
. Give the unzipped folder an easy to remember name, e.g. "BLCA_MAF". Then click Expand Archive
. This should create a new folder, BLCA_MAF
.BLCA_MAF
until you find a MANIFEST.txt
file.MANIFEST.txt
file, and choose Extract rows / cols
. Change the following parameters:
delimiter
: SpaceSave
. The resulting file should be called MANIFEST.slice.txt
.Use the Galaxy GenomeSpace Importer interface to transfer the RNA-seq data file and the MAF Manifest file to Galaxy.
Get Data > GenomeSpace import
.BLCA.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt
.Send to Galaxy
.Datatype
column, then change the following parameters:
New Type
: tabularSave
.Attributes
column, and change the following parameters:
Name
: BLCA.rnaseqSave
.MANIFEST.slice.txt
, to Galaxy.
Get Data > GenomeSpace import
.MANIFEST.slice.txt
file. Select it.Send to Galaxy
.MANIFEST.slice.txt
file.
Text Manipulation > Replace Text (in entire line)
. Change the following parameters:file to process
: GenomeSpace import on [Manifest.slice.txt
]Find pattern
: -01.maf.txtReplace with
: (leave empty)Execute
.Attributes
section. Change the following parameters:
Name
: MAF Sample IDs
.Save
.Datatype
column, then change the following parameters:
New Type
: tabularSave
.Process the RNA-seq file, BLCA.rnaseq
(from Step 2). In this step, we will be removing unnecessary information from the file (e.g. 'normalized count' information), in order to turn the file into a GCT file later.
Text Manipulation > Cut columns from a table (cut)
. Change the following parameters:file to cut
: [BLCA.rnaseq
]Operation
: DiscardDelimited by
: TabCut by
: fieldsList of Fields
: click on the box and select Column: 1
from the drop-down menu.Execute
.Datamash > Transpose
. Change the following parameters:Input tabular dataset
: Cut on [BLCA.rnaseq
]Execute
.Text Manipulation > Cut columns from a table (cut)
. Change the following parameters:file to cut
: Tranpose on [Cut on BLCA.rnaseq
]Operation
: DiscardDelimited by
: TabCut by
: fieldsList of Fields
: click on the box and select Column: 2
from the drop-down menu.Execute
.Attributes
section. Change the following parameters:
Name
: BLCA.rnaseq Data
.Save
.Next, we want to identify samples from the RNA-seq data, which have been annotated by mutations, and therefore have an associated MAF. We will do this by overlapping the list of samples with RNA-seq data, with the list of samples with MAFs.
Text Manipulation > Cut columns from a table (cut)
. Change the following parameters:file to cut
: [BLCA.rnaseq Data
]Operation
: KeepDelimited by
: TabCut by
: fieldsList of Fields
: click on the box and select Column: 1
from the drop-down menu.Execute
.Text Manipulation > Trim
. Change the following parameters:this dataset
: Cut on [BLCA.rnaseq Data
]Trim this column only
: 0Trim from the beginning up to this position
: 1Remove everything from this position to the end
: -16Is input data in fastq format?
: NoExecute
.Text Manipulation > Paste
. Change the following parameters:Paste
: Trim on [Cut on BLCA.rnaseq Data
]and
: [BLCA.rnaseq Data
]Delimit by
: TabExecute
.Attributes
column. Change the following parameters:
Name
: BLCA.rnaseq Data, combined IDs
Save
.Join, Subtract and Group > Join two Datasets
. Change the following parameters:Join
: [MAF Sample IDs
]using column
: Column: 1with
: [BLCA.rnaseq Data, combined IDs
]and column
: Column: 1Keep lines of first input that do not join with second input
: NoKeep lines of first input that are incomplete
: NoFill empty columns
: NoExecute
.Attributes
column. Change the following parameters:
Name
: BLCA.rnaseq Data, filtered
Save
.Datatype
column, then change the following parameters:
New Type
: tabularSave
.Now that we have identified the RNA-seq samples that are also annotated with mutation information, we need to re-process the filtered RNA-seq dataset and turn it into a GCT file.
BLCA.rnaseq Data, filtered
file.
Text Manipulation > Cut columns from a table (cut)
. Change the following parameters:File to cut
: [BLCA.rnaseq Data, filtered
]Operation
: DiscardDelimited by
: TabCut by
: fieldsList of fields
: click on the box and select Column: 1
from the drop-down menu. Click again to select Column: 2
from the drop-down menu.Execute
.Datamash > Transpose
. Change the following parameters:Input tabular dataset
: Cut on [BLCA.rnaseq Data, filtered
]Execute
.Text Manipulation > Cut columns from a table (cut)
. Change the following parameters:file to cut
: [BLCA.rnaseq
]Operation
: KeepDelimited by
: TabCut by
: fieldsList of Fields
: click on the box and select Column: 1
from the drop-down menu.Execute
.Attributes
section. Change the following parameters:
​Name:
BLCA.rnaseq Gene IDs
Save
.Text Manipulation > Remove beginning
. Change the following parameters:Remove first
: 1from
: [BLCA.rnaseq Gene IDs
]Execute
.Text Manipulation > Add column
. Change the following parameters:Add this value
: No descriptionto Dataset
: Remove beginning on [BLCA.rnaseq Gene IDs]Iterate?
: NOExecute
.Text Manipulation > Paste
. Change the following parameters:Paste
: Add column on [Remove beginning on [BLCA.rnaseq Gene IDs
]]and
: Transpose on [Cut on [BLCA.rnaseq Data, filtered
]]Delimit by
: TabExecute
.Attributes
section. Change the following parameters:
Name
: BLCA.rnaseq Final
Save
.Send Data > GenomeSpace Exporter
. Change the following parameters:Send this dataset to GenomeSpace
: [BLCA.rnaseq Final
]Choose Target Directory
: select a directory to save the file toFilename
: give the file a name, e.g. BLCA.rnaseq.processed.tab
. Make sure to use the .tab
extension when saving the file.Execute
.We have found the overlap between samples with RNA-seq data and samples with MAF annotations. This process resulted in a tab-delimited matrix file containing the RSEM values for the genes. In order to proceed with further analysis, this tab-delimited file needs to be turned into a GCT file. We can accomplish this using the simple conversion tools built into GenomeSpace.
BLCA.rnaseq.processed.tab
file.Convert
.Convert to:
choose GCT
.Convert on Server
. The output file should be BLCA.rnaseq.processed.gct
.Now that we have created a GCT file of the RNA-seq samples with mutation annotations, we need an accompanying CLS file. To create this CLS file, we will use our newly created GCT file, and we will also use the list of samples with mutations in the ERCC2 gene. This list is found in the ERCC2mutSamples.txt
file. We will create this CLS file in GenePattern, using the ClsFileCreator
module.
TCGA Barcodes
The Cancer Genome Atlas labels its datasets with the TCGA barcode, an identifer that describes the metadata associated with sample. You can learn more about the TCGA Barcodes on the NIH National Cancer Institute Wiki page (see also: working with TCGA data). TCGA barcodes adhere to a certain format: TCGA-00-1111-22A-33B-4444-55
. For this recipe, we are interested in the Sample type, indicated by the 22A
section of the barcode. For this recipe we are interested in samples with designation 01
(solid tumor, or TP) or 11
(solid tissue normal, or NT), which are paired tumor-normal samples.
Modules
section, search for the ClsFileCreator
module. Click the ClsFileCreator
module to load the module.GenomeSpace
tab to be able to access your files in GenomeSpace. Change the following parameters:
input file
: BLCA.rnaseq.processed.gct
.Run
. This will launch a visualizer which will walk you through the CLS file creation process.Samples
tab.Check All
button.Next
.Define Classes
tab, change the following parameters:
Enter class name
box, then click the Add class
button.
mutated_normal
: this is ERCC2 mutated, solid tissue normal (Sample Code: 11)mutated_tumor
: this is ERCC2 mutated, primary solid tumor (Sample Code: 01)wildtype_normal
: this is non-mutated, solid tissue normal (Sample Code: 11)wildtype_tumor
: this is non-mutated, primary solid tumor (Sample Code: 01)Next
.ERCC2mutSamples.txt
file.ERCC2mutSamples.txt
file, then choose Preview
. This will allow you to view the file. Keep this window open, in order to be able to reference the contents of the ERCC2mutSamples.txt
file in later steps.CLSFileCreator
module.Assign Classes
tab, assign each sample ID to a class.
ERCC2mutSamples.txt
file (in GenomeSpace) and determine its ID (e.g. "TCGA-4Z-AA84-01A-11D-A391-08"). Next, compare this to the CLSFileCreator
window (in GenePattern), and look for the same IDs.Class:
parameter is set to the appropriate class, based on the following criteria:
mutated_normal
: samples with sample class "11" and "missense mutation" (under "Variant_classification")mutated_tumor
: samples with sample class "01" and "missense mutation" (under "Variant_classification")wildtype_normal
: all remaining samples with sample class "11"wildtype_tumor
: samples with sample class "01" and "silent mutation" (under "Variant_classification"), and all remaining samples with sample class "01" that are not assigned to mutated_tumor
Class:
chosen, assign the sample ID to the class by clicking on the arrow.Next
.ERCC2mutSamples.txt
file, including 1 metadata line. The remaining 42 lines allow us to directly identify 2 mutated_normal
samples, 16 mutated_tumor
samples, and 2 wildtype_tumor
samples with silent mutations. The 13 wildtype_normal
samples are identified by their Sample Code (e.g. "11") which remains once the previous samples are removed from the list. The remaining 111 wildtype_tumor
samples are identified by being left over after all remaining previously listed samples are classified.Summary
section. You should have the following assignments:Once you have reviewed everything, scroll down and click Next
.
mutated_normal
: 2mutated_tumor
: 16wildtype_normal
: 13wildtype_tumor
: 113Save
tab, change the following parameters:
Enter file name
: Keep the filename that was generated, BLCA.rnaseq.processed.cls
Save file to GenePattern Files Tab
button.Select directory
button.Select Directory from Files
pop-up, click on the directory you want to upload the file to.OK
.Save
tab, click Save
to save the file to GenePattern.
(Optional) Save the file back to GenomeSpace.
Files
tab.BLCA.rnaseq.processed.cls
, and choose Save to GenomeSpace
.Save
.The completion of this recipe should result in two complementary files: BLCA.rnaseq.processed.gct
, and the accompanying BLCA.rnaseq.processed.cls
. The GCT file will contain gene-level RSEM values for the BLCA samples; the CLS file will define the phenotype labels (e.g. normal tissue vs. tumor) for each sample in the GCT file.
These two files can be used together to complete other types of analyses. In particular, to recapitulate the research shown by Kim et al. (Nature Genetics, 2016), the best course of action is to compare the RSEM values for each of the phenotypes, in order to identify what the overall gene expression profile looks like for samples that have ERCC2 mutations. By comparing normal and tumor tissues that do or do not have the ERCC2 mutations, a mutational signature can be determined.
Using the processed, normalized gene-level expression data, there are several follow-up analyses to consider: