Galaxy gp genomica

Build and visualize a module network using putative aberrant regions and expression data

Added by GenomeSpaceTeam on 2015.04.21 Official logo
Last updated on over 3 years ago.


Summary

Which genes lie in my copy number variation regions? Are there any sets of co-regulated genes in these aberrant regions?

This recipe provides a method for identifying and visualizing a network of co-regulated genes that are associated with aberrant regions identified by single nucleotide polymorphism (SNP) arrays. An example use of this recipe is a case where an investigator may want to find which genes are located in regions that exhibit significant changes (e.g. amplification or deletion) in cancer cells.

 

This recipe provides one method for identifying and visualizing aberrant regions in Diffuse Large B-Cell Lymphoma (DLBCL) cancer cells. This recipe uses copy-number variation (CNV) data from SNP arrays, and evaluates the expression of aberrant regions using a microarray dataset. Regions that are significantly changed (e.g., amplified or deleted) in cancer cells are defined by the GISTIC algorithm. In particular, this recipe makes use of several Galaxy tools to find the overlap between the aberrant regions and reference genes, and uses GenePattern to process the microarray dataset. Genomica is used to find module networks of co-regulated genes associated with these aberrant regions. A module network is a model which identifies regulatory modules from gene expression data, especially modules of co-regulated genes and their regulators. The module also identifies the conditions under which the regulation can occur.

Why analyze copy number variation regions? Copy number variations (CNVs) are large alterations to genomes, such as duplication or deletion of large segments of a chromosome. These variations in the genome have been associated with different conditions, such as cancer. In this recipe, we explore the scenario in which CNVs are elevated in a cancer cell line, and our goal is to determine the function of these duplicated genes.

Inputs

To complete this recipe, we will need a file describing the locations of CNV regions in a specific condition. In this example, we use CNV regions identified in DLBCL cancer cell lines using the GISTIC algorithm, which identifies regions of a genome that are significantly amplified or deleted across a set of samples. For this particular recipe, the GISTIC file will need to have the .BED extension. In addition, we will need the accompanying gene expression dataset to evaluate the expression of these aberrant genes. We will need the following datasets, which can be downloaded from the following GenomeSpace Public folders:

Public  >   RecipeData  >   GenomicFeatureData  >  GISTIC_regions.bed: This file lists CNV regions in the genome, organized by chromosome, and listing start and end positions in the genome for each CNV region.

Public  >   RecipeData  >   ExpressionData  >  mrna_orig.gct: This is a GCT file of the microarray data which accompanies the DLBCL SNP array data.

Outputs

Recipe steps

  • Galaxy
    1. Loading data into Galaxy
    2. Obtaining a reference genome using Galaxy
    3. Finding genes in CNV regions
    4. Extracting information from files
    5. Removing duplicate Entrez Gene IDs
    6. Downloading the gene list to GenomeSpace
  • GenePattern
    1. Loading data
    2. Filtering genes by expression value
    3. Saving the files to GenomeSpace
  • Genomica
    1. Loading files into Genomica
    2. Creating a module network
    3. Identifying modules of differentially expressed genes

NOTE: If you have not yet associated your GenomeSpace account with your Galaxy account, you will be asked to do so. If you do not yet have a Galaxy account, you can automatically generate a new account that will be associated with your GenomeSpace account.

  1. Open Galaxy from GenomeSpace, navigate to the Get Data tool, then click on GenomeSpace import from file browser, then navigate to the example data file: Public  >   RecipeData  >   GenomicFeatureData  >  GISTIC_regions.bed  (or to your personal directory)
  2. Once GISTIC_regions.bed is loaded into Galaxy, click the pencil icon to edit the attributes. Change the attributes to the following parameters:
    1. Database: Human Feb. 2009 (GRCh37/hg19) (hg19)
    2. Ensure that the chrom, start, and end attributes are pointing to the correct columns in the GISTIC file.

1. Click on the file (e.g., GISTIC_regions.bed) in GenomeSpace, then use the Galaxy context menu and click Launch on File.

OR

2.Click on the file (e.g., GISTIC_regions.bed) in GenomeSpace, then drag it to the Galaxy icon to launch.


  1. Navigate to the following menu: Get Data > UCSC Main
  2. In the dialog box, enter the following parameters:
    1. clade: mammal
    2. genome: Human
    3. assembly: Feb. 2009 (GRCh37/hg19)
    4. group: Genes and Gene Predictions
    5. track: UCSC genes
    6. table: knownGene
    7. region: genome
    8. output format: selected fields from primary and related tables
    9. send to: Galaxy

  3. Click get output. This will load a new page from which you can select specific annotations. Change the following parameters:
    1. Scroll down the page. Under Linked Tables, make sure hg19: knownToLocusLink is checked.
    2. Click allow selection from checked tables to update the page.
    3. Under Select Fields from hg19.knownGene, make sure the following parameters are checked:
      1. chrom: Reference sequence chromosome or scaffold
      2. cdsStart: Coding region start
      3. cdsEnd: Coding region end
    4. Under the hg19.knownToLocusLink fields, make sure the following parameters are checked:
      1. value: Entrez Gene ID (formerly known as LocusLink)
    5. Click done with selection to load the final output page.

    6. Click Send query to Galaxy to run the job. This will generate a new file of reference gene annotations.
    7. When the job has finished running, click the pencil icon to edit the attributes. Change the attributes to the following parameters:
      1. Under the Datatype tab, set New Type: interval OR bed
      2. Under the Attributes tab, set Chrom column: 1
      3. Start column: 2
      4. End column: 3
      5. Check the Name/Identifier column (click box & select) parameter, and set it equal to 4. This will set the identifier values to the Entrez IDs.
      6. Click Save to save the new attributes. It may take some time to save the new attributes 
  1. Navigate to the following menu: Operate on Genomic Intervals > Intersect
  2. In the dialog box, enter the following parameters:
    1. Return: Overlapping intervals
    2. of: the reference genome annotations, e.g., from the UCSC Main on Human: knownGene (genome) job.
      NOTE: Make sure that the first file listed is the reference genome annotation file, as it contains the Entrez ID annotations for the gene regions, which we will need to extract later. If you make this file the second file, the annotations will be discarded.
    3. that intersect: the GenomeSpace import on GISTIC_regions.bed
    4. for at least: 1
  3. Click Execute to submit your job. This will generate a processed BED file.

Note: There are two versions of a "cut" tool in Galaxy. In this example we use the first version, also known as "Cut columns from a table". The second, similarly named tool is called "Cut columns from a table (cut)". Please make sure to match the parameters described in the screenshots to the tool parameters you see in Galaxy.

  1. Navigate to the following menu: Text Manipulation > Cut
  2. In the dialog box, enter the following parameters:
    1. Cut columns: c4
    2. Delimited by: tab
    3. From: the intersect output (e.g., the intersect of GISTIC_regions.bed and the reference gene annotations).
  3. Click Execute to submit your job. This will generate a processed TXT file. However, this file has duplicate entries which will be removed in the next step.

  1. Navigate to the following menu: Join, Subtract and Group > Group
  2. In the dialog box, enter the following parameters:
    1. Select data: the list of genes from the Cut job
    2. Group by column: column 1
  3. Click Execute to submit your job. This will generate a processed TXT file of only unique Entrez IDs.

  1. Navigate to the following menu: Send Data > GenomeSpace Exporter
  2. In the dialog box, enter the following parameters:
    1. Send this dataset to GenomeSpace: the list of Entrez Gene IDs with duplicates removed
    2. Choose target directory: navigate to your GenomeSpace directory
    3. filename: choose a filename, e.g., GISTIC_genes.txt
  3. Click Execute to submit your job. This will send your data to GenomeSpace.

  4. OPTIONAL: close Galaxy.

NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.

  1. Open GenePattern from GenomeSpace, navigate to the GenomeSpace tab, then navigate to the example data file: Public  >   RecipeData  >   ExpressionData  >  mrna_orig.gct ​ (or to your personal directory).

Click on the file (e.g., mrna_orig.gct) in GenomeSpace, then use the GenePattern context menu and click Launch on File.

OR

Click on the file (e.g., mrna_orig.gct) in GenomeSpace, then drag it to the GenePattern icon to launch.


  1. Change to the Modules tab, and search for "PreprocessDataset".
  2. When the module is loaded, change the following parameters:
    1. input filename: load the GCT file, e.g., mrna_orig.gct. To do this, navigate to the GenomeSpace tab, and navigating to the folder containing the GCT file. Load the file into the input filename parameter by clicking and dragging the file to the input filename input box.

    2. floor: 0

    3. min fold change: 1
    4. min delta: 0

    5. row normalization: yes

  3. Click  to run PreprocessDataset. This will generate a processed GCT file.
  1. From the job processing view, click the context menu (blue arrow) next to the dataset (e.g., mrna_orig.preprocessed.gct), then choose Save to GenomeSpace. Save the file to your folder.
  2. OPTIONAL: close GenePattern.

  1. From the Modules and Pipeline start page, navigate to Jobs. Click on the file, then choose Send to GenomeSpace. Save the file to your folder.

NOTE: Genomica requires a Java security exception. In order to open the Genomica JNLP, add https://genie.weizmann.ac.il to your Java exception site list. For more information on how to do so, go HERE

  1. Launch Genomica from GenomeSpace by clicking on the Genomica icon. This will prompt the download of a .jnlp file. Double-click the file to launch Genomica.
  2. Navigate to the following menu: GenomeSpace > Open From GenomeSpace.... This will generate a small window with a file menu. Navigate to the location of the mrna_orig.preprocessed.gct file.
  3. Choose the file and click Select to load the file. It may take several seconds to load the file. When the file is loaded, a heatmap should appear.
  1. Navigate to the following menu: Algorithms > Create a Module Network....
  2. When the dialog box appears, change the following parameters:
    1. Under the General tab, set Max number of iterations to 5.
    2. Under the Regulation tab, load the GISTIC_genes.txt file into the Candidate regulator genes parameter. To do this, click the GenomeSpace Load... button, navigate to the folder containing your GISTIC_genes.txt file, click it, and click Select.
  3. Click Run to run the algorithm. A new dialog box will appear, showing the progress of the algorithm. This may take several minutes to run.

NOTE: Depending on how many modules you are interested in findings, you can change the Max number of modules parameter to a different number. In this example, we limit our regulator gene set to the 95 genes extracted using Galaxy. We then look for a maximum of 50 modules whose gene expression can be categorized by the expression of these candidate regulator genes. To save time, we only run 5 iterations to identify modules.

  1. To view and browse the modules of differentially expressed genes, you can try the different data views in Genomica:
    • Cluster: A view of the raw data as arranged in a heatmap. The heatmap view can change depending on which tree branches are selected in the Tree view, and which modules are selected in the BirdsEye view.
    • Tree: A view of the data organized into a tree.
    • BirdsEye: This re-arranges the heatmap into strips of modules (horizontally). For each module, samples are sorted by the predicted regulation program into smaller expression profiles. Clicking on a module will re-arrange the Cluster heatmap view, so that the samples highlighted are now the left-most samples. A blue border will appear around the selected dataset in the BirdsEye view, and the samples will be labeled in blue in the Cluster view.
  2. To more clearly view the regulation program that defines each module, you can display the gene regulation on the tree using the following steps:
    1. Click on the Cluster view, then under the Sort Experiments option, choose By Descendents from the drop-down menu.
    2. Navigate to View > Cluster, and make sure Experiments Tree is checked.
    3. Finally, to view the gene regulation program, click on a module in the Cluster view, then click the button in the left-side menu to move up-ward in the regulation program.

Results Interpretation

This is an example interpretation of the results from this recipe.

First, we identified the overlap between reference gene annotations (Entrez ID format) and the aberrant regions in cancer cells, using Galaxy. This resulted in a list of annotated genes that are located in the aberrant regions associated with cancer. In this example, we find roughly ~650 genes. Independently, we processed and normalized the microarray dataset so that the expression of genes associated with the aberrant regions could be observed. In this example, the microarray has roughly ~4000 genes. Using Genomica, we view the expression of the genes identified in aberrant regions, and try to find modules of similarly regulated genes, based on the list of candidate regulator genes from Galaxy. There are roughly ~100 genes which Galaxy identified which were also in the microarray; these ~100 genes are our 'candidate regulator genes'.


To view the regulatory program of a specific module, click on a part of the cluster heatmap and then use the triangle button in the left-side menu to view up-stream regulators of expression. In this example, we view one of the clusters and observe the regulatory program:

In the left panel is the BirdsEye view, which displays all the modules found by the algorithm. Each module is separated by horizontal yellow lines, and the splits within a module are separated by vertical yellow lines. Splits indicate that regulation of an up-stream gene changes from one group of arrays to the next. In this example, the module we are examining is highlighted in blue; note that only one part of the module is highlighted (we are examining only a few splits within the module). The right panel is the Cluster view of the microarray, organized according to the BirdsEye view. For example, the clusters we are examining (highlighted in blue) have been moved to the left side of the Cluster view, even though they are the rightmost clusters in the BirdsEye view. The top part of the second panel indicates the gene regulation program; the middle indicates the arrays grouped according to expression, and the bottom indicates the actual expression itself (as a microarray).

The regulation program illustrates which genes are thought to be associated with the regulation observed in the microarrays. Gene names are displayed along arrows indicating how the gene is regulated in different modules. Arrays are sorted to the left and right of the split by a boolean answer to the question, "Is GeneA up-regulated?" Arrays which are up-regulated for a gene will fall on the righthand side; all other arrays fall on the lefthand side. Let's examine the regulation program more closely. We see that Gene 5205 is up-regulated in Groups 2, 3, and 4, but not in Group 1. Gene 825 is up-regulated in Group 4, but not Groups 2 and 3. Finally, Gene 1455 is up-regulated in Group 3, but not 2. Therefore, we could say that Group 2 is associated with up-regulation of Gene 5205, but no up-regulation of Genes 825 or 1455.

However, the results in this example are just one interpretation and are only a simple representation of possible results.


Posted by AlexTall on April 03, 2016 07:51

Hi everyone I have some trouble in “10: Loading files into Genomica “. When I launched Genomica and wanted to “Open From GenomeSpace” new window does not appear to choose "mrna_orig.preprocessed.gct" file form it. Can anyone help me please?

Posted by sgaramsz on April 05, 2016 13:51

Hi AlexTall, thank you for contacting us! I just gave this a try and was not able to replicate the issue. For me there is a small pop-up window that gives me access to my GenomeSpace files. Please email us at gs-help[at]broadinstitute.org, so that we can open a ticket and help troubleshoot this issue.

Submit a Comment

History