Galaxy gp genomica

Identify enriched biological functions in single-nucleotide polymorphisms (SNPs)

Added by GenomeSpaceTeam on 2015.05.06 Official logo
Last updated on over 1 year ago.


Summary

How are SNP-related genes regulated in an expression dataset? Are these genes enriched for particular biological functions?

This recipe provides one method for identifying enriched biological functions in single-nucleotide polymorphisms (SNPs). An example use of this recipe is a case where an investigator may complete a genome-wide association study (GWAS) and wants to know the SNPs that are associated with certain genomic coordinates, in order to determine which genes have particular biological functions.

In this particular example, we imagine a scenario in which an investigator completes a GWAS study, obtaining a list of genomic coordinates that are associated with SNPs. However, simply knowing these genomic coordinates is not always informative; the investigator is also interested in knowing which genes lie in these regions, and what kinds of biological functions these genes may have. In this particular example, we are interested in answering two questions:

  1. Are the genes near these SNPs enriched with particular biological functions?
  2. Are the genes near these SNPs significantly up- or down-regulated in some other biological conditions, such as cancer?

To answer the first question, we will find Gene Ontology functional annotations using Genomica. To answer the second question, we will use the Gene Set Enrichment Analysis (GSEA) module in GenePattern, comparing the SNP-associated genes to a gene expression dataset from a study of epithelial cancer stem cells. This study evaluated the ability of oncogenes to activate an embryonic stem cell program in differentiated adult tumor cells, by transforming human keratinocytes into squamous cell carcinomas using oncogenic Ras and IκBα, plus one of three genes: c-Myc, E2F3, and GFP (Wong et al. 2009. Cell Stem Cell). Comparisons between these three genes showed that c-Myc could re-activate the embroynic stem cell program. Comparing the SNPs to this gene expression dataset can determine whether this set of SNP-associated genes are differentially regulated in c-Myc samples, when compared to other genes such as GFP and E2F3.

Inputs

To complete this recipe, we will first need a file describing the genomic coordinates of SNPs, as well as a gene expression dataset against which to compare genes associated with SNPs. In addition, in order to complete functional annotation in Genomica, we will need special files of the human Gene Ontology (GO) annotations. We will need the following datasets, which can be downloaded from the following GenomeSpace Public folder:

Public > RecipeData > GenomicFeatureData > SNPanalysis_data.bed: A file of the genomic coordinates of SNPs from a GWAS study. The SNPs are listed in a BED file format, with a chromosome annotation and starting and ending positions for coding regions which have SNPs.

 

Gene Expression Dataset:

Public > RecipeData > ExpressionData > GSE10423 > GSE10423.gct: A gene expression dataset of human keratinocytes transformed by Ras, IκBα, and one of three genes: c-Myc, E2F3, and GFP. The file is available in GenePattern's GCT format.

Public > RecipeData > ExpressionData > GSE10423 > GSE10423.cls: A class file containing assignments for all the samples in the GCT file (+Myc or -Myc). The file is in GenePattern CLS format.

Public > RecipeData > ExpressionData > GSE10423 > GSE10423.chip: A file containing the microarray chip information for the this gene expression dataset.

 

Genomica Database Files:

Public > RecipeData > Tool_Genomica > human_GO.gxa: A file containing the annotations of genes by pathways, functions, and biological processes from the Gene Ontology.

Public > RecipeData > Tool_Genomica > Human_RefSeq_skeleton_expression.tab: A file containing RefSeq IDs for parsing in Genomica.

Outputs

Recipe steps

  • Galaxy
    1. Loading data into Galaxy
    2. Obtaining a reference genome using Galaxy
    3. Finding genes in CNV regions
  • GenePattern
    1. Perform Gene Set Enrichment Analysis
  • Genomica
    1. Saving Genomica files locally
    2. Running gene set analysis in Genomica

NOTE: If you have not yet associated your GenomeSpace account with your Galaxy account, you will be asked to do so. If you do not yet have a Galaxy account, you can automatically generate a new account that will be associated with your GenomeSpace account.

  1. Click on the Galaxy icon to launch the tool.
  2. Navigate to the following menu: Get Data > GenomeSpace import.
  3. Select the SNPanalysis_data.bed file, which can be found under the following directory: Public > RecipeData > GenomicFeatureData.
  4. Click Send to Galaxy.
  1. Navigate to the following menu: Get Data > UCSC Main
  2. In the dialog box, enter the following parameters:
    1. clade: Mammal
    2. genome: Human
    3. assembly: Feb. 2009 (GRCh37/hg19)
    4. group: Genes and Gene Predictions
    5. track: UCSC genes
    6. table: knownGene
    7. region: genome
    8. output format: selected fields from primary and related tables
    9. send to: Galaxy
  3. Click get output. This will load a new page from which you can select specific annotations.

  4. Under Linked Tables, make sure the hg19: kgXref is checked.
  5. Click allow selection from checked tables to update the page, then change the following parameters:
    1. Under Select Fields from hg19.knownGene, make sure the following parameters are checked:
      1. chrom: Reference sequence chromosome or scaffold
      2. cdsStart: Coding region start
      3. cdsEnd: Coding region end
    2. Under hg19.kgXref, make sure the following parameters are checked:
      1. geneSymbol: Gene Symbol
      2. refseq: RefSeq ID
    3. Click done with selection to load the final output page.

  6. Click Send query to Galaxy to run the job. This will generate a new file of reference gene annotations.
  1. Click on the following link: Official GenomeSpace Galaxy Workflow: Identify Enriched Biological Functions in SNPs.
  2. Click the  icon in the upper right corner to import the workflow.
  3. Click start using this workflow.
  4. Click on the workflow drop-down menu (e.g., imported: Identify Enriched Biological Functions in SNPs), then choose Run.

  5. Load the files into the correct fields. The input fields should have annotation indicating which file should be loaded:
    1. Step 1: Input Dataset (UCSC Reference Genome Annotations): UCSC Main on Human: knownGene (genome)
    2. Step 2: Input Dataset (SNP Genomic Coordinates, e.g. SNPAnalysis_data.bed): GenomeSpace import on SNPanalysis_data.bed

  6. Click Run workflow.
  7. Once the workflow has finished running, save the output of the first cut operation to GenomeSpace by navigating to the following menu: Send Data > GenomeSpace Exporter. Change the following parameters:
    1. Send this dataset to GenomeSpace: 4: Cut on data 3, the output file of Gene Symbols, e.g. "ACTB".
    2. Choose Target Directory: choose the directory to save the file in
    3. filename: SNPanalysis_GeneSymbol.grp
    4. Click Execute.

  8. Save the output of the second cut operation by navigating again to: Send Data > GenomeSpace Exporter. Change the following parameters:
    1. Send this dataset to GenomeSpace: 5: Cut on data 3the output file of RefSeq IDs, e.g. "NM_123456789", etc.
    2. Choose Target Directory: choose the directory to save the file in
    3. filename: SNPanalysis_RefSeq.gse
    4. Click Execute.
  9. Optional: close Galaxy.

NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.

  1. Click on the GenePattern icon to launch the tool.
  2. Change to the Modules tab, and search for "GSEA". Change to the GenomeSpace tab to navigate to the directory containing the GSE10423 gene expression dataset files. Once the module is loaded, change the following parameters:
    1. expression dataset: GSE10423.gct, found in the following directory: Public > RecipeData > ExpressionData > GSE10423.
    2. gene sets database file: SNPanalysis_GeneSymbol.grp
    3. phenotype labels: GSE10423.cls, found in the following directory: Public > RecipeData > ExpressionData > GSE10423.
    4. chip platform file:
      1. Click the Upload your own file button, which will create a new drag-and-drop input box.
      2. Add the GSE10423.chip file, found in the following directory: Public > RecipeData > ExpressionData > GSE10423.
  3. Click the button to run GSEA.
  4. Once the job has finished running, the result will be a .ZIP file containing the report from GSEA. The report will include HTML files, PNG images, etc. You can save the resulting file back to GenomeSpace:
    1. Click on the ZIP file (e.g. GSE10423.zip), and choose Save to GenomeSpace.
    2. Navigate to a directory of your choice and choose Save.
    3. Un-zip the ZIP file by right-clicking on the file, then choosing Expand Archive. This should create a new folder in GenomeSpace that you can access, containing all the GSEA results.
    To interpret the results of the GSEA, please continue to the section called Interpreting the Results.
  5. Optional: close GenePattern.

First, we must save files to the local drive in order for Genomica to run the multiple gene set analysis.

  1. Download the following files:
  2. Once these two files are downloaded to your local computer, navigate to your Genomica home directory on your local computer:
    • Windows: C:/Users/USERNAME/
    • OS X: /Users/USERNAME/
  3. Create a new folder, "Genomica" in this directory. Place the two files, human_GO.gxa and Human_RefSeq_skeleton.tab, in this folder.
    • e.g., human_GO.gxa on a Windows computer will be located in: C:/Users/USERNAME/Genomica/human_GO.gxa
    • e.g., human_GO.gxa on a Mac computer will be located in: /Users/USERNAME/Genomica/human_GO.gxa
  1. In GenomeSpace, click on the file of SNP-associated RefSeq IDs, e.g. SNPanalysis_RefSeq.gse.
  2. Launch Genomica on this file by clicking on the Genomica icon. This will prompt the download of a .jnlp file. Double-click the file to launch Genomica. The gene expression analysis program should automatically begin running.
  3. Once the analysis program starts, you should see a new dialog box titled, Analyze Gene Sets. Change the following parameters:
    1. Under Sets to Analyze, choose Human GO.
    2. Under Sets to Find Enrichment For, choose the SNP_analysis_RefSeq.gse file. It may have additional text and numbers in front of the file name due to the Genomica import.
  4. Click Analyze. The output should be a table of gene functional annotation enrichments.
    NOTE: If you don't see the Analyze button, try making the window larger by clicking and dragging the bottom edge of the window downward.

  5. Disclaimer: Upon launching Genomica through the .jnlp file, there may be no Analyze button that appears for Windows users.
    1. Close the initial pop-up window, click Analyze in the top bar, then Gene Sets.
    2. After resizing the window to be larger, the Analyze and Cancel buttons should appear.

Results Interpretation

This is an example interpretation of the results from this recipe. First, we performed Gene Set Enrichment Analysis (GSEA) to determine if the SNP-associated genes are up- or down-regulated in a biological condition such as cancer. To interpret our GSEA results, we navigate to the un-zipped file of GSEA results.

This folder contains the results of the GSEA, including: tables of results; plots such as enrichment plots, heatmaps, butterfly plots, distribution plots, and correlation profiles; and HTML pages designed to display and explain the accompanying results. Clicking on the index.html file will bring up a local HTML webpage linking to the remaining results, in an easy-to-navigate format. In this example, the results indicate that our SNP-associated gene set was up-regulated in the skin tumor phenotype, but these results are not significant. For more help interpreting the results from GSEA analyses, visit the GSEA User Guide page for interpreting GSEA results.

Next, we used Genomica to determine whether SNP-associated genes are enriched for particular biological functions using multiple gene set analysis. The results from this analysis are (1) a table of significantly enriched GO terms, and (2) a graphical representation of the enrichment for each GO term in the gene set being analyzed.

The Genomica results suggest that there is significant overlap between the SNP-associated gene set and GO terms like, e.g., "regulation of nuclear mRNA splicing, via spliceosome" and "negative regulation of mRNA processing". These GO term enrichment results have high p-values (see table above, column 4), and have a large amount of overlap between the SNP-associate gene set and the GO term gene set (see graphic below). The significance of this possible result would need further confirmation.


Submit a Comment

History