GenomeSpace Recipe: Identify enriched biological functions in single-nucleotide polymorphisms (SNPs)

Identify enriched biological functions in single-nucleotide polymorphisms (SNPs)

Added by GenomeSpaceTeam on 2015.05.06 Official logo
Last updated on over 3 years ago.

SNP array genomic loci gene set enrichment analysis functional annotation

Summary

How are SNP-related genes regulated in an expression dataset? Are these genes enriched for particular biological functions?

This recipe provides one method for identifying enriched biological functions in single-nucleotide polymorphisms (SNPs). An example use of this recipe is a case where an investigator may complete a genome-wide association study (GWAS) and wants to know the SNPs that are associated with certain genomic coordinates, in order to determine which genes have particular biological functions.

In this particular example, we imagine a scenario in which an investigator completes a GWAS study, obtaining a list of genomic coordinates that are associated with SNPs. However, simply knowing these genomic coordinates is not always informative; the investigator is also interested in knowing which genes lie in these regions, and what kinds of biological functions these genes may have. In this particular example, we are interested in answering two questions:

Are the genes near these SNPs enriched with particular biological functions?
Are the genes near these SNPs significantly up- or down-regulated in some other biological conditions, such as cancer?

To answer the first question, we will find Gene Ontology functional annotations using Genomica. To answer the second question, we will use the Gene Set Enrichment Analysis (GSEA) module in GenePattern, comparing the SNP-associated genes to a gene expression dataset from a study of epithelial cancer stem cells. This study evaluated the ability of oncogenes to activate an embryonic stem cell program in differentiated adult tumor cells, by transforming human keratinocytes into squamous cell carcinomas using oncogenic Ras and IκBα, plus one of three genes: c-Myc, E2F3, and GFP (Wong et al. 2009. Cell Stem Cell). Comparisons between these three genes showed that c-Myc could re-activate the embroynic stem cell program. Comparing the SNPs to this gene expression dataset can determine whether this set of SNP-associated genes are differentially regulated in c-Myc samples, when compared to other genes such as GFP and E2F3.

Inputs

To complete this recipe, we will first need a file describing the genomic coordinates of SNPs, as well as a gene expression dataset against which to compare genes associated with SNPs. In addition, in order to complete functional annotation in Genomica, we will need special files of the human Gene Ontology (GO) annotations. We will need the following datasets, which can be downloaded from the following GenomeSpace Public folder:

Public > RecipeData > GenomicFeatureData > SNPanalysis_data.bed: A file of the genomic coordinates of SNPs from a GWAS study. The SNPs are listed in a BED file format, with a chromosome annotation and starting and ending positions for coding regions which have SNPs.

Gene Expression Dataset:

Public > RecipeData > ExpressionData > GSE10423 > GSE10423.gct: A gene expression dataset of human keratinocytes transformed by Ras, IκBα, and one of three genes: c-Myc, E2F3, and GFP. The file is available in GenePattern's GCT format.

Public > RecipeData > ExpressionData > GSE10423 > GSE10423.cls: A class file containing assignments for all the samples in the GCT file (+Myc or -Myc). The file is in GenePattern CLS format.

Public > RecipeData > ExpressionData > GSE10423 > GSE10423.chip: A file containing the microarray chip information for the this gene expression dataset.

Genomica Database Files:

Public > RecipeData > Tool_Genomica > human_GO.gxa: A file containing the annotations of genes by pathways, functions, and biological processes from the Gene Ontology.

Public > RecipeData > Tool_Genomica > Human_RefSeq_skeleton_expression.tab: A file containing RefSeq IDs for parsing in Genomica.

Outputs

Recipe steps

Galaxy

Loading data into Galaxy
Obtaining a reference genome using Galaxy
Finding genes in CNV regions

GenePattern

Perform Gene Set Enrichment Analysis

Genomica

Saving Genomica files locally
Running gene set analysis in Genomica

Expand All Steps

Collapse All Steps

1: Loading data into Galaxy

We will load the file containing SNPs into Galaxy. Then, we will use a pre-built GenomeSpace workflow to process the dataset, mapping the SNPs to reference gene annotations and extracting the gene IDs in two formats (Gene Symbol and RefSeq ID). Finally, we will send the gene IDs back to GenomeSpace for additional analysis.

NOTE: If you have not yet associated your GenomeSpace account with your Galaxy account, you will be asked to do so. If you do not yet have a Galaxy account, you can automatically generate a new account that will be associated with your GenomeSpace account.

Click on the Galaxy icon to launch the tool.
Navigate to the following menu: Get Data > GenomeSpace import.
Select the SNPanalysis_data.bed file, which can be found under the following directory: Public > RecipeData > GenomicFeatureData.
Click Send to Galaxy.

2: Obtaining a reference genome using Galaxy

We will use the UCSC Main tool to obtain a reference genome from the UCSC Table Browser, through Galaxy.

Navigate to the following menu: Get Data > UCSC Main
In the dialog box, enter the following parameters:
1. clade: Mammal
2. genome: Human
3. assembly: Feb. 2009 (GRCh37/hg19)
4. group: Genes and Gene Predictions
5. track: UCSC genes
6. table: knownGene
7. region: genome
8. output format: selected fields from primary and related tables
9. send to: Galaxy
Click get output. This will load a new page from which you can select specific annotations.
Under Linked Tables, make sure the hg19: kgXref is checked.
Click allow selection from checked tables to update the page, then change the following parameters:
1. Under Select Fields from hg19.knownGene, make sure the following parameters are checked:
  1. chrom: Reference sequence chromosome or scaffold
  2. cdsStart: Coding region start
  3. cdsEnd: Coding region end
2. Under hg19.kgXref, make sure the following parameters are checked:
  1. geneSymbol: Gene Symbol
  2. refseq: RefSeq ID
3. Click done with selection to load the final output page.
Click Send query to Galaxy to run the job. This will generate a new file of reference gene annotations.

3: Finding genes in CNV regions

We will use a pre-built GenomeSpace workflow to identify genes in the CNV regions, and to extract the gene IDs.

Click on the following link: Official GenomeSpace Galaxy Workflow: Identify Enriched Biological Functions in SNPs.
Click the icon in the upper right corner to import the workflow.
Click start using this workflow.
Click on the workflow drop-down menu (e.g., imported: Identify Enriched Biological Functions in SNPs), then choose Run.
Load the files into the correct fields. The input fields should have annotation indicating which file should be loaded:
1. Step 1: Input Dataset (UCSC Reference Genome Annotations): UCSC Main on Human: knownGene (genome)
2. Step 2: Input Dataset (SNP Genomic Coordinates, e.g. SNPAnalysis_data.bed): GenomeSpace import on SNPanalysis_data.bed
Click Run workflow.
Once the workflow has finished running, save the output of the first cut operation to GenomeSpace by navigating to the following menu: Send Data > GenomeSpace Exporter. Change the following parameters:
1. Send this dataset to GenomeSpace: 4: Cut on data 3, the output file of Gene Symbols, e.g. "ACTB".
2. Choose Target Directory: choose the directory to save the file in
3. filename: SNPanalysis_GeneSymbol.grp
4. Click Execute.
Save the output of the second cut operation by navigating again to: Send Data > GenomeSpace Exporter. Change the following parameters:
1. Send this dataset to GenomeSpace: 5: Cut on data 3the output file of RefSeq IDs, e.g. "NM_123456789", etc.
2. Choose Target Directory: choose the directory to save the file in
3. filename: SNPanalysis_RefSeq.gse
4. Click Execute.
Optional: close Galaxy.

4: Perform Gene Set Enrichment Analysis

We will use the Gene Set Enrichment Analysis (GSEA) module to determine whether the SNP-associated genes are up- or down-regulated in biological conditions such as cancer. We will use the SNP-associated genes as a custom 'gene set' to evaluate its enrichment on a gene expression dataset of human keratinocytes transformed into squamous cell carcinomas. This module uses the GCT file and the CLS file, as well as the list of Gene Symbols from the SNP analysis in Galaxy.

NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.

Click on the GenePattern icon to launch the tool.
Change to the Modules tab, and search for "GSEA". Change to the GenomeSpace tab to navigate to the directory containing the GSE10423 gene expression dataset files. Once the module is loaded, change the following parameters:
1. expression dataset: GSE10423.gct, found in the following directory: Public > RecipeData > ExpressionData > GSE10423.
2. gene sets database file: SNPanalysis_GeneSymbol.grp
3. phenotype labels: GSE10423.cls, found in the following directory: Public > RecipeData > ExpressionData > GSE10423.
4. chip platform file:
  1. Click the Upload your own file button, which will create a new drag-and-drop input box.
  2. Add the GSE10423.chip file, found in the following directory: Public > RecipeData > ExpressionData > GSE10423.
Click the button to run GSEA.
Once the job has finished running, the result will be a .ZIP file containing the report from GSEA. The report will include HTML files, PNG images, etc. You can save the resulting file back to GenomeSpace:
1. Click on the ZIP file (e.g. GSE10423.zip), and choose Save to GenomeSpace.
2. Navigate to a directory of your choice and choose Save.
3. Un-zip the ZIP file by right-clicking on the file, then choosing Expand Archive. This should create a new folder in GenomeSpace that you can access, containing all the GSEA results.
To interpret the results of the GSEA, please continue to the section called Interpreting the Results.
Optional: close GenePattern.

5: Saving Genomica files locally

We will use Genomica to determine whether the SNP-associated genes are enriched for particular biological functions by analyzing multiple gene sets.

First, we must save files to the local drive in order for Genomica to run the multiple gene set analysis.

Download the following files:
- Public > RecipeData > Tool_Genomica > human_GO.gxa
- Public > RecipeData > Tool_Genomica > Human_RefSeq_skeleton_expression.tab
Once these two files are downloaded to your local computer, navigate to your Genomica home directory on your local computer:
- Windows: C:/Users/USERNAME/
- OS X: /Users/USERNAME/
Create a new folder, "Genomica" in this directory. Place the two files, human_GO.gxa and Human_RefSeq_skeleton.tab, in this folder.
- e.g., human_GO.gxa on a Windows computer will be located in: C:/Users/USERNAME/Genomica/human_GO.gxa
- e.g., human_GO.gxa on a Mac computer will be located in: /Users/USERNAME/Genomica/human_GO.gxa

6: Running gene set analysis in Genomica

Next, we will complete an enrichment analysis by comparing multiple gene sets in Genomica.

In GenomeSpace, click on the file of SNP-associated RefSeq IDs, e.g. SNPanalysis_RefSeq.gse.
Launch Genomica on this file by clicking on the Genomica icon. This will prompt the download of a .jnlp file. Double-click the file to launch Genomica. The gene expression analysis program should automatically begin running.
Once the analysis program starts, you should see a new dialog box titled, Analyze Gene Sets. Change the following parameters:
1. Under Sets to Analyze, choose Human GO.
2. Under Sets to Find Enrichment For, choose the SNP_analysis_RefSeq.gse file. It may have additional text and numbers in front of the file name due to the Genomica import.
Click Analyze. The output should be a table of gene functional annotation enrichments.
NOTE: If you don't see the Analyze button, try making the window larger by clicking and dragging the bottom edge of the window downward.
Disclaimer: Upon launching Genomica through the .jnlp file, there may be no Analyze button that appears for Windows users.
1. Close the initial pop-up window, click Analyze in the top bar, then Gene Sets.
2. After resizing the window to be larger, the Analyze and Cancel buttons should appear.

Results Interpretation

This is an example interpretation of the results from this recipe. First, we performed Gene Set Enrichment Analysis (GSEA) to determine if the SNP-associated genes are up- or down-regulated in a biological condition such as cancer. To interpret our GSEA results, we navigate to the un-zipped file of GSEA results.

This folder contains the results of the GSEA, including: tables of results; plots such as enrichment plots, heatmaps, butterfly plots, distribution plots, and correlation profiles; and HTML pages designed to display and explain the accompanying results. Clicking on the index.html file will bring up a local HTML webpage linking to the remaining results, in an easy-to-navigate format. In this example, the results indicate that our SNP-associated gene set was up-regulated in the skin tumor phenotype, but these results are not significant. For more help interpreting the results from GSEA analyses, visit the GSEA User Guide page for interpreting GSEA results.

Next, we used Genomica to determine whether SNP-associated genes are enriched for particular biological functions using multiple gene set analysis. The results from this analysis are (1) a table of significantly enriched GO terms, and (2) a graphical representation of the enrichment for each GO term in the gene set being analyzed.

The Genomica results suggest that there is significant overlap between the SNP-associated gene set and GO terms like, e.g., "regulation of nuclear mRNA splicing, via spliceosome" and "negative regulation of mRNA processing". These GO term enrichment results have high p-values (see table above, column 4), and have a large amount of overlap between the SNP-associate gene set and the GO term gene set (see graphic below). The significance of this possible result would need further confirmation.