Find differentially expressed genes in RNA-seq data |
Added by GenomeSpaceTeam on 2015.04.21
Last updated on over 3 years ago.
Which genes are differentially expressed between my two phenotypes, based on my RNA-seq data?
This recipe provides one method to identify differentially expressed genes in RNA-seq read data. An example use of this recipe is a case where an investigator may want to compare two phenotypes, such as two types of cancer, to determine which genes are up- or down-regulated differently between these phenotypes.

In particular, this recipe uses the UCSC Table Browser to retrieve a reference genome to align RNA-seq reads against. We also used several modules in GenePattern to align the reads against the reference genome, and to identify differentially expressed genes when comparing two conditions. Finally, we use IGV to visualize the differentially expressed genes.
Why differential expression analysis? We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancerous tissue, are said to be differentially expressed when compared to normal conditions. To identify which genes change in response to specific conditions (e.g. cancer), we must filter or process the dataset to remove genes which are not informative.
To complete this recipe, we will need RNA-seq reads, and reference gene annotations to align the reads against. In this example, we are validating a Saccharomyces cerevisiae knockout of the SFP1/YLR403W gene using paired 3’ Digital Gene Expression (DGE) RNA-seq reads. The RNA-seq reads are in FASTQ/FASTA format, and can be gzipped. We will need the following datasets, which can be downloaded from the GenomeSpace Public folder:
Public > RecipeData > SequenceData > WT.fastq.gz: This file contains RNA-seq reads for the wild-type S. cerevisiae genome.
Public > RecipeData > SequenceData > SFP1KO.fastq.gz: This file contains RNA-seq reads for the mutant S. cerevisiae genome with the SFP1/YLR403W gene knocked out.
Public > RecipeData > GenomicFeatureData > SacCer3.gtf: This file contains reference gene annotations for the wild-type S. cerevisiae genome.
1: Getting reference gene annotations
In this step, we use the UCSC Table Browser to retrieve reference gene annotations corresponding to the reference genome for our example data. If you are using your own data, you may already have a reference gene annotation file, or you may need to search for one matching your reference genome here.
clade: Othergenome: S. cerevisiaeassembly: Apr. 2011 (SacCer_Apr2011/sacCer3)group: Genes and Gene Predictionstrack: SGD Genestable: sgdGeneregion: genomeoutput format: GTF - gene transfer formatSend output to: Select the GenomeSpace checkbox.output file: Give the output a name. For the example data, we name it sacCer3_SGD.gtf.

NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.
GenomeSpace tab, then navigate to the example data file, e.g. Public > RecipeData > SequenceData > WT.fastq.gz (or to your personal directory).
Alternative: other ways to load data into GenePattern
Tool: GenePattern
1. Click on the file (e.g., WT.fastq.gz) in GenomeSpace, then use the GenePattern context menu and click Launch on File.
OR
2. Click on the file (e.g., WT.fastq.gz) in GenomeSpace, then drag it to the GenePattern icon to launch.
3: Aligning RNA-Seq reads
We will use TopHat in GenePattern to align the RNA-seq raw reads to a reference genome. In this recipe, we specifically want to identify existing splice junctions, and not novel splice junctions. This module uses the FASTA/FASTQ files (e.g., WT.fastq.gz and SFP1KO.fastq.gz), the reference gene annotations (e.g., SacCer3.gtf) and a reference genome with an index (pre-built in GenePattern). If you are using your own data, you can load your own reference genome instead.
Modules tab, and search for "TopHat". Change the following parameters:bowtie index: choose a genome from the drop-down menu. In this example we use Saccharomyces_cerevisiae_sacCer3_UCSCreads pair 1: load the raw RNA-seq data, e.g., Public > RecipeData > SequenceData > SFP1KO.fastq.gz. To do this, use the GenomeSpace tab to navigate to the directory containing your FASTQ file, then drag the file to the input box.
GTF file: load the GTF file. To do this, first click the Upload your own file button. This creates a new area to drag-and-drop files. Next, navigate to the GenomeSpace tab, and navigate to the folder containing the GTF file: Public > RecipeData > GenomicFeatureData > SacCer3.gtf. Load the file into the GTF file parameter by clicking on the file and choosing Send to GTF, or by dragging the file to the GTF file input box.
output prefix: choose a filename prefix, e.g., sfp1ko.reads
Miscellaneous Advanced Parameters > find novel junctions : no
Run to run TopHat. This will generate several files; in this recipe, we will use the *.accepted_hits.bam file.TopHat with the wild type FASTA/FASTQ file (or any other comparative FASTA/FASTQ file), using the following steps:
TopHat job number to bring up the context menu for jobs.Reload job, which will reload the job with the same parameters.
next to the file in the reads pair 1 parameter.GenomeSpace tab, navigate to the second FASTA/FASTQ file (e.g., WT.fastq.gz). Load it into the reads pair 1 parameter by dragging the file to the input box.wt.readsTopHat on WT.fastq.gz first, then again on another dataset, e.g., SFP1KO.fastq.gz. It may take several hours to run TopHat, depending on the size of the files and whether you are using the GenePattern public server.
4: Identifying differential expression in RNA-seq reads
We will use the Cufflinks.cuffdiff module to test differential expression between the two samples (e.g., WT.fastq.gz and SFP1KO.fastq.gz). This module uses the aligned data output from TopHat, and the reference gene annotation GTF file.
Then, we convert the FPKM file to a GCT file using the Fpkm_trackingToGct module in GenePattern.
Cuffdiff
Modules tab, and search for "Cuffdiff". Once the module is loaded, change the following parameters:aligned files: Create a new condition for the first FASTA/FASTQ file, e.g., wt. Load the appropriate file (e.g., wt.reads.accepted_hits.bam) into the aligned files parameter by clicking and dragging the file into the input box.Add Another Condition button.
aligned files: Create a second condition for the first FASTA/FASTQ file, e.g., sfp1ko. Load the appropriate file (e.g., sfp1ko.accepted_hits.bam) into the aligned files parameter by clicking and dragging the file into the input box.GTF file: load the GTF file. To do this, first click the Upload your own file button. This creates a new area to drag-and-drop files. Next, navigate to the GenomeSpace tab, and navigate to the folder containing the GTF file: Public > RecipeData > GenomicFeatureData > SacCer3.gtf. Load the file into the GTF file parameter by clicking on the file and choosing Send to GTF, or by dragging the file to the GTF file input box.
library type: fr-unstranded
advanced input parameters and options and then set min alignment count: 0Run to submit your job. This will generate several files; in this recipe, we will use the genes.fpkm_tracking file.
Fpkm_trackingToGct
Modules tab, and search for "Fpkm_trackingToGct".input file: load the genes.fpkm_tracking file. To do this, navigate to the GenomeSpace tab, and navigating to the folder containing the file. Load the file into the input file parameter by clicking on the file and choosing Send to Input, or by dragging the file to the input file input box.Run to submit your job. This will generate a genes.gct file.
5: Creating indices for aligned read files
We will use the Picard.SortSam module to prepare the aligned read data for viewing in IGV. In order to visualize BAM alignment files in IGV, we need a companion index file (BAI) which is located in the same directory. The Picard.SortSam module will sort the file and generate the index. This module uses the *.accepted_hits.bam files.
Modules tab, and search for "Picard.SortSam".input file: load one of the TopHat aligned reads outputs into the parameter, e.g. wt.reads.accepted_hits.bam. To do this, use the Jobs tab, navigate to one of the previous TopHat jobs, and load the file into the Picard.SortSam module by dragging the file into the input file parameter.output format: BAMRun to submit your job. This will generate a new sorted BAM file, as well as an index BAI file.

Picard.SortSam with the other file using these steps:
Picard.SortSam job number to bring up the context menu for jobs.Reload job, which will reload the job with the same parameters.
next to the file in the input file parameter.Jobs tab, navigate to other TopHat job, then load the other aligned reads file, e.g. sfp1ko.reads.accepted_hits.bam, into the input file parameter by dragging the file to the input box.
6: Saving the files to GenomeSpace
Save the genes.gct file, and the sorted BAM files and the BAI files GenomeSpace using one of the following methods.
genes.gct), then choose Save to GenomeSpace. Save the file to your folder.Modules and Pipeline start page, navigate to Jobs. Click on the file, then choose Send to GenomeSpace. Save the file to your folder.Repeat this process for the remaining files. In total, you should save the following five files back to your GenomeSpace account:
genes.gctwt.reads.accepted_hits.sorted.bamwt.reads.accepted_hits.sorted.baisfp1ko.reads.accepted_hits.sorted.bamsfp1ko.reads.accepted_hits.sorted.baiOptional: when you are finished, close GenePattern.
7: Loading RNA-seq data into IGV and visualize differential expression and mapped alignments
We will use IGV to visualize the differential expression and abundance levels in the genes.gct file, and check the quality of the mapped alignments from the BAM files. In this recipe, we will also validate that the gene of interest (SFP1) was knocked-out in the SFP1KO mutant strain of S. cerevisiae.
NOTE: For Macintosh Users: JNLP files from the internet are labeled insecure. In order to open the JNLP, find the file in your Finder, right-click the file, and press open. This will open a window that will ask for permission to open the file. Press open to access the JNLP.
Launch, prompting the download of a .jnlp file. Double-click the file to launch IGV.S. cerevisiae (sacCer3) genome, as in this example.GenomeSpace tab to import files by clicking on the Load File from GenomeSpace... option.

Open. Load the following files into IGV:wt.reads.accepted_hits.sorted.bamsfp1ko.reads.accepted_hits.sorted.bamgenes.gctNow we can visualize the knock-out gene in our datasets. If you are using the example data, type "YLR403W" into the search box, and click Go or hit enter. Note that there are no reads aligned to the gene in the SFP1KO data and zero expression visible in the sfp1ko_FPKM track. There are reads aligned in the WT sample (although mostly at the 3' end, as expected with 3' DGE data) and a high expression level is visible in the wt_FPKM track.
This is an example interpretation of the results from this recipe. First, we used TopHat in GenePattern to align raw reads from RNA-seq to a reference genome (pre-built in Bowtie). After aligning both a wild-type and mutant strain of S. cerevisiae, we used Cufflinks to determine the genes which were differentially expressed when comparing the two phenotypes. We then use IGV to visualize the difference between these two datasets; in particular, we validate that the SFP1KO mutant strain of S. cerevisiae lacks the SFP1 gene.

The wt_FPKM track is red in the region of the SFP1 gene, whereas the sfp1ko_FPKM track is white. This is an indicator of gene expression, and shows that the SFP1 gene is not being transcribed in the SFP1KO mutant strain. We can see an additional indicator of this when we compare the wt.reads.accepted_hits.sorted.bam file to the sfp1ko.reads.accepted_hits.sorted.bam file. These are the tracks which have red and blue boxes to indicate where reads aligned against the reference genome. We can see there are several reads aligned in the WT strain to the SFP1 gene, but the area is blank in the SPF1KO mutant, showing that no reads from that gene aligned against the reference genome. These results confirm that the SFP1 gene was knocked-out in the SFP1KO mutant strain of S. cerevisiae. However, the results in this example are not necessarily significant and are only a simple representation of possible results.
Is GenePattern working ok??? I'm trying to use it but I keep getting an error message... "GenePattern Server error preparing job 1314526 for execution.org.genepattern.server.executor.CommandExecutorException: Error adding job to queue, gpJobNo=1314526, DrmaaException=denied: parallel environment "smp" does not exist job rejected: the requested parallel environment "smp" does not exist" Can someone help me???
I forwarded your comment to the GenePattern team at gp-help@broadinstitute.org but they could not find your job on the public server. Could you contact them directly to see if they can help sort this out?