Find differentially expressed genes in RNA-seq data |
Added by GenomeSpaceTeam on 2015.04.21
Last updated on over 3 years ago.
Which genes are differentially expressed between my two phenotypes, based on my RNA-seq data?
This recipe provides one method to identify differentially expressed genes in RNA-seq read data. An example use of this recipe is a case where an investigator may want to compare two phenotypes, such as two types of cancer, to determine which genes are up- or down-regulated differently between these phenotypes.
In particular, this recipe uses the UCSC Table Browser to retrieve a reference genome to align RNA-seq reads against. We also used several modules in GenePattern to align the reads against the reference genome, and to identify differentially expressed genes when comparing two conditions. Finally, we use IGV to visualize the differentially expressed genes.
Why differential expression analysis? We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancerous tissue, are said to be differentially expressed when compared to normal conditions. To identify which genes change in response to specific conditions (e.g. cancer), we must filter or process the dataset to remove genes which are not informative.
To complete this recipe, we will need RNA-seq reads, and reference gene annotations to align the reads against. In this example, we are validating a Saccharomyces cerevisiae knockout of the SFP1/YLR403W gene using paired 3’ Digital Gene Expression (DGE) RNA-seq reads. The RNA-seq reads are in FASTQ/FASTA format, and can be gzipped. We will need the following datasets, which can be downloaded from the GenomeSpace Public folder:
Public
> RecipeData
> SequenceData
> WT.fastq.gz
: This file contains RNA-seq reads for the wild-type S. cerevisiae genome.
Public
> RecipeData
> SequenceData
> SFP1KO.fastq.gz
: This file contains RNA-seq reads for the mutant S. cerevisiae genome with the SFP1/YLR403W gene knocked out.
Public
> RecipeData
> GenomicFeatureData
> SacCer3.gtf
: This file contains reference gene annotations for the wild-type S. cerevisiae genome.
In this step, we use the UCSC Table Browser to retrieve reference gene annotations corresponding to the reference genome for our example data. If you are using your own data, you may already have a reference gene annotation file, or you may need to search for one matching your reference genome here.
clade
: Othergenome
: S. cerevisiaeassembly
: Apr. 2011 (SacCer_Apr2011/sacCer3)group
: Genes and Gene Predictionstrack
: SGD Genestable
: sgdGeneregion
: genomeoutput format
: GTF - gene transfer formatSend output to
: Select the GenomeSpace checkbox.output file
: Give the output a name. For the example data, we name it sacCer3_SGD.gtf
.
NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.
GenomeSpace
tab, then navigate to the example data file, e.g. Public
> RecipeData
> SequenceData
> WT.fastq.gz
(or to your personal directory).Tool: GenePattern
1. Click on the file (e.g., WT.fastq.gz
) in GenomeSpace, then use the GenePattern context menu and click Launch on File
.
OR
2. Click on the file (e.g., WT.fastq.gz
) in GenomeSpace, then drag it to the GenePattern icon to launch.
We will use TopHat
in GenePattern to align the RNA-seq raw reads to a reference genome. In this recipe, we specifically want to identify existing splice junctions, and not novel splice junctions. This module uses the FASTA/FASTQ files (e.g., WT.fastq.gz
and SFP1KO.fastq.gz
), the reference gene annotations (e.g., SacCer3.gtf
) and a reference genome with an index (pre-built in GenePattern). If you are using your own data, you can load your own reference genome instead.
Modules
tab, and search for "TopHat". Change the following parameters:bowtie index
: choose a genome from the drop-down menu. In this example we use Saccharomyces_cerevisiae_sacCer3_UCSC
reads pair 1
: load the raw RNA-seq data, e.g., Public
> RecipeData
> SequenceData
> SFP1KO.fastq.gz
. To do this, use the GenomeSpace
tab to navigate to the directory containing your FASTQ file, then drag the file to the input box.GTF file
: load the GTF file. To do this, first click the Upload your own file
button. This creates a new area to drag-and-drop files. Next, navigate to the GenomeSpace
tab, and navigate to the folder containing the GTF file: Public
> RecipeData
> GenomicFeatureData
> SacCer3.gtf
. Load the file into the GTF file
parameter by clicking on the file and choosing Send to GTF
, or by dragging the file to the GTF file
input box.output prefix
: choose a filename prefix, e.g., sfp1ko.reads
Miscellaneous Advanced Parameters
> find novel junctions
: noRun
to run TopHat
. This will generate several files; in this recipe, we will use the *.accepted_hits.bam
file.TopHat
with the wild type FASTA/FASTQ file (or any other comparative FASTA/FASTQ file), using the following steps:
TopHat
job number to bring up the context menu for jobs.Reload job
, which will reload the job with the same parameters.reads pair 1
parameter.GenomeSpace
tab, navigate to the second FASTA/FASTQ file (e.g., WT.fastq.gz
). Load it into the reads pair 1
parameter by dragging the file to the input box.wt.reads
TopHat
on WT.fastq.gz
first, then again on another dataset, e.g., SFP1KO.fastq.gz
. It may take several hours to run TopHat
, depending on the size of the files and whether you are using the GenePattern public server.
We will use the Cufflinks.cuffdiff
module to test differential expression between the two samples (e.g., WT.fastq.gz
and SFP1KO.fastq.gz
). This module uses the aligned data output from TopHat
, and the reference gene annotation GTF file.
Then, we convert the FPKM file to a GCT file using the Fpkm_trackingToGct
module in GenePattern.
Cuffdiff
Modules
tab, and search for "Cuffdiff". Once the module is loaded, change the following parameters:aligned files
: Create a new condition for the first FASTA/FASTQ file, e.g., wt. Load the appropriate file (e.g., wt.reads.accepted_hits.bam
) into the aligned files
parameter by clicking and dragging the file into the input box.Add Another Condition
button.aligned files
: Create a second condition for the first FASTA/FASTQ file, e.g., sfp1ko. Load the appropriate file (e.g., sfp1ko.accepted_hits.bam
) into the aligned files
parameter by clicking and dragging the file into the input box.GTF file
: load the GTF file. To do this, first click the Upload your own file
button. This creates a new area to drag-and-drop files. Next, navigate to the GenomeSpace
tab, and navigate to the folder containing the GTF file: Public
> RecipeData
> GenomicFeatureData
> SacCer3.gtf
. Load the file into the GTF file
parameter by clicking on the file and choosing Send to GTF
, or by dragging the file to the GTF file
input box.library type
: fr-unstrandedadvanced input parameters and options
and then set min alignment count
: 0Run
to submit your job. This will generate several files; in this recipe, we will use the genes.fpkm_tracking
file.Fpkm_trackingToGct
Modules
tab, and search for "Fpkm_trackingToGct".input file
: load the genes.fpkm_tracking
file. To do this, navigate to the GenomeSpace tab, and navigating to the folder containing the file. Load the file into the input file
parameter by clicking on the file and choosing Send to Input
, or by dragging the file to the input file
input box.Run
to submit your job. This will generate a genes.gct
file.
We will use the Picard.SortSam
module to prepare the aligned read data for viewing in IGV. In order to visualize BAM alignment files in IGV, we need a companion index file (BAI) which is located in the same directory. The Picard.SortSam
module will sort the file and generate the index. This module uses the *.accepted_hits.bam
files.
Modules
tab, and search for "Picard.SortSam".input file
: load one of the TopHat
aligned reads outputs into the parameter, e.g. wt.reads.accepted_hits.bam
. To do this, use the Jobs
tab, navigate to one of the previous TopHat
jobs, and load the file into the Picard.SortSam
module by dragging the file into the input file
parameter.output format
: BAMRun
to submit your job. This will generate a new sorted BAM file, as well as an index BAI file.
Picard.SortSam
with the other file using these steps:
Picard.SortSam
job number to bring up the context menu for jobs.Reload job
, which will reload the job with the same parameters.Jobs
tab, navigate to other TopHat
job, then load the other aligned reads file, e.g. sfp1ko.reads.accepted_hits.bam
, into the input file
parameter by dragging the file to the input box.Save the genes.gct
file, and the sorted BAM files and the BAI files GenomeSpace using one of the following methods.
genes.gct
), then choose Save to GenomeSpace
. Save the file to your folder.Modules and Pipeline
start page, navigate to Jobs
. Click on the file, then choose Send to GenomeSpace
. Save the file to your folder.Repeat this process for the remaining files. In total, you should save the following five files back to your GenomeSpace account:
genes.gct
wt.reads.accepted_hits.sorted.bam
wt.reads.accepted_hits.sorted.bai
sfp1ko.reads.accepted_hits.sorted.bam
sfp1ko.reads.accepted_hits.sorted.bai
Optional: when you are finished, close GenePattern.
We will use IGV to visualize the differential expression and abundance levels in the genes.gct file, and check the quality of the mapped alignments from the BAM files. In this recipe, we will also validate that the gene of interest (SFP1) was knocked-out in the SFP1KO mutant strain of S. cerevisiae.
NOTE: For Macintosh Users: JNLP files from the internet are labeled insecure. In order to open the JNLP, find the file in your Finder, right-click the file, and press open. This will open a window that will ask for permission to open the file. Press open to access the JNLP.
Launch
, prompting the download of a .jnlp
file. Double-click the file to launch IGV.S. cerevisiae (sacCer3)
genome, as in this example.GenomeSpace
tab to import files by clicking on the Load File from GenomeSpace...
option.
Open
. Load the following files into IGV:wt.reads.accepted_hits.sorted.bam
sfp1ko.reads.accepted_hits.sorted.bam
genes.gct
Now we can visualize the knock-out gene in our datasets. If you are using the example data, type "YLR403W" into the search box, and click Go
or hit enter. Note that there are no reads aligned to the gene in the SFP1KO data and zero expression visible in the sfp1ko_FPKM track. There are reads aligned in the WT sample (although mostly at the 3' end, as expected with 3' DGE data) and a high expression level is visible in the wt_FPKM track.
This is an example interpretation of the results from this recipe. First, we used TopHat
in GenePattern to align raw reads from RNA-seq to a reference genome (pre-built in Bowtie). After aligning both a wild-type and mutant strain of S. cerevisiae, we used Cufflinks
to determine the genes which were differentially expressed when comparing the two phenotypes. We then use IGV to visualize the difference between these two datasets; in particular, we validate that the SFP1KO mutant strain of S. cerevisiae lacks the SFP1 gene.
The wt_FPKM track is red in the region of the SFP1 gene, whereas the sfp1ko_FPKM track is white. This is an indicator of gene expression, and shows that the SFP1 gene is not being transcribed in the SFP1KO mutant strain. We can see an additional indicator of this when we compare the wt.reads.accepted_hits.sorted.bam
file to the sfp1ko.reads.accepted_hits.sorted.bam
file. These are the tracks which have red and blue boxes to indicate where reads aligned against the reference genome. We can see there are several reads aligned in the WT strain to the SFP1 gene, but the area is blank in the SPF1KO mutant, showing that no reads from that gene aligned against the reference genome. These results confirm that the SFP1 gene was knocked-out in the SFP1KO mutant strain of S. cerevisiae. However, the results in this example are not necessarily significant and are only a simple representation of possible results.
Is GenePattern working ok??? I'm trying to use it but I keep getting an error message... "GenePattern Server error preparing job 1314526 for execution.org.genepattern.server.executor.CommandExecutorException: Error adding job to queue, gpJobNo=1314526, DrmaaException=denied: parallel environment "smp" does not exist job rejected: the requested parallel environment "smp" does not exist" Can someone help me???
I forwarded your comment to the GenePattern team at gp-help@broadinstitute.org but they could not find your job on the public server. Could you contact them directly to see if they can help sort this out?