Identify and visualize expressed transcripts in RNA-seq data |
Added by GenomeSpaceTeam on 2015.04.21
Last updated on over 3 years ago.
Which genes are highly expressed, based on my RNA-seq data? Are any of the highly expressed genes also differentially expressed?
This recipe provides an outline of one method to identify and visualize genes and isoforms that are highly expressed in RNA-seq data. In particular, this recipe utilizes an analysis pipeline, allowing a user to chain together multiple analysis steps into one workflow that can be run in one step. An example use of this recipe is a case where an investigator wants to process several datasets in the same way, in which case the pipeline will allow the investigator to re-use the same modules and parameters, over and over again.
Given a set of raw RNA-seq reads, the goal is to align the reads to a reference genome, estimate expression abundance levels for reference genes and isoforms, filter out low-expressed genes and isoforms, and visualize the read alignments and their expression levels. In particular, this recipe uses the UCSC Table Browser to retrieve a reference genome to align RNA-seq reads against. We also uses several modules in GenePattern to align the reads against the reference genome, and to identify differentially expressed genes when comparing two conditions. Finally, we use IGV to visualize the differentially expressed genes.
Why differential expression analysis? We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancerous tissue, are said to be differentially expressed when compared to normal conditions. To identify which genes change in response to specific conditions (e.g. cancer), we must filter or process the dataset to remove genes which are not informative.
To complete this recipe, we will need RNA-seq reads, and reference gene annotations to align the reads against. In this example, we examine human RNA-seq data. The RNA-seq reads are in FASTQ/FASTA > format, and can be gzipped. We will need the following datasets, which can be downloaded from the GenomeSpace Public folder.
Public
> RecipeData
> SequenceData
> RNA_seq.r1.fastq
: this is human RNA-seq data, forward reads.
Public
> RecipeData
> SequenceData
> RNA_seq.r2.fastq
: this is human RNA-seq data, reverse reads.
In this step, we use the UCSC Table Browser to retrieve reference gene annotations corresponding to the reference genome for our example data. If you are using your own data, you may already have a reference gene annotation file, or you may need to search for one matching your reference genome here.
clade
: Mammalgenome
: Humanassembly
: Feb. 2009 (GRCh37/hg19)group
: Genes and Gene Predictionstrack
: UCSC Genesregion
: genomeoutput format
: GTF - gene transfer formatSend output to
: Select the GenomeSpace checkbox.output file
: Give the output a name. For the example data, we name it UCSC_hg19.gtf
.get output
to retrieve the file. This will take you to a new page which loads the file to GenomeSpace. The output file should appear in your GenomeSpace home directory.In this recipe, we want to perform multiple analyses using GenePattern tools, specifically TopHat
and Cufflinks
. We will use TopHat
in GenePattern to align the RNA-seq raw reads to a reference genome. Next, we will use Cufflinks
to assemble transcripts and estimate the expression levels. We can combine these modules into a pipeline to perform the analysis in an efficient manner. Once saved, a pipeline may be reused or re-purposed for subsequent analyses.
NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.
Modules & Pipelines
tab on the GenePattern toolbar. Do not click; instead, hover your cursor over the tab and click New Pipeline
in the drop-down menu to load the Pipeline Designer interface.Pipeline Name
to, e.g., RNA_Exp. Other descriptive information (Description, Author, etc.) may be added by the user as desired.TopHat
to the pipelineSearch Modules
field or the Browse Modules
button, select the TopHat
module. A purple box titled TopHat
will appear in the designer space while the right-hand information panel will change to display TopHat
-specific parameters.GTF.file
checkbox.prebuilt.bowtie.index
context menu, choose "Homo_sapiens_hg19_UCSC".reads.pair.1
checkbox.reads.pair.2
checkbox.library.type
context menu, choose "Standard Illumina (fr-unstranded)".
Picard.SortSam
to the pipelineSearch Modules
field or the Browse Modules
button, select the Picard.SortSam
module. A purple box titled Picard.SortSam
will appear in the designer space while the right-hand information panel will change to display Picard.SortSam
-specific parameters.output.format
context menu, choose BAM.
TopHat
and Picard.SortSam
modules by clicking on the arrow button in the bottom-right corner of the TopHat
box (output) and dragging to the arrow button at the top-left corner of the Picard.SortSam
box next to input.file. A purple arrow should appear, connecting the two modules.Picard.SortSam
output box and choose bam.Cufflinks
to the pipelineSearch Modules
field or the Browse Modules
button, select the Cufflinks
module. A purple box titled Cufflinks
will appear in the designer space while the right-hand information panel will change to display Cufflinks
-specific parameters.GTF.file
to have the pipeline prompt the user to provide this file when run.
Picard.SortSam
and Cufflinks
modules by clicking on the arrow button in the bottom-right corner of the Picard.SortSam
box (output) and dragging to the arrow button at the top-left corner of the Cufflinks
box next to input.file. A purple arrow should appear, connecting the two modules.Cufflinks
output box and choose genes.fpkm_tracking.Fpkm_trackingToGct
to the pipelineSearch Modules
field or the Browse Modules
button, select the Fpkm_trackingToGct
module. A purple box titled Fpkm_trackingToGct
will appear in the designer space while the right-hand information panel will change to display Fpkm_trackingToGct
-specific parameters.Cufflinks
and Fpkm_trackingToGct
modules by clicking on the arrow button in the bottom-right corner of the Cufflinks
box (output) and dragging to the arrow button at the top-left corner of the Fpkm_trackingToGct
box next to input.file. A purple arrow should appear, connecting the two buttons.Save
button to save the pipeline.Run Pipeline
button in the dialog box to run the pipeline.NOTE: A Pipeline Issues dialog box may appear, displaying warnings found in the pipeline. In the case of our model recipe, these warnings may be ignored.
GenomeSpace
tab in the GenePattern menu, navigate to the location of your input files. In this example recipe, we will use files RNA_seq.r1.fastq
and RNA_seq.r2.fastq
.GTF file
parameter, click the Upload your own file
button. Navigate to Your Home Directory
, then click and drag the GTF file, UCSC_hg19.gtf
, to the input box.Public
> RecipeData
> SequenceData
and drag the first FASTQ file, RNA_seq.r1.fastq
, to the reads pair 1
parameter.RNA_seq.r2.fastq
, to the reads pair 2
parameter.GTF file
parameter, click the Upload your own file
button. Navigate to Your Home Directory
, then click and drag the GTF file, UCSC_hg19.gtf
, to the input box.Run
to submit the pipeline job.
NOTE: It may take several hours to run this pipeline, depending on the size of the files and whether you are using the GenePattern public server.
Example output from the pipeline:
In this recipe step, we will save the results from the GenePattern pipeline back to GenomeSpace, for further use..
TopHat1.reads.pair.1.accepted_hits.sorted.bam
TopHat1.reads.pair.1.accepted_hits.sorted.bai
genes.gct
genes.gct
), then choose Save to GenomeSpace
.Modules & Pipelines
start page, navigate to Jobs
. Click on the file, then choose Send to GenomeSpace
.We will use IGV to visualize the differential expression and abundance levels in the genes.gct file, and check the quality of the mapped alignments from the BAM files.
Launch
, prompting the download of a .jnlp file. Double-click the file to launch IGV.File > Load from Server...
.Annotations > Genes
. Choose the UCSC Genes checkbox.OK
to load the dataset.
GenomeSpace
tab to import files by clicking on the Load File from GenomeSpace...
option.Open
. Load the following files into IGV:RNA_seq.r1.accepted_hits.sorted.bam
genes.gct
To visualize the FPKM counts for specific genes, we use the genes.gct
file and the UCSC gene annotation information.
genes.gct
file. Select Preview
from the context menu.tracking_id
). In this example, we use uc001aal.Go
button. A visual representation of the mapped reads will appear.This is an example interpretation of the results from this recipe. First, we used GenePattern to create and run a pipeline which aligned raw reads from RNA-seq to a reference genome (pre-built in Bowtie) using TopHat
, then identified differentially expressed genes using Cufflinks
. We then used IGV to visualize the FPKM counts of the aligned reads for an example gene, uc001aal.
The wt_FPKM track is red in the region of the uc001aal gene, which indicates a high FPKM count for that gene. If we examine the *.accepted_hits.bam track in IGV, we can see that we have several reads aligning to this particular gene in the genome. However, the results in this example are not necessarily significant and are only a simple representation of possible results.