Identify and visualize expressed transcripts in RNA-seq data |
Added by GenomeSpaceTeam on 2015.04.21
Last updated on over 3 years ago.
Which genes are highly expressed, based on my RNA-seq data? Are any of the highly expressed genes also differentially expressed?
This recipe provides an outline of one method to identify and visualize genes and isoforms that are highly expressed in RNA-seq data. In particular, this recipe utilizes an analysis pipeline, allowing a user to chain together multiple analysis steps into one workflow that can be run in one step. An example use of this recipe is a case where an investigator wants to process several datasets in the same way, in which case the pipeline will allow the investigator to re-use the same modules and parameters, over and over again.

Given a set of raw RNA-seq reads, the goal is to align the reads to a reference genome, estimate expression abundance levels for reference genes and isoforms, filter out low-expressed genes and isoforms, and visualize the read alignments and their expression levels. In particular, this recipe uses the UCSC Table Browser to retrieve a reference genome to align RNA-seq reads against. We also uses several modules in GenePattern to align the reads against the reference genome, and to identify differentially expressed genes when comparing two conditions. Finally, we use IGV to visualize the differentially expressed genes.
Why differential expression analysis? We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancerous tissue, are said to be differentially expressed when compared to normal conditions. To identify which genes change in response to specific conditions (e.g. cancer), we must filter or process the dataset to remove genes which are not informative.
To complete this recipe, we will need RNA-seq reads, and reference gene annotations to align the reads against. In this example, we examine human RNA-seq data. The RNA-seq reads are in FASTQ/FASTA > format, and can be gzipped. We will need the following datasets, which can be downloaded from the GenomeSpace Public folder.
Public > RecipeData > SequenceData > RNA_seq.r1.fastq: this is human RNA-seq data, forward reads.
Public > RecipeData > SequenceData > RNA_seq.r2.fastq: this is human RNA-seq data, reverse reads.
1: Getting reference gene annotations
In this step, we use the UCSC Table Browser to retrieve reference gene annotations corresponding to the reference genome for our example data. If you are using your own data, you may already have a reference gene annotation file, or you may need to search for one matching your reference genome here.
clade: Mammalgenome: Humanassembly: Feb. 2009 (GRCh37/hg19)group: Genes and Gene Predictionstrack: UCSC Genesregion: genomeoutput format: GTF - gene transfer formatSend output to: Select the GenomeSpace checkbox.output file: Give the output a name. For the example data, we name it UCSC_hg19.gtf.get output to retrieve the file. This will take you to a new page which loads the file to GenomeSpace. The output file should appear in your GenomeSpace home directory.
2: Creating a pipeline of modules in GenePattern
In this recipe, we want to perform multiple analyses using GenePattern tools, specifically TopHat and Cufflinks. We will use TopHat in GenePattern to align the RNA-seq raw reads to a reference genome. Next, we will use Cufflinks to assemble transcripts and estimate the expression levels. We can combine these modules into a pipeline to perform the analysis in an efficient manner. Once saved, a pipeline may be reused or re-purposed for subsequent analyses.
NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.
Modules & Pipelines tab on the GenePattern toolbar. Do not click; instead, hover your cursor over the tab and click New Pipeline in the drop-down menu to load the Pipeline Designer interface.Pipeline Name to, e.g., RNA_Exp. Other descriptive information (Description, Author, etc.) may be added by the user as desired.TopHat to the pipelineSearch Modules field or the Browse Modules button, select the TopHat module. A purple box titled TopHat will appear in the designer space while the right-hand information panel will change to display TopHat-specific parameters.GTF.file checkbox.prebuilt.bowtie.index context menu, choose "Homo_sapiens_hg19_UCSC".reads.pair.1 checkbox.reads.pair.2 checkbox.library.type context menu, choose "Standard Illumina (fr-unstranded)".

Picard.SortSam to the pipelineSearch Modules field or the Browse Modules button, select the Picard.SortSam module. A purple box titled Picard.SortSam will appear in the designer space while the right-hand information panel will change to display Picard.SortSam-specific parameters.output.format context menu, choose BAM.

TopHat and Picard.SortSam modules by clicking on the arrow button in the bottom-right corner of the TopHat box (output) and dragging to the arrow button at the top-left corner of the Picard.SortSam box next to input.file. A purple arrow should appear, connecting the two modules.Picard.SortSam output box and choose bam.Cufflinks to the pipelineSearch Modules field or the Browse Modules button, select the Cufflinks module. A purple box titled Cufflinks will appear in the designer space while the right-hand information panel will change to display Cufflinks-specific parameters.GTF.file to have the pipeline prompt the user to provide this file when run.

Picard.SortSam and Cufflinks modules by clicking on the arrow button in the bottom-right corner of the Picard.SortSam box (output) and dragging to the arrow button at the top-left corner of the Cufflinks box next to input.file. A purple arrow should appear, connecting the two modules.Cufflinks output box and choose genes.fpkm_tracking.Fpkm_trackingToGct to the pipelineSearch Modules field or the Browse Modules button, select the Fpkm_trackingToGct module. A purple box titled Fpkm_trackingToGct will appear in the designer space while the right-hand information panel will change to display Fpkm_trackingToGct-specific parameters.Cufflinks and Fpkm_trackingToGct modules by clicking on the arrow button in the bottom-right corner of the Cufflinks box (output) and dragging to the arrow button at the top-left corner of the Fpkm_trackingToGct box next to input.file. A purple arrow should appear, connecting the two buttons.Save button to save the pipeline.Run Pipeline button in the dialog box to run the pipeline.
NOTE: A Pipeline Issues dialog box may appear, displaying warnings found in the pipeline. In the case of our model recipe, these warnings may be ignored.
GenomeSpace tab in the GenePattern menu, navigate to the location of your input files. In this example recipe, we will use files RNA_seq.r1.fastq and RNA_seq.r2.fastq.GTF file parameter, click the Upload your own file button. Navigate to Your Home Directory, then click and drag the GTF file, UCSC_hg19.gtf, to the input box.Public > RecipeData > SequenceData and drag the first FASTQ file, RNA_seq.r1.fastq, to the reads pair 1 parameter.RNA_seq.r2.fastq, to the reads pair 2 parameter.GTF file parameter, click the Upload your own file button. Navigate to Your Home Directory, then click and drag the GTF file, UCSC_hg19.gtf, to the input box.Run to submit the pipeline job.

NOTE: It may take several hours to run this pipeline, depending on the size of the files and whether you are using the GenePattern public server.
Example output from the pipeline:

4: Saving results back to GenomeSpace
In this recipe step, we will save the results from the GenePattern pipeline back to GenomeSpace, for further use..
TopHat1.reads.pair.1.accepted_hits.sorted.bamTopHat1.reads.pair.1.accepted_hits.sorted.baigenes.gctgenes.gct), then choose Save to GenomeSpace.Modules & Pipelines start page, navigate to Jobs. Click on the file, then choose Send to GenomeSpace.
5: Loading data into IGV
We will use IGV to visualize the differential expression and abundance levels in the genes.gct file, and check the quality of the mapped alignments from the BAM files.
Launch, prompting the download of a .jnlp file. Double-click the file to launch IGV.File > Load from Server....Annotations > Genes. Choose the UCSC Genes checkbox.OK to load the dataset.

GenomeSpace tab to import files by clicking on the Load File from GenomeSpace... option.Open. Load the following files into IGV:RNA_seq.r1.accepted_hits.sorted.bamgenes.gct
6: Visualizing transcripts
To visualize the FPKM counts for specific genes, we use the genes.gct file and the UCSC gene annotation information.
genes.gct file. Select Preview from the context menu.tracking_id). In this example, we use uc001aal.Go button. A visual representation of the mapped reads will appear.
This is an example interpretation of the results from this recipe. First, we used GenePattern to create and run a pipeline which aligned raw reads from RNA-seq to a reference genome (pre-built in Bowtie) using TopHat, then identified differentially expressed genes using Cufflinks. We then used IGV to visualize the FPKM counts of the aligned reads for an example gene, uc001aal.
The wt_FPKM track is red in the region of the uc001aal gene, which indicates a high FPKM count for that gene. If we examine the *.accepted_hits.bam track in IGV, we can see that we have several reads aligning to this particular gene in the genome. However, the results in this example are not necessarily significant and are only a simple representation of possible results.