Document originally produced by Chet Birger and Marc-Danie Nazaire (2014). Updated by Sara Garamszegi (2017).
The computational workflow supporting the differential expression (DE) analysis of RNA-Seq data typically consists of several high-level steps; GenePattern supports modules for conducting each of these steps. Today’s exercise makes use of GenomeSpace, GenePattern and IGV. The steps necessary to complete today’s DE analysis of RNA-Seq data are illustrated below:
Slides for this workshop can be found on the public GenomeSpace user account,
SharedData, by navigating to the following folder:
Public > SharedData > Presentations > CEGS_2017-01-31.
This exercise recapitulates and extends (employing gene set enrichment analysis) some of the findings in the following paper by Himes et al. (2014), "RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells". The authors, using an in vitro model, employed RNA-Seq to comprehensively characterize changes in the transcriptome of airway smooth muscle (ASM) cells in response to treatment with Glucocorticoids (GCs), a common class of medications used to treat inflammatory diseases, including asthma. One of the goals of the study was to increase understanding of how GCs work in ASM cells to alleviate asthma symptoms.
ASM cells were isolated and cultured from lung tissue of four donors. Total RNA was extracted from control and treated ASM cells from each donor culture, yielding 8 total samples, 4 untreated and 4 treated. Treatment consisted of exposure to dexamethasone, a synthetic glucocorticoid (1 for 18 hours).
Sequencing of 75 bp paired-end reads was performed with an Illumina HiSeq 2000 instrument at the Partners HealthCare Center for Personalized Genetic Medicine (Boston, MA). Preliminary processing of raw reads was performed using Casava 1.8 (Illumina, Inc., San Diego, CA).
Data from the study is publicly available from NCBI/GEO/SRA: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778
The data the authors posted to NCBI/SRA had already undergone some preliminary trimming. Using the FastQC tool to obtain QC metrics, the authors observed sequence bias in the initial 12 bases on the 5’ ends of reads. The first 12 bases of all reads were trimmed (using the FASTX toolkit), and the trimmed data was submitted to SRA. Thus, the reported read length is 63 bp rather than 75 bp.
Below is the entry in NCBI’s Sequence Read Archive (SRA) for one of the untreated samples, which illustrates the difference in read length:
Description of samples:
|SRA run accession||Donor ID||Phenotype||Base Read Count (Gbp)|
This document contains detailed instructions on how to run through each step in the computational workflow. The GenePattern part of the computational workflow has been previously run on the public GenePattern server. Public read access has been granted to all these jobs and links have been provided for accessing each of these completed jobs, e.g.,
http://bit.ly/ZxbsqV: job status page for TopHat alignment job on sample SRR1039508
Enter the link in your browser’s address bar (make sure you have already logged into the public GenePattern server), and the job status page from the previously run job will appear, e.g.,
To re-run a GenePattern job from its job status page, click on the module name (e.g., TopHat), and a slide-out menu will appear. Clicking on the Reload Job selection from that menu will allow you to re-run that job.
Since we are working with the complete dataset from an actual study, the processing time required to run through the workflow (hours) far exceeds the time allotted to us in the workshop. We will be examining the results of the previously run jobs. You may re-run this workflow, outside of the workshop, by following the detailed instructions below.
We begin by downloading the raw sequence data. This data was originally obtained by using GenePattern’s SraToFastQ module to retrieve the RNA-Seq read data from NCBI’s Sequence Read Archive (SRA) website, and converting it from the compressed SRA format to the FASTQ format. However, in this exercise we provide the converted, paired FASTQ files on a GenomeSpace public folder for users. Each sample has two gzipped files. For example, sample SRR1039508 has files SRR1039508_1.fastq.gz, and SRR1039508_2.fastq.gz.
Links to the files on GenomeSpace (NOTE: clicking the links will begin downloading the files!):
|SRA run accession||Donor ID||Phenotype||Link to read_1||Link to read _2|
We will run quality control checks on each of the FASTQ files before alignment. We utilize a GenePattern module that runs the FastQC quality control tool for high-throughput sequence data (Babraham Institute, 2012). This tool detects irregularities in the short reads data and generates a report that guides subsequent data trimming and filtering. Typically, one would run the FastQC module on each RNA-Seq reads input file. For paired-end experiments, each mate of a pair of read files undergoes FastQC analysis separately. We will use GenePattern’s batch feature to launch two FastQC jobs per sample, one for each of the two mated read files.
Below is an example output from running FastQC, for the sample SRR1039508_1.fastq.gz:
Links to the job status pages for runs of FastQC on all eight samples:
|SRA run accession||Donor ID||Phenotype||FastQC Job Status page for read _1||FastQC Job Status page for read _2|
To view the results of the job, click on any of the …fastqc/fastqc_report.html output files. For example, to view the results for the first mate in the paired sample SRR1039508, click on the file, SRR1039508_1_fastqc/fastqc_report.html. A menu will slide out. Select Open Link to open the FastQC report in a new tab.
Below is the FastQC report generated from running the module on SRR109508_1.fastq.gz:
Three areas are flagged as requiring some level of review: Sequence Duplication Levels, Overrepresented sequences and Kmer Content.
An error flag is raised for Sequence Duplication Levels if non-unique sequences make up more than 50% of the total number of reads. A warning or error flag in this FastQC module is simply a statement that you have exhausted the diversity in at least part of your library and are re-sequencing the same sequences repeatedly. Sequences from different transcripts will be present at a wide range of different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over-sequence highly expressed transcripts, leading to large sets of duplicates. Thus moderate to high levels of duplication may indicate a very high level of coverage of highly expressed transcripts. A high level of duplication may also indicate some kind of enrichment bias (e.g., PCR over-amplification).
The Overrepresented sequences report indicates the presence of a contaminant TruSeq adapter sequence at relatively high level. Contaminant sequences may be removed using GenePattern’s Trimmomatic module (see STEP 3). An example of an overrepresented sequence from a contaminating TruSeq adapter is shown below:
Reviewing the FastQC reports for all 16 fastq files (two per sample) shows that a handful of the fastq files have evidence of adapter contamination, and most have reports of sequence duplication levels at the warning or error levels. Most of the fastq files were also flagged at the warning level for Kmer content.
In the next step we will run the Trimmomatic module on each sample’s reads to eliminate contaminant sequences.
Here, we will run GenePattern’s Trimmomatic module on each sample’s paired end reads. The module directly exposes through its GUI six of the most frequently used trimming/filtering operations on raw RNA-Seq data. These are:
We will enable Trimming Illumina Adapters and Minimum-length Read Filtering. Short reads are more likely to align to multiple sites on a reference genome, and thus provide ambiguous information. After adapter clipping operations have been completed, you should filter the reads so that they all satisfy a specific minimum length. To enable read length filtering, set the min read length parameter. In this exercise, we set the min read length parameter to 50 bp. After completing the trimming operation, we re-run FastQC on the trimmed forward reads for sample SRR1039508 (i.e. SRR1039508_1.fastq.gz) and observe the elimination of the contaminating TruSeq adapter sequences.
Links to the job status pages for runs of Trimmomatic on all eight samples:
|SRA run accession||Donor ID||Phenotype||Trimmomatic Job Status page|
For paired-end input, Trimmomatic creates four FASTQ output files:
For paired-end inputs, it is Trimmomatic’s paired output files (containing _#P in their names) that should undergo read alignment. For example, for sample SRR1039508, the SRR1039508_1P.fastq.gz and SRR1039508_2P.fastq.gz files would be further analyzed.
Once the trimming and filtering steps are completed, we can run the FastQC module on the Trimmomatic output files to observe the elimination of contaminant adapter sequences. For example, we can run SRR1039508_1P.fastq.gz and look at resulting fastqc_report.html.
To view the results of the job, click on the …fastqc/fastqc_report.html output file. For example, for the trimmed sample SRR1039508_1P, the file is SRR1039508_1P_fastqc/fastqc_report.html. A menu will slide out, select Open Link to open the FastQC report in a new tab.
Below is the FastQC report generated from running the module on SRR109508_1P.fastq.gz:
We can see that the Overrepresented sequences section is no longer flagged as having issues. You can see the full FastQC report for sample SRR1039508_1P at the following link:
http://bit.ly/1p61J0P: job status page for FastQC job on SRR1039508_1P.fastq.gz
Note: TopHat’s alignment engine (Bowtie) takes into account base read quality scores when aligning reads. Thus, if using TopHat, there is no need to conduct quality base trimming before aligning the reads.
TopHat is an alignment algorithm using the Bowtie alignment engine. It is typically run in one of the following three modes:
TopHat records read alignments to a BAM (binary version of SAM) formatted file. It attaches a score to each alignment. This alignment score takes into account the read quality scores captured in the input FASTQ files as well as mismatches and indels.
In this exercise, we will align to the known transcript sequences only (Mode #3), relying on the UCSC hg19 reference annotation.
We must provide TopHat with either a genome annotation file (the GTF file parameter) or a transcriptome index (the transcriptome index parameter). If we provide a GTF file, TopHat builds a transcriptome sequence and a corresponding Bowtie index. Creating the Bowtie index for the transcriptome can be time-consuming. TopHat allows the reuse of a previously created transcriptome index to avoid its costly regeneration in each TopHat run. An initial run of TopHat can be made whose sole purpose is the creation of the transcriptome index: the bowtie index and GTF file parameters should be set, but the transcriptome index and read pairs parameters left blank. In subsequent runs, the directory containing the generated transcriptome index can be provided via the transcriptome index parameter.
First, we run TopHat to generate a transcriptome index for UCSC’s hg19 reference annotation, which we will use in subsequent TopHat runs to align each sample’s trimmed reads to the reference transcriptome.
The result of this job will be a new Bowtie reference transcriptome index for UCSC’s hg19 reference annotation. The result of the job can be found at the following link:
http://bit.ly/1BTrDwL: job status page for TopHat job generating a trancriptome index
Links to the job status pages for runs of TopHat on all eight samples:
|SRA run accession||Donor ID||Phenotype||Link to Tophat Job Status page|
The goal of summarization/quantitation is to accurately derive gene and transcript expression values from the aligned reads and a genome annotation. The annotation may have been constructed from the aligned reads or be an externally generated reference annotation. In the summarization/quantitation step, reads are assigned to the annotation’s transcripts that are most likely to have served as the transcription template for the sequenced cDNA.
Summarization/quantitation is not a simplistic accounting of the number of reads that map to each feature. Early methods for quantitating gene expression from RNA-Seq data counted the reads that mapped to the exons of each gene and divided each count by a scaling factor that was based on the total length of a gene’s exons. These “raw count” methods were shown to be less accurate than “isoform deconvolution” methods that estimate the expression levels of a gene’s alternative isoforms (transcripts) and then sum those transcript-level estimates to derive gene-level expression estimates.
Estimating transcript-level expression levels is complicated by the fact that a gene’s multiple isoforms often share exons and hence there is uncertainty in assigning a read to a specific isoform if the read maps to a shared exon. In other words, splicing structure introduces uncertainty that must be accounted for when deriving expression estimates from mapped reads. Many eukaryotic genes have multiple isoforms that share exons, leading to uncertainty around the assignment of a read to a particular transcript.
Normalization of the expression estimates permits accurate comparisons of features’ (transcripts or genes) expression levels within a single sample, and comparison of a single feature’s expression across different samples (phenotypes or conditions).
Following summarization/quantitation and normalization, differences in estimated transcript and gene expression levels across conditions can be analyzed. Differences in a feature’s estimated expression values across two conditions will be impacted by both true differences in expression levels and the inherent noise in the experimental data. Sources of experimental variability include sampling variance, technical variance, and biological variance. In addition to these three sources of experimental variance, RNA-Seq expression estimates are also subject to variability resulting from read assignment uncertainties. If there is a lot of similarity between a gene’s isoforms (e.g., shared exons) there will be a great deal of uncertainty attached to the expression estimates for each of those isoforms, which in turn will lead to uncertainty in the gene-level expression estimates.
Thus, differential expression analysis of RNA-Seq data requires the modeling of both experimental variation and read assignment uncertainty in order to assess the statistical significance of any observed difference in a transcript’s or gene’s expression levels across two conditions. In other words ,without a means of accurately assessing the amount of variability and uncertainty present in a feature’s expression estimates, one cannot determine whether observed differences in that feature’s expression levels across two conditions is the result of true differential expression, or an artifact of the inherent noise in the underlying expression estimates. Differential expression analysis tools, like Cuffdiff, calculate a variance for each transcript’s or gene’s expression estimates and uses that variance to estimate the variance of whatever statistic is employed to measure differential expression, e.g., log-fold change.
By conducting both quantification and differential analysis, Cuffdiff takes into account the uncertainty of mapping reads to isoforms when calculating variances of transcript-level expression estimates. This leads to more accurate assessments of the statistical significances of the observed changes in expression levels across conditions.
This job will create one output representing the comparison of untreated and dexamethasone-treated samples, for all 8 samples:
http://bit.ly/1poBEt9: job status page for Cuffdiff job.
CummeRbund is an R/Bioconductor package that supports analysis and visualization of the data produced by the Cuffdiff tool, both the expression datasets coming out of the quantitation and normalization steps and the test results from Cuffdiff’s differential expression analysis.
GenePattern supports four modules that utilize the CummeRbund package, which creates sets of plots and visualizations to better understand experimental results.
In this exercise, we will use the CummeRbund.QcReport module to obtain global statistics and plots, and to identify the top differentially expressed genes, based on specific thresholds and cut-off values. In particular, this module lists those genes that CuffDiff found to be differentially expressed with a Q-value (FDR) less than the specified FDR threshold, which by default is set to 0.05. These files can be found in the output file, QC_sig_diffExp_genes.txt (see below).
This job will create one output representing the comparison of untreated and dexamethasone-treated samples, for all 8 samples:
http://bit.ly/1uz8Vpt: job status page for CummeRbund.QcReport job
Finally, we are interested in visualizing the results of our alignment. To do this, we will use the Integrative Genomics Viewer (IGV), a high-performance visualization tool for interactive exploration of large, integrated genomic datasets (Robinson et al. 2011). It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.
We will load data into IGV directly from GenomeSpace, and visualize three sets of data:
We can now explore in IGV and visualize our results. You should see something that looks similar to the image below:
Here, we have a genome-level view of our results. To look at gene-specific results, you can enter the name of a gene in the search bar. From our results, we know that certain genes are differentially expressed when comparing the untreated and dex-treated phenotypes. We predict that these genes may have interesting patterns of aligned reads or expression.
Try searching for some of the following genes in IGV: C7, CCDC69, DUSP1, FKBP5, GPX3, KLF15, MAOA, SAMHD1, SERPINA3, SPARCL3, TSC22D3, CRISPLD2, C13orf15, PER1, KCTD12, ERRFI1, STEAP4.
For an example, see the expression profile of the gene C7. Here, we see that there is higher expression in C7 in the dex phenotype (q01) than the untreated phenotype (q00). If we look at some BAM files representing these phenotypes, we similarly observe that there are fewer reads aligned against this gene in the untreated phenotype (SRR1039508) than the dex-treated phenotype (SRR1039509). The difference in the number of reads aligning to the gene location complements the difference observed in the gene-level expression profile of the untreated and dex-treated phenotypes.