Content galaxy gp

Preprocess and quality check RNA-seq data

Added by GenomeSpaceTeam on 2015.04.21 Official logo
Last updated on over 1 year ago.


Summary

What is the quality of my RNA-seq reads? Are my RNA-seq reads of good quality or are there artifacts and errors I should remove and/or be aware of?

This recipe provides an outline of one method to preprocess and determine the quality of RNA-seq libraries. An example use of this recipe is a case where an investigator has completed sequencing of samples, and wants to evaluate and check the quality of the sequencing results before going through the time-consuming process of aligning the reads against a genome.

 

Given a set of raw RNA-seq reads, the goal is to align the reads to a reference genome and assess the quality of the read alignments by obtaining metrics such as depth of coverage, rRNA contamination, continuity of coverage, and GC bias. The purpose of this recipe is to process raw RNA-seq reads prior to aligning them against a reference genome. In particular, this recipe uses a dataset which has RNA-seq reads contaminated with adapter sequences. First we identify and remove adapter sequences using Galaxy, then we process the data and align it to a reference genome using GenePattern.

Why remove adapter sequences? Adapter sequences are short pieces of DNA with known sequence, which are used to link DNA molecules together. In particular, these are used to link an unknown DNA sequence to a sequencing primer, in order to sequence the unknown DNA strand. Most of the time adapter sequences are removed during pre-processing when data is sent to a researcher; however, sometimes some adapter sequences remain. It is important to remove these sequences before aligning reads to a genome.

 

Inputs

To complete this recipe, we will need raw RNA-seq reads that we will process and align to a reference genome. In this example, we will use RNA-seq data of a S. cerevisiae wild-type strain. In this recipe we will focus on pre-processing and aligning reads. We will need the following datasets, which can be downloaded from the following GenomeSpace Public folder: Public > RecipeData.

SequenceData > RNA-Seq.fastq: This is an RNA-seq file containing reads from the wild-type strain of S. cerevisiae.

SequenceData > SacCer3.fa: This is a FASTA file containing sequences for a reference genome for S. cerevisiae.

GenomicFeatureData > SacCer3.gtf: This is a GTF file which contains reference gene annotations for the S. cerevisiae genome.

Outputs

Recipe steps

  • Galaxy
    1. Loading data
    2. Assessing the quality of RNA-seq reads using FastQC
    3. Removing adapter sequences from RNA-seq data
    4. Transferring the clipped FASTQ files to GenomeSpace
  • GenePattern
    1. Loading data
    2. Aligning RNA-seq reads
    3. Processing the reference genome
    4. Reordering and annotating an RNA-seq dataset
    5. Identifying duplicate reads
    6. Creating indices for aligned read files

NOTE: If you have not yet associated your GenomeSpace account with your Galaxy account, you will be asked to do so. If you do not yet have a Galaxy account, you can automatically generate a new account that will be associated with your GenomeSpace account.

  1. Open Galaxy from GenomeSpace, navigate to the Get Data tool, then click on GenomeSpace import, then navigate to your personal directory (eg. SequenceData > RNA-Seq.fastq)

1. Click on the file (e.g., RNA-Seq.fastq) in GenomeSpace, then use the Galaxy context menu and click Launch on File.

OR

2. Click on the file (e.g., RNA-Seq.fastq) in GenomeSpace, then drag it to the Galaxy icon to launch.

  1. Navigate to the following menu: NGS: QC and manipulation > FastQC:Read Quality reports
  2. In the dialog box, enter the following parameters:
    1. Short read data from your current history: Your fastq file (e.g., RNA-Seq.fastq)
      NOTE: If you used the GenomeSpace importer, the file will be prefaced with, e.g. 1: GenomeSpace importer on RNA-Seq.fastq
  3. Click the Execute button to submit your job. This will generate a FastQC report as an HTML document.

Find Adapter Sequences

  1. Open your FastQC report by clicking on the eye icon next to the job titled FastQC on data 1: RawData.
  2. Scroll down to or click on the Overrepresented sequences link. You should see a table similar to the one below:
    Sequence Count Percentage Possible Source
    GATCGGAAGAGCGGTTCAGCAG... 8122 8.122 Illumina Paired End PCR Primer 2 (100% over 40bp)
    GATCGGAAGAGCGGTTCAGCAG... 5086 5.086 Illumina Paired End PCR Primer 2 (97% over 36bp)
    AATGATACGGCGACCACCGAGA... 1085 1.085 Illumina Single End PCR Primer 1 (100% over 40bp)
    . . . . . . . . . . . .
    CGGTTCAGCAGGAATGCCGAGA... 113 0.1129 Illumina Paired End PCR Primer 2 (96% over 25bp)
  3. You can identify adapter sequences by looking for the most overrepresented sequences in the reads. In the example above, there is one sequence which occurs in 8.122% of the reads in our dataset. This sequence maps to the adapter "Illumina Paired End PCR Primer 2 (100% over 40bp)".
  4. If an adapter sequence is identified, highlight the sequence and copy it to the clipboard using ctrl+c.
    NOTE:Don't use ctrl+c until the adapter sequence is pasted into the Clip tool or else the adapter sequence will be overwritten.
  5. Next we will use the FASTQ Groomer tool to convert the dataset from one FASTQ format to another. Then, we will use the Clip tool to remove the adapter sequences from the RNA-seq dataset. This tool uses the original FASTQ files.

 

FASTQ Groomer

  1. Navigate to the following menu: NGS: QC and manipulation > FASTQ Groomer
  2. In the dialog box, enter the following parameters:
    1. File to groom: Choose your fastq file, e.g., RNA-Seq.fastq
  3. Click Execute to submit your job. This will generate a new FASTQ file which has been converted into a different format.

Clip

Now, we remove the adapter sequences from the RNA-seq data by using Galaxy's Clip tool.

  1. Once the job has finished running, navigate to the following menu: NGS: QC and manipulation > Clip
  2. In the dialog box, enter the following parameters:
    1. Library to clip: Choose the FASTQ output from the FASTQ Groomer tool
    2. Source: Enter custom sequence
    3. Enter custom clipping sequence: Paste the previously copied overrepresented adapter sequence
    4. Output option: Output both clipped and non-clipped sequences
      NOTE: This parameter choice is specific to this recipe and need not be changed in personalized analyses.
  3. Click Execute to submit your job. This will generate a new FASTQ file which has the adapter sequences removed.

OPTIONAL: To observe the improved quality of the FASTQ after adapter sequences are removed, re-run FastQC on the clipped FASTQ file. Under Basic statistics, you should see that the sequence length is now variable, because some reads did not have adapter sequences and so were not clipped; other sequences contained the adapter sequence, and are therefore much shorter. You should also notes that the rest of metrics have improved in score; previously we had many bad quality scores (e.g. WARNING or FAIL), now we have more scores that pass the quality threshold (OK). If we examine the Overrepresented sequences statistic, we see that only 1.4% of our reads now contain an adapter sequence. If desired, you may again copy and clip this adapter sequence from your RNA-seq dataset.

  1. Navigate to the following menu: Send Data > GenomeSpace Exporter
  2. In the dialog box, enter the following parameters:
    1. Send this dataset to GenomeSpace: Your clipped FASTQ file
    2. Choose target directory: navigate to your GenomeSpace directory
    3. filename: choose a filename, e.g., S.cerevisiae_wt_clipped.fastq
  3. Click Execute to submit your job. This will send your data to GenomeSpace.

  4. OPTIONAL: close Galaxy.

NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.

  1. Open GenePattern from GenomeSpace, navigate to the GenomeSpace tab, then navigate to your personal directory. (e.g., S.cerevisiae_wt_clipped.fastq)

1. Click on the file (e.g., S.cerevisiae_wt_clipped.fastq) in GenomeSpace, then use the GenePattern context menu and click Launch on File.

OR

2. Click on the file (e.g., S.cerevisiae_wt_clipped.fastq) in GenomeSpace, then drag it to the GenePattern icon to launch.

  1. Change to the Modules tab, and search for "TopHat". For this example recipe, we use TopHat Version 8. Once the module is loaded, make sure the Version is set to 8. Change the following parameters:
  2. bowtie index: choose a genome from the drop-down menu. In this example we use Saccharomyces_cerevisiae_sacCer3_UCSC.
  3. reads pair 1: load the raw RNA-seq data, e.g., S.cerevisiae_wt_clipped.fastq. To do this, use the GenomeSpace tab to navigate to the directory containing your FASTQ file, then drag the file to the input box.
  4. GTF file: load the GTF file. To do this, navigate to the GenomeSpace tab, and navigating to the folder containing the GTF file. Load the file into the GTF file parameter by clicking on the file and choosing Send to GTF, or by dragging the file to the GTF file input box.
  5. output prefix: choose a filename prefix, e.g., wt_reads
  6. Click Run to run TopHat. This will generate several files; in this recipe, we will use the *.accepted_hits.bam file.
    NOTE: Each file of reads must be aligned separately; i.e., run TopHat on S.cerevisiae_wt_clipped.fastq first, then again on another dataset, e.g., S.cerevisiae_mutant_clipped.fastq. It may take several hours to run TopHat, depending on the size of the files and whether you are using the GenePattern public server.

  7. (optional, not done in default tutorial) To run another FASTA/FASTQ file, use the following steps:
    1. Click on the TopHat job number to bring up the context menu for jobs.
    2. Click on Reload job, which will reload the job with the same parameters.
    3. Remove the original FASTA/FASTQ reads file from the job, by clicking on the  next to the file in the reads pair 1 parameter.
    4. Using the GenomeSpace tab, navigate to the second FASTA/FASTQ file (e.g., S.cerevisiae_mutant_clipped.fastq). Load it into the reads pair 1 parameter by dragging the file to the input box.

Create Sequence Dictionary

  1. Change to the Modules tab, and search for "Picard.CreateSequenceDictionary".
  2. Change to the GenomeSpace tab, navigate to the directory containing your reference genome FASTA file, and drag the file to the reference sequence file parameter box.
  3. Click Run to run the module. This will generate a dictionary file for the reference genome, e.g., sacCer3.dict

  4. OPTIONAL: Save the file to GenomeSpace by clicking on the file, then choosing Save to GenomeSpace, navigating to the directory containing your other files (e.g., sacCer3.fa), and then clicking Save.

FASTA Index

  1. Change to the Modules tab, and search for "SAMtools.FastaIndex".
  2. Change to the GenomeSpace tab, navigate to the directory containing your reference genome FASTA file, and drag the file to the fasta file parameter box.
  3. In the output prefix parameter, enter a filename, e.g., sacCer3_indexed. This will prevent GenePattern from overwriting the original file.
  4. Click Run to run the module. This will generate a new fasta file (.fa) and a fasta index file (.fai) for the reference genome, e.g., sacCer3_indexed.fa, and sacCer3_indexed.fa.fai.

  5. OPTIONAL: Save each file to GenomeSpace by clicking on the file, then choosing Save to GenomeSpace, navigating to the directory containing your other files (e.g., sacCer3.fa), and then clicking Save. To save the second file, navigate to the Jobs tab, find the file in the list, then repeat the process.

Reorder SAM

  1. Change to the Modules tab, and search for "Picard.ReorderSAM".
  2. Change to the Jobs tab, and find the outputs from your previous TopHat job, then drag the BAM file (e.g., wt_reads.accepted_hits.bam) to the input file parameter.
  3. Change to the GenomeSpace tab, and navigate to the location of your original reference genome file, and the reference dictionary file. Drag the reference genome file (e.g., sacCer3.fa) to the reference file parameter.
  4. Drag the reference genome dictionary file (e.g., sacCer3.dict) to the reference sequence dictionary parameter.
  5. Click Run to run the module. This will generate a reordered BAM file, e.g., wt_reads.accepted_hits.reorder.bam

  6. OPTIONAL: Save the file to GenomeSpace by clicking on the file, then choosing Save to GenomeSpace, navigating to the directory containing your other files (e.g., sacCer3.fa), and then clicking Save

Add or Replace Read Groups

  1. Change to the Modules tab, and search for "Picard.AddorReplaceReadGroups".
  2. Change to the Jobs tab, and find the output from the Picard.ReorderSam job, then drag the reordered BAM file (e.g., wt_reads.accepted_hits.reorder.bam) to the input file parameter
  3. Change the following parameters:
    1. read group id: Test
    2. read group library: Testing
    3. read group platform: Illumina
    4. read group platform unit: T1
    5. read group sample name: T2
  4. Click Run to run the module. This will generate a re-grouped BAM file, e.g., wt_reads.accepted_hits.reorder.rgroup.bam

  5. OPTIONAL: Save the file to GenomeSpace by clicking on the file, then choosing Save to GenomeSpace, navigating to the directory containing your other files (e.g., sacCer3.fa), and then clicking Save.
  1. Change to the Modules tab, and search for "Picard.MarkDuplicates".
  2. Change to the Jobs tab, and find the outputs from the Picard.AddOrReplaceReadGroups job, then drag the modified BAM file (e.g., wt_reads.accepted_hits.reorder.rgroup.bam) to the input file parameter
  3. Click Run to run the module. This will generate a new BAM file containing the marked duplicates (e.g., wt_reads.accepted_hits.reorder.rgroup.mdup.bam), and a TXT file of metrics.

  1. Change to the Modules tab, and search for "Picard.SortSam". Click on the Jobs tab, then navigate to one of the Picard.MarkDuplicates job.
  2. Load the BAM file into the Picard.SortSam module by dragging the file into the input file parameter.
  3. Set the output format parameter to: BAM
  4. Click Run to submit your job. This will generate a new sorted BAM file, as well as an index BAI file.

  5. Save the sorted BAM files and the BAI files to GenomeSpace using one of the following methods:
    1. From the job processing view, on the text file (e.g., wt_reads.accepted_hits.reorder.rgroup.mdup.sorted.bam), then choose Save to GenomeSpace.
    2. From the Modules & Pipeline start page, navigate to Jobs. Click on the file, then choose Send to GenomeSpace.
  6. OPTIONAL: close GenePattern.

Results Interpretation

This is an example interpretation of the results from this recipe. In particular, we will focus on interpreting the results from the FastQC tool in Galaxy.

The Per base sequence quality graph shows the quality of each base from the start to the end of a read. In general, areas in green are considered good scores (>30). We can see that toward the end of our reads we have some bad scores, e.g. at position 38, our reads have large error bars which potentially reflect bad scores (>20). The Per sequence GC content graph shows the distribution of GC content for our reads. The blue line indicates the predicted, normal distribution, and the red line indicates the distribution from our read. There is an obvious spike around 55-59% GC content; this spike can indicate contamination from adapter sequences. The Overrepresented sequences table shows that there is one sequence which is observed in over 8% of the reads in our dataset. In addition, FastQC checks the sequences against a list of known adapter sequences, and in this case our sequence matches an Illumina paired end PCR primer. Therefore, we know this is a definite source of error in our reads.

After removing the adapter sequences, we can run FastQC again to see that our data is much improved. The Per base sequence quality graph shows that all the bases in the read have high quality, across the read. In particular, no positions in the read have a quality score which <20 on the y-axis. The Per sequence GC content graph shows that the distribution of GC content in our dataset (red line) is now comparable to the predicted distribution (blue line). This suggests the cause of the GC spike in the original dataset is due to adapter sequences, which we have now removed. Finally, the Overrepresented sequences table shows that only 1.4% of our reads have an adapter sequence. This is a significant improvement, compared to the 8% we had previously.

These results suggest that our RNA-seq dataset was contaminated with adapter sequences and was of poor quality, compared with the dataset after adapter sequences were removed. However, the results in this example are not necessarily significant and are only a simple representation of possible results.


Submit a Comment

History