GenomeSpace Recipe: Identify Regions of Protein-DNA Binding and Their Influence on Associated Gene Expression

Identify Regions of Protein-DNA Binding and Their Influence on Associated Gene Expression

Added by kosnicki on 2016.12.06 Official logo
Last updated on over 3 years ago.

peak calling gene sets gene expression analysis RNA-seq ChIP-seq differential gene expression

Summary

Which genes are differentially expressed between my two phenotypes, based on my RNA-seq data?

This recipe provides one method to identify and visualize gene expression in different diseases and during cell differentiation and development. In collecting ChIP-seq data, we can obtain genome-wide maps of transcription factor occupancies or histone modifications between a treatment and control. In locating these regions, we can integrate ChIP-seq and RNA-seq data to better understand how these binding events regulate associated gene expression of nearby genes. An example use case of this recipe is when Laurent et al. observed how the binding of the Prep1 transcription factor influences gene regulation in mouse embryonic stem cells. The integration of both RNA-seq and Chip-seq data allows a user to identify target genes that are directly regulated by transcription factor binding or any other epigenetic occupancy in the genome.

What is Model-based Analysis of ChiP-seq (MACS)?

Model-based Analysis of ChIP-seq (MACS) is a computational algorithm that identifies genome-wide locations of transcription/chromatin factor binding or histone modifications. It is often preferred over other peak calling algorithms due to its consistency in reporting fewer false positives and its finer spatial resolution. First, it removes redundant reads to account for possible over-amplification of ChIP-DNA, which may affect peak-calling downstream. Then it shifts read positions based on the fragment size distribution to better represent the original ChIP-DNA fragment positions. Once read positions are adjusted, peak enrichment is calculated by identifying regions that are significantly enriched relative to the genomic background. MACS empirically estimates the FDR for experiments with controls for each peak, which can be used as a cutoff to filter enriched peaks. The treatment and control samples are swapped and any enriched peaks found in the control sample are regarded as false positives.

Why differential expression analysis?

We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancerous tissue, are said to be differentially expressed when compared to normal conditions.

Use Case: ChIP-Seq and RNA-Seq Analyses Identify Components of the Wnt and Fgf Signaling Pathways as Prep1 Target Genes in Mouse Embryonic Stem Cells (Laurent et al., PLoS ONE, 2015)

The sample datatset, Series GSE6328, used for this recipe are from NCBI's GEO. We identify the interplay between epigentics and transcriptomics mouse embryonic stems cells by observing how the binding of the transcription factor, Prep1, influences gene expression. Prep1 is predominantly known for its contribution in embryonic development. In comparing genome-wide maps of mouse embryonic cells experiencing Prep1 binding to those that do not, we can identify potential target genes that are being differentially regulated by these binding events.

Inputs

To complete this recipe, we will need a sample dataset, Series GSE6328. We have already aligned and performed quality control analysis on the data, prior to this recipe. We will need the following datasets, which can be downloaded from the following folders:

Public > RecipeData > GenomicFeatureData > Mouse_wGeneID.gtf: this file contains the reference annotations for the mouse genome that is needed to analyze these datasets.

Public > RecipeData > SequenceData > GSE6328 > Prep1_WT_RNA.bam: this file contains aligned RNA-seq data from normal mouse embryonic stem cells (mESCs).

Public > RecipeData > SequenceData > GSE6328 > Prep1_KO_RNA.bam: this file contains aligned RNA-seq data from mESCs in which Prep1 was knocked out.

Public > RecipeData > SequenceData > GSE6328 > Prep1_WT_ChIP.bam: this file contains aligned input control ChIP-seq data from normal mESCs.

Public > RecipeData > SequenceData > GSE6328 > Prep1_KO_ChIP.bam: this file contains aligned ChIP-seq data from mESCs for Prep1 binding.

Outputs

Recipe steps

GenePattern

Loading data
Identifying differential gene expression in RNA-seq reads
Saving the files to GenomeSpace

Galaxy

Loading RNA-seq and ChIP-seq data into Galaxy
Prepare signficantly different expression dataset for view in IGV
Extracting intervals of significantly differential genes
Convert filtered BAM files to Bigwig
Identifying transcription factor binding sites
Convert files to Bigwig format
Sending data to GenomeSpace for viewing in IGV

IGV

Visualizing transcription factor binding sites and corresponding gene regulation

Expand All Steps

Collapse All Steps

1: Loading data

This is one method that can be used to load data into GenePattern.

NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.

Open GenePattern from GenomeSpace, and navigate to the GenomeSpace tab, then navigate to the folder containing the files (Public > RecipeData > SequenceData > GSE6328).

Alternative: other ways to load data into GenePattern

Tool: GenePattern

1. Click on the file (e.g., Prep1_WT_RNA.bam) in GenomeSpace, then use the GenePattern context menu and click Launch on File.

2. Click on the file (e.g., Prep1_WT_RNA.bam) in GenomeSpace, then drag it to the GenePattern icon to launch.

2: Identifying differential gene expression in RNA-seq reads

We will use the Cufflinks.cuffdiff module to test differential expression between the two samples (e.g., Prep1_WT_RNA.bam and Prep1_KO_RNA.bam). This module uses the aligned data from GenomeSpace, and the reference gene annotation GTF file (Mouse_wGeneID.gtf).

Change to the Modules tab, and search for "Cuffdiff". Once the module is loaded, change the following parameters:
aligned files: Create a new condition for the first FASTA/FASTQ file, e.g., "WT". Navigate to the directory containing the appropriate file (e.g., Public > RecipeData > SequenceData > GSE6328 > Prep1_WT_RNA.bam), then load the file into the aligned files parameter by clicking and dragging the file into the input box.
Click the Add Another Condition button.
â€‹
aligned files: Create a second condition for the first FASTA/FASTQ file, e.g., "KO". Navigate to the directory containing the appropriate file (e.g., Public > RecipeData > SequenceData > GSE6328 > Prep1_KO_RNA.bam), then load the file into the aligned files parameter by clicking and dragging the file into the input box.
GTF file: Click on the Upload your own file button, then upload GCT file into the Cufflinks.cuffdiff module using the GenomeSpace tab. Navigate to the directory containing the GCT file (e.g. Public > RecipeData > GenomicFeatureData > Mouse_wGeneID.gtf). Load it into the Cufflinks.cuffdiff module by dragging the file to the GTF file parameter.
library type: fr-unstranded
Click advanced input parameters and options and then set min alignment count: 0
Click Run to submit your job. This will generate several files; in this recipe, we will use the genes_exp.diff file.

Alternative: Alternative methods to identify differential gene expression in RNA-seq reads

Tool: GenePattern

The DESeq2 tool (GenePattern or Galaxy) can be used to replace the differential expression analysis. (NOTE: Preprocessing would be required if the analysis starts with BAM files

3: Saving the files to GenomeSpace

Save the gene_exp.diff file to GenomeSpace using one of the following methods.

From the job processing view, click on the text file (e.g., genes_exp.diff), then choose Save to GenomeSpace. Save the file to your folder.

4: Loading RNA-seq and ChIP-seq data into Galaxy

We will load RNA-seq, ChIP-seq, and the differential gene expression data into Galaxy for future analyses, using one the following method.

Open Galaxy from GenomeSpace, navigate to the Get Data > GenomeSpace import menu, then navigate to your personal directory to select your file.
Click on each file (e.g., gene_exp.diff, Prep1_WT_RNA.bam, Prep1_KO_RNA.bam, Prep1_WT_ChIP.bam, and Prep1_KO_ChIP.bam).
Click Send to Galaxy.

Alternative: other ways to load data into Galaxy

Tool: Galaxy

1. Click on the files (e.g., gene_exp.diff, Prep1_WT_RNA.bam, Prep1_KO_RNA.bam, Prep1_WT_ChIP.bam, and Prep1_KO_ChIP.bam) in GenomeSpace, then use the GenePattern context menu and click Launch on File.

2. Click on the file (e.g., gene_exp.diff, Prep1_WT_RNA.bam, Prep1_KO_RNA.bam, Prep1_WT_ChIP.bam, and Prep1_KO_ChIP.bam) in GenomeSpace, then drag it to the GenePattern icon to launch.

5: Prepare signficantly different expression dataset for view in IGV

We will use a pre-built GenomeSpace workflow to prepare the expression datasets for principal component analysis. This is a basic preprocessing workflow that uses column selection and matrix transposition tools.

Click on the the workflow link: Creating a BED file of significantly different genes.
Once directed to the Galaxy page, click the green import button in the top right corner, , to import the workflow into your Galaxy environment.
Click on the workflow drop-down menu (e.g.,Imported: Creating a BED file of significantly different genes), then choose Run.
Load the files into the correct fields. The input fields should have annotations indicating which file should be loaded:
1. Step 1: Input Dataset: gene_exp.diff
Click Run workflow.

6: Extracting intervals of significantly differential genes

This step consults the original gene expression data to extract gene regions that were found to be signficantly different between samples.

Navigate to the following menu: NGS:SAMtools > Filter SAM or BAM, output SAM or BAM.
Select the Multiple datasets tab () to input both the WT and KO aligned reads.
Input the following parameters:
1. SAM or BAM file to filter : Prep1_WT_RNA.bam and Prep1_KO_RNA.bam
2. Header in output : Include Header
3. Output alignment overlapping the regions in the BED FILE : SigDiff_genes.bed
Click Execute.

7: Convert filtered BAM files to Bigwig

This step will convert filtered BAM files, containing signifcantly different genes and their expression, to a Bigwig format that can be viewed in IGV

Complete the following steps for each file that was created from the NGS:SAMtools > Filter SAM or BAM, output SAM or BAM tool in the previous step, i.e. for both Prep1_WT_RNA.bam and Prep1_KO_RNA.bam.

Once the files have been output, change the attributes for each file, by clicking the pencil () icon.
Select the Attributes tab and select the following paramater:
1. Database/Build : Mouse July 2007 (NCBI37/mm9)
Select Save.
Select the Convert Format tab and input the following parameter:
1. Convert to new format : Bam to Bigwig
Select Convert.

8: Identifying transcription factor binding sites

This step will identify read-enriched regions in our ChIP-seq data using the MACS2 algorithm. In comparing the epigenetic landscape between samples, we can identify the regions in where transcription factor binding is occurring.

Ensure that both your control and ChIP-seq BAM files have been loaded into Galaxy.
Navigate to the following menu: NGS:Peak Calling > MACS2 callpeak
In the dialog box, enter the following parameters:
1. ChIP-Seq Treatment File: ChIP-Seq Tag File: GenomeSpace import on Prep1_KO_ChIP.bam
2. ChIP-Seq Control File: GenomeSpace import on Prep1_WT_ChIP.bam
3. Effective genome size : Mouse (2,150,570,000)
4. Outputs: Scores in bedGraph files (--bdg) (You may select additional outputs for your own analysis as we have done in this example)
Select Execute.

9: Convert files to Bigwig format

This step will convert treatment (KO) and control (WT) files, containing peak-enrichment data, to a Bigwig format so they can be viewed in IGV

Complete the following steps for each file that was created from the NGS:Peak Calling > MACS2 callpeak tool in the previous step, i.e. for both Prep1_WT_ChIP.bam and Prep1_KO_ChIP.bam.

Once the files have been output, change the attributes for each file, by clicking the pencil () icon.
Select the Attributes tab and select the following paramater:
1. Database/Build: Mouse July 2007 (NCBI37/mm9)
Select Save.
Select the Convert Format tab and input the following parameter:
1. Convert to new format: Convert Bedgraph to BigWig
Select Convert.

Complete these steps for the following files:

MACS2 callpeak (Bedgraph Treatment)
MACS2 callpeak (Bedgraph Control)

10: Sending data to GenomeSpace for viewing in IGV

This step will use GenomeSpace Exporter to send differential expression data and peak files to GenomeSpace.

Navigate to the following menu: Send Data > GenomeSpace Exporter
In the dialog box, enter the following parameters:
1. Send this dataset to GenomeSpace: the converted treatment bedgraph file, e.g. convert BedGraph to BigWig on data XXX (Bedgraph Treatment).
2. Choose target directory: navigate to your GenomeSpace directory
3. filename: choose a filename, e.g. Prep1_KO_peaks.bw
Click Execute to submit your job. This will send your data to GenomeSpace.

Repeat for the following files:

The converted control bedgraph file e.g. convert BedGraph to BigWig on data XXX (Bedgraph Control) called e.g. Prep1_WT_peaks.bw
The narrow Peaks output from MACS2, called e.g. Prep1_narrowPeaks.bed
The WT file of significantly different genes with another filename, called e.g. Prep1_WT_sigDiff.bw
The KO file of significantly different genes with another filename, called e.g. Prep1_KO_sigDiff.bw

11: Visualizing transcription factor binding sites and corresponding gene regulation

This step will launch files containing peak enrichment data and differential gene expression data in IGV to view how regulation is influenced by transcription factor binding.

Launch IGV from GenomeSpace by clicking on the context menu and choosing Launch, prompting the download of a .jnlp file. Double-click the file to launch IGV.
Load a reference genome into IGV by using the IGV genome selection drop-down menu in the upper left corner of the tool. Make sure the reference genome matches the samples you are using, e.g., Mouse samples should use the Mouse mm9 genome, as in this example.
Use the GenomeSpace tab to import files by clicking on the Load File from GenomeSpace... option.
To load files, navigate to the directory in GenomeSpace which contains the file(s), then click the file(s) to select, and then click Open. Load the following files into IGV:
- Prep1_narrowPeaks.bed
- Prep1_KO_peaks.bw
- Prep1_WT_peaks.bw
- Prep1_KO_sigDiff.bw
- Prep1_WT_sigDiff.bw
NOTE: Each file is loaded separately.

Results Interpretation

This is an example interpretation of the results from this recipe. We used MACS2 in Galaxy to identify differentially read-enriched regions in mouse embryonic stems cells possessing the Prep1 transcription factor binding to those that did not. The specific configuration of MACS2 we use calls narrow peaks, in other words enriched regions highly associated with transcription factor binding sites, as opposed to broad peaks, which are more highly associated with histone modifications. Additionally, we have portrayed the landscape of binding events in both samples (Knockout and Wild-type). In conjunction, we wanted to identify how this transcription factor binding regulated nearby genes. Using the GenePattern module, Cufflinks, were were able to identify genes that were significantly (p<0.05) differentially expressed between samples. Below is an example of how this data can be visualized in IGV.

Below highlights the narrow peak identified in comparing transcription factor occupancies between embryonic stem cells with the knockout (Prep1_KO_peaks.bw) and those with Prep1 (Prep1_WT_peaks.bw) binding in the chr7:149,889,286-151114351 region.

In the top two frames are the gene expression profiles of the wild-type (first) and the knock-out (second) samples. By zooming in to the chr7:149,832,786-149,849,289 region, we are able to compare the gene expression profiles, especially when we overlay the two samples (seen far below). In overlaying, we can identify whether the type of regulation (up or down) that occurs in relation to the Prep1 transcription factor binding.