Identify Regions of Protein-DNA Binding and Their Influence on Associated Gene Expression |
Added by kosnicki on 2016.12.06
Last updated on over 3 years ago.
Which genes are differentially expressed between my two phenotypes, based on my RNA-seq data?
This recipe provides one method to identify and visualize gene expression in different diseases and during cell differentiation and development. In collecting ChIP-seq data, we can obtain genome-wide maps of transcription factor occupancies or histone modifications between a treatment and control. In locating these regions, we can integrate ChIP-seq and RNA-seq data to better understand how these binding events regulate associated gene expression of nearby genes. An example use case of this recipe is when Laurent et al. observed how the binding of the Prep1 transcription factor influences gene regulation in mouse embryonic stem cells. The integration of both RNA-seq and Chip-seq data allows a user to identify target genes that are directly regulated by transcription factor binding or any other epigenetic occupancy in the genome.
What is Model-based Analysis of ChiP-seq (MACS)?
Model-based Analysis of ChIP-seq (MACS) is a computational algorithm that identifies genome-wide locations of transcription/chromatin factor binding or histone modifications. It is often preferred over other peak calling algorithms due to its consistency in reporting fewer false positives and its finer spatial resolution. First, it removes redundant reads to account for possible over-amplification of ChIP-DNA, which may affect peak-calling downstream. Then it shifts read positions based on the fragment size distribution to better represent the original ChIP-DNA fragment positions. Once read positions are adjusted, peak enrichment is calculated by identifying regions that are significantly enriched relative to the genomic background. MACS empirically estimates the FDR for experiments with controls for each peak, which can be used as a cutoff to filter enriched peaks. The treatment and control samples are swapped and any enriched peaks found in the control sample are regarded as false positives.
Why differential expression analysis?
We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancerous tissue, are said to be differentially expressed when compared to normal conditions.
Use Case: ChIP-Seq and RNA-Seq Analyses Identify Components of the Wnt and Fgf Signaling Pathways as Prep1 Target Genes in Mouse Embryonic Stem Cells (Laurent et al., PLoS ONE, 2015)
The sample datatset, Series GSE6328, used for this recipe are from NCBI's GEO. We identify the interplay between epigentics and transcriptomics mouse embryonic stems cells by observing how the binding of the transcription factor, Prep1, influences gene expression. Prep1 is predominantly known for its contribution in embryonic development. In comparing genome-wide maps of mouse embryonic cells experiencing Prep1 binding to those that do not, we can identify potential target genes that are being differentially regulated by these binding events.
To complete this recipe, we will need a sample dataset, Series GSE6328. We have already aligned and performed quality control analysis on the data, prior to this recipe. We will need the following datasets, which can be downloaded from the following folders:
Public
> RecipeData
> GenomicFeatureData
> Mouse_wGeneID.gtf
: this file contains the reference annotations for the mouse genome that is needed to analyze these datasets.
Public
> RecipeData
> SequenceData
> GSE6328
> Prep1_WT_RNA.bam
: this file contains aligned RNA-seq data from normal mouse embryonic stem cells (mESCs).
Public
> RecipeData
> SequenceData
> GSE6328
> Prep1_KO_RNA.bam
: this file contains aligned RNA-seq data from mESCs in which Prep1 was knocked out.
Public
> RecipeData
> SequenceData
> GSE6328
> Prep1_WT_ChIP.bam
: this file contains aligned input control ChIP-seq data from normal mESCs.
Public
> RecipeData
> SequenceData
> GSE6328
> Prep1_KO_ChIP.bam
: this file contains aligned ChIP-seq data from mESCs for Prep1 binding.
NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.
GenomeSpace
tab, then navigate to the folder containing the files (Public
> RecipeData
> SequenceData
> GSE6328
).Tool: GenePattern
1. Click on the file (e.g., Prep1_WT_RNA.bam
) in GenomeSpace, then use the GenePattern context menu and click Launch on File
.
OR
2. Click on the file (e.g., Prep1_WT_RNA.bam
) in GenomeSpace, then drag it to the GenePattern icon to launch.
We will use the Cufflinks.cuffdiff
module to test differential expression between the two samples (e.g., Prep1_WT_RNA.bam
and Prep1_KO_RNA.bam
). This module uses the aligned data from GenomeSpace, and the reference gene annotation GTF file (Mouse_wGeneID.gtf
).
Modules
tab, and search for "Cuffdiff". Once the module is loaded, change the following parameters:aligned files
: Create a new condition for the first FASTA/FASTQ file, e.g., "WT". Navigate to the directory containing the appropriate file (e.g., Public
> RecipeData
> SequenceData
> GSE6328
> Prep1_WT_RNA.bam
), then load the file into the aligned files
parameter by clicking and dragging the file into the input box.Add Another Condition
button.
​
aligned files
: Create a second condition for the first FASTA/FASTQ file, e.g., "KO". Navigate to the directory containing the appropriate file (e.g., Public
> RecipeData
> SequenceData
> GSE6328
> Prep1_KO_RNA.bam
), then load the file into the aligned files
parameter by clicking and dragging the file into the input box.GTF file
: Click on the Upload your own file
button, then upload GCT file into the Cufflinks.cuffdiff
module using the GenomeSpace
tab. Navigate to the directory containing the GCT file (e.g. Public
> RecipeData
> GenomicFeatureData
> Mouse_wGeneID.gtf
). Load it into the Cufflinks.cuffdiff
module by dragging the file to the GTF file
parameter.
library type
: fr-unstranded
advanced input parameters and options
and then set min alignment count
: 0Run
to submit your job. This will generate several files; in this recipe, we will use the genes_exp.diff
file.
Tool: GenePattern
The DESeq2 tool (GenePattern or Galaxy) can be used to replace the differential expression analysis. (NOTE: Preprocessing would be required if the analysis starts with BAM files
Save the gene_exp.diff
file to GenomeSpace using one of the following methods.
genes_exp.diff
), then choose Save to GenomeSpace
. Save the file to your folder.We will load RNA-seq, ChIP-seq, and the differential gene expression data into Galaxy for future analyses, using one the following method.
Get Data
> GenomeSpace import
menu, then navigate to your personal directory to select your file.gene_exp.diff
, Prep1_WT_RNA.bam
, Prep1_KO_RNA.bam
, Prep1_WT_ChIP.bam
, and Prep1_KO_ChIP.bam
).Send to Galaxy
.Tool: Galaxy
1. Click on the files (e.g., gene_exp.diff
, Prep1_WT_RNA.bam
, Prep1_KO_RNA.bam
, Prep1_WT_ChIP.bam
, and Prep1_KO_ChIP.bam
) in GenomeSpace, then use the GenePattern context menu and click Launch on File
.
OR
2. Click on the file (e.g., gene_exp.diff
, Prep1_WT_RNA.bam
, Prep1_KO_RNA.bam
, Prep1_WT_ChIP.bam
, and Prep1_KO_ChIP.bam
) in GenomeSpace, then drag it to the GenePattern icon to launch.
We will use a pre-built GenomeSpace workflow to prepare the expression datasets for principal component analysis. This is a basic preprocessing workflow that uses column selection and matrix transposition tools.
Run
.Step 1: Input Dataset
: gene_exp.diff
Run workflow
.
This step consults the original gene expression data to extract gene regions that were found to be signficantly different between samples.
NGS:SAMtools > Filter SAM or BAM, output SAM or BAM
.SAM or BAM file to filter
: Prep1_WT_RNA.bam
and Prep1_KO_RNA.bam
Header in output
: Include HeaderOutput alignment overlapping the regions in the BED FILE
: SigDiff_genes.bedExecute
.
This step will convert filtered BAM files, containing signifcantly different genes and their expression, to a Bigwig format that can be viewed in IGV
Complete the following steps for each file that was created from the NGS:SAMtools > Filter SAM or BAM, output SAM or BAM
tool in the previous step, i.e. for both Prep1_WT_RNA.bam
and Prep1_KO_RNA.bam
.
Attributes
tab and select the following paramater:
Database/Build
: Mouse July 2007 (NCBI37/mm9)Save
.
Convert Format
tab and input the following parameter:
Convert to new format
: Bam to BigwigConvert
.
This step will identify read-enriched regions in our ChIP-seq data using the MACS2 algorithm. In comparing the epigenetic landscape between samples, we can identify the regions in where transcription factor binding is occurring.
NGS:Peak Calling > MACS2 callpeak
ChIP-Seq Treatment File
: ChIP-Seq Tag File: GenomeSpace import on Prep1_KO_ChIP.bam
ChIP-Seq Control File
: GenomeSpace import on Prep1_WT_ChIP.bam
Effective genome size
: Mouse (2,150,570,000)Outputs
: Scores in bedGraph files (--bdg) (You may select additional outputs for your own analysis as we have done in this example)Execute
.This step will convert treatment (KO) and control (WT) files, containing peak-enrichment data, to a Bigwig format so they can be viewed in IGV
Complete the following steps for each file that was created from the NGS:Peak Calling > MACS2 callpeak
tool in the previous step, i.e. for both Prep1_WT_ChIP.bam
and Prep1_KO_ChIP.bam
.
Attributes
tab and select the following paramater:
Database/Build
: Mouse July 2007 (NCBI37/mm9)Save
.
Convert Format
tab and input the following parameter:
Convert to new format
: Convert Bedgraph to BigWigConvert
.
Complete these steps for the following files:
MACS2 callpeak (Bedgraph Treatment)
MACS2 callpeak (Bedgraph Control)
This step will use GenomeSpace Exporter
to send differential expression data and peak files to GenomeSpace.
Send Data > GenomeSpace Exporter
Send this dataset to GenomeSpace
: the converted treatment bedgraph file, e.g. convert BedGraph to BigWig on data XXX (Bedgraph Treatment).Choose target directory
: navigate to your GenomeSpace directoryfilename
: choose a filename, e.g. Prep1_KO_peaks.bw
Execute
to submit your job. This will send your data to GenomeSpace.
Repeat for the following files:
Prep1_WT_peaks.bw
Prep1_narrowPeaks.bed
Prep1_WT_sigDiff.bw
Prep1_KO_sigDiff.bw
This step will launch files containing peak enrichment data and differential gene expression data in IGV to view how regulation is influenced by transcription factor binding.
Launch
, prompting the download of a .jnlp
file. Double-click the file to launch IGV.Load File from GenomeSpace...
option.Prep1_narrowPeaks.bed
Prep1_KO_peaks.bw
Prep1_WT_peaks.bw
Prep1_KO_sigDiff.bw
Prep1_WT_sigDiff.bw
NOTE: Each file is loaded separately.
This is an example interpretation of the results from this recipe. We used MACS2
in Galaxy to identify differentially read-enriched regions in mouse embryonic stems cells possessing the Prep1 transcription factor binding to those that did not. The specific configuration of MACS2
we use calls narrow peaks, in other words enriched regions highly associated with transcription factor binding sites, as opposed to broad peaks, which are more highly associated with histone modifications. Additionally, we have portrayed the landscape of binding events in both samples (Knockout and Wild-type). In conjunction, we wanted to identify how this transcription factor binding regulated nearby genes. Using the GenePattern module, Cufflinks
, were were able to identify genes that were significantly (p<0.05) differentially expressed between samples. Below is an example of how this data can be visualized in IGV.
Below highlights the narrow peak identified in comparing transcription factor occupancies between embryonic stem cells with the knockout (Prep1_KO_peaks.bw
) and those with Prep1 (Prep1_WT_peaks.bw
) binding in the chr7:149,889,286-151114351 region.
In the top two frames are the gene expression profiles of the wild-type (first) and the knock-out (second) samples. By zooming in to the chr7:149,832,786-149,849,289 region, we are able to compare the gene expression profiles, especially when we overlay the two samples (seen far below). In overlaying, we can identify whether the type of regulation (up or down) that occurs in relation to the Prep1 transcription factor binding.