Identify essential genes and associated subnetworks from Genome-Scale CRISPR-Cas9 knockout screens |
Added by forrestkim on 2017.05.25
Last updated on about 2 years ago.
What genes are essential to a cell’s survival in a specific environment?
This recipe provides a way to process the results of genome-wide CRISPR-Cas9 knockout screens. In these screens, single guide RNAs (sgRNAs) are designed to bind to and inhibit specific target DNA sequences in genes. Multiple sgRNAs may target the same gene to increase knockout efficiency. In positive screens, essential genes are identified through the sequencing of surviving cells post-selection. The loss of these ‘winning’ genes create cells that are resistant to the selective pressure. In negative screens, essential genes are identified by measuring which genes are lower in abundance post selection. These screens require a non-selected control, which is used to find which genes are essential to survival under the given selective pressures (Miles et al., 2016). Since a large number of sgRNAs can be introduced in a single screen, many genes can be tested for a selection criteria. However, there are many factors to consider in processing of sequenced reads; often multiple sgRNAs in a library target the same gene but with different specificities and efficiencies, and read count distributions vary depending on library and study designs. Additionally, positive selection screens often result in relatively few sgRNAs that dominate the total sequenced reads. The MAGeCK (Li et al., 2014) method was specifically developed for CRISPR screen analyses with these conditions in mind.
How can we find the molecular mechanism responsible for resistance?
By looking at how the hits in the screen aggregate on an interaction network, we can get an idea of the mechanisms that are essential for the organism to survive an environmental challenge. The network neighborhood that contains a high concentration of essential genes is strongly implicated as the molecular mechanism by which an organism handles the challenge.
We can find the network neighborhood that is enriched for the screen hits through an algorithm called network propagation (Carlin et al., in press) that is implemented as a feature of the popular network analysis program Cytoscape. This algorithm will find the closely clustered hits and their network neighbors to build a network diagram of the resistance mechanism. We can then use GeneMANIA plugin to find enriched terms that easily summarize the biological terms that are enriched in the diagram.
What is Model-based Analysis of Genome-wide CRIPSR/Cas9 Knockout (MAGeCK)?
Model-based Analysis of Genome-wide CRIPSR/Cas9 Knockout (MAGeCK) is an algorithm for identifying both positively and negatively selected sgRNAs and genes from genome-scale CRIPSR/Cas9 knockout screens. The MAGeCK method can be summarized by the following steps:
1. sgRNA read counts are median-ratio normalized.
2. Mean-variance modeling is then used to model each replicate. The statistical significance of each sgRNA is calculated using the learned mean-variance model.
3. Essential genes are determined by looking for genes with consistently highly significant sgRNAs using robust rank aggregation.
Use Case: MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens (Li et al. 2014).
For this recipe, we will need two datasets, the experimental screen dataset and a control screen dataset, as well as a sgRNA library for sgRNA to gene relationship information. The sample CRISPR/Cas9 knockout screen treatment, control, and sgRNA library used in this recipe are from the paper "Genome-wide recessive genetic screeening in mammalian cells with a lentiviral CRISPR-guide RNA library" (Koike-Yusa et al. 2014). These datasets can be downloaded from the following GenomeSpace Public folders:
Treatment dataset:
Public
> RecipeData
> SequenceData
> MAGeCK
> ERR376999.fastq.gz
: This file contains sequence data of the Cas9-expressing mouse ESCs after they were transfected with the targeted sgRNA expression vectors. The cell was then treated with alpha-toxin for selection for 2 days. The surviving cells were pooled and genomic DNA extracted for PCR and sequencing.
Control dataset:
Public
> RecipeData
> SequenceData
> MAGeCK
> ERR376998.fastq.gz
: This file contains sequence data of the Cas9-expressing mouse ESCs after they were transfected with the pBluescript (control) vector. The cells were grown, pooled, and genomic DNA extracted similar to the treated cells for sequencing.
sgRNA library:
Public
> RecipeData
> SequenceData
> MAGeCK
> yusa_library.csv
: This file contains the sgRNAs and their corresponding target sequences.
NCI Pathway Interaction Database (in Cytoscape):
NDEx
> NCI Pathway Interaction Database - Diffusion Demo Copy
: This file contains a network derived from the latest BioPAX3 version of the Pathway Interaction Database (PID) curated by NCI/Nature.
Note: For help loading your own data into GenomeSpace, see: "Upload Data To GenomeSpace".
NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.
GenomeSpace
tab, then navigate to the folder containing the files (Public
> RecipeData
> SequenceData
).Tool: GenePattern
Click on the file (e.g. ERR376999.fastq.gz
) in GenomeSpace, then use the GenePattern context menu and click Launch on File
.
OR
Click on the file (e.g. ERR376999.fastq.gz
) in GenomeSpace, then drag it to the GenePattern icon to launch.
In this step, we will map sequenced reads to the corresponding target sequences in the provided sgRNA library to determine read counts. Read counts are then median-normalized to adjust for the effect of library sizes and read count distributions. We will accomplish this using the MAGeCK.Count
module in GenePattern.
Modules
tab, and search for MAGeCK.Count
.Group name
: esc1, fastq file
: ERR376999.fastq.gz
(found in Public
> RecipeData
> SequenceData
)Group name
: plasmid, fastq file
: ERR376998.fastq.gz
(found in Public
> RecipeData
> SequenceData
)sgRNA list
: yusa_library.csv
(found in Public
> RecipeData
> SequenceData
)output prefix
: escnegtrim 5 prime
: 23normalization method
: medianRun
to run MAGeCK.Count
.MAGeCK.Count
outputs in the Jobs tab, click on the text file (e.g. escneg.count_normalized.txt
), then choose Save to GenomeSpace
and save the file to your desired directory.Next, we will apply the MAGeCK algorithm to the resulting count file, which ranks sgRNAs and identifies essential genes based on the rank of their corresponding sgRNAs. The plasmid dataset is the control and the embyronic stem cell dataset is the treatment. We will accomplish this using the MAGeCK.Test
module in GenePattern.
Modules
tab, and search for MAGeCK.Test
.count table
: escneg.count_normalized.txt
(from MAGeCK.Count
job output)treatment id
: esc1.1control id
: plasmid.1normalization method
: medianoutput prefix
: esccpRun
to run MAGeCK.Test
.MAGeCK.Test
outputs in the Jobs tab, click on the text file (e.g. esccp.gene_summary.txt
), hen choose Save to GenomeSpace
and save the file to your desired directory.NOTE: The results of this module can also be used with the MAGeCK Pathways.Analysis
to test if a pathway is enriched in one particular gene ranking using RRA
"Several existing algorithms, although not specifically designed for CRISPR/Cas9 knockout screens, can be also be used to identify significantly selected sgRNAs or genes. For example, edgeR, DESeq, baySeq and NBPSeq are commonly used algorithms for differential RNA-Seq expression analysis. These algorithms are able to evaluate the statistical significance of hits in CRISPR/Cas9 knockout screens, although only at the sgRNA level. Algorithms designed to rank genes in genome-scale short interfering RNA (siRNA) or short hairpin RNA (shRNA) screens can also be used for CRISPR/ Cas9 knockout screening data, including RNAi Gene Enrichment Ranking (RIGER) and Redundant siRNA Activity (RSA). However, these methods are designed to identify essential genes mostly from oligonucleotide barcode microarray data, and a new algorithm is needed to prioritize sgRNAs, as well as gene and pathway hits from high-throughput sequencing data" (Li et al., 2014).
Launch Cytoscape and load the NCI Pathway Interaction Database, a highly-structured, curated collection of information of known bio-molecular interactions and key cellular processes assembled into signaling pathways.
NOTE: For Macintosh Users: JNLP files from the internet are labeled insecure. In order to open the JNLP, find the file in your Finder, right-click the file, and press open. This will open a window that will ask for permission to open the file. Press open to access the JNLP.
cytoscape.jnlp
file. Double-click this file to launch Cytoscape.CyNDEx
app. To install this, use the following steps:Apps > App Manager
CyNDEx
. Click on the app and click Install to install it.Apps > NDEx > Import Networks from NDEx
Diffusion Demo
. Make sure the full network title matches.
Load Network
Done Loading Network
to view the loaded Network​Tool: Cytoscape
If you are using Cytoscape version 3.6.0+, you will have to use the newer version of the NDEx App, CyNDEx-2, to import the NCI Pathway Interaction Database. Once you have installed CyNDEx-2, you can search for and download the network by selecting the icon and typing "final revision" in the search bar (shown below).
Import the list of essential genes discovered with the MAGeCK algorithm into Cytoscape for network analysis.
Import Table from File
button to load our gene summary output (esccp.gene_summary.txt
) from Step 3.To a Network Collection
for the Where to Import Table Data
parameter. Make sure the Network Collection indicated is the NCI Pathway Interaction Database - Diffusion Demo Copy
. Use the default parameters for the remaining options.Network diffusion is a technique for discovering genes and genetic modules that a list of initial genes may interact with. Here we use the technique with the Diffusion
app to understand associated networks and pathways from the list of essential genes from our MAGeCK analysis.
Diffusion app
(NOTE: for Cytospace version 3.6.0, the Diffusion app is already installed). To install this, use the following steps:
Apps > App Manager
Diffusion
. Select the app and click Install
to install it.Select
tab under the Control Panel.
Column Filter
neg|p-value
: between 0 and 0.05 inclusiveTools -> Diffuse -> Selected Nodes
Current Rank
: 200. Press Set
.Create
to create a new network from the selection.Layout -> yFiles Layouts -> Organic
(Note: Cytoscape 3.6.0 uses a new yLayout App. To use them, download the yFiles Layouts through the Cytoscape portal and select "yFiles Organic Layout" in the Layout dropdown menu)Fill color
column Map.
:
Column
: diffusion_inputMapping type
: discrete mappingColor
: yellowWe will use the GeneMANIA
plugin to find the network of interacting genes associated with our gene list. GeneMANIA
can find genes related to our set of input genes by using a very large set of functional association data, which includes protein and genetic interactions, pathways, co-expression, co-localization and protein domain similarity.GeneMANIA
can find new members of a pathway or complex, find additional genes that may have been missed in a screen, or find new genes with a specific function, such as protein kinases.
NOTE: Resulting subnetworks should look similar to the following, but the shape of the networks may differ.
GeneMANIA
app. To install this, use the following steps:
Apps > App Manager
GeneMANIA
. Click on the app and click Install
to install it.Node Table
. Copy the selected gene names (Windows: Ctrl + C, Mac: Command + C).Apps -> GeneMANIA -> Search...
Node Table
by pressing the keyboard shortcut for paste (Windows: Ctrl + V, Mac: Command + V).M. musculus (mouse)
is selected under Organism
Start.
As a result, we have a network with associated interactions and functions that can be further explored in the Result Panel.
Given sgRNA read counts for our treatment and control, we use the MAGeCK algorithm on GenePattern to determine essential genes, negatively selected genes needed for ESCs proliferation in the pressence of alpha toxin. Cytoscape then allows us to understand network interactions between them. We use network diffusion to discover modules that the essential genes interact with. With the resulting subnetworks, GeneMANIA provides functional annotation by searching across a large catalogue of gene sets to understand the known processes that are enriched.
GeneMANIA provides a set of networks, genes, and functional annotations from its analysis that we can explore to understand. We can see that many of the top “Functions” listed are essential biological processes and include many DNA repair genes, which are essential for ESC proliferation in the pressence of alpha toxin. These results are consistent with results found in Koike-Yusa et. al. 2014.