Expand all recipe descriptions

Found 18 recipes

Which genes lie in my copy number variation regions? Are there any sets of co-regulated genes in these aberrant regions?

This recipe provides a method for identifying and visualizing a network of co-regulated genes that are associated with aberrant regions identified by single nucleotide polymorphism (SNP) arrays. An example use of this recipe is a case where an investigator may want to find which genes are located in regions that exhibit significant changes (e.g. amplification or deletion) in cancer cells.

 

This recipe provides one method for identifying and visualizing aberrant regions in Diffuse Large B-Cell Lymphoma (DLBCL) cancer cells. This recipe uses copy-number variation (CNV) data from SNP arrays, and evaluates the expression of aberrant regions using a microarray dataset. Regions that are significantly changed (e.g., amplified or deleted) in cancer cells are defined by the GISTIC algorithm. In particular, this recipe makes use of several Galaxy tools to find the overlap between the aberrant regions and reference genes, and uses GenePattern to process the microarray dataset. Genomica is used to find module networks of co-regulated genes associated with these aberrant regions. A module network is a model which identifies regulatory modules from gene expression data, especially modules of co-regulated genes and their regulators. The module also identifies the conditions under which the regulation can occur.

Why analyze copy number variation regions? Copy number variations (CNVs) are large alterations to genomes, such as duplication or deletion of large segments of a chromosome. These variations in the genome have been associated with different conditions, such as cancer. In this recipe, we explore the scenario in which CNVs are elevated in a cancer cell line, and our goal is to determine the function of these duplicated genes.

Does my gene expression dataset contain a module network of regulatory genes? Does the network have any special features?

This recipe provides one method for creating and visualizing a module network of regulatory genes. An example use of this recipe is a case where an investigator may want to evaulate an expression dataset to find regulatory genes such as transcription factors, and then determine if they are connected in a network.

 

In particular, the regulatory genes of interest are genes which regulate other genes associated with an embryonic stem cell (ESC) state. This 'stemness signature' is a feature common to ESCs, as well as induced pluripotent stem cells (iPSCs), and also in a compendium of human cancers, such as breast cancer. This recipe recapitulates research by Wong et al., in Cell Stem Cell (2008), "Module map of stem cell genes guides creation of epithelial cancer stem cells."

We use a gene expression dataset of primary human breast cancer tumor samples (described in Chin, K. et al, Cancer Cell, 2006), and create a module network by projecting a set of stemness regulators onto the gene expression dataset, using Genomica. A module network is a model which identifies regulatory modules from gene expression data, especially modules of co-regulated genes and their regulators. The module also identifies the conditions under which the regulation can occur.

After obtaining the module network, we visualize it using Cytoscape. Since the network is very large, we then filter it to just a subnetwork of stemness regulators and their connections, again using Cytoscape. This provides us with a visual representation of the stemness regulators as they appear projected onto a breast cancer tumor dataset.

 

Does my RNA-Seq data contain noticeable batch effects? Can I remove any batch effects I find?

This recipe provides one method to assess RNA-seq count data for potential batch effects. Identified batch effects are removed using the ComBat algorithm (Johnson, Rabinovic, and Li). An example use case of this recipe is when an investigator wants to correct data and account for confounding variables generated by effects such as using different reagents or processing samples at different times.

 

This recipe uses count data and sample annotations derived from two separate RNA-seq experiments and relies upon Principal Component Analysis (PCA) as a primary means of evaluating the dataset for batch effects. In particular, this recipe utilizes a number of GenePattern modules to filter and preprocess the RNA-seq count data and to normalize the dataset using the ComBat algorithm. We then use Galaxy to view the influence of batch effects in our pre- and post-normalization datasets.

Why consider batch effects (a.k.a. confounders)? A "batch" or "confounding" variable can be generally defined as an extraneous variable that may correlate or anti-correlate with our variables of interest. Failing to account for confounding variables or batch effects can lead us to draw spurious conclusions from our data. For example:

Albert is comparing the hardiness of two different E. coli strains, X and Y, by measuring their production of various stress proteins under a mild bactericide. However, Albert's bactericide stock is running low and he ends up using two different stocks of the same bactericide over the course of this experiment. Here, bactericide stock is a confounding or batch variable. Perhaps one bactericide stock is more potent than the other. Albert will need to be careful in how he analyzes his data. He must somehow ensure that differences in stress protein abundances between his X and Y strains are driven by the biological differences of these two strains rather than differences in the bactericide stocks.

Efforts can be undertaken to minimize batch effects by ensuring proper experimental design, e.g., simplifying protocols to eliminate potential confounders. If batches are unavoidable, it is also important to randomize samples across these batches such that differences between batches can be attributed to batch effects rather than biological differences. But despite these efforts, unforeseen batch effects may still be present in any experiment. Thus, we should be aware of the degree to which these unwanted factors influence our results and be prepared to use various statistical methods to remove these unwanted factors if needed.

How do we identify batch effects? There are a variety of methods for identifying batch effects. Sometimes, potential batches can be identified at the outset of a project based on its experimental design. For example, if samples must be processed in separate groups due to the limiting capacity of a sequencer, then these processing groups are likely to introduce batch effects. If batches can be pre-identified, then the scientist can include identical control samples in each batch. Any variation in measurements from these control samples can then be attributed to and used to quantify a batch effect.

Another common method is to visualize data using a PCA projection. PCA uses a linear transformation to transform a dataset into a space of linearly uncorrelated, orthogonal "principal components". More simply, PCA is a way of visualizing data such that underlying structures are revealed. We encourage users to learn more about PCA (Ma and Dai's 2011 article in Briefings in Bioinformatics is a good starting point). By re-projecting with PCA, we can identify those variables, including batch variables, that are contributing to large variances in the dataset.

Which genes are differentially expressed between my two phenotypes, based on my RNA-seq data?

This recipe provides one method to identify differentially expressed genes in RNA-seq read data. An example use of this recipe is a case where an investigator may want to compare two phenotypes, such as two types of cancer, to determine which genes are up- or down-regulated differently between these phenotypes.

 

In particular, this recipe uses the UCSC Table Browser to retrieve a reference genome to align RNA-seq reads against. We also used several modules in GenePattern to align the reads against the reference genome, and to identify differentially expressed genes when comparing two conditions. Finally, we use IGV to visualize the differentially expressed genes.

Why differential expression analysis? We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancerous tissue, are said to be differentially expressed when compared to normal conditions. To identify which genes change in response to specific conditions (e.g. cancer), we must filter or process the dataset to remove genes which are not informative.

 

 

What subnetworks of differentially expressed genes are enriched in my samples? What biological functions are they related to?

This recipe provides a method for identifying differentially expressed genes between two phenotypes, such as tumor and normal, to find subnetworks of interacting proteins and determine their functional annotations. An example use of this recipe is a case where an investigator may want to compare two phenotypes to determine which gene networks are similar between phenotypes, and to determine how functional annotation changes between phenotypes.

 

In particular, this recipe makes use of several GenePattern modules to identify differentially regulated genes, then uses several Cytoscape plugins to identify potential interactions between gene products, and to visualize the resulting network.

Why differential expression analysis? We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancer tissue, are said to be differentially expressed when compared to normal conditions. To identify which genes change in response to specific conditions (e.g. cancer), we must filter or process the dataset to remove genes which are not informative.

Why protein interaction network analysis? Gene expression analysis results in a list of differentially expressed genes, but it does not explain whether these genes are connected biologically in a pathway or network. To better understanding the underlying biology that drives changes in gene expression analysis, we can perform network analysis to determine whether gene products (e.g. proteins) are reported to interact. To identify potential networks or pathways, we search for highly interconnected subnetworks within a large interaction network.

Which genes are differentially expressed in my microarray data? Are these genes enriched for certain biological pathways?

This recipe provides an outline of one method to identify known biological functions for genes that are differentially expressed between two conditions or phenotypes, using microarray data. An example use of this recipe is a case where an investigator may want to determine if a specific cancer phenotype is associated with expression of certain pathways.

 

Given a set of differentially expressed genes, the goal is to infer which biological functions (for example, Gene Ontology biological processes) are overrepresented in the set of reference genes found to be differentially expressed. In particular, this recipe uses a gene expression dataset which has two conditions: normal and mild hyperthermia. Then, GenePattern is used to identify differentially expressed genes, and finally MSigDB is used to identify biological functions and pathways that are enriched in the gene set.

Why differential expression analysis? We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancerous tissue, are said to be differentially expressed when compared to normal conditions. To identify which genes change in response to specific conditions (e.g. cancer), we must filter or process the dataset to remove genes which are not informative.

Why perform functional annotation? Many analyses end with the retrieval of a gene list, e.g. gene expression analysis identifies a list of genes which are differentially expressed when comparing multiple conditions. However, often times a researcher has additional questions about the function or relatedness of genes in a gene list: Are the genes a part of the same pathway? Do the gene products interact physically? Do the gene products localize to a specific part of the cell? Are the genes only expressed during a certain stage of development? These questions, and others like them, can be answered by performing functional annotation on gene lists, to better understand the underlying connections between genes.

Do phenotypically different expression datasets share a common signature? Can the signature distinguish phenotypes in an independent dataset?

This recipe provides one method for identifying a consensus gene signature from a training set of several phenotypically distinct gene expression dataset. The recipe then validates the ability of the consensus signature to accurately distinguish phenotypes by using an independent test gene expression dataset. An example use case of this recipe is when an investigator may want to develop a gene expression signature to predict a specific phenotype, such as cancer or another disease.

 

Background information: What is a consensus gene expression signature?

A gene expression signature is the pattern of expression in a specific group of genes, usually ones that are related by function, position or other biological process. A consensus gene signature is an expression pattern for a specific group of genes, which is shared among different samples or across different phenotypes. For example, a group of genes regulating immune response could be similarly up-regulated during many different, unrelated infections. There are several types of consensus signatures; those that can be derived from gene expression data are called transcriptional consensus signatures. Consensus signatures can be created by overlapping individual gene signatures derived from multiple datasets. Compared to individual gene expression signatures, consensus signatures may be more accurate at distinguishing different phenotypes, such as diseased vs. normal samples.

 

Use case: Targeting MYCN in Neuroblastoma by BET Bromodomain Inhibition (Puissant et al. , Cancer Discov. 2013).

This study analyzed gene expression data generated from primary neuroblastoma tumors of two genetic classes: tumors harboring MYCN amplification (“MYCN amplified”) and tumors without MYCN amplification (“MYCN non-amplified”). MYCN amplified neuroblastoma is exquisitely dependent on the bromodomain and extra-terminal (BET) family of proteins. As such, treatment of MYCN amplified cell lines or tumors with JQ1, a small-molecule inhibitor of BET proteins, leads to dramatic transcriptional changes and induces cell death.

To identify a consensus signature to predict sensitivity to JQ1 treatment, two training datasets and one test dataset were used. The training dataset included acute myeloid leukemia (AML) and a multiple myeloid leukemia (MM) cell lines, which had been treated with either DMSO (control) or with JQ1 (treatment). The test dataset included MYCN amplified and MYCN nonamplified neuroblastoma primary tumor samples. GenePattern was used to analyze the AML and MM cell lines; for each dataset, a gene expression signature was derived to identify JQ1 response in the cell line. Using Galaxy, the two signatures were then overlapped to determine the consensus signature between the two phenotypes.

GenePattern was used to validate the ability of this JQ1-associated consensus signature to differentiate between phenotypes, by using the signature to hierarchically cluster the test dataset (neuroblastoma). Since the MYCN amplified and MYCN non-amplified neuroblastoma samples should have differing expression profiles, it was hypothesized that the consensus signature would be able to separate the samples by phenotype. Indeed, the consensus signature was able to cluster the MYCN-amplified and MYCN-nonamplified samples separately, revealing that the consensus signature accurately distinguishes the sensitivity-to-JQ1 phenotype.

What coding variants do my exome sequences contain, and are any of them phenotypically interesting?

This recipe provides a method to identify coding variants from exome sequencing data. An example use of this recipe is a case where an investigator may want to identify variants from their alignment data and evaluate the phenotypes.

 

In particular, this recipe uses the variant detector, FreeBayes, in Galaxy to identify potential variants from short-read alignments and assign them quality scores. We also use other Galaxy tools to pare these potential variants down to a high quality subset and to compare them against a catalog of gold-standard variants. Finally, we use IGV to visualize these variants alongside known pathogenic variants.

Why identify genetic variants? Genetic variants are DNA sequence differences that occur with relative frequency in a population. These variants include single nucleotide polymorphisms (SNPs), insertions and deletions (indels), multi-nucleotide polymorphisms (MNPs), and complex events that are combinations of indels and polymorphisms. With the advent of next generation sequencing technologies, it has become common practice to identify variants using sequence data. This is usually accomplished by comparing sequence data from many samples or individuals against a reference. Identified variants can have many practical applications. For instance, variants that correlate strongly with disease risk may provide insight into the disease's genetic underpinnings. Variants may also be used as biomarkers for disease risk or to DNA "fingerprint" individuals.

Why use exome sequencing data? The human exome comprises all the exonic regions of the human genome. That is, it includes all the DNA sequences that, after transcription into RNA, remain after RNA-splicing, which amounts to about 1% of the entire genome. Whole exome sequencing (WES) thus generates much less data than whole genome sequencing (WGS). Furthermore, since WES focuses almost exclusively on protein-coding regions it is an effective method for identifying coding variants.

NOTE: Working with DNA sequencing data is a computationally and resource intensive process. Files containing raw sequence reads can be many gigabytes in size and can be cumbersome to manage. To maintain the usability of this recipe, we have chosen to work with only a partial region of the human genome, specifically the exonic regions of a chromosome 20.

Which genes are highly expressed, based on my RNA-seq data? Are any of the highly expressed genes also differentially expressed?

This recipe provides an outline of one method to identify and visualize genes and isoforms that are highly expressed in RNA-seq data. In particular, this recipe utilizes an analysis pipeline, allowing a user to chain together multiple analysis steps into one workflow that can be run in one step. An example use of this recipe is a case where an investigator wants to process several datasets in the same way, in which case the pipeline will allow the investigator to re-use the same modules and parameters, over and over again.

 

Given a set of raw RNA-seq reads, the goal is to align the reads to a reference genome, estimate expression abundance levels for reference genes and isoforms, filter out low-expressed genes and isoforms, and visualize the read alignments and their expression levels. In particular, this recipe uses the UCSC Table Browser to retrieve a reference genome to align RNA-seq reads against. We also uses several modules in GenePattern to align the reads against the reference genome, and to identify differentially expressed genes when comparing two conditions. Finally, we use IGV to visualize the differentially expressed genes.

Why differential expression analysis? We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancerous tissue, are said to be differentially expressed when compared to normal conditions. To identify which genes change in response to specific conditions (e.g. cancer), we must filter or process the dataset to remove genes which are not informative.

 

Which genes lie in my copy number variation regions? Are these genes enriched for any biological functions or pathways?

This recipe provides a method for identifying biological functions for genes lying in copy number variation (CNV) regions. An example use of this recipe is a case where an investigator may want to examine their list of CNVs to see which genetic regions are amplified or deleted, and determine the biological functions or pathways associated with these amplified or deleted regions.

 

Copy number variations (CNVs) are large alterations to genomes, such as amplification or deletion of large segments of a chromosome. They can range in size from a focal aberration in a single gene to aberrations covering entire chromosome arms. These variations in the genome have been associated with different conditions, such as cancer. Many genomic analyses produce a set of genes which are assumed to be relevant to an underlying biological mechanism or phenotype. Thus, an investigator often has additional questions about the function or relatedness of these genes: Are they part of the same pathway? Do the gene products interact physically? Do the gene products localize to a specific part of the cell? Are the genes associated with certain stages of development? These questions, and others like them, can be answered by performing functional annotation of gene lists to better understand the underlying connections between genes.

In this particular example, we imagine a scenario in which an investigator identifies CNV regions that are amplified or deleted in glioblastoma multiforme (GBM) tumor samples, using a method called Genomic Identification of Significant Targets in Cancer (GISTIC, Mermel et al. (2011) Genome Biol.). Given a set of CNV regions, the goal is to infer which biological functions (e.g., metabolic and regulatory pathways, chemical perturbation signatures, etc.) are overrepresented in the set of reference genes that overlap with these regions. In particular, this recipe uses several Galaxy tools to find the overlap between CNV regions and reference genes obtained from the UCSC Table Browser. Then it uses the Molecular Signatures Database (MSigDB) to identify biological functions of the overlapped genes.

How can I use this recipe? This recipe may be modified to analyze CNV regions derived from any organism for which an annotated reference genome exists. Nor does the recipe depend on the algorithm used to identify these regions (GISTIC); any source of CNV data can be used. Once an investigator has pinpointed the CNV regions believed to be influencing their phenotype (disease state, cell type, etc.) of study he/she can use this recipe to identify functional pathways that may be affected by these copy number changes and draw closer to an understanding of the mechanisms behind these CNV effects.

Filter by analysis type

Filter by data type

Filter by all available tags

Filter by tool