Identify and validate coding variants from exome sequencing data
Added by GenomeSpaceTeam on 2015.05.19
Last updated on about 2 years ago.
What coding variants do my exome sequences contain, and are any of them phenotypically interesting?
This recipe provides a method to identify coding variants from exome sequencing data. An example use of this recipe is a case where an investigator may want to identify variants from their alignment data and evaluate the phenotypes.
In particular, this recipe uses the variant detector, FreeBayes, in Galaxy to identify potential variants from short-read alignments and assign them quality scores. We also use other Galaxy tools to pare these potential variants down to a high quality subset and to compare them against a catalog of gold-standard variants. Finally, we use IGV to visualize these variants alongside known pathogenic variants.
Why identify genetic variants? Genetic variants are DNA sequence differences that occur with relative frequency in a population. These variants include single nucleotide polymorphisms (SNPs), insertions and deletions (indels), multi-nucleotide polymorphisms (MNPs), and complex events that are combinations of indels and polymorphisms. With the advent of next generation sequencing technologies, it has become common practice to identify variants using sequence data. This is usually accomplished by comparing sequence data from many samples or individuals against a reference. Identified variants can have many practical applications. For instance, variants that correlate strongly with disease risk may provide insight into the disease's genetic underpinnings. Variants may also be used as biomarkers for disease risk or to DNA "fingerprint" individuals.
Why use exome sequencing data? The human exome comprises all the exonic regions of the human genome. That is, it includes all the DNA sequences that, after transcription into RNA, remain after RNA-splicing, which amounts to about 1% of the entire genome. Whole exome sequencing (WES) thus generates much less data than whole genome sequencing (WGS). Furthermore, since WES focuses almost exclusively on protein-coding regions it is an effective method for identifying coding variants.
NOTE: Working with DNA sequencing data is a computationally and resource intensive process. Files containing raw sequence reads can be many gigabytes in size and can be cumbersome to manage. To maintain the usability of this recipe, we have chosen to work with only a partial region of the human genome, specifically the exonic regions of a chromosome 20.
To complete this recipe, we will need a number of datasets of various types. First, we will need a set of exonic sequence reads that have been pre-processed for quality and duplicates and aligned to the genome. In this example, we will use aligned reads generated by the 1000 Genomes Project that have been mapped to the exonic regions of chromosome 20. These samples are derived from 10 individuals in the project's Great Britain population cohort. We also require a reference genome in FASTA format, which we will use to identify variants. Finally, we will need a set of gold-standard variants as well as a set of variant annotations to compare against our discovered variants. We will need the following datasets, which can be downloaded from the following GenomeSpace Public folder:
1000GP.GBR.chrom20.exome.tar.gz: This is an archive of aligned, sorted, and duplicate-marked chromosome 20 exome reads in BAM format. Specifically, the archive includes the alignments of 10 individuals drawn from the Great Britain population in Phase 3 of the 1000 Genome Project.
hs_ref_GRCh37.p5_chr20.fa: This is a FASTA file containing sequences for a human reference genome, chromosome 20 only.
hapmap_3.3.b37.chr20.vcf: This is a VCF formatted file containing genetic variants identified by the International HapMap Project (version 3.3) filtered to include only variants on chromosome 20.
dbSNP_135.no1000GProduction.chr20.pathogenic.vcf: This is a VCF formatted file containing genetic variants identified by the NCBI Short Genetic Variations database, dbSNP.
This recipe identifies a set of high confidence genetic variants in the exonic regions of chromosome 20.
Load the files into Galaxy using the following method.
NOTE: If you have not yet associated your GenomeSpace account with your Galaxy account, you will be asked to do so. If you do not yet have a Galaxy account, you can automatically generate a new account that will be associated with your GenomeSpace account.
Un-zip the archive of BAM files.
Expand Archive. Choose your own directory.
Expand Archivebutton to unzip the
tar.gzfile, and extract all the files to a new folder in your home directory,
Send the files to Galaxy.
Get Data > GenomeSpace import. Navigate to the newly created directory,
Send to Galaxy.
hs_ref_GRCh37.p5_chr20.fa file.file to Galaxy.
SequenceData. Select the
Send to Galaxy.
hapmap_3.3.b37.chr20.vcf filefile to Galaxy.
VariantData. Select the
Send to Galaxy.
NOTE: While GenomeSpace provide multiple methods for sending data to tools, in this instance we highly suggest using the above method. Users may encounter errors if using other methods due to the relatively large size and number of files.
We will use a pre-built Galaxy workflow to call and filter variants in the exome regions of chromosome 20. This workflow first uses the haplotype-based genetic variant detector,
FreeBayes, to call variants from the short-read sequencing alignments, generating a VCF file that lists the discovered variants, genotypes, and various quality metrics. (The VCF or Variant Call Format is a standard file format for reporting genetic variants.) This workflow also uses
VCFintersect, from the
VCFlib toolkit, to filter the results from
FreeBayes for high quality, known variants. The resulting Galaxy workflow output is thus a VCF file containing high quality variants identified by
FreeBayes and extant in the gold-standard HapMap catalog.
start using this workflow.
1: Input Dataset -
2: Input Dataset -
3: Input Dataset -
4: Input Dataset -
5: Input Dataset -
6: Input Dataset -
7: Input Dataset -
8: Input Dataset -
9: Input Dataset -
10: Input Dataset -
11: Input Dataset -
12: Input Dataset -
16: GenomeSpace Exporter:
Choose Target Directoryparameter.
In this step, we use IGV to visualize high-confidence genetic variants, which we have identified in our British population sample, alongside known pathogenic variants defined in dbSNP.
Launch, prompting the download of a .jnlp file. Double-click the file to launch IGV.
GenomeSpacetab to import files by navigating to the following menu:
Load File from GenomeSpace....
Open. Load the following files into IGV:
Your Home Directory>
This is an example interpretation of the results from this recipe. In our Galaxy workflow we first used
FreeBayes to compare our short-read exome sequencing alignments to a reference sequence in order to determine likely genetic variants in our sample cohort. Next we used
VCFfilter to filter these results for high quality variants based on various metrics (e.g., read depth, quality score). We also used
VCFintersect to further narrow our results down to those variants with equivalent alleles in the HapMap catalog. Finally, we turned to IGV to view our filtered variants alongside a set of known pathogenic variants identified in the dbSNP database. Specifically, we would like to see if any of our identified variants may be associated with an increased risk for disease.
We zoom in on a specific variant located in the q13.31 cytoband of chromosome 20 by typing the coordinates 20:54961541 into the regions navigation bar and hitting enter. Both our results track (upper) and dbSNP track (lower) display a variant at this site, as indicated by the blue and red bar (the proportion of red in the bar corresponds to the allele frequency of the variant). The group of 10 tracks display the genotypes of the 10 individuals whose sequences we used in this recipe. Specifically, 4 individuals are heterozygous for the variant (blue), 2 individuals are homozygous (cyan), and the remaining 4 individuals are homozygous (grey) for the reference allele.
Clicking on the variants in either track brings up info-boxes that display various statistics for the variant. By comparing to the dbSNP track, we notice that our variant analysis has identified an A/T SNP that corresponds to the pathogenic variant, rs2273535, in the AURKA gene. Moreover, our set of 10 samples carried the minor allele with a frequency of 0.4, which corresponds closely to the allele frequency of 0.316 reported in the dbSNP annotations. Given this correspondence, we reason that we have identified the pathogenic SNP rs2273535 in our sample set of 10 British individuals. If we search for SNP id rs2273535 in ClinVar (link), we see that it has been identified as a risk factor for colon cancer. However, note that due to our small sample size these variant discoveries may not be generalizable to the whole population; this is only a simple representation of possible results.