Usage examples.
In this protocol we show how to analyze genomic variants using the SnpEff pipeline.
Computer hardware: The materials required for this protocol are:# Move to home directory cd # Download and install SnpEff curl -v -L http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip > snpEff_latest_core.zip unzip snpEff_latest_core.zip
cd snpEff java -jar snpEff.jar download -v GRCh37.75A list of pre-built databases for all other species is available by running the following command:
java -jar snpEff.jar databases
We show how to use SnpEff & SnpSitf to annotate, prioritize and filter coding variants.
Dataset: In this genomic annotation example, we use a simulated dataset to show how to find genetic variants of a Mendelian recessive disease, Cystic fibrosis, caused by a high impact coding variant, a nonsense mutation in CFTR gene (G542*). The data files come from the publicly available "CEPH_1463" dataset, sequenced by Complete Genomics, and contains sequencing information for a family consisting of 4 grandparents, 2 parents and 11 siblings.
Although these are healthy individuals, we artificially introduced a known Cystic fibrosis mutation on three siblings (cases) in a manner that was consistent with the underlying haplotype structure.
We now download and un-compress the example data used in this protocol, which, for reasons of space and time, is limited to only chromosome 7 and 17:
# Go to SnpEff's dir cd ~/snpEff # Download sample data curl -v -L http://sourceforge.net/projects/snpeff/files/protocols.zip > protocols.zip unzip protocols.zip
-stats
):
java -Xmx4g -jar snpEff.jar -v -stats ex1.html GRCh37.75 protocols/ex1.vcf > protocols/ex1.ann.vcf
java -Xmx1g -jar SnpSift.jar \ caseControl \ -v \ -tfam protocols/pedigree.tfam \ protocols/ex1.ann.vcf \ > protocols/ex1.ann.cc.vcf
Cases=1,1,3
and Controls=8,6,22
which correspond to the number of homozygous non-reference, heterozygous and total allele counts in cases and controls for each variant.
The program also calculates basic statistics for each variant based on the allele frequencies in the two groups using different models, which can be useful as a starting point for more in-depth statistical analysis.
SnpSift filter
command to reduce the number of candidate loci base on alleles in cases and controls.
SnpSift filter allows users to create powerful filters that select variants using Boolean expressions containing data from the VCF fields.
The expression we use to filter the VCF file "ex1.ann.vcf" is developed as follows.
Cases[0] = 3) & (Controls[0] = 0)
The full command line is:
cat protocols/ex1.ann.cc.vcf | java -jar SnpSift.jar filter \ "(Cases[0] = 3) & (Controls[0] = 0)" \ > protocols/ex1.filtered.hom.vcfThe filtered output file, filtered.hom_cases.vcf, contains over 400 variants satisfying our criteria. This is still too large to analyze by hand, so can we can add another filter to see if any of these variants is expected to have a high impact. To identify variants where any of these impacts is classified as either
HIGH
or MODERATE
we add the condition ANN[*].IMPACT = 'HIGH') | (ANN[*].IMPACT = 'MODERATE')
The new filtering commands become:
cat protocols/ex1.ann.cc.vcf \ | java -jar SnpSift.jar filter \ "(Cases[0] = 3) & (Controls[0] = 0) & ((ANN[*].IMPACT = 'HIGH') | (ANN[*].IMPACT = 'MODERATE'))" \ > protocols/ex1.filtered.vcf
stop_gained
loss of function variant
, whereas the other one is a missense_variant
amino acid change.
The first one is a known Cystic fibrosis variant.
$ cat protocols/ex1.filtered.vcf | ./scripts/vcfInfoOnePerLine.pl 7 117227832 . G T . . AC 14 AN 22 ANN T|stop_gained|HIGH|CFTR|ENSG00000001626|transcript|ENST00000003084|protein_coding|12/27|c.1624G>T|p.Gly542*|1756/6128|1624/4443|542/1480|| ANN T|stop_gained|HIGH|CFTR|ENSG00000001626|transcript|ENST00000454343|protein_coding|11/26|c.1441G>T|p.Gly481*|1573/5949|1441/4260|481/1419|| ANN T|stop_gained|HIGH|CFTR|ENSG00000001626|transcript|ENST00000426809|protein_coding|11/26|c.1534G>T|p.Gly512*|1534/4316|1534/4316|512/1437||WARNING_TRANSCRIPT_INCOMPLETE ANN T|sequence_feature|LOW|CFTR|ENSG00000001626|topological_domain:Cytoplasmic|ENST00000003084|protein_coding||c.1624G>T|||||| ANN T|sequence_feature|LOW|CFTR|ENSG00000001626|domain:ABC_transporter_1|ENST00000003084|protein_coding||c.1624G>T|||||| ANN T|sequence_feature|LOW|CFTR|ENSG00000001626|beta_strand|ENST00000003084|protein_coding|12/27|c.1624G>T|||||| ANN T|sequence_feature|LOW|CFTR|ENSG00000001626|beta_strand|ENST00000454343|protein_coding|11/26|c.1441G>T|||||| ANN T|upstream_gene_variant|MODIFIER|AC000111.5|ENSG00000234001|transcript|ENST00000448200|processed_pseudogene||n.-1C>A|||||1362| ANN T|downstream_gene_variant|MODIFIER|CFTR|ENSG00000001626|transcript|ENST00000472848|processed_transcript||n.*148G>T|||||29| LOF (CFTR|ENSG00000001626|11|0.27) NMD (CFTR|ENSG00000001626|11|0.27) Cases 3 Cases 0 Cases 6 Controls 0 Controls 8 Controls 8 CC_TREND 9.111e-04 CC_GENO NaN CC_ALL 4.025e-02 CC_DOM 6.061e-03 CC_REC 1.000e+00 17 39135205 . ACA GCA,GCG . . AC 16 AC 8 AN 31 ANN GCG|missense_variant|MODERATE|KRT40|ENSG00000204889|transcript|ENST00000377755|protein_coding||c.1045_1047delTGTinsCGC|p.Cys349Arg|1082/1812|1045/1296|349/431|| ANN GCG|missense_variant|MODERATE|KRT40|ENSG00000204889|transcript|ENST00000398486|protein_coding||c.1045_1047delTGTinsCGC|p.Cys349Arg|1208/1772|1045/1296|349/431|| ANN GCA|synonymous_variant|LOW|KRT40|ENSG00000204889|transcript|ENST00000377755|protein_coding|6/7|c.1047T>C|p.Cys349Cys|1082/1812|1047/1296|349/431|| ANN GCA|synonymous_variant|LOW|KRT40|ENSG00000204889|transcript|ENST00000398486|protein_coding|8/9|c.1047T>C|p.Cys349Cys|1208/1772|1047/1296|349/431|| ANN GCA|sequence_feature|LOW|KRT40|ENSG00000204889|region_of_interest:Coil_2|ENST00000398486|protein_coding|6/9|c.1047T>C|||||| ANN GCG|sequence_feature|LOW|KRT40|ENSG00000204889|region_of_interest:Coil_2|ENST00000398486|protein_coding|7/9|c.1045_1047delTGTinsCGC|||||| ANN GCA|sequence_feature|LOW|KRT40|ENSG00000204889|region_of_interest:Rod|ENST00000398486|protein_coding|3/9|c.1047T>C|||||| ANN GCG|sequence_feature|LOW|KRT40|ENSG00000204889|region_of_interest:Rod|ENST00000398486|protein_coding|3/9|c.1045_1047delTGTinsCGC|||||| ANN GCA|3_prime_UTR_variant|MODIFIER|KRT40|ENSG00000204889|transcript|ENST00000461923|nonsense_mediated_decay|8/9|n.*509T>C|||||2348| ANN GCG|3_prime_UTR_variant|MODIFIER|KRT40|ENSG00000204889|transcript|ENST00000461923|nonsense_mediated_decay|8/9|n.*507_*509delTGTinsCGC|||||2346| ANN GCA|downstream_gene_variant|MODIFIER|AC004231.2|ENSG00000234477|transcript|ENST00000418393|antisense||n.*815A>G|||||3027| ANN GCG|downstream_gene_variant|MODIFIER|AC004231.2|ENSG00000234477|transcript|ENST00000418393|antisense||n.*815_*815delACAinsGCG|||||3027| ANN GCA|non_coding_exon_variant|MODIFIER|KRT40|ENSG00000204889|transcript|ENST00000461923|nonsense_mediated_decay|8/9|n.*509T>C|||||| ANN GCG|non_coding_exon_variant|MODIFIER|KRT40|ENSG00000204889|transcript|ENST00000461923|nonsense_mediated_decay|8/9|n.*507_*509delTGTinsCGC|||||| Cases 3 Cases 0 Cases 6 Controls 0 Controls 12 Controls 18 CC_TREND 7.008e-02 CC_GENO NaN CC_ALL 1.700e-01 CC_DOM 1.231e-01 CC_REC 1.000e+00
java -jar SnpSift.jar pedShow \ protocols/pedigree.tfam \ protocols/ex1.filtered.vcf \ protocols/chart
java -Xmx1g -jar SnpSift.jar \ annotate \ -v \ protocols/db/clinvar_00-latest.vcf \ protocols/ex1.ann.cc.vcf \ > protocols/ex1.ann.cc.clinvar.vcf
stop_gained
annotation):
$ cat protocols/ex1.ann.cc.clinvar.vcf \ | java -jar SnpSift.jar filter \ "(exists CLNDBN) & (ANN[*].EFFECT has 'stop_gained') & (ANN[*].GENE = 'CFTR')" \ > protocols/ex1.ann.cc.clinvar.filtered.vcf $ cat protocols/ex1.ann.cc.clinvar.filtered.vcf | ./scripts/vcfInfoOnePerLine.pl 7 117227832 rs113993959 G T . . AC 14 AN 22 ANN T|stop_gained|HIGH|CFTR|ENSG00000001626|transcript|ENST00000003084|protein_coding|12/27|c.1624G>T|p.Gly542*|1756/6128|1624/4443|542/1480|| ANN T|stop_gained|HIGH|CFTR|ENSG00000001626|transcript|ENST00000454343|protein_coding|11/26|c.1441G>T|p.Gly481*|1573/5949|1441/4260|481/1419|| ANN T|stop_gained|HIGH|CFTR|ENSG00000001626|transcript|ENST00000426809|protein_coding|11/26|c.1534G>T|p.Gly512*|1534/4316|1534/4316|512/1437||WARNING_TRANSCRIPT_INCOMPLETE ANN T|sequence_feature|LOW|CFTR|ENSG00000001626|topological_domain:Cytoplasmic|ENST00000003084|protein_coding||c.1624G>T|||||| ANN T|sequence_feature|LOW|CFTR|ENSG00000001626|domain:ABC_transporter_1|ENST00000003084|protein_coding||c.1624G>T|||||| ANN T|sequence_feature|LOW|CFTR|ENSG00000001626|beta_strand|ENST00000003084|protein_coding||c.1624G>T|||||| ANN T|sequence_feature|LOW|CFTR|ENSG00000001626|beta_strand|ENST00000454343|protein_coding||c.1441G>T|||||| ANN T|upstream_gene_variant|MODIFIER|AC000111.5|ENSG00000234001|transcript|ENST00000448200|processed_pseudogene||n.-1C>A|||||1362| ANN T|downstream_gene_variant|MODIFIER|CFTR|ENSG00000001626|transcript|ENST00000472848|processed_transcript||n.*148G>T|||||29| LOF (CFTR|ENSG00000001626|11|0.27) NMD (CFTR|ENSG00000001626|11|0.27) Cases 3 Cases 0 Cases 6 Controls 0 Controls 8 Controls 8 CC_TREND 9.111e-04 CC_GENO NaN CC_ALL 4.025e-02 CC_DOM 6.061e-03 CC_REC 1.000e+00 ASP true CLNACC RCV000007535.6|RCV000058931.3|RCV000119041.1 CLNALLE 1 CLNDBN Cystic_fibrosis|not_provided|Hereditary_pancreatitis CLNDSDB GeneReviews:MedGen:OMIM:Orphanet:SNOMED_CT|MedGen|GeneReviews:MedGen:OMIM:Orphanet:SNOMED_CT CLNDSDBID NBK1250:C0010674:219700:ORPHA586:190905008|CN221809|NBK84399:C0238339:167800:ORPHA676:68072000 CLNHGVS NC_000007.13:g.117227832G>T CLNORIGIN 1 CLNREVSTAT prof|single|single CLNSIG 5|5|5 CLNSRC CFTR2|HGMD|OMIM_Allelic_Variant|OMIM_Allelic_Variant CLNSRCID G542X|CM900049|602421.0009|602421.0095 GENEINFO CFTR:1080 LSD true NSN true OM true PM true PMC true REF true RS 113993959 RSPOS 117227832 S3D true SAO 1 SSR 0 VC SNV VLD true VP 0x050268000605040002110100 WGT 1 dbSNPBuildID 132
java -Xmx4g -jar snpEff.jar \ -v \ -o gatk \ GRCh37.75 \ protocols/ex1.vcf \ > protocols/ex1.ann.gatk.vcf
java -Xmx4g -jar $HOME/tools/gatk/GenomeAnalysisTK.jar \ -T VariantAnnotator \ -R $HOME/genomes/GRCh37.75.fa \ -A SnpEff \ --variant protocols/ex1.vcf \ --snpEffFile protocols/ex1.ann.gatk.vcf \ -L protocols/ex1.vcf \ -o protocols/ex1.gatk.vcf
We show how to use SnpEff & SnpSitf to annotate, prioritize and filter non-coding variants.
Dataset: This example shows how to perform basic annotation of non-coding variants.
It is based on a short list of 20 non-coding that were identified by sequencing a 700 kb region surrounding the gene T-box transcription factor (TBX5) in 260 patients with congenital heart disease 67.
TBX5 is a transcription factor that plays a well-established dosage-dependent role in heart and limb development.
Coding mutations in TBX5 have been frequently identified in patients with Holt–Oram syndrome, which is associated with abnormal hand, forearm and cardiac development.
Data source: Regulatory variation in a TBX5 enhancer leads to isolated congenital heart disease.
java -Xmx4g -jar snpEff.jar \ -v \ -motif \ GRCh37.75 \ protocols/ex2.vcf \ > protocols/ex2.ann.basic.vcf
java -Xmx4g -jar snpEff.jar \ -v \ -motif \ -interval protocols/ex2_regulatory.bed \ GRCh37.75 \ protocols/ex2.vcf \ > protocols/ex2.ann.vcf
java -Xmx1g -jar SnpSift.jar \ phastCons \ -v \ protocols/phastcons \ protocols/ex2.ann.vcf \ > protocols/ex2.ann.cons.vcf
cat protocols/ex2.ann.cons.vcf \ | java -jar SnpSift.jar filter \ "(ANN[*].EFFECT = 'CUSTOM[ex2_regulatory]') & (exists PhastCons) & (PhastCons > 0.9)" \ > protocols/ex2.filtered.vcf
Here we show an example on how to get from Sequencing data to an annotated variants file
# Download the genome, uncompress and rename file wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.34_FB2011_02/fasta/dmel-all-chromosome-r5.34.fasta.gz gunzip dmel-all-chromosome-r5.34.fasta.gz mv dmel-all-chromosome-r5.34.fasta dm5.34.fasta # Create a genome index (we assume you installed BWA http://bio-bwa.sourceforge.net/) bwa index -bwtsw dm5.34.fasta # Map sequences to the genome: Create SAI file bwa aln -bwtsw dm5.34.fasta s.fastq > s.sai # Map sequences to the genome: Create SAM file bwa samse dm5.34.fasta s.sai s.fastq > s.sam # Create BAM file (we assume you installed SamTools http://samtools.sourceforge.net/) samtools view -S -b s.sam > s.bam # Sort BAM file (will create s_sort.bam) samtools sort s.bam s_sort # Create VCF file (BcfTools is part of samtools distribution) samtools mpileup -uf dm5.34.fasta s_sort.bam | bcftools view -vcg - > s.vcf # Analyze variants using snpEff java -Xmx4g -jar snpEff.jar dm5.34 s.vcf > s.ann.vcfThis highly simplified sequencing data analysis pipeline, has these basic steps
Here we show an example on how to get from Sequencing data to an annotated variants file
These are slightly more advanced examples. Here we'll try to show how to perform specific tasks. If you want to filter out SNPs from dbSnp, you can do it using SnpSift. You can download SnpSift from the "Downloads" page.
You can download the file for this example here.
Here is how to do it:
SnpSif annotate
and DbSnp# Annotate ID field using dbSnp # Note: SnpSift will automatically download and uncompress dbSnp database if not locally available. java -jar SnpSift.jar annotate -dbsnp file.vcf > file.dbSnp.vcf
java -Xmx4g -jar snpEff.jar eff -v GRCh37.75 file.dbSnp.vcf > file.ann.vcf
java -jar SnpSift.jar filter -f file.ann.vcf "! exists ID" > file.ann.not_in_dbSnp.vcfThe expression using to filter the file is "! exists ID". This means that the ID field does not exists (i.e. the value is empty) which is represented as a dot (".") in a VCF file.
"-"
java -jar SnpSif.jar annotate -dbsnp file.vcf \ | java -Xmx4g -jar snpEff.jar eff -v GRCh37.75 - \ | java -jar SnpSift.jar filter "! exists ID" \ > file.ann.not_in_dbSnp.vcfHere is an example of some entries in the annotated output file. You can see the 'ANN' field was added, predicting STOP_GAINED protein changes:
$ cat demo.1kg.snpeff.vcf | grep stop_gained 1 889455 . G A 100.0 PASS ...;ANN=A|stop_gained|HIGH|... 1 897062 . C T 100.0 PASS ...;ANN=T|stop_gained|HIGH|... 1 900375 . G A 100.0 PASS ...;ANN=A|stop_gained|HIGH|...Note: The real output was edited for readibility reasons.
SnpEff can annotate using user specified (custom) genomic intervals, allowing you to add any kind of annotations you want.
In this example, we are analyzing using a specific version of the Yeast genome (we will assume that the database is not avialable, just to show a more complete example). We also want to add annotaions of genomic regions known as 'ARS', which are defined in a GFF file. This turns out to be quite easy, thanks to SnpEff's "custom intervals" feature. SnpEff allows you to add "custom" annotations from intervals in several formats: TXT, BED, BigBed, VCF, GFF.#--- # Download data #--- $ cd ~/snpEff $ mkdir data/sacCer $ cd data/sacCer $ wget http://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff $ mv saccharomyces_cerevisiae.gff genes.gffNow that we've downloaded the reference genome, we can build the database
#--- # Build #--- $ cd ../.. # Add entry to config file $ echo "sacCer.genome : Yeast" >> snpEff.config # Build database $ java -Xmx1G -jar snpEff.jar build -gff3 sacCer
#--- # Create a features file #--- # GFF files have both genomic records and sequences, we need to know # where the 'records' section ends (it is delimited by a "##FASTA" line) $ grep -n "^#" data/sacCer/genes.gff | tail -n 1 22994:##FASTA # Note that I'm cutting the INFO column (only for readability reasons) $ head -n 22994 data/sacCer/genes.gff \ | grep -v "^#" \ | grep ARS \ | cut -f 1 -d ";" \ > sacCer_ARS_features.gffSo now we have a custom file ready to be used.
#--- # Features annotations example #--- # Create a fake VCF file (one line), this is just an example to show that it works $ echo -e "chrI\t700\t.\tA\tT\t.\t.\t." > my.vcf $ java -jar snpEff.jar -interval sacCer_features.gff sacCer my.vcf > my.ann.vcfIf we take a look at the results, we can see that the "ARS" feature is annotates (see last line)
$ cat my.ann.vcf | grep -v "^#" | cut -f 8 | tr ",;" "\n\n" EFF=missense_variant(LOW|MISSENSE|Cca/Tca|p.Pro55Ser/c.163A>T|84|YAL068W-A|protein_coding|CODING|YAL068W-A_mRNA|1|1|WARNING_REF_DOES_NOT_MATCH_GENOME) upstream_gene_variant(MODIFIER||1780||75|YAL067W-A|protein_coding|CODING|YAL067W-A_mRNA||1) downstream_gene_variant(MODIFIER||1107||120|YAL068C|protein_coding|CODING|YAL068C_mRNA||1) downstream_gene_variant(MODIFIER||51||104|YAL069W|protein_coding|CODING|YAL069W_mRNA||1) custom[sacCer_features](MODIFIER||||||ARS102||||1)