Documentation
This SnpEff version implements the new VCF annotation standard 'ANN' field.
This new format specification has been created by the developers of the most widely used variant annotation programs (SnpEff, ANNOVAR and ENSEMBL's VEP)
and attempts to:
-formatEff
command line option.
SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants (such as amino acid changes).
For older version of this page, see: Manual page for SnpEff version 4.0
Download and installing SnpEff it pretty easy, take a look at the download page.
Take a look at the "Source code" section.
Variants
By genetic variant we mean difference between a genome and a "reference" genome.
As an example, imagine we are sequencing a "sample".
Here "sample" can mean anything that you are interested in studying, from a cell culture, to a mouse or a cancer patient.
It is a standard procedure to compare your sample sequences against the corresponding "reference genome".
For instance you may compare the cancer patient genome against the "reference genome".
In a typical sequencing experiment, you will find many places in the genome where your sample differs from the reference genome.
These are called "genomic variants" or just "variants".
Typically, variants are categorized as follows:
Type | What is means | Example |
---|---|---|
SNP | Single-Nucleotide Polymorphism | Reference = 'A', Sample = 'C' |
Ins | Insertion | Reference = 'A', Sample = 'AGT' |
Del | Deletion | Reference = 'AC', Sample = 'C' |
MNP | Multiple-nucleotide polymorphism | Reference = 'ATA', Sample = 'GTC' |
MIXED | Multiple-nucleotide and an InDel | Reference = 'ATA', Sample = 'GTCAGT' |
Annotations
So, you have a huge file describing all the differences between your sample and the reference genome.
But you want to know more about these variants than just their genetic coordinates.
E.g.: Are they in a gene? In an exon? Do they change protein coding? Do they cause premature stop codons?
SnpEff can help you answer all these questions.
The process of adding this information about the variants is called "Annotation".
SnpEff provides several degrees of annotations, from simple (e.g. which gene is each variant affecting) to extremely complex annotations (e.g. will this non-coding variant affect the expression of a gene?).
It should be noted that the more complex the annotations, the more it relies in computational predictions.
Such computational predictions can be incorrect, so results from SnpEff (or any prediction algorithm) cannot be trusted blindly, they must be analyzed and independently validated by corresponding wet-lab experiments.
Feature | Comment |
---|---|
Local install |
SnpEff can be installed in your local computer or servers.
Local installations are preferred for processing genomic data.
As opposed to remote web-based services, running a program locally has many advantages:
|
Multi platform | SnpEff is written in Java. It runs on Unix / Linux, OS.X and Windows. |
Simple installation | Installation is as simple as downloading a ZIP file and double clicking on it. |
Genomes | Human genome, as well as all model organisms are supported. Over 2,500 genomes are supported, which includes most mammalian, plant, bacterial and fungal genomes with published genomic data. |
Speed | SnpEff is really fast. It can annotate up to 1,000,000 variants per minute. |
GATK integration | SnpEff can be easily integrated with GATK pipelines. |
GKNO integration | SnpEff in integrated with GKNO pipelines. |
GUI | Web based user interface via Galaxy project |
Input and Output formats |
SnpEff accepts input files in the following format:
|
Variants supported | SnpEff can annotate SNPs, MNPs, insertions and deletions. Support for mixed variants and structural variants is available (although sometimes limited). |
Effect supported | Many effects are calculated: such as SYNONYMOUS_CODING, NON_SYNONYMOUS_CODING, FRAME_SHIFT, STOP_GAINED just to name a few. |
Variant impact | SnpEff provides a simple assessment of the putative impact of the variant (e.g. HIGH, MODERATE or LOW impact). |
Cancer tissue analysis | Somatic vs Germline mutations can be calculated on the fly. This is very useful for the cancer researcher community. |
Loss of Function (LOF) assessment | SnpEff can estimate if a variant is deemed to have a loss of function on the protein. |
Nonsense mediate decay (NMD) assessment | Some mutations may cause mRNA to be degraded thus not translated into a protein. NMD analysis marks mutations that are estimated to trigger nonsense mediated decay. |
HGVS notation | SnpEff can provide output in HGVS notation, which is quite popular in clinical and translation research environments. |
User annotations | A user can provide custom annotations (by means of BED files). |
Public databases |
SnpEff can annotate using publicly available data from well known databases, for instance:
|
Common variants (dbSnp) | Annotating "common" variants from dbSnp and 1,000 Genomes can be easily done (see SnpSift annotate ). |
Gwas catalog | Support for GWAS catalog annotations (see SnpSift gwasCat ) |
Conservation scores | PhastCons conservation score annotations support (see SnpSift phastCons ) |
DbNsfp |
A comprehensive database providing many annotations and scores, such as: SIFT ,Polyphen2 ,GERP++ , PhyloP ,MutationTaster ,SiPhy ,Interpro ,Haploinsufficiency ,etc. (via SnpSift).
See |
Non-coding annotations | Regulatory and non-coding annotations are supported for different tissues and cell lines. Annotations supported include PolII ,H3K27ac ,H3K4me2 ,H3K4me3 ,H3K27me3 ,CTCF ,H3K36me3, just to name a few. |
Gene Sets annotations | Gene sets (MSigDb, GO, BioCarta, KEGG, Reactome, etc.) can be used to annotate via SnpSift geneSets command. |
database
command:
$ java -jar snpEff.jar databases | less
GRCh37
instead of hg19
, or GRCm38
instead of mm10
, and so on.
$ java -jar snpEff.jar databases | grep -i musculus GRCm38.68 Mus_musculus OK http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip GRCm38.69 Mus_musculus http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip GRCm38.70 Mus_musculus http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip GRCm38.71 Mus_musculus http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip GRCm38.72 Mus_musculus http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip GRCm38.73 Mus_musculus http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip GRCm38.74 Mus_musculus http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip NCBIM37.64 Mus_musculus http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip NCBIM37.65 Mus_musculus http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip NCBIM37.66 Mus_musculus http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip
GRCm38.74
.
Again, this is an example of the version numbers at the time of writing this paragraph, in the future there will be other releases and you should update to the corresponding version.
We show some basic examples how to use SnpEff.
Obviously the first step to use the program is to install it (for details, take a look at the download page). You have to download the core program and then uncompress the ZIP file. In Windows systems, you can just double click and copy the contents of the ZIP file to wherever you want the program installed. If you have a Unix or a Mac system, the command line would be:
# Download using wget $ wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip # If you prefer to use 'curl' instead of 'wget', you can type: # curl -L http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip > snpEff_latest_core.zip # Install $ unzip snpEff_latest_core.zip
examples/test.chr22.vcf
(this data is from the 1000 Genomes project, so the reference genome is the human genome GRCh37).
You can annotate the file by running the following command (as an input, we use a Variant Call Format (VCF) file available in SnpEff's examples
directory).
$ java -Xmx4g -jar snpEff.jar GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf # Here is how the output looks like $ head examples/test.chr22.ann.vcf ##SnpEffVersion="4.1 (build 2015-01-07), by Pablo Cingolani" ##SnpEffCmd="SnpEff GRCh37.75 examples/test.chr22.vcf " ##INFO=As you can see, SnpEff added functional annotations in the##INFO= ##INFO= #CHROM POS ID REF ALT QUAL FILTER INFO 22 17071756 . T C . . ANN=C|3_prime_UTR_variant|MODIFIER|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.*11A>G|||||11|,C|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397A>G|||||4223| 22 17072035 . C T . . ANN=T|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1406G>A|p.Gly469Glu|1666/2034|1406/1674|469/557||,T|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397G>A|||||3944| 22 17072258 . C A . . ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1183G>T|p.Gly395Cys|1443/2034|1183/1674|395/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397G>T|||||3721| 22 17072674 . G A . . ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.767C>T|p.Pro256Leu|1027/2034|767/1674|256/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397C>T|||||3305|
ANN
info field (eigth column in the VCF output file).-v
), this makes SnpEff to show a lot of information which can be usefull for debugging. $ java -Xmx4g -jar snpEff.jar -v GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf 00:00:00.000 Reading configuration file 'snpEff.config'. Genome: 'GRCh37.75' 00:00:00.434 done 00:00:00.434 Reading database for genome version 'GRCh37.75' from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/snpEffectPredictor.bin' (this might take a while) 00:00:00.434 Database not installed Attempting to download and install database 'GRCh37.75' 00:00:00.435 Reading configuration file 'snpEff.config'. Genome: 'GRCh37.75' 00:00:00.653 done 00:00:00.654 Downloading database for 'GRCh37.75' 00:00:00.655 Connecting to http://downloads.sourceforge.net/project/snpeff/databases/v4_0/snpEff_v4_0_GRCh37.75.zip 00:00:01.721 Local file name: 'snpEff_v4_0_GRCh37.75.zip' ............................................. 00:01:31.595 Donwload finished. Total 177705174 bytes. 00:01:31.597 Extracting file 'data/GRCh37.75/motif.bin' to '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/motif.bin' 00:01:31.597 Creating local directory: '/home/pcingola/snpEff_v4_0/./data/GRCh37.75' 00:01:31.652 Extracting file 'data/GRCh37.75/nextProt.bin' 00:01:31.707 Extracting file 'data/GRCh37.75/pwms.bin' 00:01:31.707 Extracting file 'data/GRCh37.75/regulation_CD4.bin' ... 00:01:32.038 Extracting file 'data/GRCh37.75/snpEffectPredictor.bin' 00:01:32.881 Unzip: OK 00:01:32.881 Done 00:01:32.881 Database installed. 00:01:58.779 done 00:01:58.813 Reading NextProt database from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/nextProt.bin' 00:02:01.448 NextProt database: 523361 markers loaded. 00:02:01.448 Adding transcript info to NextProt markers. 00:02:02.180 NextProt database: 706289 markers added. 00:02:02.181 Loading Motifs and PWMs 00:02:02.181 Loading PWMs from : /home/pcingola/snpEff_v4_0/./data/GRCh37.75/pwms.bin 00:02:02.203 Loading Motifs from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/motif.bin' 00:02:02.973 Motif database: 284122 markers loaded. 00:02:02.973 Building interval forest 00:02:41.857 done. 00:02:41.858 Genome stats : #----------------------------------------------- # Genome name : 'Homo_sapiens' # Genome version : 'GRCh37.75' # Has protein coding info : true # Genes : 63677 # Protein coding genes : 23172 #----------------------------------------------- # Transcripts : 215170 # Avg. transcripts per gene : 3.38 #----------------------------------------------- # Checked transcripts : # AA sequences : 104254 ( 114.79% ) # DNA sequences : 179360 ( 83.36% ) #----------------------------------------------- # Protein coding transcripts : 90818 # Length errors : 14349 ( 15.80% ) # STOP codons in CDS errors : 39 ( 0.04% ) # START codon errors : 8721 ( 9.60% ) # STOP codon warnings : 21788 ( 23.99% ) # UTR sequences : 87724 ( 40.77% ) # Total Errors : 21336 ( 23.49% ) #----------------------------------------------- # Cds : 792087 # Exons : 1306656 # Exons with sequence : 1306656 # Exons without sequence : 0 # Avg. exons per transcript : 6.07 # WARNING! : Mitochondrion chromosome 'MT' does not have a mitochondrion codon table (codon table = 'Standard'). You should update the config file. #----------------------------------------------- # Number of chromosomes : 297 # Chromosomes names [sizes] : # 'HG1292_PATCH' [250051446] # 'HG1287_PATCH' [249964560] # 'HG1473_PATCH' [249272860] # 'HG1471_PATCH' [249269426] # 'HSCHR1_1_CTG31' [249267852] # 'HSCHR1_2_CTG31' [249266025] # 'HSCHR1_3_CTG31' [249262108] # 'HG999_2_PATCH' [249259300] # 'HG989_PATCH' [249257867] # 'HG999_1_PATCH' [249257505] # 'HG1472_PATCH' [249251918] # '1' [249250621] # 'HG1293_PATCH' [249140837] # 'HG686_PATCH' [243297375] # 'HSCHR2_1_CTG12' [243216362] # 'HSCHR2_2_CTG12' [243205453] # 'HSCHR2_1_CTG1' [243205406] # 'HG953_PATCH' [243199374] # '2' [243199373] ..... ..... #----------------------------------------------- 00:02:59.416 Predicting variants WARNINGS: Some warning were detected Warning type Number of warnings WARNING_TRANSCRIPT_INCOMPLETE 8215 WARNING_TRANSCRIPT_NO_START_CODON 3483 00:03:04.327 Creating summary file: snpEff_summary.html 00:03:04.891 Creating genes file: snpEff_genes.txt 00:03:17.334 done. 00:03:17.336 Logging 00:03:18.337 Checking for updates...
Notice how SnpEff automatically downloads and installs the database.
Next time SnpEff will use the local version, so the instalation step is only done once.
The annotated variants will be in the new file "test.chr22.ann.vcf".
SnpEff creates a file called "snpEff_summary.html" showing basic statistics about the analyzed variants. Take a quick look at it.
We used the java parameter -Xmx4g to increase the memory available to the Java Virtual Machine to 4G. SnpEff's human genome database is large and it has to be loaded into memory. If your computer doesn't have at least 4G of memory, you probably won't be able to run this example.
If you are running SnpEff from a directory different than the one it was installed, you will have to specify where the config file is. This is done using the '-c' command line option:
java -Xmx4g -jar snpEff.jar -c path/to/snpEff/snpEff.config -v GRCh37.75 test.chr22.vcf > test.chr22.ann.vcf
java -Xmx4g path/to/snpEff/snpEff.jar -c path/to/snpEff/snpEff.config GRCh37.75 path/to/snps.vcf
Since version 4.1B, you can use the -configOption
command line option to override any value in the config file
# Run using 4 Gb of memory java -Xmx4G snpEff.jar hg19 path/to/your/files/snps.vcfNote: There is no space between "-Xmx" and "4G".
$ ssh -i ./aws_amazon/pcingola_aws.pem ec2-user@ec2-54-234-14-244.compute-1.amazonaws.com __| __|_ ) _| ( / Amazon Linux AMI ___|\___|___| [ec2-user@ip-10-2-202-163 ~]$ wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip [ec2-user@ip-10-2-202-163 ~]$ unzip snpEff_latest_core.zip [ec2-user@ip-10-2-202-163 ~]$ cd snpEff/ [ec2-user@ip-10-2-202-163 snpEff]$ java -jar snpEff.jar download -v hg19 00:00:00.000 Downloading database for 'hg19' ... 00:00:36.340 Done [ec2-user@ip-10-2-202-163 snpEff]$ java -Xmx4G -jar snpEff.jar dump -v hg19 > /dev/null 00:00:00.000 Reading database for genome 'hg19' (this might take a while) 00:00:20.688 done 00:00:20.688 Building interval forest 00:00:33.110 Done.
As you can see, it's very simple.
SnpEff has sevceral 'commands' that can be used for different annotaitons.
The default command is 'eff'
used to annotate variants.
# This will show a 'help' message java -jar snpEff.jarHere is a list of what each command does
Command | Meaning |
---|---|
[eff|ann] | This is the default command. It is used for annotating variant files (e.g. VCF files). |
build | Build a SnpEff database from reference genome files (FASTA, GTF, etc.). |
buildNextProt | Build NextProt database using XML files |
cds | Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness (invoked automatically when building a database). |
closest | Annotate the closest genomic region. |
count | Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval. |
databases | Show currently available databases (from local config file). |
download | Download a SnpEff database. |
dump | Dump to STDOUT a SnpEff database (mostly used for debugging). |
genes2bed | Create a bed file from a genes list. |
len | Calculate total genomic length for each marker type. |
protein | Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. (invoked automatically when building a database). |
spliceAnalysis | Perform an analysis of splice sites. Experimental feature. |
$ java -jar snpEff.jar SnpEff version SnpEff 4.1 (build 2015-01-07), by Pablo Cingolani Usage: snpEff [command] [options] [files] Run 'java -jar snpEff.jar command' for help on each specific command Available commands: [eff|ann] : Annotate variants / calculate effects (you can use either 'ann' or 'eff', they mean the same). Default: ann (no command or 'ann'). build : Build a SnpEff database. buildNextProt : Build a SnpEff for NextProt (using NextProt's XML files). cds : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. closest : Annotate the closest genomic region. count : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval. databases : Show currently available databases (from local config file). download : Download a SnpEff database. dump : Dump to STDOUT a SnpEff database (mostly used for debugging). genes2bed : Create a bed file from a genes list. len : Calculate total genomic length for each marker type. protein : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. spliceAnalysis : Perform an analysis of splice sites. Experimental feature. Generic options: -c , -config : Specify config file -d , -debug : Debug mode (very verbose). -dataDirHelp (command specific): In order to see a help message for a paricular command, you can run the command without any arguments or use: Override data_dir parameter from config file. -download : Download a SnpEff database, if not available locally. Default: true -nodownload : Do not download a SnpEff database, if not available locally. -noShiftHgvs : Do not shift variants towards most 3-prime position (as required by HGVS). -h , -help : Show this help and exit -noLog : Do not report usage statistics to server -t : Use multiple threads (implies '-noStats'). Default 'off' -q , -quiet : Quiet mode (do not show any messages or errors) -v , -verbose : Verbose mode Database options: -canon : Only use canonical transcripts. -interval : Use a custom intervals in TXT/BED/BigBed/VCF/GFF file (you may use this option many times) -motif : Annotate using motifs (requires Motif database). -nextProt : Annotate using NextProt (requires NextProt database). -noGenome : Do not load any genomic database (e.g. annotate using custom files). -noMotif : Disable motif annotations. -noNextProt : Disable NextProt annotations. -onlyReg : Only use regulation tracks. -onlyProtein : Only use protein coding transcripts. Default: false -onlyTr : Only use the transcripts in this file. Format: One transcript ID per line. -reg : Regulation track to use (this option can be used add several times). -ss , -spliceSiteSize : Set size for splice sites (donor and acceptor) in bases. Default: 2 -strict : Only use 'validated' transcripts (i.e. sequence has been checked). Default: false
-help
command line option:
# This will show a 'help' message for the 'ann' (aka 'eff') command $ java -jar snpEff.jar ann snpEff version SnpEff 4.1 (build 2015-01-07), by Pablo Cingolani Usage: snpEff [eff] [options] genome_version [input_file] variants_file : Default is STDIN Options: -chr: Prepend 'string' to chromosome name (e.g. 'chr1' instead of '1'). Only on TXT output. -classic : Use old style annotaions instead of Sequence Ontology and Hgvs. -download : Download reference genome if not available. Default: true -i : Input format [ vcf, bed ]. Default: VCF. -fileList : Input actually contains a list of files to process. -o : Ouput format [ vcf, gatk, bed, bedAnn ]. Default: VCF. -s , -stats : Name of stats file (summary). Default is 'snpEff_summary.html' -noStats : Do not create stats (summary) file -csvStats : Create CSV summary file instead of HTML Results filter options: -fi , -filterInterval : Only analyze changes that intersect with the intervals specified in this file (you may use this option many times) -no-downstream : Do not show DOWNSTREAM changes -no-intergenic : Do not show INTERGENIC changes -no-intron : Do not show INTRON changes -no-upstream : Do not show UPSTREAM changes -no-utr : Do not show 5_PRIME_UTR or 3_PRIME_UTR changes -no EffectType : Do not show 'EffectType'. This option can be used several times. Annotations options: -cancer : Perform 'cancer' comparisons (Somatic vs Germline). Default: false -cancerSamples : Two column TXT file defining 'oringinal \t derived' samples. -formatEff : Use 'EFF' field compatible with older versions (instead of 'ANN'). -geneId : Use gene ID instead of gene name (VCF output). Default: false -hgvs : Use HGVS annotations for amino acid sub-field. Default: true -lof : Add loss of function (LOF) and Nonsense mediated decay (NMD) tags. -noHgvs : Do not add HGVS annotations. -noLof : Do not add LOF and NMD annotations. -noShiftHgvs : Do not shift variants according to HGVS notation (most 3prime end). -oicr : Add OICR tag in VCF file. Default: false -sequenceOntology : Use Sequence Ontology terms. Default: true Generic options: -c , -config : Specify config file -d , -debug : Debug mode (very verbose). -dataDir : Override data_dir parameter from config file. -download : Download a SnpEff database, if not available locally. Default: true -nodownload : Do not download a SnpEff database, if not available locally. -noShiftHgvs : Do not shift variants towards most 3-prime position (as required by HGVS). -h , -help : Show this help and exit -noLog : Do not report usage statistics to server -t : Use multiple threads (implies '-noStats'). Default 'off' -q , -quiet : Quiet mode (do not show any messages or errors) -v , -verbose : Verbose mode Database options: -canon : Only use canonical transcripts. -interval : Use a custom intervals in TXT/BED/BigBed/VCF/GFF file (you may use this option many times) -motif : Annotate using motifs (requires Motif database). -nextProt : Annotate using NextProt (requires NextProt database). -noGenome : Do not load any genomic database (e.g. annotate using custom files). -noMotif : Disable motif annotations. -noNextProt : Disable NextProt annotations. -onlyReg : Only use regulation tracks. -onlyProtein : Only use protein coding transcripts. Default: false -onlyTr : Only use the transcripts in this file. Format: One transcript ID per line. -reg : Regulation track to use (this option can be used add several times). -ss , -spliceSiteSize : Set size for splice sites (donor and acceptor) in bases. Default: 2 -strict : Only use 'validated' transcripts (i.e. sequence has been checked). Default: false -ud , -upDownStreamLen : Set upstream downstream interval length (in bases)
In order to speed up the annotation process, there are two options that can be activated:
-noStats
disables the statistics and may result in a significant speedup.
-t
command line option.
SnpEff uses HGVS notation, which is somewhat popular amongst clinicians.
You can de-activate HGVS notation (to use the old annotation style) using the command line option -classic
.
SnpEff will try to log usage statistics to our "log server".
This is useful for us to understand user's needs and have some statistics on what users are doing with the program (e.g. decide whether a command or option is useful or not).
Logging can be deactivated by using the -noLog
command line option.
SnpEff supports filter of output results by using combinations of the following command line options:
Output filters can be implemented using SnpSift filter
, which allows to create more flexible and complex filters.
Command line option | Meaning |
---|---|
-no-downstream | Do not show DOWNSTREAM changes |
-no-intergenic | Do not show INTERGENIC changes |
-no-intron | Do not show INTRON changes |
-no-upstream | Do not show UPSTREAM changes |
-no-utr | Do not show 5_PRIME_UTR or 3_PRIME_UTR changes |
-no EffectType | Do not show 'EffectType' (it can be used several times) e.g: -no INTERGENIC -no SPLICE_SITE_REGION |
You can use the -fi intervals.bed
command line option (filterInterval). For instance, let's assume you have an interval file 'intervals.bed':
2L 10000 10999 2L 12000 12999 2L 14000 14999 2L 16000 16999 2L 18000 18999
$ java -Xmx4G -jar snpEff.jar -fi intervals.bed GRCh38.76 test.chr22.vcf
SnpEff allows to annotate using canonical transcripts by using -canon
command line option.
Canonical transcripts are defined as the longest CDS of amongst the protein coding transcripts in a gene.
If none of the transcripts in a gene is protein coding, then it is the longest cDNA.
Although this seems to be the standard definitions of "canonical transcript", there is no warranties that what SnpEff considers a canonical transcript will match exactly what UCSC or ENSEMBL consider a canonical transcript.
$ java -Xmx4G -jar snpEff.jar -v -canon GRCh37.75 examples/test.chr22.vcf > file.ann.canon.vcf
-d
(debug) command line option. E.g.:
$ java -Xmx4G -jar snpEff.jar -d -v -canon GRCh37.75 test.vcf 00:00:00.000 Reading configuration file 'snpEff.config' 00:00:00.173 done 00:00:00.173 Reading database for genome version 'GRCh37.66' 00:00:02.834 done 00:00:02.845 Filtering out non-canonical transcripts. 00:00:03.219 Canonical transcripts: geneName geneId transcriptId cdsLength GGPS1 ENSG00000152904 ENST00000488594 903 RP11-628K18.1.1 ENSG00000235112 ENST00000430808 296 MIPEPP2 ENSG00000224783 ENST00000422560 1819 FEN1P1 ENSG00000215873 ENST00000401028 1145 AL591704.7.1 ENSG00000224784 ENST00000421658 202 CAPNS1P1 ENSG00000215874 ENST00000401029 634 ST13P20 ENSG00000215875 ENST00000447996 1061 NCDN ENSG00000020129 ENST00000373243 2190 RP11-99H8.1.1 ENSG00000226208 ENST00000423187 432 AL391001.1 ENSG00000242652 ENST00000489859 289 ...
SnpEff allows you to provide a list of transcripts to use for annotations by using the -onlyTr file.txt
and providing a file with one transcript ID per line.
Any other transcript will be ignored.
$ java -Xmx4G -jar snpEff.jar -onlyTr my_transcripts.txt GRCh37.75 test.chr22.vcf > test.chr22.ann.vcf
You can change the default upstream and downstream interval size (default is 5K) using the -ud size_in_bases
option.
This also allows to eliminate any upstream and downstream effect by using "-ud 0".
Example: Make upstream and downstream size zero (i.e. do not report any upstream or downstream effect).
$ java -Xmx4G -jar snpEff.jar -ud 0 GRCh37.75 test.chr22.vcf > test.chr22.ann.vcf
You can change the default splice site size (default is 2 bases) using the -spliceSiteSize size_in_bases
option.
Example: Make splice sites four bases long
$ java -Xmx4G -jar snpEff.jar -spliceSiteSize 4 GRCh37.75 test.chr22.vcf > test.chr22.ann.vcf
SnpEff allows user defined intervals to be annotated.
This is achieved using the -interval file.bed
command line option, which can be used multiple times in the same command line (it accepts files in TXT, BED, BigBed, VCF, GFF formats).
Any variant that intersects an interval defined in those files, will be annotated using the "name" field (fourth column) in the input bed file.
Example: We create our own annotations in my_annotations.bed
$ cat my_annotations.bed 1 10000 20000 MY_ANNOTATION $ cat test.vcf 1 10469 . C G 365.78 PASS AC=30;AF=0.0732 # Annotate (output edited for readibility) $ java -Xmx4g -jar snpEff.jar -interval my_annotations.bed GRCh37.66 test.vcf 1 10469 . C G 365.78 PASS AC=30;AF=0.0732; ANN=G|upstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000456328|processed_transcript||n.-1C>G|||||1400| ... G|custom|MODIFIER|||CUSTOM&my_annotations|MY_ANNOTATION|||||||||Notice that the variant was annotated using "MY_ANNOTATION" in the
ANN
field.
You can obtain gene IDs instead of gene names by using the command line option -geneId
.
Note: This is only for the old 'EFF' field ('ANN' field always shows both gene name and gene ID).
Example:
$ java -Xmx4g -jar snpEff.jar -geneId GRCh37.66 test.vcf 1 902128 3617 C T . PASS AC=80;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gCt/gTt|A43V|576|ENSG00000187583|protein_coding|CODING|ENST00000379407|2|1),...Note: The gene 'PLEKHN1' was annotated as 'ENSG00000187583'.
# Create compressed version of the examples files cp examples/test.chr22.vcf my.vcf # Compress it gzip my.vcf # Annotate (note the it doesn't require the ending '.gz') java -Xmx4g -jar snpEff.jar GRCh37.75 my.vcf > my.ann.vcf
# These three commands are the same # Using STDIN (pipe), implicit (no input file name) cat test.chr22.vcf | java -Xmx4g -jar snpEff.jar hg19 > test.chr22.ann.vcf # Using STDIN (pipe), exlicit '-' input file name cat test.chr22.vcf | java -Xmx4g -jar snpEff.jar hg19 - > test.chr22.ann.vcf # Using explict file name java -Xmx4g -jar snpEff.jar hg19 test.chr22.vcf > test.chr22.ann.vcf
Files used as input to SnpEff must comply with standard formats. Here we describe supported input data formats.
#CHROM POS ID REF ALT QUAL FILTER INFO 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017Note that the first line is header information. Header lines start with '#'
#CHROM POS ID REF ALT QUAL FILTER INFO 1 889455 . G A 100.0 PASS AF=0.0005 1 897062 . C T 100.0 PASS AF=0.0005VCF file after being annotated using SnpEff
#CHROM POS ID REF ALT QUAL FILTER INFO 1 889455 . G A 100.0 PASS AF=0.0005;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q236*|749|NOC2L||CODING|NM_015658|) 1 897062 . C T 100.0 PASS AF=0.0005;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q141*|642|KLHL17||CODING|NM_198317|)A you can see, SnpEff added an 'EFF' tag to the INFO field (eigth column).
##SnpEffVersion="SnpEff 3.1m (build 2013-02-08)" ##SnpEffCmd="SnpEff hg19 demo.1kg.vcf " ##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name | Gene_BioType | Coding | Transcript | Exon [ | ERRORS | WARNINGS ] )' \">
ANN
tag.
The annotation 'ANN' field looks like this (the full annotataion standard specification can be found here).
ANN=T|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1406G>A|p.Gly469Glu|1666/2034|1406/1674|469/557||,T|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397G>A|||||3944|A variant can have (and ususally has) more than one annotaion. Multiple annotations are separated by commas. In the previous example there were two annotations corresponding to different genes (CCT8L2 and FABP5P11).
Annotation : T|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1406G>A|p.Gly469Glu|1666/2034|1406/1674|469/557| | SubField number : 1| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |15| 16
#CHROM POS ID REF ALT QUAL FILTER INFO chr1 123456 . C A . . ANN=A|... chr1 234567 . A G,T . . ANN=G|... , T|...In case of cancer sample, when comparing somatic versus germline using a non-standard reference (e.g. one of the ALTs is the reference) the format should be ALT-REFERENCE. E.g.:
#CHROM POS ID REF ALT QUAL FILTER INFO chr1 123456 . A C,G . . ANN=G-C|...Compound variants: two or more variants affecting the annotations (e.g. two consecutive SNPs conforming a MNP, two consecutive frame_shift variants that “recover” the frame). In this case, the Allele field should include a reference to the other variant/s included in the annotation:
#CHROM POS ID REF ALT QUAL FILTER INFO chr1 123456 . A T . . ANN=T|... chr1 123457 . C G . . ANN=C-chr1:123456_A>T|...
#CHROM POS ID REF ALT QUAL FILTER INFO chr1 123456 . C A . . ANN=A|intron_variant&nc_transcript_variant|...
Code | Message type | Description / Notes |
---|---|---|
E1 | ERROR_CHROMOSOME_NOT_FOUND | Chromosome does not exists in reference genome database. Typically indicates a mismatch between the chromosome names in the input file and the chromosome names used in the reference genome. |
E2 | ERROR_OUT_OF_CHROMOSOME_RANGE | The variant’s genomic coordinate is greater than chromosome's length. |
W1 | WARNING_REF_DOES_NOT_MATCH_GENOME | This means that the ‘REF’ field in the input VCF file does not match the reference genome. This warning may indicate a conflict between input data and data from reference genome (for instance is the input VCF was aligned to a different reference genome). |
W2 | WARNING_SEQUENCE_NOT_AVAILABLE | Reference sequence is not available, thus no inference could be performed. |
W3 | WARNING_TRANSCRIPT_INCOMPLETE | A protein coding transcript having a non-multiple of 3 length. It indicates that the reference genome has missing information about this particular transcript. |
W4 | WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS | A protein coding transcript has two or more STOP codons in the middle of the coding sequence (CDS). This should not happen and it usually means the reference genome may have an error in this transcript. |
W5 | WARNING_TRANSCRIPT_NO_START_CODON | A protein coding transcript does not have a proper START codon. It is rare that a real transcript does not have a START codon, so this probably indicates an error or missing information in the reference genome. |
I1 | INFO_REALIGN_3_PRIME | Variant has been realigned to the most 3-prime position within the transcript. This is usually done to to comply with HGVS specification to always report the most 3-prime annotation. |
I2 | INFO_COMPOUND_ANNOTATION | This effect is a result of combining more than one variants (e.g. two consecutive SNPs that conform an MNP, or two consecutive frame_shift variants that compensate frame). |
I3 | INFO_NON_REFERENCE_ANNOTATION | An alternative reference sequence was used to calculate this annotation (e.g. cancer sample comparing somatic vs. germline). |
Consistency between HGVS and functional annotations: In some cases there might be inconsistent reporting between ‘annotation’ and HGVS. This is due to the fact that VCF recommends aligning to the leftmost coordinate, whereas HGSV recommends aligning to the "most 3-prime coordinate". For instance, an InDel on the edge of an exon, which has an ‘intronic’ annotation according to VCF alignment recommendation, can lead to a ‘stop_gained’ when aligned using HGVS’s recommendation (using the most 3-prime possible alignment). So the 'annotation' sub-field will report 'intron' whereas HGVS sub-field will report a 'stop_gained'. This is obviously inconsistent and must be avoided. In order to report annotations that are consistent with HGVS notation, variants must be re-aligned according to each transcript’s strand (i.e. align the variant according to the transcript’s most 3-prime coordinate). Then annotations are calculated, thus the reported annotations will be consistent with HGVS notation. Annotation software should have a command line option to override this behaviour (e.g. ‘-no_shift_hgvs’)
-formatEff
command line option.
As of version 4.1 SnpEff uses the 'ANN' field by default.
-classic
command line option.
EFF= Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank | Genotype_Number [ | ERRORS | WARNINGS ] )
EFF Sub-field | Meaning |
---|---|
Effect | Effect of this variant. See details here. |
Effect impact | Effect impact {High, Moderate, Low, Modifier}. See details here. |
Functional Class | Functional class {NONE, SILENT, MISSENSE, NONSENSE}. |
Codon_Change / Distance | Codon change: old_codon/new_codon OR distance to transcript (in case of upstream / downstream) |
Amino_Acid_Change | Amino acid change: old_AA AA_position/new_AA (e.g. 'E30K') |
Amino_Acid_Length | Length of protein in amino acids (actually, transcription length divided by 3). |
Gene_Name | Gene name |
Transcript_BioType | Transcript bioType, if available. |
Gene_Coding | [CODING | NON_CODING]. This field is 'CODING' if any transcript of the gene is marked as protein coding. |
Transcript_ID | Transcript ID (usually ENSEMBL IDs) |
Exon/Intron Rank | Exon rank or Intron rank (e.g. '1' for the first exon, '2' for the second exon, etc.) |
Genotype_Number | Genotype number corresponding to this effect (e.g. '2' if the effect corresponds to the second ALT) |
Warnings / Errors | Any warnings or errors (not shown if empty). |
ANN
(or EFF
) field.
#CHROM POS ID REF ALT QUAL FILTER INFO 1 889455 . G A . . .In this case SnpEff will report the effect of each variant on each gene and each transcript (output edited for readibility):
#CHROM POS ID REF ALT QUAL FILTER INFO 1 889455 . G A . . ANN=A|stop_gained|HIGH|NOC2L|ENSG00000188976|transcript|ENST00000327044|protein_coding|7/19|c.706C>T|p.Gln236*|756/2790|706/2250|236/749|| ,A|downstream_gene_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000487214|processed_transcript||n.*865C>T|||||351| ,A|downstream_gene_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000469563|retained_intron||n.*878C>T|||||4171| ,A|non_coding_exon_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000477976|retained_intron|5/17|n.2153C>T||||||;LOF=(NOC2L|ENSG00000188976|6|0.17);NMD=(NOC2L|ENSG00000188976|6|0.17)
#CHROM POS ID REF ALT QUAL FILTER INFO 1 889455 . G A,T . . .In this case SnpEff will report the effect of each ALT on each gene and each transcript. Notice that ENST00000327044 has a
stop_gained
variant (ALT = 'A') and a missense_variant
(ALT = 'T')
#CHROM POS ID REF ALT QUAL FILTER INFO 1 889455 . G A,T . . ANN=A|stop_gained|HIGH|NOC2L|ENSG00000188976|transcript|ENST00000327044|protein_coding|7/19|c.706C>T|p.Gln236*|756/2790|706/2250|236/749|| ,T|missense_variant|MODERATE|NOC2L|ENSG00000188976|transcript|ENST00000327044|protein_coding|7/19|c.706C>A|p.Gln236Lys|756/2790|706/2250|236/749|| ,A|downstream_gene_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000487214|processed_transcript||n.*865C>T|||||351| ,T|downstream_gene_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000487214|processed_transcript||n.*865C>A|||||351| ,A|downstream_gene_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000469563|retained_intron||n.*878C>T|||||4171| ,T|downstream_gene_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000469563|retained_intron||n.*878C>A|||||4171| ,A|non_coding_exon_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000477976|retained_intron|5/17|n.2153C>T|||||| ,T|non_coding_exon_variant|MODIFIER|NOC2L|ENSG00000188976|transcript|ENST00000477976|retained_intron|5/17|n.2153C>A||||||;LOF=(NOC2L|ENSG00000188976|6|0.17);NMD=(NOC2L|ENSG00000188976|6|0.17)
Detailed desription of the effect predicted by SnpEff in the Effect
and Effect_Impact
sub-fields.
-classic
command line option.
Effect Seq. Ontology | Effect Classic | Note & Example | Impact |
---|---|---|---|
coding_sequence_variant | CDS | The variant hits a CDS. | MODIFIER |
chromosome | CHROMOSOME_LARGE DELETION | A large part (over 1% or 1,000,000 bases) of the chromosome was deleted. | HIGH |
duplication | CHROMOSOME_LARGE_DUPLICATION | Duplication of a large chromoome segment (over 1% or 1,000,000 bases). | HIGH |
inversion | CHROMOSOME_LARGE_INVERSION | Inversion of a large chromoome segment (over 1% or 1,000,000 bases). | HIGH |
coding_sequence_variant | CODON_CHANGE | One or many codons are changed e.g.: An MNP of size multiple of 3 | LOW |
inframe_insertion | CODON_INSERTION | One or many codons are inserted e.g.: An insert multiple of three in a codon boundary | MODERATE |
disruptive_inframe_insertion | CODON_CHANGE_PLUS CODON_INSERTION | One codon is changed and one or many codons are inserted e.g.: An insert of size multiple of three, not at codon boundary | MODERATE |
inframe_deletion | CODON_DELETION | One or many codons are deleted e.g.: A deletion multiple of three at codon boundary | MODERATE |
disruptive_inframe_deletion | CODON_CHANGE_PLUS CODON_DELETION | One codon is changed and one or more codons are deleted e.g.: A deletion of size multiple of three, not at codon boundary | MODERATE |
downstream_gene_variant | DOWNSTREAM | Downstream of a gene (default length: 5K bases) | MODIFIER |
exon_variant | EXON | The variant hits an exon (from a non-coding transcript) or a retained intron. | MODIFIER |
exon_loss_variant | EXON_DELETED | A deletion removes the whole exon. | HIGH |
exon_loss_variant | EXON_DELETED_PARTIAL | Deletion affecting part of an exon. | HIGH |
duplication | EXON_DUPLICATION | Duplication of an exon. | HIGH |
duplication | EXON_DUPLICATION_PARTIAL | Duplication affecting part of an exon. | HIGH |
inversion | EXON_INVERSION | Inversion of an exon. | HIGH |
inversion | EXON_INVERSION_PARTIAL | Inversion affecting part of an exon. | HIGH |
frameshift_variant | FRAME_SHIFT | Insertion or deletion causes a frame shift e.g.: An indel size is not multple of 3 | HIGH |
gene_variant | GENE | The variant hits a gene. | MODIFIER |
feature_ablation | GENE_DELETED | Deletion of a gene. | HIGH |
duplication | GENE_DUPLICATION | Duplication of a gene. | MODERATE |
gene_fusion | GENE_FUSION | Fusion of two genes. | HIGH |
gene_fusion | GENE_FUSION_HALF | Fusion of one gene and an intergenic region. | HIGH |
bidirectional_gene_fusion | GENE_FUSION_REVERESE | Fusion of two genes in opposite directions. | HIGH |
rearranged_at_DNA_level | GENE_REARRANGEMENT | Rearrengment affecting one or more genes. | HIGH |
intergenic_region | INTERGENIC | The variant is in an intergenic region | MODIFIER |
conserved_intergenic_variant | INTERGENIC_CONSERVED | The variant is in a highly conserved intergenic region | MODIFIER |
intragenic_variant | INTRAGENIC | The variant hits a gene, but no transcripts within the gene | MODIFIER |
intron_variant | INTRON | Variant hits and intron. Technically, hits no exon in the transcript. | MODIFIER |
conserved_intron_variant | INTRON_CONSERVED | The variant is in a highly conserved intronic region | MODIFIER |
miRNA | MICRO_RNA | Variant affects an miRNA | MODIFIER |
missense_variant | NON_SYNONYMOUS_CODING | Variant causes a codon that produces a different amino acid e.g.: Tgg/Cgg, W/R | MODERATE |
initiator_codon_variant | NON_SYNONYMOUS_START | Variant causes start codon to be mutated into another start codon (the new codon produces a different AA). e.g.: Atg/Ctg, M/L (ATG and CTG can be START codons) | LOW |
stop_retained_variant | NON_SYNONYMOUS_STOP | Variant causes stop codon to be mutated into another stop codon (the new codon produces a different AA). e.g.: Atg/Ctg, M/L (ATG and CTG can be START codons) | LOW |
protein_protein_contact | PROTEIN_PROTEIN_INTERACTION_LOCUS | Protein-Protein interacion loci. | HIGH |
structural_interaction_variant | PROTEIN_STRUCTURAL_INTERACTION_LOCUS | Within protein interacion loci (e.g. two AA that are in contact within the same protein, prossibly helping structural conformation). | HIGH |
rare_amino_acid_variant | RARE_AMINO_ACID | The variant hits a rare amino acid thus is likely to produce protein loss of function | HIGH |
splice_acceptor_variant | SPLICE_SITE_ACCEPTOR | The variant hits a splice acceptor site (defined as two bases before exon start, except for the first exon). | HIGH |
splice_donor_variant | SPLICE_SITE_DONOR | The variant hits a Splice donor site (defined as two bases after coding exon end, except for the last exon). | HIGH |
splice_region_variant | SPLICE_SITE_REGION | A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron. | LOW |
splice_region_variant | SPLICE_SITE_BRANCH | A varaint affective putative (Lariat) branch point, located in the intron. | LOW |
splice_region_variant | SPLICE_SITE_BRANCH_U12 | A varaint affective putative (Lariat) branch point from U12 splicing machinery, located in the intron. | MODERATE |
stop_lost | STOP_LOST | Variant causes stop codon to be mutated into a non-stop codon e.g.: Tga/Cga, */R | HIGH |
5_prime_UTR_premature start_codon_gain_variant | START_GAINED | A variant in 5'UTR region produces a three base sequence that can be a START codon. | LOW |
start_lost | START_LOST | Variant causes start codon to be mutated into a non-start codon. e.g.: aTg/aGg, M/R | HIGH |
stop_gained | STOP_GAINED | Variant causes a STOP codon e.g.: Cag/Tag, Q/* | HIGH |
synonymous_variant | SYNONYMOUS_CODING | Variant causes a codon that produces the same amino acid e.g.: Ttg/Ctg, L/L | LOW |
start_retained | SYNONYMOUS_START | Variant causes start codon to be mutated into another start codon. e.g.: Ttg/Ctg, L/L (TTG and CTG can be START codons) | LOW |
stop_retained_variant | SYNONYMOUS_STOP | Variant causes stop codon to be mutated into another stop codon. e.g.: taA/taG, */* | LOW |
transcript_variant | TRANSCRIPT | The variant hits a transcript. | MODIFIER |
feature_ablation | TRANSCRIPT_DELETED | Deletion of a transcript. | HIGH |
regulatory_region_variant | REGULATION | The variant hits a known regulatory feature (non-coding). | MODIFIER |
upstream_gene_variant | UPSTREAM | Upstream of a gene (default length: 5K bases) | MODIFIER |
3_prime_UTR_variant | UTR_3_PRIME | Variant hits 3'UTR region | MODIFIER |
3_prime_UTR_truncation + exon_loss | UTR_3_DELETED | The variant deletes an exon which is in the 3'UTR of the transcript | MODERATE |
5_prime_UTR_variant | UTR_5_PRIME | Variant hits 5'UTR region | MODIFIER |
5_prime_UTR_truncation + exon_loss_variant | UTR_5_DELETED | The variant deletes an exon which is in the 5'UTR of the transcript | MODERATE |
sequence_feature + exon_loss_variant | NEXT_PROT | A 'NextProt' based annotation. Details are provided in the 'feature type' sub-field (ANN), or in the effect details (EFF). | MODIFIER |
Details about Rare amino acid effect
These are amino acids that occurs very rarely in an organism.
For instance, humans are supposed to use 20 amino acids, but
there is also one rare AA. Selenocysteine, single letter
code 'U', appears roughly 100 times in the whole genome.
The amino acid is so rare that usually it does not appear
in codon translation tables. It is encoded as UGA, which usually
means a STOP codon. Secondary RNA
structures are assumed to enable this special translation.
A variant in one of these sites is likely to cause a loss of
function in the protein. E.g. in case of a Selenocysteine, a
loss of a selenium molecule is likely to cause loss of function.
Put it simply, the assumption is that there is a great deal of trouble
to get that non-standard amino acid there, so it must be important.
RARE_AMINO_ACID mark is used to show that special attention should
be paid in these cases.
When the variant hits a RARE_AMINO_ACID mark, it is likely that
the 'old_AA/new_AA' field will be incorrect. This may happen because
the amino acid is not predictable using a codon table.
Details about Protein interaction effects
Protein interactions are calculated from PDB. There are two main types of interations:
HIGH
impact or a LOW
impact variant is the one producing a phenotype of interest.
Impact | Meaning | Example |
---|---|---|
HIGH | The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay. | stop_gained, frameshift_variant |
MODERATE | A non-disruptive variant that might change protein effectiveness. | missense_variant, inframe_deletion |
LOW | Assumed to be mostly harmless or unlikely to change protein behavior. | synonymous_variant |
MODIFIER | Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact. | exon_variant, downstream_gene_variant |
-lof
, but as of version 4.0, it is activated by default.
Some details on how these variants work, can be found in these slides
# Note: Form version 4.0 onwards, the '-lof' command line option is not required java –Xmx4g -jar snpEff.jar –v \ -lof \ GRCh37.75 \ test.chr22.vcf > test.chr22.ann.vcfSnpEff adds 'LOF' and 'NMD' tags to INFO fields (column 8 in VCF format). LOF and NMD tags have the following format:
Gene | ID | num_transcripts | percent_affectedWhere:
Field | Description |
---|---|
Gene | Gene name |
ID | Gene ID (usually ENSEMBL) |
Num_transcripts | Number of transcripts in this gene |
percent_affected | Percentage of transcripts affected by this variant. |
EFF=stop_gained(LOW|NONSENSE|Gga/Tga|p.Gly163*/c.487G>T|574|GAB4|protein_coding|CODING|ENST00000400588|3|1),...and the corresponding LOF and NMD tags are
LOF=(GAB4|ENSG00000215568|4|0.25);NMD=(GAB4|ENSG00000215568|4|0.25)The meaning of the LOF tag is:
Field | Description |
---|---|
Gene | GAB4 |
ID | ENSG00000215568 |
Num_transcripts | There are 4 transcripts in this gene |
percent_affected | 25% of transcripts are affected by this variant. |
Error | Meaning and possible solutions |
---|---|
ERROR_CHROMOSOME_NOT_FOUND |
Chromosome does not exits in reference database.
See this FAQ for more details. |
ERROR_OUT_OF_CHROMOSOME_RANGE | This means that the position is higher than chromosome's length. Probably an indicator that your data is not from this reference genome. |
ERROR_OUT_OF_EXON | Exonic information not matching the coordinates. Indicates a problem (or even a bug?) in the database |
ERROR_MISSING_CDS_SEQUENCE | Transcript has no CDS info. Indicates a problem (or even a bug?) in the database |
Warning | Meaning and possible solutions |
---|---|
WARNING_REF_DOES_NOT_MATCH_GENOME |
This means that the REF field does not match the reference genome.
This warning probably indicated there is something really wrong with your data! This happens when your data was aligned to a different reference genome than the one used to create SnpEff's database. If there are many of these warnings, it's a strong indicator that the data doesn't match and all the annotations will be garbage (because you are using the wrong database). Solution: Use the right database to annotate! Due to performance and memory optimizations, SnpEff only checks reference sequence on Exons. |
WARNING_SEQUENCE_NOT_AVAILABLE | For some reason the exon sequence is not available, so we cannot calculate effects. |
WARNING_TRANSCRIPT_INCOMPLETE |
A protein coding transcript whose length is non-multiple of 3.
This means that information is missing for one or more amino acids.
This is usually due to errors in the genomic information (e.g. the genomic databases provided by UCSC or ENSEMBL). Genomic information databases are constantly being improved and are getting more accurate, but some errors still remain. |
WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS |
A protein coding transcript has two or more STOP codons in the middle of the coding sequence (CDS).
This should not happen and it usually means the genomic information may have an error in this transcript.
This is usually due to errors in the genomic information (e.g. the genomic databases provided by UCSC or ENSEMBL). Genomic information databases are constantly being improved and are getting more accurate, but some errors still remain. |
WARNING_TRANSCRIPT_NO_START_CODON |
A protein coding transcript does not have a proper START codon.
It is rare that a real transcript does not have a START codon, so this probably indicates errors in genomic information for this transcript (e.g. the genomic databases provided by UCSC or ENSEMBL).
Genomic information databases are constantly being improved and are getting more accurate, but some errors still remain. |
$ java -Xmx4g -jar snpEff.jar -i bed BDGP5.69 chipSeq_peaks.bed # SnpEff version 3.3 (build 2013-05-15), by Pablo Cingolani # Command line: SnpEff -i bed BDGP5.69 /home/pcingola/fly_pvuseq/chipSeq/Sample_w1118_IP_w_5hmC/w1118_IP_w_5hmC_peaks.bed # Chromo Start End Name;Effect|Gene|BioType Score 2L 189463 190154 MACS_peak_1;Exon|exon_6_12_RETAINED|FBtr0078122|protein_coding|spen|protein_coding;Exon|exon_5_10_RETAINED|FBtr0078123|protein_coding|spen|protein_coding;Exon|exon_7_13_RETAINED|FBtr0306341|protein_coding|spen|protein_coding;Exon|exon_6_11_RETAINED|FBtr0078121|protein_coding|spen|protein_coding 245.41 2L 195607 196120 MACS_peak_2;Exon|exon_6_12_RETAINED|FBtr0078122|protein_coding|spen|protein_coding;Exon|exon_5_10_RETAINED|FBtr0078123|protein_coding|spen|protein_coding;Exon|exon_7_13_RETAINED|FBtr0306341|protein_coding|spen|protein_coding;Exon|exon_6_11_RETAINED|FBtr0078121|protein_coding|spen|protein_coding 51.22 2L 527253 527972 MACS_peak_3;Intron|intron_2_RETAINED-RETAINED|FBtr0078063|protein_coding|ush|protein_coding 55.97 2L 711439 711764 MACS_peak_4;Intron|intron_1_RETAINED-RETAINED|FBtr0078045|protein_coding|ds|protein_coding 61.16 2L 1365255 1365556 MACS_peak_5;Upstream|FBtr0077927|protein_coding|CG14346|protein_coding;Upstream|FBtr0077926|protein_coding|CG14346|protein_coding;Intergenic|NLaz...CG14346;Upstream|FBtr0077942|protein_coding|NLaz|protein_coding 62.78 2L 1970199 1970405 MACS_peak_6;Upstream|FBtr0077813|protein_coding|Der-1|protein_coding;Intergenic|tRNA:CR31942...Der-1;Downstream|FBtr0077812|tRNA|tRNA:CR31942|tRNA 110.34 2L 3345637 3346152 MACS_peak_7;Intron|intron_2_ALTTENATIVE_3SS-ALTTENATIVE_3SS|FBtr0089979|protein_coding|E23|protein_coding;Intron|intron_3_ALTTENATIVE_3SS-ALTTENATIVE_3SS|FBtr0089981|protein_coding|E23|protein_coding 65.49 2L 4154734 4155027 MACS_peak_8;Intergenic|CG2955...Or24a;Downstream|FBtr0077468|protein_coding|CG2955|protein_coding 76.92 2L 4643232 4643531 MACS_peak_9;Downstream|FBtr0110769|protein_coding|BG642163|protein_coding;Exon|exon_2_2_RETAINED|FBtr0300354|protein_coding|CG15635|protein_coding 76.92When a peak intersects multiple transcripts or even multiple genes, each annotation is separated by a semicolon. So if you look into the previous results in more detail, the fisrt line looks like this (format editted for readibility pourposes):
2L 189463 190154 MACS_peak_1;Exon|exon_6_12_RETAINED|FBtr0078122|protein_coding|spen|protein_coding ;Exon|exon_5_10_RETAINED|FBtr0078123|protein_coding|spen|protein_coding ;Exon|exon_7_13_RETAINED|FBtr0306341|protein_coding|spen|protein_coding ;Exon|exon_6_11_RETAINED|FBtr0078121|protein_coding|spen|protein_codingThis peak is hitting four transcripts (FBtr0078122, FBtr0078123, FBtr0306341, FBtr0078121) in gene 'spen'.
exon_Rank_Total_Type
, where:
rank
is the exon rank in the transcript (position in the transcript)
total
is the total number of exons in that transcript
type
is the exon splice type.
exon_5_10_RETAINED
would be the fifth exon in a 10 exon transcript.
This exon is type "RETAINED", which means it is not spliced out.
NONE
: Not spliced
RETAINED
: All transcripts have this exon
SKIPPED
: Some transcripts skip it
ALTTENATIVE_3SS
: Some transcripts have and alternative 3' exon start
ALTTENATIVE_5SS
: Some transcripts have and alternative 5' exon end
MUTUALLY_EXCLUSIVE
: Mutually exclusive (respect to other exon)
ALTTENATIVE_PROMOMOTER
: The first exon is different in some transcripts.
ALTTENATIVE_POLY_A
: The last exon.
intron_Rank_ExonTypeBefore-ExonTypeAfter
, where:
Rank
: the rank number for this inton in the transcript
ExonTypeBefore
: the splicing type of the exon preceding this intron (see exon naming convention for details).
ExonTypeAfter
: the splicing type of the after this intron (see exon naming convention for details).
intron_9_SKIPPED-RETAINED
would be the ninth intron of the transcript.
The intron is preceded by a SKIPPED
exon and followed by a RETAINED
exon.
Here we describe details about annotating cancer samples.
-cancer
command line option, you can compare somatic vs germline samples.
$ java -Xmx4g -jar snpEff.jar -v -cancer GRCh37.75 cancer.vcf > cancer.ann.vcf
In a typical cancer sequencing experiment, we want to measure and annotate differences between germline (healthy) and somatic (cancer) tissue samples from the same patient.
The complication is that germline is not always the same as the reference genome, so a typical annotation does not work.
For instance, let's assume that at a given genomic position (e.g. chr1:69091), reference genome is 'A', germline is 'C' and somatic is 'G'.
This should be represented in a VCF file as:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Patient_01_Germline Patient_01_Somatic 1 69091 . A C,G . PASS AC=1 GT 1/0 2/1
-cancer
command line option.
-canceSample
command line option.
PEDIGREE
meta information or a TXT samples file, SnpEff will not know which somatic samples derive from which germline samples.
Thus it will be unable to perform cancer effect analysis.
Patient_01_Germline Patient_01_Somatic Patient_02_Germline Patient_02_Somatic Patient_03_Germline Patient_03_Somatic Patient_04_Germline Patient_04_SomaticThen you have to specify this TXT file when invoking SnpEff, using the
-canceSample
command line option.
$ cat examples/samples_cancer_one.txt Patient_01_Germline Patient_01_Somatic $ java -Xmx4g -jar snpEff.jar -v \ -cancer \ -cancerSamples examples/samples_cancer_one.txt \ GRCh37.75 \ examples/cancer.vcf \ > cancer.ann.vcf
PEDIGREE
header with the appropriate information to your VCF file.
Obviously this requires you to edit your VCF file's header.
vcf-annotate
from VCFtools). But if you find adding PEDIGREE information to your VCF file difficult, just use the TXT file method described in the previous sub-section.
$ cat examples/cancer_pedigree.vcf ##PEDIGREE=<Derived=Patient_01_Somatic,Original=Patient_01_Germline> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Patient_01_Germline Patient_01_Somatic 1 69091 . A C,G . PASS AF=0.1122 GT 1/0 2/1 1 69849 . G A,C . PASS AF=0.1122 GT 1/0 2/1 1 69511 . A C,G . PASS AF=0.3580 GT 1/1 2/2 $ java -Xmx4g -jar snpEff.jar -v -cancer GRCh37.75 examples/cancer_pedigree.vcf > examples/cancer_pedigree.ann.vcfHere we say that the sample called
Patient_01_Somatic
is derived from the sample called Patient_01_Germline
.
In this context, this means that cancer sample is derived from the healthy tissue.
ANN
field cancer sample relies on 'Allele' sub-field.
Just as a reminder, ANN
field has the following format:
ANN = Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS_WARNINGS_INFOThe
Allele
field tells you which effect relates to which genotype.
More importantly, genotype difference between Somatic and Germline.
$ java -Xmx4g -jar snpEff.jar -v -cancer -cancerSamples examples/samples_cancer_one.txt GRCh37.75 examples/cancer.vcf > examples/cancer.eff.vcf 1 69091 . A C,G . PASS AF=0.1122; ANN=G|start_lost|HIGH|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>G|p.Met1?|1/918|1/918|1/305|| ,G-C|start_lost|HIGH|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>G|p.Leu1?|1/918|1/918|1/305|| ,C|initiator_codon_variant|LOW|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>C|p.Met1?|1/918|1/918|1/305|| GT 1/0 2/1What does it mean:
C|initiator_codon_variant|LOW|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>C|p.Met1?|1/918|1/918|1/305||Note that the last field (genotype field) is 'C' indicating this is produced by the first ALT
G|start_lost|HIGH|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>G|p.Met1?|1/918|1/918|1/305||Note that the last field (genotype field) is 'G' indicating this is produced by the second ALT
G-C|start_lost|HIGH|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>G|p.Leu1?|1/918|1/918|1/305||
EFF
field cancer sample relies on 'Genotype' sub-field.
Just as a reminder, EFF
field has the following format:
EFF = Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank | Genotype_Number [ | ERRORS | WARNINGS ] )For the previous example, we get (edited for readibility):
$ java -Xmx4g -jar snpEff.jar -v -classic -cancer -cancerSamples examples/samples_cancer_one.txt GRCh37.75 examples/cancer.vcf > examples/cancer.eff.vcf 1 69091 . A C,G . PASS AC=1; EFF=START_LOST(HIGH|MISSENSE|Atg/Gtg|M1V|305|OR4F5|protein_coding|CODING|ENST00000335137|1|G) ,START_LOST(HIGH|MISSENSE|Ctg/Gtg|L1V|305|OR4F5|protein_coding|CODING|ENST00000335137|1|G-C) ,NON_SYNONYMOUS_START(LOW|MISSENSE|Atg/Ctg|M1L|305|OR4F5|protein_coding|CODING|ENST00000335137|1|C)The
GenotypeNum
field tells you which effect relates to which genotype.
More importantly, genotype difference between Somatic and Germline.
SnpEff can also provide non-coding and regulatory annotations. Here we show how to annotate them.
First of all, you need to see if your organism has a regulatory database.
You can just look into the database directory to see if regulation_*.bin
files are there.
For instance, for human genome:
$ cd ~/snpEff $ cd data/GRCh37.75/ $ ls -al drwxrwxr-x 2 pcingola pcingola 4096 Aug 26 19:51 . drwxrwxr-x 3 pcingola pcingola 4096 Aug 26 19:51 .. -rw-rw-r-- 1 pcingola pcingola 5068097 Aug 26 19:51 motif.bin -rw-rw-r-- 1 pcingola pcingola 5469036 Aug 26 19:51 nextProt.bin -rw-rw-r-- 1 pcingola pcingola 38000 Aug 26 19:51 pwms.bin -rw-rw-r-- 1 pcingola pcingola 6399582 Aug 26 19:51 regulation_CD4.bin -rw-rw-r-- 1 pcingola pcingola 2516472 Aug 26 19:51 regulation_GM06990.bin -rw-rw-r-- 1 pcingola pcingola 8064939 Aug 26 19:51 regulation_GM12878.bin -rw-rw-r-- 1 pcingola pcingola 6309932 Aug 26 19:51 regulation_H1ESC.bin -rw-rw-r-- 1 pcingola pcingola 5247586 Aug 26 19:51 regulation_HeLa-S3.bin -rw-rw-r-- 1 pcingola pcingola 7506893 Aug 26 19:51 regulation_HepG2.bin -rw-rw-r-- 1 pcingola pcingola 4064952 Aug 26 19:51 regulation_HMEC.bin -rw-rw-r-- 1 pcingola pcingola 4644239 Aug 26 19:51 regulation_HSMM.bin -rw-rw-r-- 1 pcingola pcingola 5641615 Aug 26 19:51 regulation_HUVEC.bin -rw-rw-r-- 1 pcingola pcingola 5617233 Aug 26 19:51 regulation_IMR90.bin -rw-rw-r-- 1 pcingola pcingola 546871 Aug 26 19:51 regulation_K562b.bin -rw-rw-r-- 1 pcingola pcingola 8542718 Aug 26 19:51 regulation_K562.bin -rw-rw-r-- 1 pcingola pcingola 3119671 Aug 26 19:51 regulation_NH-A.bin -rw-rw-r-- 1 pcingola pcingola 5721741 Aug 26 19:51 regulation_NHEK.bin -rw-rw-r-- 1 pcingola pcingola 94345546 Aug 26 19:51 snpEffectPredictor.binSo we can annotate using any of those tracks.
$ java -Xmx4g -jar snpEff.jar -v -reg HeLa-S3 -reg NHEK GRCh37.75 examples/test.1KG.vcf > test.1KG.ann_reg.vcf 00:00:00.000 Reading configuration file 'snpEff.config'. Genome: 'GRCh37.75' 00:00:00.377 done 00:00:00.377 Reading database for genome version 'GRCh37.75' from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/snpEffectPredictor.bin' (this might take a while) 00:00:25.845 done 00:00:25.878 Reading regulation track 'NHEK' 00:00:30.137 Reading regulation track 'HeLa-S3' ... # Show one example of "regulatory_region" (output edited for readability) $ grep -i regulatory_region test.1KG.ann_reg.vcf | head -n 1 | ./scripts/vcfInfoOnePerLine.pl 1 10291 . C T 2373.79 . ANN=T|regulatory_region_variant|MODIFIER|||REGULATION&H3K36me3:NHEK|NHEK_H3K36me3_5||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&H3K27me3:NHEK|NHEK_H3K27me3_4||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&Max:HeLa-S3|HeLa-S3_Max_26||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&Cfos:HeLa-S3|HeLa-S3_Cfos_30||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&FAIRE:HeLa-S3|HeLa-S3_FAIRE_49||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&H3K27ac:HeLa-S3|HeLa-S3_H3K27ac_88||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&PolII:NHEK|NHEK_PolII_59||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&CTCF:NHEK|NHEK_CTCF_42||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&Cmyc:HeLa-S3|HeLa-S3_Cmyc_16||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&H4K20me1:NHEK|NHEK_H4K20me1_122||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&H3K4me3:NHEK|NHEK_H3K4me3_133||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&DNase1:HeLa-S3|HeLa-S3_DNase1_108||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&DNase1:NHEK|NHEK_DNase1_63||||||||| ,T|regulatory_region_variant|MODIFIER|||REGULATION&FAIRE:NHEK|NHEK_FAIRE_149|||||||||
ENCODE project's goal is to find all functional elements in the human genome. You can perform annotations using ENCODE's data.
ENCODE project has produced huge amounts of data (see also Nature's portal).
This information is available for download and can be used to annotate genomic variants or regions.
An overview of all the data available from ENCODE is shown as an experimental data matrix.
The download site is here.
Data is available in "BigBed" format, which can be feed into SnpEff using -interval
command line option (you can add many -interval
options).
Here is a simple example:
# Create a directory for ENCODE files mkdir -p db/encode # Download ENCODE experimental results (BigBed file) cd db/encode wget "http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/openchrom/jan2011/fdrPeaks/wgEncodeDukeDnase8988T.fdr01peaks.hg19.bb" # Annotate using ENCODE's data: java -Xmx4g -jar snpEff.jar -v -interval db/encode/wgEncodeDukeDnase8988T.fdr01peaks.hg19.bb GRCh37.75 examples/test.1KG.vcf > test.1KG.ann_encode.vcf # Annotations are added as "CUSTOM" intervals: $ grep CUSTOM test.1KG.ann_encode.vcf | head 1 564672 . A C 812.29 . ANN=|custom|MODIFIER|||CUSTOM&wgEncodeDukeDnase8988T|wgEncodeDukeDnase8988T:564666_564815||||||||| 1 564687 . C T 308.21 . ANN=T|custom|MODIFIER|||CUSTOM&wgEncodeDukeDnase8988T|wgEncodeDukeDnase8988T:564666_564815||||||||| ... 1 956676 . G A 120.88 . ANN=A|custom|MODIFIER|||CUSTOM&wgEncodeDukeDnase8988T|wgEncodeDukeDnase8988T:956646_956795||||||||| ...
Epigenome Roadmap Project has produced large amounts of information that can be used by SnpEff.
Epigenome Roadmap Project goal is
"to map DNA methylation, histone modifications, chromatin accessibility and small RNA transcripts
in stem cells and primary ex vivo tissues selected to represent the normal counterparts of tissues
and organ systems frequently involved in human disease".
A data matrix shows the experimental set ups
currently available.
Unfortunately the project is not (currently) providing results files that can be used directly by annotation software, such as SnpEff.
They will be available later in the project.
So, for the time being, data has to be downloaded an pre-processed.
We'll be processing these information and making it available (as SnpEff databases) as soon as we can.
The latest Epigenome project processed information, can be found here.
This includes genomic intervals for high confidence peaks in form of BED
files.
To annotate you can do:
# Download Epigenome project database (pre-processed as BED files) wget http://sourceforge.net/projects/snpeff/files/databases/epigenome_latest.tgz/download # Open tar file tar -xvzf epigenome_latest.tgz # Annotate using SnpEff and "-interval" command line java -Xmx4g -jar snpEff.jar -v -interval db/epigenome/BI_Pancreatic_Islets_H3K4me3.peaks.bed GRCh37.75 test.vcf > test.ann.vcf # See the data represented as "CUSTOM" EFF fields $ grep CUSTOM test.ann.vcf 1 894573 . G A . PASS AC=725;EFF=CUSTOM[BI_Pancreatic_Islets_H3K4me3](MODIFIER||||||MACS_peak_8||||1),INTRON(MODIFIER||||749|NOC2L|protein_coding|CODING|ENST00000327044|1|1),INTRON(MODIFIER|||||NOC2L|processed_transcript|CODING|ENST00000487214|1|1),INTRON(MODIFIER|||||NOC2L|retained_intron|CODING|ENST00000469563|1|1),UPSTREAM(MODIFIER||||642|KLHL17|protein_coding|CODING|ENST00000338591||1),UPSTREAM(MODIFIER|||||KLHL17|nonsense_mediated_decay|CODING|ENST00000466300||1),UPSTREAM(MODIFIER|||||KLHL17|retained_intron|CODING|ENST00000463212||1),UPSTREAM(MODIFIER|||||KLHL17|retained_intron|CODING|ENST00000481067||1),UPSTREAM(MODIFIER|||||NOC2L|retained_intron|CODING|ENST00000477976||1) 1 948692 . G A . PASS AC=896;EFF=CUSTOM[BI_Pancreatic_Islets_H3K4me3](MODIFIER||||||MACS_peak_9||||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||||165|ISG15|protein_coding|CODING|ENST00000379389||1),UPSTREAM(MODIFIER|||||RP11-54O7.11|antisense|NON_CODING|ENST00000458555||1) 1 948921 . T C . PASS AC=904;EFF=CUSTOM[BI_Pancreatic_Islets_H3K4me3](MODIFIER||||||MACS_peak_9||||1),UPSTREAM(MODIFIER|||||RP11-54O7.11|antisense|NON_CODING|ENST00000458555||1),UTR_5_PRIME(MODIFIER||||165|ISG15|protein_coding|CODING|ENST00000379389|1|1) 1 1099342 . A C . PASS AC=831;EFF=CUSTOM[BI_Pancreatic_Islets_H3K4me3](MODIFIER||||||MACS_peak_10||||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER|||||MIR200A|miRNA|NON_CODING|ENST00000384875||1),UPSTREAM(MODIFIER|||||MIR200B|miRNA|NON_CODING|ENST00000384997||1)
NextProt has useful proteomic annotations than can help to identify variants causing reduced protein functionality or even loss of function.
Nextprot project provides proteomic information that can be used for genomic annotations.
NextProt povides only human data.
Starting from SnpEff version 4.0, these annotaions are automatically added if the database is available for the genome version you are using (in older SnpEff versions the -nextprot
command line option was used).
NextProt databases are available by for in some GRCh37 genomes (e.g. file data/GRCh37.75/nextProt.bin
).
Annotations example:
$ java -Xmx4g -jar snpEff.jar -v GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf 00:00:00.000 Reading configuration file 'snpEff.config'. Genome: 'GRCh37.75' 00:00:00.374 done 00:00:00.374 Reading database for genome version 'GRCh37.75' from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/snpEffectPredictor.bin' (this might take a while) 00:00:25.880 done 00:00:25.913 Reading NextProt database from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/nextProt.bin' ... # Show some results (edited for readibility) $ cat test.chr22.ann.vcf ... 22 17280941 . T C . . ANN=C|sequence_feature|LOW|XKR3|ENSG00000172967|transmembrane_region:Transmembrane_region|ENST00000331428|protein_coding|2/4|c.336-27A>G||||||,C|sequence_feature|LOW|XKR3|ENSG00000172967|transmembrane_region:Transmembrane_region|ENST00000331428|protein_coding|3/4|c.336-27A>G||||||,C|intron_variant|MODIFIER|XKR3|ENSG00000172967|transcript|ENST00000331428|protein_coding|2/3|c.336-27A>G|||||| ... 22 17472785 . G A . . ANN=A|sequence_feature|LOW|GAB4|ENSG00000215568|domain:PH|ENST00000400588|protein_coding|2/10|c.456C>T||||||,A|sequence_feature|LOW|GAB4|ENSG00000215568|domain:PH|ENST00000400588|protein_coding|1/10|c.456C>T||||||,A|non_coding_exon_variant|MODIFIER|GAB4|ENSG00000215568|transcript|ENST00000465611|nonsense_mediated_decay|2/9|n.339C>T||||||,A|non_coding_exon_variant|MODIFIER|GAB4|ENSG00000215568|transcript|ENST00000523144|processed_transcript|2/4|n.341C>T|||||| ... 22 50722408 . T C . . ANN=C|sequence_feature|MODERATE|PLXNB2|ENSG00000196576|glycosylation_site:N-linked__GlcNAc..._|ENST00000359337|protein_coding|14/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|22/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|8/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|21/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|16/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|15/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|17/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|11/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|14/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|19/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|18/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|9/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|20/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|6/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|3/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|12/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|7/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|10/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|4/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|5/37|c.2275A>G||||||,C|sequence_feature|LOW|PLXNB2|ENSG00000196576|topological_domain:Extracellular|ENST00000359337|protein_coding|13/37|c.2275A>G||||||,C|upstream_gene_variant|MODIFIER|PLXNB2|ENSG00000196576|transcript|ENST00000479701|retained_intron||n.-1A>G|||||1417|,C|upstream_gene_variant|MODIFIER|PLXNB2|ENSG00000196576|transcript|ENST00000463165|retained_intron||n.-1A>G|||||2045|,C|upstream_gene_variant|MODIFIER|PLXNB2|ENSG00000196576|transcript|ENST00000492578|retained_intron||n.-1A>G|||||1973|,C|upstream_gene_variant|MODIFIER|PLXNB2|ENSG00000196576|transcript|ENST00000427829|protein_coding||c.-3A>G|||||1099|WARNING_TRANSCRIPT_INCOMPLETE,C|downstream_gene_variant|MODIFIER|PLXNB2|ENSG00000196576|transcript|ENST00000434732|protein_coding||c.*253A>G|||||188|WARNING_TRANSCRIPT_INCOMPLETE,C|downstream_gene_variant|MODIFIER|PLXNB2|ENSG00000196576|transcript|ENST00000432455|protein_coding||c.*1620A>G|||||4008|WARNING_TRANSCRIPT_NO_STOP_CODON,C|intron_variant|MODIFIER|PLXNB2|ENSG00000196576|transcript|ENST00000411680|protein_coding|2/5|c.202+5007A>G||||||WARNING_TRANSCRIPT_INCOMPLETE,C|non_coding_exon_variant|MODIFIER|PLXNB2|ENSG00000196576|transcript|ENST00000496720|processed_transcript|6/8|n.510A>G||||||The last line in the example shows a
glycosylation_site
marked as MODERATE
impact, since a modification of such a site might impair protein function.
Motif annotations provided by ENSEMBL and Jaspar can be added to the standard annotations.
ENSEMBL provides trancription factor binding sites prediction, for human and mouse genomes, using Jaspar motif database.
As of SnpEff version 4.0, these annotations are added automatically, if the datbase is avialble for the genome version you are using (files motif.bin
and pwms.bin
).
Older versions requires using the -motif
command line option.
Example of trancription factor binding sites prediction predictions:
$ java -Xmx4g -jar snpEff.jar -v GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf 00:00:00.000 Reading configuration file 'snpEff.config'. Genome: 'GRCh37.75' 00:00:00.393 done 00:00:00.394 Reading database for genome version 'GRCh37.75' from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/snpEffectPredictor.bin' (this might take a while) 00:00:26.214 done 00:00:26.248 Reading NextProt database from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/nextProt.bin' 00:00:27.386 NextProt database: 523361 markers loaded. 00:00:27.387 Adding transcript info to NextProt markers. 00:00:28.072 NextProt database: 706289 markers added. 00:00:28.072 Loading Motifs and PWMs 00:00:28.072 Loading PWMs from : /home/pcingola/snpEff_v4_0/./data/GRCh37.75/pwms.bin 00:00:28.103 Loading Motifs from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/motif.bin' 00:00:28.862 Motif database: 284122 markers loaded. ... # Show some examples (output edited for readability) $ cat test.chr22.ann.vcf ... 22 18301084 . G A . . ANN=A|TF_binding_site_variant|MODIFIER|||Nrsf|MA0138.2|||||||||,A|TF_binding_site_variant|MODIFIER|||Nrsf|MA0138.1|||||||||,... ... 22 23523309 . C T . . ANN=T|TF_binding_site_variant|LOW|||Gabp|MA0062.2|||||||||,... ... 22 36629223 . G C . . ANN=C|TF_binding_site_variant|LOW|||SP1|MA0079.1|||||||||,... ...
TF_binding_site_variant|LOW|||SP1|MA0079.1
corresponding to motif
MA0079.1, which you can look up in Jaspar.
SnpEff creates an additional output file showing overall statistics. This "stats" file is an HTML file which can be opened using a web browser. You can find an example of a 'stats' file here.
The program performs some statistics and saves them to the file 'snpEff_summary.html' on the
directory where snpEff is being executed. You can see the file, by opening it in your browser.
You can change the default location by using the '-stats' command line option. This also changes the location of the TXT summary file.
Summary can be create in CSV format using command line option -csvStats
. This allows easy downstream processing.
E.g.: In the stats file, you can see coverage histogram plots like this one
"Effects by type" vs "Effects by region"
SnpEff annotates variants.
Variants produce effect of difference "types" (e.g. NON_SYNONYMOUS_CODING, STOP_GAINED).
These variants affect regions of the genome (e.g. EXON, INTRON).
The two tables count how many effects for each type and for each region exists.
E.g.: In an EXON region, you can have all the following effect types: NON_SYNONYMOUS_CODING, SYNONYMOUS_CODING, FRAME_SHIFT, STOP_GAINED, etc.
The complicated part is that some effect types affect a region that has the same name (yes, I know, this is confusing).
E.g.: In a UTR_5_PRIME region you can have UTR_5_PRIME and START_GAINED effect type.
This means that the number of both tables are not exactly the same, because the labels don't mean the same.
See the next figure as an example
So the number of effects that affect a UTR_5_PRIME region is 206. Of those, 57 are effects type START_GAINED and 149 are effects type UTR_5_PRIME.
How exactly are effect type and effect region related? See the following table
Effect Type | Region |
---|---|
NONE CHROMOSOME CUSTOM CDS |
NONE |
INTERGENIC INTERGENIC_CONSERVED |
INTERGENIC |
UPSTREAM |
UPSTREAM |
UTR_5_PRIME UTR_5_DELETED START_GAINED |
UTR_5_PRIME |
SPLICE_SITE_ACCEPTOR |
SPLICE_SITE_ACCEPTOR |
SPLICE_SITE_DONOR |
SPLICE_SITE_DONOR |
SPLICE_SITE_REGION |
SPLICE_SITE_REGION |
INTRAGENIC START_LOST SYNONYMOUS_START NON_SYNONYMOUS_START GENE TRANSCRIPT |
EXON or NONE |
EXON EXON_DELETED NON_SYNONYMOUS_CODING SYNONYMOUS_CODING FRAME_SHIFT CODON_CHANGE CODON_INSERTION CODON_CHANGE_PLUS_CODON_INSERTION CODON_DELETION CODON_CHANGE_PLUS_CODON_DELETION STOP_GAINED SYNONYMOUS_STOP STOP_LOST RARE_AMINO_ACID |
EXON |
INTRON INTRON_CONSERVED |
INTRON |
UTR_3_PRIME UTR_3_DELETED |
UTR_3_PRIME |
DOWNSTREAM |
DOWNSTREAM |
REGULATION |
REGULATION |
SnpEff alse generates a TXT (tab separated) file having counts of number of variants affecting each transcript and gene.
By default, the file name is snpEff_genes.txt
, but it can be changed using the -stats
command line option.
Here is an example of this file:
$ head snpEff_genes.txt # The following table is formatted as tab separated values. #GeneName GeneId TranscriptId BioType variants_impact_HIGH variants_impact_LOW variants_impact_MODERATE variants_impact_MODIFIER variants_effect_3_prime_UTR_variant variants_effect_5_prime_UTR_premature_start_codon_gain_variant variants_effect_5_prime_UTR_variant variants_effect_downstream_gene_variant variants_effect_intron_variant variants_effect_missense_variant variants_effect_non_coding_exon_variant variants_effect_splice_acceptor_variant variants_effect_splice_donor_variant variants_effect_splice_region_variant variants_effect_start_lost variants_effect_stop_gained variants_effect_stop_lost variants_effect_synonymous_variant variants_effect_upstream_gene_variant bases_affected_DOWNSTREAM total_score_DOWNSTREAM length_DOWNSTREAM bases_affected_EXON total_score_EXON length_EXON bases_affected_INTRON total_score_INTRON length_INTRON bases_affected_SPLICE_SITE_ACCEPTOR total_score_SPLICE_SITE_ACCEPTOR length_SPLICE_SITE_ACCEPTOR bases_affected_SPLICE_SITE_DONOR total_score_SPLICE_SITE_DONOR length_SPLICE_SITE_DONOR bases_affected_SPLICE_SITE_REGION total_score_SPLICE_SITE_REGION length_SPLICE_SITE_REGION bases_affected_TRANSCRIPT total_score_TRANSCRIPT length_TRANSCRIPT bases_affected_UPSTREAM total_score_UPSTREAM length_UPSTREAM bases_affected_UTR_3_PRIME total_score_UTR_3_PRIME length_UTR_3_PRIME bases_affected_UTR_5_PRIME total_score_UTR_5_PRIME length_UTR_5_PRIME AC000029.1 ENSG00000221069 ENST00000408142 miRNA 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 5000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 AC000068.5 ENSG00000185065 ENST00000431090 antisense 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 5000 0 0 0 0 0 0 AC000081.2 ENSG00000230194 ENST00000433141 processed_pseudogene 0 0 0 8 0 0 0 3 0 0 0 0 0 0 5000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5000 0 0 AC000089.3 ENSG00000235776 ENST00000424559 processed_pseudogene 0 0 0 1 0 0 0 0 0 0 0 0 0 0 5000 0 0 0 0 0 0 AC002472.1 ENSG00000269103 ENST00000547793 protein_coding 0 0 0 6 0 0 0 5 0 0 0 0 0 0 0 5000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 5000 0 0 AC002472.11 ENSG00000226872 ENST00000450652 antisense 0 0 0 13 0 0 0 5 2 0 0 0 0 0 0 5000 0 0 0 2 0 11199 0 0 0 0 0 0 0 0 0 0 0 0 6 0 5000 0 0 AC002472.13 ENSG00000187905 ENST00000342608 protein_coding 0 1 6 1 0 0 0 0 1 6 0 0 0 1 0 116 1 0 934 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 0 0 0 0 AC002472.13 ENSG00000187905 ENST00000442047 protein_coding 0 1 6 1 0 0 0 0 1 6 0 0 0 1 0 116 1 0 934 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 0 0 0 0
Column name | Meaning |
---|---|
GeneName | Gene name (usually HUGO) |
GeneId | Gene's ID |
TranscriptId | Transcript's ID |
BioType | Transcript's bio-type (if avaialble) |
The following column is repeated for each impact {HIGH, MODERATE, LOW, MODIFIER} | |
variants_impact_* | Count number of variants for each impact category |
The following column is repeated for each annotated effect (e.g. missense_variant, synonymous_variant, stop_lost, etc.) | |
variants_effect_* | Count number of variants for each effect type |
The following columns are repeated for several genomic regions (DOWNSTREAM, EXON, INTRON, UPSTREAM, etc.) | |
bases_affected_* | Number of bases that variants overlap genomic region |
total_score_* | Sum of scores overlapping this genomic region. Note: Scores are only available when input files are type 'BED' (e.g. when annotating ChipSeq experiments) |
length_* | Genomic region length |
Some common problems
Chromosome not found
This is by far the most common problem.
It means that the input VCF file has chromosome names that do not match SnpEff's database and don't match reference genome either, since SnpEff's database are created using reference genome chromosome names.
The solution is simple: fix your VCF file to use standard chromosome names.
You can see which chromosome names are used by SnpEff simply by using the -v
(verbose) command line option.
This shows all chromosome names and their respective lengths. Notice the last line ("Chromosomes names [sizes]"):
$ java -Xmx4g -jar snpEff.jar -v GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf 00:00:00.000 Reading configuration file 'snpEff.config'. Genome: 'GRCh37.75' ... # Number of chromosomes : 297 # Chromosomes names [sizes] : # 'HG1292_PATCH' [250051446] # 'HG1287_PATCH' [249964560] # 'HG1473_PATCH' [249272860] # 'HG1471_PATCH' [249269426] # 'HSCHR1_1_CTG31' [249267852] # 'HSCHR1_2_CTG31' [249266025] # 'HSCHR1_3_CTG31' [249262108] # 'HG999_2_PATCH' [249259300] # 'HG989_PATCH' [249257867] # 'HG999_1_PATCH' [249257505] # 'HG1472_PATCH' [249251918] # '1' [249250621] # '2' [243199373] # '3' [198022430] # '4' [191154276] # '5' [180915260] # '6' [171115067] # '7' [159138663] # 'X' [155270560] ...
Apparent inconsistencies when using UCSC genome browser
WARNING: Usage of hg19 genome is deprecated and discouraged, you should use GRChXX.YY instead (e.g. the latest version at the time of writing is GRCh37.70)
Reference sequence and annotations are made for an organism version and sub-version.
For examples human genome, version 37, sub-version 70 would be called (GRCh37.70).
UCSC doesn't specify sub-version.
They just say hg19.
This annoying sub-version problem appeared often and, having reproducibility of results in mind, I dropped UCSC annotations in favor of ENSEMBL ones (they have clear versioning).
SnpEff reporting an effect that doesn't match ENSEMBL's web page
Please remember that databases are updated often (e.g. by ENSEMBL), so if you are using an old database, you might get different effects.
For example, this transcript ENST00000487462 changed from protein_coding in GRCh37.63
1 protein_coding exon 1655388 1655458 . - . gene_id "ENSG00000008128"; transcript_id "ENST00000487462"; exon_number "1"; gene_name "CDK11A"; transcript_name "CDK11A-013"; 1 protein_coding exon 1653905 1654270 . - . gene_id "ENSG00000008128"; transcript_id "ENST00000487462"; exon_number "2"; gene_name "CDK11A"; transcript_name "CDK11A-013";...to processed_transcript in GRCh37.64:
1 processed_transcript exon 1655388 1655458 . - . gene_id "ENSG00000008128"; transcript_id "ENST00000487462"; exon_number "1"; gene_name "CDK11A"; gene_biotype "protein_coding"; transcript_name "CDK11A-013"; 1 processed_transcript exon 1653905 1654270 . - . gene_id "ENSG00000008128"; transcript_id "ENST00000487462"; exon_number "2"; gene_name "CDK11A"; gene_biotype "protein_coding"; transcript_name "CDK11A-013";This means that you'll get different results for this transcript using sub-version 63 or 64. I assume that latest versions are improved, so I always encourage to upgrade.
SnpEff reports a SYNONYMOUS and a NON_SYNONYMOUS effect on the same gene
This is not a bug.
It is not uncommon for a gene to have more than one transcript (e.g. in human most genes have multiple transcripts).
A variant (e.g. a SNP) might affect different transcripts in different ways, as a result of different reading frames.
For instance:
chr5 137622242 . C T . . EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gaa/Aaa|E/K|CDC25C|protein_coding|CODING|ENST00000514017|exon_5_137622186_137622319), SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000323760|exon_5_137622186_137622319), SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000348983|exon_5_137622186_137622319), SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000356505|exon_5_137622186_137622319), SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000357274|exon_5_137622186_137622319), SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000415130|exon_5_137622186_137622319), SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000513970|exon_5_137622186_137622319), SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000514555|exon_5_137622186_137622319), SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000534892|exon_5_137622186_137622319)in this example (it was divided into multiple lines for legibility), the first transcript ENST0000051401 has a NON_SYNONYMOUS effect, but all other transcripts have a SYNONYMOUS effect.
Counting total number of effects of a given type
Some people try to count the number of effects in a file by doing (assuming we want to count how many MODIFIER effects we have):
grep -o MODIFIER output.ann.vcf | wc -l
cat output.ann.vcf \ | cut -f 8 \ | tr ";" "\n" \ | grep ^EFF= \ | cut -f 2 -d = \ | tr "," "\n" \ | grep MODIFIER \ | wc -lBrief explanation:
cut -f 8 | Extract INFO fields |
tr ";" "\n" | Expand each field into one line |
grep ^EFF= | Only keep 'EFF' fields |
cut -f 2 -d = | Keep only the effect data (drop the 'EFF=' part) |
tr "," "\n" | Expand effects to multiple lines |
grep MODIFIER | wc -l | Count the ones you want (in this example 'MODIFIER') |
SnpEff needs a database to perform genomic annotations. There are pre-built databases for over 2,500 genomes, so chances are that your organism of choice already has a SnpEff database available. In the (unlikely?) event that you need to build one yourself, here we describe how to it.
You can know which genomes are supported by running the following command:
$ java -jar snpEff.jar databases
SnpEff databases for the most popular genomes are already pre-built and available for you to download. So, chances are that you don't need to build a database yourself (this will save you a LOT of work).
By default SnpEff automatically downloads and installs the database for you, so you don't need to do it manually.
The following instructions are for people that want to pre-install databases manually (again, most people don't need to do this).
The easiest way to download and install a pre-built SnpEff database manually, is using the "download" command.
E.g. if you want to install the SnpEff database for the human genome, you can run the following command:
$ java -jar snpEff.jar download -v GRCh37.75
If you are running SnpEff from a directory different than the one it was installed, you will have to specify where the config file is. This is done using the '-c' command line option:
$ java -Xmx4g -jar snpEff.jar download -c path/to/snpEff/snpEff.config -v GRCh37.75
snpEff.config
.
In order to tell SnpEff that there is a new genome available, you must update SnpEff's configuration file snpEff.config
.
You must add a new genome entry to snpEff.config
.
If your genome, or a chromosome, uses non-standard codon tables you must update snpEff.config
accordingly.
A typical case is when you use mitochondrial DNA. Then you specify that chromosome 'MT' uses codon.Invertebrate_Mitochondrial
codon table.
Another common case is when you are adding a bacterial genome, then you specify that the codon table is Bacterial_and_Plant_Plastid
.
This example shows how to add a new genome to the config files. For this example we'll use the mouse genome (mm37.61):
vi snpEffect.configAdd the following lines (you are editing snpEffect.config)
# Mouse genome, version mm37.61 mm37.61.genome : Mouse
cd /path/to/galaxy cd tools/snpEffect/ vi snpEffect.xmlAdd the following lines to the file
<param name="genomeVersion" type="select" label="Genome"> <option value="hg37">Human (hg37)<option> <option value="mm37.61">Mouse (mm37.61)<option> <param>
Codon tables are provided in the snpEff.config
configuration file under the section codon.Name_of_your_codon_table
.
The format is a comma separated list of CODON/AMINO_ACID
.
E.g.:
codon.Invertebrate_Mitochondrial: TTT/F, TTC/F, TAC/Y, TAA/*, ATG/M+, ATG/M+, ACT/T, ...Note that codons marked with '*' are STOP codons and codons marked with a '+' are START codons.
dm3.M.codonTable : Invertebrate_Mitochondrial...of course, chromosome 'M' is not a real chromosome, it is just a way to mark the sequence as mitochondrial DNA in the reference genome.
As we previously mentioned, reference genome information can be in different formats: GTF, GFF, RefSeq or GenBank.
In the following sub-sections, we show how to build a database for each type of genomic information file.
GTF 2.2 files are supported by SnpEff (e.g. ENSEMBL releases genome annotations in this format).
# Create directoy for this new genome cd /path/to/snpEff/data/ mkdir mm37.61 cd mm37.61 # Get annotation files wget ftp://ftp.ensembl.org/pub/current/gtf/mus_musculus/Mus_musculus.NCBIM37.61.gtf.gz mv Mus_musculus.NCBIM37.61.gtf.gz genes.gtf.gz # Get the genome cd /path/to/snpEff/data/genomes wget ftp://ftp.ensembl.org/pub/current/fasta/mus_musculus/dna/Mus_musculus.NCBIM37.61.dna.toplevel.fa.gz mv Mus_musculus.NCBIM37.61.dna.toplevel.fa.gz mm37.61.fa.gz
cd /path/to/snpEff java -jar snpEff.jar build -gtf22 -v mm37.61
This example shows how to create a database for a new genome using GFF file ((e.g. FlyBase, WormBase, BeeBase release GFF files). For this example we'll use the Drosophila melanogaster genome (dm5.31):
mkdir path/to/snpEff/data/dm5.31 cd path/to/snpEff/data/dm5.31 wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.31_FB2010_08/gff/dmel-all-r5.31.gff.gz mv dmel-all-r5.31.gff.gz genes.gff.gz
cd /path/to/snpEff java -jar snpEff.jar build -gff3 -v dm5.31
This example shows how to create a database for a new genome.
For this example we'll use the Human genome (hg19).
Warning: Using UCSC genome tables is highly discouraged, we recommend you use ENSEMBL versions instead.
Warning: UCSC tables sometimes change for different species.
This means that even if these instructions work for human genome, it might not work for other genomes.
Obviously creating a new parser for each genome is impractical, so working with UCSC genomes is highly discouraged.
We recommend to use ENSEMBL genomes instead.
Warning: UCSC genomes provide only major release version, but NOT sub-versions.
E.g. UCSC's "hg19" has major version 19 but there is no "sub-version", whereas ENSEMBL's GRCh37.70 clearly has major version 37 and minor version 70.
Not providing a minor version means that they might change the database and two "hg19" genomes are actually be different.
This creates all sorts of consistency problems (e.g. the annotations may not be the same that you see in the UCSC genome browser, even though both of them are 'hg19' version).
Using UCSC genome tables is highly discouraged, we recommend you use ENSEMBL versions instead.
In order to build a genome using UCSC tables, you can follow these instructions:
cd /path/to/snpEff java -jar snpEff.jar build -refSeq -v hg19
This example shows how to create a database for a new genome. For this example we'll use "Staphylococcus aureus":
cat NC_002505.1.gbk NC_002506.1.gbk > genes.gbkAdd the following entries in the config file:
# Vibrio Cholerae vibrio.genome : Vibrio Cholerae vibrio.chromosomes : NC_002505.1, NC_002506.1 vibrio.NC_002505.1.codonTable : Bacterial_and_Plant_Plastid vibrio.NC_002506.1.codonTable : Bacterial_and_Plant_Plastid
cd /path/to/snpEff java -jar snpEff.jar build -genbank -v CP000730
This is a full example on how to build the human genome database (using GTF file from ENSEBML), it includes support for regulatory features, sanity check, rare amino acids, etc..
# Go to SnpEff's install dir cd ~/snpeff # Create database dir mkdir data/GRCh37.70 cd data/GRCh37.70 # Download annotated genes wget ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/Homo_sapiens.GRCh37.70.gtf.gz mv Homo_sapiens.GRCh37.70.gtf.gz genes.gtf.gz # Download proteins # This is used for: # - "Rare Amino Acid" annotations # - Sanity check (checking protein predicted from DNA sequences match 'real' proteins) wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/pep/Homo_sapiens.GRCh37.70.pep.all.fa.gz mv Homo_sapiens.GRCh37.70.pep.all.fa.gz protein.fa.gz # Download CDSs # Note: This is used as "sanity check" (checking that CDSs prediscted from gene sequences match 'real' CDSs) wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.70.cdna.all.fa.gz mv Homo_sapiens.GRCh37.70.cdna.all.fa.gz cds.fa.gz # Download regulatory annotations wget ftp://ftp.ensembl.org/pub/release-70/regulation/homo_sapiens/AnnotatedFeatures.gff.gz mv AnnotatedFeatures.gff.gz regulation.gff.gz # Uncompress gunzip *.gz # Download genome cd ../genomes/ wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.70.dna.toplevel.fa.gz mv Homo_sapiens.GRCh37.70.dna.toplevel.fa.gz GRCh37.70.fa.gz # Uncompress: # Why do we need to uncompress? # Because ENSEMBL compresses files using a block compress gzip which is not compatibles whith Java's library Gunzip gunzip GRCh37.70.fa.gz # Edit snpEff.config file # # WARNING! You must do this yourself. Just copying and pasting this into a terminal won't work. # # Add lines: # GRCh37.70.genome : Homo_sapiens # GRCh37.70.reference : ftp://ftp.ensembl.org/pub/release-70/gtf/ # Now we are ready to build tha database cd ~/snpeff java -Xmx20g -jar snpEff.jar build -v GRCh37.70 2>&1 | tee GRCh37.70.build
When I build the database using GFF 3 SnpEff reports that Exons don't have sequences
GFF3 files can have sequence information either in the same file or in a separate fasta file.
In order to add sequence information in the GFF file, you can do this:
cat annotations.gff > genes.gff echo "###" >> genes.gff echo "##FASTA" >> genes.gff cat sequence.fa >> genes.gff
When building a database, I get zero protein coding genes
When building a database, snpEff tries to find which transcripts are protein coding. This is done using the 'bioType' information.
The bioType information is not a standard GFF or GTF feature. So I follow ENSEMBL's convention of using the second column ('source') for bioType, as well as the gene_biotype attribute.
If your file was not produced by ENSEMBL, it probably doesn't have this information. This means that snpEff doesn't know which genes are protein coding and which ones are not.
Having no information, snpEff will treat all genes as protein coding (assuming you have '-treatAllAsProteinCoding Auto' option in the command line, which is the default).
So you will get effects as if all genes were protein coding, then you can filter out the irrelevant genes. Unfortunately, this is the best I can do if there is no 'bioType' information
When building a database, I get too many warnings
There are plenty of GFF and GTF files that, unfortunately, do not follow the specification.
SnpEff usually complains about this, but tries hard to correct the problems.
So the database may be OK even after you see many warnings.
You can check the database to see if the features (genes, exons, UTRs) have been correctly incorporated, by taking a look at the database:
java -jar snpEff.jar dump myGenome | less
SnpEff supports regulatory and non-coding annotations. In this sections we show how to build those databases. As in the previous section, most likely you will never have to do it yourself and can just use available pre-built databases.
There are two ways to add support for regulatory annotations (these are not mutually exclusive, you can use both at the same time):
cd path/to/snpEff/data/GRCh37.65 wget ftp:/ftp.ensembl.org/pub/release-65/regulation/homo_sapiens/AnnotatedFeatures.gff.gz mv AnnotatedFeatures.gff.gz regulation.gff.gz
cd /path/to/snpEff java -Xmx20G -jar snpEff.jar build -v -onlyReg GRCh37.65The output looks like this
Reading regulation elements (GFF) Chromosome '11' line: 226964 Chromosome '12' line: 493780 ... Chromosome '9' line: 4832434 Chromosome 'X' line: 5054301 Chromosome 'Y' line: 5166958 Done Total lines : 5176289 Total annotation count : 3961432 Percent : 76.5% Total annotated length : 3648200193 Number of cell/annotations : 266 Saving database 'HeLa-S3' in file '/path/to/snpEff/data/GRCh37.65/regulation_HeLa-S3.bin' Saving database 'HepG2' in file '/path/to/snpEff/data/GRCh37.65/regulation_HepG2.bin' Saving database 'NHEK' in file '/path/to/snpEff/data/GRCh37.65/regulation_NHEK.bin' Saving database 'GM12878' in file '/path/to/snpEff/data/GRCh37.65/regulation_GM12878.bin' Saving database 'HUVEC' in file '/path/to/snpEff/data/GRCh37.65/regulation_HUVEC.bin' Saving database 'H1ESC' in file '/path/to/snpEff/data/GRCh37.65/regulation_H1ESC.bin' Saving database 'CD4' in file '/path/to/snpEff/data/GRCh37.65/regulation_CD4.bin' Saving database 'GM06990' in file '/path/to/snpEff/data/GRCh37.65/regulation_GM06990.bin' Saving database 'IMR90' in file '/path/to/snpEff/data/GRCh37.65/regulation_IMR90.bin' Saving database 'K562' in file '/path/to/snpEff/data/GRCh37.65/regulation_K562.bin' Done.As you can see, annotations for each cell type are saved in different files. This makes it easier to load annotations only for the desired cell types when analyzing data.
This example shows how to create a regulation database for human (GRCh37.65). We assume we have a file called "my_regulation.bed" which has information for H3K9me3 in Pancreatic Islets (for instance, as a result of a Chip-Seq experiment and peak enrichment analysis).
cd path/to/snpEff/data/GRCh37.65 mkdir regulation.bed cd regulation.bed mv where/everh/your/bed/file/is/my_regulation.bed ./regulation.Pancreatic_Islets.H3K9me3.bedNote: The name of the file must be 'regulation.CELL_TYPE.ANNOTATION_TYPE.bed'. In this case, 'CELL_TYPE=Pancreatic_Islets' and 'ANNOTATION_TYPE=H3K9me3'
cd /path/to/snpEff java -Xmx20G -jar snpEff.jar build -v -onlyReg GRCh37.65The output looks like this
Building database for 'GRCh37.65' Reading regulation elements (GFF) Cannot read regulation elements form file '/path/to/snpEff/data/GRCh37.65/regulation.gff' Directory has 1 bed files and 1 cell types Creating consensus for cellType 'Pancreatic_Islets', files: [/path/to/snpEff/data/GRCh37.65/regulation.bed/regulation.Pancreatic_Islets.H3K9me3.bed] Reading file '/path/to/snpEff/data/GRCh37.65/regulation.bed/regulation.Pancreatic_Islets.H3K9me3.bed' Chromosome '10' line: 5143 Chromosome '11' line: 8521 ... Chromosome 'X' line: 52481 Chromosome 'Y' line: 53340 Done Total lines : 53551 Total annotation count : 53573 Percent : 100.0% Total annotated length : 75489402 Number of cell/annotations : 1 Creating consensus for cell type: Pancreatic_Islets Sorting: Pancreatic_Islets , size: 53573 Adding to final consensus Final consensus for cell type: Pancreatic_Islets , size: 53549 Saving database 'Pancreatic_Islets' in file '/path/to/snpEff/data/GRCh37.65/regulation_Pancreatic_Islets.bin' Done Finishing upNote: If there are many annotations, they are saved in one binary file for each cell type (i.e. several BED files for different cell types are collapsed together). This makes it easier to load annotations only for the desired cell types when analyzing data.
SnpEff is integrated with Broad Institute's Genome Analysis Toolkit (GATK) and Galaxy projects.
-o gatk
command line option.
The reason for using '-o gatk' is that, even though both GATK and SnpEff use VCF format, SnpEff has recently updated the EFF
sub-field format and this might cause some trouble (since GATK still uses the original version).
snpEff/scripts/
directory of the distribution)
#!/bin/sh #------------------------------------------------------------------------------- # Files #------------------------------------------------------------------------------- in=$1 # Input VCF file eff=`dirname $in`/`basename $in .vcf`.ann.vcf # SnpEff annotated VCF file out=`dirname $in`/`basename $in .vcf`.gatk.vcf # Output VCF file (annotated by GATK) ref=$HOME/snpEff/data/genomes/hg19.fa # Reference genome file dict=`dirname $ref`/`basename $ref .fa`.dict # Reference genome: Dictionary file #------------------------------------------------------------------------------- # Path to programs and libraries #------------------------------------------------------------------------------- gatk=$HOME/tools/gatk/GenomeAnalysisTK.jar picard=$HOME/tools/picard/ snpeff=$HOME/snpEff/snpEff.jar #------------------------------------------------------------------------------- # Main #------------------------------------------------------------------------------- # Create genome index file echo echo "Indexing Genome reference FASTA file: $ref" samtools faidx $ref # Create dictionary echo echo "Creating Genome reference dictionary file: $dict" java -jar $picard/CreateSequenceDictionary.jar R= $ref O= $dict # Annotate echo echo "Annotate using SnpEff" echo " Input file : $in" echo " Output file : $eff" java -Xmx4G -jar $snpeff -c $HOME/snpEff/snpEff.config -v -o gatk hg19 $in > $eff # Use GATK echo echo "Annotating using GATK's VariantAnnotator:" echo " Input file : $in" echo " Output file : $out" java -Xmx4g -jar $gatk \ -T VariantAnnotator \ -R $ref \ -A SnpEff \ --variant $in \ --snpEffFile $eff \ -L $in \ -o $out
$ ~/snpEff/scripts/gatk.sh zzz.vcf Indexing Genome reference FASTA file: /home/pcingola/snpEff/data/genomes/hg19.fa Creating Genome reference dictionary file: /home/pcingola/snpEff/data/genomes/hg19.dict [Fri Apr 12 11:23:12 EDT 2013] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/home/pcingola/snpEff/data/genomes/hg19.fa OUTPUT=/home/pcingola/snpEff/data/genomes/hg19.dict TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Fri Apr 12 11:23:12 EDT 2013] Executing as pcingola@localhost.localdomain on Linux 3.6.11-4.fc16.x86_64 amd64; OpenJDK 64-Bit Server VM 1.6.0_24-b24; Picard version: 1.89(1408) [Fri Apr 12 11:23:12 EDT 2013] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=141164544 To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp Exception in thread "main" net.sf.picard.PicardException: /home/pcingola/snpEff/data/genomes/hg19.dict already exists. Delete this file and try again, or specify a different output file. at net.sf.picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:114) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:93) Annotate using SnpEff Input file : zzz.vcf Output file : ./zzz.ann.vcf 00:00:00.000 Reading configuration file '/home/pcingola/snpEff/snpEff.config' 00:00:00.173 done 00:00:00.173 Reading database for genome version 'hg19' from file '/home/pcingola//snpEff/data/hg19/snpEffectPredictor.bin' (this might take a while) 00:00:11.860 done 00:00:11.885 Building interval forest 00:00:17.755 done. 00:00:18.391 Genome stats : # Genome name : 'Homo_sapiens (USCS)' # Genome version : 'hg19' # Has protein coding info : true # Genes : 25933 # Protein coding genes : 20652 # Transcripts : 44253 # Avg. transcripts per gene : 1.71 # Protein coding transcripts : 36332 # Cds : 365442 # Exons : 429543 # Exons with sequence : 409789 # Exons without sequence : 19754 # Avg. exons per transcript : 9.71 # Number of chromosomes : 50 # Chromosomes names [sizes] : '1' [249250621] '2' [243199373] '3' [198022430] '4' [191154276] '5' [180915260] '6' [171115067] '7' [159138663] 'X' [155270560] '8' [146364022] '9' [141213431] '10' [135534747] '11' [135006516] '12' [133851895] '13' [115169878] '14' [107349540] '15' [102531392] '16' [90354753] '17' [81195210] '18' [78077248] '20' [63025520] 'Y' [59373566] '19' [59128983] '22' [51304566] '21' [48129895] '6_ssto_hap7' [4905564] '6_mcf_hap5' [4764535] '6_cox_hap2' [4734611] '6_mann_hap4' [4679971] '6_qbl_hap6' [4609904] '6_dbb_hap3' [4572120] '6_apd_hap1' [4383650] '17_ctg5_hap1' [1574839] '4_ctg9_hap1' [582546] 'Un_gl000220' [156152] '19_gl000209_random' [145745] 'Un_gl000213' [139339] '17_gl000205_random' [119732] 'Un_gl000223' [119730] '4_gl000194_random' [115071] 'Un_gl000228' [114676] 'Un_gl000219' [99642] 'Un_gl000218' [97454] 'Un_gl000211' [93165] 'Un_gl000222' [89310] '4_gl000193_random' [88375] '7_gl000195_random' [86719] '1_gl000192_random' [79327] 'Un_gl000212' [60768] '1_gl000191_random' [50281] 'M' [16571] 00:00:18.391 Predicting variants 00:00:20.267 Creating summary file: snpEff_summary.html 00:00:20.847 Creating genes file: snpEff_genes.txt 00:00:25.026 done. 00:00:25.036 Logging 00:00:26.037 Checking for updates... Annotating using GATK's VariantAnnotator: Input file : zzz.vcf Output file : ./zzz.gatk.vcf INFO 11:23:41,316 ArgumentTypeDescriptor - Dynamically determined type of zzz.vcf to be VCF INFO 11:23:41,343 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:23:41,344 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-9-g532efad, Compiled 2013/03/19 07:35:36 INFO 11:23:41,344 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 11:23:41,344 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:23:41,347 HelpFormatter - Program Args: -T VariantAnnotator -R /home/pcingola/snpEff/data/genomes/hg19.fa -A SnpEff --variant zzz.vcf --snpEffFile ./zzz.ann.vcf -L zzz.vcf -o ./zzz.gatk.vcf INFO 11:23:41,347 HelpFormatter - Date/Time: 2013/04/12 11:23:41 INFO 11:23:41,348 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:23:41,348 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:23:41,353 ArgumentTypeDescriptor - Dynamically determined type of zzz.vcf to be VCF INFO 11:23:41,356 ArgumentTypeDescriptor - Dynamically determined type of ./zzz.ann.vcf to be VCF INFO 11:23:41,399 GenomeAnalysisEngine - Strictness is SILENT INFO 11:23:41,466 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 11:23:41,480 RMDTrackBuilder - Loading Tribble index from disk for file zzz.vcf INFO 11:23:41,503 RMDTrackBuilder - Loading Tribble index from disk for file ./zzz.ann.vcf WARN 11:23:41,505 RMDTrackBuilder - Index file /data/pcingola/Documents/projects/snpEff/gatk_test/./zzz.ann.vcf.idx is out of date (index older than input file), deleting and updating the index file INFO 11:23:41,506 RMDTrackBuilder - Creating Tribble index in memory for file ./zzz.ann.vcf INFO 11:23:41,914 RMDTrackBuilder - Writing Tribble index to disk for file /data/pcingola/Documents/projects/snpEff/gatk_test/./zzz.ann.vcf.idx INFO 11:23:42,076 IntervalUtils - Processing 33411 bp from intervals INFO 11:23:42,125 GenomeAnalysisEngine - Creating shard strategy for 0 BAM files INFO 11:23:42,134 GenomeAnalysisEngine - Done creating shard strategy INFO 11:23:42,134 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 11:23:42,135 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 11:23:49,268 VariantAnnotator - Processed 9966 loci. INFO 11:23:49,280 ProgressMeter - done 3.34e+04 7.0 s 3.6 m 100.0% 7.0 s 0.0 s INFO 11:23:49,280 ProgressMeter - Total runtime 7.15 secs, 0.12 min, 0.00 hours INFO 11:23:49,953 GATKRunReport - Uploaded run statistics report to AWS S3
galaxy/*.xml
files provided in the main distribution.
# Set variable to snpEff install dir (we only use it for this install script) export snpEffDir="$HOME/snpEff" # Go to your galaxy 'tools' dir cd galaxy-dist/tools # Create a directory and copy the XML config files from SnpEff's distribution mkdir snpEff cd snpEff/ cp $snpEffDir/galaxy/* . # Create links to JAR files ln -s $snpEffDir/snpEff.jar ln -s $snpEffDir/SnpSift.jar # Link to config file ln -s $snpEffDir/snpEff.config # Allow scripts execution chmod a+x *.{pl,sh} # Copy genomes information cd ../.. cp $snpEffDir/galaxy/tool-data/snpEff_genomes.loc tool-data/ # Edit Galaxy's tool_conf.xml and add all the tools vi tool_conf.xml -------------------- Begin: Edit tool_conf.xml -------------------- <!-- Add this section to tool_conf.xml file in your galaxy distribution Note: The following lines should be added at the end of the file, right before "</toolbox>" line --> <section name="snpEff tools" id="snpEff_tools"> <tool file="snpEff/snpEff.xml" /> <tool file="snpEff/snpEff_download.xml" /> <tool file="snpEff/snpSift_annotate.xml" /> <tool file="snpEff/snpSift_caseControl.xml" /> <tool file="snpEff/snpSift_filter.xml" /> <tool file="snpEff/snpSift_int.xml" /> </section> -------------------- End: Edit tool_conf.xml -------------------- # Run galaxy and check that the new menues appear ./run.sh
SnpEff provides several other commands and utilities that can be useful for genomic data analysis.
Most of this manual was dedicated the SnpEff eff
and SnpEff build
commands, which annotate effects and build databases respectively.
Here we describe all the other commands and some scripts provided, that are useful for genomic data analysis.
Annotates using the closest genomic region (e.g. exon, transcript ID, gene name) and distance in bases.
Example:
$ java -Xmx4g -jar snpEff.jar closest GRCh37.66 test.vcf ##INFO=<ID=CLOSEST,Number=4,Type=String,Description="Closest exon: Distance (bases), exons Id, transcript Id, gene name"> 1 12078 . G A 25.69 PASS AC=2;AF=0.048;CLOSEST=0,exon_1_11869_12227,ENST00000456328,DDX11L1 1 16097 . T G 42.42 PASS AC=9;AF=0.0113;CLOSEST=150,exon_1_15796_15947,ENST00000423562,WASH7P 1 40261 . C A 366.26 PASS AC=30;AF=0.484;CLOSEST=4180,exon_1_35721_36081,ENST00000417324,FAM138A 1 63880 . C T 82.13 PASS AC=10;AF=0.0400;CLOSEST=0,exon_1_62948_63887,ENST00000492842,OR4G11PFor intance, in the third line (1:16097 T G), it added the tag
CLOSEST=150,exon_1_15796_15947,ENST00000423562,WASH7P
, which means that the variant is 150 bases away from exon "exon_1_15796_15947".
The exon belongs to transcript "ENST00000423562" of gene "WASH7P".
$ snpeff closest -bed GRCh37.66 test.bed 1 12077 12078 line_1;0,exon_1_11869_12227,ENST00000456328,DDX11L1 1 16096 16097 line_2;150,exon_1_15796_15947,ENST00000423562,WASH7P 1 40260 40261 line_3;4180,exon_1_35721_36081,ENST00000417324,FAM138A 1 63879 63880 line_4;0,exon_1_62948_63887,ENST00000492842,OR4G11P
As the name suggests, snpEff count
command counts how many reads and bases from a BAM file hit a gene, transcript, exon, intron, etc.
Input files can be in BAM, SAM, VCF, BED or BigBed formats.
A summary HTML file with charts is generated. Here are some examples:
If you need to count how many reads (and bases) from a BAM file hit each genomic region, you can use 'count' utility.
The command line is quite simple. E.g. in order to count how many reads (from N BAM files) hit regions of the human genome, you simply run:
java -Xmx4g -jar snpEff.jar count GRCh37.68 readsFile_1.bam readsFile_2.bam ... readsFile_N.bam > countReads.txt
chr start end type IDs Reads:readsFile_1.bam Bases:readsFile_1.bam Reads:readsFile_2.bam Bases:readsFile_2.bam ... 1 1 11873 Intergenic DDX11L1 130 6631 50 2544 1 1 249250621 Chromosome 1 2527754 251120400 2969569 328173439 1 6874 11873 Upstream NR_046018;DDX11L1 130 6631 50 2544 1 9362 14361 Downstream NR_024540;WASH7P 243 13702 182 9279 1 11874 12227 Exon exon_1;NR_046018;DDX11L1 4 116 2 102 1 11874 14408 Gene DDX11L1 114 7121 135 6792 1 11874 14408 Transcript NR_046018;DDX11L1 114 7121 135 6792 1 12228 12229 SpliceSiteDonor exon_1;NR_046018;DDX11L1 3 6 0 0 1 12228 12612 Intron intron_1;NR_046018;DDX11L1 13 649 0 0 1 12611 12612 SpliceSiteAcceptor exon_2;NR_046018;DDX11L1 0 0 0 0 1 12613 12721 Exon exon_2;NR_046018;DDX11L1 3 24 1 51 1 12722 12723 SpliceSiteDonor exon_2;NR_046018;DDX11L1 3 6 0 0 1 12722 13220 Intron intron_2;NR_046018;DDX11L1 22 2110 20 987 1 13219 13220 SpliceSiteAcceptor exon_3;NR_046018;DDX11L1 5 10 1 2 1 13221 14408 Exon exon_3;NR_046018;DDX11L1 82 4222 113 5652 1 14362 14829 Exon exon_11;NR_024540;WASH7P 37 1830 7 357 1 14362 29370 Transcript NR_024540;WASH7P 704 37262 524 34377 1 14362 29370 Gene WASH7P 704 37262 524 34377 1 14409 19408 Downstream NR_046018;DDX11L1 122 7633 39 4254The columns are:
-p
, you can calculate p-values based on a Binomial model.
For example (output edited for the sake of brevity):
$ java -Xmx4g -jar snpEff.jar count -v BDGP5.69 fly.bam > countReads.txt 00:00:00.000 Reading configuration file 'snpEff.config' ... 00:00:12.148 Calculating probability model for read length 50 ... type p.binomial reads.fly expected.fly pvalue.fly Chromosome 1.0 205215 205215 1.0 Downstream 0.29531659795589793 59082 60603 1.0 Exon 0.2030262729897713 53461 41664 0.0 Gene 0.49282883664487515 110475 101136 0.0 Intergenic 0.33995644860241336 54081 69764 0.9999999963234701 Intron 0.3431415343615103 72308 70418 9.06236369003514E-19 RareAminoAcid 9.245222303207472E-7 3 0 9.879186871519377E-4 SpliceSiteAcceptor 0.014623209601955131 3142 3001 0.005099810118785825 SpliceSiteDonor 0.015279075154423956 2998 3135 0.9937690786738507 Transcript 0.49282883664487515 110475 101136 0.0 Upstream 0.31499087549896493 64181 64641 0.9856950416729887 Utr3prime 0.03495370828296416 8850 7173 1.1734134297889064E-84 Utr5prime 0.02765432673262785 8146 5675 7.908406840800345E-215
-i file.bed
command line option.
The option can be used multiple times, thus allowing multiple BED files to be added.
java -Xmx4g -jar snpEff.jar count -i peaks.bed GRCh37.68 reads.bam
This command provides a list of configured databases, i.e. available in snpEff.config file.
Example:
$ java -jar snpEff.jar databases Genome Organism Database download link ------ -------- ---------------------- ADWO01 Prevotella bryantii B14 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Prevotella bryantii B14.zip Acholeplasma_laidlawii_PG_8A_uid58901 Acholeplasma_laidlawii_PG_8A_uid58901 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acholeplasma_laidlawii_PG_8A_uid58901.zip Achromobacter_xylosoxidans_A8_uid59899 Achromobacter_xylosoxidans_A8_uid59899 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Achromobacter_xylosoxidans_A8_uid59899.zip Acidaminococcus_fermentans_DSM_20731_uid43471 Acidaminococcus_fermentans_DSM_20731_uid43471 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidaminococcus_fermentans_DSM_20731_uid43471.zip Acidaminococcus_intestini_RyC_MR95_uid74445 Acidaminococcus_intestini_RyC_MR95_uid74445 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidaminococcus_intestini_RyC_MR95_uid74445.zip Acidianus_hospitalis_W1_uid66875 Acidianus_hospitalis_W1_uid66875 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidianus_hospitalis_W1_uid66875.zip Acidilobus_saccharovorans_345_15_uid51395 Acidilobus_saccharovorans_345_15_uid51395 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidilobus_saccharovorans_345_15_uid51395.zip Acidimicrobium_ferrooxidans_DSM_10331_uid59215 Acidimicrobium_ferrooxidans_DSM_10331_uid59215 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidimicrobium_ferrooxidans_DSM_10331_uid59215.zip Acidiphilium_cryptum_JF_5_uid58447 Acidiphilium_cryptum_JF_5_uid58447 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidiphilium_cryptum_JF_5_uid58447.zip Acidiphilium_multivorum_AIU301_uid63345 Acidiphilium_multivorum_AIU301_uid63345 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidiphilium_multivorum_AIU301_uid63345.zip Acidithiobacillus_caldus_SM_1_uid70791 Acidithiobacillus_caldus_SM_1_uid70791 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidithiobacillus_caldus_SM_1_uid70791.zip Acidithiobacillus_ferrivorans_SS3_uid67387 Acidithiobacillus_ferrivorans_SS3_uid67387 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidithiobacillus_ferrivorans_SS3_uid67387.zip Acidithiobacillus_ferrooxidans_ATCC_23270_uid57649 Acidithiobacillus_ferrooxidans_ATCC_23270_uid57649 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidithiobacillus_ferrooxidans_ATCC_23270_uid57649.zip Acidithiobacillus_ferrooxidans_ATCC_53993_uid58613 Acidithiobacillus_ferrooxidans_ATCC_53993_uid58613 http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidithiobacillus_ferrooxidans_ATCC_53993_uid58613.zip ... ...
This command downloads and installs a database.
Note that the database must be configured in snpEff.config
and available at the download site.
Example: Download and install C.Elegans genome
$ java -jar snpEff.jar download -v WBcel215.69 00:00:00.000 Downloading database for 'WBcel215.69' 00:00:00.002 Connecting to http://downloads.sourceforge.net/project/snpeff/databases/v3_1/snpEff_v3_1_WBcel215.69.zip 00:00:00.547 Copying file (type: application/zip, modified on: Sat Dec 01 20:59:55 EST 2012) 00:00:00.547 Local file name: 'snpEff_v3_1_WBcel215.69.zip' 00:00:01.949 Downloaded 1049506 bytes 00:00:03.624 Downloaded 2135266 bytes 00:00:05.067 Downloaded 3185026 bytes 00:00:06.472 Downloaded 4234786 bytes 00:00:07.877 Downloaded 5284546 bytes 00:00:09.580 Downloaded 6374626 bytes 00:00:11.005 Downloaded 7424386 bytes 00:00:12.410 Downloaded 8474146 bytes 00:00:13.815 Downloaded 9523906 bytes 00:00:15.358 Downloaded 10604226 bytes 00:00:16.761 Downloaded 11653666 bytes 00:00:18.168 Downloaded 12703426 bytes 00:00:19.573 Downloaded 13753186 bytes 00:00:21.198 Downloaded 14837506 bytes 00:00:22.624 Downloaded 15887266 bytes 00:00:24.029 Downloaded 16937026 bytes 00:00:25.434 Downloaded 17986786 bytes 00:00:26.864 Downloaded 19036546 bytes 00:00:28.269 Downloaded 20086306 bytes 00:00:29.155 Donwload finished. Total 20748168 bytes. 00:00:29.156 Local file name: '/home/pcingola//snpEff/data/WBcel215.69/snpEffectPredictor.bin' 00:00:29.156 Extracting file 'data/WBcel215.69/snpEffectPredictor.bin' to '/home/pcingola//snpEff/data/WBcel215.69/snpEffectPredictor.bin' 00:00:29.157 Creating local directory: '/home/pcingola/snpEff/data/WBcel215.69' 00:00:29.424 Unzip: OK 00:00:29.424 Done
Dump the contents of a database to a text file, a BED file or a tab separated TXT file (that can be loaded into R).
BED file example:
$ java -jar snpEff.jar download -v GRCh37.70 $ java -Xmx4g -jar snpEff.jar dump -v -bed GRCh37.70 > GRCh37.70.bed 00:00:00.000 Reading database for genome 'GRCh37.70' (this might take a while) 00:00:32.476 done 00:00:32.477 Building interval forest 00:00:45.928 Done.The output file looks like a typical BED file (chr \t start \t end \t name).
$ head GRCh37.70.bed 1 0 249250621 Chromosome_1 1 111833483 111863188 Gene_ENSG00000134216 1 111853089 111863002 Transcript_ENST00000489524 1 111861741 111861861 Cds_CDS_1_111861742_111861861 1 111861948 111862090 Cds_CDS_1_111861949_111862090 1 111860607 111860731 Cds_CDS_1_111860608_111860731 1 111861114 111861300 Cds_CDS_1_111861115_111861300 1 111860305 111860427 Cds_CDS_1_111860306_111860427 1 111862834 111863002 Cds_CDS_1_111862835_111863002 1 111853089 111853114 Utr5prime_exon_1_111853090_111853114TXT file example:
$ java -Xmx4g -jar snpEff.jar dump -v -txt GRCh37.70 > GRCh37.70.txt 00:00:00.000 Reading database for genome 'GRCh37.70' (this might take a while) 00:00:31.961 done 00:00:31.962 Building interval forest 00:00:45.467 Done.In this case, the ouput file looks like a typical BED file (chr \t start \t end \t name):
$ head GRCh37.70.txt chr start end strand type id geneName geneId numberOfTranscripts canonicalTranscriptLength transcriptId cdsLength numerOfExons exonRank exonSpliceType 1 1 249250622 +1 Chromosome 1 1 111833484 111863189 +1 Gene ENSG00000134216 CHIA ENSG00000134216 10 1431 1 111853090 111863003 +1 Transcript ENST00000489524 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 1 111861742 111861862 +1 Cds CDS_1_111861742_111861861 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 1 111861949 111862091 +1 Cds CDS_1_111861949_111862090 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 1 111853090 111853115 +1 Utr5prime exon_1_111853090_111853114 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 1 ALTTENATIVE_3SS 1 111854311 111854341 +1 Utr5prime exon_1_111854311_111854340 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 2 SKIPPED 1 111860608 111860732 +1 Exon exon_1_111860608_111860731 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 5 RETAINED 1 111853090 111853115 +1 Exon exon_1_111853090_111853114 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 1 ALTTENATIVE_3SS 1 111861742 111861862 +1 Exon exon_1_111861742_111861861 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 7 RETAINEDThe format is
Column | Meaning |
---|---|
chr | Chromosome name |
start | Marker start (one-based coordinate) |
end | Marker end (one-based coordinate) |
strand | Strand (positive or negative) |
type | Type of marker (e.g. exon, transcript, etc.) |
id | ID. E.g. if it's a Gene, then it may be ENSEBML's gene ID |
geneName | Gene name, if marker is within a gene (exon, transcript, UTR, etc.), empty otherwise (e.g. intergenic) |
geneId | Gene is, if marker is within a gene |
numberOfTranscripts | Number of transcripts in the gene |
canonicalTranscriptLength | CDS length of canonical transcript. |
transcriptId | Transcript ID, if marker is within a transcript |
cdsLength | CDS length of the transcript |
numerOfExons | Number of exons in this transcript |
exonRank | Exon rank, if marker is an exon |
exonSpliceType | Exon splice type, if marker is an exon |
Dumps a selected set of genes as BED intervals.
The functionality of this command is a subset of SnpEff dump
, so it is likely to be deprecated in the future.
Example:
$ java -Xmx4g -jar snpEff.jar genes2bed GRCh37.66 DDX11L1 WASH7P #chr start end geneName;geneId 1 11868 14411 DDX11L1;ENSG00000223972 1 14362 29805 WASH7P;ENSG00000227232
These commands perform SnpEff database sanity checks.
They calculate CDS and protein sequences from a SnpEff database and then compare the results to a FASTA file (having the "correct" sequences).
The commands are invoked automatically when building databases, so there is no need for the user to invoke them manually.
Calculates the genomic length of every type of marker (Gene, Exon, Utr, etc.).
Length is calculated by overlapping all markers and counting the number of bases (e.g. a base is counted as 'Exon' if any exon falls onto that base).
This command also reports the probability of a Binomial model.
Parameter -r num
adjusts the model for a read length of 'num' bases. That is, if two markers of the same type are closer than 'num' bases, it joins them by incliding the bases separating them.
E.g.:
$ java -Xmx1g -jar snpEff.jar len -r 100 BDGP5.69 marker size count raw_size raw_count binomial_p Cds 22767006 56955 45406378 122117 0.13492635563570918 Chromosome 168736537 15 168736537 15 1.0 Downstream 49570138 5373 254095562 50830 0.29377240330587084 Exon 31275946 61419 63230008 138474 0.18535372691689175 Gene 82599213 11659 87017182 15222 0.4895158717166277 Intergenic 56792611 11637 56792611 11650 0.3365756581812509 Intron 55813748 42701 168836797 113059 0.33077452573297744 SpliceSiteAcceptor 97977 48983 226118 113059 5.806507691929223E-4 SpliceSiteDonor 101996 50981 226118 113059 6.044689657225808E-4 Transcript 82599213 11659 232066805 25415 0.4895158717166277 Upstream 52874082 5658 254044876 50830 0.3133528928592389 Utr3prime 5264120 13087 10828991 24324 0.031197274126824114 Utr5prime 3729197 19324 6368070 33755 0.02210070839607192Column meaning:
Several scripts are provided in the scripts
directory.
Here we briefly describe their functionality:
Script | Functionality |
---|---|
sam2fastq.pl | Convert a SAM input to a FASTQ output. Example:
samtools view test.bam | ./scripts/bam2fastq.pl | head -n 12 @HWI-ST1220:76:D12CHACXX:7:2207:18986:95756 CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATT + CCCFFFFFGHHHHIIJIJJIJIJJIJJIIIJIIHIJJJIJJIJJJJJJIJJ @HWI-ST1220:76:D12CHACXX:7:2206:4721:162268 ATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATT + @@@DDD>DAB;?DGGEGGBCD>BFGI?FCFFBFGG@ |
fasta2tab.pl |
Convert a fasta file to a two column tab-separated TXT file (name \t sequence)
Example (output truncated for brevity)
$ zcat ce6.fa.gz | ./scripts/fasta2tab.pl chrI gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcct... chrV GAATTcctaagcctaagcctaagcctaagcctaagcctaagcctaa... chrII cctaagcctaagcctaagcctaagcctaagcctaagcctaagccta... chrM CAGTAAATAGTTTAATAAAAATATAGCATTTGGGTTGCTAAGATAT... chrX ctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa... chrIV cctaagcctaagcctaagcctaagcctaagcctaagcctaagccta... chrIII cctaagcctaagcctaagcctaagcctaagcctaagcctaagccta... |
fastaSplit.pl | Split a multiple sequence FASTA file to individial files.
Example: Creates one file per chromosome
$ zcat ce6.fa.gz | ./scripts/fastaSplit.pl Writing to chrI.fa Writing to chrV.fa Writing to chrII.fa Writing to chrM.fa Writing to chrX.fa Writing to chrIV.fa Writing to chrIII.fa |
hist.pl | Given a list of numbers (one per line), shows a histogram. Note: It requires R.
Example: Extract the file sizes in a directory and show a histogram $ ls -al scripts/ | tr -s " " | cut -f 5 -d " " | ./scripts/hist.plCreates the following plot ![]() |
maPlot.pl plot.pl qqplot.pl smoothScatter.pl |
Similar to 'hist.pl', these perform plots based on input from STDIN.
Note that in some cases, inputs are expected to be probabilities (qqplot.pl) or pairs of numbers (maPlot.pl).
$ ls -al scripts/ | tr -s " " | cut -f 5 -d " " | ./scripts/plot.plCreates the following plot ![]() |
queue.pl | Process a list of statements in parallel according to the number of CPUs available in the local machine |
splitChr.pl | Splits a file by chromosome. Works on any tab separated file that the first column is CHR field (e.g. BED, VCF, etc.)
Example: $ cat large_test.vcf | ./scripts/splitChr.pl Input line 28. Creating file 'chr1.txt' Input line 13332. Creating file 'chr2.txt' Input line 22097. Creating file 'chr3.txt' Input line 29289. Creating file 'chr4.txt' Input line 34236. Creating file 'chr5.txt' Input line 39899. Creating file 'chr6.txt' Input line 47120. Creating file 'chr7.txt' Input line 53371. Creating file 'chr8.txt' Input line 57810. Creating file 'chr9.txt' Input line 63005. Creating file 'chr10.txt' Input line 68080. Creating file 'chr11.txt' Input line 76629. Creating file 'chr12.txt' Input line 83071. Creating file 'chr13.txt' Input line 85124. Creating file 'chr14.txt' Input line 89281. Creating file 'chr15.txt' Input line 93215. Creating file 'chr16.txt' Input line 99081. Creating file 'chr17.txt' Input line 106405. Creating file 'chr18.txt' Input line 108330. Creating file 'chr19.txt' Input line 118568. Creating file 'chr20.txt' Input line 121795. Creating file 'chr21.txt' Input line 123428. Creating file 'chr22.txt' Input line 126520. Creating file 'chrX.txt' Input line 129094. Creating file 'chrY.txt' Input line 129113. Creating file 'chrMT.txt' |
uniqCount.pl | Count number of unique lines. It's the same as doing cat lines.tst | sort | uniq -c , but much faster. Particularly useful for very large inputs. |
vcfEffOnePerLine.pl | Splits EFF fields in a VCF file, creating mutiple lines, each one with only one effect.
Very useful for filtering with SnpSift. Example: |
$ cat test.stop.vcf 1 897062 . C T 100.0 PASS AC=1;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q141*|642|KLHL17||CODING|NM_198317|),UPSTREAM(MODIFIER||||576|PLEKHN1||CODING|NM_001160184|),UPSTREAM(MODIFIER||||611|PLEKHN1||CODING|NM_032129|),UPSTREAM(MODIFIER||||749|NOC2L||CODING|NM_015658|) $ cat test.stop.vcf | ./scripts/vcfEffOnePerLine.pl 1 897062 . C T 100.0 PASS AC=1;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q141*|642|KLHL17||CODING|NM_198317|) 1 897062 . C T 100.0 PASS AC=1;EFF=UPSTREAM(MODIFIER||||576|PLEKHN1||CODING|NM_001160184|) 1 897062 . C T 100.0 PASS AC=1;EFF=UPSTREAM(MODIFIER||||611|PLEKHN1||CODING|NM_032129|) 1 897062 . C T 100.0 PASS AC=1;EFF=UPSTREAM(MODIFIER||||749|NOC2L||CODING|NM_015658|)