Filter and manipulate annotated files
Documentation for version 4.0
SnpSift is a toolbox that allows you to filter and manipulate annotated files.
Once your genomic variants have been annotated, you need to filter them out in order to find the "interesting / relevant variants". Given the large data files, this is not a trivial task (e.g. you cannot load all the variants into XLS spreadsheet). SnpSift helps to perform this VCF file manipulation and filtering required at this stage in data processing pipelines.
SnpSift is part of SnpEff main distribution, so please click on here and follow the instructions on how to download and install SnpEff.
SnpSift is a collection of tools to manipulate VCF (variant call format) files.
Some examples of what you can do:
Filter | You can filter using arbitrary expressions, for instance "(QUAL > 30) | (exists INDEL) | ( countHet() > 2 )". The actual expressions can be quite complex, so it allows for a lot of flexibility. |
Annotate & AnnMem | You can add 'ID' and INFO fields from another "VCF database" (e.g. typically dbSnp database in VCF format). |
CaseControl | You can compare how many variants are in 'case' and in 'control' groups. Also calculates p-values (Fisher exact test). |
Intervals | Filter variants that intersect with intervals. |
Intervals (intidx) | Filter variants that intersect with intervals. Index the VCF file using memory mapped I/O to speed up the search. This is intended for huge VCF files and a small number of intervals to retrieve. |
Join | Join by generic genomic regions (intersecting or closest). |
RmRefGen | Remove reference genotype (i.e. replace '0/0' genotypes by '.') |
TsTv | Calculate transition to transversion ratio. |
Extract fields | Extract fields from a VCF file to a TXT (tab separated) format. |
Variant type | Adds SNP/MNP/INS/DEL to info field. It also adds "HOM/HET" if there is only one sample. |
GWAS Catalog | Annotate using GWAS Catalog. |
DbNSFP | Annotate using dbNSFP: The dbNSFP is an integrated database of functional predictions from multiple algorithms (SIFT, Polyphen2, LRT and MutationTaster, PhyloP and GERP++, etc.) |
SplitChr | Split a VCF file by chromosome |
In order to cite SnpSift, please use the following reference:
"Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift", Cingolani, P., et. al., Frontiers in Genetics, 3, 2012.BibTex entry:
@article{cingolani2012using, title={Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift}, author={Cingolani, P. and Patel, V.M. and Coon, M. and Nguyen, T. and Land, S.J. and Ruden, D.M. and Lu, X.}, journal={Frontiers in Genetics}, volume={3}, year={2012}, publisher={Frontiers Media SA} }
The project is hosted at SourceForge.
Here is the SVN command to check out the development version of the code:svn co https://snpeff.svn.sourceforge.net/svnroot/snpeff/SnpSift/trunk
SnpSift filter is one of the most useful SnpSift commands. Using SnpSift filter you can filter VCF files using arbitrary expressions, for instance "(QUAL > 30) | (exists INDEL) | ( countHet() > 2 )". The actual expressions can be quite complex, so it allows for a lot of flexibility.
Some examples for the impatient:
cat variants.vcf | java -jar SnpSift.jar filter " ( QUAL >= 30 )" > filtered.vcf
cat variants.vcf | java -jar SnpSift.jar filter "(( exists INDEL ) & (QUAL >= 20)) | (QUAL >= 30 )" > filtered.vcf
cat variants.vcf | java -jar SnpSift.jar filter "(countHom() > 3) | (( exists INDEL ) & (QUAL >= 20)) | (QUAL >= 30 )" > filtered.vcf
cat variants.vcf | java -jar SnpSift.jar filter "((countHet() > 0) && (DP >= 25)) | (countHom() > 3) | (( exists INDEL ) & (QUAL >= 20)) | (QUAL >= 30 )" > filtered.vcf
cat variants.vcf | java -jar SnpSift.jar filter "isHom( GEN[0] ) & isVariant( GEN[0] ) & isRef( GEN[1] )" > filtered.vcf
cat variants.vcf | java -jar SnpSift.jar filter --set my_rs.txt "ID in SET[0]" > filtered.vcfand the file my_rs.txt has one string per line, e.g.:
rs58108140 rs71262674 rs71262673
"( CHROM = 'chr1' )"
"( POS > 123456 ) & ( POS < 654321 )"
"(exists ID) & ( ID =~ 'rs' )"
"( REF = 'A' )"
"( ALT = 'T' )"
"( QUAL > 30 )"
"( na FILTER ) | (FILTER = 'PASS')"
( DP > 10 ) & ( AF1 = 0 )
"( CI95[0] > 0.1 ) & (CI95[1] <= 0.3)"
ANY or *: If you use 'ANY' as index, the expression will be true if any field satisfies the expression.
So, for instance, the following expressions
"( CI95[ANY] > 0.1 )"or
"( CI95[*] > 0.1 )"are equivalent to (in this case, there are only two values in the array)
"( CI95[0] > 0.1 ) | ( CI95[1] > 0.1 )"
ALL or ?: If you use 'ALL' as index, the expression will be true if all field satisfy the expression.
So, for instance, the following expressions
"( CI95[ALL] > 0.1 )"
"( CI95[?] > 0.1 )"are equivalent to (in this case, there are only two values in the array)
"( CI95[0] > 0.1 ) & ( CI95[1] > 0.1 )"
"( GEN[0].GQ > 60 ) & ( GEN[1].GQ > 90 )"You may use an asterisk to represent 'ANY' field
"( GEN[*].GQ > 60 )"
"( GEN[0].PL[2] = 0 )"You may use an asterisk to represent 'ANY' field
"( GEN[0].PL[*] = 0 )"...or even
"( GEN[*].PL[*] = 0 )"
snpSift_filter_sample_to_number.pl
converts sample names to sample numbers.
$ ./scripts/snpSift_filter_sample_to_number.pl test.1KG.vcf "( GEN[HG00096].GQ > 60 ) & ( GEN[HG00097].GQ > 90 )" ( GEN[0].GQ > 60 ) & ( GEN[1].GQ > 90 )You can use the resulting expression in SnpSift filter without having to worry about different versions of the VCF files having different sample order.
"( ID in SET[2] )"
EFF
sub-fields are:
"( EFF[0].EFFECT = 'NON_SYNONYMOUS_CODING' )"...but this probably doesn't make much sense. What you may really want are lines where ANY effect is NON_SYNONYMOUS:
"( EFF[*].EFFECT = 'NON_SYNONYMOUS_CODING' )"May be you want only the ones that affect gene 'TCF7L2'
"( EFF[*].EFFECT = 'NON_SYNONYMOUS_CODING' ) & ( EFF[*].GENE = 'TCF7L2' )"
LOF
and NMD
sub-fields are available:
$cat test.snpeff.vcf | java -Xmx1G -jar SnpSift.jar filter "(exists LOF[*].PERC) & (LOF[*].PERC > 0.9)"We assume that 'test.snpeff.vcf' was annotated with SnpEff using '-lof' command line option.
The following operators and functions are interpreted by SnpSift filter
:
Operand | Description | Data type | Example |
---|---|---|---|
= | Equality test | FLOAT, INT or STRING | (REF = 'A') |
> | Greater than | FLOAT or INT | (DP > 20) |
≥ | Greater or equal than | FLOAT or INT | (DP ≥ 20) |
< | Less than | FLOAT or INT | (DP < 20) |
≤ | Less or equal than | FLOAT or INT | (DP ≤ 20) |
=~ | Match regular expression | STRING | (REL =~ 'AC') |
!~ | Does not match regular expression | STRING | (REL !~ 'AC') |
& | AND operator | Boolean | (DP > 20) & (REF = 'A') |
| | OR operator | Boolean | (DP > 20) | (REF = 'A') |
! | NOT operator | Boolean | ! (DP > 20) |
exists | The variable exists (not missing) | Any | (exists INDEL) |
Function | Description | Data type | Example |
countHom() | Count number of homozygous genotypes | No arguments | (countHom() > 0) |
countHet() | Count number of heterozygous genotypes | No arguments | (countHet() > 2) |
countVariant() | Count number of genotypes that are variants (i.e. not reference 0/0) | No arguments | (countVariant() > 5) |
countRef() | Count number of genotypes that are NOT variants (i.e. reference 0/0) | No arguments | (countRef() < 1) |
Genotype Function | Description | Data type | Example |
isHom | Is homozygous genotype? | Genotype | isHom( GEN[0] ) |
isHet | Is heterozygous genotype? | Genotype | isHet( GEN[0] ) |
isVariant | Is genotype a variant? (i.e. not reference 0/0) | Genotype | isVariant( GEN[0] ) |
isRef | Is genotype a reference? (i.e. 0/0) | Genotype | isRef( GEN[0] ) |
Annotate using fields from another VCF file (e.g. dbSnp or 1000 genomes projects).
This is typically used to annotate IDs and INFO fields from a 'database' VCF file (e.g. dbSnp). Here is an example:java -jar SnpSift.jar annotate dbSnp132.vcf variants.vcf > variants_annotated.vcf
Important:
SnpSift annotate
command assumes that both the database and the input VCF files are sorted by position. Chromosome sort order can differ between files.
SnpSift annotateMem
command allows unsorted VCF files, but it loads the entire 'database' VCF file into memory (which may not be practical for large 'database' VCF files).
Note:
Example 1: Annotating ID
from dbSnp:
$ cat test.chr22.vcf #CHROM POS ID REF ALT QUAL FILTER INFO 22 16157571 . T G 0.0 FAIL NS=53 22 16346045 . T C 0.0 FAIL NS=244 22 16350245 . C A 0.0 FAIL NS=192 22 17054103 . G A 0.0 PASS NS=404 22 17071906 . A T 0.0 PASS NS=464 22 17072347 . C T 0.0 PASS NS=464 22 17072394 . C G 0.0 PASS NS=463 22 17072411 . G T 0.0 PASS NS=464 $ java -jar SnpSift.jar annotate -id db/dbSnp/dbSnp137.20120616.vcf test.chr22.vcf #CHROM POS ID REF ALT QUAL FILTER INFO 22 16157571 . T G 0.0 FAIL NS=53 22 16346045 rs56234788 T C 0.0 FAIL NS=244 22 16350245 rs2905295 C A 0.0 FAIL NS=192 22 17054103 rs4008588 G A 0.0 PASS NS=404 22 17071906 . A T 0.0 PASS NS=464 22 17072347 rs139948519 C T 0.0 PASS NS=464 22 17072394 . C G 0.0 PASS NS=463 22 17072411 rs41277596 G T 0.0 PASS NS=464Example 2: Annotatating
ID
and all INFO
fields from dbSnp (VCF headers not shown for brevity):
$ cat test.chr22.vcf #CHROM POS ID REF ALT QUAL FILTER INFO 22 16157571 . T G 0.0 FAIL NS=53 22 16346045 . T C 0.0 FAIL NS=244 22 16350245 . C A 0.0 FAIL NS=192 22 17054103 . G A 0.0 PASS NS=404 22 17071906 . A T 0.0 PASS NS=464 22 17072347 . C T 0.0 PASS NS=464 22 17072394 . C G 0.0 PASS NS=463 22 17072411 . G T 0.0 PASS NS=464 $ java -jar SnpSift.jar annotate db/dbSnp/dbSnp137.20120616.vcf test.chr22.vcf #CHROM POS ID REF ALT QUAL FILTER INFO 22 16157571 . T G 0.0 FAIL NS=53 22 16346045 rs56234788 T C 0.0 FAIL NS=244;RSPOS=16346045;GMAF=0.162248628884826;dbSNPBuildID=129;SSR=0;SAO=0;VP=050100000000000100000100;WGT=0;VC=SNV;SLO;GNO 22 16350245 rs2905295 C A 0.0 FAIL NS=192;RSPOS=16350245;GMAF=0.230804387568556;dbSNPBuildID=101;SSR=1;SAO=0;VP=050000000000000100000140;WGT=0;VC=SNV;GNO 22 17054103 rs4008588 G A 0.0 PASS NS=404;RSPOS=17054103;GMAF=0.123400365630713;dbSNPBuildID=108;SSR=0;SAO=0;VP=050100000000070010000100;WGT=0;VC=SNV;SLO;VLD;G5A;G5;KGPilot123 22 17071906 . A T 0.0 PASS NS=464 22 17072347 rs139948519 C T 0.0 PASS NS=464;RSPOS=17072347;dbSNPBuildID=134;SSR=0;SAO=0;VP=050200000004040010000100;WGT=0;VC=SNV;S3D;ASP;VLD;KGPilot123 22 17072394 . C G 0.0 PASS NS=463 22 17072411 rs41277596 G T 0.0 PASS NS=464;RSPOS=17072411;GMAF=0.00411334552102377;dbSNPBuildID=127;SSR=0;SAO=0;VP=050200000008040010000100;GENEINFO=CCT8L2:150160;WGT=0;VC=SNV;S3D;CFL;VLD;KGPilot123
Allows you to count how many samples are in 'case' and 'control' groups.
This command counts the number of 'homozygous', 'heterozygous' and 'total' variants in a case and control groups and performs some basic pValue calculation using Fisher exact test and Cochran-Armitage test.
Case and Control groups can be defined either by a command line string or a TFAM file (see PLINK's documentation).
Case/Control command line string containing plus and minus symbols {'+', '-', '0'} where '+' is case, '-' is control and '0' is neutral (ignored).
E.g. We have ten samples, which means ten genotype columns in the VCF file.
The first four are 'cases', the fifth one is 'neutral', and the last five are 'control'.
So the description string would be "++++0-----" (note that the following output has been edited, only counts are shown, no pValues):
$ java -jar SnpSift.jar caseControl "++++0-----" cc.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_01 Sample_02 Sample_03 Sample_04 Sample_05 Sample_06 Sample_07 Sample_08 Sample_09 Sample_10 1 69496 . G A . PASS AF=0.01;Cases=1,2,4;Controls=2,2,6 GT 0/1 1/1 1/0 0/0 0/0 0/1 1/1 1/1 1/0 0/0Cases genotypes are samples 1 to 4 : 0/1, 1/1, 1/0 and 0/0. So there are 1 homozygous, 2 heterozygous, and a total of 4 variants (2 * 1 + 1 * 2 = 4). Thus the annotation is
Cases=1,2,4
Controls=2,2,6
-tfam
command line option to specify a TFAM file.
Case, control from are read from phenotype field of a TFAM file (6th column).
Phenotype order in TFAM files do not need to match VCF sample order (sample IDs are used).
Phenotype column should be coded as {0,1,2} meaning {Missing, Control, Case} respectively.
See PLINK's reference for details about TFAM file format.
-name nameString
command line option to add name to the INFO tags.
This can be used to count different case/control groups in the same dataset (e.g. multiple phenotypes)
$ java -jar SnpSift.jar caseControl -name "_MY_GROUP" "++++0-----" cc.vcf \ | java -jar SnpSift.jar caseControl -name "_ANOTHER_GROUP" "+-+-+-+-+-" - #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_01 Sample_02 Sample_03 Sample_04 Sample_05 Sample_06 Sample_07 Sample_08 Sample_09 Sample_10 1 69496 . G A . PASS AF=0.01;Cases_MY_GROUP=1,2,4;Controls_MY_GROUP=2,2,6;Cases_ANOTHER_GROUP=1,3,5;Controls_ANOTHER_GROUP=2,1,5 GT 0/1 1/1 1/0 0/0 0/0 0/1 1/1 1/1 1/0 0/0
CC_DOM
): A 2 by 2 contingency table is created
Alt (A/a + a/a) | Ref (A/A) | |
---|---|---|
Cases | N11 | N12 |
Controls | N21 | N22 |
CC_REC
): A 2 by 2 contingency table is created
Alt (a/a) | Ref + Het (A/A + A/a) | |
---|---|---|
Cases | N11 | N12 |
Controls | N21 | N22 |
CC_ALL
): A 2 by 2 contingency table is created
Variants | References | |
---|---|---|
Cases | N11 | N12 |
Controls | N21 | N22 |
CC_GENO
): A 2 by 3 contingency table is created
A/A | a/A | a/a | |
---|---|---|---|
Cases | N11 | N12 | N13 |
Controls | N21 | N22 | N23 |
Weight | 0.0 | 1.0 | 2.0 |
This is used to extract variants that intersect any interval.
You must provide intervals as BED files.
Command line options:
E.g.:
cat variants.vcf | java -jar SnpSift.jar intervals my_intervals.bed > variants_intersecting_intervals.vcf
This is used to extract variants that intersect any interval.
This is similar to "SnpSift intervals", but intended for huge VCF files, and relatively small number of intervals.
This command indexes the VCF file, thus is optimized for huge VCF files.
You must provide intervals as BED files.
BED format is tab separated zero-based coordinates "chr \t start \t end " (for this application, all other fields in the BED file are ignored).
You can use command line option '-if 1' if you want one-based coordinates.
E.g.:
java -jar SnpSift.jar intidx variants.vcf my_intervals.bed > variants_intersecting_intervals.vcfYou can also have genomic coordinate in the command line. Note that in this case, coordinates are assumed to be one-based (instead of zero-based, like in BED files):
java -jar SnpSift.jar intidx -c variants.vcf chr1:12345-23456 chr2:3456789-4567890 > variants_intersecting_intervals.vcf
Join files by genomic regions (i.e. chr:start-end).
Files can be generic TXT (tab separated), VCF or BED.Usage: java -jar SnpSift.jar join [options] file1 file2 Note: It is assumed that both files fit in memory. Options: -if1 <num> : Offset for file1 (e.g. 1 if coordinates are one-based. Default: 1 -if2 <num> : Offset for file2 (e.g. 2 if coordinates are one-based. Default: 1 -cols1 <colDef> : Column definition for file 1. Format: chrCol,startCol,endCol (e.g. '1,2,3'). Shortcuts 'bed' or 'vcf' are allowed. Default: 'vcf -cols2 <colDef> : Column definition for file 2. Format: chrCol,startCol,endCol (e.g. '1,2,3'). Shortcuts 'bed' or 'vcf' are allowed. Default: 'vcf -all : For each interval, show all intersecting. Default: show only one (the largest intersection) -closest : Show closest intervals in file2 if none intersect. Default: off -empty : Show intervals in file1 even if they do not intersect with any other interval. Default: offExample: Join two bed files, showing intersecting or closest intervals
java -Xmx2G -jar SnpSift.jar join -v -cols1 bed -cols2 bed -closest file1.bed file2.bedExample: Join one bed file and another file having chr:start-end in columns 7,8 and 9 respectively. Showing intervals form file1 that do not intersect any interval from file2
java -Xmx2G -jar SnpSift.jar join -v -cols1 bed -cols2 7,8,9 -empty file.bed my_weird_file.txt
Remove reference genotypes.
Replaces genotype information for non-variant samples.E.g. If you have this file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT M1 M2 X1 X2 2L 426906 . C G 53.30 . DP=169 GT:PL:GQ 0/1:7,0,255:4 0/1:7,0,255:4 0/0:0,0,0:6 0/0:0,30,255:35 2L 601171 . C A 999.00 . DP=154 GT:PL:GQ 0/1:81,0,141:78 0/1:42,0,251:39 0/0:0,0,0:4 0/0:0,33,255:36 2L 648611 . A T 999.00 . DP=225 GT:PL:GQ 0/1:52,0,42:47 1/1:75,21,0:14 0/0:0,0,0:3 0/0:0,60,255:61 2L 807373 . A G 106.00 . DP=349 GT:PL:GQ 0/1:14,0,65:12 0/1:60,0,42:50 0/0:0,0,0:4 0/0:0,69,255:72 2L 816766 . G T 999.00 . DP=411 GT:PL:GQ 0/1:108,0,45:53 0/1:7,0,255:6 0/0:0,0,0:4 0/0:0,57,255:59You can run:
cat file.vcf | java -jar SnpSift.jar rmRefGen > file_noref.vcfand you get this (notice the last two columns, that had '0/0' genotype):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT M1 M2 X1 X2 2L 426906 . C G 53.30 . DP=169 GT:PL:GQ 0/1:7,0,255:4 0/1:7,0,255:4 . . 2L 601171 . C A 999.00 . DP=154 GT:PL:GQ 0/1:81,0,141:78 0/1:42,0,251:39 . . 2L 648611 . A T 999.00 . DP=225 GT:PL:GQ 0/1:52,0,42:47 1/1:75,21,0:14 . . 2L 807373 . A G 106.00 . DP=349 GT:PL:GQ 0/1:14,0,65:12 0/1:60,0,42:50 . . 2L 816766 . G T 999.00 . DP=411 GT:PL:GQ 0/1:108,0,45:53 0/1:7,0,255:6 . .
Calculate transition vs transversion ratios for each sample.
Usage example:$ java -jar SnpSift.jar tstv hom s.vcf Sample : 1 2 3 4 5 6 7 8 9 10 11 12 Total Transitions : 150488 150464 158752 156674 152936 160356 152276 155314 156484 149276 151182 153468 1847670 Transversions : 70878 70358 73688 72434 70828 76150 72030 71958 72960 69348 70180 71688 862500 Ts/Tv : 2.123 2.139 2.154 2.163 2.159 2.106 2.114 2.158 2.145 2.153 2.154 2.141 2.142
Extract fields from a VCF file to a TXT, tab separated format, that you can easily load in R, XLS, etc.
You can also use sub-fields and genotype fields / sub-fields such as:
vcfEffOnePerLine.pl
script with SnpSift extractFields
if you want to have each effect in a seprarate line.
-s
to specify multiple field separator and -e
to specify how to represent empty fields.
Example 1: Extracting chromosome, position, ID and allele frequency from a VCF file:
$ java -jar SnpSift.jar extractFields s.vcf CHROM POS ID AF | head #CHROM POS ID AF 1 69134 0.086 1 69496 rs150690004 0.001 1 69511 rs75062661 0.983 1 69569 0.538 1 721559 0.001 1 721757 0.011 1 846854 rs111957712 0.003 1 865584 rs148711625 0.001 1 865625 rs146327803 0.001
Example 2: Extracting genotype fields
$ java -jar SnpSift.jar extractFields file.vcf CHROM" "POS" "ID" "THETA" "GEN[0].GL[1]" "GEN[1].GL" "GEN[3].GL[*]" "GEN[*].GT"This means to extract:
#CHROM POS ID THETA GEN[0].GL[1] GEN[1].GL GEN[3].GL[*] GEN[*].GT 1 10583 rs58108140 0.0046 -0.47 -0.24,-0.44,-1.16 -0.48 -0.48 -0.48 0|0 0|0 0|0 0|1 0|0 0|1 0|0 0|0 0|1 1 10611 rs189107123 0.0077 -0.48 -0.24,-0.44,-1.16 -0.48 -0.48 -0.48 0|0 0|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1 13302 rs180734498 0.0048 -0.58 -2.45,-0.00,-5.00 -0.48 -0.48 -0.48 0|0 0|1 0|0 0|0 0|0 1|0 0|0 0|1 0|0 1 13327 rs144762171 0.0204 -1.11 -1.97,-0.01,-2.51 -0.48 -0.48 -0.48 0|0 0|1 0|0 0|0 0|0 1|0 0|0 0|0 0|0 1 13957 rs201747181 0.0100 0 0,0,0 0 0 0 0|0 0|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1 13980 rs151276478 0.0139 -0.48 -0.48,-0.48,-0.48 -0.48 -0.48 -0.48 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1 30923 rs140337953 0.0162 -0.61 -0.10,-0.69,-2.81 -0.48 -0.48 -0.48 1|1 0|0 0|0 1|1 1|0 0|0 1|1 1|0 1|1 1 46402 rs199681827 0.0121 0 0,0,0 0 0 0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1 47190 rs200430748 0.0153 0 0,0,0 0 0 0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
-s
to specify multiple field separator and -e
to specify how to represent empty fields.
$ java -jar SnpSift.jar extractFields -s "," -e "." test.chr22.eff.vcf CHROM POS REF ALT "EFF[*].EFFECT" "EFF[*].AA"
-s ","
, and we use "." for empty fiels (option -e "."
).
The results is:
#CHROM POS REF ALT EFF[*].EFFECT EFF[*].AA 22 16346045 T C DOWNSTREAM,INTRON,SPLICE_SITE_DONOR .,.,. 22 16350245 C A INTRON,SPLICE_SITE_ACCEPTOR .,. 22 17054103 G A INTRON,SPLICE_SITE_DONOR .,. 22 17071906 A T DOWNSTREAM,NON_SYNONYMOUS_CODING .,L512Q 22 17072347 C T DOWNSTREAM,STOP_GAINED .,W365* 22 17072394 C G DOWNSTREAM,NON_SYNONYMOUS_CODING .,R349S 22 17072411 G T DOWNSTREAM,NON_SYNONYMOUS_CODING .,P344T 22 17072483 A G DOWNSTREAM,NON_SYNONYMOUS_CODING .,W320R 22 17072907 G T DOWNSTREAM,NON_SYNONYMOUS_CODING .,D178E 22 17073032 C T DOWNSTREAM,NON_SYNONYMOUS_CODING .,A137T 22 17073124 G T DOWNSTREAM,NON_SYNONYMOUS_CODING .,A106D 22 17073332 A T DOWNSTREAM,NON_SYNONYMOUS_CODING .,L37M
$ java -jar SnpSift.jar extractFields test.snpeff.vcf CHROM POS "EFF[*].EFFECT" #CHROM POS EFF[*].EFFECT 1 10303 DOWNSTREAM INTERGENIC UPSTREAMNote that since that variant has three effects, there are three "EFFECT" columns.
vcfEffOnePerLine.pl
provided with SnpEff distribution
$ cat test.snpeff.vcf | ./scripts/vcfEffOnePerLine.pl | java -jar SnpSift.jar extractFields - CHROM POS "EFF[*].EFFECT" #CHROM POS EFF[*].EFFECT 1 10303 DOWNSTREAM 1 10303 INTERGENIC 1 10303 UPSTREAMNow we obtain one effect per line, while all other parameters in the line are repeated across mutiple lines (e.g. there are three 1:10303 lines).
-
as input file name, which denotes STDIN.
SnpSift extractFields
can get confused if the VCF field has non-alphanumeric charaters in the name (e.g. dbNSFP_GERP++_RS
has two "+" signs).
A quick fix, it so is to change the field names in the VCF file.
Here is an example:
# Change field names in VCF $ cat kath.vcf | sed "s/dbNSFP_GERP++/dbNSFP_GERP/g" > kath.gerp.vcf # Use new names to extract fields $ java -jar SnpSift.jar extractFields kath.gerp.vcf CHROM POS REF ALT dbNSFP_GERP_RS dbNSFP_GERP_NS #CHROM POS REF ALT dbNSFP_GERP_RS 1 142827044 G A 2 132914561 G A 7 151933217 C A 7 151933251 T C 7 151933302 T C 7 151945101 G C -0.892 7 151945167 G T 7 151962176 T A 7 151970672 A T 7 151970856 T A 3.71 18 14183638 G C 18 14183710 A G 18 14542909 G A 18 14543039 T C -0.942
Adds an INFO field denoting variant type.
It adds "SNP/MNP/INS/DEL/MIXED" in the INFO field. It also adds "HOM/HET", but this last one works if there is only one sample (otherwise it doesn't make any sense).
$ java -jar SnpSift.jar varType test.vcf | grep -v "^#" | head 20 10469 . C G 100.0 PASS SNP;HOM GT:AP 0|0:0.075,0.060 20 10492 . C T 100.0 PASS SNP;HET GT:AP 0|1:0.180,0.345 20 10575 . C CG 100.0 PASS DEL;HET GT:AP 0|1:0.000,0.000 20 10611 . CG C 100.0 PASS INS;HET GT:AP 0|1:0.000,0.010 20 10618 . GT TA 100.0 PASS MNP;HET GT:AP 0|1:0.020,0.030
Annotate using GWAS catalog
$ java -jar SnpSift.jar gwasCat gwascatalog.txt test.vcf | tee test.gwas.vcf 1 1005806 rs3934834 C T . PASS AF=0.091;GWASCAT=Body_mass_index 1 2069172 rs425277 C T . PASS AF=0.400;GWASCAT=Height 1 2069681 rs3753242 C T . PASS AF=0.211;GWASCAT=Reasoning 1 2392648 rs2477686 G C . PASS AF=0.745;GWASCAT=Non_obstructive_azoospermia 1 2513216 rs734999 C T . PASS AF=0.547;GWASCAT=Ulcerative_colitis 1 2526746 rs3748816 A G . PASS AF=0.489;GWASCAT=Celiac_disease 1 3083712 rs2651899 T C . PASS AF=0.467;GWASCAT=Migraine 1 3280253 rs6658356 G A . PASS AF=0.070;GWASCAT=Response_to_statin_therapy 1 4315204 rs966321 G T . PASS AF=0.522;GWASCAT=Factor_VII 1 5170712 rs7513590 A G . PASS AF=0.256;GWASCAT=Anthropometric_traits 1 6279370 rs846111 G C . PASS AF=0.153;GWASCAT=QT_interval,QT_interval 1 6631431 rs11587438 C T . PASS AF=0.906;GWASCAT=White_blood_cell_types 1 7879063 rs2797685 C T . PASS AF=0.186;GWASCAT=Crohn_s_disease 1 8021973 rs35675666 G T . PASS AF=0.093;GWASCAT=Ulcerative_colitis 1 8046672 rs12727642 C A . PASS AF=0.101;GWASCAT=Celiac_disease 1 8422676 rs2252865 T C . PASS AF=0.771;GWASCAT=Schizophrenia 1 8526142 rs4908760 G A . PASS AF=0.630;GWASCAT=Vitiligo
The dbNSFP is an integrated database of functional predictions from multiple algorithms (SIFT, Polyphen2, LRT and MutationTaster, PhyloP and GERP++, etc.).
One of the main advantages is that you can annotate using multiple prediction tools with just one command. This allows for faster annotations. Here is the link to dbNSFP database website for more details.
Database: In order to annotate using dbNSFP, you need to download the dbNSFP database and the index file.
dbNSFP is large (several GB) so it might take a while to download it.
The database is compressed (block-gzip) and tabix indexed, so two files are required (the data *.gz file and the *.gz.tbi index file).
Automatic database download: As of SnpSift version 4.0, the dbNSFP databse will be downloaded and stored the first time you use SnpSift dbNSFP
.
You can also download it manually and specify the path to the dbNSFP files using the -db
command line option.
You can download the files from SnpEff's site:
Here is a full example how to perform annotations
# Annotate using dbNSFP # Note that the fisr time you run the command, it will attempt to download the dbNSFP database. java -jar SnpSift.jar dbnsfp -v myFile.vcf > myFile.annotated.vcf
Here is a command line example:
# Donload database and tabix index files (you need to do this only once): # WARNING: The database file is large (over 5Gb) wget http://sourceforge.net/projects/snpeff/files/databases/dbNSFP/dbNSFP2.4.txt.gz wget http://sourceforge.net/projects/snpeff/files/databases/dbNSFP/dbNSFP2.4.txt.gz.tbi
# Annotate using dbNSFP java -jar SnpSift.jar dbnsfp -v -db path/to/my/dbNSFP2.4.txt.gz myFile.vcf > myFile.annotated.vcf
# Download dbNSFP database $ wget http://dbnsfp.houstonbioinformatics.org/dbNSFPzip/dbNSFP2.4.zip # Uncompress $ unzip dbNSFP2.4.zip # Create a single file version $ (head -n 1 dbNSFP2.4_variant.chr1 ; cat dbNSFP2.4_variant.chr* | grep -v "^#" ) > dbNSFP2.4.txt # Compress using block-gzip algorithm bgzip dbNSFP2.4.txt # Create tabix index tabix -s 1 -b 2 -e 2 dbNSFP2.4.txt.gz
Simply split (or join) VCF files. Allows to create one file per chromosome or one file every N lines.
A typical usage for this command is to:SnpSift split huge.vcf
SnpSift split -j huge.000.vcf huge.001.vcf huge.002.vcf ... > huge.out.vcf
.
java -jar SnpSift.jar split myHugeVcf.vcf.gz
Will create files myHugeVcf.1.vcf, myHugeVcf.2.vcf, ... , myHugeVcf.22.vcf, myHugeVcf.X.vcf, myHugeVcf.Y.vcf
You can also specify '-l' command line option to split the file every N lines
E.g.: Split a VCF file every 10,000 lines
java -jar SnpSift.jar split -l 10000 myHugeVcf.vcf.gz
Will create files myHugeVcf.001.vcf, myHugeVcf.002.vcf, ...
VCF header will be added to each file, so resulting files will be more than 10,000 lines.-j
(join) command line option to join a set of VCF files.
java -jar SnpSift.jar split -j huge.000.vcf huge.001.vcf huge.002.vcf ... > huge.out.vcf
Annotate using PhastCons conservation scores.
You must download PhastCons files here
You also need a chromosome size file, which can be created using samtools faidx
, or you can download it from here
Full example.
Most of the example deals with downloading and installing PhastCons database, which is done only once.
The real annotation process is done in the last line.
# Create a dir for PhastCons database cd ~/snpEff mkdir -p db/phastCons/ # Download all PhastCons files cd db/phastCons/ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr1.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr2.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr3.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr4.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr5.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr6.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr7.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr8.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr9.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr10.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr11.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr12.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr13.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr14.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr15.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr16.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr17.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr18.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr19.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr20.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr21.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr22.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chrM.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chrX.phastCons100way.wigFix.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chrY.phastCons100way.wigFix.gz # Create a chromosome size file and name it "genome.fai" samtools faidx path/to/genome/hg19.fa.gz cp path/to/genome/hg19.fa.gz.fai ./genome.fai # Now we are ready to annotate java -Xmx4g -jar SnpSift.jar phastCons ~/snpEff/db/phastCons file.vcf > file.phastCons.vcf
-bed
command line option.
In the output BED formatted intervals, the score column (fifth column), is the average conservation score of all bases within the interval.
-minScore score
and -extract len
command line options.
For instance, the following command:
java -jar SnpSift.jar phastCons -minScore 0.8 -extract 10 -bed path/to/phastCons/dir input.bedextracts all subintervals from each line in
input.bed
, that has at least 10
bases length and a conservation score of 0.8
Calculate concordance between two VCF files.
This is typically used when you want to calculate concordance between a genotyping experiment and a sequencing experiment.
For instance, you sequenced several samples and, as part of a related experiment or just as quality control, you also genotype the same samples using a genotyping array.
Now you want to compare the two experiments.
Ideally there would be no difference between the variants from genotyping and sequencing, but this is hardly the case in real world.
You can use SnpSift concordance
to measure the differences between the two experiments.
It is assumed that both VCF files are sorted by chromosome and position.
Sample names are defined in '#CHROM' line of the header section.
Concordance is calculated only if sample label matches in both files.
Example:
$ java -Xmx1g -jar SnpSift.jar concordance -v genotype.vcf sequencing.vcf > concordance.txt 00:00:00.000 Indexing file 'genotype.vcf' index: MT 460030998 index: 1 19705 1 / 2 45170805 / 45174315 2 / 3 77052081 / 77055591 3 / 4 104065531 / 104069041 4 / 5 124098372 / 124101881 5 / 6 146535292 / 146538802 6 / 7 184793526 / 184797035 7 / 8 206156508 / 206160018 8 / 9 223072816 / 223076326 9 / 10 242315995 / 242319505 10 / 11 261053789 / 261057299 11 / 12 290190553 / 290194063 12 / 13 312869636 / 312873146 13 / 14 321966539 / 321970049 14 / 15 336131317 / 336134827 15 / 16 350871669 / 350875179 16 / 17 368900523 / 368904032 17 / 18 391305860 / 391309369 18 / 19 398932237 / 398935747 19 / 20 425219198 / 425222708 20 / 21 437022008 / 437025517 21 / 22 442563678 / 442567188 22 / X 451783418 / 451786927 X / Y 459553691 / 459557200 Y / MT 459588787 / 459592296 00:00:01.137 Open VCF file 'genotype.vcf' 00:00:01.141 Open VCF file 'sequencing.vcf' 00:00:01.176 Chromosome: '1' 00:00:02.127 1:1550992 1:1528859 00:00:02.739 1:2426313 1:2389636 ... 00:02:13.780 1:248487058 1:248471945
SnpSif's concordance output is written to STDOUT and two files.
For instance the command java -jar SnpSift.jar concordance -v genotype.vcf sequencing.vcf
will write:
concordance_genotyping_sequencing.by_sample.txt
concordance_genotyping_sequencing.summary.txt
This sections is a table showing concordance details for every entry (chr:position) that both files have in common. E.g.:
chr pos ref alt change_0_0 change_0_1 change_0_2 change_1_0 change_1_1 change_1_2 change_2_0 change_2_1 change_2_2 missing_genotype_genotype missing_genotype_sequencing 1 865584 G A 508 0 0 0 2 0 0 0 0 0 5 1 865625 G A 512 0 0 0 1 0 0 0 0 0 1 1 865628 G A 511 0 0 0 2 0 0 0 0 0 1 1 865665 G A 495 0 0 0 4 0 0 0 0 0 17 1 865694 C T 428 0 0 0 82 0 0 0 4 0 0
This section shows details in the same format as the previous section. Here, concordance metrics are shown aggregated for each sample. E.g.:
# Totals by sample sample change_0_0 change_0_1 change_0_2 change_1_0 change_1_1 change_1_2 change_2_0 change_2_1 change_2_2 missing_genotype_genotype missing_genotype_sequencing ID_003 79 0 0 1 8 0 0 0 2 1 1 ID_004 83 0 0 1 2 0 0 0 5 0 1 ID_005 80 0 0 0 7 0 0 0 4 1 0 ID_006 79 0 0 0 5 0 0 0 6 0 2 ID_008 81 0 0 0 4 0 0 0 4 0 3 ID_009 80 0 0 0 7 0 0 0 3 0 2 ID_012 74 0 0 0 10 0 0 0 1 0 7 ID_013 79 1 0 0 4 0 0 0 5 0 3 ID_018 84 0 0 0 5 0 0 0 3 0 0 ...Summary
Summary file contains overall information and errors. Here is an example of a summary file:
$ cat concordance_genotyping_sequencing.summary.txt Number of samples: 929 File genotype.vcf 583 File sequencing.vcf 514 Both files Errors: ALT field does not match 19The header indicates that one file ('genotype.vcf') has 929 samples, the other file has 583 and there are 514 matching sample IDs in both files.
-v
).
Annotate if a variant is private to a family.
A Private=Family_ID
is added to a variant's INFO field, if the variant is only found in one family.
A TFAM file (see PLINK's documentation) specifies a mapping from sample IDs to family IDs.
E.g.:
$ java -jar SnpSift.jar private pheotypes.tfam imp.ann.vcf > imp.ann.private.vcf
1 1005806 rs3934834 C T . PASS AF=0.091;..;Private=Family_47This indicates that the variant is only found in members of
Family_47
, according to the definitions in pheotypes.tfam
.
Convert from VCF to PLINK's TPED file format.
The vcf2tped
command uses a VCF and a TFAP file as input, creating a TPED and a consolidated TFAM as outputs.
Command line options are:
$ java -jar SnpSift.jar vcf2tped SnpSift version 1.9d (build 2013-04-26), by Pablo Cingolani Usage: java -jar SnpSift.jar vcf2tped [options] file.tfam file.vcf outputName Options: -f : Force. Overwrite new files if they exist. Default: false -onlySnp : Use only SNPs when converting VCF to TPED. Default: false -onlyBiAllelic : Use only bi-allelic variants. Default: false -useMissing : Use entries with missing genotypes (otherwise they are filtered out). Default: false -useMissingRef : Use entries with missing genotypes marking them as 'reference' instead of 'missing'. Default: false Parameters: file.tfam : File with genotypes and groups information (in PLINK's TFAM format) file.vcf : A VCF file (variants and genotype data) outputName : Base name for the new TPED and TFAM files.
vcf2tped
command supports the following features:
-onlyBiAllelic
option filters out non bi-allelic variants.
-onlySnp
option filters out non SNP variants.
-useMissing
uses missing variants in TPED file.
-useMissingRef
Converts missing variants to reference genotype.
This command intersects several intervals files (e.g. BED, BigBed, TXT) and produces a result of all intersections.
A typical usage example is to create a consensus of peaks from several Chip-Seq experiments.
Algorithm: This command creates one interval forest for each input file.
For every interval in all input files, finds all intervals that intersect at least minOverlap
bases (default 1 base).
If there are at least cluster
number of intersecting intervals it creates a consensus interval from the intersections (or union
) of all intervals found.
The consensus interval, if any, is shown as result.
Command line options:
$ java -jar SnpSift.jar intersect SnpSift version 1.9d (build 2013-04-26), by Pablo Cingolani Usage: java -jar SnpSift.jar [options] file_1.bed file_2.bed ... file_N.bed Options: -minOverlap: Minimum number of bases that two intervals have to overlap. Default : 0 -cluster : An interval has to intersect at least 'num' intervals (from other files) to be considered. Default: 0 -intersect : Report the intersection of all intervals. Default: false -union : Report the union of all intervals. Default: true
This command removes INFO fields from a VCF file (i.e. removes annotations)
Removing INFO fields is usually done because you want to re-annotate the VCF file, thus removing old INFO fields in order to add new ones later.
SnpEff & SnpSift only add annotations and do not change current ones.
So, in order to re-annotate a file, you should first remove the old annotations and then re-annotate.
The reason for this behavior is simply because replacing annotation values is considered a bad practice.
Imagine that you have a VCF entry in your re-annotated file having the value "AA=1": How do you know if this is from the old annotations or from the new ones?
This confusion often leads to problems in downstream steps of your pipelines, so it's better to avoid the problem by first removing all the previous annotations and then adding the new ones.
Usage example:
$ cat test.snpeff.vcf #CHROM POS ID REF ALT QUAL FILTER INFO 1 734462 1032 G A . s50 AC=348;EFF=DOWNSTREAM(MODIFIER|||||RP11-206L10.8|processed_transcript|NON_CODING|ENST00000447500||1),INTRON(MODIFIER|||||RP11-206L10.6|processed_transcript|NON_CODING|ENST00000429505|1|1) $ java -jar SnpSift.jar rmInfo test.snpeff.vcf EFF #CHROM POS ID REF ALT QUAL FILTER INFO 1 734462 1032 G A . s50 AC=348
Annotating GeneSets, such as Gene Ontology (GO), KEGG, Reactome, etc.; can be quite useful to find significant variants.
Gene set annotations can be added to a SnpEff annotated file using SnpSift geneSets
command.
The VCF file must be annotated using SnpEff
before perforimg Gene Sets annotations.
This is because we must know which gene the variant affects).
You can download MSigDb from Broad Institute
Usage example:
$ java -jar SnpSift.jar geneSets -v db/msigDb/msigdb.v3.1.symbols.gmt test.eff.vcf > test.eff.geneSets.vcf 00:00:00.000 Reading MSigDb from file: 'db/msigDb/msigdb.v3.1.symbols.gmt' 00:00:01.168 Done. Total: 8513 gene sets 31847 genes 00:00:01.168 Annotating variants from: 'test.eff.vcf' 00:00:01.298 Done. # Summary # gene_set gene_set_size variants # ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN 940 8 # CHR1P36 504 281 # KEGG_OLFACTORY_TRANSDUCTION 389 8 # REACTOME_GPCR_DOWNSTREAM_SIGNALING 805 8 # REACTOME_OLFACTORY_SIGNALING_PATHWAY 328 8 ... # REACTOME_SIGNALING_BY_GPCR 920 8 $ cat test.eff.geneSets.vcf ##INFO=<ID=MSigDb,Number=.,Type=String,Description="Gene set from MSigDB database (GSEA)"> 1 69849 . G A 454.73 PASS AC=33;EFF=STOP_GAINED(HIGH|NONSENSE|tgG/tgA|W253*|305|OR4F5|protein_coding|CODING|ENST00000335137|1|1);MSigDb=ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN,CHR1P36,KEGG_OLFACTORY_TRANSDUCTION,REACTOME_GPCR_DOWNSTREAM_SIGNALING,REACTOME_OLFACTORY_SIGNALING_PATHWAY,REACTOME_SIGNALING_BY_GPCR
Compress genotype calls, reducing the overall size of the VCF file.
This is intended for compressing very large VCF in very large sequencing projects (e.g. thousands of samples).
For instance, we've reduced 1Tb (1,000 Gb) VCF file to roughly 1Gb in a project that has over 10,000 samples.
The underlying idea is quite simple.
In large re-sequenceing projects most of the variants are singletons.
This means that most variants are present in only one of the samples.
For those variants, you have thousands of samples that are homozygous reference (i.e. genotype entry is "0/0") and one that is a variant (e.g. '0/1' or '1/1').
A trivial way to compress these VCF entries is just to state which sample has non-reference information.
Intuitively, this is similar to the way used to represent sparse matrices (only store non-zero elements).
SnpSift gt
creates three INFO fields. These three files are composed of comma separated indexes of samples having:
-u
command line option to uncompress.
$ cat test.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8 Sample_9 Sample_10 Sample_11 Sample_12 Sample_13 Sample_14 Sample_15 1 861276 . A G . PASS AC=1 GT 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 #--- # Compress genotypes #--- $ java -jar SnpSift.jar gt test.vcf | tee test.gt.vcf ##INFO=<ID=HO,Number=.,Type=Integer,Description="List of sample indexes having homozygous ALT genotypes"> ##INFO=<ID=HE,Number=.,Type=Integer,Description="List of sample indexes having heterozygous ALT genotypes"> ##INFO=<ID=NA,Number=.,Type=Integer,Description="List of sample indexes having missing genotypes"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8 Sample_9 Sample_10 Sample_11 Sample_12 Sample_13 Sample_14 Sample_15 1 861276 . A G . PASS AC=1;HO=1 #--- # Uncompress genotypes (command line option '-u') #--- $ java -jar SnpSift.jar gt -u test.gt.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8 Sample_9 Sample_10 Sample_11 Sample_12 Sample_13 Sample_14 Sample_15 1 861276 . A G . PASS AC=1 GT 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Perform some basic check ups on VCF files to spot common problems.
SnpSift vcfCheck
checks for some common problems where VCF files are not following the specification.
Given that many common VCF problems cause analysis tools and pipelines to behave unexpectedly, this command is intended as a simple debugging tool.
E.g.:
$ java -jar SnpSift.jar vcfCheck bad.vcf WARNING: Malformed VCF entryfile 'bad.vcf', line 7: Entry : 3 148885779 . A ATT,AT 999.0 PASS UK10KWES_AC=0,0;MDV=94 Errors : INFO filed 'UK10KWES_AC' has 'Number=1' in header, but it contains '2' elements. Cannot find header for INFO field 'MDV' WARNING: Malformed VCF entryfile 'bad.vcf', line 14: Entry : 3 148890104 . TCA T . . . Errors : File is not sorted: Position '3:148890104' after position '3:148890105'