Internals¶
Overview¶

Parallel¶
bcbio calculates callable regions following alignment using goleft depth. These regions determine breakpoints for analysis, allowing parallelization by genomic regions during variant calling. Defining non-callable regions allows bcbio to select breakpoints for parallelization within chromosomes where we won’t accidentally impact small variant detection. The callable regions also supplement the variant calls to define positions where not called bases are homozygous reference, as opposed to true no-calls with no evidence. The callable regions plus variant calls is an alternative to gVCF output which explicitly enumerates reference calls in the output variant file.

Overview of cluster types during parallel execution
Somatic tumor only variant calling pipeline with UMIs step by step¶
Minor steps (like tabix’ing of vcfs, indexing of bams) and details (full paths) are omitted.
bcbio.yaml config:
details:
- algorithm:
aligner: bwa
trim_ends: [2,0,2,0]
min_allele_fraction: 0.01
correct_umis: /path/umi.whitelist.txt
tools_off:
- gemini
umi_type: fastq_name
variantcaller:
- vardict
coverage: target.bed
variant_regions: target.bed
analysis: variant2
description: samplex
files:
- /path/samplex_1.fq.gz
- /path/samplex_2.fq.gz
genome_build: hg38
metadata:
phenotype: tumor
resources:
fgbio:
options: [--min-reads, 3]
step 1: trimming 2p from reads: trim_ends: [2,0,2,0], indexing:
bgzip --threads 8 -c <(seqtk trimfq -b 2 -e 0 samplex_1.fq.gz) > samplex_1.fq.gz bgzip --threads 8 -c <(seqtk trimfq -b 2 -e 0 samplex_2.fq.gz) > samplex_2.fq.gz grabix index samplex_2.fq.gz grabix index samplex_1.fq.gz
step 2: alignment with bwa mem, sort and mark duplicates, assign BAM tags (XS=sample, XC=cell, RX=UMI), input: fastq files, output: samplex-sort.bam:
unset JAVA_HOME && \ bwa mem -c 250 -M -t 16 -R '@RG\tID:samplex\tPL:illumina\tPU:samplex\tSM:samplex' \ -v 1 /path/reference/genomes/Hsapiens/hg38/bwa/hg38.fa samplex_1.fq.gz samplex_2.fq.gz | \ bamsormadup tmpfile=samplex-sort-sorttmp-markdup inputformat=sam threads=16 \ outputformat=bam level=0 SO=coordinate | \ /path/bcbio/anaconda/envs/python2/bin/python /path/bin/umis bamtag - | \ samtools view -b > samplex-sort.bam
step 3: correct UMIs with fgbio using the whitelist, input: samplex-sort.bam, output: samplex-sort-umis_corrected.bam:
unset JAVA_HOME && \ fgbio -Xms750m -Xmx30g -XX:+UseSerialGC --tmp-dir . --async-io=true --compression=0 \ CorrectUmis \ -t XC -m 3 -d 1 -x -U /path/umi.whitelist.txt -i samplex-sort.bam \ -o samplex-sort-umis_corrected.bam
step 4: calculate coverage with mosdepth:
mosdepth -t 16 -F 1804 --no-per-base --by target.bed samplex-rawumi \ samplex-sort-umis_corrected.bam
step 5: fgbio GroupReadsByUmi, CallDuplexConsensusReads, FilterConsensusReads with min-reads=3, bam2fastq:
unset JAVA_HOME && \ fgbio -Xms750m -Xmx30g -XX:+UseSerialGC --tmp-dir . --async-io=true --compression=0 \ GroupReadsByUmi \ --edits=1 --min-map-q=1 -t XC -s paired -i samplex-sort-umis_corrected.bam | \ fgbio -Xms750m -Xmx30g -XX:+UseSerialGC --tmp-dir . --async-io=true --compression=0 \ CallDuplexConsensusReads \ --min-input-base-quality=2 --sort-order=:none: -i /dev/stdin -o /dev/stdout | \ fgbio -Xms750m -Xmx30g -XX:+UseSerialGC --tmp-dir . --async-io=true --compression=0 \ FilterConsensusReads --min-reads=3 --min-base-quality=13 --max-base-error-rate=0.1 \ -r /path/reference/genomes/Hsapiens/hg38/seq/hg38.fa -i /dev/stdin -o /dev/stdout | \ bamtofastq collate=1 T=samplex-sort-umis_corrected-cumi-1-bamtofastq-tmp \ F=samplex-sort-umis_corrected-cumi-1.fq.gz F2=samplex-sort-umis_corrected-cumi-2.fq.gz tags=cD,cM,cE gz=1
step 6: align consensus reads:
unset JAVA_HOME && bwa mem -C -c 250 -M -t 16 -R '@RG\tID:samplex\tPL:illumina\tPU:samplex\tSM:samplex' \ -v 1 /projects/ngs/reference/genomes/Hsapiens/hg38/bwa/hg38.fa \ samplex-sort-umis_corrected-cumi-1.fq.gz samplex-sort-umis_corrected-cumi-2.fq.gz | \ samtools sort -@ 16 -m 1G -T samplex-sort-cumi-sorttmp -o samplex-sort-cumi.bam /dev/stdin samtools index -@ 16 samplex-sort-cumi.bam samplex-sort-cumi.bam.bai
step 7: clean variant_regions bed file:
cat target.bed | grep -v ^track | grep -v ^browser | grep -v ^@ | grep -v ^# | \ bcbio_python -c 'from bcbio.variation import bedutils; bedutils.remove_bad()' | \ sort -V -T . -k1,1 -k2,2n > cleaned-target.bed cat cleaned-target.bed | bgzip --threads 16 -c > cleaned-target.bed.gz tabix -f -p bed cleaned-target.bed.gz bedtools merge -i cleaned-target.bed> cleaned-target-merged.bed cat cleaned-target-merged.bed | bgzip --threads 16 -c > cleaned-target-merged.bed.gz
step 8: clean coverage bed file (the same in our example):
cat target.bed | grep -v ^track | grep -v ^browser | grep -v ^@ | grep -v ^# | \ iconv -c -f utf-8 -t ascii | sed 's/ //g' | \ bcbio_python -c 'from bcbio.variation import bedutils; bedutils.remove_bad()' | \ sort -V -T . -k1,1 -k2,2n > cov-target.bed cat cov-target.bed | bgzip --threads 16 -c > cov-target.bed.gz tabix -f -p bed cov-target.bed.gz bedtools merge -i cov-target.bed > cov-target-merged.bed cat cov-target-merged.bed | bgzip --threads 16 -c > cov-target-merged.bed.gz
step 9: clean sv regions bed file:
cat cleaned-target.bed | grep -v ^track | grep -v ^browser | grep -v ^@ | grep -v ^# | \ /home/kmhr378/local/bin/bcbio_python -c 'from bcbio.variation import bedutils; bedutils.remove_bad()' | \ sort -V -T . -k1,1 -k2,2n > \ svregions-cleaned-target.bed cat svregions-cleaned-target.bed | bgzip --threads 16 -c > svregions-cleaned-target.bed.gz
step10: calculate coverage for 3 bed files with MOSDEPTH:
export MOSDEPTH_Q0=NO_COVERAGE && export MOSDEPTH_Q1=LOW_COVERAGE && \ export MOSDEPTH_Q2=CALLABLE && \ mosdepth -t 16 -F 1804 -Q 1 --no-per-base --by cleaned-target.bed \ --quantize 0:1:4: samplex-variant_regions samplex-sort-cumi.bam mosdepth -t 16 -F 1804 --no-per-base --by svregions-cleaned-target.bed \ samplex-sv_regions samplex-sort-cumi.bam mosdepth -t 16 -F 1804 --no-per-base --by cov-target.bed samplex-coverage \ samplex-sort-cumi.bam \ -T 1,5,10,20,50,100,200,500,1000,2000,5000,10000,20000,50000,100000,200000,500000
step11: hts_nim counts:
hts_nim_tools count-reads -t 16 -F 1804 /path/samplex/counts/fullgenome.bed samplex-sort-cumi.bam > fullgenome-1804-counts.txt hts_nim_tools count-reads -t 16 -F 1804 cleaned-target.bed samplex-sort-cumi.bam > cleaned-target-merged-1804-counts.txt
- step12: samtools read statistics::
samtools stats -@ 16 samplex-sort-cumi.bam > samplex.txt samtools idxstats samplex-sort-cumi.bam > samplex-idxstats.txt
step13: variant calling with vardict (repeated for each alignment chunk):
unset R_HOME && unset R_LIBS && unset JAVA_HOME && \ export VAR_DICT_OPTS='-Xms750m -Xmx3500m -XX:+UseSerialGC -Djava.io.tmpdir=.' && \ vardict-java -G /path/reference/genomes/Hsapiens/hg38/seq/hg38.fa \ -N samplex -b samplex-sort-cumi.bam -c 1 -S 2 -E 3 -g 4 --nosv --deldupvar -Q 10 -F 0x700 \ samplex-chr5_0_x-unmerged-regions-regionlimit.bed -f 0.0025 | \ teststrandbias.R | \ var2vcf_valid.pl -A -N samplex -E -f 0.0025 | grep -v ^##contig | \ bcftools annotate -h samplex-chr5_0_x-contig_header.txt | \ bcftools filter -i 'QUAL >= 0' | \ bcftools filter --soft-filter 'LowFreqBias' --mode '+' -e 'FORMAT/AF[0:0] < 0.02 && \ FORMAT/VD[0] < 30 && INFO/SBF < 0.1 && INFO/NM >= 2.0' | \ awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDXkmryswbvhdx]/, "N", $4) } {print}' | \ awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDXkmryswbvhdx]/, "N", $5) } {print}' | \ awk -F$'\t' -v OFS='\t' '$1!~/^#/ && $4 == $5 {next} {print}' | \ vcfstreamsort | bgzip -c > samplex-chr5_0_x.vcf.gz zgrep ^# samplex-chr5_0_x.vcf.gz > samplex-chr5_0_x-fixheader-header.vcf unset JAVA_HOME && \ picard FixVcfHeader HEADER=samplex-chr5_0_x-fixheader-header.vcf \ INPUT=samplex-chr5_0_x.vcf.gz \ OUTPUT=samplex-chr5_0_x-fixheader.vcf.gz
step14: gather vcfs:
unset JAVA_HOME && \ gatk --java-options '-Xms681m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=.\ GatherVcfs -I samplex-files.list -O samplex.vcf.gz
step15: annotate with snpEff:
unset JAVA_HOME && \ snpEff -Xms750m -Xmx29g -Djava.io.tmpdir=. eff \ -dataDir /path/reference/genomes/Hsapiens/hg38/snpeff \ -hgvs -cancer -noLog -i vcf -o vcf -csvStats samplex-effects-stats.csv \ -s samplex-effects-stats.html GRCh38.86 samplex.vcf.gz | \ bgzip --threads 16 -c > samplex-effects.vcf.gz
step16: annotate with vcfanno:
vcfanno -p 16 dbsnp.conf samplex-effects.vcf.gz | \ bcftools reheader -h samplex-effects-annotated-sample_header.txt | \ bcftools view | bgzip -c > samplex-effects-annotated.vcf.gz tabix -f -p vcf samplex-effects-annotated.vcf.gz vcfanno -p 16 samplex-effects-annotated-annotated-somatic-combine.conf \ samplex-effects-annotated.vcf.gz | bgzip -c > samplex-effects-annotated-annotated-somatic.vcf.gz tabix -f -p vcf samplex-effects-annotated-annotated-somatic.vcf.gz cat samplex-effects-annotated-annotated-somatic-priority.tsv | bgzip --threads 16 -c > \ samplex-effects-annotated-annotated-somatic-priority.tsv.gz tabix -f -0 -c '#' -s 1 -b 2 -e 3 samplex-effects-annotated-annotated-somatic-priority.tsv.gz
scRNA-seq: Single cell RNA-seq barcode counting pipeline step by step¶
Minor steps and exact file locations are omitted. * = repeated for every sample in sample_barcodes.csv
bcbio.yaml config:
details:
- algorithm:
cellular_barcode_correction: 1
minimum_barcode_depth: 1000
sample_barcodes: /path/project/config/barcodes.csv
transcriptome_fasta: /path/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.fa
transcriptome_gtf: /path/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf
umi_type: harvard-indrop-v3
analysis: scRNA-seq
description: project
files:
- /path/project_1.fq.gz
- /path/project_2.fq.gz
- /path/project_3.fq.gz
- /path/project_4.fq.gz
genome_build: mm10
fc_name: sc-mouse
upload:
dir: /path/project/final
step1: parse barcode information from reads 2,3,4 to the fastq read name, CELL_[value]:UMI_[value]:sample_[value]. umis supports many protocols, however, the downside is speed - this step can take up to 3-4 days:
python umis fastqtransform \ --separate_cb umis/harvard-indrop-v3-transform.json --cores 16 \ project_1.fq.gz project_2.fq.gz project_3.fq.gz project_4.fq.gz | \ seqtk seq -L 20 - | gzip > project.umitransformed.fq.gz
step2: create a fastq file for each sample:
python umis demultiplex_samples \ --nedit 1 --barcodes sample_barcodes.csv \ --out_dir demultiplexed project.umitransformed.fq.gz
step3*:
python umis cb_filter \ --cores 16 --bc1 harvard-indrop-v3-cb1.txt.gz --nedit 1 \ --bc2 harvard-indrop-v3-cb2.txt.gz demultiplexed/[sample-barcodeAATTTTT].fq | \ gzip -c > project-sample_barcode.filtered.fq.gz
step 4*: create cellular barcode histrogram:
python umis cb_histogram project-sample_barcode.filtered.fq.gz > cb-histogram.txt
step 5: create index genome for rapmap:
rapmap quasiindex -k 31 -i mm10 -t mm10/ref-transcripts.fa
step 6*. align reads with rapmap:
rapmap quasimap -t 16 -i mm10 \ -r <(gzip -cd project-sample_barcode.filtered.fq.gz) | \ samtools sort -@ 16 -m 1G -T project-sample_barcode-sorttmp \ -o project-sample_barcode.bam /dev/stdin samtools index -@ 16 project-sample_barcode.bam project-sample_barcode.bam.bai
step 7*: count transcripts:
python umis fasttagcount --cb_cutoff 1000 \ --genemap ref-transcripts-tx2gene.tsv --cb_histogram project-sample_barcode/cb-histogram.txt \ --umi_matrix project-sample_barcode-dupes.mtx.full \ project-sample_barcode.bam project-sample_barcode.mtx.full python umis sparse project-sample_barcode.mtx.full \ project-sample_barcode.mtx python umis sparse project-sample_barcode-dupes.mtx.full \ project-sample_barcode-dupes.mtx
step 8. Concatenate all cb-histogram-filtered.txt files:
cat project-[all-barcodes]/cb-histogram-filtered.txt > cb-histogram.txt