Bioinformatic Analysis pipeLine for SomAtic MutatIons in Cancer (v 9.0.0)
FastQ to Annotated VCF
BALSAMIC is basically a wrapper for its core workflow manager. The goal is to have a package with well defined cli to make it reproducible for user to run somatic calling regaradless of the workflow manger at its core. Right now, BALSAMIC is using Snakemake as its core. So one can run the sample using workflows available within this package and standard Snakemake cli given that there is a proper config file created.
Source code |
|
---|---|
Version |
|
Author |
Hassan Foroughi Asl |
Development model |
Gitflow |
Build status |
|
Container latest release status |
|
Container master and develop status |
|
Code coverage |
|
Documentation |
|
Dependencies |
|
Contributors |
@ashwini06, @ivadym, @khurrammaqbool, @keyvanelhami, @mropat, @imsarath |
Installation¶
This section describes steps to install BALSAMIC (version = 9.0.0)
Software Requirements¶
Conda >=version 4.5.0: For detailed software and python requirements please see
setup.py
andBALSAMIC/conda/balsamic.yaml
Singularity >=version 3.0.0: BALSAMIC uses singularity to run vairous parts of the workflow.
Python 3.6
BALSAMIC is dependent on third-party bioinformatics software
Sentieon-tools
. Example: for running wgs variant calling usingTNScope
, and to executeUMIworkflow
.
Note: Set Sentieon envionment variables in your ~/.bashrc file by adding following two lines
export SENTIEON_INSTALL_DIR=path_to_sentieon_install_dir
export SENTIEON_LICENSE=IP:Port
Step 1. Installing BALSAMIC¶
Create a conda environment:
conda create -c conda-forge -c defaults --name S_BALSAMIC python==3.7 pip pygraphviz
Activate environment:
conda activate S_BALSAMIC
Install BALSAMIC using
pip
within the newly created environment:
pip install --no-cache-dir -U git+https://github.com/Clinical-Genomics/BALSAMIC
Or if you have repository cloned and want it in editable mode:
pip install -e .
Step 2. generate BALSAMIC cache and pull containers¶
First generate your own COSMIC database key via: https://cancer.sanger.ac.uk/cosmic/help/file_download
The following commands will create and download reference directory at ~/balsamic_cache
(change this path if you
want it to be created in another location):
NOTE: This process can take couple of hours
# Note:
# 1. COSMIC key is in variable $COSMIC_KEY
# 2. For genome version hg38, set --genome-version to hg38
# 3. For using develop container version, set --container-version to develop
# 4. For submitting jobs to slurm cluster, use option --account
balsamic init --outdir ~/balsamic_cache \
--cosmic-key "${COSMIC_KEY}" \
--genome-version hg19 \
--run-analysis \
--account development
# Generate cache locally instead of slurm job submission
balsamic init --outdir ~/balsamic_cache \
--cosmic-key "${COSMIC_KEY}" \
--genome-version hg19 \
--run-analysis \
--run-mode local \
--snakemake-opt "--cores 16"
Short tutorial¶
Here a short tutorial is provided for BALSAMIC (version = 9.0.0).
Running a test sample¶
Given the
balsamic config case \
--tumor tests/test_data/fastq/S1_R_1.fastq.gz \
--normal tests/test_data/fastq/S2_R_1.fastq.gz \
--case-id demo_run_balsamic \
--analysis-dir demo/ \
--panel-bed tests/test_data/references/panel/panel.bed \
--balsamic-cache ~/balsamic_cache \
--quiet
Notes:
If you want to test tumor_only mode, remove the
--normal tests/test_data/fastq/S2_R_1.fastq.gz
line.--output-config demo_run_balsamic.json
is also optional
Let’s try a dry run and see everything is in place:
balsamic run analysis --sample-config demo/demo_run_balsamic/demo_run_balsamic.json
Command above should exit a similar output as below:
Job counts:
count jobs
1 BaseRecalibrator
1 CollectAlignmentSummaryMetrics
1 CollectHsMetrics
1 CollectInsertSizeMetrics
1 IndelRealigner
1 MarkDuplicates
1 RealignerTargetCreator
1 all
1 bwa_mem
1 cnvkit_single
1 fastp
1 fastqc
13 haplotypecaller
1 haplotypecaller_merge
1 manta_germline
1 manta_tumor_only
1 mergeBam_tumor
1 mergeBam_tumor_gatk
1 multiqc
1 mutect2_merge
13 mutect2_tumor_only
1 sambamba_exon_depth
1 sambamba_panel_depth
1 samtools_sort_index
1 somatic_snv_indel_vcf_merge
1 split_bed_by_chrom
1 strelka_germline
1 vardict_merge
13 vardict_tumor_only
7 vep
72
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
And now run balsamic through SLURM. Make sure you set your SLURM project account using --account
if your local
settings require it:
balsamic run analysis --sample-config demo/demo_run_balsamic/demo_run_balsamic.json \
--profile slurm --qos low --account development --run-analysis
And now run balsamic through QSUB. Make sure you set your QSUB project account using --account
if your local
settings require it:
balsamic run analysis --sample-config demo/demo_run_balsamic/demo_run_balsamic.json \
--profile qsub --qos low --account development --run-analysis
And running workflow without submitting jobs. Set number of cores by passing an argument to snakemake as seen below:
balsamic run analysis --sample-config demo/demo_run_balsamic/demo_run_balsamic.json \
--run-mode local --snakemake-opt "--cores 8" --run-analysis
BALSAMIC Annotation Resources¶
BALSAMIC annotates somatic single nucleotide variants (SNVs) using ensembl-vep
and vcfanno
. Somatic structural variants (SVs), somatic copy-number variants (CNVs) and germline single nucleotide variants are annotated using only ensembl-vep
. All SVs and CNVs are merged using SVDB
before annotating for Target Genome Analysis (TGA) or Whole Genome Sequencing (WGS) analyses.
BALSAMIC adds the following annotation from gnomAD database using vcfanno
.
VCF tag |
description |
---|---|
GNOMADAF_popmax |
maximum allele frequency across populations |
GNOMADAF |
fraction of the reads supporting the alternate allele, allelic frequency |
BALSAMIC adds the following annotation from ClinVar database using vcfanno
.
VCF tag |
description |
---|---|
CLNACC |
Variant Accession and Versions |
CLNREVSTAT |
ClinVar review status for the Variation ID |
CLNSIG |
Clinical significance for this single variant |
CLNVCSO |
Sequence Ontology id for variant type |
CLNVC |
Variant type |
ORIGIN |
Allele origin |
The values for ORIGIN are described below:
Value |
Annotation |
---|---|
0 |
unknown |
1 |
germline |
2 |
somatic |
4 |
inherited |
8 |
paternal |
16 |
maternal |
32 |
de-novo |
64 |
biparental |
128 |
uniparental |
256 |
not-tested |
512 |
tested-inconclusive |
1073741824 |
other |
BALSAMIC uses ensembl-vep to add the following annotation from COSMIC database.
VCF tag |
description |
---|---|
COSMIC_CDS |
CDS annotation |
COSMIC_GENE |
gene name |
COSMIC_STRAND |
strand |
COSMIC_CNT |
number of samples with this mutation in the COSMIC database |
COSMIC_AA |
peptide annotation |
Where relevant, BALSAMIC uses ensembl-vep to annotate somatic and germline SNVs and somatic SVs/CNVs from 1000genomes (phase3), ClinVar, ESP, HGMD-PUBLIC, dbSNP, gencode, gnomAD, polyphen, refseq, and sift databases. The following annotations are added by ensembl-vep.
Annotation |
description |
---|---|
Allele |
the variant allele used to calculate the consequence |
Gene |
Ensembl stable ID of affected gene |
Feature |
Ensembl stable ID of feature |
Feature type |
type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature. |
Consequence |
consequence type of this variant |
Position in cDNA |
relative position of base pair in cDNA sequence |
Position in CDS |
relative position of base pair in coding sequence |
Position in protein |
relative position of amino acid in protein |
Amino acid change |
only given if the variant affects the protein-coding sequence |
Codon change |
the alternative codons with the variant base in upper case |
Co-located variation |
identifier of any existing variants |
VARIANT_CLASS |
Sequence Ontology variant class |
SYMBOL |
the gene symbol |
SYMBOL_SOURCE |
the source of the gene symbol |
STRAND |
the DNA strand (1 or -1) on which the transcript/feature lies |
ENSP |
the Ensembl protein identifier of the affected transcript |
FLAGS |
transcript quality flags:
cds_start_NF: CDS 5’ incomplete
cds_end_NF: CDS 3’ incomplete
|
SWISSPROT |
Best match UniProtKB/Swiss-Prot accession of protein product |
TREMBL |
Best match UniProtKB/TrEMBL accession of protein product |
UNIPARC |
Best match UniParc accession of protein product |
HGVSc |
the HGVS coding sequence name |
HGVSp |
the HGVS protein sequence name |
HGVSg |
the HGVS genomic sequence name |
HGVS_OFFSET |
Indicates by how many bases the HGVS notations for this variant have been shifted |
SIFT |
the SIFT prediction and/or score, with both given as prediction(score) |
PolyPhen |
the PolyPhen prediction and/or score |
MOTIF_NAME |
The source and identifier of a transcription factor binding profile aligned at this position |
MOTIF_POS |
The relative position of the variation in the aligned TFBP |
HIGH_INF_POS |
A flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP) |
MOTIF_SCORE_CHANGE |
The difference in motif score of the reference and variant sequences for the TFBP |
CANONICAL |
a flag indicating if the transcript is denoted as the canonical transcript for this gene |
CCDS |
the CCDS identifer for this transcript, where applicable |
INTRON |
the intron number (out of total number) |
EXON |
the exon number (out of total number) |
DOMAINS |
the source and identifer of any overlapping protein domains |
DISTANCE |
Shortest distance from variant to transcript |
AF |
Frequency of existing variant in 1000 Genomes |
AFR_AF |
Frequency of existing variant in 1000 Genomes combined African population |
AMR_AF |
Frequency of existing variant in 1000 Genomes combined American population |
EUR_AF |
Frequency of existing variant in 1000 Genomes combined European population |
EAS_AF |
Frequency of existing variant in 1000 Genomes combined East Asian population |
SAS_AF |
Frequency of existing variant in 1000 Genomes combined South Asian population |
AA_AF |
Frequency of existing variant in NHLBI-ESP African American population |
EA_AF |
Frequency of existing variant in NHLBI-ESP European American population |
gnomAD_AF |
Frequency of existing variant in gnomAD exomes combined population |
gnomAD_AFR_AF |
Frequency of existing variant in gnomAD exomes African/American population |
gnomAD_AMR_AF |
Frequency of existing variant in gnomAD exomes American population |
gnomAD_ASJ_AF |
Frequency of existing variant in gnomAD exomes Ashkenazi Jewish population |
gnomAD_EAS_AF |
Frequency of existing variant in gnomAD exomes East Asian population |
gnomAD_FIN_AF |
Frequency of existing variant in gnomAD exomes Finnish population |
gnomAD_NFE_AF |
Frequency of existing variant in gnomAD exomes Non-Finnish European population |
gnomAD_OTH_AF |
Frequency of existing variant in gnomAD exomes combined other combined populations |
gnomAD_SAS_AF |
Frequency of existing variant in gnomAD exomes South Asian population |
MAX_AF |
Maximum observed allele frequency in 1000 Genomes, ESP and gnomAD |
MAX_AF_POPS |
Populations in which maximum allele frequency was observed |
CLIN_SIG |
ClinVar clinical significance of the dbSNP variant |
BIOTYPE |
Biotype of transcript or regulatory feature |
APPRIS |
Annotates alternatively spliced transcripts as primary or alternate based on a range of computational methods. NB: not available for GRCh37 |
TSL |
Transcript support level. NB: not available for GRCh37 |
PUBMED |
Pubmed ID(s) of publications that cite existing variant |
SOMATIC |
Somatic status of existing variant(s); multiple values correspond to multiple values in the Existing_variation field |
PHENO |
Indicates if existing variant is associated with a phenotype, disease or trait; multiple values correspond to multiple values in the Existing_variation field |
GENE_PHENO |
Indicates if overlapped gene is associated with a phenotype, disease or trait |
BAM_EDIT |
Indicates success or failure of edit using BAM file |
GIVEN_REF |
Reference allele from input |
REFSEQ_MATCH |
the RefSeq transcript match status; contains a number of flags indicating whether this RefSeq transcript matches the underlying reference sequence and/or an Ensembl transcript (more information):
|
CHECK_REF |
Reports variants where the input reference does not match the expected reference |
HGNC_ID |
A unique ID provided by the HGNC for each gene with an approved symbol |
MANE |
indicating if the transcript is the MANE Select or MANE Plus Clinical transcript for the gene. |
miRNA |
Reports where the variant lies in the miRNA secondary structure. |
BALSAMIC Variant Calling Algorithms¶
In BALSAMIC, various bioinfo tools are integrated for reporting somatic and germline variants. Also, the choice of these tools differs between the type of analysis, e.g.: Target Genome Analysis (TGA) or analysis of Whole Genome Sequencing (WGS). Various filters (Pre-call and Post-call filtering) are applied at different levels to report high-confidence variant calls.
Pre-call filtering is where the variant-calling tool decides not to add a variant to the VCF file if the default filters of the variant-caller did not pass the filter criteria. The set of default filters differs between the various variant-calling algorithms.
To know more about the pre-call filters used by the variant callers, please have a look at the VCF header of the particular variant-calling results. For example:

Pre-call filters applied by the Vardict variant-caller is listed in the VCF header.¶
In the VCF file, the FILTER status is PASS if this position has passed all filters, i.e., a call is made at this position. Contrary, if the site has not passed any of the filters, a semicolon-separated list of those failed filter(s) will be appended to the FILTER column instead of PASS. E.g., p8;pSTD might indicate that at this site, the mean position in reads is less than 8, and the position in reads has a standard deviation of 0.
Note
In BALSAMIC, this VCF file is named as *.all.vcf.gz (eg: SNV.somatic.<CASE_ID>.vardict.all.vcf.gz)

Vardict Variant calls with different ‘FILTER’ status underlined in white line (NM4.5, PASS, p8;pSTD)¶
Post-call filtering is where a variant is further filtered with quality, depth, VAF, etc., with more stringent thresholds.
For Post-call filtering, in BALSAMIC we have applied various filtering criteria (Vardict_filtering, TNscope filtering (Tumor_normal) ) depending on the analysis-type (TGS/WGS) and sample-type(tumor-only/tumor-normal).
Note
In BALSAMIC, this VCF file is named as *.all.filtered.vcf.gz (eg: SNV.somatic.<CASE_ID>.vardict.all.filtered.vcf.gz)
Only those variants that fulfill the pre-call and post-call filters are scored as PASS in the STATUS column of the VCF file. We filter those PASS variants and deliver a final list of variants to the customer either via Scout or Caesar
Note
In BALSAMIC, this VCF file is named as *.all.filtered.pass.vcf.gz (eg: SNV.somatic.<CASE_ID>.vardict.all.filtered.pass.vcf.gz)
VCF file name |
Description |
Delivered to the customer |
---|---|---|
.vcf.gz |
Unannotated VCF file with pre-call filters included in the STATUS column |
Yes (Caesar) |
.all.vcf.gz |
Annotated VCF file with pre-call filters included in the STATUS column |
No |
.all.filtered.vcf.gz |
Annotated VCF file with pre-call and post-call filters included in the STATUS column |
No |
.all.filtered.pass.vcf.gz |
Annotated and filtered VCF file by excluding all filters that did not meet the pre and post-call filter criteria. Includes only variants with the PASS STATUS |
Yes (Caesar and Scout) |
Targeted Genome Analysis¶
Somatic Callers for reporting SNVs/INDELS¶
Vardict¶
Vardict is a sensitive variant caller used for both tumor-only and tumor-normal variant calling. The results of Vardict variant calling are further post-filtered based on several criteria (Vardict_filtering) to retrieve high-confidence variant calls. These high-confidence variant calls are the final list of variants uploaded to Scout or available in the delivered VCF file in Caesar.
Vardict_filtering¶
Following is the set of criteria applied for filtering vardict results. It is used for both tumor-normal and tumor-only samples.
Mean Mapping Quality (MQ): Refers to the root mean square (RMS) mapping quality of all the reads spanning the given variant site.
MQ >= 40
Total Depth (DP): Refers to the overall read depth supporting the called variant.
DP >= 100
Variant depth (VD): Total reads supporting the ALT allele
VD >= 5
Allelic Frequency (AF): Fraction of the reads supporting the alternate allele
Minimum AF >= 0.007
Maximum AF < 1
Attention
BALSAMIC <= v8.2.7 uses minimum AF 1% (0.01). From Balsamic v8.2.8, minimum VAF is changed to 0.7% (0.007)
GNOMADAF_POPMAX: Maximum Allele Frequency across populations
GNOMADAF_popmax <= 0.005 (or) GNOMADAF_popmax == "."
Important
Additionally, the variant is excluded for tumor-normal cases if marked as ‘germline’ in the STATUS column of the VCF file.
Whole Genome Sequencing (WGS)¶
Sentieon’s TNscope¶
BALSAMIC utilizes the TNscope algorithm for calling somatic SNVs and INDELS in WGS samples. The TNscope algorithm performs the somatic variant calling on the tumor-normal or the tumor-only samples, using a Haplotyper algorithm.
TNscope filtering (Tumor_normal)¶
The following filters are applied to the variants in TNscope raw VCF file (SNV.somatic.$CASE_ID.tnscope.all.vcf.gz). The variants scored as PASS are included in the final vcf file (SNV.somatic.$CASE_ID.tnscope.all.filtered.pass.vcf.gz).
Total Depth (DP): Refers to the overall read depth from all target samples supporting the variant call
DP(tumor) >= 10 (or) DP(normal) >= 10
Allelic Depth (AD): Total reads supporting the ALT allele in the tumor sample
AD(tumor) >= 3
Allelic Frequency (AF): Fraction of the reads supporting the alternate allele
Minimum AF(tumor) >= 0.05
Maximum AF(tumor) < 1
GNOMADAF_POPMAX: Maximum Allele Frequency across populations
GNOMADAF_popmax <= 0.001 (or) GNOMADAF_popmax == "."
TNscope filtering (tumor_only)¶
The somatic variants in TNscope raw VCF file (SNV.somatic.$CASE_ID.tnscope.all.vcf.gz) are filtered out for the genomic regions that are not reliable (eg: centromeric regions, non-chromosome contigs) to enhance the computation time. This WGS interval region file is collected from gatk_bundles gs://gatk-legacy-bundles/b37/wgs_calling_regions.v1.interval_list and following filters are applied. The variants that scored as PASS are considered for Merging of TNscope and TNhaplotyper results (tumor_only)
Total Depth (DP): Refers to the overall read depth supporting the variant call
DP(tumor) >= 10
Allelic Depth (AD): Total reads supporting the ALT allele in the tumor sample
AD(tumor) > 3
Allelic Frequency (AF): Fraction of the reads supporting the alternate allele
Minimum AF(tumor) > 0.05
Maximum AF(tumor) < 1
GNOMADAF_POPMAX: Maximum Allele Frequency across populations
GNOMADAF_popmax <= 0.001 (or) GNOMADAF_popmax == "."
Normalized base quality scores: The sum of base quality scores for each allele (QSS) is divided by the allelic depth of alt and ref alleles (AD)
SUM(QSS)/SUM(AD) >= 20
Read Counts: Count of reads in a given (F1R2, F2R1) pair orientation supporting the alternate allele and reference alleles
ALT_F1R2 > 0, ALT_F2R1 > 0
REF_F1R2 > 0, REF_F2R1 > 0
SOR: Symmetric Odds Ratio of 2x2 contingency table to detect strand bias
SOR < 3
TNhaplotyper filtering (tumor_only)¶
The somatic variants in TNhaplotyper raw VCF file (SNV.somatic.$CASE_ID.tnhaplotyper.all.vcf.gz) are filtered out for the genomic regions that are not reliable (eg: centromeric regions, non-chromosome contigs) to enhance the computation time. This WGS interval region file is collected from gatk_bundles gs://gatk-legacy-bundles/b37/wgs_calling_regions.v1.interval_list and following filters are applied. The variants that scored as PASS are considered for Merging of TNscope and TNhaplotyper results (tumor_only)
Total Depth (DP): Refers to the overall read depth from all target samples supporting the variant call
DP(tumor) >= 10 (or) DP(normal) >= 10
Allelic Depth (AD): Total reads supporting the ALT allele in the tumor sample
AD(tumor) >= 3
Allelic Frequency (AF): Fraction of the reads supporting the alternate allele
Minimum AF(tumor) >= 0.05
Maximum AF(tumor) < 1
GNOMADAF_POPMAX: Maximum Allele Frequency across populations
GNOMADAF_popmax <= 0.001 (or) GNOMADAF_popmax == "."
Normalized base quality scores: The sum of base quality scores for each allele (QSS) is divided by the allelic depth of alt and ref alleles (AD)
SUM(QSS)/SUM(AD) >= 20
Read Counts: Count of reads in a given (F1R2, F2R1) pair orientation supporting the alternate allele and reference alleles
ALT_F1R2 > 0, ALT_F2R1 > 0
REF_F1R2 > 0, REF_F2R1 > 0
Merging of TNscope and TNhaplotyper results (tumor_only)¶
The filtered somatic variants from TNscope filtering (tumor_only) and TNhaplotyper filtering (tumor_only) are merged using the bcftools intersect command to reduce the number of reported somatic variants for tumor-only samples. Next, the somatic variants that are called by both variant-callers are reported as the final filtered list of variants (SNV.somatic.{CASE_ID}.tnscope.all.filtered.pass.vcf.gz). The final VCF constitutes a high confidence set of somatic variants, which is delivered to the customer either by scout or caesar filesystem.
Target Genome Analysis with UMI’s into account¶
Sentieon’s TNscope¶
UMI workflow performs the variant calling of SNVs/INDELS using the TNscope algorithm from UMI consensus-called reads. The following filter applies for both tumor-normal and tumor-only samples.
Pre-call Filters
minreads: Filtering of consensus called reads based on the minimum reads supporting each UMI tag group
minreads = 3,1,1
It means that at least 3 UMI tag groups should be ideally considered from both DNA strands, where a minimum of at least 1 UMI tag group should exist in each of the single-stranded consensus reads.
min_init_tumor_lod: Log odds is the likelihood that the candidate mutation is real over the likelihood that the candidate mutation is a sequencing error before any read-based filters are applied. Minimum log-odds for the candidate selection. TNscope default: 4. In our UMI-workflow we reduced this setting to 0.5
min_init_tumor_lod = 0.5
min_tumor_lod: minimum log odds in the final call of variants. TNscope default: 6.3. In our UMI-workflow we reduced this setting to 4.0
min_tumor_lod = 4.0
min_tumor_allele_frac: Set the minimum tumor AF to be considered as potential variant site.
min_tumor_allele_frac = 0.0005
interval_padding: Adding an extra 100bp to each end of the target region in the bed file before variant calling.
interval_padding = 100
Post-call Filters
GNOMADAF_POPMAX: Maximum Allele Frequency across populations
GNOMADAF_popmax <= 0.02 (or) GNOMADAF_popmax == "."
Attention
BALSAMIC <= v8.2.10 uses GNOMAD_popmax <= 0.005. From Balsamic v9.0.0, this settings is changed to 0.02, to reduce the stringency.
BALSAMIC METHODS¶
Target Genome Analysis¶
BALSAMIC 1 (version = 8.2.10) was used to analyze the data from raw FASTQ files. We first quality controlled FASTQ files using FastQC v0.11.9 2. Adapter sequences and low-quality bases were trimmed using fastp v0.23.2 3. Trimmed reads were mapped to the reference genome hg19 using BWA MEM v0.7.17 4. The resulted SAM files were converted to BAM files and sorted using samtools v1.15.1 5. Duplicated reads were marked using Picard tools MarkDuplicate v2.27.1 6 and promptly quality controlled using CollectHsMetrics, CollectInsertSizeMetrics and CollectAlignmentSummaryMetrics functionalities. Results of the quality controlled steps were summarized by MultiQC v1.12 7. Small somatic mutations (SNVs and INDELs) were called for each sample using VarDict v2019.06.04 8. Apart from the Vardict filters to report the variants, the called-variants were also further second filtered using the criteria (MQ >= 40, DP >= 100, VD >= 5, Minimum AF >= 0.007, Maximum AF < 1, GNOMADAF_popmax <= 0.005). Only those variants that fulfilled the filtering criteria and scored as PASS in the VCF file were reported. Structural variants were called using Manta v1.6.0 9 and Delly v0.9.1 10. Copy number aberrations were called using CNVkit v0.9.9 11. The variant calls from CNVkit, Manta and Delly were merged using SVDB v2.6.0 12. All variants were annotated using Ensembl VEP v104.3 13. We used vcfanno v0.3.3 14 to annotate somatic variants for their population allele frequency from gnomAD v2.1.1 18.
Whole Genome Analysis¶
BALSAMIC 1 (version = 8.2.10) was used to analyze the data from raw FASTQ files. We first quality controlled FASTQ files using FastQC v0.11.9 2. Adapter sequences and low-quality bases were trimmed using fastp v0.23.2 3. Trimmed reads were mapped to the reference genome hg19 using sentieon-tools 15. The resulted SAM files were converted to BAM files and sorted using samtools v1.15.1 5. Duplicated reads were marked using Picard tools MarkDuplicate v2.27.1 6 and promptly quality controlled using CollectMultipleMetrics and CollectWgsMetrics functionalities. Results of the quality controlled steps were summarized by MultiQC v1.12 7. Small somatic mutations (SNVs and INDELs) were called for each sample using Sentieon TNscope and TNhaplotyper 16. The called-variants were also further second filtered using the criteria (DP(tumor,normal) >= 10; AD(tumor) >= 3; AF(tumor) >= 0.05, Maximum AF(tumor < 1; GNOMADAF_popmax <= 0.001; normalized base quality scores >= 20, read_counts of alt,ref alle > 0). The filtered variants from TNscope and TNhaplotyper were merged using bcftools isec functionality to reduce the number of variants for tumor-only samples. Structural variants were called using Manta v1.6.0 9 and Delly v0.9.1 10. Copy number aberrations were called using ascatNgs v4.5.0 17 for tumor-normal samples. The structural variant calls from Manta, Delly and ascatNgs were merged using SVDB v2.6.0 12 All variants were finally annotated using Ensembl VEP v104.3 13. We used vcfanno v0.3.3 14 to annotate somatic variants for their population allele frequency from gnomAD v2.1.1 18.
UMI Data Analysis¶
BALSAMIC 1 (version = 8.2.10) was used to analyze the data from raw FASTQ files. We first quality controlled FASTQ files using FastQC v0.11.9 2. Adapter sequences and low-quality bases were trimmed using fastp v0.23.2 3. UMI tag extraction and consensus generation were performed using Sentieon tools v202010.02 15. The alignment of UMI extracted and consensus called reads to the human reference genome (hg19) was done by bwa-mem and samtools using Sentieon utils. Consensus reads were filtered based on the number of minimum reads supporting each UMI tag group. We applied a criteria filter of minimum reads 3,1,1. It means that at least three UMI tag groups should be ideally considered from both DNA strands, where a minimum of at least one UMI tag group should exist in each single-stranded consensus read. The filtered consensus reads were quality controlled using Picard CollectHsMetrics v2.27.1 5. Results of the quality controlled steps were summarized by MultiQC v1.12 6. For each sample, somatic mutations were called using Sentieon TNscope 16, with non-default parameters for passing the final list of variants (–min_tumor_allele_frac 0.0005, –filter_t_alt_frac 0.0005, –min_init_tumor_lod 0.5, min_tumor_lod 4, –max_error_per_read 5 –pcr_indel_model NONE, GNOMADAF_popmax <= 0.001). All variants were finally annotated using Ensembl VEP v104.3 7. We used vcfanno v0.3.3 8 to annotate somatic variants for their population allele frequency from gnomAD v2.1.1 18. For exact parameters used for each software, please refer to https://github.com/Clinical-Genomics/BALSAMIC. We used three commercially available products from SeraCare [Material numbers: 0710-067110 19, 0710-067211 20, 0710-067312 21] for validating the efficiency of the UMI workflow in identifying 14 mutation sites at known allelic frequencies.
References¶
Foroughi-Asl, H., Jeggari, A., Maqbool, K., Ivanchuk, V., Elhami, K., & Wirta, V. BALSAMIC: Bioinformatic Analysis pipeLine for SomAtic MutatIons in Cancer (Version v8.2.10) [Computer software]. https://github.com/Clinical-Genomics/BALSAMIC
Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data. Accessed June 22, 2020. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):i884-i890. doi:10.1093/bioinformatics/bty560
Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. doi: 10.1093/bioinformatics/btp352
Picard Tools - By Broad Institute. Accessed June 22, 2020. https://broadinstitute.github.io/picard/
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047-3048. doi:10.1093/bioinformatics/btw354
Lai Z, Markovets A, Ahdesmaki M, Chapman B, Hofmann O, McEwen R, Johnson J, Dougherty B, Barrett JC, and Dry JR. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 2016. https://doi.org/10.1093/nar/gkw227
Chen, X. et al. (2016) Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics, 32, 1220-1222. doi:10.1093/bioinformatics/btv710
Tobias Rausch, Thomas Zichner, Andreas Schlattl, Adrian M. Stuetz, Vladimir Benes, Jan O. Korbel. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012 Sep 15;28(18):i333-i339. https://doi.org/10.1093/bioinformatics/bts378
Talevich, E, Shain, A.H, Botton, T, & Bastian, B.C. CNVkit: Genome-wide copy number detection and visualization from targeted sequencing. PLOS Computational Biology. 2016, 12(4):e1004873. https://doi.org/10.1371/journal.pcbi.1004873
Jesper Eisfeldt et.al. TIDDIT, an efficient and comprehensive structural variant caller for massive parallel sequencing data. F1000 research. 2017. doi: 10.12688/f1000research.11168.2
McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17(1):122.
Pedersen BS, Layer RM, Quinlan AR. Vcfanno: fast, flexible annotation of genetic variants. Genome Biology. 2016;17(1):118. doi:10.1186/s13059-016-0973-5
Donald Freed, Rafael Aldana, Jessica A. Weber, Jeremy S. Edwards. The Sentieon Genomics Tools - A fast and accurate solution to variant calling from next-generation sequence data. Bioinformatics. 2016, Volume 32,Issue 8. https://doi.org/10.1093/bioinformatics/btv710
Donald Freed, Renke Pan, Rafael Aldana. TNscope: Accurate Detection of Somatic Mutations with Haplotype-based Variant Candidate Detection and Machine Learning Filtering. bioRvix. doi: https://doi.org/10.1101/250647
Keiran MR, Peter VL, David CW, David J, Andrew M, Adam PB , Jon WT, Patrick T, Serena Nik-Zainal, Peter J C. ascatNgs: Identifying Somatically Acquired Copy-Number Alterations from Whole-Genome Sequencing Data. Curr Protoc Bioinformatics. 2016. doi:https://doi.org/10.1002/cpbi.17
Karczewski, K.J., Francioli, L.C., Tiao, G. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020). https://doi.org/10.1038/s41586-020-2308-7
https://www.seracare.com/Seraseq-ctDNA-Complete-Reference-Material-AF1-0710-0671/
https://www.seracare.com/Seraseq-ctDNA-Complete-Reference-Material-AF05-0710-0672/
https://www.seracare.com/Seraseq-ctDNA-Complete-Reference-Material-AF01-0710-0673/
Workflow¶
BALSAMIC ( version = 9.0.0 ) uses myriad of tools and softwares to analyze fastq files. This section covers why each one is included: usage and parameters, and relevant external links.
ascatNgs¶
- Source code:
- Article:
- Version:
4.5.0
bcftools¶
- Source code:
- Article:
Bioinformatics https://pubmed.ncbi.nlm.nih.gov/21903627/
- Version:
>=1.10
bedtools¶
- Source code:
- Article:
Bioinformatics https://pubmed.ncbi.nlm.nih.gov/20110278/
- Version:
2.30.0
bwa¶
- Source code:
GitHub https://github.com/lh3/bwa
- Article:
Bioinformatics https://arxiv.org/abs/1303.3997
- Version:
0.7.17
cnvkit¶
- Source code:
- Article:
PLOS Computational Biology http://dx.doi.org/10.1371/journal.pcbi.1004873
- Version:
0.9.9
delly¶
- Source code:
- Article:
Bioinformatics https://academic.oup.com/bioinformatics/article/28/18/i333/245403
- Version:
0.9.1
ensembl-vep¶
- Source code:
- Article:
Genome Biology https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0974-4
- Version:
104.3
fastp¶
- Source code:
- Article:
Bioinformatics https://doi.org/10.1093/bioinformatics/bty560
- Version:
0.23.2
fastqc¶
- Source code:
- Article:
Babraham http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- Version:
0.11.9
gatk¶
- Source code:
- Article:
Current Protocols in Bioinformatics https://pubmed.ncbi.nlm.nih.gov/25431634/
- Version:
3.8
manta¶
- Source code:
- Article:
Bioinformatics https://doi.org/10.1093/bioinformatics/btv710
- Version:
1.6.0
multiqc¶
- Source code:
- Article:
Bioinformatics http://dx.doi.org/10.1093/bioinformatics/btw354
- Version:
1.12
mosdepth¶
- Source code:
- Article:
Bioinformatics https://academic.oup.com/bioinformatics/article/34/5/867/4583630?login=true
- Version:
0.3.3
picard¶
- Source code:
- Article:
-
- Version:
2.27.1
sambamba¶
- Source code:
- Article:
Bioinformatics https://pubmed.ncbi.nlm.nih.gov/25697820/
- Version:
0.8.2
samtools¶
- Source code:
- Article:
Bioinformatics https://pubmed.ncbi.nlm.nih.gov/19505943/
- Version:
>1.11
sentieon-tools¶
- Source code:
Commercial Tool https://www.sentieon.com/
- Article:
Bioinformatics https://www.biorxiv.org/content/10.1101/115717v2
- Version:
202010.02
svdb¶
- Source code:
- Article:
- Version:
2.6.0
tabix¶
- Source code:
- Article:
Bioinformatics https://academic.oup.com/bioinformatics/article/27/5/718/262743
- Version:
1.11
vardict¶
- Source code:
- Article:
Nucleic Acid Research https://pubmed.ncbi.nlm.nih.gov/27060149/
- Version:
2019.06.04
vcfanno¶
- Source code:
- Article:
Genome Biology https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5/
- Version:
0.3.3
vcf2cytosure¶
- Source code:
- Article:
-
- Version:
0.7.1
CLI reference¶
CHANGELOG¶
[9.0.0]¶
Added:¶
Snakemake workflow to create canfam3 reference #843
Call umi variants using TNscope in bed defined regions #821
UMI duplication metrics to report in multiqc_picard_dups.json #844
Option to use PON reference in cnv calling for TGA tumor-only cases
QC default validation conditions (for not defined capture kits) #855
SVdb to the varcall_py36 container #872
SVdb to WGS workflow #873
Docker container for vcf2cytosure #858
Snakemake rule for creating .cgh files from CNVkit outputs #858
SVdb to TGA workflow #879
SVdb merge SV and CNV #886
Readthedocs for BALSAMIC method descriptions #892
Readthedocs for BALSAMIC variant filters for WGS somatic callers #892
bcftools counts to varcall filter rules #898
Additional WGS metrics to be stored in
<case>_metrics_deliverables.yaml
#907ascatNGS copynumber file #914
ReadtheDocs for BALSAMIC annotation resources #916
Delly CNV for tumor only workflow #923
Delly CNV Read-depth profiles for tumor only workflows #924
New metric to be extracted and validated:
NUMBER_OF_SITES
(bcftools
counts) #925
Changed:¶
Merge QC metric extraction workflows #833
Changed the base-image for balsamic container to 4.10.3-alpine #869
Updated SVdb to 2.6.0 #871
Upgrade black to 22.3.0
For UMI workflow, post filter gnomad_pop_freq value is changed from 0.005 to 0.02 #919
updated delly to 0.9.1 #920
container base_image (align_qc, annotate, coverage_qc, varcall_cnvkit, varcall_py36) to 4.10.3-alpine #921
update container (align_qc, annotate, coverage_qc, varcall_cnvkit,varcall_py36) bioinfo tool versions #921
update tool versions (align_qc, annotate, coverage_qc, varcall_cnvkit) in methods and softwares docs #921
Updated the list of files to be stored and delivered #848
Moved
collect_custom_qc_metrics
rule frommultiqc.rule
#925
Fixed:¶
Automate balsamic version for readthedocs install page #888
collect_qc_metrics.py
failing for WGS cases with emptycapture_kit
argument #850QC metric validation for different panel bed version #855
Fixed development version of
fpdf2
to2.4.6
#878Added missing svdb index file #848
Removed¶
--qc-metrics/--no-qc-metrics
flag from thebalsamic report deliver
command #833Unused pon option for SNV calling with TNhaplotyper tumor-only
SV and CNV callers from annotation and filtering #889
vcfanno and COSMIC from SV annotation #891
Removed MSK_impact and MSK_impact_noStrelka json files from config
Cleanup of strelka, pindel , mutect2 variables from BALSAMIC
bcftools_stats from vep #898
QC delivery report workflow (generating the
<case>_qc_report.html
file) #878--sample-id-map
and--case-id-map
flags from thebalsamic report deliver
command #878Removed gatk_haplotypecaller for reporting panel germline variants #918
[8.2.10]¶
Added:¶
libopenblas=0.3.20 dependency to annotate container for fixing bcftools #909
Fixes:¶
bcftools version locked at 1.10 #909
Changed:¶
base image of balsamic container to 4.10.3-alphine #909
Replaced annotate container tests with new code #909
Removed:¶
Removed failed vcf2cytosure installation from annotate container #909
[8.2.9]¶
Added:¶
Added slurm qos tag express #885
Included more text about UMI-workflow variant calling settings to the readthedocs #888
Extend QCModel to include n_base_limit which outputs in config json QC dict
Fixes:¶
Automate balsamic version for readthedocs install page #888
Changed:¶
Upgrade black to 22.3.0
fastp default setting of n_base_limit is changed to 50 from 5
[8.2.8]¶
Added:¶
Added the readthedocs page for BALSAMIC variant-calling filters #867
Project requirements (setup.py) to build the docs #874
Generate cram from umi-consensus called bam files #865
Changed:¶
Updated the bioinfo tools version numbers in BALSAMIC readthedocs #867
Sphinx version fixed to <0.18 #874
Sphinx GitHub action triggers only on master branch PRs
VAF filter for reporting somatic variants (Vardict) is minimised to 0.7% from 1% #876
Fixes:¶
cyvcf2 mock import for READTHEDOCS environment #874
[8.2.7]¶
Fixes:¶
Fixes fastqc timeout issues for wgs cases #861
Fix cluster configuration for vep and vcfanno #857
[8.2.6]¶
Fixes:¶
Set right qos in scheduler command #856
[8.2.5]¶
balsamic.sif container installation during cache generation #841
Fixed:¶
Execution of create_pdf python script inside the balsamic container #841
[8.2.4]¶
Added:¶
--hgvsg
annotation to VEP #830ascatNgs
PDF delivery (plots & statistics) #828
[8.2.3]¶
Fixed:¶
Add default for gender if
purecn
captures dual gender values #824
Changed:¶
Updated
purecn
and its dependencies to latest versions
[8.2.2]¶
Added:¶
ascatNGS
tumor normal delivery #810
Changed:¶
QC metrics delivery tag #820
Refactor tmb rule that contains redundant line #817
[8.2.1]¶
Fixed:¶
cnvkit
gender comparison operator bug #819
[8.2.0]¶
Added:¶
Added various basic filters to all variant callers irregardless of their delivery status #750
BALSAMIC container #728
BALSAMIC reference generation via cluster submission for both reference and container #686
Container specific tests #770
BALSAMIC quality control metrics extraction and validation #754
Delly is added as a submodule and removed from rest of the conda environments #787
Store research VCFs for all filtered and annotated VCF files
Added .,PASS to all structural variant filter rules to resolve the issues with missing calls in filtered file
Handling of QC metrics validation errors #783
Github Action workflow that builds the docs using Sphinx #809
Zenodo integration to create citable link #813
Panel BED specific QC conditions #800
Metric extraction to a YAML file for Vogue #802
Changed:¶
refactored main workflow with more readible organization #614
refactored conda envs within container to be on base and container definition is uncoupled #759
renamed umi output file names to fix issue with picard HSmetrics #804
locked requirements for graphviz io 0.16 #811
QC metric validation is performed across all metrics of each of the samples #800
Removed:¶
The option of running umiworkflow independently with balsamic command-line option “-a umi”
Removed source activate from reference and pon workflows #764
Fixed:¶
Pip installation failure inside balsamic container #758
Fixed issue #768 with missing
vep_install
command in containerFixed issue #765 with correct input bam files for SV rules
Continuation of CNVkit even if
PURECN
fails and fixPureCN
conda paths #774 #775Locked version for
cryptography
packageBumped version for
bcftools
in cnvkit containerFixed issues #776 and #777 with correct install paths for gatk and manta
Fixed issue #782 for missing AF in the vcf INFO field
Fixed issues #748 #749 with correct sample names
Fixed issue #767 for ascatngs hardcoded values
Fixed missing output option in bcftools filters for tnhaplotyper #793
Fixed issue #795 with increasing resources for vep and filter SV prior to vep
Building
wheel
forcryptography
bug inside BALSAMIC container #801Fixed badget for docker container master and develop status
ReadtheDocs building failure due to dependencies, fixed by locking versions #773
Dev requirements installation for Sphinx docs (Github Action) #812
Changed path for main Dockerfile version in
.bumpversion.cfg
[8.1.0]¶
Added:¶
Workflow to check PR tiltes to make easier to tell PR intents #724
bcftools stats
to calculate Ti/Tv for all post annotate germline and somatic calls #93Added reference download date to
reference.json
#726ascatngs
hg38 references to constants #683Added ClinVar as a source to download and to be annotated with VCFAnno #737
Changed:¶
Updated docs for git FAQs #731
Rename panel of normal filename Clinical-Genomics/cgp-cancer-cnvcall#10
Fixed:¶
Fixed bug with using varcall_py36 container with VarDict #739
Fixed a bug with VEP module in MultiQC by excluding #746
Fixed a bug with
bcftools stats
results failing in MultiQC #744
[8.0.2]¶
Fixed:¶
Fixed breaking shell command for VEP annotation rules #734
[8.0.1]¶
Fixed:¶
Fixed context for Dockerfile for release content #720
[8.0.0]¶
Added:¶
samtools
flagstats and stats to workflow and MultiQCdelly v0.8.7
somatic SV caller #644delly
containter #644bcftools v1.12
todelly
container #644tabix v0.2.6
todelly
container #644Passed SV calls from Manta to clinical delivery
An extra filter to VarDict tumor-normal to remove variants with STATUS=Germline, all other will still be around
Added
vcf2cytosure
to annotate containergit
to the container definitionprepare_delly_exclusion rule
Installation of
PureCN
rpackage incnvkit
containerCalculate tumor-purity and ploidy using
PureCN
forcnvkit
callascatngs
as a submodule #672GitHub action to build and test
ascatngs
containerReference section to
docs/FAQ.rst
ascatngs
download references from reference_file repository #672delly
tumor only rule #644ascatngs
download container #672Documentation update on setting sentieon env variables in
bashrc
ascatngs
tumor normal rule for wgs cases #672Individual rules (i.e. ngs filters) for cnv and sv callers. Only Manta will be delivered and added to the list of output files. #708
Added “targeted” and “wgs” tags to variant callers to provide another layer of separation. #708
manta
convert inversion #709Sentieon version to bioinformatic tool version parsing #685
added
CITATION.cff
to cite BALSAMIC
Changed:¶
Upgrade to latest sentieon version 202010.02
New name
MarkDuplicates
topicard_markduplicates
inbwa_mem
rule andcluster.json
New name rule
GATK_contest
togatk_contest
Avoid running pytest github actions workflow on
docs/**
andCHANGELOG.rst
changesUpdated
snakemake
tov6.5.3
#501Update
GNOMAD
URLSplit Tumor-only
cnvkit batch
into individual commandsImproved TMB calculation issue #51
Generalized ascat, delly, and manta result in workflow. #708
Generalized workflow to eliminate duplicate entries and code. #708
Split Tumor-Normal
cnvkit batch
into individual commandsMoved params that are used in multiple rules to constants #711
Changed the way conda and non-conda bioinfo tools version are parsed
Python code formatter changed from Black to YAPF #619
Fixed:¶
post-processing of the umi consensus in handling BI tags
vcf-filtered-clinical tag files will have all variants including PASS
Refactor snakemake
annotate
rules according to snakemake etiquette #636Refactor snakemake
align
rules according to snakemake etiquette #636Refactor snakemake
fastqc
vep
contest andmosdepth
rules according tosnakemake
etiquette #636Order of columns in QC and coverage report issue #601
delly
not showing in workflow at runtime #644ascatngs
documentation links inFAQs
#672varcall_py36
container build and push #703Wrong spacing in reference json issue #704
Refactor snakemake
quality control
rules according to snakemake etiquette #636
Removed:¶
Cleaned up unused container definitions and conda environment files
Remove cnvkit calling for WGS cases
Removed the install.sh script
[7.2.5]¶
Changed:¶
Updated COSMIC path to use version 94
[7.2.5]¶
Changed:¶
Updated path for gnomad and 1000genomes to a working path from Google Storage
[7.2.4]¶
Changed:¶
Updated sentieon util sort in umi to use Sentieon 20201002 version
[7.2.3]¶
Fixed:¶
Fixed memory issue with vcfanno in vep_somatic rule fixes #661
[7.2.2]¶
Fixed:¶
An error with Sentieon for better management of memory fixes #621
[7.2.1]¶
Changed:¶
Rename Github actions to reflect their content
[7.2.0]¶
Added:¶
Changelog reminder workflow to Github
Snakemake workflow for created PON reference
Balsamic cli config command(pon) for creating json for PON analysis
tumor lod option for passing tnscope-umi final variants
Git guide to make balsamic release in FAQ docs
Changed:¶
Expanded multiqc result search dir to whole analysis dir
Simple test for docker container
Fixed:¶
Correctly version bump for Dockerfile
Removed:¶
Removed unused Dockerfile releases
Removed redundant genome version from
reference.json
[7.1.10]¶
Fixed:¶
Bug in
ngs_filter
rule set for tumor-only WGSMissing delivery of tumor only WGS filter
[7.1.9]¶
Changed:¶
only pass variants are not part of delivery anymore
delivery tag file ids are properly matched with sample_name
tabix updated to 0.2.6
fastp updated to 0.20.1
samtools updated to 1.12
bedtools updated to 2.30.0
Removed:¶
sentieon-dedup rule from delivery
Removed all pre filter pass from delivery
[7.1.8]¶
Fixed:¶
Target coverage (Picard HsMetrics) for UMI files is now correctly calculated.
Changed:¶
TNscope calculated AF values are fetched and written to AFtable.txt.
[7.1.7]¶
Added:¶
ngs_filter_tnscope is also part of deliveries now
Changed:¶
rankscore is now a research tag instead of clinical
Some typo and fixes in the coverage and constant metrics
Delivery process is more verbose
Fixed:¶
CNVKit output is now properly imported in the deliveries and workflow
[7.1.6]¶
Fixed:¶
CSS style for qc coverage report is changed to landscape
[7.1.5]¶
Changed:¶
update download url for 1000genome WGS sites from ftp to http
[7.1.4]¶
Changed:¶
bump picard to version 2.25.0
[7.1.3]¶
Fixed:¶
assets
path is now added to bind path
[7.1.2]¶
Fixed:¶
umi_workflow config json is set as true for panel and wgs as false.
Rename umiconsensus bam file headers from {samplenames} to TUMOR/NORMAL.
Documentation autobuild on RTFD
[7.1.1]¶
Fixed:¶
Moved all requirements to setup.py, and added all package_data there. Clean up unused files.
[7.1.0]¶
Removed¶
tnsnv
removed from WGS analysis, both tumor-only and tumor-normalGATK-BaseRecalibrator is removed from all workflows
Fixed¶
Fixed issue 577 with missing
tumor.merged.bam
andnormal.merged.bam
Issue 448 with lingering tmp_dir. It is not deleted after analysis is properly finished.
Changed¶
All variant calling rules use proper
tumor.merged.bam
ornormal.merged.bam
as inputs
[7.0.2]¶
Added¶
Updated docs with FAQ for UMI workflow
Fixed¶
fix job scheduling bug for benchmarking
rankscore’s output is now a proper vcf.gz file
Manta rules now properly make a sample_name file
[7.0.1]¶
Added¶
github action workflow to autobuild release containers
[7.0.0]¶
Added¶
balsamic init
to download reference and related containers done in PRs #464 #538balsamic config case
now only take a cache path instead of container and reference #538UMI workflow added to main workflow in series of PRs #469 #477 #483 #498 #503 #514 #517
DRAGEN for WGS applications in PR #488
A framework for QC check PR #401
--quiet`
option forrun analysis
PR #491Benchmark SLURM jobs after the analysis is finished PR #534
One container per conda environment (i.e. decouple containers) PR #511 #525 #522
--disable-variant-caller
command forreport deliver
PR #439Added genmod and rankscore in series of two PRs #531 and #533
Variant filtering to Tumor-Normal in PR #534
Split SNV/InDels and SVs from TNScope variant caller PR #540
WGS Tumor only variant filters added in PR #548
Changed¶
Update Manta to 1.6.0 PR #470
Update FastQC to 0.11.9 PR #532
Update BCFTools to 1.11 PR #537
Update Samtools to 1.11 PR #537
Increase resources and runtime for various workflows in PRs #482
Python package dependenicies versions fixed in PR #480
QoL changes to workflow in series of PR #471
Series of documentation updates in PRs #489 #553
QoL changes to scheduler script PR #491
QoL changes to how temporary directories are handlded PR #516
TNScope model apply rule merged with TNScope variant calling for tumor-normal in WGS #540
Decoupled
fastp
rule into two rules to make it possible to use it for UMI runs #570
Fixed¶
A bug in Manta variant calling rules that didn’t name samples properly to TUMOR/NORMAL in the VCF file #572
[6.1.2]¶
Changed¶
Changed hk delivery tag for coverage-qc-report
[6.1.1]¶
Fixed¶
No UMI trimming for WGS applications #486
Fixed a bug where BALSAMIC was checking for sacct/jobid file in local mode PR #497
readlink
command invep_germline
,vep_somatic
,split_bed
, andGATK_popVCF
#533Fix various bugs for memory handling of Picardtools and its executable in PR #534
Fixed various issues with
gsutils
in PR #550
Removed¶
gatk-register
command removed from installing GATK PR #496
[6.1.1]¶
Fixed a bug with missing QC templates after
pip install
[6.1.0]¶
Added¶
CLI option to expand report generation for TGA and WES runs. Please see
balsamic report deliver --help
BALSAMIC now generates a custom HTML report for TGA and WES cases.
[6.0.4]¶
Changed¶
Reduces MQ cutoff from 50 to 40 to only remove obvious artifacts PR #535
Reduces AF cutoff from 0.02 to 0.01 PR #535
[6.0.3]¶
Added¶
config case
subcommand now has--tumor-sample-name
and--normal-sample-name
Fixed¶
Manta resource allocation is now properly set PR #523
VarDict resource allocation in cluster.json increased (both core and time allocation) PR #523
minimum memory request for GATK mutect2 and haplotypecaller is removed and max memory increased PR #523
[6.0.2]¶
Added¶
Document for Snakemake rule grammar PR #489
Fixed¶
removed
gatk3-register
command from Dockerfile(s) PR #508
[6.0.1]¶
Added¶
A secondary path for latest jobids submitted to cluster (slurm and qsub) PR #465
[6.0.0]¶
Added¶
UMI workflow using Sentieon tools. Analysis run available via balsamic run analysis –help command. PR #359
VCFutils to create VCF from flat text file. This is for internal purpose to generate validation VCF. PR #349
Download option for hg38 (not validated) PR #407
Option to disable variant callers for WES runs. PR #417
Fixed¶
Missing cyvcf2 dependency, and changed conda environment for base environment PR #413
Missing numpy dependency PR #426
Changed¶
COSMIC db for hg19 updated to v90 PR #407
Fastp trimming is now a two-pass trimming and adapter trimming is always enabled. This might affect coverage slightly PR #422
All containers start with a clean environment #425
All Sentieon environment variables are now added to config when workflow executes #425
Branching model will be changed to gitflow
[5.1.0]¶
Fixed¶
Vardict-java version fixed. This is due to bad dependency and releases available on conda. Anaconda is not yet update with vardict 1.8, but vardict-java 1.8 is there. This causes various random breaks with Vardict’s TSV output. #403
Changed¶
Refactored Docker files a bit, preparation for decoupling #403
Removed¶
In preparation for GATK4, IndelRealigner is removed #404
[5.0.1]¶
Added¶
Temp directory for various rules and workflow wide temp directory #396
Changed¶
Refactored tags for housekeeper delivery to make them unique #395
Increased core requirements for mutect2 #396
GATK3.8 related utils run via jar file instead of gatk3 #396
[5.0.0]¶
Added¶
Config.json and DAG draph included in Housekeeper report #372
New output names added to cnvkit_single and cnvkit_paired #372
New output names added to vep.rule #372
Delivery option to CLI and what to delivery with delivery params in rules that are needed to be delivered #376
Reference data model with validation #371
Added container path to install script #388
Changed¶
Delivery file format simplified #376
VEP rules have “all” and “pass” as output #376
Downloaded reference structure changed #371
genome/refseq.flat renamed to genome/refGene.flat #371
reverted CNVKit to version 0.9.4 #390
Fixed¶
Missing pygments to requirements.txt to fix travis CI #364
Wildcard resolve for deliveries of vep_germline #374
Missing index file from deliverables #383
Ambiguous deliveries in vep_somatic and ngs_filters #387
Updated documentation to match with installation #391
Removed¶
Temp files removed from list of outputs in vep.rule #372
samtools.rule and merged it with bwa_mem #375
[4.5.0]¶
Added¶
Models to build config case JSON. The models and descriptions of their contents can now be found in BALSAMIC/utils/models.py
Added analysis_type to report deliver command
Added report and delivery capability to Alignment workflow
run_validate.sh now has -d to handle path to analysis_dir (for internal use only) #361
Changed¶
Fastq files are no longer being copied as part of creation of the case config file. A symlink is now created at the destination path instead
Config structure is no longer contained in a collestion of JSON files. The config models are now built using Pydantic and are contained in BALSAMIC/utils/models.py
Removed¶
Removed command line option “–fastq-prefix” from config case command
Removed command line option “–config-path” from config case command. The config is now always saved with default name “case_id.json”
Removed command line option “–overwrite-config” from config-case command The command is now always executed with “–overwrite-config True” behavior
Refactored¶
Refactored BALSAMIC/commands/config/case.py: Utility functions are moved to BALSAMIC/utils/cli.py Models for config fields can be found at BALSAMIC/utils/models.py Context aborts and logging now contained in pilot function Tests created to support new architecture
Reduce analysis directory’s storage
Fixed¶
Report generation warnings supressed by adding workdirectory
Missing tag name for germline annotated calls #356
Bind path is not added as None if analysis type is wgs #357
Changes vardict to vardict-java #361
[4.4.0]¶
Added¶
pydantic to validate various models namely variant caller filters
Changed¶
Variant caller filters moved into pydantic
Install script and setup.py
refactored install script with more log output and added a conda env suffix option
refactored docker container and decoupled various parts of the workflow
[4.3.0]¶
Added¶
Added cram files for targeted sequencing runs fixes #286
Added mosdepth to calculate coverage for whole exome and targeted sequencing
Filter models added for tumor-only mode
Enabling adapter trim enables pe adapter trim option for fastp
Annotate germline variant calls
Baitset name to picard hsmetrics
Deprecated¶
Sambamba coverage and rules will be deprecated
Fixed¶
Fixed latest tag in install script
Fixed lack of naming final annotated VCF TUMOR/NORMAL
Changed¶
Increased run time for various slurm jobs fixes #314
Enabled SV calls for VarDict tumor-only
Updated ensembl-vep to v100.2
[4.2.4]¶
Fixed¶
Fixed sort issue with bedfiles after 100 slop
[4.2.3]¶
Added¶
Added Docker container definition for release and bumpversion
Changed¶
Quality of life change to rtfd docs
Fixed¶
Fix Docker container with faulty git checkout
[4.2.2]¶
Added¶
Add “SENTIEON_TMPDIR” to wgs workflow
[4.2.1]¶
Changed¶
Add docker container pull for correct version of install script
[4.2.0]¶
Added¶
CNV output as VCF
Vep output for PASSed variants
Report command with status and delivery subcommands
Changed¶
Bed files are slopped 100bp for variant calling fix #262
Disable vcfmerge
Picard markduplicate output moved from log to output
Vep upgraded to 99.1
Removed SVs from vardict
Refactored delivery plugins to produce a file with list of output files from workflow
Updated snakemake to 5.13
Fixed¶
Fixed a bug where threads were not sent properly to rules
Removed¶
Removed coverage annotation from mutect2
Removed source deactivate from rules to suppress conda warning
Removed
plugins delivery
subcommandRemoved annotation for germline caller results
[4.1.0]¶
Added¶
VEP now also produces a tab delimited file
CNVkit rules output genemetrics and gene break file
Added reference genome to be able to calculate AT/CG dropouts by Picard
coverage plot plugin part of issue #75
callable regions for CNV calling of tumor-only
Changed¶
Increased time for indel realigner and base recalib rules
decoupled vep stat from vep main rule
changed qsub command to match UGE
scout plugin updated
Fixed¶
WGS qc rules - updated with correct options (picard - CollectMultipleMetrics, sentieon - CoverageMetrics)
Log warning if WES workflow cannot find SENTIEON* env variables
Fixes issue with cnvkit and WGS samples #268
Fix #267 coverage issue with long deletions in vardict
[4.0.1] - 2019-11-08¶
Added¶
dependencies for workflow report
sentieon variant callers germline and somatic for wes cases
Changed¶
housekeeper file path changed from basename to absolute
scout template for sample location changed from delivery_report to scout
rule names added to benchmark files
[4.0.0] - 2019-11-04¶
SGE qsub support release
Added¶
install.sh
now also downloads latest containerDocker image for balsamic as part of ci
Support for qsub alongside with slurm on
run analysis --profile
Changed¶
Documentation updated
Test fastq data and test panel bed file with real but dummy data
[3.3.1] - 2019-10-28¶
Fixed¶
Various links for reference genome is updated with working URL
Config reference command now print correct output file
[3.3.0] - 2019-10-24¶
somatic vcfmerge release
Added¶
QC metrics for WGS workflow
refGene.txt download to reference.json and reference workflow
A new conda environment within container
A new base container built via Docker (centos7:miniconda3_4_6_14)
VCFmerge package as VCF merge rule (https://github.com/hassanfa/VCFmerge)
A container for develop branch
Benchmark rules to variant callers
Changed¶
SLURM resource allocation for various variancalling rules optimized
mergetype rule updated and only accepts one single tumor instead of multiple
[3.2.3] - 2019-10-24¶
Fixed¶
Removed unused output files from cnvkit which caused to fail on targetted analysis
[3.2.2] - 2019-10-23¶
Fixed¶
Removed target file from cnvkit batch
[3.2.1] - 2019-10-23¶
Fixed¶
CNVkit single missing reference file added
[3.2.0] - 2019-10-11¶
Adds:¶
CNVkit to WGS workflow
get_thread for runs
Changed:¶
Optimized resources for SLURM jobs
Removed:¶
Removed hsmetrics for non-mark duplicate bam files
[3.1.4] - 2019-10-08¶
Fixed¶
Fixes a bug where missing capture kit bed file error for WGS cases
[3.1.3] - 2019-10-07¶
Fixed¶
benchmark path bug issue #221
[3.1.2] - 2019-10-07¶
Fixed¶
libreadline.so.6 symlinking and proper centos version for container
[3.1.1] - 2019-10-03¶
Fixed¶
Proper tag retrieval for release ### Changed
BALSAMIC container change to latest and version added to help line
[3.1.0] - 2019-10-03¶
TL;DR:
QoL changes to WGS workflow
Simplified installation by moving all tools to a container
Added¶
Benchmarking using psutil
ML variant calling for WGS
--singularity
option toconfig case
andconfig reference
Fixed¶
Fixed a bug with boolean values in analysis.json
Changed¶
install.sh
simplified and will be depricatedSingularity container updated
Common somatic and germline variant callers are put in single file
Variant calling workflow and analysis config files merged together
Removed¶
balsamic install
is removedConda environments for py36 and py27 are removed
[3.0.1] - 2019-09-11¶
Fixed¶
Permissions on
analysis/qc
dir are 777 now
[3.0.0] - 2019-09-05¶
This is major release. TL;DR:
Major changes to CLI. See documentation for updates.
New additions to reference generation and reference config file generation and complete overhaul
Major changes to reposityory structure, conda environments.
Added¶
Creating and downloading reference files:
balsamic config reference
andbalsamic run reference
Container definitions for install and running BALSAMIC
Bunch of tests, setup coveralls and travis.
Added Mutliqc, fastp to rule utilities
Create Housekeeper and Scout files after analysis completes
Added Sentieon tumor-normal and tumor only workflows
Added trimming option while creating workflow
Added multiple tumor sample QC analysis
Added pindle for indel variant calling
Added Analysis finish file in the analysis directory
Fixed¶
Multiple fixes to snakemake rules
Changed¶
Running analysis through:
balsamic run analysis
Cluster account and email info added to
balsamic run analysis
umi
workflow through--umi
tag. [workflow still in evaluation]sample-id
replaced bycase-id
Plan to remove FastQC as well
Removed¶
balsamic config report
andbalsamic report
sample.config
andreference.json
from config directoryRemoved cutadapt from workflows
[2.9.8] - 2019-01-01¶
Fixed¶
picard hsmetrics now has 50000 cov max
cnvkit single wildcard resolve bug fixed
[2.9.7] - 2019-02-28¶
Fixed¶
Various fixes to umi_single mode
analysis_finish file does not block reruns anymore
Added missing single_umi to analysis workflow cli
Changed¶
vardict in single mode has lower AF threshold filter (0.005 -> 0.001)
[2.9.6] - 2019-02-25¶
Fixed¶
Reference to issue #141, fix for 3 other workflows
CNVkit rule update for refflat file
[2.9.5] - 2019-02-25¶
Added¶
An analysis finish file is generated with date and time inside (%Y-%M-%d T%T %:z)
[2.9.4] - 2019-02-13¶
Fixed¶
picard version update to 2.18.11 github.com/hassanfa/picard
[2.9.3] - 2019-02-12¶
Fixed¶
Mutect single mode table generation fix
Vardict single mode MVL annotation fix
[2.9.2] - 2019-02-04¶
Added¶
CNVkit single sample mode now in workflow
MVL list from cheng et al. 2015 moved to assets
[2.9.1] - 2019-01-22¶
Added¶
Simple table for somatic variant callers for single sample mode added
Fixed¶
Fixes an issue with conda that unset variables threw an error issue #141
[2.9.0] - 2019-01-04¶
Changed¶
Readme structure and example
Mutect2’s single sample output is similar to paired now
cli path structure update
Added¶
test data and sample inputs
A dag PDF will be generated when config is made
umi specific variant calling
[2.8.1] - 2018-11-28¶
Fixed¶
VEP’s perl module errors
CoverageRep.R now properly takes protein_coding transcatipts only
[2.8.0] - 2018-11-23¶
UMI single sample align and QC
Added¶
Added rules and workflows for UMI analysis: QC and alignment
[2.7.4] - 2018-11-23¶
Germline single sample
Added¶
Germline single sample addition ### Changed
Minor fixes to some rules to make them compatible with tumor mode
[2.7.3] - 2018-11-20¶
Fixed¶
Various bugs with DAG to keep popvcf and splitbed depending on merge bam file
install script script fixed and help added
[2.7.2] - 2018-11-15¶
Changed¶
Vardict, Strelka, and Manta separated from GATK best practice pipeline
[2.7.1] - 2018-11-13¶
Fixed¶
minro bugs with strelka_germline and freebayes merge ### Changed
removed ERC from haplotypecaller
[2.7.0] - 2018-11-08¶
Germline patch
Added¶
Germline caller tested and added to the paired analysis workflow: Freebayes, HaplotypeCaller, Strelka, Manta
Changed¶
Analysis config files updated
Output directory structure changed
vep rule is now a single rule
Bunch of rule names updated and shortened, specifically in Picard and GATK
Variant caller rules are all updated and changed
output vcf file names are now more sensible: {SNV,SV}.{somatic,germline}.sampleId.variantCaller.vcf.gz
Job limit increased to 300
Removed¶
removed bcftools.rule for var id annotation
Changed¶
Fixed¶
[2.6.3] - 2018-11-01¶
Changed¶
Ugly and godforsaken
runSbatch.py
is now dumping sacct files with job IDs. Yikes!
[2.6.2] - 2018-10-31¶
Fixed¶
added
--fastq-prefix
option forconfig sample
to set fastq prefix name. Linking is not changed.
[2.6.1] - 2018-10-29¶
Fixed¶
patched a bug for copying results for strelka and manta which was introduced in
2.5.0
[2.5.0] - 2018-10-22¶
Changed¶
variant_panel
changed tocapture_kit
sample config file takes balsamic version
bioinfo tool config moved bioinfotool to cli_utils from
config report
Added¶
bioinfo tool versions is now added to analysis config file
[2.4.0] - 2018-10-22¶
Changed¶
balsamic run
has 3 stop points: paired variant calling, single mode variant calling, and QC/Alignment mode.balsamic run [OPTIONS] -S ...
is depricated, but it supersedesanalysis_type
mode if provided.
[2.3.3] - 2018-10-22¶
Added¶
CSV output for variants in each variant caller based on variant filters
DAG image of workflow ### Changed
Input for variant filter has a default value
delivery_report
is no created during config generationVariant reporter R script cmd updated in
balsamic report
[2.3.2] - 2018-10-19¶
Changed¶
Fastq files are now always linked to
fastq
directory within the analysis directory
Added¶
balsamic config sample
now accepts individual files and paths. See README for usage.
[2.3.1] - 2018-09-25¶
Added¶
CollectHSmetric now run twice for before and after markduplicate
[2.3.0] - 2018-09-25¶
Changed¶
Sample config file now includes a list of chromosomes in the panel bed file
Fixed¶
Non-matching chrom won’t break the splitbed rule anymore
collectqc rules now properly parse tab delimited metric files
[2.2.0] - 2018-09-11¶
Added¶
Coverage plot to report
target coverage file to report json
post-cutadapt fastqc to collectqc
A header to report pdf
list of bioinfo tools used in the analysis added to report ### Changed
VariantRep.R now accepts multiple inputs for each parameter (see help)
AF values for MSKIMPACT config ### Fixed
Output figure for coverageplot is now fully square :-)
[2.1.0] - 2018-09-11¶
Added¶
normalized coverage plot script
fastq file IO check for config creation
added qos option to
balsamic run
### FixedSambamba depth coverage parameters
bug with picard markduplicate flag
[2.0.2] - 2018-09-11¶
Added¶
Added qos option for setting qos to run jobs with a default value of low
[2.0.1] - 2018-09-10¶
Fixed¶
Fixed package dependencies with vep and installation
[2.0.0] - 2018-09-05¶
Variant reporter patch and cli update
Added¶
Added
balsamic config sample
andbalsamic config report
to generate run analysis and reporting configAdded
VariantRep.R
script to information from merged variant table: variant summry, TMB, and much moreAdded a workflow for single sample mode alignment and QC only
Added QC skimming script to qccollect to generate nicely formatted information from picard ### Changed
Change to CLI for running and creating config
Major overhaul to coverage report script. It’s now simpler and more readable! ### Fixed
Fixed sambamba depth to include mapping quality
Markduplicate now is now by default on marking mode, and will NOT remove duplicates
Minor formatting and script beautification happened
[1.13.1] - 2018-08-17¶
Fixed¶
fixed a typo in MSKMVL config
fixed a bug in strelka_simple for correct column orders
[1.13.0] - 2018-08-10¶
Added¶
rule for all three variant callers for paired analysis now generate a simple VCF file
rule for all three variant callers for paired analysis to convert VCF into table format
MVL config file and MVL annotation to VCF calls for SNV/INDEL callers
CALLER annotation added to SNV/INDEL callers
exome specific option for strelka paired
create_config subcommand is now more granular, it accepts all enteries from sample.json as commandline arguments
Added tabQuery to the assets as a tool to query the tabulated output of summarized VCF
Added MQ annotation field to Mutect2 output see #67 ### Changed
Leaner VCF output from mutect2 with coverage and MQ annotation according to #64
variant ids are now updated from simple VCF file ### Fixed
Fixed a bug with sambamba depth coverage reporting wrong exon and panel coverage see #68
The json output is now properly formatted using yapf
Strelka rule doesn’t filter out PASS variants anymore fixes issue #63
[1.12.0] - 2018-07-06¶
Coverage report patch
Added¶
Added a new script to retrieve coverage report for a list of gene(s) and transcripts(s)
Added sambamba exon depth rule for coverage report
Added a new entry in reference json for exon bed file, this file generated using: https://github.com/hassanfa/GFFtoolkit ### Changed
sambamba_depth rule changed to sambama_panel_depth
sambamba depth now has fix-mate-overlaps parameter enabled
sambamba string filter changed to
unmapped or mate\_is\_unmapped) and not duplicate and not failed\_quality\_control
.sambamba depth for both panel and exon work on picard flag (rmdup or mrkdup). ### Fixed
Fixed sambamba panel depth rule for redundant coverage parameter
[1.11.0] - 2018-07-05¶
create config patch for single and paired mode
Changed¶
create_config is now accepting a paired|single mode instead of analysis json template (see help for changes). It is not backward compatible ### Added
analysis_{paired single}.json for creating config. Analysis.json is now obsolete. ### Fixed
A bug with writing output for analysis config, and creating the path if it doesn’t exist.
A bug with manta rule to correctly set output files in config.
A bug that strelka was still included in sample analysis.
[1.10.0] - 2018-06-07¶
Added¶
Markduplicate flag to analysis config
[1.9.0] - 2018-06-04¶
Added¶
Single mode for vardict, manta, and mutect.
merge type for tumor only ### Changed
Single mode variant calling now has all variant calling rules ### Fixed
run_analaysis now accepts workflows for testing pyrposes
[1.8.0] - 2018-06-01¶
Changed¶
picard create bed interval rule moved into collect hsmetric
split bed is dependent on bam merge rule
vardict env now has specific build rather than URL download (conda doesn’t support URLs anymore) ### Fixed
new logs and scripts dirs are not re-created if they are empty
[1.7.0] - 2018-05-31¶
Added¶
A source altered picard to generated more quality metrics output is added to installation and rules
[1.6.0] - 2018-05-30¶
Added¶
report subcommand for generating a pdf report from a json input file
Added fastqc after removing adapter ### Changed
Markduplicate now has both REMOVE and MARK (rmdup vs mrkdup)
CollectHSMetrics now has more steps on PCT_TARGET_BASES
[1.5.0] - 2018-05-28¶
Changed¶
New log and script directories are now created for each re-run ### Fixed
Picardtools’ memory issue addressed for large samples
[1.4.0] - 2018-05-18¶
Added¶
single sample analysis mode
alignment and insert size metrics are added to the workflow ### Changed
collectqc and contest have their own rule for paired (tumor vs normal) and single (tumor only) sample.
[1.3.0] - 2018-05-13¶
Added¶
bed file for panel analysis is now mandatory to create analaysis config
[1.2.3] - 2018-05-13¶
Changed¶
vep execution path
working directory for snakemake
[1.2.2] - 2018-05-04¶
Added¶
sbatch submitter and cluster config now has an mail field ### Changed
create_config
now only requires sample and output json. The rest are optional
[1.2.0] - 2018-05-02¶
Added¶
snakefile and cluster config in run analysis are now optional with a default value
[1.1.2] - 2018-04-27¶
Fixed¶
vardict installation was failing without conda-forge channel
gatk installation was failing without correct jar file
[1.1.1] - 2018-04-27¶
Fixed¶
gatk-register tmp directory
[1.1.0] - 2018-04-26¶
Added¶
create config sub command added as a new feature to create input config file
templates to generate a config file for analysis added
code style template for YAPF input created. see: https://github.com/google/yapf
vt conda env added
Changed¶
install script changed to create an output config
README updated with usage
Fixed¶
fastq location for analysis config is now fixed
lambda rules removed from cutadapt and fastq
[1.0.3-rc2] - 2018-04-18¶
Added¶
Added sbatch submitter to handle it outside snakemake ### Changed
sample config file structure changed
coding styles updated
[1.0.2-rc2] - 2018-04-17¶
Added¶
Added vt environment ### Fixed
conda envs are now have D prefix instead of P (develop vs production)
install_conda subcommand now accepts a proper conda prefix
[1.0.1-rc2] - 2018-04-16¶
Fixed¶
snakemake rules are now externally linked
[1.0.0-rc2] - 2018-04-16¶
Added¶
run_analysis subcommand
Mutational Signature R script with CLI
unittest to install_conda
a method to semi-dynamically retrieve suitable conda env for each rule
Fixed¶
install.sh updated with gatk and proper log output
conda environments updated
vardict now has its own environment and it should not raise anymore errors
[1.0.0-rc1] - 2018-04-05¶
Added¶
install.sh to install balsamic
balsamic barebone cli
subcommand to install required environments
README.md updated with basic installation instructions
Fixed¶
conda environment yaml files
Other resources¶
Resources¶
Main resources including knowledge base and databases necessary for pipeline development
MSK-Impact pipeline: https://www.mskcc.org/msk-impact
dbSNP: Database of single nucleotide polymorphisms (SNPs) and multiple small-scale variations that include insertions/deletions, microsatellites, and non-polymorphic variants. https://www.ncbi.nlm.nih.gov/snp/ Download link: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/All_20170710.vcf.gz
ClinVar: ClinVar aggregates information about genomic variation and its relationship to human health. https://www.ncbi.nlm.nih.gov/clinvar/ Download link: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20171029.vcf.gz
ExAC: The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community. http://exac.broadinstitute.org/ Download link: ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/ExAC.r1.sites.vep.vcf.gz
GTEx: The Genotype-Tissue Expression (GTEx) project aims to provide to the scientific community a resource with which to study human gene expression and regulation and its relationship to genetic variation. https://gtexportal.org/static/ Download URL by applying through: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v6.p1
OMIM: OMIM®, Online Mendelian Inheritance in Man®, An Online Catalog of Human Genes and Genetic Disorders. https://www.omim.org/ Download link: https://omim.org/downloads/ (registration required)
Drug resistance: An effort by Cosmic to annotate mutations identified in the literature as resistance mutations, including those conferring acquired resistance (after treatment) and intrinsic resistance (before treatment). Available through Cosmic: http://cancer.sanger.ac.uk/cosmic/drug_resistance
Mutational signatures: Signatures of Mutational Processes in Human Cancer. Available through Cosmic: http://cancer.sanger.ac.uk/cosmic/signatures
DGVa: The Database of Genomic Variants archive (DGVa) is a repository that provides archiving, accessioning and distribution of publicly available genomic structural variants, in all species. https://www.ebi.ac.uk/dgva
Cancer genomics workflow: MGI’s CWL Cancer Pipelines. https://github.com/genome/cancer-genomics-workflow/wiki
GIAB: The priority of GIAB is authoritative characterization of human genomes for use in analytical validation and technology development, optimization, and demonstration. http://jimb.stanford.edu/giab/ and https://github.com/genome-in-a-bottle Download links: http://jimb.stanford.edu/giab-resources
dbNSFP: dbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. https://sites.google.com/site/jpopgen/dbNSFP
1000Genomes: The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the populations studied. http://www.internationalgenome.org/ Download link: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/
HapMap3: The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap 3 is the third phase of the International HapMap project. http://www.sanger.ac.uk/resources/downloads/human/hapmap3.html Download link: ftp://ftp.ncbi.nlm.nih.gov/hapmap/
GRCh38.p11: GRCh38.p11 is the eleventh patch release for the GRCh38 (human) reference assembly. https://www.ncbi.nlm.nih.gov/grc/human Download link: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/
dbVar: dbVar is NCBI’s database of genomic structural variation – insertions, deletions, duplications, inversions, mobile element insertions, translocations, and complex chromosomal rearrangements https://www.ncbi.nlm.nih.gov/dbvar Download link: https://www.ncbi.nlm.nih.gov/dbvar/content/ftp_manifest/
Drug sensitivity in cancer: Identifying molecular features of cancers that predict response to anti-cancer drugs. http://www.cancerrxgene.org/ Download link: ftp://ftp.sanger.ac.uk/pub4/cancerrxgene/releases
VarSome: VarSome is a knowledge base and aggregator for human genomic variants. https://varsome.com/about/
Google Genomics Public Data: Google Genomics helps the life science community organize the world’s genomic information and make it accessible and useful. and http://googlegenomics.readthedocs.io
Sample datasets¶
TCRB: he Texas Cancer Research Biobank (TCRB) was created to bridge the gap between doctors and scientific researchers to improve the prevention, diagnosis and treatment of cancer. This work occurred with funding from the Cancer Prevention & Research Institute of Texas (CPRIT) from 2010-2014. http://txcrb.org/data.html Article: https://www.nature.com/articles/sdata201610
Relevant publications¶
Including methodological benchmarking
MSK-IMPACT:
Original pipeline: Cheng, D. T., Mitchell, T. N., Zehir, A., Shah, R. H., Benayed, R., Syed, A., … Berger, M. F. (2015). Memorial sloan kettering-integrated mutation profiling of actionable cancer targets (MSK-IMPACT): A hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology. Journal of Molecular Diagnostics, 17(3), 251–264. https://doi.org/10.1016/j.jmoldx.2014.12.006
Case study: Cheng, D. T., Prasad, M., Chekaluk, Y., Benayed, R., Sadowska, J., Zehir, A., … Zhang, L. (2017). Comprehensive detection of germline variants by MSK-IMPACT, a clinical diagnostic platform for solid tumor molecular oncology and concurrent cancer predisposition testing. BMC Medical Genomics, 10(1), 33. https://doi.org/10.1186/s12920-017-0271-4
Case study: Zehir, A., Benayed, R., Shah, R. H., Syed, A., Middha, S., Kim, H. R., … Berger, M. F. (2017). Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nature Medicine, 23(6), 703–713. https://doi.org/10.1038/nm.4333
Application of MSK-IMPACT: Zehir, A., Benayed, R., Shah, R. H., Syed, A., Middha, S., Kim, H. R., … Berger, M. F. (2017). Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nature Medicine, 23(6), 703–713. https://doi.org/10.1038/nm.4333
Review on bioinformatic pipelins: Leipzig, J. (2017). A review of bioinformatic pipeline frameworks. Briefings in Bioinformatics, 18(3), 530–536. https://doi.org/10.1093/bib/bbw020
Mutational signature reviews:
Helleday, T., Eshtad, S., & Nik-Zainal, S. (2014). Mechanisms underlying mutational signatures in human cancers. Nature Reviews Genetics, 15(9), 585–598. https://doi.org/10.1038/nrg3729
Alexandrov, L. B., & Stratton, M. R. (2014). Mutational signatures: The patterns of somatic mutations hidden in cancer genomes. Current Opinion in Genetics and Development, 24(1), 52–60. https://doi.org/10.1016/j.gde.2013.11.01
Review on structural variation detection tools:
Lin, K., Bonnema, G., Sanchez-Perez, G., & De Ridder, D. (2014). Making the difference: Integrating structural variation detection tools. Briefings in Bioinformatics, 16(5), 852–864. https://doi.org/10.1093/bib/bbu047
Tattini, L., D’Aurizio, R., & Magi, A. (2015). Detection of Genomic Structural Variants from Next-Generation Sequencing Data. Frontiers in Bioengineering and Biotechnology, 3(June), 1–8. https://doi.org/10.3389/fbioe.2015.00092
Two case studies and a pipeline (unpublished): Noll, A. C., Miller, N. A., Smith, L. D., Yoo, B., Fiedler, S., Cooley, L. D., … Kingsmore, S. F. (2016). Clinical detection of deletion structural variants in whole-genome sequences. Npj Genomic Medicine, 1(1), 16026. https://doi.org/10.1038/npjgenmed.2016.26
Review on driver gene methods: Tokheim, C. J., Papadopoulos, N., Kinzler, K. W., Vogelstein, B., & Karchin, R. (2016). Evaluating the evaluation of cancer driver genes. Proceedings of the National Academy of Sciences, 113(50), 14330–14335. https://doi.org/10.1073/pnas.1616440113
Resource, or general notable papers including resource and KB papers related to cancer genomics
GIAB: Zook, J. M., Catoe, D., McDaniel, J., Vang, L., Spies, N., Sidow, A., … Salit, M. (2016). Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data, 3, 160025. https://doi.org/10.1038/sdata.2016.25
Methods and tools¶
Excluding multiple method comparison or benchmarking tools
BreakDancer: Chen, K., Wallis, J. W., Mclellan, M. D., Larson, D. E., Kalicki, J. M., Pohl, C. S., … Elaine, R. (2013). BreakDancer - An algorithm for high resolution mapping of genomic structure variation. Nature Methods, 6(9), 677–681. https://doi.org/10.1038/nmeth.1363.BreakDancer
Pindel: Ye, K., Schulz, M. H., Long, Q., Apweiler, R., & Ning, Z. (2009). Pindel: A pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 25(21), 2865–2871. https://doi.org/10.1093/bioinformatics/btp394
SVDetect: Zeitouni, B., Boeva, V., Janoueix-Lerosey, I., Loeillet, S., Legoix-né, P., Nicolas, A., … Barillot, E. (2010). SVDetect: A tool to identify genomic structural variations from paired-end and mate-pair sequencing data. Bioinformatics, 26(15), 1895–1896. https://doi.org/10.1093/bioinformatics/btq293
Purityest: Su, X., Zhang, L., Zhang, J., Meric-bernstam, F., & Weinstein, J. N. (2012). Purityest: Estimating purity of human tumor samples using next-generation sequencing data. Bioinformatics, 28(17), 2265–2266. https://doi.org/10.1093/bioinformatics/bts365
PurBayes: Larson, N. B., & Fridley, B. L. (2013). PurBayes: Estimating tumor cellularity and subclonality in next-generation sequencing data. Bioinformatics, 29(15), 1888–1889. https://doi.org/10.1093/bioinformatics/btt293
ANNOVAR: Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38(16), 1–7. https://doi.org/10.1093/nar/gkq603
ASCAT: Van Loo, P., Nordgard, S. H., Lingjaerde, O. C., Russnes, H. G., Rye, I. H., Sun, W., … Kristensen, V. N. (2010). Allele-specific copy number analysis of tumors. Proceedings of the National Academy of Sciences, 107(39), 16910–16915. https://doi.org/10.1073/pnas.1009843107
Treeomics: Reiter, J. G., Makohon-Moore, A. P., Gerold, J. M., Bozic, I., Chatterjee, K., Iacobuzio-Donahue, C. A., … Nowak, M. A. (2017). Reconstructing metastatic seeding patterns of human cancers. Nature Communications, 8, 14114. https://doi.org/10.1038/ncomms14114
deconstructSigs: Rosenthal, R., McGranahan, N., Herrero, J., Taylor, B. S., & Swanton, C. (2016). deconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biology, 17(1), 31. https://doi.org/10.1186/s13059-016-0893-4
MutationalPatterns: Blokzijl, F., Janssen, R., van Boxtel, R., & Cuppen, E. (2017). MutationalPatterns: comprehensive genome-wide analysis of mutational processes. bioRxiv, 1–20. https://doi.org/https://doi.org/10.1101/071761
MaSuRCA: Zimin, A. V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S. L., & Yorke, J. A. (2013). The MaSuRCA genome assembler. Bioinformatics, 29(21), 2669–2677. https://doi.org/10.1093/bioinformatics/btt476
VarDict: Lai, Z., Markovets, A., Ahdesmaki, M., Chapman, B., Hofmann, O., Mcewen, R., … Dry, J. R. (2016). VarDict: A novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Research, 44(11), 1–11. https://doi.org/10.1093/nar/gkw227
vt: Tan, A., Abecasis, G. R., & Kang, H. M. (2015). Unified representation of genetic variants. Bioinformatics, 31(13), 2202–2204. https://doi.org/10.1093/bioinformatics/btv112
peddy: Pedersen, B. S., & Quinlan, A. R. (2017). Who’s Who? Detecting and Resolving Sample Anomalies in Human DNA Sequencing Studies with Peddy. American Journal of Human Genetics, 100(3), 406–413. https://doi.org/10.1016/j.ajhg.2017.01.017
GQT: Layer, R. M., Kindlon, N., Karczewski, K. J., & Quinlan, A. R. (2015). Efficient genotype compression and analysis of large genetic-variation data sets. Nature Methods, 13(1). https://doi.org/10.1038/nmeth.3654
Tool sets and softwares required at various steps of pipeline development
Teaser: NGS readmapping benchmarking.
FastQC: Quality control tool. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Cutadapt: Adapter removal tool. https://cutadapt.readthedocs.io/en/stable/
Trim Galore!: FastQC and Cutadapt wrapper. https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
Picardtools: BAM/SAM/VCF/CRAM manipulator. http://broadinstitute.github.io/picard/
MarkDuplicate: Mark duplicate reads and potentially remove them
LiftoverVcf: liftover VCF between builds
CollectHsMetric: Collects hybrid-selection (HS) metrics for a SAM or BAM file
CollectAlignmentSummaryMetrics: Produces a summary of alignment metrics from a SAM or BAM file
CollectGcBiasMetrics: Collect metrics regarding GC bias
CollectWgsMetrics: Collect metrics about coverage and performance of whole genome sequencing (WGS) experiments
GATK: A variant discovery tool: https://software.broadinstitute.org/gatk/
BaseRecalibrator: Detect systematic error in base quality score
Somatic Indel Realigner: Local Realignment around Indels
ContEst: Estimate cross sample contamination
DepthOfCoverage: Assess sequence coverage by sample, read group, or libraries
DuplicateReadFilter: remove duplicated from flag set by MarkDuplicates
Samtools: Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format http://www.htslib.org/
Sambamba: Tools for working with SAM/BAM/CRAM data http://lomereiter.github.io/sambamba/
bcftools: Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants http://www.htslib.org/doc/bcftools.html
vcftools: VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. https://vcftools.github.io/index.html
Delly2: An integrated structural variant prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations https://github.com/dellytools/delly
PLINK: PLINK: Whole genome data analysis toolset https://www.cog-genomics.org/plink2
freebayes: a haplotype-based variant detector. https://github.com/ekg/freebayes
AscatNGS: Allele-Specific Copy Number Analysis of Tumors, tumor purity and ploidy https://github.com/cancerit/ascatNgs
MutationalPatterns: R package for extracting and visualizing mutational patterns in base substitution catalogues https://github.com/UMCUGenetics/MutationalPatterns
desconstructSigs: identification of mutational signatures within a single tumor sample https://github.com/raerose01/deconstructSigs
treeOmics: Decrypting somatic mutation patterns to reveal the evolution of cancer https://github.com/johannesreiter/treeomics
controlFreeC: Copy number and allelic content caller http://boevalab.com/FREEC/
MuTect2: Call somatic SNPs and indels via local re-assembly of haplotypes https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php
Annovar: annotation of detected genetic variation http://annovar.openbioinformatics.org/en/latest/
Strelka: Small variant caller https://github.com/Illumina/strelka
Manta: Structural variant caller https://github.com/Illumina/manta
PurBayes: estimate tumor purity and clonality
VarDict: variant caller for both single and paired sample variant calling from BAM files https://github.com/AstraZeneca-NGS/VarDict
SNPeff/SNPSift: Genomic variant annotations and functional effect prediction toolbox. http://snpeff.sourceforge.net/ and http://snpeff.sourceforge.net/SnpSift.html
IGV: visualization tool for interactive exploration http://software.broadinstitute.org/software/igv/
SVDetect: a tool to detect genomic structural variations http://svdetect.sourceforge.net/Site/Home.html
GenomeSTRiP: A suite of tools for discovering and genotyping structural variations using sequencing data http://software.broadinstitute.org/software/genomestrip/
BreakDancer: SV detection from paired end reads mapping https://github.com/genome/breakdancer
pIndel: Detect breakpoints of large deletions, medium sized insertions, inversions, and tandem duplications https://github.com/genome/pindel
VarScan: Variant calling and somatic mutation/CNV detection https://github.com/dkoboldt/varscan
VEP: Variant Effect Predictor https://www.ensembl.org/info/docs/tools/vep/index.html
Probablistic2020: Simulates somatic mutations, and calls statistically significant oncogenes and tumor suppressor genes based on a randomization-based test https://github.com/KarchinLab/probabilistic2020
2020plus: Classifies genes as an oncogene, tumor suppressor gene, or as a non-driver gene by using Random Forests https://github.com/KarchinLab/2020plus
vtools: variant tools is a software tool for the manipulation, annotation, selection, simulation, and analysis of variants in the context of next-gen sequencing analysis. http://varianttools.sourceforge.net/Main/HomePage
vt: A variant tool set that discovers short variants from Next Generation Sequencing data. https://genome.sph.umich.edu/wiki/Vt and https://github.com/atks/vt
CNVnator: a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads. https://github.com/abyzovlab/CNVnator
SvABA: Structural variation and indel detection by local assembly. https://github.com/walaj/svaba
indelope: find indels and SVs too small for structural variant callers and too large for GATK. https://github.com/brentp/indelope
peddy: peddy compares familial-relationships and sexes as reported in a PED/FAM file with those inferred from a VCF. https://github.com/brentp/peddy
cyvcf2: cyvcf2 is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files. https://github.com/brentp/cyvcf2
GQT: Genotype Query Tools (GQT) is command line software and a C API for indexing and querying large-scale genotype data sets. https://github.com/ryanlayer/gqt
LOFTEE: Loss-Of-Function Transcript Effect Estimator. A VEP plugin to identify LoF (loss-of-function) variation. Assesses variants that are: Stop-gained, Splice site disrupting, and Frameshift variants. https://github.com/konradjk/loftee
PureCN: copy number calling and SNV classification using targeted short read sequencing https://bioconductor.org/packages/release/bioc/html/PureCN.html
SVCaller: A structural variant caller. https://github.com/tomwhi/svcaller
SnakeMake: A workflow manager. http://snakemake.readthedocs.io/en/stable/index.html
BWA: BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. http://bio-bwa.sourceforge.net/
wgsim: Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. https://github.com/lh3/wgsim
dwgsim: Whole genome simulation can be performed with dwgsim. dwgsim is based off of wgsim found in SAMtools. https://github.com/nh13/DWGSIM
ABSOLUTE: ABSOLUTE can estimate purity/ploidy, and from that compute absolute copy-number and mutation multiplicities. http://archive.broadinstitute.org/cancer/cga/absolute
THetA: Tumor Heterogeneity Analysis. This algorithm estimates tumor purity and clonal/subclonal copy number aberrations directly from high-throughput DNA sequencing data. https://github.com/raphael-group/THetA
Skewer: Adapter trimming, similar to cutadapt. https://github.com/relipmoc/skewer
Phylowgs: Application for inferring subclonal composition and evolution from whole-genome sequencing data. https://github.com/morrislab/phylowgs
superFreq: SuperFreq is an R package that analyses cancer exomes to track subclones. https://github.com/ChristofferFlensburg/superFreq
readVCF-r: Read VCFs into R and annotatte them. https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html
vcfr: Read VCFs into R. https://github.com/knausb/vcfR
msisensor: microsatellite instability detection using paired tumor-normal https://github.com/ding-lab/msisensor
MOSAIC: MicrOSAtellite Instability Classifier https://github.com/ronaldhause/mosaic
MANTIS: Microsatellite Analysis for Normal-Tumor InStability https://github.com/OSU-SRLab/MANTIS
SBDB: A toolkit for constricting and querying structural variant databases https://github.com/J35P312/SVDB
Git Etiquette¶
It is recommended to follow a system to standardize the commit messages loosely. Following up from commit messages discussed on https://github.com/Clinical-Genomics/development/pull/97 , the format below is recommended for commit messages:
Code formatting¶
BALSAMIC is using Black (https://github.com/psf/black) as code formatter.
Conventional commits and PRs¶
PRs should follow the following keywords in the title: https://www.conventionalcommits.org/en/v1.0.0/
Commit messages are recommended to following the following similar to PRs:
feat: Introducing a new features. This includes but not limited to workflows, SnakeMake rule, cli, and plugins. In other words, anything that is new and fundamental change will also go here. Enhancements and optimizations will go into refactor.
fix: This is essentially a patch. Included but not limited to: bug fixes, hotfixes, and any patch to address a known issue.
doc: Any changes to the documentation are part of doc subject line, included but not limited to docstrings, cli-help, readme, tutorial, documentation, CHANGELOG, and addition of ipython/jupyter notebook in the form of tutorial.
test: Any changes to the tests are part of test subject line. This includes adding, removing or updating of the following: unittests, validation/verification dataset, and test related configs.
refactor: Refactoring refers to a rather broad term. Any style changes, code enhancement, and analysis optimization.
version: Any changes to .bumpversion config and or change of version will be specified with this. This includes comments within .bumpversion, structure of .bumpversion, etc.
Scope¶
Scope is specified within parenthesis. It show the scope of the subject line. The following scope are valid:
cli
style
rule (refers to SnakeMake rules)
workflow (refer to SnakeMake workflows)
config (refers to configs that are either used or generated by BALSAMIC)
Relevant scopes that might fit into a scope description
Note: If scope is broad or matching with multiple (it shouldn’t, but if it does) one can leave out the scope.
Message¶
It’s better to start Git commit message with the following words:
added
removed
updated
Snakemake Etiquette¶
The bioinformatics core analysis in BALSAMIC is defined by set of rules written as a Snakemake rules (*.rule
) and Snakemake
workflow as (*.smk
). Main balsamic.smk
workflow uses these rules to create sets of output files from sets of input files.
Using {wildcards}
Snakemake can automatically determine the dependencies between the rules by matching file names. The
following guidelines describe the general conventions for naming and order of the rules, while writing a Snakemake file in
BALSAMIC. For further description of how Snakemake works, please refer to Snakemake official documentation: https://snakemake.readthedocs.io/
Structure of Snakemake rules¶
rule <program>_<function>_<optional_tag>_<optional_tag>:
input:
<named_input_1> = ...,
<named_input_2> = ...,
output:
<named_output_1> = ...,
benchmark:
Path(benchmark_dir, "<rule_name>_<{sample}/{case_name}>.tsv").as_posix()
singularity:
singularity_image
params:
<named_param_1> = ...,
<named_param_1> = ...,
threads:
get_threads(cluster_config, '<rule_name>')
message:
("Align fastq files with bwa-mem to reference genome and sort using samtools for sample: {sample}"
"<second line is allowed to cover more description>")
shell:
"""
<first_command> <options>;
<second_command> <options>;
<a_long_command> <--option-1> <value_1> \
<--option-2> <value_2> \
<--option-3> <value_3>;
"""
Descriptions¶
rulename: Rule name briefly should outline the program and functions utilized inside the rule. Each word is seperated by a underscore _
. First word is the bioinformatic tool or script’s name. The following words describe subcommand within that bioinformatic tool and then followed by workflow specific description. The word length shouldn’t exceed more than 3 or 4 words. Make sure rule names are updated within config/cluster.json
and it is all lowercase. Examples: picard_collecthsmetrics_umi
, bcftools_query_calculateaftable_umi
input: It is strongly recommended to set input and output files as named. Refrain from introducing new wildcards as much as possible.
output: This should follow the same instructions as input
.
benchmark: Benchmark name is prefixed with rule name and suffixed with ‘.tsv’ file extension.
singularity: Make sure the singularity image does contain a Conda environment with required bioinformatics tools. Do not use this field if run
is used instead of shell
.
params: If the defined parameter is a threshold or globally used constant; add it to utils/constants.py
. Respective class models need to be updated in utils/models.py
.
threads: Make sure for each rule, the correct number of threads are assigned in config/cluster.json
. Otherwise it will be assigned default values from config/cluster.json
. If there is no need for multithreading, this field can be removed from rule.
message: A short message describing the function of rule. Add any relevant wildcard to message to make it readable and understandable. It is also recommended to use params
to build a more descriptive message
shell (run): Code inside the shell/run command should be left indented. Shell lines no longer than 100 characters. Break the long commands with \
and followed by a new line. Avoid having long Python code within run
, instead add it to utils/
as a Python script and import the function.
Example:
java -jar \
-Djava.io.tmpdir=${{tmpdir}} \
-Xms8G -Xmx16G \
$CONDA_PREFIX/share/picard.jar \
MarkDuplicates \
{input.named_input_1} \
{output.named_output_1};
Example for external python scripts that can be saved as modules in utils/*.py
and can use them as definitions in rules as:
from BALSAMIC.utils.workflowscripts import get_densityplot
get_densityplot(input.named_input1, params.named_params_1, output.named_output1 )
Similarly awk
or R
external scripts can be saved in assets/scripts/*awk
and can be invoked using get_script_path as:
params:
consensusfilter_script = get_script_path("FilterDuplexUMIconsensus.awk")
shell:
"""
samtools view -h {input} | \
awk -v MinR={params.minreads} \
-v OFS=\'\\t\' -f {params.consensusfilter_script} | \
samtools view -bh - > {output}
"""
References¶
Build Doc¶
Following steps explains how to build documents locally.
Create a conda environment:
conda create -n balsamic_doc -c bioconda -c conda-forge python=3.6 pip
conda activate balsamic_doc
Install Sphinx and extensions:
python -m pip install --upgrade --upgrade-strategy eager --no-cache-dir .
cd docs
pip install -r requirements.txt -r ../requirements-dev.txt
Build docs:
sphinx-build -T -E -b html -d _build/doctrees-readthedocs -D language=en . _build/html
View docs (open
or similar command from your OS):
open _build/html/index.html
Semantic versioning¶
BALSAMIC is following Semantic Versioning.
Since October 24, 2018 the following changes were added in addition to SemVer also to cover Bioinformatic and data analysis aspect of it:
major:
Structural changes to the BALSAMIC workflow. This includes reordering of annotation softwares or sources, variant callers, aligners, quality trimmers, and/or anything other than QC reporting and rule all.
Addition of annotation softwares or sources, variant callers, aligners, quality trimmers, and/or anything other than QC reporting
minor:
Under the hood changes to rules that won’t affect output results of workflow.
Addition of new bioinfo tools for QC reporting.
Updating version of a Bioinformatic software or data resource (including annotation sources)
patch:
Any bug fix and under the hood changes that won’t impact end-users run.
Changes to resource allocation of Scheduler job submission
The rational for versioning is heavily inspired from BACTpipe: DOI: 10.5281/zenodo.1254248 and https://github.com/ctmrbio/BACTpipe)
Frequently Asked Questions (FAQs)¶
BALSAMIC¶
UMIworkflow¶
What are UMIs
Unique Molecular Identifiers (UMIs) are short random nucleotide sequences (3-20 bases) that are ligated to the ends of DNA fragments prior to sequencing to greatly reduce the impact of PCR duplicates and sequencing errors on the variant calling process.

Figure1: Design of UMI adapters in the library preparation. Ref¶
How is the UMIworkflow implemented
CG’s UMIworkflow is implemented using the commercial software Sentieon. The Sentieon tools provide functionality for extracting UMI tags from fastq reads and performing barcode-aware consensus generation. The workflow is as described:

Figure2: UMI workflow steps.¶
How is the UMI structure defined
Our pair-end sequencing read length is about 151 bp and the UMI structure is defined as`3M2S146T, 3M2S146T` where 3M represents 3 UMI bases, 2S represents 2 skipped bases, 146T represents 146 bases in the read.
Are there any differences in the UMI read extraction if the read structure is defined as `3M2S146T, 3M2S146T` or `3M2S+T, 3M2S+T`?
In theory, this should be the same if the read length is always 151bp. But the recommendation is to use 3M2S+T, 3M2S+T so that UMIworkflow can handle any unexpected input data.
How does the `umi extract` tool handle sequencing adapters? Do the input reads always need to be adapter removed fastq reads
The presence of 5’ adapter sequences can cause issues for the Sentieon umi extract tool, as the extract tool will not correctly identify the UMI sequence. If 5’ adapter contamination is found in the data, before processing with the umi extract tool, these adapter sequences needed to be removed with a third-party trimming tool. 3’ adapter contamination is much more common and can occur when the insert size is shorter than the sequence read length. The Sentieon umi consensus tool will correctly identify and handle 3’ adapter/barcode contamination during consensus read creation.
How does Sentieon `umi consensus` tool handles paired-end reads
The umi consensus tool will merge overlapping read pairs when it can, but it is not possible for reads with an insert size greater than 2x the read length as there is some unknown intervening sequence. In this case, umi consensus will output a consensus read pair where each consensus read in the pair is constructed separately, while other reads in the dataset are collapsed/merged to single-end reads.

Figure3: Figure taken from Sentieon document.¶
Purpose of consensus-filtering step in the UMIworkflow
Mainly to reduce the calling of false-positive variants. Consensus filtering is based on the setting of minimum raw reads (MinR) supporting each UMI group. By default, MinR is set as 3,1,1, meaning that the minimum number of raw reads in both strands should be greater than 1 and the sum of both strands is greater than 3. The default 3,1,1 is a good starting point at lower coverages. This setting can be further adjusted accordingly at higher coverages or if finding false-positive calls due to consensus reads with little read support.
How is the performance of other variant callers for analysing UMI datasets UMI workflow is validated with two datasets (SeraCare and HapMap). The Vardict failed to call the true reference variants while the TNscope performed better. A more detailed analysis is summarized here
We are still investigating other UMI-aware variant callers and maybe in the future, if something works better, additional varcallers will be added to the UMIworkflow.
References¶
How to generate reference files for ascatNGS
Detailed information is available from ascatNGS documentation
Briefly, ascatNGS needs gender loci file if gender information for the input sample is not available. The second file is SnpGcCorrections.tsv, which is prepared from the 1000 genome SNP panel.
Gender loci file:
GRCh37d5_Y.loci contains the following contents:
GC correction file:
First step is to download the 1000 genome snp file and convert it from .vcf to .tsv. The detailed procedure to for this step is available from ascatNGS-reference-files (Human reference files from 1000 genomes VCFs)
export TG_DATA=ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
Followed by:
curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/);\
next if($F[0] eq $l_c && $F[1]-1000 < $l_p); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1];\
$l_c=$F[0]; $l_p=$F[1];' > SnpPositions_GRCh37_1000g.tsv
–or–
curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); next if($F[0] eq $l_c && $F[1]-1000 < $l_p);\
printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];'\
> SnpPositions_GRCh37_1000g.tsv
Second step is to use SnpPositions.tsv file and generate SnpGcCorrections.tsv file, more details see ascatNGS-convert-snppositions
ascatSnpPanelGcCorrections.pl genome.fa SnpPositions.tsv > SnpGcCorrections.tsv