Calling and filtering variants#
In BALSAMIC, various bioinfo tools are integrated for reporting somatic and germline variants summarized in the table below. The choice of these tools differs between the type of analysis; Whole Genome Sequencing (WGS), or Target Genome Analysis (TGA) and Target Genome Analysis (TGA) with UMI 3,1,1 filtering activated.
Variant caller |
Sequencing type |
Analysis type |
Somatic/Germline |
Variant type |
---|---|---|---|---|
DNAscope |
TGA, WGS |
tumor-normal, tumor-only |
germline |
SNV, InDel |
TNscope |
WGS, TGA, TGA with UMI 3,1,1 filtering applied |
tumor-normal, tumor-only |
somatic |
SNV, InDel |
VarDict |
TGA |
tumor-normal, tumor-only |
somatic |
SNV, InDel |
Various filters (Pre-call and Post-call filtering) are applied at different levels to report high-confidence variant calls.
Pre-call filtering is where the variant-calling tool decides not to add a variant to the VCF file if the default filters of the variant-caller did not pass the filter criteria. The set of default filters differs between the various variant-calling algorithms.
To know more about the pre-call filters and detailed arguments used by the variant callers, please have a look at the VCF header of the particular variant-calling results. For example:
In the VCF file, the FILTER status is PASS if this position has passed all filters, i.e., a call is made at this position. Contrary, if the site has not passed any of the filters, a semicolon-separated list of those failed filter(s) will be appended to the FILTER column instead of PASS. E.g., t_lod_fstar;alt_allele_in_normal might indicate that at this site there is little support that the alternative bases constitute a true somatic variant, and there is also evidence of those same bases in the normal sample.
Note: In BALSAMIC, this VCF file is often referred to the “raw” VCF because it is the most unfiltered VCF produced, and is named `SNV.somatic.<CASE_ID>.tnscope.vcf.gz`
Post-call quality filtering is where a variant is further filtered based on quality, depth, VAF, etc., with more stringent thresholds.
For Post-call filtering, in BALSAMIC we have applied various filtering criteria depending on the analysis-type (TGS/WGS) and sample-type (tumor-only/tumor-normal), and only variants with either PASS or triallelic_site are kept.
Note: In BALSAMIC, this VCF file is named as SNV.somatic.<CASE_ID>.research.tnscope.vcf.gz and is not delivered
Post-call variant-database filtering is where a variant is further filtered based on their presence in certain variant-databases such as Gnomad and local variant databases built with LoqusDB.
This is a two step process where variants are first filtered based on existing above a specified frequency in public available databases, and then based on local databases of previously observed variants.
At each step only variants with filters PASS and triallelic_site are kept and delivered as a final list of variants to the customer either via Scout or Caesar
Note: In BALSAMIC, the VCF file filtered only on public available databases is named as `*.research.filtered.pass.vcf.gz` (eg: for WGS `SNV.somatic.<CASE_ID>.tnscope.research.filtered.pass.vcf.gz`) In BALSAMIC, the VCF file filtered on both public and locally available databases is named as `*.clinical.filtered.pass.vcf.gz` (eg: for TGA `SNV.somatic.<CASE_ID>.merged.<research/clinical>.filtered.pass.vcf.gz`)
VCF file name |
Description |
Delivered to the customer |
---|---|---|
.vcf.gz |
Unannotated raw VCF file with pre-call filters included in the STATUS column |
Yes (Caesar) |
.research.filtered.pass.vcf.gz |
Annotated VCF file with quality and population based filters applied. |
Yes (Caesar) |
.clinical.filtered.pass.vcf.gz |
Annotated VCF file with quality, population and local database filters applied. |
Yes (Caesar and Scout) |
Targeted Genome Analysis#
Somatic Callers for reporting SNVs/INDELS#
For SNV/InDel calling in the TGA analyses of balsamic both VarDict and TNscope are used. Lists of variants are produced from both tools, which are then normalised and quality filtered before being merged.
Vardict#
Vardict is a sensitive variant caller used for both tumor-only and tumor-normal variant calling.
There are two slightly different post-processing filters activated depending on if the sample is an exome or a smaller panel as these tend to have very different sequencing depths.
Vardict_filtering#
Following is the set of criteria applied for filtering vardict results. It is used for both tumor-normal and tumor-only samples.
Post-call Quality Filters for panels
Mean Mapping Quality (MQ): Refers to the root mean square (RMS) mapping quality of all the reads spanning the given variant site.
MQ >= 30
Total Depth (DP): Refers to the overall read depth supporting the called variant.
DP >= 50
Variant depth (VD): Total reads supporting the ALT allele
VD >= 5
Allelic Frequency (AF): Fraction of the reads supporting the alternate allele
Minimum AF >= 0.005
Post-call Quality Filters for exomes
Mean Mapping Quality (MQ): Refers to the root mean square (RMS) mapping quality of all the reads spanning the given variant site.
MQ >= 30
Total Depth (DP): Refers to the overall read depth supporting the called variant.
DP >= 20
Variant depth (VD): Total reads supporting the ALT allele
VD >= 5
Allelic Frequency (AF): Fraction of the reads supporting the alternate allele
Minimum AF >= 0.005
Attention: BALSAMIC <= v8.2.7 uses minimum AF 1% (0.01). From Balsamic v8.2.8, minimum VAF is changed to 0.7% (0.007). From v16.0.0 minimum VAF is changed to 0.5% (0.005).
For normal matched analyses
Relative tumor AF in normal: Allows for maximum Tumor-In-Normal-Contamination of 30%.
excludes variant if: AF(normal) / AF(tumor) > 0.3
Note: Additionally, the variant is excluded for tumor-normal cases if marked as ‘germline’ in the `STATUS` column of the VCF file.
Sentieon’s TNscope#
The TNscope algorithm performs the somatic variant calling on the tumor-normal or the tumor-only samples.
TNscope filtering#
Pre-call Filters
min_init_tumor_lod: Initial Log odds for the that the variant exists in the tumor.
min_init_tumor_lod = 0.5
min_tumor_lod: Minimum log odds in the final call of variant in the tumor.
min_tumor_lod = 4
min_init_normal_lod: Initial Log odds for the that the variant exists in the normal.
min_init_normal_lod = 0.5
min_normal_lod: Minimum log odds in the final call of variant in the normal.
min_normal_lod = 2.2
min_dbnp_normal_lod: Minimum normalLOD at dbSNP site.
min_dbnp_normal_lod = 5.5
max_error_per_read: Maximum number of differences to reference per read.
max_error_per_read = 5
min_base_qual: Minimal base quality to consider in calling
- ::
min_base_qual = 15
min_tumor_allele_frac: Set the minimum tumor AF to be considered as potential variant site.
min_tumor_allele_frac = 0.0005
interval_padding: Adding an extra 100bp to each end of the target region in the bed file before variant calling.
interval_padding = 100
Post-call Filters
Total Depth (DP): Refers to the overall read depth supporting the called variant.
DP >= 50
Variant depth (VD): Total reads supporting the ALT allele
VD >= 5
Allelic Frequency (AF): Fraction of the reads supporting the alternate allele
Minimum AF >= 0.005
For tumor only analyses
Average base quality score
SUM(QSS)/SUM(AD) >= 20
SOR: Symmetric Odds Ratio of 2x2 contingency table to detect strand bias
SOR < 2.7
Note: Additionally, variants labeled with triallelic site filter are not filtered out
For normal matched analyses
alt_allele_in_normal: Default filter set by TNscope was considered too stringent in filtering tumor in normal and is removed.
bcftools annotate -x FILTER/alt_allele_in_normal
Relative tumor AF in normal: Allows for maximum Tumor-In-Normal-Contamination of 30%.
excludes variant if: AF(normal) / AF(tumor) > 0.3
Post-call Observation database Filters
GNOMADAF_POPMAX: Maximum Allele Frequency across populations
GNOMADAF_popmax <= 0.005 (or) GNOMADAF_popmax == "."
SWEGENAF: SweGen Allele Frequency
SWEGENAF <= 0.01 (or) SWEGENAF == "."
Frq: Frequency of observation of the variants from normal Clinical samples
Frq <= 0.01 (or) Frq == "."
ArtefactFrq: Frequency of observation of the variants from normal WGS samples merged to ~1200X coverage
ArtefactFrq <= 0.1 (or) ArtefactFrq == "."
This above corresponds to at least 4 observations in a database of 29 cases of merged WGS samples.
Target Genome Analysis with UMI’s into account#
Sentieon’s TNscope#
UMI workflow performs the variant calling of SNVs/INDELS using the TNscope algorithm from UMI consensus-called reads. The following filter applies for both tumor-normal and tumor-only samples.
Pre-call Filters
minreads: Filtering of consensus called reads based on the minimum reads supporting each UMI tag group
minreads = 3,1,1
It means that at least 3 read-pairs need to support the UMI-group (based on the UMI-tag and the aligned genomic positions), and with at least 1 read-pair from each strand (F1R2 and F2R1). NOTE: This filtering is performed on the bamfile before variant calling.
min_init_tumor_lod: Initial Log odds for the that the variant exists in the tumor.
min_init_tumor_lod = 0.5
min_tumor_lod: Minimum log odds in the final call of variant in the tumor.
min_tumor_lod = 4
min_init_normal_lod: Initial Log odds for the that the variant exists in the normal.
min_init_normal_lod = 0.5
min_normal_lod: Minimum log odds in the final call of variant in the normal.
min_normal_lod = 2.2
min_dbnp_normal_lod: Minimum normalLOD at dbSNP site.
min_dbnp_normal_lod = 5.5
max_error_per_read: Maximum number of differences to reference per read.
max_error_per_read = 5
min_base_qual: Minimal base quality to consider in calling
- ::
min_base_qual = 15
min_tumor_allele_frac: Set the minimum tumor AF to be considered as potential variant site.
min_tumor_allele_frac = 0.0005
interval_padding: Adding an extra 100bp to each end of the target region in the bed file before variant calling.
interval_padding = 100
Post-call Quality Filters
alt_allele_in_normal: Default filter set by TNscope was considered too stringent in filtering tumor in normal and is removed.
bcftools annotate -x FILTER/alt_allele_in_normal
Relative tumor AF in normal: Allows for maximum Tumor-In-Normal-Contamination of 30%.
excludes variant if: AF(normal) / AF(tumor) > 0.3
Post-call Observation database Filters
GNOMADAF_POPMAX: Maximum Allele Frequency across populations
GNOMADAF_popmax <= 0.02 (or) GNOMADAF_popmax == "."
SWEGENAF: SweGen Allele Frequency
SWEGENAF <= 0.01 (or) SWEGENAF == "."
Frq: Frequency of observation of the variants from normal Clinical samples
Frq <= 0.01 (or) Frq == "."
The variants scored as PASS or triallelic_sites are included in the final vcf file (SNV.somatic.<CASE_ID>.tnscope.<research/clinical>.filtered.pass.vcf.gz).
Whole Genome Sequencing (WGS)#
Sentieon’s TNscope#
BALSAMIC utilizes the TNscope algorithm for calling somatic SNVs and INDELS in WGS samples. The TNscope algorithm performs the somatic variant calling on the tumor-normal or the tumor-only samples.
TNscope filtering (Tumor_normal)#
Pre-call Filters
Apply TNscope trained MachineLearning Model: Sets MLrejected on variants with ML_PROB below 0.32.
- ::
ML model: SentieonTNscopeModel_GiAB_HighAF_LowFP-201711.05.model is applied
min_init_tumor_lod: Initial Log odds for the that the variant exists in the tumor.
min_init_tumor_lod = 1
min_tumor_lod: Minimum log odds in the final call of variant in the tumor.
min_tumor_lod = 8
min_init_normal_lod: Initial Log odds for the that the variant exists in the normal.
min_init_normal_lod = 0.5
min_normal_lod: Minimum log odds in the final call of variant in the normal.
min_normal_lod = 1
Post-call Quality Filters
Total Depth (DP): Refers to the overall read depth from all target samples supporting the variant call
DP(tumor) >= 10 (or) DP(normal) >= 10
Allelic Depth (AD): Total reads supporting the ALT allele in the tumor sample
AD(tumor) >= 3
Allelic Frequency (AF): Fraction of the reads supporting the alternate allele
Minimum AF(tumor) >= 0.05
alt_allele_in_normal: Default filter set by TNscope was considered too stringent in filtering tumor in normal and is removed.
bcftools annotate -x FILTER/alt_allele_in_normal
Relative tumor AF in normal: Allows for maximum Tumor-In-Normal-Contamination of 30%.
excludes variant if: AF(normal) / AF(tumor) > 0.3
Post-call Observation database Filters
GNOMADAF_POPMAX: Maximum Allele Frequency across populations
GNOMADAF_popmax <= 0.001 (or) GNOMADAF_popmax == "."
SWEGENAF <= 0.01 (or) SWEGENAF == "."
Frq: Frequency of observation of the variants from normal Clinical samples
Frq <= 0.01 (or) Frq == "."
The variants scored as PASS or triallelic_sites are included in the final vcf file (SNV.somatic.<CASE_ID>.tnscope.<research/clinical>.filtered.pass.vcf.gz).
TNscope filtering (tumor_only)#
Pre-call Filters
min_init_tumor_lod: Initial Log odds for the that the variant exists in the tumor.
min_init_tumor_lod = 1
min_tumor_lod: Minimum log odds in the final call of variant in the tumor.
min_tumor_lod = 8
The somatic variants in TNscope raw VCF file (SNV.somatic.<CASE_ID>.tnscope.all.vcf.gz) are filtered out for the genomic regions that are not reliable (eg: centromeric regions, non-chromosome contigs) to enhance the computation time. This WGS interval region file is collected from gatk_bundles gs://gatk-legacy-bundles/b37/wgs_calling_regions.v1.interval_list.
Post-call Quality Filters
Total Depth (DP): Refers to the overall read depth supporting the variant call
DP(tumor) >= 10
Allelic Depth (AD): Total reads supporting the ALT allele in the tumor sample
AD(tumor) > 3
Allelic Frequency (AF): Fraction of the reads supporting the alternate allele
Minimum AF(tumor) > 0.05
SUM(QSS)/SUM(AD) >= 20
Read Counts: Count of reads in a given (F1R2, F2R1) pair orientation supporting the alternate allele and reference alleles
ALT_F1R2 > 0, ALT_F2R1 > 0
REF_F1R2 > 0, REF_F2R1 > 0
SOR: Symmetric Odds Ratio of 2x2 contingency table to detect strand bias
SOR < 3
Post-call Observation database Filters
GNOMADAF_POPMAX: Maximum Allele Frequency across populations
GNOMADAF_popmax <= 0.001 (or) GNOMADAF_popmax == "."
Normalized base quality scores: The sum of base quality scores for each allele (QSS) is divided by the allelic depth of alt and ref alleles (AD)
SWEGENAF <= 0.01 (or) SWEGENAF == "."
Frq: Frequency of observation of the variants from normal Clinical samples
Frq <= 0.01 (or) Frq == "."
The variants scored as PASS or triallelic_sites are included in the final vcf file (SNV.somatic.<CASE_ID>.tnscope.<research/clinical>.filtered.pass.vcf.gz).
Attention: BALSAMIC <= v8.2.10 uses GNOMAD_popmax <= 0.005. From Balsamic v9.0.0, this settings is changed to 0.02, to reduce the stringency. BALSAMIC >= v11.0.0 removes unmapped reads from the bam and cram files for all the workflows. BALSAMIC >= v13.0.0 keeps unmapped reads in bam and cram files for all the workflows. BALSAMIC >= v16.0.0 uses UMIs for duplicate removal bam in standard TGA workflows.