Structural and Copy Number variants#

Depending on the sequencing type, BALSAMIC is currently running the following structural and copy number variant callers:

SV CNV callers#
Variant caller	Sequencing type	Analysis type	Somatic/Germline	Variant type
AscatNgs	WGS	tumor-normal	somatic	CNV
CNVkit	TGA, WES	tumor-normal, tumor-only	somatic	CNV
Delly	TGA, WES, WGS	tumor-normal, tumor-only	somatic	SV, CNV
Manta	TGA, WES, WGS	tumor-normal, tumor-only	somatic, germline	SV
VarDict	TGA, WES	tumor-only	somatic	SV
TIDDIT	WGS	tumor-normal, tumor-only	somatic	SV
CNVpytor	WGS	tumor-only	somatic	CNV
igh_dux4 (see note below)	WGS	tumor-normal, tumor-only	somatic	SV

Note: igh_dux4 is not a variant caller itself. This is a custom script that uses samtools to detect read pairs supporting IGH::DUX4 rearrangements.

From BALSAMIC version >= 18.0.0, if no gender information is provided to case configuration step, the gender used in CNV analysis will be derived from the predicted sex (see Sex check for more information.)

Further details about a specific caller can be found in the links for the repositories containing the documentation for SV and CNV callers along with the links for the articles are listed in bioinfo softwares.

Difficult to detect clinically relevant SVs#

IGH::DUX4 rearrangements

This is a custom script that uses samtools to detect read pairs supporting IGH::DUX4 rearrangements. In short, the command identifies discordant reads mapping to the IGH region and to either DUX4 or its homologous DUX4-like regions (see references for details). The inclusion of this feature aims to alleviate the failure of callers to detect this rearrangement.

It is important to note, however, that the reported breakpoints are fixed to the IGH and DUX4 coordinates and are, therefore, imprecise and uncertain. Therefore, we advise caution when interpreting this information.

The script used to detect this rearrangements can be found in: BALSAMIC/assets/scripts/igh_dux4_detection.sh (see references Detection of IGH::DUX4 rearrangement, for more information.)

FLT3-ITDs

FLT3 Internal Tandem Duplications are quite difficult to detect and can exist in sizes detectable by both small SNV and InDel callers such as VarDit and TNscope as well as Structural Variant callers like Manta, Delly and TIDDIT. In our experience however the variant is more commonly detected by SV callers and we strongly recommend that this variant is looked for in the SV results.

It has also been observed in release 16.0.0 of Balsamic a decline in the ability of SNV callers to detect this variant, possibly as a result of collapsing overlapping mates of read-pairs into singletons as a result of UMI post processing. This adds further reasons to searching for this variant in the SV results!

In the future we will aim to add further callers better able to detect this variant.

Pre-merge Filtrations#

The copy number variants, identified using ascatNgs and dellycnv, are converted to deletion and duplications before they are merged using SVDB with –bnd_distance = 5000 (distance between end points for the variants from different callers) and –overlap = 0.80 (percentage for overlapping bases for the variants from different callers).

Tumor and normal calls in TIDDIT are merged using SVDB with –bnd_distance 500 and –overlap = 0.80. Using a custom made script “filter_SVs.py”, soft-filters are added to the calls based on the presence of the variant in the normal, with the goal of retaining only somatic variants as PASS.

Manta calls are filtered using bcftools to only keep variants that have evidence from 3 or more reads.

SV filters#
Variant caller	Filter added	Filter expression
TIDDIT	high_normal_af_fraction	(AF_N_MAX / AF_T_MAX) > 0.25
TIDDIT	max_normal_allele_frequency	AF_N_MAX > 0.25
TIDDIT	normal_variant	AF_T_MAX == 0 and ctg_t == False
TIDDIT	in_normal	ctg_n == True and AF_N_MAX == 0 and AF_T_MAX <= 0.25
Manta	low_pr_sr_count	SUM(FORMAT/PR[0:1]+FORMAT/SR[0:1]) < 4.0
igh_dux4	samtools_igh_dux4	DV < 1

Further information regarding the TIDDIT tumor normal filtration: As translocation variants are represented by 2 BNDs in the VCF which allows for mixed assignment of soft-filters, a requirement for assigning soft-filters to translocations is that neither BND is PASS.

Post-merge Filtrations#

SVDB prioritizes the merging of variants from SV and CNV callers to fetch position and genotype information, in the following order:

SVDB merge caller priority order#
TGA, WES tumor-normal	TGA, WES tumor-only	WGS tumor-normal	WGS tumor-only
1. manta 2. vardict 3. dellysv 4. cnvkit 5. dellycnv	1. manta 2. vardict 3. dellysv 4. cnvkit 5. dellycnv	1. manta 2. dellysv 3. ascat 4. dellycnv 5. tiddit 6. igh_dux4	1. manta 2. dellysv 3. dellycnv 4. tiddit 5. cnvpytor 6. igh_dux4

The merged SV.somatic.<CASE_ID>.svdb.vcf.gz file retains all the information for the variants from the caller in which the variants are identified, which are then annotated using ensembl-vep. The SweGen and frequencies and the frequency of observed structural variants from clinical normal samples are annotated using SVDB.

The following filter applies for both tumor-normal and tumor-only samples in addition to caller specific filters.

ArtefactObs: Number of observations of the variants from somatic calls on high-coverage WGS normal samples (artificially created by mixing several samples)

ArtefactObs <= 4  (or) ArtefactObs == "."

Note

The above filter ArtefactObs is only active for the WGS workflow.

Above filter is applied for the clinical vcf (SV.somatic.<CASE_ID>.svdb.clinical.filtered.pass.vcf.gz).

SWEGENAF: SweGen Allele Frequency

SWEGENAF <= 0.02  (or) SWEGENAF == "."

Above filter is applied for both research and clinical vcf (SV.somatic.<CASE_ID>.svdb.<research/clinical>.filtered.pass.vcf.gz).

Frq: Frequency of observation of the variants from germline calls on normal Clinical samples

Frq <= 0.02  (or) Frq == "."

Above filter is applied for only the clinical vcf (SV.somatic.<CASE_ID>.svdb.clinical.filtered.pass.vcf.gz).

The following command can be used to fetch the variants identified by a specific caller from merged structural and copy number variants.

zgrep -E "#|<Caller>" <*.svdb.vcf.gz>

Using GENS for WGS#

GENS is a visualization tool similar to IGV, originally developed in Clinical Genomics Lund, and primarily for visualizing genomic copy number profiles from WGS samples.

To visualise the GENS-formatted files from BALSAMIC you need to have GENS installed, and to do this you can follow the instructions on the Clinical-Genomics-Lund GENS-repository:

Clinical-Genomics-Lund-GENS

Two files per sample are uploaded to GENS, one file with allele-frequencies from SNV & InDel germline-calls (BAF file) which can be used to aid the interpretation of the CN-profile, and one file with the Log2 copy number ratios normalized against a PON. Instructions for how to generate this PON using the BALSAMIC PON workflow can be found here:

Generate GENS PON.

There are three required arguments for creating the input files for GENS: 1. Genome interval file produced by GATK PreprocessIntervals (see instructions in GENS PON creation) 2. A gender specific PON (see instructions in GENS PON creation) 3. A population database VCF with variant positions to be reported in the BAF file.

We created the file mentioned in 3 using the file gnomad.genomes.r2.1.1.sites filtered with bcftools to only keep variants with an AF above 0.05.

bcftools view -i AF>=0.05 -Oz

To config BALSAMIC to run with GENS activated you supply these files like this:

balsamic config case \
  --case-id <CASE_ID>
  --balsamic-cache </path/reference_cache/>
  --analysis-dir </path/analysis/>
  --fastq-path </path/fastq/>
  --gender <[male/female]>
  --analysis-workflow balsamic
  --genome-version hg19
  --tumor-sample-name <TUMOR_NAME>
  --genome-interval </path/genome_interval>
  --gens-coverage-pon </path/pon_file>
  --gnomad-min-af5 </path/population_vcf.vcf.gz>

Genome Reference Files#

How to generate genome reference files for ascatNGS

Detailed information is available from ascatNGS documentation

The file SnpGcCorrections.tsv prepared from the 1000 genome SNP panel.

GC correction file:

First step is to download the 1000 genome snp file and convert it from .vcf to .tsv. The detailed procedure to for this step is available from ascatNGS-reference-files (Human reference files from 1000 genomes VCFs)

export TG_DATA=ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz

Followed by:

curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/);\
next if($F[0] eq $l_c && $F[1]-1000 < $l_p); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1];\
$l_c=$F[0]; $l_p=$F[1];' > SnpPositions_GRCh37_1000g.tsv

–or–

curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); next if($F[0] eq $l_c && $F[1]-1000 < $l_p);\
printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];'\
> SnpPositions_GRCh37_1000g.tsv

Second step is to use SnpPositions.tsv file and generate SnpGcCorrections.tsv file, more details see ascatNGS-convert-snppositions

ascatSnpPanelGcCorrections.pl genome.fa SnpPositions.tsv > SnpGcCorrections.tsv