Structural and Copy Number variants#

Depending on the sequencing type, BALSAMIC is currently running the following structural and copy number variant callers:

SV CNV callers#

Variant caller

Sequencing type

Analysis type

Somatic/Germline

Variant type

AscatNgs

WGS

tumor-normal

somatic

CNV

CNVkit

TGA, WES

tumor-normal, tumor-only

somatic

CNV

Delly

TGA, WES, WGS

tumor-normal, tumor-only

somatic

SV, CNV

Manta

TGA, WES, WGS

tumor-normal, tumor-only

somatic, germline

SV

TIDDIT

WGS

tumor-normal, tumor-only

somatic

SV

CNVpytor

WGS

tumor-only

somatic

CNV

igh_dux4 (see note below)

WGS

tumor-normal, tumor-only

somatic

SV

Note: igh_dux4 is not a variant caller itself. This is a custom script that uses samtools to detect read pairs supporting IGH::DUX4 rearrangements.

It is mandatory to provide the gender of the sample from BALSAMIC version >= 10.0.0 For CNV analysis.

Further details about a specific caller can be found in the links for the repositories containing the documentation for SV and CNV callers along with the links for the articles are listed in bioinfo softwares.

Difficult to detect clinically relevant SVs#

IGH::DUX4 rearrangements

This is a custom script that uses samtools to detect read pairs supporting IGH::DUX4 rearrangements. In short, the command identifies discordant reads mapping to the IGH region and to either DUX4 or its homologous DUX4-like regions (see references for details). The inclusion of this feature aims to alleviate the failure of callers to detect this rearrangement.

It is important to note, however, that the reported breakpoints are fixed to the IGH and DUX4 coordinates and are, therefore, imprecise and uncertain. Therefore, we advise caution when interpreting this information.

The script used to detect this rearrangements can be found in: BALSAMIC/assets/scripts/igh_dux4_detection.sh (see references Detection of IGH::DUX4 rearrangement, for more information.)

FLT3-ITDs

FLT3 Internal Tandem Duplications are quite difficult to detect and can exist in sizes detectable by both small SNV and InDel callers such as VarDit and TNscope as well as Structural Variant callers like Manta, Delly and TIDDIT. In our experience however the variant is more commonly detected by SV callers and we strongly recommend that this variant is looked for in the SV results.

It has also been observed in release 16.0.0 of Balsamic a decline in the ability of SNV callers to detect this variant, possibly as a result of collapsing overlapping mates of read-pairs into singletons as a result of UMI post processing. This adds further reasons to searching for this variant in the SV results!

In the future we will aim to add further callers better able to detect this variant.

Pre-merge Filtrations#

The copy number variants, identified using ascatNgs and dellycnv, are converted to deletion and duplications before they are merged using SVDB with –bnd_distance = 5000 (distance between end points for the variants from different callers) and –overlap = 0.80 (percentage for overlapping bases for the variants from different callers).

Tumor and normal calls in TIDDIT are merged using SVDB with –bnd_distance 500 and –overlap = 0.80. Using a custom made script “filter_SVs.py”, soft-filters are added to the calls based on the presence of the variant in the normal, with the goal of retaining only somatic variants as PASS.

Manta calls are filtered using bcftools to only keep variants that have evidence from 3 or more reads.

SV filters#

Variant caller

Filter added

Filter expression

TIDDIT

high_normal_af_fraction

(AF_N_MAX / AF_T_MAX) > 0.25

TIDDIT

max_normal_allele_frequency

AF_N_MAX > 0.25

TIDDIT

normal_variant

AF_T_MAX == 0 and ctg_t == False

TIDDIT

in_normal

ctg_n == True and AF_N_MAX == 0 and AF_T_MAX <= 0.25

Manta

low_pr_sr_count

SUM(FORMAT/PR[0:1]+FORMAT/SR[0:1]) < 4.0

igh_dux4

samtools_igh_dux4

DV < 1

Further information regarding the TIDDIT tumor normal filtration: As translocation variants are represented by 2 BNDs in the VCF which allows for mixed assignment of soft-filters, a requirement for assigning soft-filters to translocations is that neither BND is PASS.

Post-merge Filtrations#

SVDB prioritizes the merging of variants from SV and CNV callers to fetch position and genotype information, in the following order:

SVDB merge caller priority order#
TGA, WES

tumor-normal

TGA, WES

tumor-only

WGS

tumor-normal

WGS

tumor-only

1. manta
2. dellysv
3. cnvkit
4. dellycnv
1. manta
2. dellysv
3. cnvkit
4. dellycnv
1. manta
2. dellysv
3. ascat
4. dellycnv
5. tiddit
6. igh_dux4
1. manta
2. dellysv
3. dellycnv
4. tiddit
5. cnvpytor
6. igh_dux4

The merged SNV.somatic.<CASE_ID>.svdb.vcf.gz file retains all the information for the variants from the caller in which the variants are identified, which are then annotated using ensembl-vep. The SweGen and frequencies and the frequency of observed structural variants from clinical normal samples are annotated using SVDB.

The following filter applies for both tumor-normal and tumor-only samples in addition to caller specific filters.

SWEGENAF: SweGen Allele Frequency

SWEGENAF <= 0.02  (or) SWEGENAF == "."

Frq: Frequency of observation of the variants from normal Clinical samples

Frq <= 0.02  (or) Frq == "."

The variants scored as PASS are included in the final vcf file (SNV.somatic.<CASE_ID>.svdb.<research/clinical>.filtered.pass.vcf.gz).

The following command can be used to fetch the variants identified by a specific caller from merged structural and copy number variants.

zgrep -E "#|<Caller>" <*.svdb.vcf.gz>

Using GENS for WGS#

GENS is a visualization tool similar to IGV, originally developed in Clinical Genomics Lund, and primarily for visualizing genomic copy number profiles from WGS samples.

To visualise the GENS-formatted files from BALSAMIC you need to have GENS installed, and to do this you can follow the instructions on the Clinical-Genomics-Lund GENS-repository:

Two files per sample are uploaded to GENS, one file with allele-frequencies from SNV & InDel germline-calls (BAF file) which can be used to aid the interpretation of the CN-profile, and one file with the Log2 copy number ratios normalized against a PON. Instructions for how to generate this PON using the BALSAMIC PON workflow can be found here:

Generate GENS PON.

There are three required arguments for creating the input files for GENS: 1. Genome interval file produced by GATK PreprocessIntervals (see instructions in GENS PON creation) 2. A gender specific PON (see instructions in GENS PON creation) 3. A population database VCF with variant positions to be reported in the BAF file.

We created the file mentioned in 3 using the file gnomad.genomes.r2.1.1.sites filtered with bcftools to only keep variants with an AF above 0.05.

bcftools view -i AF>=0.05 -Oz

To config BALSAMIC to run with GENS activated you supply these files like this:

balsamic config case \
  --case-id <CASE_ID>
  --balsamic-cache </path/reference_cache/>
  --analysis-dir </path/analysis/>
  --fastq-path </path/fastq/>
  --gender <[male/female]>
  --analysis-workflow balsamic
  --genome-version hg19
  --tumor-sample-name <TUMOR_NAME>
  --genome-interval </path/genome_interval>
  --gens-coverage-pon </path/pon_file>
  --gnomad-min-af5 </path/population_vcf.vcf.gz>

Genome Reference Files#

How to generate genome reference files for ascatNGS

Detailed information is available from ascatNGS documentation

The file SnpGcCorrections.tsv prepared from the 1000 genome SNP panel.

GC correction file:

First step is to download the 1000 genome snp file and convert it from .vcf to .tsv. The detailed procedure to for this step is available from ascatNGS-reference-files (Human reference files from 1000 genomes VCFs)

export TG_DATA=ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz

Followed by:

curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/);\
next if($F[0] eq $l_c && $F[1]-1000 < $l_p); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1];\
$l_c=$F[0]; $l_p=$F[1];' > SnpPositions_GRCh37_1000g.tsv

–or–

curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); next if($F[0] eq $l_c && $F[1]-1000 < $l_p);\
printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];'\
> SnpPositions_GRCh37_1000g.tsv

Second step is to use SnpPositions.tsv file and generate SnpGcCorrections.tsv file, more details see ascatNGS-convert-snppositions

ascatSnpPanelGcCorrections.pl genome.fa SnpPositions.tsv > SnpGcCorrections.tsv