Structural and Copy Number variants#

Depending on the sequencing type, BALSAMIC is currently running the following structural and copy number variant callers:

SV CNV callers#
Variant caller	Sequencing type	Analysis type	Somatic/Germline	Variant type
AscatNgs	WGS	tumor-normal	somatic	CNV
CNVkit	TGA, WES	tumor-normal, tumor-only	somatic	CNV
Delly	TGA, WES, WGS	tumor-normal, tumor-only	somatic	SV, CNV
Manta	TGA, WES, WGS	tumor-normal, tumor-only	somatic, germline	SV
TIDDIT	WGS	tumor-normal, tumor-only	somatic	SV
CNVpytor	WGS	tumor-only	somatic	CNV

Further details about a specific caller can be found in the links for the repositories containing the documentation for SV and CNV callers along with the links for the articles are listed in bioinfo softwares.

It is mandatory to provide the gender of the sample from BALSAMIC version >= 10.0.0 For CNV analysis.

Pre-merge Filtrations#

The copy number variants, identified using ascatNgs and dellycnv, are converted to deletion and duplications before they are merged using SVDB with –bnd_distance = 5000 (distance between end points for the variants from different callers) and –overlap = 0.80 (percentage for overlapping bases for the variants from different callers).

Tumor and normal calls in TIDDIT are merged using SVDB with –bnd_distance 500 and –overlap = 0.80. Using a custom made script “filter_SVs.py”, soft-filters are added to the calls based on the presence of the variant in the normal, with the goal of retaining only somatic variants as PASS.

Manta calls are filtered using bcftools to only keep variants that have evidence from 3 or more reads.

SV filters#
Variant caller	Filter added	Filter expression
TIDDIT	high_normal_af_fraction	(AF_N_MAX / AF_T_MAX) > 0.25
TIDDIT	max_normal_allele_frequency	AF_N_MAX > 0.25
TIDDIT	normal_variant	AF_T_MAX == 0 and ctg_t == False
TIDDIT	in_normal	ctg_n == True and AF_N_MAX == 0 and AF_T_MAX <= 0.25
Manta	low_pr_sr_count	SUM(FORMAT/PR[0:1]+FORMAT/SR[0:1]) < 4.0

Further information regarding the TIDDIT tumor normal filtration: As translocation variants are represented by 2 BNDs in the VCF which allows for mixed assignment of soft-filters, a requirement for assigning soft-filters to translocations is that neither BND is PASS.

Post-merge Filtrations#

SVDB prioritizes the merging of variants from SV and CNV callers to fetch position and genotype information, in the following order:

SVDB merge caller priority order#
TGA, WES tumor-normal	TGA, WES tumor-only	WGS tumor-normal	WGS tumor-only
1. manta 2. dellysv 3. cnvkit 4. dellycnv	1. manta 2. dellysv 3. cnvkit 4. dellycnv	1. manta 2. dellysv 3. ascat 4. dellycnv 5. tiddit	1. manta 2. dellysv 3. dellycnv 4. tiddit 5. cnvpytor

The merged SNV.somatic.<CASE_ID>.svdb.vcf.gz file retains all the information for the variants from the caller in which the variants are identified, which are then annotated using ensembl-vep. The SweGen and frequencies and the frequency of observed structural variants from clinical normal samples are annotated using SVDB.

The following filter applies for both tumor-normal and tumor-only samples in addition to caller specific filters.

SWEGENAF: SweGen Allele Frequency

SWEGENAF <= 0.02  (or) SWEGENAF == "."

Frq: Frequency of observation of the variants from normal Clinical samples

Frq <= 0.02  (or) Frq == "."

The variants scored as PASS are included in the final vcf file (SNV.somatic.<CASE_ID>.svdb.<research/clinical>.filtered.pass.vcf.gz).

The following command can be used to fetch the variants identified by a specific caller from merged structural and copy number variants.

zgrep -E "#|<Caller>" <*.svdb.vcf.gz>

Using GENS for WGS#

GENS is a visualization tool similar to IGV, originally developed in Clinical Genomics Lund, and primarily for visualizing genomic copy number profiles from WGS samples.

To visualise the GENS-formatted files from BALSAMIC you need to have GENS installed, and to do this you can follow the instructions on the Clinical-Genomics-Lund GENS-repository:

Clinical-Genomics-Lund-GENS

Two files per sample are uploaded to GENS, one file with allele-frequencies from SNV & InDel germline-calls (BAF file) which can be used to aid the interpretation of the CN-profile, and one file with the Log2 copy number ratios normalized against a PON. Instructions for how to generate this PON using the BALSAMIC PON workflow can be found here:

Generate GENS PON.

There are three required arguments for creating the input files for GENS: 1. Genome interval file produced by GATK PreprocessIntervals (see instructions in GENS PON creation) 2. A gender specific PON (see instructions in GENS PON creation) 3. A population database VCF with variant positions to be reported in the BAF file.

We created the file mentioned in 3 using the file gnomad.genomes.r2.1.1.sites filtered with bcftools to only keep variants with an AF above 0.05.

bcftools view -i AF>=0.05 -Oz

To config BALSAMIC to run with GENS activated you supply these files like this:

balsamic config case \
  --case-id <CASE_ID>
  --balsamic-cache </path/reference_cache/>
  --analysis-dir </path/analysis/>
  --fastq-path </path/fastq/>
  --gender <[male/female]>
  --analysis-workflow balsamic
  --genome-version hg19
  --tumor-sample-name <TUMOR_NAME>
  --genome-interval </path/genome_interval>
  --gens-coverage-pon </path/pon_file>
  --gnomad-min-af5 </path/population_vcf.vcf.gz>

Genome Reference Files#

How to generate genome reference files for ascatNGS

Detailed information is available from ascatNGS documentation

The file SnpGcCorrections.tsv prepared from the 1000 genome SNP panel.

GC correction file:

First step is to download the 1000 genome snp file and convert it from .vcf to .tsv. The detailed procedure to for this step is available from ascatNGS-reference-files (Human reference files from 1000 genomes VCFs)

export TG_DATA=ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz

Followed by:

curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/);\
next if($F[0] eq $l_c && $F[1]-1000 < $l_p); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1];\
$l_c=$F[0]; $l_p=$F[1];' > SnpPositions_GRCh37_1000g.tsv

–or–

curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); next if($F[0] eq $l_c && $F[1]-1000 < $l_p);\
printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];'\
> SnpPositions_GRCh37_1000g.tsv

Second step is to use SnpPositions.tsv file and generate SnpGcCorrections.tsv file, more details see ascatNGS-convert-snppositions

ascatSnpPanelGcCorrections.pl genome.fa SnpPositions.tsv > SnpGcCorrections.tsv