Structural and Copy Number variants¶

Depending on the sequencing type, BALSAMIC is currently running the following structural and copy number variant callers:

SV CNV callers¶
Variant caller	Sequencing type	Analysis type	Somatic/Germline	Variant type
AscatNgs	WGS	tumor-normal	somatic	CNV
CNVkit	TGA, WES	tumor-normal, tumor-only	somatic	CNV
Delly	TGA, WES, WGS	tumor-normal, tumor-only	somatic	SV, CNV
Manta	TGA, WES, WGS	tumor-normal, tumor-only	somatic, germline	SV
TIDDIT	WGS	tumor-normal, tumor-only	somatic	SV

Further details about a specific caller can be found in the links for the repositories containing the documentation for SV and CNV callers along with the links for the articles are listed in bioinfo softwares.

It mandatory to provide the gender of the sample from BALSAMIC version >= 10.0.0 For CNV analysis.

The copy number variants, identified using ascatNgs and dellycnv, are converted to deletion and duplications before they are merged using SVDB with –bnd_distance = 5000 (distance between end points for the variants from different callers) and –overlap = 0.80 (percentage for overlapping bases for the variants from different callers). SVDB prioritizes the merging of variants from SV and CNV callers to fetch position and genotype information, in the following order:

SVDB merge caller priority order¶
TGA, WES tumor-normal	TGA, WES tumor-only	WGS tumor-normal	WGS tumor-only
1. manta 2. dellysv 3. cnvkit 4. dellycnv	1. manta 2. dellysv 3. cnvkit 4. dellycnv	1. manta 2. dellysv 3. ascat 4. dellycnv 5. tiddit	1. manta 2. dellysv 3. dellycnv 4. tiddit

The merged *.svdb.vcf.gz file retains all the information for the variants from the caller in which the variants are identified, which are then annotated using ensembl-vep.

The following command can be used to fetch the variants identified by a specific caller from merged structural and copy number variants.

zgrep -E "#|<Caller>" <*.svdb.vcf.gz>

Genome Reference Files¶

How to generate genome reference files for ascatNGS

Detailed information is available from ascatNGS documentation

The file SnpGcCorrections.tsv prepared from the 1000 genome SNP panel.

GC correction file:

First step is to download the 1000 genome snp file and convert it from .vcf to .tsv. The detailed procedure to for this step is available from ascatNGS-reference-files (Human reference files from 1000 genomes VCFs)

export TG_DATA=ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz

Followed by:

curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/);\
next if($F[0] eq $l_c && $F[1]-1000 < $l_p); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1];\
$l_c=$F[0]; $l_p=$F[1];' > SnpPositions_GRCh37_1000g.tsv

–or–

curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); next if($F[0] eq $l_c && $F[1]-1000 < $l_p);\
printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];'\
> SnpPositions_GRCh37_1000g.tsv

Second step is to use SnpPositions.tsv file and generate SnpGcCorrections.tsv file, more details see ascatNGS-convert-snppositions

ascatSnpPanelGcCorrections.pl genome.fa SnpPositions.tsv > SnpGcCorrections.tsv