Structural and Copy Number variants#
Depending on the sequencing type, BALSAMIC is currently running the following structural and copy number variant callers:
Variant caller |
Sequencing type |
Analysis type |
Somatic/Germline |
Variant type |
---|---|---|---|---|
AscatNgs |
WGS |
tumor-normal |
somatic |
CNV |
CNVkit |
TGA, WES |
tumor-normal, tumor-only |
somatic |
CNV |
Delly |
TGA, WES, WGS |
tumor-normal, tumor-only |
somatic |
SV, CNV |
Manta |
TGA, WES, WGS |
tumor-normal, tumor-only |
somatic, germline |
SV |
TIDDIT |
WGS |
tumor-normal, tumor-only |
somatic |
SV |
CNVpytor |
WGS |
tumor-only |
somatic |
CNV |
Further details about a specific caller can be found in the links for the repositories containing the documentation for SV and CNV callers along with the links for the articles are listed in bioinfo softwares.
It is mandatory to provide the gender of the sample from BALSAMIC version >= 10.0.0 For CNV analysis.
Pre-merge Filtrations#
The copy number variants, identified using ascatNgs and dellycnv, are converted to deletion and duplications before they are merged using SVDB with –bnd_distance = 5000 (distance between end points for the variants from different callers) and –overlap = 0.80 (percentage for overlapping bases for the variants from different callers).
Tumor and normal calls in TIDDIT are merged using SVDB with –bnd_distance 500 and –overlap = 0.80. Using a custom made script “filter_SVs.py”, soft-filters are added to the calls based on the presence of the variant in the normal, with the goal of retaining only somatic variants as PASS.
Manta calls are filtered using bcftools to only keep variants that have evidence from 3 or more reads.
Variant caller |
Filter added |
Filter expression |
---|---|---|
TIDDIT |
high_normal_af_fraction |
(AF_N_MAX / AF_T_MAX) > 0.25 |
TIDDIT |
max_normal_allele_frequency |
AF_N_MAX > 0.25 |
TIDDIT |
normal_variant |
AF_T_MAX == 0 and ctg_t == False |
TIDDIT |
in_normal |
ctg_n == True and AF_N_MAX == 0 and AF_T_MAX <= 0.25 |
Manta |
low_pr_sr_count |
SUM(FORMAT/PR[0:1]+FORMAT/SR[0:1]) < 4.0 |
Further information regarding the TIDDIT tumor normal filtration: As translocation variants are represented by 2 BNDs in the VCF which allows for mixed assignment of soft-filters, a requirement for assigning soft-filters to translocations is that neither BND is PASS.
Post-merge Filtrations#
SVDB prioritizes the merging of variants from SV and CNV callers to fetch position and genotype information, in the following order:
|
|
|
|
---|---|---|---|
1. manta
2. dellysv
3. cnvkit
4. dellycnv
|
1. manta
2. dellysv
3. cnvkit
4. dellycnv
|
1. manta
2. dellysv
3. ascat
4. dellycnv
5. tiddit
|
1. manta
2. dellysv
3. dellycnv
4. tiddit
5. cnvpytor
|
The merged SNV.somatic.<CASE_ID>.svdb.vcf.gz file retains all the information for the variants from the caller in which the variants are identified, which are then annotated using ensembl-vep. The SweGen and frequencies and the frequency of observed structural variants from clinical normal samples are annotated using SVDB.
The following filter applies for both tumor-normal and tumor-only samples in addition to caller specific filters.
SWEGENAF: SweGen Allele Frequency
SWEGENAF <= 0.02 (or) SWEGENAF == "."
Frq: Frequency of observation of the variants from normal Clinical samples
Frq <= 0.02 (or) Frq == "."
The variants scored as PASS are included in the final vcf file (SNV.somatic.<CASE_ID>.svdb.<research/clinical>.filtered.pass.vcf.gz).
The following command can be used to fetch the variants identified by a specific caller from merged structural and copy number variants.
zgrep -E "#|<Caller>" <*.svdb.vcf.gz>
Using GENS for WGS#
GENS is a visualization tool similar to IGV, originally developed in Clinical Genomics Lund, and primarily for visualizing genomic copy number profiles from WGS samples.
To visualise the GENS-formatted files from BALSAMIC you need to have GENS installed, and to do this you can follow the instructions on the Clinical-Genomics-Lund GENS-repository:
Two files per sample are uploaded to GENS, one file with allele-frequencies from SNV & InDel germline-calls (BAF file) which can be used to aid the interpretation of the CN-profile, and one file with the Log2 copy number ratios normalized against a PON. Instructions for how to generate this PON using the BALSAMIC PON workflow can be found here:
There are three required arguments for creating the input files for GENS:
1. Genome interval file produced by GATK PreprocessIntervals
(see instructions in GENS PON creation)
2. A gender specific PON (see instructions in GENS PON creation)
3. A population database VCF with variant positions to be reported in the BAF file.
We created the file mentioned in 3 using the file gnomad.genomes.r2.1.1.sites
filtered with bcftools to only keep variants with an AF above 0.05.
bcftools view -i AF>=0.05 -Oz
To config BALSAMIC to run with GENS activated you supply these files like this:
balsamic config case \
--case-id <CASE_ID>
--balsamic-cache </path/reference_cache/>
--analysis-dir </path/analysis/>
--fastq-path </path/fastq/>
--gender <[male/female]>
--analysis-workflow balsamic
--genome-version hg19
--tumor-sample-name <TUMOR_NAME>
--genome-interval </path/genome_interval>
--gens-coverage-pon </path/pon_file>
--gnomad-min-af5 </path/population_vcf.vcf.gz>
Genome Reference Files#
How to generate genome reference files for ascatNGS
Detailed information is available from ascatNGS documentation
The file SnpGcCorrections.tsv prepared from the 1000 genome SNP panel.
GC correction file:
First step is to download the 1000 genome snp file and convert it from .vcf to .tsv. The detailed procedure to for this step is available from ascatNGS-reference-files (Human reference files from 1000 genomes VCFs)
export TG_DATA=ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
Followed by:
curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/);\
next if($F[0] eq $l_c && $F[1]-1000 < $l_p); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1];\
$l_c=$F[0]; $l_p=$F[1];' > SnpPositions_GRCh37_1000g.tsv
–or–
curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); next if($F[0] eq $l_c && $F[1]-1000 < $l_p);\
printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];'\
> SnpPositions_GRCh37_1000g.tsv
Second step is to use SnpPositions.tsv file and generate SnpGcCorrections.tsv file, more details see ascatNGS-convert-snppositions
ascatSnpPanelGcCorrections.pl genome.fa SnpPositions.tsv > SnpGcCorrections.tsv