Structural and Copy Number variants¶
Depending on the sequencing type, BALSAMIC is currently running the following structural and copy number variant callers:
Variant caller |
Sequencing type |
Analysis type |
Somatic/Germline |
Variant type |
---|---|---|---|---|
AscatNgs |
WGS |
tumor-normal |
somatic |
CNV |
CNVkit |
TGA, WES |
tumor-normal, tumor-only |
somatic |
CNV |
Delly |
TGA, WES, WGS |
tumor-normal, tumor-only |
somatic |
SV, CNV |
Manta |
TGA, WES, WGS |
tumor-normal, tumor-only |
somatic, germline |
SV |
TIDDIT |
WGS |
tumor-normal, tumor-only |
somatic |
SV |
Further details about a specific caller can be found in the links for the repositories containing the documentation for SV and CNV callers along with the links for the articles are listed in bioinfo softwares.
It mandatory to provide the gender of the sample from BALSAMIC version >= 10.0.0 For CNV analysis.
The copy number variants, identified using ascatNgs and dellycnv, are converted to deletion and duplications before they are merged using SVDB with –bnd_distance = 5000 (distance between end points for the variants from different callers) and –overlap = 0.80 (percentage for overlapping bases for the variants from different callers). SVDB prioritizes the merging of variants from SV and CNV callers to fetch position and genotype information, in the following order:
|
|
|
|
---|---|---|---|
1. manta
2. dellysv
3. cnvkit
4. dellycnv
|
1. manta
2. dellysv
3. cnvkit
4. dellycnv
|
1. manta
2. dellysv
3. ascat
4. dellycnv
5. tiddit
|
1. manta
2. dellysv
3. dellycnv
4. tiddit
|
The merged *.svdb.vcf.gz file retains all the information for the variants from the caller in which the variants are identified, which are then annotated using ensembl-vep.
The following command can be used to fetch the variants identified by a specific caller from merged structural and copy number variants.
zgrep -E "#|<Caller>" <*.svdb.vcf.gz>
Genome Reference Files¶
How to generate genome reference files for ascatNGS
Detailed information is available from ascatNGS documentation
The file SnpGcCorrections.tsv prepared from the 1000 genome SNP panel.
GC correction file:
First step is to download the 1000 genome snp file and convert it from .vcf to .tsv. The detailed procedure to for this step is available from ascatNGS-reference-files (Human reference files from 1000 genomes VCFs)
export TG_DATA=ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
Followed by:
curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/);\
next if($F[0] eq $l_c && $F[1]-1000 < $l_p); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1];\
$l_c=$F[0]; $l_p=$F[1];' > SnpPositions_GRCh37_1000g.tsv
–or–
curl -sSL $TG_DATA | zgrep -F 'E_Multiple_observations' | grep -F 'TSA=SNV' |\
perl -ane 'next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); $F[7]=~m/MAF=([^;]+)/;\
next if($1 < 0.05); next if($F[0] eq $l_c && $F[1]-1000 < $l_p);\
printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];'\
> SnpPositions_GRCh37_1000g.tsv
Second step is to use SnpPositions.tsv file and generate SnpGcCorrections.tsv file, more details see ascatNGS-convert-snppositions
ascatSnpPanelGcCorrections.pl genome.fa SnpPositions.tsv > SnpGcCorrections.tsv