Annotation resources#
BALSAMIC annotates somatic single nucleotide variants (SNVs) using ensembl-vep
and vcfanno
. Somatic structural variants (SVs), somatic copy-number variants (CNVs) and germline single nucleotide variants are annotated using only ensembl-vep
. All SVs and CNVs are merged using SVDB
before annotating for Target Genome Analysis (TGA) or Whole Genome Sequencing (WGS) analyses.
gnomAD#
BALSAMIC adds the following annotation from gnomAD database using vcfanno
.
VCF tag |
description |
---|---|
GNOMADAF_popmax |
maximum allele frequency across populations |
GNOMADAF |
fraction of the reads supporting the alternate allele, allelic frequency |
ClinVar#
BALSAMIC adds the following annotation from ClinVar database using vcfanno
.
VCF tag |
description |
---|---|
CLNACC |
Variant Accession and Versions |
CLNREVSTAT |
ClinVar review status for the Variation ID |
CLNSIG |
Clinical significance for this single variant |
CLNVCSO |
Sequence Ontology id for variant type |
CLNVC |
Variant type |
ORIGIN |
Allele origin |
The values for ORIGIN are described below:
Value |
Annotation |
---|---|
0 |
unknown |
1 |
germline |
2 |
somatic |
4 |
inherited |
8 |
paternal |
16 |
maternal |
32 |
de-novo |
64 |
biparental |
128 |
uniparental |
256 |
not-tested |
512 |
tested-inconclusive |
1073741824 |
other |
COSMIC#
BALSAMIC uses ensembl-vep to add the following annotation from COSMIC database.
VCF tag |
description |
---|---|
COSMIC_CDS |
CDS annotation |
COSMIC_GENE |
gene name |
COSMIC_STRAND |
strand |
COSMIC_CNT |
number of samples with this mutation in the COSMIC database |
COSMIC_AA |
peptide annotation |
CADD#
BALSAMIC adds the following annotation for SNVs from CADD database using vcfanno
.
VCF tag |
description |
---|---|
CADD |
Combined Annotation Dependent Depletion |
LoqusDB somatic frequencies (cancer cases)#
VCF tag |
description |
variant type |
---|---|---|
Cancer_Somatic_Frq |
Frequency of observation for somatic mutations |
SNV, SV |
Cancer_Somatic_Obs |
allele counts of the somatic variant |
SNV, SV |
Cancer_Somatic_Hom |
allele counts of the homozygous somatic variant |
SNV |
LoqusDB germline frequencies (cancer cases)#
VCF tag |
description |
variant type |
---|---|---|
Cancer_Germline_Frq |
Frequency of observation for germline mutations |
SNV |
Cancer_Germline_Obs |
allele counts of the germline variant |
SNV |
Cancer_Germline_Hom |
allele counts of the homozygous germline variant |
SNV |
LoqusDB germline frequencies (non-cancer cases)#
BALSAMIC adds the following annotation from database of non-cancer clinical samples using vcfanno
for SNVs and SVDB for SVs.
VCF tag |
description |
variant type |
---|---|---|
Frq |
Frequency of observation of the variants from normal non-cancer clinical samples |
SNV, SV |
Obs |
allele counts of the variant in normal non-cancer clinical samples |
SNV |
Hom |
allele counts of the homozygous variant in normal non-cancer clinical samples |
SNV |
clin_obs |
allele counts |
SV |
SWEGEN#
BALSAMIC adds the following annotation from SWEGEN database using vcfanno
for SNVs and SVDB for SVs.
VCF tag |
description |
variant type |
---|---|---|
SWEGENAF |
allele frequency from 1000 Swedish genomes project |
SNV, SV |
SWEGENAAC_Hom |
allele counts of homozygous variants |
SNV |
SWEGENAAC_Het |
allele counts of heterozygous variants |
SNV |
SWEGENAAC_Hemi |
allele counts of hemizygous variants |
SNV |
swegen_obs |
allele count |
SV |
MSI#
BALSAMIC generates the following table using MSIsensor-pro that is included in the CNV report. See https://github.com/xjtu-omics/msisensor-pro/wiki for more details.
Column |
description |
---|---|
Total_Number_of_Sites |
all detected microsatellites |
Number_of_Somatic_Sites |
the unstable(somatic) microsatellites |
MSI |
The MSI score |
ENSEMBL-VEP annotations#
Where relevant, BALSAMIC uses ensembl-vep to annotate somatic and germline SNVs and somatic SVs/CNVs from 1000genomes (phase3), ClinVar, ESP, HGMD-PUBLIC, dbSNP, gencode, gnomAD, polyphen, refseq, and sift databases. The following annotations are added by ensembl-vep.
VEP has a setting for the maximum size of a structural variant that it will annotate, currently this is set to the size of the size of chromosome 1 (in hg19) (–max_sv_size 249250621).
Annotation |
description |
---|---|
Allele |
the variant allele used to calculate the consequence |
Gene |
Ensembl stable ID of affected gene |
Feature |
Ensembl stable ID of feature |
Feature type |
type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature. |
Consequence |
consequence type of this variant |
Position in cDNA |
relative position of base pair in cDNA sequence |
Position in CDS |
relative position of base pair in coding sequence |
Position in protein |
relative position of amino acid in protein |
Amino acid change |
only given if the variant affects the protein-coding sequence |
Codon change |
the alternative codons with the variant base in upper case |
Co-located variation |
identifier of any existing variants |
VARIANT_CLASS |
Sequence Ontology variant class |
SYMBOL |
the gene symbol |
SYMBOL_SOURCE |
the source of the gene symbol |
STRAND |
the DNA strand (1 or -1) on which the transcript/feature lies |
ENSP |
the Ensembl protein identifier of the affected transcript |
FLAGS |
transcript quality flags:
cds_start_NF: CDS 5’ incomplete
cds_end_NF: CDS 3’ incomplete
|
SWISSPROT |
Best match UniProtKB/Swiss-Prot accession of protein product |
TREMBL |
Best match UniProtKB/TrEMBL accession of protein product |
UNIPARC |
Best match UniParc accession of protein product |
HGVSc |
the HGVS coding sequence name |
HGVSp |
the HGVS protein sequence name |
HGVSg |
the HGVS genomic sequence name |
HGVS_OFFSET |
Indicates by how many bases the HGVS notations for this variant have been shifted |
SIFT |
the SIFT prediction and/or score, with both given as prediction(score) |
PolyPhen |
the PolyPhen prediction and/or score |
MOTIF_NAME |
The source and identifier of a transcription factor binding profile aligned at this position |
MOTIF_POS |
The relative position of the variation in the aligned TFBP |
HIGH_INF_POS |
A flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP) |
MOTIF_SCORE_CHANGE |
The difference in motif score of the reference and variant sequences for the TFBP |
CANONICAL |
a flag indicating if the transcript is denoted as the canonical transcript for this gene |
CCDS |
the CCDS identifer for this transcript, where applicable |
INTRON |
the intron number (out of total number) |
EXON |
the exon number (out of total number) |
DOMAINS |
the source and identifer of any overlapping protein domains |
DISTANCE |
Shortest distance from variant to transcript |
AF |
Frequency of existing variant in 1000 Genomes |
AFR_AF |
Frequency of existing variant in 1000 Genomes combined African population |
AMR_AF |
Frequency of existing variant in 1000 Genomes combined American population |
EUR_AF |
Frequency of existing variant in 1000 Genomes combined European population |
EAS_AF |
Frequency of existing variant in 1000 Genomes combined East Asian population |
SAS_AF |
Frequency of existing variant in 1000 Genomes combined South Asian population |
AA_AF |
Frequency of existing variant in NHLBI-ESP African American population |
EA_AF |
Frequency of existing variant in NHLBI-ESP European American population |
gnomAD_AF |
Frequency of existing variant in gnomAD exomes combined population |
gnomAD_AFR_AF |
Frequency of existing variant in gnomAD exomes African/American population |
gnomAD_AMR_AF |
Frequency of existing variant in gnomAD exomes American population |
gnomAD_ASJ_AF |
Frequency of existing variant in gnomAD exomes Ashkenazi Jewish population |
gnomAD_EAS_AF |
Frequency of existing variant in gnomAD exomes East Asian population |
gnomAD_FIN_AF |
Frequency of existing variant in gnomAD exomes Finnish population |
gnomAD_NFE_AF |
Frequency of existing variant in gnomAD exomes Non-Finnish European population |
gnomAD_OTH_AF |
Frequency of existing variant in gnomAD exomes combined other combined populations |
gnomAD_SAS_AF |
Frequency of existing variant in gnomAD exomes South Asian population |
MAX_AF |
Maximum observed allele frequency in 1000 Genomes, ESP and gnomAD |
MAX_AF_POPS |
Populations in which maximum allele frequency was observed |
CLIN_SIG |
ClinVar clinical significance of the dbSNP variant |
BIOTYPE |
Biotype of transcript or regulatory feature |
APPRIS |
Annotates alternatively spliced transcripts as primary or alternate based on a range of computational methods. NB: not available for GRCh37 |
TSL |
Transcript support level. NB: not available for GRCh37 |
PUBMED |
Pubmed ID(s) of publications that cite existing variant |
SOMATIC |
Somatic status of existing variant(s); multiple values correspond to multiple values in the Existing_variation field |
PHENO |
Indicates if existing variant is associated with a phenotype, disease or trait; multiple values correspond to multiple values in the Existing_variation field |
GENE_PHENO |
Indicates if overlapped gene is associated with a phenotype, disease or trait |
BAM_EDIT |
Indicates success or failure of edit using BAM file |
GIVEN_REF |
Reference allele from input |
REFSEQ_MATCH |
the RefSeq transcript match status; contains a number of flags indicating whether this RefSeq transcript matches the underlying reference sequence and/or an Ensembl transcript (more information):
|
CHECK_REF |
Reports variants where the input reference does not match the expected reference |
HGNC_ID |
A unique ID provided by the HGNC for each gene with an approved symbol |
MANE |
indicating if the transcript is the MANE Select or MANE Plus Clinical transcript for the gene. |
miRNA |
Reports where the variant lies in the miRNA secondary structure. |