Annotation resources#

BALSAMIC annotates somatic single nucleotide variants (SNVs) using ensembl-vep and vcfanno. Somatic structural variants (SVs), somatic copy-number variants (CNVs) and germline single nucleotide variants are annotated using only ensembl-vep. All SVs and CNVs are merged using SVDB before annotating for Target Genome Analysis (TGA) or Whole Genome Sequencing (WGS) analyses.

gnomAD#

BALSAMIC adds the following annotation from gnomAD database using vcfanno.

gnomAD annotations#
VCF tag	description
GNOMADAF_popmax	maximum allele frequency across populations
GNOMADAF	fraction of the reads supporting the alternate allele, allelic frequency

ClinVar#

BALSAMIC adds the following annotation from ClinVar database using vcfanno.

ClinVar annotations#
VCF tag	description
CLNACC	Variant Accession and Versions
CLNREVSTAT	ClinVar review status for the Variation ID
CLNSIG	Clinical significance for this single variant
CLNVCSO	Sequence Ontology id for variant type
CLNVC	Variant type
ORIGIN	Allele origin

The values for ORIGIN are described below:

ClinVar ORIGIN#
Value	Annotation
0	unknown
1	germline
2	somatic
4	inherited
8	paternal
16	maternal
32	de-novo
64	biparental
128	uniparental
256	not-tested
512	tested-inconclusive
1073741824	other

COSMIC#

BALSAMIC uses ensembl-vep to add the following annotation from COSMIC database.

COSMIC annotations#
VCF tag	description
COSMIC_CDS	CDS annotation
COSMIC_GENE	gene name
COSMIC_STRAND	strand
COSMIC_CNT	number of samples with this mutation in the COSMIC database
COSMIC_AA	peptide annotation

CADD#

BALSAMIC adds the following annotation for SNVs from CADD database using vcfanno.

CADD annotations#
VCF tag	description
CADD	Combined Annotation Dependent Depletion

LoqusDB somatic frequencies (cancer cases)#

LoqusDB Somatic Annotations#
VCF tag	description	variant type
Cancer_Somatic_Frq	Frequency of observation for somatic mutations	SNV, SV
Cancer_Somatic_Obs	allele counts of the somatic variant	SNV, SV
Cancer_Somatic_Hom	allele counts of the homozygous somatic variant	SNV

LoqusDB germline frequencies (cancer cases)#

loqusDB germline SNV annotations#
VCF tag	description	variant type
Cancer_Germline_Frq	Frequency of observation for germline mutations	SNV
Cancer_Germline_Obs	allele counts of the germline variant	SNV
Cancer_Germline_Hom	allele counts of the homozygous germline variant	SNV

LoqusDB germline frequencies (non-cancer cases)#

BALSAMIC adds the following annotation from database of non-cancer clinical samples using vcfanno for SNVs and SVDB for SVs.

loqusDB germline (non-cancer) SNV annotations#
VCF tag	description	variant type
Frq	Frequency of observation of the variants from normal non-cancer clinical samples	SNV, SV
Obs	allele counts of the variant in normal non-cancer clinical samples	SNV
Hom	allele counts of the homozygous variant in normal non-cancer clinical samples	SNV
clin_obs	allele counts	SV

SWEGEN#

BALSAMIC adds the following annotation from SWEGEN database using vcfanno for SNVs and SVDB for SVs.

Swegen SNV annotations#
VCF tag	description	variant type
SWEGENAF	allele frequency from 1000 Swedish genomes project	SNV, SV
SWEGENAAC_Hom	allele counts of homozygous variants	SNV
SWEGENAAC_Het	allele counts of heterozygous variants	SNV
SWEGENAAC_Hemi	allele counts of hemizygous variants	SNV
swegen_obs	allele count	SV

ENSEMBL-VEP annotations#

Where relevant, BALSAMIC uses ensembl-vep to annotate somatic and germline SNVs and somatic SVs/CNVs from 1000genomes (phase3), ClinVar, ESP, HGMD-PUBLIC, dbSNP, gencode, gnomAD, polyphen, refseq, and sift databases. The following annotations are added by ensembl-vep.

VEP has a setting for the maximum size of a structural variant that it will annotate, currently this is set to the size of the size of chromosome 1 (in hg19) (–max_sv_size 249250621).

ensembl-vep#
Annotation	description
Allele	the variant allele used to calculate the consequence
Gene	Ensembl stable ID of affected gene
Feature	Ensembl stable ID of feature
Feature type	type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature.
Consequence	consequence type of this variant
Position in cDNA	relative position of base pair in cDNA sequence
Position in CDS	relative position of base pair in coding sequence
Position in protein	relative position of amino acid in protein
Amino acid change	only given if the variant affects the protein-coding sequence
Codon change	the alternative codons with the variant base in upper case
Co-located variation	identifier of any existing variants
VARIANT_CLASS	Sequence Ontology variant class
SYMBOL	the gene symbol
SYMBOL_SOURCE	the source of the gene symbol
STRAND	the DNA strand (1 or -1) on which the transcript/feature lies
ENSP	the Ensembl protein identifier of the affected transcript
FLAGS	transcript quality flags: cds_start_NF: CDS 5’ incomplete cds_end_NF: CDS 3’ incomplete
SWISSPROT	Best match UniProtKB/Swiss-Prot accession of protein product
TREMBL	Best match UniProtKB/TrEMBL accession of protein product
UNIPARC	Best match UniParc accession of protein product
HGVSc	the HGVS coding sequence name
HGVSp	the HGVS protein sequence name
HGVSg	the HGVS genomic sequence name
HGVS_OFFSET	Indicates by how many bases the HGVS notations for this variant have been shifted
SIFT	the SIFT prediction and/or score, with both given as prediction(score)
PolyPhen	the PolyPhen prediction and/or score
MOTIF_NAME	The source and identifier of a transcription factor binding profile aligned at this position
MOTIF_POS	The relative position of the variation in the aligned TFBP
HIGH_INF_POS	A flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP)
MOTIF_SCORE_CHANGE	The difference in motif score of the reference and variant sequences for the TFBP
CANONICAL	a flag indicating if the transcript is denoted as the canonical transcript for this gene
CCDS	the CCDS identifer for this transcript, where applicable
INTRON	the intron number (out of total number)
EXON	the exon number (out of total number)
DOMAINS	the source and identifer of any overlapping protein domains
DISTANCE	Shortest distance from variant to transcript
AF	Frequency of existing variant in 1000 Genomes
AFR_AF	Frequency of existing variant in 1000 Genomes combined African population
AMR_AF	Frequency of existing variant in 1000 Genomes combined American population
EUR_AF	Frequency of existing variant in 1000 Genomes combined European population
EAS_AF	Frequency of existing variant in 1000 Genomes combined East Asian population
SAS_AF	Frequency of existing variant in 1000 Genomes combined South Asian population
AA_AF	Frequency of existing variant in NHLBI-ESP African American population
EA_AF	Frequency of existing variant in NHLBI-ESP European American population
gnomAD_AF	Frequency of existing variant in gnomAD exomes combined population
gnomAD_AFR_AF	Frequency of existing variant in gnomAD exomes African/American population
gnomAD_AMR_AF	Frequency of existing variant in gnomAD exomes American population
gnomAD_ASJ_AF	Frequency of existing variant in gnomAD exomes Ashkenazi Jewish population
gnomAD_EAS_AF	Frequency of existing variant in gnomAD exomes East Asian population
gnomAD_FIN_AF	Frequency of existing variant in gnomAD exomes Finnish population
gnomAD_NFE_AF	Frequency of existing variant in gnomAD exomes Non-Finnish European population
gnomAD_OTH_AF	Frequency of existing variant in gnomAD exomes combined other combined populations
gnomAD_SAS_AF	Frequency of existing variant in gnomAD exomes South Asian population
MAX_AF	Maximum observed allele frequency in 1000 Genomes, ESP and gnomAD
MAX_AF_POPS	Populations in which maximum allele frequency was observed
CLIN_SIG	ClinVar clinical significance of the dbSNP variant
BIOTYPE	Biotype of transcript or regulatory feature
APPRIS	Annotates alternatively spliced transcripts as primary or alternate based on a range of computational methods. NB: not available for GRCh37
TSL	Transcript support level. NB: not available for GRCh37
PUBMED	Pubmed ID(s) of publications that cite existing variant
SOMATIC	Somatic status of existing variant(s); multiple values correspond to multiple values in the Existing_variation field
PHENO	Indicates if existing variant is associated with a phenotype, disease or trait; multiple values correspond to multiple values in the Existing_variation field
GENE_PHENO	Indicates if overlapped gene is associated with a phenotype, disease or trait
BAM_EDIT	Indicates success or failure of edit using BAM file
GIVEN_REF	Reference allele from input
REFSEQ_MATCH	the RefSeq transcript match status; contains a number of flags indicating whether this RefSeq transcript matches the underlying reference sequence and/or an Ensembl transcript (more information): rseq_3p_mismatch: signifies a mismatch between the RefSeq transcript and the underlying primary genome assembly sequence. Specifically, there is a mismatch in the 3’ UTR of the RefSeq model with respect to the primary genome assembly (e.g. GRCh37/GRCh38). rseq_5p_mismatch: signifies a mismatch between the RefSeq transcript and the underlying primary genome assembly sequence. Specifically, there is a mismatch in the 5’ UTR of the RefSeq model with respect to the primary genome assembly. rseq_cds_mismatch: signifies a mismatch between the RefSeq transcript and the underlying primary genome assembly sequence. Specifically, there is a mismatch in the CDS of the RefSeq model with respect to the primary genome assembly. rseq_ens_match_cds: signifies that for the RefSeq transcript there is an overlapping Ensembl model that is identical across the CDS region only. A CDS match is defined as follows: the CDS and peptide sequences are identical and the genomic coordinates of every translatable exon match. Useful related attributes are: rseq_ens_match_wt and rseq_ens_no_match. rseq_ens_match_wt: signifies that for the RefSeq transcript there is an overlapping Ensembl model that is identical across the whole transcript. A whole transcript match is defined as follows: 1) In the case that both models are coding, the transcript, CDS and peptide sequences are all identical and the genomic coordinates of every exon match. 2) In the case that both transcripts are non-coding the transcript sequences and the genomic coordinates of every exon are identical. No comparison is made between a coding and a non-coding transcript. Useful related attributes are: rseq_ens_match_cds and rseq_ens_no_match. rseq_ens_no_match: signifies that for the RefSeq transcript there is no overlapping Ensembl model that is identical across either the whole transcript or the CDS. This is caused by differences between the transcript, CDS or peptide sequences or between the exon genomic coordinates. Useful related attributes are: rseq_ens_match_wt and rseq_ens_match_cds. rseq_mrna_match: signifies an exact match between the RefSeq transcript and the underlying primary genome assembly sequence (based on a match between the transcript stable id and an accession in the RefSeq mRNA file). An exact match occurs when the underlying genomic sequence of the model can be perfectly aligned to the mRNA sequence post polyA clipping. rseq_mrna_nonmatch: signifies a non-match between the RefSeq transcript and the underlying primary genome assembly sequence. A non-match is deemed to have occurred if the underlying genomic sequence does not have a perfect alignment to the mRNA sequence post polyA clipping. It can also signify that no comparison was possible as the model stable id may not have had a corresponding entry in the RefSeq mRNA file (sometimes happens when accessions are retired or changed). When a non-match occurs one or several of the following transcript attributes will also be present to provide more detail on the nature of the non-match: rseq_5p_mismatch, rseq_cds_mismatch, rseq_3p_mismatch, rseq_nctran_mismatch, rseq_no_comparison rseq_nctran_mismatch: signifies a mismatch between the RefSeq transcript and the underlying primary genome assembly sequence. This is a comparison between the entire underlying genomic sequence of the RefSeq model to the mRNA in the case of RefSeq models that are non-coding. rseq_no_comparison: signifies that no alignment was carried out between the underlying primary genome assembly sequence and a corresponding RefSeq mRNA. The reason for this is generally that no corresponding, unversioned accession was found in the RefSeq mRNA file for the transcript stable id. This sometimes happens when accessions are retired or replaced. A second possibility is that the sequences were too long and problematic to align (though this is rare).
CHECK_REF	Reports variants where the input reference does not match the expected reference
HGNC_ID	A unique ID provided by the HGNC for each gene with an approved symbol
MANE	indicating if the transcript is the MANE Select or MANE Plus Clinical transcript for the gene.
miRNA	Reports where the variant lies in the miRNA secondary structure.