Annotation resources#

BALSAMIC annotates somatic single nucleotide variants (SNVs) using ensembl-vep and vcfanno. Somatic structural variants (SVs), somatic copy-number variants (CNVs) and germline single nucleotide variants are annotated using only ensembl-vep. All SVs and CNVs are merged using SVDB before annotating for Target Genome Analysis (TGA) or Whole Genome Sequencing (WGS) analyses.

gnomAD#

BALSAMIC adds the following annotation from gnomAD database using vcfanno.

gnomAD annotations#

VCF tag

description

GNOMADAF_popmax

maximum allele frequency across populations

GNOMADAF

fraction of the reads supporting the alternate allele, allelic frequency

ClinVar#

BALSAMIC adds the following annotation from ClinVar database using vcfanno.

ClinVar annotations#

VCF tag

description

CLNACC

Variant Accession and Versions

CLNREVSTAT

ClinVar review status for the Variation ID

CLNSIG

Clinical significance for this single variant

CLNVCSO

Sequence Ontology id for variant type

CLNVC

Variant type

ORIGIN

Allele origin

The values for ORIGIN are described below:

ClinVar ORIGIN#

Value

Annotation

0

unknown

1

germline

2

somatic

4

inherited

8

paternal

16

maternal

32

de-novo

64

biparental

128

uniparental

256

not-tested

512

tested-inconclusive

1073741824

other

COSMIC#

BALSAMIC uses ensembl-vep to add the following annotation from COSMIC database.

COSMIC annotations#

VCF tag

description

COSMIC_CDS

CDS annotation

COSMIC_GENE

gene name

COSMIC_STRAND

strand

COSMIC_CNT

number of samples with this mutation in the COSMIC database

COSMIC_AA

peptide annotation

CADD#

BALSAMIC adds the following annotation for SNVs from CADD database using vcfanno.

CADD annotations#

VCF tag

description

CADD

Combined Annotation Dependent Depletion

LoqusDB somatic frequencies (cancer cases)#

LoqusDB Somatic Annotations#

VCF tag

description

variant type

Cancer_Somatic_Frq

Frequency of observation for somatic mutations

SNV, SV

Cancer_Somatic_Obs

allele counts of the somatic variant

SNV, SV

Cancer_Somatic_Hom

allele counts of the homozygous somatic variant

SNV

LoqusDB germline frequencies (cancer cases)#

loqusDB germline SNV annotations#

VCF tag

description

variant type

Cancer_Germline_Frq

Frequency of observation for germline mutations

SNV

Cancer_Germline_Obs

allele counts of the germline variant

SNV

Cancer_Germline_Hom

allele counts of the homozygous germline variant

SNV

LoqusDB germline frequencies (non-cancer cases)#

BALSAMIC adds the following annotation from database of non-cancer clinical samples using vcfanno for SNVs and SVDB for SVs.

loqusDB germline (non-cancer) SNV annotations#

VCF tag

description

variant type

Frq

Frequency of observation of the variants from normal non-cancer clinical samples

SNV, SV

Obs

allele counts of the variant in normal non-cancer clinical samples

SNV

Hom

allele counts of the homozygous variant in normal non-cancer clinical samples

SNV

clin_obs

allele counts

SV

SWEGEN#

BALSAMIC adds the following annotation from SWEGEN database using vcfanno for SNVs and SVDB for SVs.

Swegen SNV annotations#

VCF tag

description

variant type

SWEGENAF

allele frequency from 1000 Swedish genomes project

SNV, SV

SWEGENAAC_Hom

allele counts of homozygous variants

SNV

SWEGENAAC_Het

allele counts of heterozygous variants

SNV

SWEGENAAC_Hemi

allele counts of hemizygous variants

SNV

swegen_obs

allele count

SV

ENSEMBL-VEP annotations#

Where relevant, BALSAMIC uses ensembl-vep to annotate somatic and germline SNVs and somatic SVs/CNVs from 1000genomes (phase3), ClinVar, ESP, HGMD-PUBLIC, dbSNP, gencode, gnomAD, polyphen, refseq, and sift databases. The following annotations are added by ensembl-vep.

VEP has a setting for the maximum size of a structural variant that it will annotate, currently this is set to the size of the size of chromosome 1 (in hg19) (–max_sv_size 249250621).

ensembl-vep#

Annotation

description

Allele

the variant allele used to calculate the consequence

Gene

Ensembl stable ID of affected gene

Feature

Ensembl stable ID of feature

Feature type

type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature.

Consequence

consequence type of this variant

Position in cDNA

relative position of base pair in cDNA sequence

Position in CDS

relative position of base pair in coding sequence

Position in protein

relative position of amino acid in protein

Amino acid change

only given if the variant affects the protein-coding sequence

Codon change

the alternative codons with the variant base in upper case

Co-located variation

identifier of any existing variants

VARIANT_CLASS

Sequence Ontology variant class

SYMBOL

the gene symbol

SYMBOL_SOURCE

the source of the gene symbol

STRAND

the DNA strand (1 or -1) on which the transcript/feature lies

ENSP

the Ensembl protein identifier of the affected transcript

FLAGS

transcript quality flags:
cds_start_NF: CDS 5’ incomplete
cds_end_NF: CDS 3’ incomplete

SWISSPROT

Best match UniProtKB/Swiss-Prot accession of protein product

TREMBL

Best match UniProtKB/TrEMBL accession of protein product

UNIPARC

Best match UniParc accession of protein product

HGVSc

the HGVS coding sequence name

HGVSp

the HGVS protein sequence name

HGVSg

the HGVS genomic sequence name

HGVS_OFFSET

Indicates by how many bases the HGVS notations for this variant have been shifted

SIFT

the SIFT prediction and/or score, with both given as prediction(score)

PolyPhen

the PolyPhen prediction and/or score

MOTIF_NAME

The source and identifier of a transcription factor binding profile aligned at this position

MOTIF_POS

The relative position of the variation in the aligned TFBP

HIGH_INF_POS

A flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP)

MOTIF_SCORE_CHANGE

The difference in motif score of the reference and variant sequences for the TFBP

CANONICAL

a flag indicating if the transcript is denoted as the canonical transcript for this gene

CCDS

the CCDS identifer for this transcript, where applicable

INTRON

the intron number (out of total number)

EXON

the exon number (out of total number)

DOMAINS

the source and identifer of any overlapping protein domains

DISTANCE

Shortest distance from variant to transcript

AF

Frequency of existing variant in 1000 Genomes

AFR_AF

Frequency of existing variant in 1000 Genomes combined African population

AMR_AF

Frequency of existing variant in 1000 Genomes combined American population

EUR_AF

Frequency of existing variant in 1000 Genomes combined European population

EAS_AF

Frequency of existing variant in 1000 Genomes combined East Asian population

SAS_AF

Frequency of existing variant in 1000 Genomes combined South Asian population

AA_AF

Frequency of existing variant in NHLBI-ESP African American population

EA_AF

Frequency of existing variant in NHLBI-ESP European American population

gnomAD_AF

Frequency of existing variant in gnomAD exomes combined population

gnomAD_AFR_AF

Frequency of existing variant in gnomAD exomes African/American population

gnomAD_AMR_AF

Frequency of existing variant in gnomAD exomes American population

gnomAD_ASJ_AF

Frequency of existing variant in gnomAD exomes Ashkenazi Jewish population

gnomAD_EAS_AF

Frequency of existing variant in gnomAD exomes East Asian population

gnomAD_FIN_AF

Frequency of existing variant in gnomAD exomes Finnish population

gnomAD_NFE_AF

Frequency of existing variant in gnomAD exomes Non-Finnish European population

gnomAD_OTH_AF

Frequency of existing variant in gnomAD exomes combined other combined populations

gnomAD_SAS_AF

Frequency of existing variant in gnomAD exomes South Asian population

MAX_AF

Maximum observed allele frequency in 1000 Genomes, ESP and gnomAD

MAX_AF_POPS

Populations in which maximum allele frequency was observed

CLIN_SIG

ClinVar clinical significance of the dbSNP variant

BIOTYPE

Biotype of transcript or regulatory feature

APPRIS

Annotates alternatively spliced transcripts as primary or alternate based on a range of computational methods. NB: not available for GRCh37

TSL

Transcript support level. NB: not available for GRCh37

PUBMED

Pubmed ID(s) of publications that cite existing variant

SOMATIC

Somatic status of existing variant(s); multiple values correspond to multiple values in the Existing_variation field

PHENO

Indicates if existing variant is associated with a phenotype, disease or trait; multiple values correspond to multiple values in the Existing_variation field

GENE_PHENO

Indicates if overlapped gene is associated with a phenotype, disease or trait

BAM_EDIT

Indicates success or failure of edit using BAM file

GIVEN_REF

Reference allele from input

REFSEQ_MATCH

the RefSeq transcript match status; contains a number of flags indicating whether this RefSeq transcript matches the underlying reference sequence and/or an Ensembl transcript (more information):
  • rseq_3p_mismatch: signifies a mismatch between the RefSeq transcript and the underlying primary genome assembly sequence. Specifically, there is a mismatch in the 3’ UTR of the RefSeq model with respect to the primary genome assembly (e.g. GRCh37/GRCh38).

  • rseq_5p_mismatch: signifies a mismatch between the RefSeq transcript and the underlying primary genome assembly sequence. Specifically, there is a mismatch in the 5’ UTR of the RefSeq model with respect to the primary genome assembly.

  • rseq_cds_mismatch: signifies a mismatch between the RefSeq transcript and the underlying primary genome assembly sequence. Specifically, there is a mismatch in the CDS of the RefSeq model with respect to the primary genome assembly.

  • rseq_ens_match_cds: signifies that for the RefSeq transcript there is an overlapping Ensembl model that is identical across the CDS region only. A CDS match is defined as follows: the CDS and peptide sequences are identical and the genomic coordinates of every translatable exon match. Useful related attributes are: rseq_ens_match_wt and rseq_ens_no_match.

  • rseq_ens_match_wt: signifies that for the RefSeq transcript there is an overlapping Ensembl model that is identical across the whole transcript. A whole transcript match is defined as follows: 1) In the case that both models are coding, the transcript, CDS and peptide sequences are all identical and the genomic coordinates of every exon match. 2) In the case that both transcripts are non-coding the transcript sequences and the genomic coordinates of every exon are identical. No comparison is made between a coding and a non-coding transcript. Useful related attributes are: rseq_ens_match_cds and rseq_ens_no_match.

  • rseq_ens_no_match: signifies that for the RefSeq transcript there is no overlapping Ensembl model that is identical across either the whole transcript or the CDS. This is caused by differences between the transcript, CDS or peptide sequences or between the exon genomic coordinates. Useful related attributes are: rseq_ens_match_wt and rseq_ens_match_cds.

  • rseq_mrna_match: signifies an exact match between the RefSeq transcript and the underlying primary genome assembly sequence (based on a match between the transcript stable id and an accession in the RefSeq mRNA file). An exact match occurs when the underlying genomic sequence of the model can be perfectly aligned to the mRNA sequence post polyA clipping.

  • rseq_mrna_nonmatch: signifies a non-match between the RefSeq transcript and the underlying primary genome assembly sequence. A non-match is deemed to have occurred if the underlying genomic sequence does not have a perfect alignment to the mRNA sequence post polyA clipping. It can also signify that no comparison was possible as the model stable id may not have had a corresponding entry in the RefSeq mRNA file (sometimes happens when accessions are retired or changed). When a non-match occurs one or several of the following transcript attributes will also be present to provide more detail on the nature of the non-match: rseq_5p_mismatch, rseq_cds_mismatch, rseq_3p_mismatch, rseq_nctran_mismatch, rseq_no_comparison

  • rseq_nctran_mismatch: signifies a mismatch between the RefSeq transcript and the underlying primary genome assembly sequence. This is a comparison between the entire underlying genomic sequence of the RefSeq model to the mRNA in the case of RefSeq models that are non-coding.

  • rseq_no_comparison: signifies that no alignment was carried out between the underlying primary genome assembly sequence and a corresponding RefSeq mRNA. The reason for this is generally that no corresponding, unversioned accession was found in the RefSeq mRNA file for the transcript stable id. This sometimes happens when accessions are retired or replaced. A second possibility is that the sequences were too long and problematic to align (though this is rare).

CHECK_REF

Reports variants where the input reference does not match the expected reference

HGNC_ID

A unique ID provided by the HGNC for each gene with an approved symbol

MANE

indicating if the transcript is the MANE Select or MANE Plus Clinical transcript for the gene.

miRNA

Reports where the variant lies in the miRNA secondary structure.