Annotation resources¶
BALSAMIC annotates somatic single nucleotide variants (SNVs) using ensembl-vep
and vcfanno
. Somatic structural variants (SVs), somatic copy-number variants (CNVs) and germline single nucleotide variants are annotated using only ensembl-vep
. All SVs and CNVs are merged using SVDB
before annotating for Target Genome Analysis (TGA) or Whole Genome Sequencing (WGS) analyses.
BALSAMIC adds the following annotation from gnomAD database using vcfanno
.
VCF tag |
description |
---|---|
GNOMADAF_popmax |
maximum allele frequency across populations |
GNOMADAF |
fraction of the reads supporting the alternate allele, allelic frequency |
BALSAMIC adds the following annotation from ClinVar database using vcfanno
.
VCF tag |
description |
---|---|
CLNACC |
Variant Accession and Versions |
CLNREVSTAT |
ClinVar review status for the Variation ID |
CLNSIG |
Clinical significance for this single variant |
CLNVCSO |
Sequence Ontology id for variant type |
CLNVC |
Variant type |
ORIGIN |
Allele origin |
The values for ORIGIN are described below:
Value |
Annotation |
---|---|
0 |
unknown |
1 |
germline |
2 |
somatic |
4 |
inherited |
8 |
paternal |
16 |
maternal |
32 |
de-novo |
64 |
biparental |
128 |
uniparental |
256 |
not-tested |
512 |
tested-inconclusive |
1073741824 |
other |
BALSAMIC uses ensembl-vep to add the following annotation from COSMIC database.
VCF tag |
description |
---|---|
COSMIC_CDS |
CDS annotation |
COSMIC_GENE |
gene name |
COSMIC_STRAND |
strand |
COSMIC_CNT |
number of samples with this mutation in the COSMIC database |
COSMIC_AA |
peptide annotation |
Where relevant, BALSAMIC uses ensembl-vep to annotate somatic and germline SNVs and somatic SVs/CNVs from 1000genomes (phase3), ClinVar, ESP, HGMD-PUBLIC, dbSNP, gencode, gnomAD, polyphen, refseq, and sift databases. The following annotations are added by ensembl-vep.
Annotation |
description |
---|---|
Allele |
the variant allele used to calculate the consequence |
Gene |
Ensembl stable ID of affected gene |
Feature |
Ensembl stable ID of feature |
Feature type |
type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature. |
Consequence |
consequence type of this variant |
Position in cDNA |
relative position of base pair in cDNA sequence |
Position in CDS |
relative position of base pair in coding sequence |
Position in protein |
relative position of amino acid in protein |
Amino acid change |
only given if the variant affects the protein-coding sequence |
Codon change |
the alternative codons with the variant base in upper case |
Co-located variation |
identifier of any existing variants |
VARIANT_CLASS |
Sequence Ontology variant class |
SYMBOL |
the gene symbol |
SYMBOL_SOURCE |
the source of the gene symbol |
STRAND |
the DNA strand (1 or -1) on which the transcript/feature lies |
ENSP |
the Ensembl protein identifier of the affected transcript |
FLAGS |
transcript quality flags:
cds_start_NF: CDS 5’ incomplete
cds_end_NF: CDS 3’ incomplete
|
SWISSPROT |
Best match UniProtKB/Swiss-Prot accession of protein product |
TREMBL |
Best match UniProtKB/TrEMBL accession of protein product |
UNIPARC |
Best match UniParc accession of protein product |
HGVSc |
the HGVS coding sequence name |
HGVSp |
the HGVS protein sequence name |
HGVSg |
the HGVS genomic sequence name |
HGVS_OFFSET |
Indicates by how many bases the HGVS notations for this variant have been shifted |
SIFT |
the SIFT prediction and/or score, with both given as prediction(score) |
PolyPhen |
the PolyPhen prediction and/or score |
MOTIF_NAME |
The source and identifier of a transcription factor binding profile aligned at this position |
MOTIF_POS |
The relative position of the variation in the aligned TFBP |
HIGH_INF_POS |
A flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP) |
MOTIF_SCORE_CHANGE |
The difference in motif score of the reference and variant sequences for the TFBP |
CANONICAL |
a flag indicating if the transcript is denoted as the canonical transcript for this gene |
CCDS |
the CCDS identifer for this transcript, where applicable |
INTRON |
the intron number (out of total number) |
EXON |
the exon number (out of total number) |
DOMAINS |
the source and identifer of any overlapping protein domains |
DISTANCE |
Shortest distance from variant to transcript |
AF |
Frequency of existing variant in 1000 Genomes |
AFR_AF |
Frequency of existing variant in 1000 Genomes combined African population |
AMR_AF |
Frequency of existing variant in 1000 Genomes combined American population |
EUR_AF |
Frequency of existing variant in 1000 Genomes combined European population |
EAS_AF |
Frequency of existing variant in 1000 Genomes combined East Asian population |
SAS_AF |
Frequency of existing variant in 1000 Genomes combined South Asian population |
AA_AF |
Frequency of existing variant in NHLBI-ESP African American population |
EA_AF |
Frequency of existing variant in NHLBI-ESP European American population |
gnomAD_AF |
Frequency of existing variant in gnomAD exomes combined population |
gnomAD_AFR_AF |
Frequency of existing variant in gnomAD exomes African/American population |
gnomAD_AMR_AF |
Frequency of existing variant in gnomAD exomes American population |
gnomAD_ASJ_AF |
Frequency of existing variant in gnomAD exomes Ashkenazi Jewish population |
gnomAD_EAS_AF |
Frequency of existing variant in gnomAD exomes East Asian population |
gnomAD_FIN_AF |
Frequency of existing variant in gnomAD exomes Finnish population |
gnomAD_NFE_AF |
Frequency of existing variant in gnomAD exomes Non-Finnish European population |
gnomAD_OTH_AF |
Frequency of existing variant in gnomAD exomes combined other combined populations |
gnomAD_SAS_AF |
Frequency of existing variant in gnomAD exomes South Asian population |
MAX_AF |
Maximum observed allele frequency in 1000 Genomes, ESP and gnomAD |
MAX_AF_POPS |
Populations in which maximum allele frequency was observed |
CLIN_SIG |
ClinVar clinical significance of the dbSNP variant |
BIOTYPE |
Biotype of transcript or regulatory feature |
APPRIS |
Annotates alternatively spliced transcripts as primary or alternate based on a range of computational methods. NB: not available for GRCh37 |
TSL |
Transcript support level. NB: not available for GRCh37 |
PUBMED |
Pubmed ID(s) of publications that cite existing variant |
SOMATIC |
Somatic status of existing variant(s); multiple values correspond to multiple values in the Existing_variation field |
PHENO |
Indicates if existing variant is associated with a phenotype, disease or trait; multiple values correspond to multiple values in the Existing_variation field |
GENE_PHENO |
Indicates if overlapped gene is associated with a phenotype, disease or trait |
BAM_EDIT |
Indicates success or failure of edit using BAM file |
GIVEN_REF |
Reference allele from input |
REFSEQ_MATCH |
the RefSeq transcript match status; contains a number of flags indicating whether this RefSeq transcript matches the underlying reference sequence and/or an Ensembl transcript (more information):
|
CHECK_REF |
Reports variants where the input reference does not match the expected reference |
HGNC_ID |
A unique ID provided by the HGNC for each gene with an approved symbol |
MANE |
indicating if the transcript is the MANE Select or MANE Plus Clinical transcript for the gene. |
miRNA |
Reports where the variant lies in the miRNA secondary structure. |