BALSAMIC.utils package

Submodules

BALSAMIC.utils.cli module

class BALSAMIC.utils.cli.CaptureStdout[source]

Bases: list

Captures stdout.

class BALSAMIC.utils.cli.SnakeMake[source]

Bases: object

To build a snakemake command using cli options

Params:

case_name - analysis case name working_dir - working directory for snakemake configfile - sample configuration file (json) output of balsamic-config-sample run_mode - run mode - cluster or local shell run cluster_config - cluster config json file scheduler - slurm command constructor log_path - log file path script_path - file path for slurm scripts result_path - result directory qos - QOS for sbatch jobs account - scheduler(e.g. slurm) account mail_user - email to account to send job run status forceall - To add ‘–forceall’ option for snakemake run_analysis - To run pipeline use_singularity - To use singularity singularity_bind- Singularity bind path singularity_arg - Singularity arguments to pass to snakemake sm_opt - snakemake additional options disable_variant_caller - Disable variant caller

build_cmd()[source]
BALSAMIC.utils.cli.add_doc(docstring)[source]

A decorator for adding docstring. Taken shamelessly from stackexchange.

BALSAMIC.utils.cli.convert_defaultdict_to_regular_dict(inputdict: dict)[source]

Recursively convert defaultdict to dict.

BALSAMIC.utils.cli.createDir(path, interm_path=[])[source]

Creates directories by recursively checking if it exists, otherwise increments the number

Creates symlinks for provided files in analysis/fastq directory. Identifies file prefix pattern, and also creates symlinks for the second read file, if needed

BALSAMIC.utils.cli.find_file_index(file_path)[source]
BALSAMIC.utils.cli.generate_graph(config_collection_dict, config_path)[source]

Generate DAG graph using snakemake stdout output

BALSAMIC.utils.cli.get_bioinfo_tools_list(conda_env_path) → dict[source]

Parses the names and versions of bioinfo tools used by BALSAMIC from config YAML into a dict

BALSAMIC.utils.cli.get_config(config_name)[source]

Return a string path for config file.

BALSAMIC.utils.cli.get_fastq_bind_path(fastq_path: pathlib.Path) → [][source]

Takes a path with symlinked fastq files. Returns unique paths to parent directories for singulatiry bind

BALSAMIC.utils.cli.get_file_extension(file_path)[source]
BALSAMIC.utils.cli.get_file_status_string(file_to_check)[source]

Checks if file exsits. and returns a string with checkmark or redcorss mark if it exists or doesn’t exist respectively. Always assume file doesn’t exist, unless proven otherwise.

BALSAMIC.utils.cli.get_from_two_key(input_dict, from_key, by_key, by_value, default=None)[source]

Given two keys with list of values of same length, find matching index of by_value in from_key from by_key.

from_key and by_key should both exist

BALSAMIC.utils.cli.get_panel_chrom(panel_bed) → list[source]

Returns a set of chromosomes present in PANEL BED

BALSAMIC.utils.cli.get_sample_dict(tumor, normal) → dict[source]

Concatenates sample dicts for all provided files

BALSAMIC.utils.cli.get_sample_names(filename, sample_type)[source]

Creates a dict with sample prefix, sample type, and readpair suffix

BALSAMIC.utils.cli.get_schedulerpy()[source]

Returns a string path for scheduler.py

BALSAMIC.utils.cli.get_snakefile(analysis_type, sequencing_type='targeted')[source]

Return a string path for variant calling snakefile.

BALSAMIC.utils.cli.iterdict(dic)[source]

dictionary iteration - returns generator

BALSAMIC.utils.cli.merge_dict_on_key(dict_1, dict_2, by_key)[source]

Merge two list of dictionaries based on key

BALSAMIC.utils.cli.merge_json(*args)[source]

Take a list of json files and merges them together Input: list of json file Output: dictionary of merged json

BALSAMIC.utils.cli.recursive_default_dict()[source]

Recursivly create defaultdict.

BALSAMIC.utils.cli.singularity(sif_path: str, cmd: str, bind_paths: list) → str[source]

Run within container

Excutes input command string via Singularity container image

Parameters:
  • sif_path – Path to singularity image file (sif)
  • cmd – A string for series of commands to run
  • bind_path – a path to bind within container
Returns:

A sanitized Singularity cmd

Raises:

BalsamicError – An error occured while creating cmd

BALSAMIC.utils.cli.validate_fastq_pattern(sample)[source]

Finds the correct filename prefix from file path, and returns it. An error is raised if sample name has invalid pattern

BALSAMIC.utils.cli.write_json(json_out, output_config)[source]

BALSAMIC.utils.exc module

exception BALSAMIC.utils.exc.BalsamicError(message)[source]

Bases: Exception

Base exception for the BALSAMIC.

exception BALSAMIC.utils.exc.WorkflowRunError(message)[source]

Bases: BALSAMIC.utils.exc.BalsamicError

Exception for handling workflow errors. Raise this exception when workflow or rules fails to run or execute

BALSAMIC.utils.models module

class BALSAMIC.utils.models.AnalysisModel[source]

Bases: pydantic.main.BaseModel

Pydantic model containing workflow variables

case_id

Field(required); string case identifier

analysis_type

Field(required); string literal [single, paired] single : if only tumor samples are provided paired : if both tumor and normal samples are provided

sequencing_type

Field(required); string literal [targeted, wgs] targeted : if capture kit was used to enrich specific genomic regions wgs : if whole genome sequencing was performed

analysis_dir

Field(required); existing path where to save files

fastq_path

Field(optional); Path where fastq files will be stored

script

Field(optional); Path where snakemake scripts will be stored

log

Field(optional); Path where logs will be saved

result

Field(optional); Path where BALSAMIC output will be stored

benchmark

Field(optional); Path where benchmark report will be stored

dag

Field(optional); Path where DAG graph of workflow will be stored

BALSAMIC_version

Field(optional); Current version of BALSAMIC

config_creation_date

Field(optional); Timestamp when config was created

Raises:ValueError – When analysis_type is set to any value other than [single, paired, qc] When sequencing_type is set to any value other than [wgs, targeted]
class Config[source]

Bases: object

validate_all = True
classmethod analysis_type_literal(value) → str[source]
classmethod datetime_as_string(value)[source]
classmethod dirpath_always_abspath(value) → str[source]
classmethod parse_analysis_to_benchmark_path(value, values, **kwargs) → str[source]
classmethod parse_analysis_to_dag_path(value, values, **kwargs) → str[source]
classmethod parse_analysis_to_fastq_path(value, values, **kwargs) → str[source]
classmethod parse_analysis_to_log_path(value, values, **kwargs) → str[source]
classmethod parse_analysis_to_result_path(value, values, **kwargs) → str[source]
classmethod parse_analysis_to_script_path(value, values, **kwargs) → str[source]
classmethod sequencing_type_literal(value) → str[source]
class BALSAMIC.utils.models.BalsamicConfigModel[source]

Bases: pydantic.main.BaseModel

Summarizes config models in preparation for export

QC

Field(QCmodel); variables relevant for fastq preprocessing and QC

vcf

Field(VCFmodel); variables relevand for variant calling pipeline

samples

Field(Dict); dictionary containing samples submitted for analysis

reference

Field(Dict); dictionary containign paths to reference genome files

panel

Field(PanelModel(optional)); variables relevant to PANEL BED if capture kit is used

bioinfo_tools

Field(BioinfoToolsModel); dictionary of bioinformatics software and their versions used for the analysis

singularity

Field(Path); path to singularity container of BALSAMIC

background_variants

Field(Path(optional)); path to BACKGROUND VARIANTS for UMI

conda_env_yaml

Field(Path(CONVA_ENV_YAML)); path where Balsamic configs can be found

rule_directory

Field(Path(RULE_DIRECTORY)); path where snakemake rules can be found

classmethod abspath_as_str(value)[source]
classmethod fl_abspath_as_str(value)[source]
classmethod transform_path_to_dict(value)[source]
class BALSAMIC.utils.models.BioinfoToolsModel[source]

Bases: pydantic.main.BaseModel

Holds versions of current bioinformatic tools used in analysis

class BALSAMIC.utils.models.PanelModel[source]

Bases: pydantic.main.BaseModel

Holds attributes of PANEL BED file if provided .. attribute:: capture_kit

Field(str(Path)); string representation of path to PANEL BED file
chrom

Field(list(str)); list of chromosomes in PANEL BED

Raises:ValueError – When capture_kit argument is set, but is not a valid path
classmethod path_as_abspath_str(value)[source]
class BALSAMIC.utils.models.QCModel[source]

Bases: pydantic.main.BaseModel

Contains settings for quality control and pre-processing .. attribute:: picard_rmdup

Field(bool); whether duplicate removal is to be applied in the workflow
adapter

Field(str(AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT)); adapter sequence to trim

quality_trim

Field(bool); whether quality trimming it to be performed in the workflow

adapter_trim

Field(bool); whether adapter trimming is to be performed in the workflow

umi_trim

Field(bool); whether UMI trimming is to be performed in the workflow

min_seq_length

Field(str(int)); minimum sequence length cutoff for reads

umi_trim_length

Field(str(int)); length of UMI to be trimmed from reads

Raises:ValueError – When the input in min_seq_length and umi_trim_length cannot be interpreted as integer and coerced to string
class Config[source]

Bases: object

validate_all = True
classmethod coerce_int_as_str(value)[source]
class BALSAMIC.utils.models.ReferenceMeta[source]

Bases: pydantic.main.BaseModel

Defines a basemodel for all reference file

This class defines a meta for various reference files. Only reference_genome is mandatory.

basedir

str for basedirectory which will be appended to all ReferenceUrlsModel fields

reference_genome

ReferenceUrlsModel. Required field for reference genome fasta file

dbsnp

ReferenceUrlsModel. Optional field for dbSNP vcf file

hc_vcf_1kg

ReferenceUrlsModel. Optional field for high confidence 1000Genome vcf

mills_1kg

ReferenceUrlsModel. Optional field for Mills’ high confidence indels vcf

known_indel_1kg

ReferenceUrlsModel. Optional field for 1000Genome known indel vcf

vcf_1kg

ReferenceUrlsModel. Optional field for 1000Genome all SNPs

wgs_calling

ReferenceUrlsModel. Optional field for wgs calling intervals

genome_chrom_size

ReferenceUrlsModel. Optional field for geneome’s chromosome sizes

cosmicdb

ReferenceUrlsModel. Optional COSMIC database’s variants as vcf

refgene_txt

ReferenceUrlsModel. Optional refseq’s gene flat format from UCSC

refgene_sql

ReferenceUrlsModel. Optional refseq’s gene sql format from UCSC

classmethod validate_path(value, values, **kwargs)[source]

validate and append path in ReferenceUrlsModel fields with basedir

class BALSAMIC.utils.models.ReferenceUrlsModel[source]

Bases: pydantic.main.BaseModel

Defines a basemodel for reference urls

This class handles four attributes for each reference url. Each attribute defines url, type of file, and gzip status.

url

defines the url to access file. Essentially it will be used to download file locally. It should match url_type://…

file_type

describes file type. Accepted values are VALID_REF_FORMAT constant

gzip

gzip status. Binary: True or False

genome_version

genome version matching the content of the file. Accepted values are VALID_GENOME_VER constant

Raises:ValidationError – When it can’t validate values matching above attributes
classmethod check_file_type(value) → str[source]

Validate file format according to constants

classmethod check_genome_ver(value) → str[source]

Validate genome version according constants

get_output_file

return output file full path

write_md5

calculate md5 for first 4kb of file and write to file_name.md5

class BALSAMIC.utils.models.SampleInstanceModel[source]

Bases: pydantic.main.BaseModel

Holds attributes for samples used in analysis

file_prefix

Field(str); basename of sample pair

sample_type

Field(str; alias=type); type of sample [tumor, normal]

readpair_suffix

Field(List); currently always set to [1, 2]

Raises:ValueError – When sample_type is set ot any value other than [tumor, normal]
classmethod sample_type_literal(value)[source]
class BALSAMIC.utils.models.VCFAttributes[source]

Bases: pydantic.main.BaseModel

General purpose filter to manage various VCF attributes

This class handles three parameters for the purpose filtering variants based on a tag_values, filter_name, and which field in VCF.

E.g. AD=VCFAttributes(tag_value=5, filter_name=”balsamic_low_tumor_ad”, field=”INFO”) A value of 5 from INFO field and filter_name will be balsamic_low_tumor_ad

tag_value

float

filter_name

str

field

str

class BALSAMIC.utils.models.VCFModel[source]

Bases: pydantic.main.BaseModel

Contains VCF config

class BALSAMIC.utils.models.VarCallerFilter[source]

Bases: pydantic.main.BaseModel

General purpose for variant caller filters

This class handles attributes and filter for variant callers

AD

VCFAttributes (required); minimum allelic depth

AF_min

VCFAttributes (optional); minimum allelic fraction

AF_max

VCFAttributes (optional); maximum allelic fraction

MQ

VCFAttributes (optional); minimum mapping quality

DP

VCFAttributes (optional); minimum read depth

varcaller_name

str (required); variant caller name

filter_type

str (required); filter name for variant caller

analysis_type

str (required); analysis type e.g. tumor_normal or tumor_only

description

str (required); comment section for description

class BALSAMIC.utils.models.VarcallerAttribute[source]

Bases: pydantic.main.BaseModel

Holds variables for variant caller software .. attribute:: mutation

str of mutation class
mutation_type

str of mutation type

analysis_type

list of str for analysis types

workflow_solution

list of str for workflows

Raises:ValueError – When a variable other than [somatic, germline] is passed in mutation field When a variable other than [SNV, CNV, SV] is passed in mutation_type field
classmethod annotation_type_literal(value) → str[source]

Validate analysis types

classmethod mutation_literal(value) → str[source]

Validate mutation class

classmethod mutation_type_literal(value) → str[source]

Validate mutation type

classmethod workflow_solution_literal(value) → str[source]

Validate workflow solution

BALSAMIC.utils.rule module

BALSAMIC.utils.rule.get_chrom(panelfile)[source]

input: a panel bedfile output: list of chromosomes in the bedfile

BALSAMIC.utils.rule.get_conda_env(yaml_file, pkg)[source]

Retrieve conda environment for package from a predefined yaml file

input: balsamic_env output: string of conda env where packge is in

BALSAMIC.utils.rule.get_delivery_id(id_candidate: str, file_to_store: str, tags: list, output_file_wildcards: dict)[source]

resolve delivery id from file_to_store, tags, and output_file_wildcards

This function will get a filename, a list of tags, and an id_candidate. id_candidate should be form of a fstring.

Parameters:
  • id_candidate – a fstring format string. e.g. “{case_name}”
  • file_to_store – a filename to search a resolved id
  • tags – a list of tags with a resolve id in it
  • output_file_wildcards – a dictionary of wildcards. Keys are wildcard names, and values are list of wildcard values
Returns:

a resolved id string. If it can’t be resolved, it’ll return the id_candidate value

Return type:

delivery_id

BALSAMIC.utils.rule.get_picard_mrkdup(config)[source]

input: sample config file output from BALSAMIC output: mrkdup or rmdup strings

BALSAMIC.utils.rule.get_reference_output_files(reference_files_dict: dict, file_type: str) → list[source]

Returns list of files matching a file_type from reference files

Parameters:
  • reference_files_dict – A validated dict model from reference
  • file_type – a file type string, e.g. vcf, fasta
Returns:

list of file_type files that are found in reference_files_dict

Return type:

ref_vcf_list

BALSAMIC.utils.rule.get_result_dir(config)[source]

input: sample config file from BALSAMIC output: string of result directory path

BALSAMIC.utils.rule.get_rule_output(rules, rule_name, output_file_wildcards)[source]

get list of existing output files from a given workflow

Parameters:
  • rule_names – rule_name to query from rules object
  • rules – snakemake rules object
Returns:

list of tuples (file, file_index, rule_name, tags, id, file_extension) for rules

Return type:

output_files

BALSAMIC.utils.rule.get_sample_type(sample, bio_type)[source]

input: sample dictionary from BALSAMIC’s config file output: list of sample type id

BALSAMIC.utils.rule.get_script_path(script_name: str)[source]

Retrieves script path where name is matching {{script_name}}.

BALSAMIC.utils.rule.get_threads(cluster_config, rule_name='__default__')[source]

To retrieve threads from cluster config or return default value of 8

BALSAMIC.utils.rule.get_variant_callers(config, mutation_type: str, mutation_class: str, analysis_type: str, workflow_solution: str)[source]

Get list of variant callers for a given list of input

Parameters:
  • config – A validated dictionary of case_config
  • mutation_type – A mutation type string, e.g. SNV
  • mutation_class – A mutation class string, e.g. somatic
  • analysis_type – A analysis type string, e.g. paired
  • workflow_solution – A workflow type string, e.g. BALSAMIC
Returns:

A list variant caller names extracted from config

Raises:

WorkflowRunError if mutation_type, mutation_class, analysis_type, or workflow_solution do not have valid value

BALSAMIC.utils.rule.get_vcf(config, var_caller, sample)[source]

input: BALSAMIC config file output: retrieve list of vcf files

Module contents