BALSAMIC.utils package¶
Submodules¶
BALSAMIC.utils.cli module¶
-
class
BALSAMIC.utils.cli.
SnakeMake
[source]¶ Bases:
object
To build a snakemake command using cli options
Params:
case_name - analysis case name working_dir - working directory for snakemake configfile - sample configuration file (json) output of balsamic-config-sample run_mode - run mode - cluster or local shell run cluster_config - cluster config json file scheduler - slurm command constructor log_path - log file path script_path - file path for slurm scripts result_path - result directory qos - QOS for sbatch jobs account - scheduler(e.g. slurm) account mail_user - email to account to send job run status forceall - To add ‘–forceall’ option for snakemake run_analysis - To run pipeline use_singularity - To use singularity singularity_bind- Singularity bind path singularity_arg - Singularity arguments to pass to snakemake sm_opt - snakemake additional options disable_variant_caller - Disable variant caller
-
BALSAMIC.utils.cli.
add_doc
(docstring)[source]¶ A decorator for adding docstring. Taken shamelessly from stackexchange.
-
BALSAMIC.utils.cli.
convert_defaultdict_to_regular_dict
(inputdict: dict)[source]¶ Recursively convert defaultdict to dict.
Replaces values of file_prefix with sample_name in deliverables dict
-
BALSAMIC.utils.cli.
createDir
(path, interm_path=[])[source]¶ Creates directories by recursively checking if it exists, otherwise increments the number
-
BALSAMIC.utils.cli.
create_fastq_symlink
(casefiles, symlink_dir: pathlib.Path)[source]¶ Creates symlinks for provided files in analysis/fastq directory. Identifies file prefix pattern, and also creates symlinks for the second read file, if needed
-
BALSAMIC.utils.cli.
generate_graph
(config_collection_dict, config_path)[source]¶ Generate DAG graph using snakemake stdout output
-
BALSAMIC.utils.cli.
get_bioinfo_tools_list
(conda_env_path) → dict[source]¶ Parses the names and versions of bioinfo tools used by BALSAMIC from config YAML into a dict
-
BALSAMIC.utils.cli.
get_fastq_bind_path
(fastq_path: pathlib.Path) → [][source]¶ Takes a path with symlinked fastq files. Returns unique paths to parent directories for singulatiry bind
-
BALSAMIC.utils.cli.
get_file_status_string
(file_to_check)[source]¶ Checks if file exsits. and returns a string with checkmark or redcorss mark if it exists or doesn’t exist respectively. Always assume file doesn’t exist, unless proven otherwise.
-
BALSAMIC.utils.cli.
get_from_two_key
(input_dict, from_key, by_key, by_value, default=None)[source]¶ Given two keys with list of values of same length, find matching index of by_value in from_key from by_key.
from_key and by_key should both exist
-
BALSAMIC.utils.cli.
get_panel_chrom
(panel_bed) → list[source]¶ Returns a set of chromosomes present in PANEL BED
-
BALSAMIC.utils.cli.
get_sample_dict
(tumor: str, normal: str, tumor_sample_name: str = None, normal_sample_name: str = None) → dict[source]¶ Concatenates sample dicts for all provided files
-
BALSAMIC.utils.cli.
get_sample_names
(filename, sample_type)[source]¶ Creates a dict with sample prefix, sample type, and readpair suffix
-
BALSAMIC.utils.cli.
get_snakefile
(analysis_type, sequencing_type='targeted')[source]¶ Return a string path for variant calling snakefile.
-
BALSAMIC.utils.cli.
merge_dict_on_key
(dict_1, dict_2, by_key)[source]¶ Merge two list of dictionaries based on key
-
BALSAMIC.utils.cli.
merge_json
(*args)[source]¶ Take a list of json files and merges them together Input: list of json file Output: dictionary of merged json
-
BALSAMIC.utils.cli.
singularity
(sif_path: str, cmd: str, bind_paths: list) → str[source]¶ Run within container
Excutes input command string via Singularity container image
Parameters: - sif_path – Path to singularity image file (sif)
- cmd – A string for series of commands to run
- bind_path – a path to bind within container
Returns: A sanitized Singularity cmd
Raises: BalsamicError – An error occured while creating cmd
BALSAMIC.utils.exc module¶
-
exception
BALSAMIC.utils.exc.
BalsamicError
(message)[source]¶ Bases:
Exception
Base exception for the BALSAMIC.
-
exception
BALSAMIC.utils.exc.
WorkflowRunError
(message)[source]¶ Bases:
BALSAMIC.utils.exc.BalsamicError
Exception for handling workflow errors. Raise this exception when workflow or rules fails to run or execute
BALSAMIC.utils.models module¶
-
class
BALSAMIC.utils.models.
AnalysisModel
[source]¶ Bases:
pydantic.main.BaseModel
Pydantic model containing workflow variables
-
case_id
¶ Field(required); string case identifier
-
analysis_type
¶ Field(required); string literal [single, paired] single : if only tumor samples are provided paired : if both tumor and normal samples are provided
-
sequencing_type
¶ Field(required); string literal [targeted, wgs] targeted : if capture kit was used to enrich specific genomic regions wgs : if whole genome sequencing was performed
-
analysis_dir
¶ Field(required); existing path where to save files
-
fastq_path
¶ Field(optional); Path where fastq files will be stored
-
script
¶ Field(optional); Path where snakemake scripts will be stored
-
log
¶ Field(optional); Path where logs will be saved
-
result
¶ Field(optional); Path where BALSAMIC output will be stored
-
benchmark
¶ Field(optional); Path where benchmark report will be stored
-
dag
¶ Field(optional); Path where DAG graph of workflow will be stored
-
BALSAMIC_version
¶ Field(optional); Current version of BALSAMIC
-
config_creation_date
¶ Field(optional); Timestamp when config was created
Raises: ValueError – When analysis_type is set to any value other than [single, paired, qc] When sequencing_type is set to any value other than [wgs, targeted] -
-
class
BALSAMIC.utils.models.
BalsamicConfigModel
[source]¶ Bases:
pydantic.main.BaseModel
Summarizes config models in preparation for export
-
QC
¶ Field(QCmodel); variables relevant for fastq preprocessing and QC
-
vcf
¶ Field(VCFmodel); variables relevand for variant calling pipeline
-
samples
¶ Field(Dict); dictionary containing samples submitted for analysis
-
reference
¶ Field(Dict); dictionary containign paths to reference genome files
-
panel
¶ Field(PanelModel(optional)); variables relevant to PANEL BED if capture kit is used
-
bioinfo_tools
¶ Field(BioinfoToolsModel); dictionary of bioinformatics software and their versions used for the analysis
-
singularity
¶ Field(Path); path to singularity container of BALSAMIC
-
background_variants
¶ Field(Path(optional)); path to BACKGROUND VARIANTS for UMI
-
conda_env_yaml
¶ Field(Path(CONVA_ENV_YAML)); path where Balsamic configs can be found
-
rule_directory
¶ Field(Path(RULE_DIRECTORY)); path where snakemake rules can be found
-
-
class
BALSAMIC.utils.models.
BioinfoToolsModel
[source]¶ Bases:
pydantic.main.BaseModel
Holds versions of current bioinformatic tools used in analysis
-
class
BALSAMIC.utils.models.
PanelModel
[source]¶ Bases:
pydantic.main.BaseModel
Holds attributes of PANEL BED file if provided .. attribute:: capture_kit
Field(str(Path)); string representation of path to PANEL BED file-
chrom
¶ Field(list(str)); list of chromosomes in PANEL BED
Raises: ValueError – When capture_kit argument is set, but is not a valid path -
-
class
BALSAMIC.utils.models.
QCModel
[source]¶ Bases:
pydantic.main.BaseModel
Contains settings for quality control and pre-processing .. attribute:: picard_rmdup
Field(bool); whether duplicate removal is to be applied in the workflow-
adapter
¶ Field(str(AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT)); adapter sequence to trim
-
quality_trim
¶ Field(bool); whether quality trimming it to be performed in the workflow
-
adapter_trim
¶ Field(bool); whether adapter trimming is to be performed in the workflow
-
umi_trim
¶ Field(bool); whether UMI trimming is to be performed in the workflow
-
min_seq_length
¶ Field(str(int)); minimum sequence length cutoff for reads
-
umi_trim_length
¶ Field(str(int)); length of UMI to be trimmed from reads
Raises: ValueError – When the input in min_seq_length and umi_trim_length cannot be interpreted as integer and coerced to string -
-
class
BALSAMIC.utils.models.
ReferenceMeta
[source]¶ Bases:
pydantic.main.BaseModel
Defines a basemodel for all reference file
This class defines a meta for various reference files. Only reference_genome is mandatory.
-
basedir
¶ str for basedirectory which will be appended to all ReferenceUrlsModel fields
-
reference_genome
¶ ReferenceUrlsModel. Required field for reference genome fasta file
-
dbsnp
¶ ReferenceUrlsModel. Optional field for dbSNP vcf file
-
hc_vcf_1kg
¶ ReferenceUrlsModel. Optional field for high confidence 1000Genome vcf
-
mills_1kg
¶ ReferenceUrlsModel. Optional field for Mills’ high confidence indels vcf
-
known_indel_1kg
¶ ReferenceUrlsModel. Optional field for 1000Genome known indel vcf
-
vcf_1kg
¶ ReferenceUrlsModel. Optional field for 1000Genome all SNPs
-
wgs_calling
¶ ReferenceUrlsModel. Optional field for wgs calling intervals
-
genome_chrom_size
¶ ReferenceUrlsModel. Optional field for geneome’s chromosome sizes
-
cosmicdb
¶ ReferenceUrlsModel. Optional COSMIC database’s variants as vcf
-
refgene_txt
¶ ReferenceUrlsModel. Optional refseq’s gene flat format from UCSC
-
refgene_sql
¶ ReferenceUrlsModel. Optional refseq’s gene sql format from UCSC
-
-
class
BALSAMIC.utils.models.
ReferenceUrlsModel
[source]¶ Bases:
pydantic.main.BaseModel
Defines a basemodel for reference urls
This class handles four attributes for each reference url. Each attribute defines url, type of file, and gzip status.
-
url
¶ defines the url to access file. Essentially it will be used to download file locally. It should match url_type://…
-
file_type
¶ describes file type. Accepted values are VALID_REF_FORMAT constant
-
gzip
¶ gzip status. Binary: True or False
-
genome_version
¶ genome version matching the content of the file. Accepted values are VALID_GENOME_VER constant
Raises: ValidationError – When it can’t validate values matching above attributes -
get_output_file
¶ return output file full path
-
write_md5
¶ calculate md5 for first 4kb of file and write to file_name.md5
-
-
class
BALSAMIC.utils.models.
SampleInstanceModel
[source]¶ Bases:
pydantic.main.BaseModel
Holds attributes for samples used in analysis
-
file_prefix
¶ Field(str); basename of sample pair
-
sample_type
¶ Field(str; alias=type); type of sample [tumor, normal]
-
sample_name
¶ Field(str); Internal ID of sample to use in deliverables
-
readpair_suffix
¶ Field(List); currently always set to [1, 2]
Raises: ValueError – When sample_type is set ot any value other than [tumor, normal] -
-
class
BALSAMIC.utils.models.
VCFAttributes
[source]¶ Bases:
pydantic.main.BaseModel
General purpose filter to manage various VCF attributes
This class handles three parameters for the purpose filtering variants based on a tag_values, filter_name, and which field in VCF.
E.g. AD=VCFAttributes(tag_value=5, filter_name=”balsamic_low_tumor_ad”, field=”INFO”) A value of 5 from INFO field and filter_name will be balsamic_low_tumor_ad
-
tag_value
¶ float
-
filter_name
¶ str
-
field
¶ str
-
-
class
BALSAMIC.utils.models.
VarCallerFilter
[source]¶ Bases:
pydantic.main.BaseModel
General purpose for variant caller filters
This class handles attributes and filter for variant callers
-
AD
¶ VCFAttributes (required); minimum allelic depth
-
AF_min
¶ VCFAttributes (optional); minimum allelic fraction
-
AF_max
¶ VCFAttributes (optional); maximum allelic fraction
-
MQ
¶ VCFAttributes (optional); minimum mapping quality
-
DP
¶ VCFAttributes (optional); minimum read depth
-
varcaller_name
¶ str (required); variant caller name
-
filter_type
¶ str (required); filter name for variant caller
-
analysis_type
¶ str (required); analysis type e.g. tumor_normal or tumor_only
-
description
¶ str (required); comment section for description
-
-
class
BALSAMIC.utils.models.
VarcallerAttribute
[source]¶ Bases:
pydantic.main.BaseModel
Holds variables for variant caller software .. attribute:: mutation
str of mutation class-
mutation_type
¶ str of mutation type
-
analysis_type
¶ list of str for analysis types
-
workflow_solution
¶ list of str for workflows
Raises: ValueError – When a variable other than [somatic, germline] is passed in mutation field When a variable other than [SNV, CNV, SV] is passed in mutation_type field -
BALSAMIC.utils.rule module¶
-
BALSAMIC.utils.rule.
get_chrom
(panelfile)[source]¶ input: a panel bedfile output: list of chromosomes in the bedfile
-
BALSAMIC.utils.rule.
get_conda_env
(yaml_file, pkg)[source]¶ Retrieve conda environment for package from a predefined yaml file
input: balsamic_env output: string of conda env where packge is in
-
BALSAMIC.utils.rule.
get_delivery_id
(id_candidate: str, file_to_store: str, tags: list, output_file_wildcards: dict)[source]¶ resolve delivery id from file_to_store, tags, and output_file_wildcards
This function will get a filename, a list of tags, and an id_candidate. id_candidate should be form of a fstring.
Parameters: - id_candidate – a fstring format string. e.g. “{case_name}”
- file_to_store – a filename to search a resolved id
- tags – a list of tags with a resolve id in it
- output_file_wildcards – a dictionary of wildcards. Keys are wildcard names, and values are list of wildcard values
Returns: a resolved id string. If it can’t be resolved, it’ll return the id_candidate value
Return type: delivery_id
-
BALSAMIC.utils.rule.
get_picard_mrkdup
(config)[source]¶ input: sample config file output from BALSAMIC output: mrkdup or rmdup strings
-
BALSAMIC.utils.rule.
get_reference_output_files
(reference_files_dict: dict, file_type: str) → list[source]¶ Returns list of files matching a file_type from reference files
Parameters: - reference_files_dict – A validated dict model from reference
- file_type – a file type string, e.g. vcf, fasta
Returns: list of file_type files that are found in reference_files_dict
Return type: ref_vcf_list
-
BALSAMIC.utils.rule.
get_result_dir
(config)[source]¶ input: sample config file from BALSAMIC output: string of result directory path
-
BALSAMIC.utils.rule.
get_rule_output
(rules, rule_name, output_file_wildcards)[source]¶ get list of existing output files from a given workflow
Parameters: - rule_names – rule_name to query from rules object
- rules – snakemake rules object
Returns: list of tuples (file, file_index, rule_name, tags, id, file_extension) for rules
Return type: output_files
-
BALSAMIC.utils.rule.
get_sample_type
(sample, bio_type)[source]¶ input: sample dictionary from BALSAMIC’s config file output: list of sample type id
-
BALSAMIC.utils.rule.
get_script_path
(script_name: str)[source]¶ Retrieves script path where name is matching {{script_name}}.
-
BALSAMIC.utils.rule.
get_threads
(cluster_config, rule_name='__default__')[source]¶ To retrieve threads from cluster config or return default value of 8
-
BALSAMIC.utils.rule.
get_variant_callers
(config, mutation_type: str, mutation_class: str, analysis_type: str, workflow_solution: str)[source]¶ Get list of variant callers for a given list of input
Parameters: - config – A validated dictionary of case_config
- mutation_type – A mutation type string, e.g. SNV
- mutation_class – A mutation class string, e.g. somatic
- analysis_type – A analysis type string, e.g. paired
- workflow_solution – A workflow type string, e.g. BALSAMIC
Returns: A list variant caller names extracted from config
Raises: WorkflowRunError if mutation_type, mutation_class, analysis_type, or workflow_solution do not have valid value –