References and other resources#

Main resources including knowledge base and databases necessary for pipeline development

  1. MSK-Impact pipeline:

  2. TCGA:

  3. COSMIC:

  4. dbSNP: Database of single nucleotide polymorphisms (SNPs) and multiple small-scale variations that include insertions/deletions, microsatellites, and non-polymorphic variants. Download link:

  5. ClinVar: ClinVar aggregates information about genomic variation and its relationship to human health. Download link:

  6. SweGen: This dataset contains whole-genome variant frequencies for 1000 Swedish individuals generated within the SweGen project. Download link:

  7. ExAC: The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community. Download link:

  8. GTEx: The Genotype-Tissue Expression (GTEx) project aims to provide to the scientific community a resource with which to study human gene expression and regulation and its relationship to genetic variation. Download URL by applying through:

  9. OMIM: OMIM®, Online Mendelian Inheritance in Man®, An Online Catalog of Human Genes and Genetic Disorders. Download link: (registration required)

  10. Drug resistance: An effort by Cosmic to annotate mutations identified in the literature as resistance mutations, including those conferring acquired resistance (after treatment) and intrinsic resistance (before treatment). Available through Cosmic:

  11. Mutational signatures: Signatures of Mutational Processes in Human Cancer. Available through Cosmic:

  12. DGVa: The Database of Genomic Variants archive (DGVa) is a repository that provides archiving, accessioning and distribution of publicly available genomic structural variants, in all species.

  13. Cancer genomics workflow: MGI’s CWL Cancer Pipelines.

  14. GIAB: The priority of GIAB is authoritative characterization of human genomes for use in analytical validation and technology development, optimization, and demonstration.

  15. dbNSFP: dbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome.

  16. 1000Genomes: The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the populations studied. Download link:

  17. HapMap3: The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap 3 is the third phase of the International HapMap project. Download link:

  18. GRCh38.p11: GRCh38.p11 is the eleventh patch release for the GRCh38 (human) reference assembly. Download link:

  19. dbVar: dbVar is NCBI’s database of genomic structural variation – insertions, deletions, duplications, inversions, mobile element insertions, translocations, and complex chromosomal rearrangements Download link:

  20. Drug sensitivity in cancer: Identifying molecular features of cancers that predict response to anti-cancer drugs. Download link:

  21. VarSome: VarSome is a knowledge base and aggregator for human genomic variants.

  22. CADD: CADD is a tool for scoring the deleteriousness of single nucleotide variants as well as insertion/deletions variants in the human genome. CADD can quantitatively prioritize functional, deleterious, and disease causal variants across a wide range of functional categories, effect sizes and genetic architectures and can be used prioritize causal variation in both research and clinical settings.

Sample datasets#

  1. TCRB: The Texas Cancer Research Biobank (TCRB) was created to bridge the gap between doctors and scientific researchers to improve the prevention, diagnosis and treatment of cancer. This work occurred with funding from the Cancer Prevention & Research Institute of Texas (CPRIT) from 2010-2014. Article:

Relevant publications#

Including methodological benchmarking


    • Original pipeline: Cheng, D. T., Mitchell, T. N., Zehir, A., Shah, R. H., Benayed, R., Syed, A., … Berger, M. F. (2015). Memorial sloan kettering-integrated mutation profiling of actionable cancer targets (MSK-IMPACT): A hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology. Journal of Molecular Diagnostics, 17(3), 251–264.

    • Case study: Cheng, D. T., Prasad, M., Chekaluk, Y., Benayed, R., Sadowska, J., Zehir, A., … Zhang, L. (2017). Comprehensive detection of germline variants by MSK-IMPACT, a clinical diagnostic platform for solid tumor molecular oncology and concurrent cancer predisposition testing. BMC Medical Genomics, 10(1), 33.

    • Case study: Zehir, A., Benayed, R., Shah, R. H., Syed, A., Middha, S., Kim, H. R., … Berger, M. F. (2017). Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nature Medicine, 23(6), 703–713.

  2. Application of MSK-IMPACT: Zehir, A., Benayed, R., Shah, R. H., Syed, A., Middha, S., Kim, H. R., … Berger, M. F. (2017). Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nature Medicine, 23(6), 703–713.

  3. Review on bioinformatic pipelins: Leipzig, J. (2017). A review of bioinformatic pipeline frameworks. Briefings in Bioinformatics, 18(3), 530–536.

  4. Mutational signature reviews:

    • Helleday, T., Eshtad, S., & Nik-Zainal, S. (2014). Mechanisms underlying mutational signatures in human cancers. Nature Reviews Genetics, 15(9), 585–598.

    • Alexandrov, L. B., & Stratton, M. R. (2014). Mutational signatures: The patterns of somatic mutations hidden in cancer genomes. Current Opinion in Genetics and Development, 24(1), 52–60.

  5. Review on structural variation detection tools:

    • Lin, K., Bonnema, G., Sanchez-Perez, G., & De Ridder, D. (2014). Making the difference: Integrating structural variation detection tools. Briefings in Bioinformatics, 16(5), 852–864.

    • Tattini, L., D’Aurizio, R., & Magi, A. (2015). Detection of Genomic Structural Variants from Next-Generation Sequencing Data. Frontiers in Bioengineering and Biotechnology, 3(June), 1–8.

  6. Two case studies and a pipeline (unpublished): Noll, A. C., Miller, N. A., Smith, L. D., Yoo, B., Fiedler, S., Cooley, L. D., … Kingsmore, S. F. (2016). Clinical detection of deletion structural variants in whole-genome sequences. Npj Genomic Medicine, 1(1), 16026.

  7. Review on driver gene methods: Tokheim, C. J., Papadopoulos, N., Kinzler, K. W., Vogelstein, B., & Karchin, R. (2016). Evaluating the evaluation of cancer driver genes. Proceedings of the National Academy of Sciences, 113(50), 14330–14335.

  8. Detection of IGH::DUX4 rearrangement: Rezayee, F., Eisfeldt, J., Skaftason, A., Öfverholm, I., Sayyab, S., Syvänen, A. C., … & Barbany, G. (2023). Feasibility to use whole-genome sequencing as a sole diagnostic method to detect genomic aberrations in pediatric B-cell acute lymphoblastic leukemia. Frontiers in Oncology, 13, 1217712.

Resource, or general notable papers including resource and KB papers related to cancer genomics

  1. GIAB: Zook, J. M., Catoe, D., McDaniel, J., Vang, L., Spies, N., Sidow, A., … Salit, M. (2016). Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data, 3, 160025.

Methods and tools#

Excluding multiple method comparison or benchmarking tools

  • BreakDancer: Chen, K., Wallis, J. W., Mclellan, M. D., Larson, D. E., Kalicki, J. M., Pohl, C. S., … Elaine, R. (2013). BreakDancer - An algorithm for high resolution mapping of genomic structure variation. Nature Methods, 6(9), 677–681.

  • Pindel: Ye, K., Schulz, M. H., Long, Q., Apweiler, R., & Ning, Z. (2009). Pindel: A pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 25(21), 2865–2871.

  • SVDetect: Zeitouni, B., Boeva, V., Janoueix-Lerosey, I., Loeillet, S., Legoix-né, P., Nicolas, A., … Barillot, E. (2010). SVDetect: A tool to identify genomic structural variations from paired-end and mate-pair sequencing data. Bioinformatics, 26(15), 1895–1896.

  • Purityest: Su, X., Zhang, L., Zhang, J., Meric-bernstam, F., & Weinstein, J. N. (2012). Purityest: Estimating purity of human tumor samples using next-generation sequencing data. Bioinformatics, 28(17), 2265–2266.

  • PurBayes: Larson, N. B., & Fridley, B. L. (2013). PurBayes: Estimating tumor cellularity and subclonality in next-generation sequencing data. Bioinformatics, 29(15), 1888–1889.

  • ANNOVAR: Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38(16), 1–7.

  • ASCAT: Van Loo, P., Nordgard, S. H., Lingjaerde, O. C., Russnes, H. G., Rye, I. H., Sun, W., … Kristensen, V. N. (2010). Allele-specific copy number analysis of tumors. Proceedings of the National Academy of Sciences, 107(39), 16910–16915.

  • Treeomics: Reiter, J. G., Makohon-Moore, A. P., Gerold, J. M., Bozic, I., Chatterjee, K., Iacobuzio-Donahue, C. A., … Nowak, M. A. (2017). Reconstructing metastatic seeding patterns of human cancers. Nature Communications, 8, 14114.

  • deconstructSigs: Rosenthal, R., McGranahan, N., Herrero, J., Taylor, B. S., & Swanton, C. (2016). deconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biology, 17(1), 31.

  • MutationalPatterns: Blokzijl, F., Janssen, R., van Boxtel, R., & Cuppen, E. (2017). MutationalPatterns: comprehensive genome-wide analysis of mutational processes. bioRxiv, 1–20.

  • MaSuRCA: Zimin, A. V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S. L., & Yorke, J. A. (2013). The MaSuRCA genome assembler. Bioinformatics, 29(21), 2669–2677.

  • VarDict: Lai, Z., Markovets, A., Ahdesmaki, M., Chapman, B., Hofmann, O., Mcewen, R., … Dry, J. R. (2016). VarDict: A novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Research, 44(11), 1–11.

  • vt: Tan, A., Abecasis, G. R., & Kang, H. M. (2015). Unified representation of genetic variants. Bioinformatics, 31(13), 2202–2204.

  • peddy: Pedersen, B. S., & Quinlan, A. R. (2017). Who’s Who? Detecting and Resolving Sample Anomalies in Human DNA Sequencing Studies with Peddy. American Journal of Human Genetics, 100(3), 406–413.

  • GQT: Layer, R. M., Kindlon, N., Karczewski, K. J., & Quinlan, A. R. (2015). Efficient genotype compression and analysis of large genetic-variation data sets. Nature Methods, 13(1).

Tool sets and softwares required at various steps of pipeline development

  1. FastQC: Quality control tool.

  2. Cutadapt: Adapter removal tool.

  3. Trim Galore!: FastQC and Cutadapt wrapper.

  4. Picardtools: BAM/SAM/VCF/CRAM manipulator.

    • MarkDuplicate: Mark duplicate reads and potentially remove them

    • LiftoverVcf: liftover VCF between builds

    • CollectHsMetric: Collects hybrid-selection (HS) metrics for a SAM or BAM file

    • CollectAlignmentSummaryMetrics: Produces a summary of alignment metrics from a SAM or BAM file

    • CollectGcBiasMetrics: Collect metrics regarding GC bias

    • CollectWgsMetrics: Collect metrics about coverage and performance of whole genome sequencing (WGS) experiments

  5. GATK: A variant discovery tool:

    • BaseRecalibrator: Detect systematic error in base quality score

    • Somatic Indel Realigner: Local Realignment around Indels

    • ContEst: Estimate cross sample contamination

    • DepthOfCoverage: Assess sequence coverage by sample, read group, or libraries

    • DuplicateReadFilter: remove duplicated from flag set by MarkDuplicates

  6. Samtools: Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format

  7. Sambamba: Tools for working with SAM/BAM/CRAM data

  8. bcftools: Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants

  9. vcftools: VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project.

  10. Delly2: An integrated structural variant prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations

  11. PLINK: PLINK: Whole genome data analysis toolset

  12. freebayes: a haplotype-based variant detector.

  13. AscatNGS: Allele-Specific Copy Number Analysis of Tumors, tumor purity and ploidy

  14. MutationalPatterns: R package for extracting and visualizing mutational patterns in base substitution catalogues

  15. desconstructSigs: identification of mutational signatures within a single tumor sample

  16. treeOmics: Decrypting somatic mutation patterns to reveal the evolution of cancer

  17. controlFreeC: Copy number and allelic content caller

  18. MuTect2: Call somatic SNPs and indels via local re-assembly of haplotypes

  19. Annovar: annotation of detected genetic variation

  20. Strelka: Small variant caller

  21. Manta: Structural variant caller

  22. PurBayes: estimate tumor purity and clonality

  23. VarDict: variant caller for both single and paired sample variant calling from BAM files

  24. SNPeff/SNPSift: Genomic variant annotations and functional effect prediction toolbox. and

  25. IGV: visualization tool for interactive exploration

  26. SVDetect: a tool to detect genomic structural variations

  27. GenomeSTRiP: A suite of tools for discovering and genotyping structural variations using sequencing data

  28. BreakDancer: SV detection from paired end reads mapping

  29. pIndel: Detect breakpoints of large deletions, medium sized insertions, inversions, and tandem duplications

  30. VarScan: Variant calling and somatic mutation/CNV detection

  31. VEP: Variant Effect Predictor

  32. Probablistic2020: Simulates somatic mutations, and calls statistically significant oncogenes and tumor suppressor genes based on a randomization-based test

  33. 2020plus: Classifies genes as an oncogene, tumor suppressor gene, or as a non-driver gene by using Random Forests

  34. vtools: variant tools is a software tool for the manipulation, annotation, selection, simulation, and analysis of variants in the context of next-gen sequencing analysis.

  35. vt: A variant tool set that discovers short variants from Next Generation Sequencing data. and

  36. CNVnator: a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads.

  37. CNVpytor: a tool for copy number variation detection and analysis from read depth and allele imbalance in whole-genome sequencing.

  38. SvABA: Structural variation and indel detection by local assembly.

  39. indelope: find indels and SVs too small for structural variant callers and too large for GATK.

  40. peddy: peddy compares familial-relationships and sexes as reported in a PED/FAM file with those inferred from a VCF.

  41. cyvcf2: cyvcf2 is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files.

  42. GQT: Genotype Query Tools (GQT) is command line software and a C API for indexing and querying large-scale genotype data sets.

  43. LOFTEE: Loss-Of-Function Transcript Effect Estimator. A VEP plugin to identify LoF (loss-of-function) variation. Assesses variants that are: Stop-gained, Splice site disrupting, and Frameshift variants.

  44. PureCN: copy number calling and SNV classification using targeted short read sequencing

  45. SVCaller: A structural variant caller.

  46. SnakeMake: A workflow manager.

  47. BWA: BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM.

  48. wgsim: Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors.

  49. dwgsim: Whole genome simulation can be performed with dwgsim. dwgsim is based off of wgsim found in SAMtools.

  50. THetA: Tumor Heterogeneity Analysis. This algorithm estimates tumor purity and clonal/subclonal copy number aberrations directly from high-throughput DNA sequencing data.

  51. Skewer: Adapter trimming, similar to cutadapt.

  52. Phylowgs: Application for inferring subclonal composition and evolution from whole-genome sequencing data.

  53. superFreq: SuperFreq is an R package that analyses cancer exomes to track subclones.

  54. readVCF-r: Read VCFs into R and annotatte them.

  55. vcfr: Read VCFs into R.

  56. msisensor: microsatellite instability detection using paired tumor-normal

  57. MOSAIC: MicrOSAtellite Instability Classifier

  58. MANTIS: Microsatellite Analysis for Normal-Tumor InStability

  59. SBDB: A toolkit for constricting and querying structural variant databases