=============== Other resources =============== Resources --------- *Main resources including knowledge base and databases necessary for pipeline development* #. **MSK-Impact pipeline**\ : https://www.mskcc.org/msk-impact #. **TCGA**\ : https://cancergenome.nih.gov/ #. **COSMIC**\ : http://cancer.sanger.ac.uk/cosmic #. **dbSNP**\ : Database of single nucleotide polymorphisms (SNPs) and multiple small-scale variations that include insertions/deletions, microsatellites, and non-polymorphic variants. https://www.ncbi.nlm.nih.gov/snp/ Download link: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/All_20170710.vcf.gz #. **ClinVar**\ : ClinVar aggregates information about genomic variation and its relationship to human health. https://www.ncbi.nlm.nih.gov/clinvar/ Download link: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20171029.vcf.gz #. **ExAC**\ : The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community. http://exac.broadinstitute.org/ Download link: ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/ExAC.r1.sites.vep.vcf.gz #. **GTEx**\ : The Genotype-Tissue Expression (GTEx) project aims to provide to the scientific community a resource with which to study human gene expression and regulation and its relationship to genetic variation. https://gtexportal.org/static/ Download URL by applying through: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v6.p1 #. **OMIM**\ : OMIM®, Online Mendelian Inheritance in Man®, An Online Catalog of Human Genes and Genetic Disorders. https://www.omim.org/ Download link: https://omim.org/downloads/ (registration required) #. **Drug resistance**\ : An effort by Cosmic to annotate mutations identified in the literature as resistance mutations, including those conferring acquired resistance (after treatment) and intrinsic resistance (before treatment). Available through Cosmic: http://cancer.sanger.ac.uk/cosmic/drug_resistance #. **Mutational signatures**\ : Signatures of Mutational Processes in Human Cancer. Available through Cosmic: http://cancer.sanger.ac.uk/cosmic/signatures #. **DGVa**\ : The Database of Genomic Variants archive (DGVa) is a repository that provides archiving, accessioning and distribution of publicly available genomic structural variants, in all species. https://www.ebi.ac.uk/dgva #. **Cancer genomics workflow**\ : MGI's CWL Cancer Pipelines. https://github.com/genome/cancer-genomics-workflow/wiki #. **GIAB**\ : The priority of GIAB is authoritative characterization of human genomes for use in analytical validation and technology development, optimization, and demonstration. http://jimb.stanford.edu/giab/ and https://github.com/genome-in-a-bottle Download links: http://jimb.stanford.edu/giab-resources #. **dbNSFP**\ : dbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. https://sites.google.com/site/jpopgen/dbNSFP #. **1000Genomes**\ : The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the populations studied. http://www.internationalgenome.org/ Download link: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ #. **HapMap3**\ : The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap 3 is the third phase of the International HapMap project. http://www.sanger.ac.uk/resources/downloads/human/hapmap3.html Download link: ftp://ftp.ncbi.nlm.nih.gov/hapmap/ #. **GRCh38.p11**\ : GRCh38.p11 is the eleventh patch release for the GRCh38 (human) reference assembly. https://www.ncbi.nlm.nih.gov/grc/human Download link: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/ #. **dbVar**\ : dbVar is NCBI's database of genomic structural variation – insertions, deletions, duplications, inversions, mobile element insertions, translocations, and complex chromosomal rearrangements https://www.ncbi.nlm.nih.gov/dbvar Download link: https://www.ncbi.nlm.nih.gov/dbvar/content/ftp_manifest/ #. **Drug sensitivity in cancer**\ : Identifying molecular features of cancers that predict response to anti-cancer drugs. http://www.cancerrxgene.org/ Download link: ftp://ftp.sanger.ac.uk/pub4/cancerrxgene/releases #. **VarSome**\ : VarSome is a knowledge base and aggregator for human genomic variants. https://varsome.com/about/ #. **Google Genomics Public Data**\ : Google Genomics helps the life science community organize the world’s genomic information and make it accessible and useful. and http://googlegenomics.readthedocs.io Sample datasets --------------- #. **TCRB**\ : he Texas Cancer Research Biobank (TCRB) was created to bridge the gap between doctors and scientific researchers to improve the prevention, diagnosis and treatment of cancer. This work occurred with funding from the Cancer Prevention & Research Institute of Texas (CPRIT) from 2010-2014. http://txcrb.org/data.html Article: https://www.nature.com/articles/sdata201610 Relevant publications --------------------- *Including methodological benchmarking* #. **MSK-IMPACT:** * **Original pipeline**\ : Cheng, D. T., Mitchell, T. N., Zehir, A., Shah, R. H., Benayed, R., Syed, A., … Berger, M. F. (2015). Memorial sloan kettering-integrated mutation profiling of actionable cancer targets (MSK-IMPACT): A hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology. Journal of Molecular Diagnostics, 17(3), 251–264. https://doi.org/10.1016/j.jmoldx.2014.12.006 * **Case study**\ : Cheng, D. T., Prasad, M., Chekaluk, Y., Benayed, R., Sadowska, J., Zehir, A., … Zhang, L. (2017). Comprehensive detection of germline variants by MSK-IMPACT, a clinical diagnostic platform for solid tumor molecular oncology and concurrent cancer predisposition testing. BMC Medical Genomics, 10(1), 33. https://doi.org/10.1186/s12920-017-0271-4 * **Case study**\ : Zehir, A., Benayed, R., Shah, R. H., Syed, A., Middha, S., Kim, H. R., … Berger, M. F. (2017). Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nature Medicine, 23(6), 703–713. https://doi.org/10.1038/nm.4333 #. **Application of MSK-IMPACT:** Zehir, A., Benayed, R., Shah, R. H., Syed, A., Middha, S., Kim, H. R., … Berger, M. F. (2017). Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nature Medicine, 23(6), 703–713. https://doi.org/10.1038/nm.4333 #. **Review on bioinformatic pipelins:** Leipzig, J. (2017). A review of bioinformatic pipeline frameworks. Briefings in Bioinformatics, 18(3), 530–536. https://doi.org/10.1093/bib/bbw020 #. **Mutational signature reviews:** * Helleday, T., Eshtad, S., & Nik-Zainal, S. (2014). Mechanisms underlying mutational signatures in human cancers. Nature Reviews Genetics, 15(9), 585–598. https://doi.org/10.1038/nrg3729 * Alexandrov, L. B., & Stratton, M. R. (2014). Mutational signatures: The patterns of somatic mutations hidden in cancer genomes. Current Opinion in Genetics and Development, 24(1), 52–60. https://doi.org/10.1016/j.gde.2013.11.01 #. **Review on structural variation detection tools**\ : * Lin, K., Bonnema, G., Sanchez-Perez, G., & De Ridder, D. (2014). Making the difference: Integrating structural variation detection tools. Briefings in Bioinformatics, 16(5), 852–864. https://doi.org/10.1093/bib/bbu047 * Tattini, L., D’Aurizio, R., & Magi, A. (2015). Detection of Genomic Structural Variants from Next-Generation Sequencing Data. Frontiers in Bioengineering and Biotechnology, 3(June), 1–8. https://doi.org/10.3389/fbioe.2015.00092 #. **Two case studies and a pipeline (unpublished)**\ : Noll, A. C., Miller, N. A., Smith, L. D., Yoo, B., Fiedler, S., Cooley, L. D., … Kingsmore, S. F. (2016). Clinical detection of deletion structural variants in whole-genome sequences. Npj Genomic Medicine, 1(1), 16026. https://doi.org/10.1038/npjgenmed.2016.26 #. **Review on driver gene methods**\ : Tokheim, C. J., Papadopoulos, N., Kinzler, K. W., Vogelstein, B., & Karchin, R. (2016). Evaluating the evaluation of cancer driver genes. Proceedings of the National Academy of Sciences, 113(50), 14330–14335. https://doi.org/10.1073/pnas.1616440113 *Resource, or general notable papers including resource and KB papers related to cancer genomics* #. **GIAB**\ : Zook, J. M., Catoe, D., McDaniel, J., Vang, L., Spies, N., Sidow, A., … Salit, M. (2016). Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data, 3, 160025. https://doi.org/10.1038/sdata.2016.25 Methods and tools ----------------- *Excluding multiple method comparison or benchmarking tools* * **BreakDancer**\ : Chen, K., Wallis, J. W., Mclellan, M. D., Larson, D. E., Kalicki, J. M., Pohl, C. S., … Elaine, R. (2013). BreakDancer - An algorithm for high resolution mapping of genomic structure variation. Nature Methods, 6(9), 677–681. https://doi.org/10.1038/nmeth.1363.BreakDancer * **Pindel**\ : Ye, K., Schulz, M. H., Long, Q., Apweiler, R., & Ning, Z. (2009). Pindel: A pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 25(21), 2865–2871. https://doi.org/10.1093/bioinformatics/btp394 * **SVDetect**\ : Zeitouni, B., Boeva, V., Janoueix-Lerosey, I., Loeillet, S., Legoix-né, P., Nicolas, A., … Barillot, E. (2010). SVDetect: A tool to identify genomic structural variations from paired-end and mate-pair sequencing data. Bioinformatics, 26(15), 1895–1896. https://doi.org/10.1093/bioinformatics/btq293 * **Purityest**\ : Su, X., Zhang, L., Zhang, J., Meric-bernstam, F., & Weinstein, J. N. (2012). Purityest: Estimating purity of human tumor samples using next-generation sequencing data. Bioinformatics, 28(17), 2265–2266. https://doi.org/10.1093/bioinformatics/bts365 * **PurBayes**\ : Larson, N. B., & Fridley, B. L. (2013). PurBayes: Estimating tumor cellularity and subclonality in next-generation sequencing data. Bioinformatics, 29(15), 1888–1889. https://doi.org/10.1093/bioinformatics/btt293 * **ANNOVAR**\ : Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38(16), 1–7. https://doi.org/10.1093/nar/gkq603 * **ASCAT**\ : Van Loo, P., Nordgard, S. H., Lingjaerde, O. C., Russnes, H. G., Rye, I. H., Sun, W., … Kristensen, V. N. (2010). Allele-specific copy number analysis of tumors. Proceedings of the National Academy of Sciences, 107(39), 16910–16915. https://doi.org/10.1073/pnas.1009843107 * **Treeomics**\ : Reiter, J. G., Makohon-Moore, A. P., Gerold, J. M., Bozic, I., Chatterjee, K., Iacobuzio-Donahue, C. A., … Nowak, M. A. (2017). Reconstructing metastatic seeding patterns of human cancers. Nature Communications, 8, 14114. https://doi.org/10.1038/ncomms14114 * **deconstructSigs**\ : Rosenthal, R., McGranahan, N., Herrero, J., Taylor, B. S., & Swanton, C. (2016). deconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biology, 17(1), 31. https://doi.org/10.1186/s13059-016-0893-4 * **MutationalPatterns**\ : Blokzijl, F., Janssen, R., van Boxtel, R., & Cuppen, E. (2017). MutationalPatterns: comprehensive genome-wide analysis of mutational processes. bioRxiv, 1–20. https://doi.org/https://doi.org/10.1101/071761 * **MaSuRCA**\ : Zimin, A. V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S. L., & Yorke, J. A. (2013). The MaSuRCA genome assembler. Bioinformatics, 29(21), 2669–2677. https://doi.org/10.1093/bioinformatics/btt476 * **VarDict**\ : Lai, Z., Markovets, A., Ahdesmaki, M., Chapman, B., Hofmann, O., Mcewen, R., … Dry, J. R. (2016). VarDict: A novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Research, 44(11), 1–11. https://doi.org/10.1093/nar/gkw227 * **vt**\ : Tan, A., Abecasis, G. R., & Kang, H. M. (2015). Unified representation of genetic variants. Bioinformatics, 31(13), 2202–2204. https://doi.org/10.1093/bioinformatics/btv112 * **peddy**\ : Pedersen, B. S., & Quinlan, A. R. (2017). Who’s Who? Detecting and Resolving Sample Anomalies in Human DNA Sequencing Studies with Peddy. American Journal of Human Genetics, 100(3), 406–413. https://doi.org/10.1016/j.ajhg.2017.01.017 * **GQT**\ : Layer, R. M., Kindlon, N., Karczewski, K. J., & Quinlan, A. R. (2015). Efficient genotype compression and analysis of large genetic-variation data sets. Nature Methods, 13(1). https://doi.org/10.1038/nmeth.3654 *Tool sets and softwares required at various steps of pipeline development* #. **Teaser**\ : NGS readmapping benchmarking. * http://teaser.cibiv.univie.ac.at/ * https://github.com/Cibiv/Teaser #. **FastQC**\ : Quality control tool. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ #. **Cutadapt**\ : Adapter removal tool. https://cutadapt.readthedocs.io/en/stable/ #. **Trim Galore!**\ : FastQC and Cutadapt wrapper. https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/ #. **Picardtools**\ : BAM/SAM/VCF/CRAM manipulator. http://broadinstitute.github.io/picard/ * **MarkDuplicate**\ : Mark duplicate reads and potentially remove them * **LiftoverVcf**\ : liftover VCF between builds * **CollectHsMetric**\ : Collects hybrid-selection (HS) metrics for a SAM or BAM file * **CollectAlignmentSummaryMetrics**\ : Produces a summary of alignment metrics from a SAM or BAM file * **CollectGcBiasMetrics**\ : Collect metrics regarding GC bias * **CollectWgsMetrics**\ : Collect metrics about coverage and performance of whole genome sequencing (WGS) experiments #. **GATK**\ : A variant discovery tool: https://software.broadinstitute.org/gatk/ * **BaseRecalibrator**\ : Detect systematic error in base quality score * **Somatic Indel Realigner**\ : Local Realignment around Indels * **ContEst**\ : Estimate cross sample contamination * **DepthOfCoverage**\ : Assess sequence coverage by sample, read group, or libraries * **DuplicateReadFilter**\ : remove duplicated from flag set by MarkDuplicates #. **Samtools**\ : Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format http://www.htslib.org/ #. **Sambamba**\ : Tools for working with SAM/BAM/CRAM data http://lomereiter.github.io/sambamba/ #. **bcftools**\ : Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants http://www.htslib.org/doc/bcftools.html #. **vcftools**\ : VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. https://vcftools.github.io/index.html #. **Delly2**\ : An integrated structural variant prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations https://github.com/dellytools/delly #. **PLINK**\ : PLINK: Whole genome data analysis toolset https://www.cog-genomics.org/plink2 #. **freebayes**\ : a haplotype-based variant detector. https://github.com/ekg/freebayes #. **ASCAT**\ : Allele-Specific Copy Number Analysis of Tumors, tumor purity and ploidy https://github.com/Crick-CancerGenomics/ascat #. **MutationalPatterns**\ : R package for extracting and visualizing mutational patterns in base substitution catalogues https://github.com/UMCUGenetics/MutationalPatterns #. **desconstructSigs**\ : identification of mutational signatures within a single tumor sample https://github.com/raerose01/deconstructSigs #. **treeOmics**\ : Decrypting somatic mutation patterns to reveal the evolution of cancer https://github.com/johannesreiter/treeomics #. **controlFreeC**\ : Copy number and allelic content caller http://boevalab.com/FREEC/ #. **MuTect2**\ : Call somatic SNPs and indels via local re-assembly of haplotypes https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php #. **Annovar**\ : annotation of detected genetic variation http://annovar.openbioinformatics.org/en/latest/ #. **Strelka**\ : Small variant caller https://github.com/Illumina/strelka #. **Manta**\ : Structural variant caller https://github.com/Illumina/manta #. **PurBayes**\ : estimate tumor purity and clonality #. **VarDict**\ : variant caller for both single and paired sample variant calling from BAM files https://github.com/AstraZeneca-NGS/VarDict #. **SNPeff/SNPSift**\ : Genomic variant annotations and functional effect prediction toolbox. http://snpeff.sourceforge.net/ and http://snpeff.sourceforge.net/SnpSift.html #. **IGV**\ : visualization tool for interactive exploration http://software.broadinstitute.org/software/igv/ #. **SVDetect**\ : a tool to detect genomic structural variations http://svdetect.sourceforge.net/Site/Home.html #. **GenomeSTRiP**\ : A suite of tools for discovering and genotyping structural variations using sequencing data http://software.broadinstitute.org/software/genomestrip/ #. **BreakDancer**\ : SV detection from paired end reads mapping https://github.com/genome/breakdancer #. **pIndel**\ : Detect breakpoints of large deletions, medium sized insertions, inversions, and tandem duplications https://github.com/genome/pindel #. **VarScan**\ : Variant calling and somatic mutation/CNV detection https://github.com/dkoboldt/varscan #. **VEP**\ : Variant Effect Predictor https://www.ensembl.org/info/docs/tools/vep/index.html #. **Probablistic2020**\ : Simulates somatic mutations, and calls statistically significant oncogenes and tumor suppressor genes based on a randomization-based test https://github.com/KarchinLab/probabilistic2020 #. **2020plus**\ : Classifies genes as an oncogene, tumor suppressor gene, or as a non-driver gene by using Random Forests https://github.com/KarchinLab/2020plus #. **vtools**\ : variant tools is a software tool for the manipulation, annotation, selection, simulation, and analysis of variants in the context of next-gen sequencing analysis. http://varianttools.sourceforge.net/Main/HomePage #. **vt**\ : A variant tool set that discovers short variants from Next Generation Sequencing data. https://genome.sph.umich.edu/wiki/Vt and https://github.com/atks/vt #. **CNVnator**\ : a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads. https://github.com/abyzovlab/CNVnator #. **SvABA**\ : Structural variation and indel detection by local assembly. https://github.com/walaj/svaba #. **indelope**\ : find indels and SVs too small for structural variant callers and too large for GATK. https://github.com/brentp/indelope #. **peddy**\ : peddy compares familial-relationships and sexes as reported in a PED/FAM file with those inferred from a VCF. https://github.com/brentp/peddy #. **cyvcf2**\ : cyvcf2 is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files. https://github.com/brentp/cyvcf2 #. **GQT**\ : Genotype Query Tools (GQT) is command line software and a C API for indexing and querying large-scale genotype data sets. https://github.com/ryanlayer/gqt #. **LOFTEE**\ : Loss-Of-Function Transcript Effect Estimator. A VEP plugin to identify LoF (loss-of-function) variation. Assesses variants that are: Stop-gained, Splice site disrupting, and Frameshift variants. https://github.com/konradjk/loftee #. **PureCN**\ : copy number calling and SNV classification using targeted short read sequencing https://bioconductor.org/packages/release/bioc/html/PureCN.html #. **SVCaller**\ : A structural variant caller. https://github.com/tomwhi/svcaller #. **SnakeMake**\ : A workflow manager. http://snakemake.readthedocs.io/en/stable/index.html #. **BWA**\ : BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. http://bio-bwa.sourceforge.net/ #. **wgsim**\ : Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. https://github.com/lh3/wgsim #. **dwgsim**\ : Whole genome simulation can be performed with dwgsim. dwgsim is based off of wgsim found in SAMtools. https://github.com/nh13/DWGSIM #. **ABSOLUTE**\ : ABSOLUTE can estimate purity/ploidy, and from that compute absolute copy-number and mutation multiplicities. http://archive.broadinstitute.org/cancer/cga/absolute #. **THetA**\ : Tumor Heterogeneity Analysis. This algorithm estimates tumor purity and clonal/subclonal copy number aberrations directly from high-throughput DNA sequencing data. https://github.com/raphael-group/THetA #. **Skewer**\ : Adapter trimming, similar to cutadapt. https://github.com/relipmoc/skewer #. **Phylowgs**\ : Application for inferring subclonal composition and evolution from whole-genome sequencing data. https://github.com/morrislab/phylowgs #. **superFreq**\ : SuperFreq is an R package that analyses cancer exomes to track subclones. https://github.com/ChristofferFlensburg/superFreq #. **readVCF-r**\ : Read VCFs into R and annotatte them. https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html #. **vcfr**\ : Read VCFs into R. https://github.com/knausb/vcfR #. **msisensor**\ : microsatellite instability detection using paired tumor-normal https://github.com/ding-lab/msisensor #. **MOSAIC**\ : MicrOSAtellite Instability Classifier https://github.com/ronaldhause/mosaic #. **MANTIS**\ : Microsatellite Analysis for Normal-Tumor InStability https://github.com/OSU-SRLab/MANTIS