Cook JH*, Melloni GEM*, Gulhan DC, Park PJ**, Haigis KM**. The origins and genetic interactions of KRAS mutations are allele- and tissue-specific [Internet]. Nature Communications 2021;12(1808) Publisher's VersionAbstract
Mutational activation of KRAS promotes the initiation and progression of cancers, especially in the colorectum, pancreas, lung, and blood plasma, with varying prevalence of specific activating missense mutations. Although epidemiological studies connect specific alleles to clinical outcomes, the mechanisms underlying the distinct clinical characteristics of mutant KRAS alleles are unclear. Here, we analyze 13,492 samples from these four tumor types to examine allele- and tissue-specific genetic properties associated with oncogenic KRAS mutations. The prevalence of known mutagenic mechanisms partially explains the observed spectrum of KRAS activating mutations. However, there are substantial differences between the observed and predicted frequencies for many alleles, suggesting that biological selection underlies the tissue-specific frequencies of mutant alleles. Consistent with experimental studies that have identified distinct signaling properties associated with each mutant form of KRAS, our genetic analysis reveals that each KRAS allele is associated with a distinct tissuespecific comutation network. Moreover, we identify tissue-specific genetic dependencies associated with specific mutant KRAS alleles. Overall, this analysis demonstrates that the genetic interactions of oncogenic KRAS mutations are allele- and tissue-specific, underscoring the complexity that drives their clinical consequences.
Bizzotto S*, Dou Y*, Ganz J*, Doan RN, Kwon M, Bohrson CL, Kim SN, Bae T, Abyzov A, Network NIMHBSM, Park PJ**, Walsh CA**. Landmarks of human embryonic development inscribed in somatic mutations [Internet]. Science 2021;371(6535):1249-1253. Publisher's VersionAbstract
Although cell lineage information is fundamental to understanding organismal development, very little direct information is available for humans. We performed high-depth (250×) whole-genome sequencing of multiple tissues from three individuals to identify hundreds of somatic single-nucleotide variants (sSNVs). Using these variants as "endogenous barcodes" in single cells, we reconstructed early embryonic cell divisions. Targeted sequencing of clonal sSNVs in different organs (about 25,000×) and in more than 1000 cortical single cells, as well as single-nucleus RNA sequencing and single-nucleus assay for transposase-accessible chromatin sequencing of ~100,000 cortical single cells, demonstrated asymmetric contributions of early progenitors to extraembryonic tissues, distinct germ layers, and organs. Our data suggest onset of gastrulation at an effective progenitor pool of about 170 cells and about 50 to 100 founders for the forebrain. Thus, mosaic mutations provide a permanent record of human embryonic development at very high resolution.
Kwon M, Lee S, Berselli M, Chu C, Park PJ. BamSnap: a lightweight viewer for sequencing reads in BAM files. Bioinformatics 2021;Abstract
SUMMARY: Despite the improvement in variant detection algorithms, visual inspection of the read-level data remains an essential step for accurate identification of variants in genome analysis. We developed BamSnap, an efficient BAM file viewer utilizing a graphics library and BAM indexing. In contrast to existing viewers, BamSnap can generate high-quality snapshots rapidly, with customized tracks and layout. As an example, we produced read-level images at 1000 genomic loci for >2500 whole-genomes. AVAILABILITY: BamSnap is freely available at SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Rodin RE*, Dou Y*, Kwon M, Sherman MA, D'Gama AM, Doan RN, Rento LM, Girskis KM, Bohrson CL, Kim SN, Nadig A, Luquette LJ, Gulhan DC, Brain Somatic Mosaicism Network BSM, Park PJ**, Walsh CA**. The landscape of somatic mutation in cerebral cortex of autistic and neurotypical individuals revealed by ultra-deep whole-genome sequencing. Nat Neurosci 2021;24(2):176-185.Abstract
We characterize the landscape of somatic mutations-mutations occurring after fertilization-in the human brain using ultra-deep (~250×) whole-genome sequencing of prefrontal cortex from 59 donors with autism spectrum disorder (ASD) and 15 control donors. We observe a mean of 26 somatic single-nucleotide variants per brain present in ≥4% of cells, with enrichment of mutations in coding and putative regulatory regions. Our analysis reveals that the first cell division after fertilization produces ~3.4 mutations, followed by 2-3 mutations in subsequent generations. This suggests that a typical individual possesses ~80 somatic single-nucleotide variants present in ≥2% of cells-comparable to the number of de novo germline mutations per generation-with about half of individuals having at least one potentially function-altering somatic mutation somewhere in the cortex. ASD brains show an excess of somatic mutations in neural enhancer sequences compared with controls, suggesting that mosaic enhancer mutations may contribute to ASD risk.
Jung YL, Kirli K, Alver BH, Park PJ. Resources and challenges for integrative analysis of nuclear architecture data. Curr Opin Genet Dev 2021;67:103-110.Abstract
A large amount of genomic data for profiling three-dimensional genome architecture have accumulated from large-scale consortium projects as well as from individual laboratories. In this review, we summarize recent landmark datasets and collections in the field. We describe the challenges in collection, annotation, and analysis of these data, particularly for integration of sequencing and microscopy data. We introduce efforts from consortia and independent groups to harmonize diverse datasets. As the resolution and throughput of sequencing and imaging technologies continue to increase, more efficient utilization and integration of collected data will be critical for a better understanding of nuclear architecture.
Cortés-Ciriano I, Lee JJK, Xi R, Jain D, Jung YL, Yang L, Gordenin D, Klimczak LJ, Zhang CZ, Pellman DS, Group PCAWGSVW, Park PJ, Consortium PCAWG. Comprehensive analysis of chromothripsis in 2,658 human cancers using whole-genome sequencing [Internet]. Nature Genetics 2020;52(3):331-341. Publisher's VersionAbstract
Chromothripsis is a mutational phenomenon characterized by massive, clustered genomic rearrangements that occurs in cancer and other diseases. Recent studies in selected cancer types have suggested that chromothripsis may be more common than initially inferred from low-resolution copy-number data. Here, as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), we analyze patterns of chromothripsis across 2,658 tumors from 38 cancer types using whole-genome sequencing data. We find that chromothripsis events are pervasive across cancers, with a frequency of more than 50% in several cancer types. Whereas canonical chromothripsis profiles display oscillations between two copy-number states, a considerable fraction of events involve multiple chromosomes and additional structural alterations. In addition to non-homologous end joining, we detect signatures of replication-associated processes and templated insertions. Chromothripsis contributes to oncogene amplification and to inactivation of genes such as mismatch-repair-related genes. These findings show that chromothripsis is a major process that drives genome evolution in human cancer.
Dou Y, Kwon M, Rodin RE, Cortés-Ciriano I, Doan R, J. Luquette L, Galor A, Bohrson C, Walsh CA, Park PJ. Accurate detection of mosaic variants in sequencing data without matched controls [Internet]. Nature Biotechnology 2020;38(3):314-319. Publisher's VersionAbstract

Detection of mosaic mutations that arise in normal development is challenging, as such mutations are typically present in only a minute fraction of cells and there is no clear matched control for removing germline variants and systematic artifacts. We present MosaicForecast, a machine-learning method that leverages read-based phasing and read-level features to accurately detect mosaic single-nucleotide variants and indels, achieving a multifold increase in specificity compared with existing algorithms. Using single-cell sequencing and targeted sequencing, we validated 80–90{\%} of the mosaic single-nucleotide variants and 60–80{\%} of indels detected in human brain whole-genome sequencing data. Our method should help elucidate the contribution of mosaic somatic mutations to the origin and development of disease.

Jain D, Chu C, Alver BH, Lee S, Lee EA, Park PJ. HiTea: a computational pipeline to identify non-reference transposable element insertions in Hi-C data. Bioinformatics 2020;Abstract
Hi-C is a common technique for assessing three-dimensional chromatin conformation. Recent studies have shown that long-range interaction information in Hi-C data can be used to generate chromosome-length genome assemblies and identify large-scale structural variations. Here, we demonstrate the use of Hi-C data in detecting mobile transposable element (TE) insertions genome-wide. Our pipeline HiTea (Hi-C based Transposable element analyzer) capitalizes on clipped Hi-C reads and is aided by a high proportion of discordant read pairs in Hi-C data to detect insertions of three major families of active human TEs. Despite the uneven genome coverage in Hi-C data, HiTea is competitive with the existing callers based on whole genome sequencing (WGS) data and can supplement the WGS-based characterization of the TE insertion landscape. We employ the pipeline to identify TE insertions from human cell-line Hi-C samples. HiTea is available at and as a Docker image. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Yun JW, Yang L, Park H-Y, Lee C-W, Cha H, Shin H-T, Noh K-W, Choi Y-L, Park W-Y**, Park PJ**. Dysregulation of cancer genes by recurrent intergenic fusions. Genome Biol 2020;21(1):166.Abstract
BACKGROUND: Gene fusions have been studied extensively, as frequent drivers of tumorigenesis as well as potential therapeutic targets. In many well-known cases, breakpoints occur at two intragenic positions, leading to in-frame gene-gene fusions that generate chimeric mRNAs. However, fusions often occur with intergenic breakpoints, and the role of such fusions has not been carefully examined. RESULTS: We analyze whole-genome sequencing data from 268 patients to catalog gene-intergenic and intergenic-intergenic fusions and characterize their impact. First, we discover that, in contrast to the common assumption, chimeric oncogenic transcripts-such as those involving ETV4, ERG, RSPO3, and PIK3CA-can be generated by gene-intergenic fusions through splicing of the intervening region. Second, we find that over-expression of an upstream or downstream gene by a fusion-mediated repositioning of a regulatory sequence is much more common than previously suspected, with enhancers sometimes located megabases away. We detect a number of recurrent fusions, such as those involving ANO3, RGS9, FUT5, CHI3L1, OR1D4, and LIPG in breast; IGF2 in colon; ETV1 in prostate; and IGF2BP3 and SIX2 in thyroid cancers. CONCLUSION: Our findings elucidate the potential oncogenic function of intergenic fusions and highlight the wide-ranging consequences of structural rearrangements in cancer genomes.
Chu C, Zhao B, Park PJ, Lee EA. Identification and Genotyping of Transposable Element Insertions From Genome Sequencing Data. Curr Protoc Hum Genet 2020;107(1):e102.Abstract
Transposable element (TE) mobilization is a significant source of genomic variation and has been associated with various human diseases. The exponential growth of population-scale whole-genome sequencing and rapid innovations in long-read sequencing technologies provide unprecedented opportunities to study TE insertions and their functional impact in human health and disease. Identifying TE insertions, however, is challenging due to the repetitive nature of the TE sequences. Here, we review computational approaches to detecting and genotyping TE insertions using short- and long-read sequencing and discuss the strengths and weaknesses of different approaches. © 2020 Wiley Periodicals LLC.
Goldman MJ*, Zhang J*, Fonseca NA*, Cortés-Ciriano I*, Xiang Q, Craft B, Piñeiro-Yáñez E, O'Connor BD, Bazant W, Barrera E, Muñoz-Pomer A, Petryszak R, Füllgrabe A, Al-Shahrour F, Keays M, Haussler D, Weinstein JN, Huber W, Valencia A, Park PJ, Papatheodorou I, Zhu J, Ferretti V, Vazquez M. A user guide for the online exploration and visualization of PCAWG data. Nat Commun 2020;11(1):3400.Abstract
The Pan-Cancer Analysis of Whole Genomes (PCAWG) project generated a vast amount of whole-genome cancer sequencing resource data. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancers across 38 tumor types, we provide a user's guide to the five publicly available online data exploration and visualization tools introduced in the PCAWG marker paper. These tools are ICGC Data Portal, UCSC Xena, Chromothripsis Explorer, Expression Atlas, and PCAWG-Scout. We detail use cases and analyses for each tool, show how they incorporate outside resources from the larger genomics ecosystem, and demonstrate how the tools can be used together to understand the biology of cancers more deeply. Together, the tools enable researchers to query the complex genomic PCAWG data dynamically and integrate external information, enabling and enhancing interpretation.
Wang S, Lee S, Chu C, Jain D, Kerpedjiev P, Nelson GM, Walsh JM, Alver BH, Park PJ. HiNT: a computational method for detecting copy number variations and translocations from Hi-C data [Internet]. Genome Biology 2020;21(1):73. Publisher's VersionAbstract
The three-dimensional conformation of a genome can be profiled using Hi-C, a technique that combines chromatin conformation capture with high-throughput sequencing. However, structural variations often yield features that can be mistaken for chromosomal interactions. Here, we describe a computational method HiNT (Hi-C for copy Number variation and Translocation detection), which detects copy number variations and interchromosomal translocations within Hi-C data with breakpoints at single base-pair resolution. We demonstrate that HiNT outperforms existing methods on both simulated and real data. We also show that Hi-C can supplement whole-genome sequencing in structure variant detection by locating breakpoints in repetitive regions.
Horton CA, Alver B, Park PJ. GiniQC: a measure for quantifying noise in single-cell Hi-C data [Internet]. Bioinformatics 2020; Publisher's VersionAbstract
Single-cell Hi-C (scHi-C) allows the study of cell-to-cell variability in chromatin structure and dynamics. However, the high level of noise inherent in current scHi-C protocols necessitates careful assessment of data quality before biological conclusions can be drawn. Here we present GiniQC, which quantifies unevenness in the distribution of inter-chromosomal reads in the scHi-C contact matrix to measure the level of noise. Our examples show the utility of GiniQC in assessing the quality of scHi-C data as a complement to existing quality control measures. We also demonstrate how GiniQC can help inform the impact of various data processing steps on data quality.
Kim J, Hu C, Moufawad El Achkar C, Black LE, Douville J, Larson A, Pendergast MK, Goldkind SF, Lee EA, Kuniholm A, Soucy A, Vaze J, Belur NR, Fredriksen K, Stojkovska I, Tsytsykova A, Armant M, DiDonato RL, Choi J, Cornelissen L, Pereira LM, Augustine EF, Genetti CA, Dies K, Barton B, Williams L, Goodlett BD, Riley BL, Pasternak A, Berry ER, Pflock KA, Chu S, Reed C, Tyndall K, Agrawal PB, Beggs AH, Grant EP, Urion DK, Snyder RO, Waisbren SE, Poduri A, Park PJ, Patterson A, Biffi A, Mazzulli JR, Bodamer O, Berde CB, Yu TW. Patient-Customized Oligonucleotide Therapy for a Rare Genetic Disease. N Engl J Med 2019;Abstract
Genome sequencing is often pivotal in the diagnosis of rare diseases, but many of these conditions lack specific treatments. We describe how molecular diagnosis of a rare, fatal neurodegenerative condition led to the rational design, testing, and manufacture of milasen, a splice-modulating antisense oligonucleotide drug tailored to a particular patient. Proof-of-concept experiments in cell lines from the patient served as the basis for launching an "N-of-1" study of milasen within 1 year after first contact with the patient. There were no serious adverse events, and treatment was associated with objective reduction in seizures (determined by electroencephalography and parental reporting). This study offers a possible template for the rapid development of patient-customized treatments. (Funded by Mila's Miracle Foundation and others.).
Luquette JL, Bohrson CL, Sherman M, Park PJ. Identification of somatic mutations in single cell DNA sequencing data using a spatial model of allelic imbalance. Nature Communications 2019;10(1):3908.Abstract
Recent advances in single cell technology have enabled dissection of cellular heterogeneity in great detail. However, analysis of single cell DNA sequencing data remains challenging due to bias and artifacts that arise during DNA extraction and whole-genome amplification, including allelic imbalance and dropout. Here, we present a framework for statistical estimation of allele-specific amplification imbalance at any given position in single cell whole-genome sequencing data by utilizing the allele frequencies of heterozygous single nucleotide polymorphisms in the neighborhood. The resulting allelic imbalance profile is critical for determining whether the variant allele fraction of an observed mutation is consistent with the expected fraction for a true variant. This method, implemented in SCAN-SNV (Single Cell ANalysis of SNVs), substantially improves the identification of somatic variants in single cells. Our allele balance framework is broadly applicable to genotype analysis of any variant type in any data that might exhibit allelic imbalance.
Gulhan DC, Lee JJ-K, Melloni GEM, Cortés-Ciriano I, Park PJ. Detecting the mutational signature of homologous recombination deficiency in clinical samples. Nature Genetics 2019;51(5):912-919.Abstract
Mutations in BRCA1 and/or BRCA2 (BRCA1/2) are the most common indication of deficiency in the homologous recombination (HR) DNA repair pathway. However, recent genome-wide analyses have shown that the same pattern of mutations found in BRCA1/2-mutant tumors is also present in several other tumors. Here, we present a new computational tool called Signature Multivariate Analysis (SigMA), which can be used to accurately detect the mutational signature associated with HR deficiency from targeted gene panels. Whereas previous methods require whole-genome or whole-exome data, our method detects the HR-deficiency signature even from low mutation counts, by using a likelihood-based measure combined with machine-learning techniques. Cell lines that we identify as HR deficient show a significant response to poly (ADP-ribose) polymerase (PARP) inhibitors; patients with ovarian cancer whom we found to be HR deficient show a significantly longer overall survival with platinum regimens. By enabling panel-based identification of mutational signatures, our method substantially increases the number of patients that may be considered for treatments targeting HR deficiency.
Bohrson CL, Barton AR, Lodato MA, Rodin RE, Luquette LJ, Viswanadham VV, Gulhan DC, Cortés-Ciriano I, Sherman MA, Kwon M, Coulter ME, Galor A, Walsh CA, Park PJ. Linked-read analysis identifies mutations in single-cell DNA-sequencing data. Nature Genetics 2019;51:749-754.Abstract
Whole-genome sequencing of DNA from single cells has the potential to reshape our understanding of mutational heterogeneity in normal and diseased tissues. However, a major difficulty is distinguishing amplification artifacts from biologically derived somatic mutations. Here, we describe linked-read analysis (LiRA), a method that accurately identifies somatic singlenucleotide variants (sSNVs) by using read-level phasing with nearby germline heterozygous polymorphisms, thereby enabling the characterization of mutational signatures and estimation of somatic mutation rates in single cells.
Yang L, Wang S, Lee JJ-K, Lee S, Lee E, Shinbrot E, Wheeler DA, Kucherlapati R, Park PJ. An enhanced genetic model of colorectal cancer progression history. Genome Biology 2019;20(1):168.
Lee S, Johnson J, Vitzthum C, Kirli K, Alver BH, Park PJ. Tibanna: software for scalable execution of portable pipelines on the cloud [Internet]. Bioinformatics 2019; Publisher's VersionAbstract
We introduce Tibanna, an open-source software tool for automated execution of bioinformatics pipelines on Amazon Web Services (AWS). Tibanna accepts reproducible and portable pipeline standards including Common Workflow Language (CWL), Workflow Description Language (WDL) and Docker. It adopts a strategy of isolation and optimization of individual executions, combined with a serverless scheduling approach. Pipelines are executed and monitored using local commands or the Python Application Programming Interface (API) and cloud configuration is automatically handled. Tibanna is well suited for projects with a range of computational requirements, including those with large and widely fluctuating loads. Notably, it has been used to process terabytes of data for the 4D Nucleome (4DN) Network.
Lee JJ-K, Park S, Park H, Kim S, Lee J, Lee J, Youk J, Yi K, An Y, Park IK, Kang CH, Chung DH, Kim TM, Jeon YK, Hong D, Park PJ, Ju YS, Kim YT. Tracing Oncogene Rearrangements in the Mutational History of Lung Adenocarcinoma [Internet]. Cell 2019;177(7):1842-1857. Publisher's VersionAbstract
Mutational processes giving rise to lung adenocarcinomas (LADCs) in non-smokers remain elusive. We analyzed 138 LADC whole genomes, including 83 cases with minimal contribution of smoking-associated mutational signature. Genomic rearrangements were not correlated with smoking-associated mutations and frequently served as driver events of smoking-signature-low LADCs. Complex genomic rearrangements, including chromothripsis and chromoplexy, generated 74% of known fusion oncogenes, including EML4-ALK, CD74-ROS1, and KIF5B-RET. Unlike other collateral rearrangements, these fusion-oncogene-associated rearrangements were frequently copy-number-balanced, representing a genomic signature of early oncogenesis. Analysis of mutation timing revealed that fusions and point mutations of canonical oncogenes were often acquired in the early decades of life. During a long latency, cancer-related genes were disrupted or amplified by complex rearrangements. The genomic landscape was different between subgroups-EGFR-mutant LADCs had frequent whole-genome duplications with p53 mutations, whereas fusion-oncogene-driven LADCs had frequent SETD2 mutations. Our study highlights LADC oncogenesis driven by endogenous mutational processes.