WGS

2021
Chu C, Borges-Monroy R, Viswanadham VV, Lee S, Li H, Lee EA**, Park PJ**. Comprehensive identification of transposable element insertions using multiple sequencing technologies. Nat Commun 2021;12(1):3836.Abstract
Transposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea .
Kwon M, Lee S, Berselli M, Chu C, Park PJ. BamSnap: a lightweight viewer for sequencing reads in BAM files. Bioinformatics 2021;37(2):263-4.Abstract
SUMMARY: Despite the improvement in variant detection algorithms, visual inspection of the read-level data remains an essential step for accurate identification of variants in genome analysis. We developed BamSnap, an efficient BAM file viewer utilizing a graphics library and BAM indexing. In contrast to existing viewers, BamSnap can generate high-quality snapshots rapidly, with customized tracks and layout. As an example, we produced read-level images at 1000 genomic loci for >2500 whole-genomes. AVAILABILITY: BamSnap is freely available at https://github.com/parklab/bamsnap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Sherman MA, Rodin RE, Genovese G, Dias C, Barton AR, Mukamel RE, Berger B, Park PJ**, Walsh CA**, Loh P-R**. Large mosaic copy number variations confer autism risk. Nat Neurosci 2021;24(2):197-203.Abstract
Although germline de novo copy number variants (CNVs) are known causes of autism spectrum disorder (ASD), the contribution of mosaic (early-developmental) copy number variants (mCNVs) has not been explored. In this study, we assessed the contribution of mCNVs to ASD by ascertaining mCNVs in genotype array intensity data from 12,077 probands with ASD and 5,500 unaffected siblings. We detected 46 mCNVs in probands and 19 mCNVs in siblings, affecting 2.8-73.8% of cells. Probands carried a significant burden of large (>4-Mb) mCNVs, which were detected in 25 probands but only one sibling (odds ratio = 11.4, 95% confidence interval = 1.5-84.2, P = 7.4 × 10). Event size positively correlated with severity of ASD symptoms (P = 0.016). Surprisingly, we did not observe mosaic analogues of the short de novo CNVs recurrently observed in ASD (eg, 16p11.2). We further experimentally validated two mCNVs in postmortem brain tissue from 59 additional probands. These results indicate that mCNVs contribute a previously unexplained component of ASD risk.
2020
Cortés-Ciriano I, Lee JJK, Xi R, Jain D, Jung YL, Yang L, Gordenin D, Klimczak LJ, Zhang CZ, Pellman DS, Group PCAWGSVW, Park PJ, Consortium PCAWG. Comprehensive analysis of chromothripsis in 2,658 human cancers using whole-genome sequencing [Internet]. Nature Genetics 2020;52(3):331-341. Publisher's VersionAbstract
Chromothripsis is a mutational phenomenon characterized by massive, clustered genomic rearrangements that occurs in cancer and other diseases. Recent studies in selected cancer types have suggested that chromothripsis may be more common than initially inferred from low-resolution copy-number data. Here, as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), we analyze patterns of chromothripsis across 2,658 tumors from 38 cancer types using whole-genome sequencing data. We find that chromothripsis events are pervasive across cancers, with a frequency of more than 50% in several cancer types. Whereas canonical chromothripsis profiles display oscillations between two copy-number states, a considerable fraction of events involve multiple chromosomes and additional structural alterations. In addition to non-homologous end joining, we detect signatures of replication-associated processes and templated insertions. Chromothripsis contributes to oncogene amplification and to inactivation of genes such as mismatch-repair-related genes. These findings show that chromothripsis is a major process that drives genome evolution in human cancer.
Dou Y, Kwon M, Rodin RE, Cortés-Ciriano I, Doan R, J. Luquette L, Galor A, Bohrson C, Walsh CA, Park PJ. Accurate detection of mosaic variants in sequencing data without matched controls [Internet]. Nature Biotechnology 2020;38(3):314-319. Publisher's VersionAbstract

Detection of mosaic mutations that arise in normal development is challenging, as such mutations are typically present in only a minute fraction of cells and there is no clear matched control for removing germline variants and systematic artifacts. We present MosaicForecast, a machine-learning method that leverages read-based phasing and read-level features to accurately detect mosaic single-nucleotide variants and indels, achieving a multifold increase in specificity compared with existing algorithms. Using single-cell sequencing and targeted sequencing, we validated 80–90{\%} of the mosaic single-nucleotide variants and 60–80{\%} of indels detected in human brain whole-genome sequencing data. Our method should help elucidate the contribution of mosaic somatic mutations to the origin and development of disease.

Yun JW, Yang L, Park H-Y, Lee C-W, Cha H, Shin H-T, Noh K-W, Choi Y-L, Park W-Y**, Park PJ**. Dysregulation of cancer genes by recurrent intergenic fusions. Genome Biol 2020;21(1):166.Abstract
BACKGROUND: Gene fusions have been studied extensively, as frequent drivers of tumorigenesis as well as potential therapeutic targets. In many well-known cases, breakpoints occur at two intragenic positions, leading to in-frame gene-gene fusions that generate chimeric mRNAs. However, fusions often occur with intergenic breakpoints, and the role of such fusions has not been carefully examined. RESULTS: We analyze whole-genome sequencing data from 268 patients to catalog gene-intergenic and intergenic-intergenic fusions and characterize their impact. First, we discover that, in contrast to the common assumption, chimeric oncogenic transcripts-such as those involving ETV4, ERG, RSPO3, and PIK3CA-can be generated by gene-intergenic fusions through splicing of the intervening region. Second, we find that over-expression of an upstream or downstream gene by a fusion-mediated repositioning of a regulatory sequence is much more common than previously suspected, with enhancers sometimes located megabases away. We detect a number of recurrent fusions, such as those involving ANO3, RGS9, FUT5, CHI3L1, OR1D4, and LIPG in breast; IGF2 in colon; ETV1 in prostate; and IGF2BP3 and SIX2 in thyroid cancers. CONCLUSION: Our findings elucidate the potential oncogenic function of intergenic fusions and highlight the wide-ranging consequences of structural rearrangements in cancer genomes.
Chu C, Zhao B, Park PJ, Lee EA. Identification and Genotyping of Transposable Element Insertions From Genome Sequencing Data. Curr Protoc Hum Genet 2020;107(1):e102.Abstract
Transposable element (TE) mobilization is a significant source of genomic variation and has been associated with various human diseases. The exponential growth of population-scale whole-genome sequencing and rapid innovations in long-read sequencing technologies provide unprecedented opportunities to study TE insertions and their functional impact in human health and disease. Identifying TE insertions, however, is challenging due to the repetitive nature of the TE sequences. Here, we review computational approaches to detecting and genotyping TE insertions using short- and long-read sequencing and discuss the strengths and weaknesses of different approaches. © 2020 Wiley Periodicals LLC.
Goldman MJ*, Zhang J*, Fonseca NA*, Cortés-Ciriano I*, Xiang Q, Craft B, Piñeiro-Yáñez E, O'Connor BD, Bazant W, Barrera E, Muñoz-Pomer A, Petryszak R, Füllgrabe A, Al-Shahrour F, Keays M, Haussler D, Weinstein JN, Huber W, Valencia A, Park PJ, Papatheodorou I, Zhu J, Ferretti V, Vazquez M. A user guide for the online exploration and visualization of PCAWG data. Nat Commun 2020;11(1):3400.Abstract
The Pan-Cancer Analysis of Whole Genomes (PCAWG) project generated a vast amount of whole-genome cancer sequencing resource data. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancers across 38 tumor types, we provide a user's guide to the five publicly available online data exploration and visualization tools introduced in the PCAWG marker paper. These tools are ICGC Data Portal, UCSC Xena, Chromothripsis Explorer, Expression Atlas, and PCAWG-Scout. We detail use cases and analyses for each tool, show how they incorporate outside resources from the larger genomics ecosystem, and demonstrate how the tools can be used together to understand the biology of cancers more deeply. Together, the tools enable researchers to query the complex genomic PCAWG data dynamically and integrate external information, enabling and enhancing interpretation.
2019
Luquette JL, Bohrson CL, Sherman M, Park PJ. Identification of somatic mutations in single cell DNA sequencing data using a spatial model of allelic imbalance. Nature Communications 2019;10(1):3908.Abstract
Recent advances in single cell technology have enabled dissection of cellular heterogeneity in great detail. However, analysis of single cell DNA sequencing data remains challenging due to bias and artifacts that arise during DNA extraction and whole-genome amplification, including allelic imbalance and dropout. Here, we present a framework for statistical estimation of allele-specific amplification imbalance at any given position in single cell whole-genome sequencing data by utilizing the allele frequencies of heterozygous single nucleotide polymorphisms in the neighborhood. The resulting allelic imbalance profile is critical for determining whether the variant allele fraction of an observed mutation is consistent with the expected fraction for a true variant. This method, implemented in SCAN-SNV (Single Cell ANalysis of SNVs), substantially improves the identification of somatic variants in single cells. Our allele balance framework is broadly applicable to genotype analysis of any variant type in any data that might exhibit allelic imbalance.
Bohrson CL, Barton AR, Lodato MA, Rodin RE, Luquette LJ, Viswanadham VV, Gulhan DC, Cortés-Ciriano I, Sherman MA, Kwon M, Coulter ME, Galor A, Walsh CA, Park PJ. Linked-read analysis identifies mutations in single-cell DNA-sequencing data. Nature Genetics 2019;51:749-754.Abstract
Whole-genome sequencing of DNA from single cells has the potential to reshape our understanding of mutational heterogeneity in normal and diseased tissues. However, a major difficulty is distinguishing amplification artifacts from biologically derived somatic mutations. Here, we describe linked-read analysis (LiRA), a method that accurately identifies somatic singlenucleotide variants (sSNVs) by using read-level phasing with nearby germline heterozygous polymorphisms, thereby enabling the characterization of mutational signatures and estimation of somatic mutation rates in single cells.
Yang L, Wang S, Lee JJ-K, Lee S, Lee E, Shinbrot E, Wheeler DA, Kucherlapati R, Park PJ. An enhanced genetic model of colorectal cancer progression history. Genome Biology 2019;20(1):168.
2018
Sherman MA, Barton AR, Lodato MA, Vitzthum C, Coulter ME, Walsh CA, Park PJ. PaSD-qc: quality control for single cell whole-genome sequencing data using power spectral density estimation. Nucleic Acids Research 2018;46(4):e20.Abstract
Single cell whole-genome sequencing (scWGS) is providing novel insights into the nature of genetic heterogeneity in normal and diseased cells. However, the whole-genome amplification process required for scWGS introduces biases into the resulting sequencing that can confound downstream analysis. Here, we present a statistical method, with an accompanying package PaSD-qc (Power Spectral Density-qc), that evaluates the properties and quality of single cell libraries. It uses a modified power spectral density to assess amplification uniformity, amplicon size distribution, autocovariance and inter-sample consistency as well as to identify chromosomes with aberrant read-density profiles due either to copy alterations or poor amplification. These metrics provide a standard way to compare the quality of single cell samples as well as yield information necessary to improve variant calling strategies. We demonstrate the usefulness of this tool in comparing the properties of scWGS protocols, identifying potential chromosomal copy number variation, determining chromosomal and subchromosomal regions of poor amplification, and selecting high-quality libraries from low-coverage data for deep sequencing. The software is available free and open-source at https://github.com/parklab/PaSDqc.
2017
Cortes-Ciriano I*, Lee S*, Park W-Y, Kim T-M**, Park PJ**. A molecular portrait of microsatellite instability across multiple cancers. Nat Commun 2017;8:15180.Abstract
Microsatellite instability (MSI) refers to the hypermutability of short repetitive sequences in the genome caused by impaired DNA mismatch repair. Although MSI has been studied for decades, large amounts of sequencing data now available allows us to examine the molecular fingerprints of MSI in greater detail. Here, we analyse ∼8,000 exomes and ∼1,000 whole genomes of cancer patients across 23 cancer types. Our analysis reveals that the frequency of MSI events is highly variable within and across tumour types. We also identify genes in DNA repair and oncogenic pathways recurrently subject to MSI and uncover non-coding loci that frequently display MSI. Finally, we propose a highly accurate exome-based predictive model for the MSI phenotype. These results advance our understanding of the genomic drivers and consequences of MSI, and our comprehensive catalogue of tumour-type-specific MSI loci will enable panel-based MSI testing to identify patients who are likely to benefit from immunotherapy.
2016
Xi R, Lee S, Xia Y, Kim T-M, Park PJ. Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants. Nucleic Acids Res 2016;Abstract

Whole-genome sequencing data allow detection of copy number variation (CNV) at high resolution. However, estimation based on read coverage along the genome suffers from bias due to GC content and other factors. Here, we develop an algorithm called BIC-seq2 that combines normalization of the data at the nucleotide level and Bayesian information criterion-based segmentation to detect both somatic and germline CNVs accurately. Analysis of simulation data showed that this method outperforms existing methods. We apply this algorithm to low coverage whole-genome sequencing data from peripheral blood of nearly a thousand patients across eleven cancer types in The Cancer Genome Atlas (TCGA) to identify cancer-predisposing CNV regions. We confirm known regions and discover new ones including those covering KMT2C, GOLPH3, ERBB2 and PLAG1 Analysis of colorectal cancer genomes in particular reveals novel recurrent CNVs including deletions at two chromatin-remodeling genes RERE and NPM2 This method will be useful to many researchers interested in profiling CNVs from whole-genome sequencing data.

2015
Bersani F, Lee E, Kharchenko PV, Xu AW, Liu M, Xega K, MacKenzie OC, Brannigan BW, Wittner BS, Jung H, Ramaswamy S, Park PJ, Maheswaran S, Ting DT, Haber DA. Pericentromeric satellite repeat expansions through RNA-derived DNA intermediates in cancer. Proc Natl Acad Sci U S A 2015;112(49):15148-53.Abstract

Aberrant transcription of the pericentromeric human satellite II (HSATII) repeat is present in a wide variety of epithelial cancers. In deriving experimental systems to study its deregulation, we observed that HSATII expression is induced in colon cancer cells cultured as xenografts or under nonadherent conditions in vitro, but it is rapidly lost in standard 2D cultures. Unexpectedly, physiological induction of endogenous HSATII RNA, as well as introduction of synthetic HSATII transcripts, generated cDNA intermediates in the form of DNA/RNA hybrids. Single molecule sequencing of tumor xenografts showed that HSATII RNA-derived DNA (rdDNA) molecules are stably incorporated within pericentromeric loci. Suppression of RT activity using small molecule inhibitors reduced HSATII copy gain. Analysis of whole-genome sequencing data revealed that HSATII copy number gain is a common feature in primary human colon tumors and is associated with a lower overall survival. Together, our observations suggest that cancer-associated derepression of specific repetitive sequences can promote their RNA-driven genomic expansion, with potential implications on pericentromeric architecture.

Evrony GD*, Lee E*, Mehta BK, Benjamini Y, Johnson RM, Cai X, Yang L, Haseley P, Lehmann HS, Park PJ**, Walsh CA**. Cell lineage analysis in human brain using endogenous retroelements. Neuron 2015;85(1):49-59.Abstract

Somatic mutations occur during brain development and are increasingly implicated as a cause of neurogenetic disease. However, the patterns in which somatic mutations distribute in the human brain are unknown. We used high-coverage whole-genome sequencing of single neurons from a normal individual to identify spontaneous somatic mutations as clonal marks to track cell lineages in human brain. Somatic mutation analyses in >30 locations throughout the nervous system identified multiple lineages and sublineages of cells marked by different LINE-1 (L1) retrotransposition events and subsequent mutation of poly-A microsatellites within L1. One clone contained thousands of cells limited to the left middle frontal gyrus, whereas a second distinct clone contained millions of cells distributed over the entire left hemisphere. These patterns mirror known somatic mutation disorders of brain development and suggest that focally distributed mutations are also prevalent in normal brains. Single-cell analysis of somatic mutation enables tracing of cell lineage clones in human brain.

2014
Parfenov M, Pedamallu CS, Gehlenborg N, Freeman SS, Danilova L, Bristow CA, Lee S, Hadjipanayis AG, Ivanova EV, Wilkerson MD, Protopopov A, Yang L, Seth S, Song X, Tang J, Ren X, Zhang J, Pantazi A, Santoso N, Xu AW, Mahadeshwar H, Wheeler DA, Haddad RI, Jung J, Ojesina AI, Issaeva N, Yarbrough WG, Hayes ND, Grandis JR, El-Naggar AK, Meyerson M, Park PJ, Chin L, Seidman JG, Hammerman PS, Kucherlapati R, Cancer Genome Atlas Network TCGA. Characterization of HPV and host genome interactions in primary head and neck cancers. Proc Natl Acad Sci U S A 2014;111(43):15544-9.Abstract

Previous studies have established that a subset of head and neck tumors contains human papillomavirus (HPV) sequences and that HPV-driven head and neck cancers display distinct biological and clinical features. HPV is known to drive cancer by the actions of the E6 and E7 oncoproteins, but the molecular architecture of HPV infection and its interaction with the host genome in head and neck cancers have not been comprehensively described. We profiled a cohort of 279 head and neck cancers with next generation RNA and DNA sequencing and show that 35 (12.5%) tumors displayed evidence of high-risk HPV types 16, 33, or 35. Twenty-five cases had integration of the viral genome into one or more locations in the human genome with statistical enrichment for genic regions. Integrations had a marked impact on the human genome and were associated with alterations in DNA copy number, mRNA transcript abundance and splicing, and both inter- and intrachromosomal rearrangements. Many of these events involved genes with documented roles in cancer. Cancers with integrated vs. nonintegrated HPV displayed different patterns of DNA methylation and both human and viral gene expressions. Together, these data provide insight into the mechanisms by which HPV interacts with the human genome beyond expression of viral oncoproteins and suggest that specific integration events are an integral component of viral oncogenesis.

The somatic genomic landscape of chromophobe renal cell carcinoma.
Davis CF*, Ricketts CJ*, Wang M*, Yang L*, Cherniack AD, Shen H, Buhay C, Kang H, Kim SC, Fahey CC, Hacker KE, Bhanot G, Gordenin DA, Chu A, Gunaratne PH, Biehl M, Seth S, Kaipparettu BA, Bristow CA, Donehower LA, Wallen EM, Smith AB, Tickoo SK, Tamboli P, Reuter V, Schmidt LS, Hsieh JJ, Choueiri TK, Hakimi AA, Hakimi AA, Chin L, Meyerson M, Kucherlapati R, Park W-Y, Robertson GA, Laird PW, Henske EP, Kwiatkowski DJ, Park PJ, Morgan M, Shuch B, Muzny D, Wheeler DA, Linehan MW, Gibbs RA, Rathmell KW, Creighton CJ. The somatic genomic landscape of chromophobe renal cell carcinoma. Cancer Cell 2014;26(3):319-30.Abstract

We describe the landscape of somatic genomic alterations of 66 chromophobe renal cell carcinomas (ChRCCs) on the basis of multidimensional and comprehensive characterization, including mtDNA and whole-genome sequencing. The result is consistent that ChRCC originates from the distal nephron compared with other kidney cancers with more proximal origins. Combined mtDNA and gene expression analysis implicates changes in mitochondrial function as a component of the disease biology, while suggesting alternative roles for mtDNA mutations in cancers relying on oxidative phosphorylation. Genomic rearrangements lead to recurrent structural breakpoints within TERT promoter region, which correlates with highly elevated TERT expression and manifestation of kataegis, representing a mechanism of TERT upregulation in cancer distinct from previously observed amplifications and point mutations.

2013
Yang L, Luquette LJ, Gehlenborg N, Xi R, Haseley PS, Hsieh C-H, Zhang C, Ren X, Protopopov A, Chin L, Kucherlapati R, Lee C, Park PJ. Diverse mechanisms of somatic structural variations in human cancer genomes. Cell 2013;153(4):919-29.Abstract

Identification of somatic rearrangements in cancer genomes has accelerated through analysis of high-throughput sequencing data. However, characterization of complex structural alterations and their underlying mechanisms remains inadequate. Here, applying an algorithm to predict structural variations from short reads, we report a comprehensive catalog of somatic structural variations and the mechanisms generating them, using high-coverage whole-genome sequencing data from 140 patients across ten tumor types. We characterize the relative contributions of different types of rearrangements and their mutational mechanisms, find that ~20% of the somatic deletions are complex deletions formed by replication errors, and describe the differences between the mutational mechanisms in somatic and germline alterations. Importantly, we provide detailed reconstructions of the events responsible for loss of CDKN2A/B and gain of EGFR in glioblastoma, revealing that these alterations can result from multiple mechanisms even in a single genome and that both DNA double-strand breaks and replication errors drive somatic rearrangements.

2012
Lee E, Iskow R, Yang L, Gokcumen O, Haseley P, Luquette LJ, Lohr JG, Harris CC, Ding L, Wilson RK, Wheeler DA, Gibbs RA, Kucherlapati R, Lee C, Kharchenko PV**, Park PJ**, Cancer Genome Atlas Research Network TCGA. Landscape of somatic retrotransposition in human cancers. Science 2012;337(6097):967-71.Abstract

Transposable elements (TEs) are abundant in the human genome, and some are capable of generating new insertions through RNA intermediates. In cancer, the disruption of cellular mechanisms that normally suppress TE activity may facilitate mutagenic retrotranspositions. We performed single-nucleotide resolution analysis of TE insertions in 43 high-coverage whole-genome sequencing data sets from five cancer types. We identified 194 high-confidence somatic TE insertions, as well as thousands of polymorphic TE insertions in matched normal genomes. Somatic insertions were present in epithelial tumors but not in blood or brain cancers. Somatic L1 insertions tend to occur in genes that are commonly mutated in cancer, disrupt the expression of the target genes, and are biased toward regions of cancer-specific DNA hypomethylation, highlighting their potential impact in tumorigenesis.