The 4D Nucleome (4DN) Network aims to elucidate the complex structure and organization of chromosomes in the nucleus and the impact of their disruption in disease biology. We present the 4DN Data Portal ( https://data.4dnucleome.org/ ), a repository for datasets generated in the 4DN network and relevant external datasets. Datasets were generated with a wide range of experiments, including chromosome conformation capture assays such as Hi-C and other innovative sequencing and microscopy-based assays probing chromosome architecture. All together, the 4DN data portal hosts more than 1800 experiment sets and 36000 files. Results of sequencing-based assays from different laboratories are uniformly processed and quality-controlled. The portal interface allows easy browsing, filtering, and bulk downloads, and the integrated HiGlass genome browser allows interactive visualization and comparison of multiple datasets. The 4DN data portal represents a primary resource for chromosome contact and other nuclear architecture data for the scientific community.
Distilling biologically meaningful information from cancer genome sequencing data requires comprehensive identification of somatic alterations using rigorous computational methods. As the amount and complexity of sequencing data have increased, so has the number of tools for analysing them. Here, we describe the main steps involved in the bioinformatic analysis of cancer genomes, review key algorithmic developments and highlight popular tools and emerging technologies. These tools include those that identify point mutations, copy number alterations, structural variations and mutational signatures in cancer genomes. We also discuss issues in experimental design, the strengths and limitations of sequencing modalities and methodological challenges for the future.
SUMMARY: As the amount of three-dimensional chromosomal interaction data continues to increase, storing and accessing such data efficiently becomes paramount. We introduce Pairs, a block-compressed text file format for storing paired genomic coordinates from Hi-C data, and Pairix, an open-source C application to index and query Pairs files. Pairix (also available in Python and R) extends the functionalities of Tabix to paired coordinates data. We have also developed PairsQC, a collapsible HTML quality control report generator for Pairs files. AVAILABILITY: The format specification and source code are available at https://github.com/4dn-dcic/pairix, https://github.com/4dn-dcic/Rpairix and https://github.com/4dn-dcic/pairsqc.
Rajurkar M, Parikh AR, Solovyov A, You E, Kulkarni AS, Chu C, Xu KH, Jaicks C, Taylor MS, Wu C, Alexander KA, Good CR, Szabolcs A, Gerstberger S, Tran AV, Xu N, Ebright RY, Van Seventer EE, Vo KD, Tai EC, Lu C, Joseph-Chazan J, Raabe MJ, Nieman LT, Desai N, Arora KS, Ligorio M, Thapar V, Cohen L, Garden PM, Senussi Y, Zheng H, Allen JN, Blaszkowsky LS, Clark JW, Goyal L, Wo JY, Ryan DP, Corcoran RB, Deshpande V, Rivera MN, Aryee MJ, Hong TS, Berger SL, Walt DR, Burns KH, Park PJ, Greenbaum BD, Ting DT. Reverse Transcriptase Inhibition Disrupts Repeat Element Life Cycle in Colorectal Cancer. Cancer Discov 2022;Abstract
Altered RNA expression of repetitive sequences and retrotransposition are frequently seen in colorectal cancer (CRC) implicating a functional importance of repeat activity in cancer progression. We show the nucleoside reverse transcriptase inhibitor 3TC targets activities of these repeat elements in CRC pre-clinical models with a preferential effect in P53 mutant cell lines linked with direct binding of P53 to repeat elements. We translate these findings to a human Phase 2 trial of single agent 3TC treatment in metastatic CRC with demonstration of clinical benefit in 9 of 32 patients. Analysis of 3TC effects on CRC tumorspheres demonstrates accumulation of immunogenic RNA:DNA hybrids linked with induction of interferon response genes and DNA damage response. Epigenetic and DNA damaging agents induce repeat RNAs and have enhanced cytotoxicity with 3TC. These findings identify a vulnerability in CRC by targeting the viral mimicry of repeat elements.
Gene fusions can play important roles in tumor initiation and progression. While fusion detection so far has been from bulk samples, full-length single-cell RNA sequencing (scRNA-seq) offers the possibility of detecting gene fusions at the single-cell level. However, scRNA-seq data have a high noise level and contain various technical artifacts that can lead to spurious fusion discoveries. Here, we present a computational tool, scFusion, for gene fusion detection based on scRNA-seq. We evaluate the performance of scFusion using simulated and five real scRNA-seq datasets and find that scFusion can efficiently and sensitively detect fusions with a low false discovery rate. In a T cell dataset, scFusion detects the invariant TCR gene recombinations in mucosal-associated invariant T cells that many methods developed for bulk data fail to detect; in a multiple myeloma dataset, scFusion detects the known recurrent fusion IgH-WHSC1, which is associated with overexpression of the WHSC1 oncogene. Our results demonstrate that scFusion can be used to investigate cellular heterogeneity of gene fusions and their transcriptional impact at the single-cell level.
The structure of the human neocortex underlies species-specific traits and reflects intricate developmental programs. Here we sought to reconstruct processes that occur during early development by sampling adult human tissues. We analysed neocortical clones in a post-mortem human brain through a comprehensive assessment of brain somatic mosaicism, acting as neutral lineage recorders1,2. We combined the sampling of 25 distinct anatomic locations with deep whole-genome sequencing in a neurotypical deceased individual and confirmed results with 5 samples collected from each of three additional donors. We identified 259 bona fide mosaic variants from the index case, then deconvolved distinct geographical, cell-type and clade organizations across the brain and other organs. We found that clones derived after the accumulation of 90-200 progenitors in the cerebral cortex tended to respect the midline axis, well before the anterior-posterior or ventral-dorsal axes, representing a secondary hierarchy following the overall patterning of forebrain and hindbrain domains. Clones across neocortically derived cells were consistent with a dual origin from both dorsal and ventral cellular populations, similar to rodents, whereas the microglia lineage appeared distinct from other resident brain cells. Our data provide a comprehensive analysis of brain somatic mosaicism across the neocortex and demonstrate cellular origins and progenitor distribution patterns within the human brain.
For quality, interpretation, reproducibility and sharing value, microscopy images should be accompanied by detailed descriptions of the conditions that were used to produce them. Micro-Meta App is an intuitive, highly interoperable, open-source software tool that was developed in the context of the 4D Nucleome (4DN) consortium and is designed to facilitate the extraction and collection of relevant microscopy metadata as specified by the recent 4DN-BINA-OME tiered-system of Microscopy Metadata specifications. In addition to substantially lowering the burden of quality assurance, the visual nature of Micro-Meta App makes it particularly suited for training purposes.
BACKGROUND: Retrotransposons have been implicated as causes of Mendelian disease, but their role in autism spectrum disorder (ASD) has not been systematically defined, because they are only called with adequate sensitivity from whole genome sequencing (WGS) data and a large enough cohort for this analysis has only recently become available. RESULTS: We analyzed WGS data from a cohort of 2288 ASD families from the Simons Simplex Collection by establishing a scalable computational pipeline for retrotransposon insertion detection. We report 86,154 polymorphic retrotransposon insertions-including > 60% not previously reported-and 158 de novo retrotransposition events. The overall burden of de novo events was similar between ASD individuals and unaffected siblings, with 1 de novo insertion per 29, 117, and 206 births for Alu, L1, and SVA respectively, and 1 de novo insertion per 21 births total. However, ASD cases showed more de novo L1 insertions than expected in ASD genes. Additionally, we observed exonic insertions in loss-of-function intolerant genes, including a likely pathogenic exonic insertion in CSDE1, only in ASD individuals. CONCLUSIONS: These findings suggest a modest, but important, impact of intronic and exonic retrotransposon insertions in ASD, show the importance of WGS for their analysis, and highlight the utility of specific bioinformatic tools for high-throughput detection of retrotransposon insertions.
The vertebrate retina is generated by retinal progenitor cells (RPCs), which produce >100 cell types. Although some RPCs produce many cell types, other RPCs produce restricted types of daughter cells, such as a cone photoreceptor and a horizontal cell (HC). We used genome-wide assays of chromatin structure to compare the profiles of a restricted cone/HC RPC and those of other RPCs in chicks. These data nominated regions of regulatory activity, which were tested in tissue, leading to the identification of many cis-regulatory modules (CRMs) active in cone/HC RPCs and developing cones. Two transcription factors, Otx2 and Oc1, were found to bind to many of these CRMs, including those near genes important for cone development and function, and their binding sites were required for activity. We also found that Otx2 has a predicted autoregulatory CRM. These results suggest that Otx2, Oc1 and possibly other Onecut proteins have a broad role in coordinating cone development and function. The many newly discovered CRMs for cones are potentially useful reagents for gene therapy of cone diseases.
Wang Y, Bae T, Thorpe J, Sherman MA, Jones AG, Cho S, Daily K, Dou Y, Ganz J, Galor A, Lobon I, Pattni R, Rosenbluh C, Tomasi S, Tomasini L, Yang X, Zhou B, Akbarian S, Ball LL, Bizzotto S, Emery SB, Doan R, Fasching L, Jang Y, Juan D, Lizano E, Luquette LJ, Moldovan JB, Narurkar R, Oetjens MT, Rodin RE, Sekar S, Shin JH, Soriano E, Straub RE, Zhou W, Chess A, Gleeson JG, Marquès-Bonet T, Park PJ, Peters MA, Pevsner J, Walsh CA, Weinberger DR, Weinberger DR, Vaccarino FM, Moran JV, Urban AE, Kidd JM, Mills RE, Abyzov A. Comprehensive identification of somatic nucleotide variants in human brain tissue. Genome Biol 2021;22(1):92.Abstract
BACKGROUND: Post-zygotic mutations incurred during DNA replication, DNA repair, and other cellular processes lead to somatic mosaicism. Somatic mosaicism is an established cause of various diseases, including cancers. However, detecting mosaic variants in DNA from non-cancerous somatic tissues poses significant challenges, particularly if the variants only are present in a small fraction of cells. RESULTS: Here, the Brain Somatic Mosaicism Network conducts a coordinated, multi-institutional study to examine the ability of existing methods to detect simulated somatic single-nucleotide variants (SNVs) in DNA mixing experiments, generate multiple replicates of whole-genome sequencing data from the dorsolateral prefrontal cortex, other brain regions, dura mater, and dural fibroblasts of a single neurotypical individual, devise strategies to discover somatic SNVs, and apply various approaches to validate somatic SNVs. These efforts lead to the identification of 43 bona fide somatic SNVs that range in variant allele fractions from ~ 0.005 to ~ 0.28. Guided by these results, we devise best practices for calling mosaic SNVs from 250× whole-genome sequencing data in the accessible portion of the human genome that achieve 90% specificity and sensitivity. Finally, we demonstrate that analysis of multiple bulk DNA samples from a single individual allows the reconstruction of early developmental cell lineage trees. CONCLUSIONS: This study provides a unified set of best practices to detect somatic SNVs in non-cancerous tissues. The data and methods are freely available to the scientific community and should serve as a guide to assess the contributions of somatic SNVs to neuropsychiatric diseases.
Transposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea .
Histone chaperones are critical for controlling chromatin integrity during transcription, DNA replication, and DNA repair. Three conserved and essential chaperones, Spt6, Spn1/Iws1, and FACT, associate with elongating RNA polymerase II and interact with each other physically and/or functionally; however, there is little understanding of their individual functions or their relationships with each other. In this study, we selected for suppressors of a temperature-sensitive spt6 mutation that disrupts the Spt6-Spn1 physical interaction and that also causes both transcription and chromatin defects. This selection identified novel mutations in FACT. Surprisingly, suppression by FACT did not restore the Spt6-Spn1 interaction, based on coimmunoprecipitation, ChIP, and mass spectrometry experiments. Furthermore, suppression by FACT bypassed the complete loss of Spn1. Interestingly, the FACT suppressor mutations cluster along the FACT-nucleosome interface, suggesting that they alter FACT-nucleosome interactions. In agreement with this observation, we showed that the spt6 mutation that disrupts the Spt6-Spn1 interaction caused an elevated level of FACT association with chromatin, while the FACT suppressors reduced the level of FACT-chromatin association, thereby restoring a normal Spt6-FACT balance on chromatin. Taken together, these studies reveal previously unknown regulation between histone chaperones that is critical for their essential in vivo functions.
Negative elongation factor (NELF) is a critical transcriptional regulator that stabilizes paused RNA polymerase to permit rapid gene expression changes in response to environmental cues. Although NELF is essential for embryonic development, its role in adult stem cells remains unclear. In this study, through a muscle-stem-cell-specific deletion, we showed that NELF is required for efficient muscle regeneration and stem cell pool replenishment. In mechanistic studies using PRO-seq, single-cell trajectory analyses and myofiber cultures revealed that NELF works at a specific stage of regeneration whereby it modulates p53 signaling to permit massive expansion of muscle progenitors. Strikingly, transplantation experiments indicated that these progenitors are also necessary for stem cell pool repopulation, implying that they are able to return to quiescence. Thus, we identified a critical role for NELF in the expansion of muscle progenitors in response to injury and revealed that progenitors returning to quiescence are major contributors to the stem cell pool repopulation.
Mutational activation of KRAS promotes the initiation and progression of cancers, especially in the colorectum, pancreas, lung, and blood plasma, with varying prevalence of speciﬁc activating missense mutations. Although epidemiological studies connect speciﬁc alleles to clinical outcomes, the mechanisms underlying the distinct clinical characteristics of mutant KRAS alleles are unclear. Here, we analyze 13,492 samples from these four tumor types to examine allele- and tissue-speciﬁc genetic properties associated with oncogenic KRAS mutations. The prevalence of known mutagenic mechanisms partially explains the observed spectrum of KRAS activating mutations. However, there are substantial differences between the observed and predicted frequencies for many alleles, suggesting that biological selection underlies the tissue-speciﬁc frequencies of mutant alleles. Consistent with experimental studies that have identiﬁed distinct signaling properties associated with each mutant form of KRAS, our genetic analysis reveals that each KRAS allele is associated with a distinct tissuespeciﬁc comutation network. Moreover, we identify tissue-speciﬁc genetic dependencies associated with speciﬁc mutant KRAS alleles. Overall, this analysis demonstrates that the genetic interactions of oncogenic KRAS mutations are allele- and tissue-speciﬁc, underscoring the complexity that drives their clinical consequences.
Although cell lineage information is fundamental to understanding organismal development, very little direct information is available for humans. We performed high-depth (250×) whole-genome sequencing of multiple tissues from three individuals to identify hundreds of somatic single-nucleotide variants (sSNVs). Using these variants as "endogenous barcodes" in single cells, we reconstructed early embryonic cell divisions. Targeted sequencing of clonal sSNVs in different organs (about 25,000×) and in more than 1000 cortical single cells, as well as single-nucleus RNA sequencing and single-nucleus assay for transposase-accessible chromatin sequencing of ~100,000 cortical single cells, demonstrated asymmetric contributions of early progenitors to extraembryonic tissues, distinct germ layers, and organs. Our data suggest onset of gastrulation at an effective progenitor pool of about 170 cells and about 50 to 100 founders for the forebrain. Thus, mosaic mutations provide a permanent record of human embryonic development at very high resolution.
SUMMARY: Despite the improvement in variant detection algorithms, visual inspection of the read-level data remains an essential step for accurate identification of variants in genome analysis. We developed BamSnap, an efficient BAM file viewer utilizing a graphics library and BAM indexing. In contrast to existing viewers, BamSnap can generate high-quality snapshots rapidly, with customized tracks and layout. As an example, we produced read-level images at 1000 genomic loci for >2500 whole-genomes. AVAILABILITY: BamSnap is freely available at https://github.com/parklab/bamsnap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Idiopathic normal pressure hydrocephalus (iNPH) is a neurological disorder that occurs in about 1% of individuals over age 60 and is characterized by enlarged cerebral ventricles, gait difficulty, incontinence, and cognitive decline. The cause and pathophysiology of iNPH are largely unknown. We performed whole exome sequencing of DNA obtained from 53 unrelated iNPH patients. Two recurrent heterozygous loss of function deletions in CWH43 were observed in 15% of iNPH patients and were significantly enriched 6.6-fold and 2.7-fold, respectively, when compared to the general population. Cwh43 modifies the lipid anchor of glycosylphosphatidylinositol-anchored proteins. Mice heterozygous for CWH43 deletion appeared grossly normal but displayed hydrocephalus, gait and balance abnormalities, decreased numbers of ependymal cilia, and decreased localization of glycosylphosphatidylinositol-anchored proteins to the apical surfaces of choroid plexus and ependymal cells. Our findings provide novel mechanistic insights into the origins of iNPH and demonstrate that it represents a distinct disease entity.
Homologous recombination (HR)-deficient cancers are sensitive to poly-ADP ribose polymerase inhibitors (PARPi), which have shown clinical efficacy in the treatment of high-grade serous cancers (HGSC). However, the majority of patients will relapse, and acquired PARPi resistance is emerging as a pressing clinical problem. Here we generated seven single-cell clones with acquired PARPi resistance derived from a PARPi-sensitive TP53 -/- and BRCA1 -/- epithelial cell line generated using CRISPR/Cas9. These clones showed diverse resistance mechanisms, and some clones presented with multiple mechanisms of resistance at the same time. Genomic analysis of the clones revealed unique transcriptional and mutational profiles and increased genomic instability in comparison with a PARPi-sensitive cell line. Clonal evolutionary analyses suggested that acquired PARPi resistance arose via clonal selection from an intrinsically unstable and heterogenous cell population in the sensitive cell line, which contained preexisting drug-tolerant cells. Similarly, clonal and spatial heterogeneity in tumor biopsies from a clinical patient with BRCA1-mutant HGSC with acquired PARPi resistance was observed. In an imaging-based drug screening, the clones showed heterogenous responses to targeted therapeutic agents, indicating that not all PARPi-resistant clones can be targeted with just one therapy. Furthermore, PARPi-resistant clones showed mechanism-dependent vulnerabilities to the selected agents, demonstrating that a deeper understanding on the mechanisms of resistance could lead to improved targeting and biomarkers for HGSC with acquired PARPi resistance. SIGNIFICANCE: This study shows that BRCA1-deficient cells can give rise to multiple genomically and functionally heterogenous PARPi-resistant clones, which are associated with various vulnerabilities that can be targeted in a mechanism-specific manner.
Hi-C is a common technique for assessing 3D chromatin conformation. Recent studies have shown that long-range interaction information in Hi-C data can be used to generate chromosome-length genome assemblies and identify large-scale structural variations. Here, we demonstrate the use of Hi-C data in detecting mobile transposable element (TE) insertions genome-wide. Our pipeline Hi-C-based TE analyzer (HiTea) capitalizes on clipped Hi-C reads and is aided by a high proportion of discordant read pairs in Hi-C data to detect insertions of three major families of active human TEs. Despite the uneven genome coverage in Hi-C data, HiTea is competitive with the existing callers based on whole-genome sequencing (WGS) data and can supplement the WGS-based characterization of the TE-insertion landscape. We employ the pipeline to identify TE-insertions from human cell-line Hi-C samples. AVAILABILITY AND IMPLEMENTATION: HiTea is available at https://github.com/parklab/HiTea and as a Docker image. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.