Resources and challenges for integrative analysis of nuclear architecture data
Introduction
The rapid pace of technology development in genome and epigenome profiling has led to major advances in our understanding of genome architecture and function. The initial techniques for measuring three-dimensional interactions among genomic loci based on chromosome conformation capture [1, 2, 3] have matured in terms of protocol optimization and have led to the development of numerous related techniques, for example, enriching for interactions with a protein of interest [4,5]. Aided by decreasing sequencing cost, researchers can now produce high-quality data that allow for more sensitive detection of long-range interactions.
In addition to published data from individual laboratories, the US National Institutes of Health (NIH) as well as other governments’ agencies have launched consortium efforts to systematically profile epigenomes across many cell lines and tissue types, generating a large amount of data including 3D interaction data. These data provide an opportunity for researchers to engage in integrative analysis that combines their DNA, RNA, and/or local epigenetic data with publicly available 3D interactions data.
In this review, we will first summarize the resources currently available for those interested in 3D data analysis. Then, we will describe several challenges in collection, curation, and integration of data, as well as steps that can be taken to maximize the value of the data resources for the scientific community. We will focus on nuclear architecture data, but the issues and approaches are also relevant for other data types.
Section snippets
Landmark nuclear architecture datasets
Here, we highlight several datasets that represent key advances in terms of data quality and resolution. For chromosome conformation capture assays, advances in experimental protocols improved the spatial resolution of long-range interactions. The first Hi-C maps with more than a billion reads, using in situ Hi-C, was in 2014, providing resolution reaching 1 kb and identifying ∼10 k loops anchored by CTCF [6]. A subsequent dataset with a similar resolution was in mouse, resolving dynamic
Databases for nuclear architecture and epigenomics data
The largest coordinated initiative focusing on 3D genome architecture is the 4D Nucleome Network (the authors are associated with the Data Coordination and Integration Center of this project) [19••]. This initiative aims to understand the principles underlying nuclear organization in space and time (hence the ‘4D’), the role of nuclear organization in gene expression and cellular function, and the impact of changing nuclear organization in various diseases. 4D Nucleome in Phase I (2015–2020)
Data visualization tools
Exploratory analysis of Hi-C or other 3D interaction data typically begins with visual inspection of the interaction matrix, which shows the estimated frequency of interactions between every pair of loci. These datasets are large in size: the minimum number of reads required for a Hi-C experiment in the 4D Nucleome consortium is 600 million (a standard RNA-seq may contain on the order of 10–40 million reads). Thus, a tool that allows visualization of the interaction maps quickly without having
Challenges and best practices in the analysis of chromatin interaction data
To ensure the validity of a study based on chromatin interaction data, evaluation of data quality and reproducibility is essential. In addition to the common statistics on read alignments, several additional measures specific to 3D data are often informative, such as the fraction of valid pairs, the ratio between intra-chromosomal and inter-chromosomal contacts, and the fraction of short-range compared to long-range interactions [38]. Many Hi-C analysis pipelines, such as HiC-Pro [39], generate
Opportunities and challenges for data reuse
The key datasets highlighted above and the hundreds of other published datasets present many opportunities for deriving new insights without the need to perform expensive experiments. For instance, a cancer biologist may have found a recurrent non-coding mutation in colorectal cancers that, based on the histone mark H3K27ac or H3K4me1, appears to be in an enhancer region. To identify which genes may be regulated by the enhancer, she could generate her own data. Alternatively, she could first
Importance of metadata collection for reproducible science
To take full advantage of the existing data, proper metadata (‘data about data’) must be available at the repositories. Lack of proper metadata is one of the main factors that hinder reproducibility of published results. To increase scientific rigor and transparency, NIH has implemented policies that emphasize the ‘FAIR’ principle: findability, accessibility, interoperability, and reusability [48••]. The idea behind this principle is to encourage data producers and publishers to provide
Collecting imaging data
Collection, curation, and re-analysis of microscopy data present additional challenges to the ones we have outlined above for genomic assays. Whereas sequencing experiments have common data formats (e.g. FASTQ) and common coordinate systems (genome builds), microscopy experiments are diverse in many aspects, including image resolutions, biological sample preparation methods, imaging modalities, and data formats. Imaging experiments are sometimes performed with extensive protocol variations even
Conclusion
In recent years, we have seen major advances in our understanding of nuclear architecture, aided by the increase in the resolution and throughput with which we can probe chromatin organization. An important byproduct of these advances are the high-quality datasets that have been generated. We have highlighted some datasets that provide the highest resolutions of genomic interactions to date. We have described how data portals such as those by 4DN and ENCODE increase the utility of datasets with
Conflict of interest statement
Nothing declared.
References and recommended reading
Papers of particular interest, published within the period of review, have been highlighted as:
• of special interest
•• of outstanding interest
Acknowledgement
This work was supported by the National Institutes of Health (U01CA200059).
References (50)
- et al.
A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping
Cell
(2014) - et al.
Multiscale 3D genome rewiring during mouse neural development
Cell
(2017) - et al.
Resolving the 3D landscape of transcription-linked mammalian chromatin folding
Mol Cell
(2020) - et al.
Cell-cycle dynamics of chromosomal organization at single-cell resolution
Nature
(2017) - et al.
ChromEMT: visualizing 3D chromatin structure and compaction in interphase and mitotic cells
Science (80-)
(2017) - et al.
3DIV: a 3D-genome interaction viewer and database
Nucleic Acids Res
(2018) - et al.
The 3D genome browser: a web-based browser for visualizing 3D genome organization and long-range chromatin interactions
Genome Biol
(2018) - et al.
Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture
Nat Genet
(2011) - et al.
The hitchhiker’s guide to Hi-C analysis: practical guidelines
Methods
(2015) - et al.
GiniQC: a measure for quantifying noise in single-cell Hi-C data
Bioinformatics
(2020)
Capturing chromosome conformation
Science (80-)
Comprehensive mapping of long-range interactions reveals folding principles of the human genome
Science (80-)
Single-cell Hi-C reveals cell-to-cell variability in chromosome structure
Nature
ChIP-based methods for the identification of long-range chromatin interactions
J Cell Biochem
HiChIP: efficient and sensitive analysis of protein-directed genome architecture
Nat Methods
Transcriptionally active HERV-H retrotransposons demarcate topologically associating domains in human pluripotent stem cells
Nat Genet
Ultrastructural details of mammalian chromosome architecture
Mol Cell
Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition
Nature
Three-dimensional genome structures of single diploid human cells
Science (80-)
Extensive heterogeneity and intrinsic variation in spatial genome organization
Cell
Walking along chromosomes with super-resolution imaging, contact maps, and integrative modeling
PLoS Genet
Spatial organization of chromatin domains and compartments in single chromosomes
Science
3D mapping and accelerated super-resolution imaging of the human genome using in situ sequencing
Nat Methods
The 4D nucleome project
Nature
Expanded encyclopaedias of DNA elements in the human and mouse genomes
Nature
Cited by (1)
Construction of Sports and Health Data Resources and Transformation of Teachers' Orientation Based on Web Database
2022, Journal of Healthcare Engineering