Resources and challenges for integrative analysis of nuclear architecture data

https://doi.org/10.1016/j.gde.2020.12.009Get rights and content

A large amount of genomic data for profiling three-dimensional genome architecture have accumulated from large-scale consortium projects as well as from individual laboratories. In this review, we summarize recent landmark datasets and collections in the field. We describe the challenges in collection, annotation, and analysis of these data, particularly for integration of sequencing and microscopy data. We introduce efforts from consortia and independent groups to harmonize diverse datasets. As the resolution and throughput of sequencing and imaging technologies continue to increase, more efficient utilization and integration of collected data will be critical for a better understanding of nuclear architecture.

Introduction

The rapid pace of technology development in genome and epigenome profiling has led to major advances in our understanding of genome architecture and function. The initial techniques for measuring three-dimensional interactions among genomic loci based on chromosome conformation capture [1, 2, 3] have matured in terms of protocol optimization and have led to the development of numerous related techniques, for example, enriching for interactions with a protein of interest [4,5]. Aided by decreasing sequencing cost, researchers can now produce high-quality data that allow for more sensitive detection of long-range interactions.

In addition to published data from individual laboratories, the US National Institutes of Health (NIH) as well as other governments’ agencies have launched consortium efforts to systematically profile epigenomes across many cell lines and tissue types, generating a large amount of data including 3D interaction data. These data provide an opportunity for researchers to engage in integrative analysis that combines their DNA, RNA, and/or local epigenetic data with publicly available 3D interactions data.

In this review, we will first summarize the resources currently available for those interested in 3D data analysis. Then, we will describe several challenges in collection, curation, and integration of data, as well as steps that can be taken to maximize the value of the data resources for the scientific community. We will focus on nuclear architecture data, but the issues and approaches are also relevant for other data types.

Section snippets

Landmark nuclear architecture datasets

Here, we highlight several datasets that represent key advances in terms of data quality and resolution. For chromosome conformation capture assays, advances in experimental protocols improved the spatial resolution of long-range interactions. The first Hi-C maps with more than a billion reads, using in situ Hi-C, was in 2014, providing resolution reaching 1 kb and identifying ∼10 k loops anchored by CTCF [6]. A subsequent dataset with a similar resolution was in mouse, resolving dynamic

Databases for nuclear architecture and epigenomics data

The largest coordinated initiative focusing on 3D genome architecture is the 4D Nucleome Network (the authors are associated with the Data Coordination and Integration Center of this project) [19••]. This initiative aims to understand the principles underlying nuclear organization in space and time (hence the ‘4D’), the role of nuclear organization in gene expression and cellular function, and the impact of changing nuclear organization in various diseases. 4D Nucleome in Phase I (2015–2020)

Data visualization tools

Exploratory analysis of Hi-C or other 3D interaction data typically begins with visual inspection of the interaction matrix, which shows the estimated frequency of interactions between every pair of loci. These datasets are large in size: the minimum number of reads required for a Hi-C experiment in the 4D Nucleome consortium is 600 million (a standard RNA-seq may contain on the order of 10–40 million reads). Thus, a tool that allows visualization of the interaction maps quickly without having

Challenges and best practices in the analysis of chromatin interaction data

To ensure the validity of a study based on chromatin interaction data, evaluation of data quality and reproducibility is essential. In addition to the common statistics on read alignments, several additional measures specific to 3D data are often informative, such as the fraction of valid pairs, the ratio between intra-chromosomal and inter-chromosomal contacts, and the fraction of short-range compared to long-range interactions [38]. Many Hi-C analysis pipelines, such as HiC-Pro [39], generate

Opportunities and challenges for data reuse

The key datasets highlighted above and the hundreds of other published datasets present many opportunities for deriving new insights without the need to perform expensive experiments. For instance, a cancer biologist may have found a recurrent non-coding mutation in colorectal cancers that, based on the histone mark H3K27ac or H3K4me1, appears to be in an enhancer region. To identify which genes may be regulated by the enhancer, she could generate her own data. Alternatively, she could first

Importance of metadata collection for reproducible science

To take full advantage of the existing data, proper metadata (‘data about data’) must be available at the repositories. Lack of proper metadata is one of the main factors that hinder reproducibility of published results. To increase scientific rigor and transparency, NIH has implemented policies that emphasize the ‘FAIR’ principle: findability, accessibility, interoperability, and reusability [48••]. The idea behind this principle is to encourage data producers and publishers to provide

Collecting imaging data

Collection, curation, and re-analysis of microscopy data present additional challenges to the ones we have outlined above for genomic assays. Whereas sequencing experiments have common data formats (e.g. FASTQ) and common coordinate systems (genome builds), microscopy experiments are diverse in many aspects, including image resolutions, biological sample preparation methods, imaging modalities, and data formats. Imaging experiments are sometimes performed with extensive protocol variations even

Conclusion

In recent years, we have seen major advances in our understanding of nuclear architecture, aided by the increase in the resolution and throughput with which we can probe chromatin organization. An important byproduct of these advances are the high-quality datasets that have been generated. We have highlighted some datasets that provide the highest resolutions of genomic interactions to date. We have described how data portals such as those by 4DN and ENCODE increase the utility of datasets with

Conflict of interest statement

Nothing declared.

References and recommended reading

Papers of particular interest, published within the period of review, have been highlighted as:

  • • of special interest

  • •• of outstanding interest

Acknowledgement

This work was supported by the National Institutes of Health (U01CA200059).

References (50)

  • J. Dekker et al.

    Capturing chromosome conformation

    Science (80-)

    (2002)
  • E. Lieberman-aiden et al.

    Comprehensive mapping of long-range interactions reveals folding principles of the human genome

    Science (80-)

    (2009)
  • T. Nagano et al.

    Single-cell Hi-C reveals cell-to-cell variability in chromosome structure

    Nature

    (2013)
  • M.J. Fullwood et al.

    ChIP-based methods for the identification of long-range chromatin interactions

    J Cell Biochem

    (2009)
  • M.R. Mumbach et al.

    HiChIP: efficient and sensitive analysis of protein-directed genome architecture

    Nat Methods

    (2016)
  • Y. Zhang et al.

    Transcriptionally active HERV-H retrotransposons demarcate topologically associating domains in human pluripotent stem cells

    Nat Genet

    (2019)
  • N. Krietenstein et al.

    Ultrastructural details of mammalian chromosome architecture

    Mol Cell

    (2020)
  • I.M. Flyamer et al.

    Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition

    Nature

    (2017)
  • L. Tan et al.

    Three-dimensional genome structures of single diploid human cells

    Science (80-)

    (2018)
  • E.H. Finn et al.

    Extensive heterogeneity and intrinsic variation in spatial genome organization

    Cell

    (2019)
  • G. Nir et al.

    Walking along chromosomes with super-resolution imaging, contact maps, and integrative modeling

    PLoS Genet

    (2018)
  • S. Wang et al.

    Spatial organization of chromatin domains and compartments in single chromosomes

    Science

    (2016)
  • H.Q. Nguyen et al.

    3D mapping and accelerated super-resolution imaging of the human genome using in situ sequencing

    Nat Methods

    (2020)
  • J. Dekker et al.

    The 4D nucleome project

    Nature

    (2017)
  • F. Abascal et al.

    Expanded encyclopaedias of DNA elements in the human and mouse genomes

    Nature

    (2020)
  • View full text