Bioinformatics

2022
Lee S, Bakker C, Vitzthum C, Alver BH, Park PJ*. Pairs and Pairix: a file format and a tool for efficient storage and retrieval for Hi-C read pairs. Bioinformatics 2022;Abstract
SUMMARY: As the amount of three-dimensional chromosomal interaction data continues to increase, storing and accessing such data efficiently becomes paramount. We introduce Pairs, a block-compressed text file format for storing paired genomic coordinates from Hi-C data, and Pairix, an open-source C application to index and query Pairs files. Pairix (also available in Python and R) extends the functionalities of Tabix to paired coordinates data. We have also developed PairsQC, a collapsible HTML quality control report generator for Pairs files. AVAILABILITY: The format specification and source code are available at https://github.com/4dn-dcic/pairix, https://github.com/4dn-dcic/Rpairix and https://github.com/4dn-dcic/pairsqc.
pdf
2021
Kwon M, Lee S, Berselli M, Chu C, Park PJ. BamSnap: a lightweight viewer for sequencing reads in BAM files. Bioinformatics 2021;37(2):263-4.Abstract
SUMMARY: Despite the improvement in variant detection algorithms, visual inspection of the read-level data remains an essential step for accurate identification of variants in genome analysis. We developed BamSnap, an efficient BAM file viewer utilizing a graphics library and BAM indexing. In contrast to existing viewers, BamSnap can generate high-quality snapshots rapidly, with customized tracks and layout. As an example, we produced read-level images at 1000 genomic loci for >2500 whole-genomes. AVAILABILITY: BamSnap is freely available at https://github.com/parklab/bamsnap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
pdf
Jain D, Chu C, Alver BH, Lee S, Lee EA, Park PJ*. HiTea: a computational pipeline to identify non-reference transposable element insertions in Hi-C data. Bioinformatics 2021;37(8):1045-1051.Abstract
Hi-C is a common technique for assessing 3D chromatin conformation. Recent studies have shown that long-range interaction information in Hi-C data can be used to generate chromosome-length genome assemblies and identify large-scale structural variations. Here, we demonstrate the use of Hi-C data in detecting mobile transposable element (TE) insertions genome-wide. Our pipeline Hi-C-based TE analyzer (HiTea) capitalizes on clipped Hi-C reads and is aided by a high proportion of discordant read pairs in Hi-C data to detect insertions of three major families of active human TEs. Despite the uneven genome coverage in Hi-C data, HiTea is competitive with the existing callers based on whole-genome sequencing (WGS) data and can supplement the WGS-based characterization of the TE-insertion landscape. We employ the pipeline to identify TE-insertions from human cell-line Hi-C samples. AVAILABILITY AND IMPLEMENTATION: HiTea is available at https://github.com/parklab/HiTea and as a Docker image. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
pdf
2020
Horton CA, Alver B, Park PJ. GiniQC: a measure for quantifying noise in single-cell Hi-C data. Bioinformatics 2020;Abstract
Single-cell Hi-C (scHi-C) allows the study of cell-to-cell variability in chromatin structure and dynamics. However, the high level of noise inherent in current scHi-C protocols necessitates careful assessment of data quality before biological conclusions can be drawn. Here we present GiniQC, which quantifies unevenness in the distribution of inter-chromosomal reads in the scHi-C contact matrix to measure the level of noise. Our examples show the utility of GiniQC in assessing the quality of scHi-C data as a complement to existing quality control measures. We also demonstrate how GiniQC can help inform the impact of various data processing steps on data quality.
pdf
2019
Lee S, Johnson J, Vitzthum C, Kirli K, Alver BH, Park PJ. Tibanna: software for scalable execution of portable pipelines on the cloud. Bioinformatics 2019;Abstract
We introduce Tibanna, an open-source software tool for automated execution of bioinformatics pipelines on Amazon Web Services (AWS). Tibanna accepts reproducible and portable pipeline standards including Common Workflow Language (CWL), Workflow Description Language (WDL) and Docker. It adopts a strategy of isolation and optimization of individual executions, combined with a serverless scheduling approach. Pipelines are executed and monitored using local commands or the Python Application Programming Interface (API) and cloud configuration is automatically handled. Tibanna is well suited for projects with a range of computational requirements, including those with large and widely fluctuating loads. Notably, it has been used to process terabytes of data for the 4D Nucleome (4DN) Network.
pdf
2015
Sohn K-A*, Ho JWK*, Djordjevic D, Jeong H-H, Park PJ**, Kim JH**. hiHMM: Bayesian non-parametric joint inference of chromatin state maps. Bioinformatics 2015;31(13):2066-74.Abstract

MOTIVATION: Genome-wide mapping of chromatin states is essential for defining regulatory elements and inferring their activities in eukaryotic genomes. A number of hidden Markov model (HMM)-based methods have been developed to infer chromatin state maps from genome-wide histone modification data for an individual genome. To perform a principled comparison of evolutionarily distant epigenomes, we must consider species-specific biases such as differences in genome size, strength of signal enrichment and co-occurrence patterns of histone modifications. RESULTS: Here, we present a new Bayesian non-parametric method called hierarchically linked infinite HMM (hiHMM) to jointly infer chromatin state maps in multiple genomes (different species, cell types and developmental stages) using genome-wide histone modification data. This flexible framework provides a new way to learn a consistent definition of chromatin states across multiple genomes, thus facilitating a direct comparison among them. We demonstrate the utility of this method using synthetic data as well as multiple modENCODE ChIP-seq datasets. CONCLUSION: The hierarchical and Bayesian non-parametric formulation in our approach is an important extension to the current set of methodologies for comparative chromatin landscape analysis. AVAILABILITY AND IMPLEMENTATION: Source codes are available at https://github.com/kasohn/hiHMM. Chromatin data are available at http://encode-x.med.harvard.edu/data_sets/chromatin/.

pdf
2013
Gehlenborg N, Noble MS, Getz G, Chin L, Park PJ. Nozzle: a report generation toolkit for data analysis pipelines. Bioinformatics 2013;29(8):1089-91.Abstract

SUMMARY: We have developed Nozzle, an R package that provides an Application Programming Interface to generate HTML reports with dynamic user interface elements. Nozzle was designed to facilitate summarization and rapid browsing of complex results in data analysis pipelines where multiple analyses are performed frequently on big datasets. The package can be applied to any project where user-friendly reports need to be created. AVAILABILITY: The R package is available on CRAN at http://cran.r-project.org/package=Nozzle.R1. Examples and additional materials are available at http://gdac.broadinstitute.org/nozzle. The source code is also available at http://www.github.com/parklab/Nozzle. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

pdf
2009
Park PJ, Kong SW, Tebaldi T, Lai WR, Kasif S, Kohane IS. Integration of heterogeneous expression data sets extends the role of the retinol pathway in diabetes and insulin resistance. Bioinformatics 2009;25(23):3121-7.Abstract

MOTIVATION: Type 2 diabetes is a chronic metabolic disease that involves both environmental and genetic factors. To understand the genetics of type 2 diabetes and insulin resistance, the DIabetes Genome Anatomy Project (DGAP) was launched to profile gene expression in a variety of related animal models and human subjects. We asked whether these heterogeneous models can be integrated to provide consistent and robust biological insights into the biology of insulin resistance. RESULTS: We perform integrative analysis of the 16 DGAP data sets that span multiple tissues, conditions, array types, laboratories, species, genetic backgrounds and study designs. For each data set, we identify differentially expressed genes compared with control. Then, for the combined data, we rank genes according to the frequency with which they were found to be statistically significant across data sets. This analysis reveals RetSat as a widely shared component of mechanisms involved in insulin resistance and sensitivity and adds to the growing importance of the retinol pathway in diabetes, adipogenesis and insulin resistance. Top candidates obtained from our analysis have been confirmed in recent laboratory studies.

pdf
2008
Lai W, Choudhary V, Park PJ. CGHweb: a tool for comparing DNA copy number segmentations from multiple algorithms. Bioinformatics 2008;24(7):1014-5.Abstract

UNLABELLED: Accurate estimation of DNA copy numbers from array comparative genomic hybridization (CGH) data is important for characterizing the cancer genome. An important part of this process is the segmentation of the log-ratios between the sample and control DNA along the chromosome into regions of different copy numbers. However, multiple algorithms are available in the literature for this procedure and the results can vary substantially among these. Thus, a visualization tool that can display the segmented profiles from a number of methods can be helpful to the biologist or the clinician to ascertain that a feature of interest did not arise as an artifact of the algorithm. Such a tool also allows the methodologist to easily contrast his method against others. We developed a web-based tool that applies a number of popular algorithms to a single array CGH profile entered by the user. It generates a heatmap panel of the segmented profiles for each method as well as a consensus profile. The clickable heatmap can be moved along the chromosome and zoomed in or out. It also displays the time that each algorithm took and provides numerical values of the segmented profiles for download. The web interface calls algorithms written in the statistical language R. We encourage developers of new algorithms to submit their routines to be incorporated into the website. AVAILABILITY: http://compbio.med.harvard.edu/CGHweb.

pdf
Lee H, Kong SW, Park PJ. Integrative analysis reveals the direct and indirect interactions between DNA copy number aberrations and gene expression changes. Bioinformatics 2008;24(7):889-96.Abstract

MOTIVATION: DNA copy number aberrations (CNAs) and gene expression (GE) changes provide valuable information for studying chromosomal instability and its consequences in cancer. While it is clear that the structural aberrations and the transcript levels are intertwined, their relationship is more complex and subtle than initially suspected. Most studies so far have focused on how a CNA affects the expression levels of those genes contained within that CNA. RESULTS: To better understand the impact of CNAs on expression, we investigated the correlation of each CNA to all other genes in the genome. The correlations are computed over multiple patients that have both expression and copy number measurements in brain, bladder and breast cancer data sets. We find that a CNA has a direct impact on the gene amplified or deleted, but it also has a broad, indirect impact elsewhere. To identify a set of CNAs that is coordinately associated with the expression changes of a set of genes, we used a biclustering algorithm on the correlation matrix. For each of the three cancer types examined, the aberrations in several loci are associated with cancer-type specific biological pathways that have been described in the literature: CNAs of chromosome (chr) 7p13 were significantly correlated with epidermal growth factor receptor signaling pathway in glioblastoma multiforme, chr 13q with NF-kappaB cascades in bladder cancer, and chr 11p with Reck pathway in breast cancer. In all three data sets, gene sets related to cell cycle/division such as M phase, DNA replication and cell division were also associated with CNAs. Our results suggest that CNAs are both directly and indirectly correlated with changes in expression and that it is beneficial to examine the indirect effects of CNAs. AVAILABILITY: The code is available upon request.

pdf
Tolstorukov MY**, Choudhary V, Olson WK, Zhurkin VB, Park PJ**. nuScore: a web-interface for nucleosome positioning predictions. Bioinformatics 2008;24(12):1456-8.Abstract

SUMMARY: Sequence-directed mapping of nucleosome positions is of major biological interest. Here, we present a web-interface for estimation of the affinity of the histone core to DNA and prediction of nucleosome arrangement on a given sequence. Our approach is based on assessment of the energy cost of imposing the deformations required to wrap DNA around the histone surface. The interface allows the user to specify a number of options such as selecting from several structural templates for threading calculations and adding random sequences to the analysis. AVAILABILITY: The nuScore interface is freely available for use at http://compbio.med.harvard.edu/nuScore. CONTACT: peter_park@harvard.edu; tolstorukov@gmail.com SUPPLEMENTARY INFORMATION: The site contains user manual, description of the methodology and examples.

pdf
2006
Kong SW, Pu WT, Park PJ. A multivariate approach for integrating genome-wide expression data and biological knowledge. Bioinformatics 2006;22(19):2373-80.Abstract

MOTIVATION: Several statistical methods that combine analysis of differential gene expression with biological knowledge databases have been proposed for a more rapid interpretation of expression data. However, most such methods are based on a series of univariate statistical tests and do not properly account for the complex structure of gene interactions. RESULTS: We present a simple yet effective multivariate statistical procedure for assessing the correlation between a subspace defined by a group of genes and a binary phenotype. A subspace is deemed significant if the samples corresponding to different phenotypes are well separated in that subspace. The separation is measured using Hotelling's T(2) statistic, which captures the covariance structure of the subspace. When the dimension of the subspace is larger than that of the sample space, we project the original data to a smaller orthonormal subspace. We use this method to search through functional pathway subspaces defined by Reactome, KEGG, BioCarta and Gene Ontology. To demonstrate its performance, we apply this method to the data from two published studies, and visualize the results in the principal component space.

pdf
2005
Lai WR, Johnson MD, Kucherlapati R, Park PJ. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 2005;21(19):3763-70.Abstract

MOTIVATION: Array Comparative Genomic Hybridization (CGH) can reveal chromosomal aberrations in the genomic DNA. These amplifications and deletions at the DNA level are important in the pathogenesis of cancer and other diseases. While a large number of approaches have been proposed for analyzing the large array CGH datasets, the relative merits of these methods in practice are not clear. RESULTS: We compare 11 different algorithms for analyzing array CGH data. These include both segment detection methods and smoothing methods, based on diverse techniques such as mixture models, Hidden Markov Models, maximum likelihood, regression, wavelets and genetic algorithms. We compute the Receiver Operating Characteristic (ROC) curves using simulated data to quantify sensitivity and specificity for various levels of signal-to-noise ratio and different sizes of abnormalities. We also characterize their performance on chromosomal regions of interest in a real dataset obtained from patients with Glioblastoma Multiforme. While comparisons of this type are difficult due to possibly sub-optimal choice of parameters in the methods, they nevertheless reveal general characteristics that are helpful to the biological investigator.

pdf
Kong SW, Hwang K-B, Kim RD, Zhang B-T, Greenberg SA, Kohane IS, Park PJ. CrossChip: a system supporting comparative analysis of different generations of Affymetrix arrays. Bioinformatics 2005;21(9):2116-7.Abstract

SUMMARY: To increase compatibility between different generations of Affymetrix GeneChip arrays, we propose a method of filtering probes based on their sequences. Our method is implemented as a web-based service for downloading necessary materials for converting the raw data files (*.CEL) for comparative analysis. The user can specify the appropriate level of filtering by setting the criteria for the minimum overlap length between probe sequences and the minimum number of usable probe pairs per probe set. Our website supports a within-species comparison for human and mouse GeneChip arrays. AVAILABILITY: http://www.crosschip.org

pdf
2002
Park PJ, Butte AJ, Kohane IS. Comparing expression profiles of genes with similar promoter regions. Bioinformatics 2002;18(12):1576-84.Abstract

MOTIVATION: Gene regulatory elements are often predicted by seeking common sequences in the promoter regions of genes that are clustered together based on their expression profiles. We consider the problem in the opposite direction: we seek to find the genes that have similar promoter regions and determine the extent to which these genes have similar expression profiles. RESULTS: We use the data sets from experiments on Saccharomyces cerevisiae. Our similarity measure for the promoter regions is based on the set of common mapped or putative transcription factor binding sites and other regulatory elements in the upstream region of the genes, as contained in the Saccharomyces cerevisiae Promoter Database. We pair up the genes with high similarity scores and compare their expression levels in time-course experiment data. We find that genes with similar promoter regions on the average have significantly higher correlation, but it can vary widely depending on the genes. This confirms that the presence of similar regulatory elements often does not correspond to similarity in expression profiles and indicates that finding transcription factor binding sites or other regulatory elements starting with the expression patterns may be limited in many cases. Regardless of the correlation, the degree to which the profiles agree under different experimental conditions can be examined to derive hypotheses concerning the role of common regulatory elements. Overall, we find that considering the relationship between the promoter regions and the expression profiles starting with the regulatory elements is a difficult but useful process that can provide valuable insights.

pdf
Park PJ, Tian L, Kohane IS. Linking gene expression data with patient survival times using partial least squares. Bioinformatics 2002;18 Suppl 1:S120-7.Abstract

There is an increasing need to link the large amount of genotypic data, gathered using microarrays for example, with various phenotypic data from patients. The classification problem in which gene expression data serve as predictors and a class label phenotype as the binary outcome variable has been examined extensively, but there has been less emphasis in dealing with other types of phenotypic data. In particular, patient survival times with censoring are often not used directly as a response variable due to the complications that arise from censoring. We show that the issues involving censored data can be circumvented by reformulating the problem as a standard Poisson regression problem. The procedure for solving the transformed problem is a combination of two approaches: partial least squares, a regression technique that is especially effective when there is severe collinearity due to a large number of predictors, and generalized linear regression, which extends standard linear regression to deal with various types of response variables. The linear combinations of the original variables identified by the method are highly correlated with the patient survival times and at the same time account for the variability in the covariates. The algorithm is fast, as it does not involve any matrix decompositions in the iterations. We apply our method to data sets from lung carcinoma and diffuse large B-cell lymphoma studies to verify its effectiveness.