BACKGROUND: RNA-seq has been widely used for genome-wide expression profiling. RNA-seq data typically consists of tens of millions of short sequenced reads from different transcripts. However, due to sequence similarity among genes and among isoforms, the source of a given read is often ambiguous. Existing approaches for estimating expression levels from RNA-seq reads tend to compromise between accuracy and computational cost. RESULTS: We introduce a new approach for quantifying transcript abundance from RNA-seq data. EMSAR (Estimation by Mappability-based Segmentation And Reclustering) groups reads according to the set of transcripts to which they are mapped and finds maximum likelihood estimates using a joint Poisson model for each optimal set of segments of transcripts. The method uses nearly all mapped reads, including those mapped to multiple genes. With an efficient transcriptome indexing based on modified suffix arrays, EMSAR minimizes the use of CPU time and memory while achieving accuracy comparable to the best existing methods. CONCLUSIONS: EMSAR is a method for quantifying transcripts from RNA-seq data with high accuracy and low computational cost. EMSAR is available at https://github.com/parklab/emsar.
BACKGROUND: Chromatin immunoprecipitation followed by microarray hybridization (ChIP-chip) is used to study protein-DNA interactions and histone modifications on a genome-scale. To ensure data quality, these experiments are usually performed in replicates, and a correlation coefficient between replicates is used often to assess reproducibility. However, the correlation coefficient can be misleading because it is affected not only by the reproducibility of the signal but also by the amount of binding signal present in the data. RESULTS: We develop the Quantized correlation coefficient (QCC) that is much less dependent on the amount of signal. This involves discretization of data into set of quantiles (quantization), a merging procedure to group the background probes, and recalculation of the Pearson correlation coefficient. This procedure reduces the influence of the background noise on the statistic, which then properly focuses more on the reproducibility of the signal. The performance of this procedure is tested in both simulated and real ChIP-chip data. For replicates with different levels of enrichment over background and coverage, we find that QCC reflects reproducibility more accurately and is more robust than the standard Pearson or Spearman correlation coefficients. The quantization and the merging procedure can also suggest a proper quantile threshold for separating signal from background for further analysis. CONCLUSIONS: To measure reproducibility of ChIP-chip data correctly, a correlation coefficient that is robust to the amount of signal present should be used. QCC is one such measure. The QCC statistic can also be applied in a variety of other contexts for measuring reproducibility, including analysis of array CGH data for DNA copy number and gene expression data.
BACKGROUND: Recent advances in sequencing technologies have enabled generation of large-scale genome sequencing data. These data can be used to characterize a variety of genomic features, including the DNA copy number profile of a cancer genome. A robust and reliable method for screening chromosomal alterations would allow a detailed characterization of the cancer genome with unprecedented accuracy. RESULTS: We develop a method for identification of copy number alterations in a tumor genome compared to its matched control, based on application of Smith-Waterman algorithm to single-end sequencing data. In a performance test with simulated data, our algorithm shows >90% sensitivity and >90% precision in detecting a single copy number change that contains approximately 500 reads for the normal sample. With 100-bp reads, this corresponds to a ~50 kb region for 1X genome coverage of the human genome. We further refine the algorithm to develop rSW-seq, (recursive Smith-Waterman-seq) to identify alterations in a complex configuration, which are commonly observed in the human cancer genome. To validate our approach, we compare our algorithm with an existing algorithm using simulated and publicly available datasets. We also compare the sequencing-based profiles to microarray-based results. CONCLUSION: We propose rSW-seq as an efficient method for detecting copy number changes in the tumor genome.
BACKGROUND: Chromatin immunoprecipitation on tiling arrays (ChIP-chip) has been widely used to investigate the DNA binding sites for a variety of proteins on a genome-wide scale. However, several issues in the processing and analysis of ChIP-chip data have not been resolved fully, including the effect of background (mock control) subtraction and normalization within and across arrays. RESULTS: The binding profiles of Drosophila male-specific lethal (MSL) complex on a tiling array provide a unique opportunity for investigating these topics, as it is known to bind on the X chromosome but not on the autosomes. These large bound and control regions on the same array allow clear evaluation of analytical methods.We introduce a novel normalization scheme specifically designed for ChIP-chip data from dual-channel arrays and demonstrate that this step is critical for correcting systematic dye-bias that may exist in the data. Subtraction of the mock (non-specific antibody or no antibody) control data is generally needed to eliminate the bias, but appropriate normalization obviates the need for mock experiments and increases the correlation among replicates. The idea underlying the normalization can be used subsequently to estimate the background noise level in each array for normalization across arrays. We demonstrate the effectiveness of the methods with the MSL complex binding data and other publicly available data. CONCLUSION: Proper normalization is essential for ChIP-chip experiments. The proposed normalization technique can correct systematic errors and compensate for the lack of mock control data, thus reducing the experimental cost and producing more accurate results.
BACKGROUND: One of the important challenges in microarray analysis is to take full advantage of previously accumulated data, both from one's own laboratory and from public repositories. Through a comparative analysis on a variety of datasets, a more comprehensive view of the underlying mechanism or structure can be obtained. However, as we discover in this work, continual changes in genomic sequence annotations and probe design criteria make it difficult to compare gene expression data even from different generations of the same microarray platform. RESULTS: We first describe the extent of discordance between the results derived from two generations of Affymetrix oligonucleotide arrays, as revealed in cluster analysis and in identification of differentially expressed genes. We then propose a method for increasing comparability. The dataset we use consists of a set of 14 human muscle biopsy samples from patients with inflammatory myopathies that were hybridized on both HG-U95Av2 and HG-U133A human arrays. We find that the use of the probe set matching table for comparative analysis provided by Affymetrix produces better results than matching by UniGene or LocusLink identifiers but still remains inadequate. Rescaling of expression values for each gene across samples and data filtering by expression values enhance comparability but only for few specific analyses. As a generic method for improving comparability, we select a subset of probes with overlapping sequence segments in the two array types and recalculate expression values based only on the selected probes. We show that this filtering of probes significantly improves the comparability while retaining a sufficient number of probe sets for further analysis. CONCLUSIONS: Compatibility between high-density oligonucleotide arrays is significantly affected by probe-level sequence information. With a careful filtering of the probes based on their sequence overlaps, data from different generations of microarrays can be combined more effectively.