Primary

2005
Kong SW, Hwang K-B, Kim RD, Zhang B-T, Greenberg SA, Kohane IS, Park PJ. CrossChip: a system supporting comparative analysis of different generations of Affymetrix arrays. Bioinformatics 2005;21(9):2116-7.Abstract

SUMMARY: To increase compatibility between different generations of Affymetrix GeneChip arrays, we propose a method of filtering probes based on their sequences. Our method is implemented as a web-based service for downloading necessary materials for converting the raw data files (*.CEL) for comparative analysis. The user can specify the appropriate level of filtering by setting the criteria for the minimum overlap length between probe sequences and the minimum number of usable probe pairs per probe set. Our website supports a within-species comparison for human and mouse GeneChip arrays. AVAILABILITY: http://www.crosschip.org

pdf
2004
Park PJ, Hou TY. Multiscale Numerical Methods for Singularly Perturbed Convection-Diffusion Equations. International Journal of Computational Methods 2004;1(1):17-65.Abstract

We present an efficient and robust approach in the finite element framework for numerical solutions that exhibit multiscale behavior, with applications to singularly perturbed convection-diffusion problems. The first type of equation we study is the convection-dominated convection-diffusion equation, with periodic or random coefficients; the second type of equation is an elliptic equation with singularities due to discontinuous coefficients and non-smooth boundaries. In both cases, standard methods for purely hyperbolic or elliptic problems perform poorly due to sharp boundary and internal layers in the solution.

We propose a framework in which the finite element basis functions are designed to capture the local small-scale behavior correctly. When the structure of the layers can be determined locally, we apply the multiscale finite element method, in which we solve the corresponding homogeneous equation on each element to capture the small scale features of the differential operator. We demonstrate the effectiveness of this method by computing the enhanced diffusivity scaling for a passive scalar in the cellular flow. We also carry out the asymptotic error analysis for its convergence rate and perform numerical experiments for verification. For a random flow with nonlocal layer structure, we use a variational principle to gain additional information in our attempt to design asymptotic basis functions. We also apply the same framework for elliptic equations with discontinuous coefficients or non-smooth boundaries. In that case, we construct local basis function near singularities using infinite element method in order to resolve extreme singularity. Numerical results on problems with various singularities confirm the efficiency and accuracy of this approach.

Hwang K-B, Kong SW, Greenberg SA, Park PJ. Combining gene expression data from different generations of oligonucleotide arrays. BMC Bioinformatics 2004;5:159.Abstract

BACKGROUND: One of the important challenges in microarray analysis is to take full advantage of previously accumulated data, both from one's own laboratory and from public repositories. Through a comparative analysis on a variety of datasets, a more comprehensive view of the underlying mechanism or structure can be obtained. However, as we discover in this work, continual changes in genomic sequence annotations and probe design criteria make it difficult to compare gene expression data even from different generations of the same microarray platform. RESULTS: We first describe the extent of discordance between the results derived from two generations of Affymetrix oligonucleotide arrays, as revealed in cluster analysis and in identification of differentially expressed genes. We then propose a method for increasing comparability. The dataset we use consists of a set of 14 human muscle biopsy samples from patients with inflammatory myopathies that were hybridized on both HG-U95Av2 and HG-U133A human arrays. We find that the use of the probe set matching table for comparative analysis provided by Affymetrix produces better results than matching by UniGene or LocusLink identifiers but still remains inadequate. Rescaling of expression values for each gene across samples and data filtering by expression values enhance comparability but only for few specific analyses. As a generic method for improving comparability, we select a subset of probes with overlapping sequence segments in the two array types and recalculate expression values based only on the selected probes. We show that this filtering of probes significantly improves the comparability while retaining a sufficient number of probe sets for further analysis. CONCLUSIONS: Compatibility between high-density oligonucleotide arrays is significantly affected by probe-level sequence information. With a careful filtering of the probes based on their sequence overlaps, data from different generations of microarrays can be combined more effectively.

pdf
Park PJ, Cao YA, Lee SY, Kim J-W, Chang MS, Hart R, Choi S. Current issues for DNA microarrays: platform comparison, double linear amplification, and universal RNA reference. J Biotechnol 2004;112(3):225-45.Abstract

DNA microarray technology has been widely used to simultaneously determine the expression levels of thousands of genes. A variety of approaches have been used, both in the implementation of this technology and in the analysis of the large amount of expression data. However, several practical issues still have not been resolved in a satisfactory manner, and among the most critical is the lack of agreement in the results obtained in different array platforms. In this study, we present a comparison of several microarray platforms [Affymetrix oligonucleotide arrays, custom complementary DNA (cDNA) arrays, and custom oligo arrays printed with oligonucleotides from three different sources] as well as analysis of various methods used for microarray target preparation and the reference design. The results indicate that the pairwise correlations of expression levels between platforms are relative low overall but that the log ratios of the highly expressed genes are strongly correlated, especially between Affymetrix and cDNA arrays. The microarray measurements were compared with quantitative real-time-polymerase chain reaction (QRT-PCR) results for 23 genes, and the varying degrees of agreement for each platform were characterized. We have also developed and tested a double amplification method which allows the use of smaller amounts of starting material. The added round of amplification produced reproducible results as compared to the arrays hybridized with single round amplified targets. Finally, the reliability of using a universal RNA reference for two-channel microarrays was tested and the results suggest that comparisons of multiple experimental conditions using the same control can be accurate.

Kim RD, Park PJ. Improving identification of differentially expressed genes in microarray studies using information from public databases. Genome Biol 2004;5(9):R70.Abstract

We demonstrate that the process of identifying differentially expressed genes in microarray studies with small sample sizes can be substantially improved by extracting information from a large number of datasets accumulated in public databases. The improvement comes from more reliable estimates of gene-specific variances based on other datasets. For a two-group comparison with two arrays in each group, for example, the result of our method was comparable to that of a t-test analysis with five samples in each group or to that of a regularized t-test analysis with three samples in each group. Our results are further improved by weighting the results of our approach with the regularized t-test results in a hybrid method.

pdf
2003
Kulldorff M, Tango M, Park PJ. Power Comparisons for Disease Clustering Tests. Computational Statistics and Data Analysis 2003;42(4):665-684.Abstract

Many different methods have been proposed to test for geographical disease clustering, and more generally, for spatial clustering of any type of observations while adjusting for an inhomogeneous background population generating the observations. Despite the many proposed test statistics, there has been few formal comparisons conducted. We present a collection of 1,220,000 simulated benchmark data sets generated under 51 different cluster models and the null hypothesis, to be used for power evaluations. We then use these data sets to compare the power of the spatial scan statistic, the maximized excess events test and the nonparametric M statistic. All have good power, the first having an advantage for localized hot-spot type clusters and the second for global clustering where randomly located cases generate other cases close by. By making the simulated data sets publicly available, new tests can easily be compared with previously evaluated tests by analyzing the same benchmark data.

Park PJ, Kahane IS, Kim JH. Rank-Based Nonlinear Normalization of Oligonucleotide Arrays. Genomics & Informatics 2003;1(2):94-100.Abstract

MOTIVATION: Many have observed a nonlinear relationship between the signal intensity and the transcript abundance in microarray data. The first step in analyzing the data is to normalize it properly, and this should include a correction for the nonlinearity. The commonly used linear normalization schemes do not address this problem. RESULTS: Nonlinearity is present in both cDNA and oligonucleotide arrays, but we concentrate on the latter in this paper. Across a set of chips, we identify those genes whose within-chip ranks are relatively constant compared to other genes of similar intensity. For each gene, we compute the sum of the squares of the differences in its within-chip ranks between every pair of chips as our statistic and we select a small fraction of the genes with the minimal changes in ranks at each intensity level. These genes are most likely to be non-differentially expressed and are subsequently used in the normalization procedure. This method is a generalization of the rank-invariant normalization (Li and Wong, 2001), using all available chips rather than two at a time to gather more information, while using the chip that is least likely to be affected by nonlinear effects as the reference chip. The assumption in our method is that there are at least a small number of nondifferentially expressed genes across the intensity range. The normalized expression values can be substantially different from the unnormalized values and may result in altered down-stream analysis.

Kuo WP, Mendez E, Chen C, Whipple ME, Farell G, Agoff N, Park PJ. Functional relationships between gene pairs in oral squamous cell carcinoma. AMIA Annu Symp Proc 2003;:371-5.Abstract

We developed a novel method for the discovery of functional relationships between pairs of genes based on gene expression profiles generated from microarrays. This approach examines all possible pairs of genes and identifies those in which the relationship between the two genes changes in different diseases or conditions. In contrast to previous methods that have focused on differentially expressed genes, this method attempts to find changes in the correlation between genes. These changes may be indicative of the functional relationships related to a disease mechanism. We demonstrate the utility of this approach by applying it to an oral squamous cell carcinoma (OSCC) microarray data set. Our results suggest new directions for future experimental investigations.

pdf
2002
Park PJ, Butte AJ, Kohane IS. Comparing expression profiles of genes with similar promoter regions. Bioinformatics 2002;18(12):1576-84.Abstract

MOTIVATION: Gene regulatory elements are often predicted by seeking common sequences in the promoter regions of genes that are clustered together based on their expression profiles. We consider the problem in the opposite direction: we seek to find the genes that have similar promoter regions and determine the extent to which these genes have similar expression profiles. RESULTS: We use the data sets from experiments on Saccharomyces cerevisiae. Our similarity measure for the promoter regions is based on the set of common mapped or putative transcription factor binding sites and other regulatory elements in the upstream region of the genes, as contained in the Saccharomyces cerevisiae Promoter Database. We pair up the genes with high similarity scores and compare their expression levels in time-course experiment data. We find that genes with similar promoter regions on the average have significantly higher correlation, but it can vary widely depending on the genes. This confirms that the presence of similar regulatory elements often does not correspond to similarity in expression profiles and indicates that finding transcription factor binding sites or other regulatory elements starting with the expression patterns may be limited in many cases. Regardless of the correlation, the degree to which the profiles agree under different experimental conditions can be examined to derive hypotheses concerning the role of common regulatory elements. Overall, we find that considering the relationship between the promoter regions and the expression profiles starting with the regulatory elements is a difficult but useful process that can provide valuable insights.

pdf
Kuo WP, Jenssen T-K, Park PJ, Lingen MW, Hasina R, Ohno-Machado L. Gene expression levels in different stages of progression in oral squamous cell carcinoma. Proc AMIA Symp 2002;:415-9.Abstract

Oral squamous cell carcinoma (OSCC) is one of the most common cancer types worldwide. The prognosis for patients with this disease is generally poor and little is known about its progression. Gene expression studies may provide important insights to the molecular mechanisms of this disease. We analyzed gene expression data from a small panel of patients diagnosed with OSCC. Even with only 13 patient samples we were able to find genes with significant differences in expression levels between normal, dysplasia, and cancer samples. The largest differences in expression were generally found between normal and cancer samples, but significant differences were also found for several genes between dysplasia and the other two sample types. We also represent the significance levels of differentially expressed genes on the chromosome domain. The genes and genetic features we examine are potentially important factors on the molecular level in the progression of OSCC.

pdf
Park PJ, Tian L, Kohane IS. Linking gene expression data with patient survival times using partial least squares. Bioinformatics 2002;18 Suppl 1:S120-7.Abstract

There is an increasing need to link the large amount of genotypic data, gathered using microarrays for example, with various phenotypic data from patients. The classification problem in which gene expression data serve as predictors and a class label phenotype as the binary outcome variable has been examined extensively, but there has been less emphasis in dealing with other types of phenotypic data. In particular, patient survival times with censoring are often not used directly as a response variable due to the complications that arise from censoring. We show that the issues involving censored data can be circumvented by reformulating the problem as a standard Poisson regression problem. The procedure for solving the transformed problem is a combination of two approaches: partial least squares, a regression technique that is especially effective when there is severe collinearity due to a large number of predictors, and generalized linear regression, which extends standard linear regression to deal with various types of response variables. The linear combinations of the original variables identified by the method are highly correlated with the patient survival times and at the same time account for the variability in the covariates. The algorithm is fast, as it does not involve any matrix decompositions in the iterations. We apply our method to data sets from lung carcinoma and diffuse large B-cell lymphoma studies to verify its effectiveness.

2001
A nonparametric scoring algorithm for identifying informative genes from microarray data.
Park PJ, Pagano M, Bonetti M. A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac Symp Biocomput 2001;6:52-63.Abstract

Microarray data routinely contain gene expression levels of thousands of genes. In the context of medical diagnostics, an important problem is to find the genes that are correlated with given phenotypes. These genes may reveal insights to biological processes and may be used to predict the phenotypes of new samples. In most cases, while the gene expression levels are available for a large number of genes, only a small fraction of these genes may be informative in classification with statistical significance. We introduce a nonparametric scoring algorithm that assigns a score to each gene based on samples with known classes. Based on these scores, we can find a small set of genes which are informative of their class, and subsequent analysis can be carried out with this set. This procedure is robust to outliers and different normalization schemes, and immediately reduces the size of the data with little loss of information. We study the properties of this algorithm and apply it to the data set from cancer patients. We quantify the information in a given set of genes by comparing its distribution of the score statistics to a set of distributions generated by permutations that preserve the correlation structure among the genes.

pdf

Pages