Classification Methods for Temporal Gene Expressions and High-Dimensional Data
Temporal gene expression data is of particular interest to researchers as they contain rich information on characterization of gene function and have been widely used in biomedical studies and early cancer detection. Dense temporal gene expression data in bacteria shows that gene expression has various patterns under different biological conditions. In contrast to the rich literature available on how to estimate gene expression over time under a given condition, few researchers have considered identifying the different effects of multiple conditions on gene expression profiles. In this thesis we investigate the effects of multiple conditions on gene expressions and then classify the conditions according to the obtained results, that is their effects on gene profiles. We propose a linear regression model and and use properties of the log-normal distribution to characterize the variance function of genes under a given condition. Then, based on the estimated parameters, a chi-square test is proposed to test the equality of the variance function for multiple conditions. Furthermore, the Mahalanobis distance is used for the classification of conditions. Moreover, most of temporal gene expressions represent variability with time. However, after the initial time period, some genes exhibit some kinds of stability. This means that gene expressions demonstrate stability after a specific time point, called the threshold time point. At this threshold time point measurements are rather constant or fluctuate slightly. The threshold time point of a gene expression can be used to decide the measuring time period for behaviour. Different threshold time points can be used to distinguish behaviours for gene expressions. For this reason, we will use the first and second relative change rate to reduce the dimension of time points. In addition, canonical correlation is another measurement which can be used to classify gene expressions. However, for this measurement we usually have to deal with a large number of time points. The sample canonical correlations cannot be used as the sample covariance matrices are rank deficient. For this reason, we explore possible alternative estimations for covariance matrices and the accuracy of these alternative estimations is studied. Meanwhile, simulation studies are performed to confirm the performance of the methods which are introduced in this thesis. Simulation studies indicate that the model based variance function is accurate and that the Wald statistics based test performs well. Moreover, simulation studies indicate which of the introduced size reduction methods performs better when dealing with gene expression data. Also, different scenarios are considered and simulated to find which of the model-based covariance structures acts better for estimating canonical correlations.