The University of Adelaide
You are here » Home » People directory
Text size: S | M | L
Printer Friendly Version
December 2018

Ms Yasmine Flint

Honours graduate


Honours thesis

Clustering for large, messy data sets

The data that motivated this thesis is a series of samples of grapevines taken over 17 consecutive weeks. It consists of expression levels for 16,602 genes (obtained from microarray analyses) and has 17 dimensions. We can think of the data as 16,602 expression profiles, where the expression profile for each gene is a time course of 17 expression levels. There are many processes occurring as the grapevines grow and we expect multiple genes to be associated with each process. We would expect those genes associated with a particular process to have similar expression profiles, so if we were able to group these genes together we might be able to reduce the complexity of the data. One way we can potentially group genes that have similar expression profiles together is by using clustering techniques. Applying standard methods such as K-Means and K-Medoids clustering has not produced any useful results, most likely due to the high dimensionality of the data, the large number of genes and the possible noise in the data. A big issue was that after using these algorithms some of the resulting clusters contained a very small number of genes. When we try to cluster the data with fewer clusters the resulting clusters are quite diffuse (that is, the observations within them are quite dissimilar to each other). In this thesis we will consider a modified version of K-Means clustering, in which we create a special cluster that holds genes that simply do not fit into any other cluster. We will then explore the effectiveness of this new algorithm on simulated data sets and on a subset of the vine data.