Ms Yasmine Flint
Clustering for large, messy data sets
The data that motivated this thesis is a series of samples of grapevines taken over 17
consecutive weeks. It consists of expression levels for 16,602 genes (obtained from
microarray analyses) and has 17 dimensions. We can think of the data as 16,602
expression profiles, where the expression profile for each gene is a time course of 17
There are many processes occurring as the grapevines grow and we expect multiple
genes to be associated with each process. We would expect those genes associated
with a particular process to have similar expression profiles, so if we were able
to group these genes together we might be able to reduce the complexity of the
data. One way we can potentially group genes that have similar expression profiles
together is by using clustering techniques.
Applying standard methods such as K-Means and K-Medoids clustering has not
produced any useful results, most likely due to the high dimensionality of the data,
the large number of genes and the possible noise in the data. A big issue was that
after using these algorithms some of the resulting clusters contained a very small
number of genes. When we try to cluster the data with fewer clusters the resulting
clusters are quite diffuse (that is, the observations within them are quite dissimilar
to each other).
In this thesis we will consider a modified version of K-Means clustering, in which
we create a special cluster that holds genes that simply do not fit into any other
cluster. We will then explore the effectiveness of this new algorithm on simulated
data sets and on a subset of the vine data.