Defense Date

2006

Document Type

Dissertation

Degree Name

Doctor of Philosophy

Department

Biostatistics

First Advisor

Dr. Robert E. Johnson

Abstract

Methods for determining clusters of data under- specified constraints have recently gained popularity. Although general constraints may be used, we focus on clustering methods with the constraint of a minimal cluster size. In this dissertation, we propose two constrained k-means algorithms: Linear Programming Algorithm (LPA) and Genetic Constrained K-means Algorithm (GCKA). Linear Programming Algorithm modifies the k-means algorithm into a linear programming problem with constraints requiring that each cluster have m or more subjects. In order to achieve an acceptable clustering solution, we run the algorithm with a large number of random sets of initial seeds, and choose the solution with minimal Root Mean Squared Error (RMSE) as our final solution for a given data set. We evaluate LPA with both generic data and simulated data and the results indicate that LPA can obtain a reasonable clustering solution. Genetic Constrained K-Means Algorithm (GCKA) hybridizes the Genetic Algorithm with a constrained k-means algorithm. We define Selection Operator, Mutation Operator and Constrained K-means operator. Using finite Markov chain theory, we prove that the GCKA converges in probability to the global optimum. We test the algorithm with several datasets. The analysis shows that we can achieve a good clustering solution by carefully choosing parameters such as population size, mutation probability and generation. We also propose a Bi-Nelder algorithm to search for an appropriate cluster number with minimal RMSE.

Rights

© The Author

Is Part Of

VCU University Archives

Is Part Of

VCU Theses and Dissertations

Date of Submission

June 2008

Included in

Biostatistics Commons

Share

COinS