Assignment Specifications (View All)

Clustering Assignment

This assignment is worth 17% of your overall Homework and Programming Assignments grade.

In this assignment, you are to implement both the k-means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm. For both, you may assume that all features are real-valued. You may also assume that there is no need for normalization of the features. Use the Ln-norm for distance calculations with a default value of n=2 (Euclidean).

Note that the datasets you are to test your algorithms with contain labeled items. You will need to ignore the label (target attribute, always last here) while clustering.

  1. Implement the k-means clustering algorithm and the HAC algorithm (using single linkage).
    • For k-means, your program should automatically try 2 <= k <= 7 and compute the squared error in each case. You should then return the value of k that produces the lowest squared error, together with that error.
    • For HAC, you should compute (and store) the squared errors of all possible clusterings (as they are built from the bottom up). Upon completion, you should return the value of t (distance threshold) that produces the lowest squared error, the corresponding number of clusters, and the corresponding error.
  2. Use your algorithms on the Iris problem.
    • Compare the best number of clusters obtained by k-means and HAC. How do these also compare with the underlying structure of the dataset in which there are 3 classes of iris plants?
    • Experiment with your distance metric -- can you find a value of n for the Ln-norm that changes the number of clusters found?
    • Graph the value of the squared error for each clustering as HAC executes (with no threshold). What do you observe? Is this surprising?
  3. Use your algorithms on the Balance problem.
    • Run HAC on the full dataset. Record the threshold value that produces the lowest squared error, together with the corresponding number of clusters and the corresponding error. Call these tf, kf and ef, respectively.
    • Run HAC on this sample. Record the value of the best threshold (call it ts).
    • Run HAC on the full dataset again, but using the threshold (ts) obtained above on the sample. Record the values of the the corresponding number of clusters (ks) and the corresponding error (es).
    • Compare tf, kf and ef with ts, ks and es. Discuss your findings.

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).