## Assignment Specifications

### Guidelines

Each written assignment is limited to no more than four single-sided, single-spaced pages (the first should require significantly less) using 12 point font, including all figures. Figures are not to be hand-drawn and should be large enough to be easily legible. Anything beyond the four page limit will not be graded! You must work to include all important information in the allotted four pages. Communicating clearly and concisely what you have to say is an important skill you will use throughout your career. NOTE: All written assignments are to be done on a word processor and be neat and professional.

Lab assignments will be graded 40% on positive completion, 50% on technical accuracy and insight and 10% on writing and presentation. Most parts of the assignments are fairly objective and for these, you are expected to be correct. A few questions are more open-ended and subjective, and in these cases you will be graded based on perceived effort and understanding. Good writing, grammar, punctuation, etc. are extremely important because of the great effect they have on the impact of your work.

## Wiki Assignment

This assignment is worth 5% of your overall Homework and Programming Assignments grade.

Do some surfing on the web and find at least two examples of recent, interesting machine learning news, applications, advances, etc. Make an entry on the course wiki that briefly describes each, gives your opinions, etc. and points interested readers to more information. Email the TAs so they can verify your work.

## Linear Models Assignment

This assignment is worth 17% of your overall Homework and Programming Assignments grade.

In this assignment, you are to implement the simple Logistic Regression algorithm for a single independent variable (IV) and the Perceptron algorithm for the general case of multiple independent variables. You may assume that data is provided to you in 2-column (n-column) format, where the first column(s) is(are) the IV(s) and the second(last) one is the dependent variable (DV). The first row contains the number of independent variables and the second row contains the name(s) of the variables. Each subsequent row corresponds to a single observation. Hence, you may need to aggregate the values of the DV (i.e., counts) for each value of the IV before doing the regression.

1. Implement Logistic Regression for a single IV.
• Output the values of the intercept (w0) and the slope (w1) of the model.
• Output the value of R2 from the probabilities (i.e., compute the SSE and TSS on probabilities rather than on odds).
2. Use your algorithm on this (modified) Coronary Heart Disease (CHD) problem.
• First, use simple linear regression (no logit function) to build a linear model of the data
• Report the parameters of the model (w0, w1 and R2).
• Plot the original (probability) data and graph the linear model your program produced.
• What does the model predict for the probability of someone 41 years old suffering of CHD?
• Next, use logistic regression to build a nonlinear model of the data.
• Report the parameters of the model (w0, w1 and R2).
• Plot the original (probability) data and graph the nonlinear model your program produced.
• What does the model predict for the probability of someone 41 years old suffering of CHD?
3. Generalize your program to handle multiple independent variables by implementing the Perceptron Algorithm.
• Output the values of the perceptron weights (w0, w1, ..., wn).
• Output the value of R2 (i.e., compute the SSE and TSS for the output of the perceptron)
• Use your algorithm on the CHD problem.
• Report the parameters of the model (weights and R2).
• Plot the SSE over time for 100 epochs of training for a learning rate that is too large (no convergence).
• Plot the SSE over time for 100 epochs of training for a learning rate that is small enough (asymptotic error).
• Plot the original (probability) data and graph the nonlinear model your program produced.
• How do the three models you produced for the CHD problem compare?
• Use your algorithm on this (modified) Iris problem.
• Report the parameters of the model (weights and R2).
• Plot the SSE over time for 1000 epochs of training for a learning rate that is too large (no convergence).
• Plot the SSE over time for 1000 epochs of training for a learning rate that is small enough (asymptotic error).
• Report the predictions your model makes for the following data (vectors independent variables).

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).

## Clustering Assignment

This assignment is worth 17% of your overall Homework and Programming Assignments grade.

In this assignment, you are to implement both the k-means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm. For both, you may assume that all features are real-valued. You may also assume that there is no need for normalization of the features. Use the Ln-norm for distance calculations with a default value of n=2 (Euclidean).

Note that the datasets you are to test your algorithms with contain labeled items. You will need to ignore the label (target attribute, always last here) while clustering.

1. Implement the k-means clustering algorithm and the HAC algorithm (using single linkage).
• For k-means, your program should automatically try 2 <= k <= 7 and compute the squared error in each case. You should then return the value of k that produces the lowest squared error, together with that error.
• For HAC, you should compute (and store) the squared errors of all possible clusterings (as they are built from the bottom up). Upon completion, you should return the value of t (distance threshold) that produces the lowest squared error, the corresponding number of clusters, and the corresponding error.
2. Use your algorithms on the Iris problem.
• Compare the best number of clusters obtained by k-means and HAC. How do these also compare with the underlying structure of the dataset in which there are 3 classes of iris plants?
• Experiment with your distance metric -- can you find a value of n for the Ln-norm that changes the number of clusters found?
• Graph the value of the squared error for each clustering as HAC executes (with no threshold). What do you observe? Is this surprising?
3. Use your algorithms on the Balance problem.
• Run HAC on the full dataset. Record the threshold value that produces the lowest squared error, together with the corresponding number of clusters and the corresponding error. Call these tf, kf and ef, respectively.
• Run HAC on this sample. Record the value of the best threshold (call it ts).
• Run HAC on the full dataset again, but using the threshold (ts) obtained above on the sample. Record the values of the the corresponding number of clusters (ks) and the corresponding error (es).
• Compare tf, kf and ef with ts, ks and es. Discuss your findings.

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).

## Apriori Assignment

This assignment is worth 17% of your overall Homework and Programming Assignments grade.

In this assignment, you are to implement the Apriori algorithm. Your implementation should allow the user to specify a minimum support threshold (minsup), a minimum confidence threshold (minconf), and a maximum number of rules to display at a time (maxrules).

1. Implement the Apriori algorithm.
2. Use your algorithm on the Binarized Lenses problem.
• Run Apriori for 0.1 <= minsup <= 0.8 and 0.1 <= minconf <= 0.6, using increments of 0.1 (i.e., this means you should run the algorithm 48 times).
3. Use your algorithm on the Mirror Symmetry problem.
• Run Apriori for various combinations of minsup and minconf values.
• This is an artificial problem. Each attribute represents a bit position in a string of 30 bits: Lmost, Lmost1, ..., Lmost14, Rmost14, Rmost13, ..., Rmost1, Rmost and the attribute Symm is 1 if the pattern is symmetric about its center, and 0 otherwise. Given this interpretation, do any of the rules discovered by your Apriori algorithm make sense?
• Design your task so that it contains some simple associations you can check your algorithm against. List these associations.
• Run Apriori for various combinations of minsup and minconf values.
• Verify that the associations you designed into the task are discovered by your algorithm.
• Any surprises?

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).

## WEKA Exercise

This assignment is worth 10% of your overall Homework and Programming Assignments grade.

Perform the following activities:

2. In the Explorer application, open the CardiologyCategorical.csv data file and do the following:
1. Build a neural network (using the MLP algorithm) that predicts whether a patient has a heart condition. Record the 10-fold cross-validation accuracy of your model as A1.
2. Create a new attribute coarseBloodPressure, with values: Low if blood pressure is less than or equal to 120, Normal if blood pressure is greater than 120 but less than or equal to 150, and High if blood pressure is greater than 150. (You may need to be a little creative here in the use of (unsupervised) filters, possibly having to first create a copy of blood pressure with the new name and then transforming it as per the foregoing; Filters AddExpression and MathExpression may prove particularly useful).
3. Build a neural network (using the MLP algorithm) that predicts whether a patient has a heart condition, using the new attribute coarseBloodPressure instead of the original blood pressure. Record the 10-fold cross-validation accuracy of your model as A2.
4. Compare A1 and A2. Any comments?
5. Remove all records whose value of the attribute resting ecg is Abnormal.
6. Construct a decision tree (using the J48 Algorithm) that predicts whether a patient has a heart condition, given the attributes age, sex, chest pain type, coarseBloodPressure, angina, peak, and slope. Save the result buffer and insert the confusion matrix obtained with 10-fold cross-validation in your report.
3. In the Explorer application, open the cpu.arff data file and do the following:
1. Cluster the data (using the simple k-Means algorithm, with k=3) and report on the nature and composition of the extracted clusters.
2. Discretize the attributes MMIN, MMAX, CACH, CHMIN and CHMAX using 3 buckets in one step. Find associations among these attributes only (i.e., remove the other ones), using the Apriori algorithm, with support 0.2, confidence 0.95 and top 3 rules only being displayed. Save the result buffer and insert it in your report.
3. Using the original data, list the eigenvalues associated with the attributes selected by the Principal Components Analysis method, when the amount of variance you wish covered by the subset of attributes is 75%.

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).

## WEKA Lab 1

This assignment is worth 17% of your overall Homework and Programming Assignments grade.

1. Become familiar with the ionosphere data set and use it to perform the following experiments:
1. Use IB1 and IBk (for k up to 9) to classify the data using cross validation. Report your results. Which value of k is best?
2. Use PCA to try differing numbers of attributes (1-8). For each number of attributes, perform the same classification experiments that you did with the original data. Report your results (as an 8x9 matrix). Which value of k is best? How does performance change with different numbers of attributes? Which number of attributes is best? Can you explain which attributes these are?
3. Given the best number of (PCA) attributes, choose various subsets (of the same size) of the original attributes. Perform the IBk experiments on these subsets and report and compare results with those obtained using the PCA attributes. What do you see?
4. Show a visual representation of the decision surface with 2 attributes (derived from PCA) and k=1.
2. Become familiar with the vowel data set and use it to perform the following experiments:
1. Remove the first three attributes as well as the class attribute.
2. Cluster the data using the simple k-Means algorithm, with values of k from 1 to 12. What do you see?
3. Now add the class attribute back in and repeat the clusterings, comparing the clusterings with the class. How well does the clustering appear to correlate with class? What might this mean?
4. Choose several different classifiers and use them to classify the data. How does their performance compare with the clustering's "performance"? Is this something you might expect?
5. Does adding back in any of the original first three attributes have any effect on either the clustering or the classification performance?
3. Design your own experiment, perform it and report on it:
1. Find an interesting data set and briefly describe it. You might look here or here. Or, you can come up with your own some other way. A brief description of ARFF (a file format that WEKA knows) may be helpful.
2. Come up with an interesting experiment (experiments) to do using your data set and briefly describe it (them).
3. Perform your experiment(s), report your results and draw conclusions (what did you learn? what does it mean? does it make sense? what else could be done? etc).

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each). What you say is important. Why you say it is even more important.

## WEKA Lab 2

This assignment is worth 17% of your overall Homework and Programming Assignments grade.

1. In this lab you will investigate the difference in model performance using statistical significance testing. We will compare three models (decision tree, multi-layer perceptron and SVM) on two different data sets (ionosphere and vowel), and perform a pairwise comparison of the models on each data set (so, a total of six experiments). For the vowel data, preprocess the data by removing the first three attributes as we did in the previous assignment.
2. For both data sets, compare the performance of a decision tree vs. a multi-layer perceptron, a decision tree vs. a support vector machine, and a multi-layer perceptron vs. a support vector machine using a paired permutation test. For each model, describe parameter settings/design decisions you make in acquiring your data (so that your experiments are replicable).
3. For the permutation tests, obtain 10 paired accuracy estimates (the best and easiest way to do this is 10-fold cross-validation) and perform an exhaustive permutation test.
4. You can collect accuracy estimates using the Experimenter in WEKA, dumping the results to a file and using the appropriate column of the file. You will need to implement the paired permutation test yourself.
5. Report the p value for each experiment and draw conclusions. Can you make any assertions about the accuracy of one model over another on the ionosphere data (that is, can you say that differences in model accuracy are statistically significant)? On the vowel data? Can you say anything about how these models might compare on other data sets? What can you say about differences in the performance of the models in general?

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).