CS 478 Projects and Assignments

 

Projects are to be done on a word processor and be neat and professional. Good writing, grammar, punctuation, etc. are important and points will be taken off if these things are lacking.  Each written project report should be about 4-5 pages and are limited to no more than five single-sided, single-spaced pages using 12 point font, including all graphs and figures.  (For the backpropagation report you may use 6 pages if absolutely necessary). Communicating clearly and concisely what you have to say is an important skill you will use throughout your career. Figures are not to be hand-drawn and should be large enough to be easily legible. All assignments should be e-mailed as PDF files to the TA e-mail shown on the homepage.  Put the assignment name in the subject line.  If you have a question for the TAs send it to the same address, but put question in the subject line so they can respond more readily. Except where specifically stated otherwise, all reports are due by midnight of the published due date.  By midnight we mean the very end of the published due date.

 

Programming projects will be graded primarily on positive completion and technical accuracy in the discussion.  Most parts of the projects require an clear measurable result (e.g. working program, graphs of results, etc.) and some questions where you analyze your results.  Most of the questions will have objective answers which you are expected to get right.  A few questions are more open-ended and subjective, and in these cases you will be graded based on perceived effort and thought.

 

·          Instructions and source code for the toolkit can be found here.

·          A collection of data sets already in the ARFF format can be found here.

·          Details on ARFF are found here.

·          The complete and continually updating UC Irvine Data Set is here.  This is a good place to go to study more about the problems used in your assignments and also to find new problems.

·          In order to help you debug projects we have included some small examples and other hints with actual learned hypotheses so that you can compare the results of your code and help ensure that your code is working properly.

 

Intro Assignment

 

Answer problems 1.1, 1.2, and 1.4 from the text. This assignment is due before class on the published due date. 

 

Perceptron Project

 

1.  (25%) Correctly implement the perceptron learning algorithm and integrate it with the tool kit.  Before implementing your perceptron you should review the remaining requirements for this project so that you implement in a way which will support them.  In particular, note that you will be supporting tasks with non-binary outputs, which require multiple output nodes.  Include your source code as a separate attachment in this and all labs. 

 

2.  (5%) Create 2 ARFF files, both with 8 instances using 2 real valued inputs (which range between -1 and 1) each with 4 instances from each class.  One should be linearly separable and the other not.

 

3.  (10%) Train on both sets (the entire sets) with the Perceptron Rule.  Try it with a couple different learning rates and discuss the effect of learning rate, including how many epochs are completed before stopping.  (For these situations learning rate should have minimal effect, unlike with the Backpropagation lab).  Discuss the differences in outcome with the linearly separable vs. the non-linearly separable set.

 

The basic stopping criteria for these models is to stop training when no longer making significant progress.  For example, when you have gone a number of epochs with no significant improvement in accuracy.  Describe your specific stopping criteria.  Don’t just stop the first epoch when no improvement occurs.  Use a learning rate of .1 for experiments 4-6 below.

 

4. (10%) Graph the instances and decision line for the two cases above (with LR=.1) and discuss the graphs.

 

5. (15%) Use the perceptron rule to learn this version of the voting task.  This particular task is an edited version of the standard voting set, where we have replaced all the don’t know values with the most common value for the particular attribute.  Randomly split the data into 70% training and 30% test set.  Try it five times with different random splits.  For each split report the final training and test set accuracy and the # of epochs required.  Also report the average of these values from the 5 trials.  You should update after every instance.  Remember to shuffle the data order after each epoch.  By looking at the weights, explain what the model has learned and how the individual input features effect the result.  Which specific features are most critical for the voting task, and which are least critical?  Do one graph of the average misclassification rate vs epochs (1st –final epoch) for the training set.  Discuss your results.

 

6. (20%) Use the perceptron rule to learn the iris task.  Randomly split the data into 70% training and 30% test set.  Try it five times with different random splits.  For each split report the final training and test set accuracy and the # of epochs required.  Also report the average of these values from the 5 trials.  By looking at the weights, explain what the model has learned and how the individual input features effect the result.  Discuss your results. As a rough sanity check, typical Perceptron accuracies for these data sets are: Vote: 90%-98%, Iris: 60-95%.  Note that the iris data set has 3 output classes, and the perceptron node only has two possible outputs.  Two common ways to deal with this are:

 

a)  Create 1 perceptron for each output class.  Each perceptron has its own training set which considers its class positive and all other classes to be negative examples.  Run all three perceptrons on novel data and set the class to the label of the perceptron which outputs high.  If there is a tie, choose the perceptron with the highest net value.

b)  Create 1 perceptron for each pair of output classes, where the training set only contains examples from the 2 classes.  Run all perceptrons on novel data and set the class to the label with the most wins (votes) from the perceptrons.  In case of a tie, use the net values to decide.

 

Implement one of these and discuss your results.  For either of these approaches you can train up the models independently or simultaneously.  For testing you just execute the novel instance on each model and combine the overall results to see which output class wins.

 

7.  (15%) Do your own experiment with either the perceptron or delta rule.  Show and discuss your results, including what you learned.  Have fun and be creative!  Make sure you do something more than just try out the perceptron on a different data set.

 

Note:  In order to help you debug this and other projects we have included some small examples and other hints with actual learned hypotheses so that you can compare the results of your code and help ensure that your code is working properly.

 

Backpropagation Project

 

1. (30%) Implement the backpropagation algorithm and integrate it with the toolkit. Attach your source code. Your implementation should include:

    ◦      ability to create an arbitrary network structure (# of nodes, layers, etc.)

    ◦      random weight initialization (small random weights with 0 mean)

    ◦      on-line/stochastic weight update

    ◦      a reasonable stopping criterion

    ◦      training set randomization at each epoch

    ◦      an option to include a momentum term

 

2.  (10%) Use your backpropagation learner, with stochastic weight updates, for the iris classification problem. Use one layer of hidden nodes with the number of hidden nodes being twice the number of inputs. Always use bias weights to each hidden and output node.  Use a random 75/25 split of the data for the training/test set and a learning rate of .1. Use a validation set (VS) for your stopping criteria for this and the remaining experiments. Note that with a VS you do not stop the first epoch that the VS does not get an improved accuracy.  Rather, you keep track of the best solution so far (bssf) on the VS and consider a window of epochs (e.g. 10) and when there has been no improvement over bssf for the length of the window, then you stop. Create one graph with the MSE (mean squared error) on the training set (TS), the MSE on the VS, and the classification accuracy (% classified correctly) of the VS on the y-axis, and number of epochs on the x-axis. (Note two scales on the y-axis). The results for the different measurables should be shown with a different color, line type, etc.  As a rough sanity check, typical Backpropagation accuracies for these data sets are: Iris: 80% to mid 90%, Vowel: around 60%.

 

3.  (20%) For 3-5 you will just use the vowel data set, which is a more difficult task (what would the baseline accuracy be?).  Use your backpropagation learner, with stochastic weight updates, for the vowel classification problem. Consider carefully which of the given input features you should actually use.  Initially use one layer of hidden nodes with the number of hidden nodes being twice the number of inputs. Use a random 75/25 split of the data for the training/test set. Try some different learning rates (LR).  For each LR find the best VS solution (in terms of VS MSE). Average the results of a few different random splits and initial weight settings for each learning rate.  Create one graph with MSE for the TS, VS, and test set, and also the classification accuracy on the VS and test set, on the y-axis, and the different learning rates on the x-axis.  Discuss your results and how learning rate effected backpropagation learning? 

 

4.  (10%) Using the best LR you discovered, experiment with different numbers of hidden nodes. Start with 2 hidden nodes and double them for each test until you get no more improvement.  For each number of hidden nodes find the best VS solution (in terms of VS MSE). Average the results of a few different random splits and initial weight settings for each case.  Create one graph with MSE for the TS, VS, and test set, the classification accuracy on the VS and test set, and the number of epochs needed to get to the best VS solution, on the y-axis (3 scales), and the different # of hidden nodes on the x-axis.  Note that you can support multiple y-axis scales by simply mentioning in your explanation, about which scales goes with which lines (the lines styles should be differentiated so as to clearly distinguish on from another.) In general, whenever you are testing a parameter such as # of hidden nodes, test values until no more improvement is found. For example, if 20 hidden nodes did better than 10, you wouldn’t stop at 20, but would try 40, etc., until you saw that you no longer got improvement. Discuss the effect of different numbers of hidden units on the algorithm's performance.

 

5. (10%) Try some different momentum terms in the learning equation using the best number of hidden nodes and LR from your earlier experiments. Graph as in step 4 but with momentum on the x-axis.  How did the momentum term affect your results?

 

6.  (5%) Using the best parameters found in 3-5, what difference in accuracy and learning do you get (for both training and test set) with vowel if you include name and gender as features vs excluding them. Discuss some possible reasons for this.

 

7.  (15%) Do an experiment of your own regarding backpropagation learning and report your results and conclusions.  Do something more interesting than just trying BP on a different task, or just a variation of the other requirements above.  Analyze and discuss your results.  Try to explain why things happened the way they did.

 

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).

 

Decision Tree Project

 

1.  (35%) Correctly implement the ID3 decision tree algorithm, including the ability to handle unknown attributes (You do not need to handle real valued attributes).  Attach your source code. Use standard information gain as your basic attribute evaluation metric.  (Note that normal ID3 would always augment information gain with gain ratio or some other mechanism to penalize statistically insignificant attribute splits.  Otherwise, even with approaches like pruning below, the SS# type of overfit could still hurt us.) It is a good idea to use a simple data set (like the lenses data), that you can check by hand, to test your algorithm to make sure that it is working correctly. You should be able to get about 68% (61%-82%) predictive accuracy on lenses.

 

2. (15%) You will use your ID3 algorithm to induce decision trees for the cars data set and the voting data set.  Do not use a stopping criteria, but induce the tree as far as it can go (until classes are pure or there are no more data or attributes to split on).  Note that you will need to support unknown attributes in the voting data set.  Use 10-fold CV on each data set to predict how well the models will do on novel data.  Report the training and test classification accuracy for each fold and then average the test accuracies to get your prediction.  Create a table summarizing these accuracy results, and discuss what you observed.  As a rough sanity check, typical decision tree accuracies for these data sets are: Cars: .90-.95, Vote: .92-.95.

 

3. (10%) For each of the two problems, summarize in English what the decision tree has learned (look at the induced tree and describe what rules it has discovered to try to solve each task).  If the tree is large you can just discuss a few of the more shallow rules and the most important decisions made high in the tree.

 

4.  (10%) How did you handle unknown attributes in the voting problem? Why did you choose this approach? (Do not use the approach of just throwing out data with unknown attributes).

 

5.  (15%) Implement reduced error pruning to help avoid overfitting.  You will need to take a validation set out of your training data to do this, while still having a test set to test your final accuracy.  Create a table comparing the original trees created with no overfit avoidance in item 2 above and the trees you create with pruning.  This table should compare a) the # of nodes and tree depth of the final decision trees and b) the generalization (test set) accuracy. (For the 10-fold CV models, just use their average values in the table).  Summarize your findings.

 

6.  (15%) Do an experiment of your own regarding decision trees and report your results and conclusions.  You can be as creative as you would like on this.  Experiments could include modifying the algorithm, modifying the measure of best attribute, comparing information gain and gain ratio, comparing different stopping criteria and/or pruning approaches, etc.  Analyze and discuss your results.  Try to explain why things happened the way they did.

 

Turn in a thoughtful, well-written written report (see the guidelines above) that addresses each question posed above (look carefully at everything to make sure you've covered all the parts of each).

 

 

Instance-based Learning Project

 

1. (40%) Implement the k nearest neighbor algorithm and the k nearest neighbor regression algorithm, including optional distance weighting for both algorithms. Attach your source code.

 

2.  (10%) Use the k nearest neighbor algorithm for the magic telescope problem using this training set and this test set.  Try it with k=3 with normalization (input features normalized between 0 and 1) and without normalization and discuss the accuracy results on the test set.  For the rest of the experiments use only normalized data.  With just the normalized training set as your data, graph classification accuracy on the test set with odd values of k from 1 to 15. Which value of k is the best in terms of classification accuracy?  As a rough sanity check, typical k-nn accuracies for the magic telescope data set are 75-85%.

 

3.  (10%) Use the regression variation of your algorithm for the housing price prediction problem using this training set and this test set.  Report Mean Square Error on the test set as your accuracy metric for this case. Experiment using odd values of k from 1 to 15. Which value of k is the best?

 

4.  (10%) Repeat your experiments for magic telescope and housing using distance-weighted (inverse of distance squared) voting. Discuss how distance weighting affects the algorithm performance.

 

5.  (10%) For the best value of k for each dataset, implement a reduction algorithm that removes data points in some rational way such that performance does not drop too drastically on the test set given the reduced training set. Compare your performance on the test set for the reduced and non-reduced versions and give the number (and percentage) of training examples removed from the original training set. (Note that performance for magic telescope is classification accuracy and for housing it is sum squared error). How well does your reduction algorithm work? Magic Telescope has about 12,000 instances and if you use a leave one out style of testing for your data set reduction, then your algorithm will run slow since that is n2 at each step.  If you wish, you may use a random subset of 2,000 of the magic telescope instances.  More information on reduction techniques can be found here.

 

6.  (10%) Use the k nearest neighbor algorithm to solve the credit-approval (credit-a) data set.  Note that this set has both continuous and nominal attributes, together with don’t know values.  Implement and justify a distance metric which supports continuous, nominal, and don’t know attribute values.  Use your own choice for k, training/test split, etc. and discuss your results. More information on distance metrics can be found here. As a rough sanity check, typical k-nn accuracies for the credit data set are 70-80%.

 

7.  (10%) Do an experiment of your own regarding the k nearest neighbor paradigm and report your results and conclusions.  Analyze and discuss your results.  Try to explain why things happened the way they did.

 

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).

 

 

Clustering Project

 

1. (40%) Implement the k-means clustering algorithm and the HAC (Hierarchical Agglomerative Clustering) algorithm. Attach your source code. Use Euclidean distance for continuous attributes and (0,1) distances for nominal and unknown attributes (e.g. matching nominals have distance 0 else 1, unknown attributes have distance 1).  HAC should support both single link and complete link options.  For k-means you will pass in a specific k value for the number of clusters that should be in the resulting clustering.  Since HAC automatically generates groupings for all values of k, you will pass in a range or set of k values for which actual output will be generated. The output for both algorithms should include for each clustering: a) the number of clusters, b) the centroid values of each cluster, c) the number of instances tied to that centroid, d) the SSE of each cluster, and e) the total SSE of the full clustering. The sum squared error (SSE) of a single cluster is the sum of the squared distances to the cluster centroid.  Run k-means, HAC-single link, and HAC-complete link on this exact set of the sponge data set (use all columns and do not shuffle or normalize) and report your exact results for each algorithm with k=4 clusters.  For k-means use the first 4 elements of the data set as initial centroids.  This will allow us to check the accuracy of your implementation. 

 

To help debug your implementations, you may run them on this data set (don’t shuffle or normalize it). Note that this is just the labor data set with an instance id column added for showing results.  (Do NOT use the id column or the output class column as feature data).  The results for 5-means, using the first 5 elements of the data set as initial centroids should be this.  In HAC we just do the one closest merge per iteration.  The results for HAC-single link up to 5 clusters should be this and complete link up to 5 clusters should be this.

 

For the sample files, we ignore missing values when calculating centroids and assign them a distance of 1 when determining total sum squared error.  Suppose you had the following instances in a cluster:

 

?, 1.0, Round

?, 2.0, Square

?, 3.0, Round

Red, ?, ?

Red, ?, Square

 

The centroid value for the first attribute would be "Red" and SSE would be 3.  The centroid value for the second attribute would be 2, and SSE would be 4.  In case of a tie as with the third attribute, we choose the nominal value which appeared first in the meta data list.  So if the attribute were declared as @attribute Shape{"Round", "Square"}, then the centroid value for the third attribute would be Round and the SSE would be 3.  For other types of ties (node or cluster with the same distance to another cluster, which should be rare), just go with the earliest cluster in your list.  If all the attributes in a cluster have don’t know for one of the attributes, then use don’t know in the centroid for that attribute.

 

2.   (20%) Run all three variations (k-means, HAC-single link, and HAC-complete link) on the full iris data set where you do not include the output label as part of the data set. For k-means you should always choose k random points in the data set as initial centroids.  If you ever end up with any empty clusters in k-means, re-run with different initial centroids.  Run it for k = 2-7.  State whether you normalize or not (your choice).  Graph the total SSE for each k and discuss your results (i.e. what kind of clusters are being made). Now do it again where you include the output class as one of the input features and discuss your results and any differences.  For this final data set, run k-means 10 times with k=4, each time with different initial random centroids and discuss any variations in the results.

 

3.   (15%) Run all three variations (k-means, HAC-single link, and HAC-complete link) on the following smaller (500 instance) abalone data set where you include all attributes including the normal output attribute “rings”.  Treat “rings” as a continuous variable, rather than nominal.  Why would I suggest that?  Run it for k = 2-7.  Graph and discuss your results without normalization.  Then run it again with normalization and discuss any differences.

 

4. (10%) For your normalized abalone experiments, calculate and graph a performance metric for the clusterings for k = 2-7. You may use the Davies-Bouldin Index or some other metric including one of your own making.  If not using Davies-Bouldin, justify your metric.  Discuss how effective the metric you used might be in these cases for selecting which number of clusters is best.

 

5.  (15%) Do an experiment of your own regarding clustering and report your results and conclusions.  Analyze and discuss your results.  Try to explain why things happened the way they did.

 

 

Group Project Proposal

 

This proposal should be e-mailed as a PDF directly to Professor Martinez.

Come up with one carefully proposed idea for a possible group machine learning project, that could be done this semester.   This proposal should not be more than one page long.  It should include a thoughtful first draft proposal of a) description of the project, b) what features the data set would include and c) how and from where would the data set be gathered and labeled.  Give at least one fully specified example of a data set instance based on your proposed features, including a reasonable representation (continuous, nominal, etc.) and value for each feature.  The actual values may be fictional at this time.  This effort will cause you to consider how plausible the future data gathering and representation might actually be.  Note: Don’t choose a task where the data set is already worked through and collected, and pretty much ready to use for training.  I want you to learn by having to work through, at least to some degree, these important issues.

 

Also specify whether this is a project you would really like to be part of or if you just did it because it was assigned.  I will use this in helping decide which subset of the proposals I will send to the class.  Please e-mail me this proposal by the due date.  I will then consider which ones are most reasonable (about 12-15) and e-mail them out to the class.  I will then have each of you e-mail me a ranked list of the top 4 projects you would like to be part of.  I will then do my best to place you in a team of 3 on a project you are interested in.  Note that you may not all get your first choice, but I will guarantee that if you are the one who proposes the project, and that project is chosen and you want to be on it, that you will get to be on that project.

 

You may work together on the proposal (up to 2 people) and send in one proposal if it is one that you both specify as wanting to work on.

 

The grade on this will not be based on whether your proposal is chosen or not.  It will be based on whether it appears that you put in a reasonable effort to propose a plausible project and if the proposal is appropriate based on what we have learned in class.

 

Group Project Progress Report

 

This report should be printed and brought to class on the due date.

Go here to see the different projects and teams.

Partway into the semester each group will hand in a 2-3 page review of their project progress including

·          A description of the problem

·          What machine learning model they are initially trying to learn with

·          How and from where are they gathering data

·          A description of their data set including:

o    Actual example instances, including a reasonable representation (continuous, nominal, etc.) and values for each feature

o    How many instances and features you plan to have in your final data set

·          Brief discussion of plans and schedule to finish the project

 

Group Project

 

Your goal in the group project is to get the highest possible generalization accuracy on a real world application.  A large part of the project will be gathering, deriving, and representing input features.  After you have come up with basic data and features, you will choose a machine learning model(s), and format the data to fit the model(s).  Expect that initial results may not be as good as you would like.  You will then begin the iterative process of a) trying adjusted/improved data/features, and b) adjusted/different machine learning models in order to get the best possible results.  Your written report and oral presentation (formats below) should contain at least the following.

 

a)       Motivation and discussion of your chosen task

b)      The features used to solve the problem and details on how you gathered and represented the features, including critical decisions/choices made along the way

c)       Your initial results with your initial model

d)      The iterative steps you took to get better results (improved features and/or learning models)

e)       Clear reporting and explanation of your final results including your training/testing approach

f)        Conclusions, insights, and future directions you would take if time permitted

 

1.       Prepare and submit a polished paper describing your work. Submit one hard copy at the beginning of class on the day of the oral presentation.  You should use these Latex style files if you are writing in Latex (which is a great way to write nice looking papers) or this Word template. Read one of these links (especially the early paragraphs) because they contain more information on the contents of your paper (#pages, order, etc.).  I will not be getting out a ruler to make sure you match format exactly, but it is a good experience to put together a professional looking paper.  Your paper should look like a conference paper, including Title, Author Affiliations, Abstract, Introduction, Methods, Results and Conclusion (Bibliographical references are optional for this, but do include them if you have any).  Here is an example of a well written write-up from a previous semester.

2.       Prepare a polished oral presentation, including slides, that communicates your work. Your presentation should take 12 minutes and allow 3 minutes for questions (a rough guide for slide preparations is about 1 slide per 1-2 minutes). Practice in advance to insure a quality appropriately timed presentation.

3.       Email me a thoughtful and honest evaluation of the contributions of your group members (including yourself). For each, include a score from 0 to 10 indicating your evaluation of their work (10 meaning they were a valuable member of the team that made significant contributions to the project and were good to work with, 0 meaning they contributed nothing). If you would like, you may also include any clarifying comments, etc.