Project reports are to
be done on a word processor and be neat and professional. Good writing,
grammar, punctuation, etc. are important and points will be taken off if these
things are lacking. Each written project report should be 3-5 pages and
are limited to no more than five single-sided, single-spaced pages using 12 point font, including all graphs and figures. Reports longer than the page limit will
receive an automatic 10% reduction. (For the backpropagation report you may use
6 pages if absolutely necessary). Communicating clearly and concisely what you
have to say is an important skill you will use throughout your career. Number each section of your paper to match
the numbering in the assignment directions.
You should have some discussion for each section, though some may be
brief (e.g. A section that just asks you to code up
the algorithm. But you could still say a
sentence or two regarding it.). You do not need the standard intro of “Machine
learning is an exciting new field…, etc. Your audience is the TAs who know the basic
but want to hear about what your learned, your analysis, etc. Figures are not
to be hand-drawn and should be large enough to be easily legible. Keep numbers
reported to consistent (3 or 4) significant digits. __Remember to read the directions carefully
and to do and discuss each subtask__. Also label all axes on graphs!

All assignments will be
submitted through Learning Suite (click on Assignments in the left panel and
then select the appropriate project). There are two entries for each project -
one for the write-up and one for the code. The write-up should be a PDF (make
sure you look at your PDF before submitting and check that it is clear and
readable, including readable fonts on graphs, etc.). The code can be submitted as text if one file
or as a zip or tar file. Make sure you
submit both for each project. When your
projects are graded you can go to learning suite and access your project which will have feedback. I ask the TAs to take off lightly on the
first project for simple errors and point them out, but I expect you to review
those, and the penalty will be higher if repeated in later projects.

To help in the projects, we provide toolkits in
C++ and Java, though you may program in the language of your choice. You are not required to use the provided
toolkit and you may write your own if you desire. Programming projects will be graded primarily
on positive completion, correctness of results, technical accuracy in the
discussion, and perceived effort and understanding. Most parts of the
projects require some clear measurable results (e.g. working program, graphs of
results, etc.) and some questions where you analyze your results. If your
results or answers differ significantly from appropriate norms, then you will
be marked off. A few questions are more open-ended and subjective, and in these
cases you will be graded based on perceived effort and thought. If you do a
thoughtful conscientious effort, learn the model well, and try to create a
quality write-up demonstrating your learning, you will do well. Have fun and learn while doing the projects!

·
Instructions
and source code for the toolkit can be found here.

- A collection of data sets already in
the ARFF format can be found here.
- Details on ARFF are found here.
- The complete and continually
updating UC Irvine Data Set is here. This is a good place to go to study more
about the problems used in your assignments and also to find new problems.
- In order to help you debug projects
we have included some small examples
and other hints with actual learned hypotheses so that you can compare
the results of your code and help ensure that your code is working
properly.

1. (40%) Correctly implement the perceptron learning algorithm and integrate it with the tool
kit. Before implementing your perceptron
and other projects, you should review requirements for the project so that you
implement it in a way that will support them all. Note that for this and the
other labs it is not always easy to tell if you have implemented the model
exactly correct, since training sets, initial parameters, etc. usually have a
random aspect. However, it is easy to see when your results are inconsistent
with reasonable results, and points will be reduced based on how far off your
implementation appears to be.

2. (5%) Create 2 ARFF files, both with 8
instances using 2 real valued inputs (which range between -1 and 1) each with 4
instances from each class. One should be
linearly separable and the other not.

3. (10%) Train on both sets (the entire sets)
with the Perceptron Rule. Try it with a
couple different learning rates and discuss the effect of learning rate,
including how many epochs are completed before stopping. (For these situations learning rate should
have minimal effect, unlike with the Backpropagation lab).

The basic stopping criteria
for many models is to stop training when no longer making significant
progress. Most commonly, when you have
gone a number of epochs (e.g. 5) with no significant improvement in accuracy
(Note that the weights/accuracy do not usually change monotonically). Describe your specific stopping
criteria. Don’t just stop the first
epoch when no improvement occurs. Use a
learning rate of .1 for experiments 4-6 below.

4. (10%) Graph the
instances and decision line for the two cases above (with LR=.1).
For all graphs always label the axes!

5. (20%) Use the
perceptron rule to learn this version of the voting
task. This particular task is an edited
version of the standard
voting set, where we have replaced all the “don’t know” values with the
most common value for the particular attribute.
Randomly split the data into 70% training and 30% test set. Try it five times with different random 70/30
splits. For each split report the final
training and test set accuracy and the # of epochs required. Also report the average of these values from
the 5 trials. You should update after every
instance. Remember to shuffle the data
order after each epoch. By looking at
the weights, explain what the model has learned and how the individual input
features affect the result. Which
specific features are most critical for the voting task, and which are least
critical? Do one graph of the average
misclassification rate vs epochs (0^{th} – final epoch) for the
training set. In our helps
page is some help for doing graphs. As a
rough sanity check, typical Perceptron accuracies for the voting data set are
90%-98%.

6. (15%) Do your own experiment with either the
perceptron or delta rule. Include in
your discussion what you learned from the experiment. Have fun and be creative! For this lab and all the future labs make
sure you do something more than just try out the model on different data sets. One option you can do to fulfill this part is
the following:

Use the perceptron rule to learn the iris
task or some other task with more than two possible output values. Note that
the iris data set has 3 output classes, and a perceptron node only has two
possible outputs. Two common ways to
deal with this are:

a) Create 1 perceptron for each output class. Each perceptron has its own training set
which considers its class positive and all other classes to be negative
examples. Run all three perceptrons on
novel data and set the class to the label of the perceptron which outputs
high. If there is a tie, choose the
perceptron with the highest net value.

b) Create 1 perceptron for each pair of output
classes, where the training set only contains examples from the 2 classes. Run all perceptrons on novel data and set the
class to the label with the most wins (votes) from the perceptrons. In case of a tie, use the net values to
decide.

You could implement one of these. For either of these approaches you can train
up the models independently or simultaneously.
For testing you just execute the novel instance on each model and
combine the overall results to see which output class wins.

Note:
In order to help you debug this and other projects we have included some small examples and
other hints
with actual learned hypotheses so that you can compare the results of your code
and help ensure that your code is working properly. You may also discuss and compare results with
classmates.

1. (40%) Implement the
backpropagation algorithm and integrate it with the toolkit. This is probably
the most intense lab, so start early!
This and the following are implementations you should be able to use on
real world problems in your careers, group projects, etc. Your implementation
should include:

◦
ability to create an network
structure with at least one hidden layer and an arbitrary number of nodes

◦
random weight initialization (small random weights
with 0 mean)

◦
on-line/stochastic weight update

◦
a reasonable stopping criterion

◦
training set randomization at each epoch

◦
an option to include a
momentum term

2. (13%) Use your backpropagation learner, with
stochastic weight updates, for the iris classification problem.
Use one layer of hidden nodes with the number of hidden nodes being twice the
number of inputs. Always use bias weights to each hidden and output node. Use a random 75/25 split of the data for the
training/test set and a learning rate of .1. Use a validation set (VS) for your
stopping criteria for this and the remaining experiments. Note that with a VS you do not stop the first epoch that the VS does
not get an improved accuracy. Rather,
you keep track of the best solution so far (bssf) on
the VS and consider a window of epochs (e.g. 5) and when there has been no
improvement over bssf for the length of the window, then you stop. Create one graph with the MSE (mean squared
error) on the training set, the MSE on the VS, and the classification accuracy
(% classified correctly) of the VS on the y-axis, and number of epochs on the
x-axis. (Note two scales on the y-axis). The results for the different measurables should be shown with a different color, line
type, etc. Typical backpropagation accuracies for the Iris data set are 85-95%.

3. (12%) For 3-5 you will use the vowel data set, which is a
more difficult task (what would the baseline accuracy be?). Typical
backpropagation accuracies for the Vowel data set are ~60%. Consider carefully which of the given input features you should
actually use (Train/test, speaker, and gender?). Use one layer of hidden nodes
with the number of hidden nodes being twice the number of inputs. Use random
75/25 splits of the data for the training/test set. Try some different learning
rates (LR). For each LR find the best VS solution (in terms of VS MSE). Note that the proper approach in this case would be to average the
results of multiple random initial conditions (splits and initial weight settings)
for each learning rate. To
minimize work you may just do each learning rate once with the same initial
conditions. If you would like you may average the results of multiple initial
conditions (e.g. 3) per LR, and that obviously would give more accurate
results. The same applies for parts 4 and 5. Create one graph with MSE for the
training set, VS, and test set, and also the classification accuracy on the VS
and test set. Create another graph
showing the number of epochs needed to get to the best VS solution on the
y-axis with the different learning rates on the x-axis. In general,
whenever you are testing a parameter such as LR, # of hidden nodes, etc., test
values until no more improvement is found. For example, if 20 hidden nodes did
better than 10, you would not stop at 20, but would try 40, etc., until you saw
that you no longer got improvement.

4. (10%)
Using the best LR you discovered, experiment with different numbers of hidden
nodes. Start with 1 hidden nodes, then 2, and then double them for each test
until you get no more improvement. For each number of hidden nodes find
the best VS solution (in terms of VS MSE). Graph as in step 3 but with # of hidden nodes on the x-axis.

5. (10%) Try some
different momentum terms in the learning equation using the best number of
hidden nodes and LR from your earlier experiments. Graph as in step 3 but with
momentum on the x-axis.

6. (15%) Do an
experiment of your own regarding backpropagation learning. Do something
more interesting than just trying BP on a different task, or just a variation
of the other requirements above.

1. (40%) Correctly
implement the ID3 decision tree algorithm, including the ability to handle
unknown attributes (You do not need to handle real valued attributes).
Use standard information gain as your basic attribute evaluation metric.
(Note that normal ID3 would always augment information gain with gain ratio or
some other mechanism to penalize statistically insignificant attribute splits. Otherwise, even with approaches like pruning
below, the SS# type of overfit could still hurt us.) It is a good idea to use a
simple data set (like the lenses data),
that you can check by hand, to test your algorithm to make sure that it
is working correctly. You should be able to get about 68%
(61%-82%) predictive accuracy on lenses.

2. (15%) You will use
your ID3 algorithm to induce decision trees for the cars data set and
the voting data
set. Do not use a stopping criteria, but induce
the tree as far as it can go (until classes are pure or there are no more data
or attributes to split on). Note that
with a full tree you will often get 100% accuracy on the training set. (Why
would you and in what cases would you not?
This question is for our discussion, and should also be answered in your
report.) Note that you will need to
support unknown attributes in the voting data set. Use 10-fold CV on each
data set to predict how well the models will do on novel data. Report the training and test classification
accuracy for each fold and then average the test accuracies to get your
prediction. Create a table summarizing these accuracy results, and
discuss what you observed. As a rough
sanity check, typical decision tree accuracies for these data sets are: Cars:
.90-.95, Vote: .92-.95.

3. (10%) For each of the
two problems, summarize in English what the decision tree has learned (i.e.
look at the induced tree and describe what rules it has discovered to try to
solve each task). If the tree is large
you can just discuss a few of the more shallow attributes combinations and the
most important decisions made high in the tree.

4. (5%) How did
you handle unknown attributes in the voting problem? Why did you choose this
approach? (Do not use the approach of just throwing out data with unknown
attributes).

5. (15%) Implement
reduced error pruning to help avoid overfitting. You will need to take a
validation set out of your training data to do this, while still having a test
set to test your final accuracy. Create
a table comparing the original trees created with no overfit avoidance in item
2 above and the trees you create with pruning.
This table should compare a) the # of nodes (including leaf nodes) and
tree depth of the final decision trees and b) the generalization (test set)
accuracy. (For the unpruned 10-fold CV models, just use their average values in the table).

6. (15%) Do an
experiment of your own regarding decision trees. You can be as creative
as you would like on this. Experiments could include such things as
modifying the algorithm, modifying the measure of best attribute, comparing
information gain and gain ratio, supporting real valued attributes, comparing
different stopping criteria and/or pruning approaches, etc. Be creative!

1. (40%) Implement the *k* nearest neighbor algorithm and the *k* nearest neighbor regression algorithm,
including optional distance weighting for both algorithms. Attach your source
code.

2. (15%) Use the *k* nearest neighbor algorithm (without
distance weighting) for the magic
telescope problem using this training
set and
this test set. Try it with *k*=3 with normalization (input features
normalized between 0 and 1) and without normalization and discuss the accuracy
results on the test set. For the rest of the experiments use only
normalized data. With just the normalized training set as your data,
graph classification accuracy on the test set with odd values of *k* from 1 to 15. Which value of *k* is the best in terms of classification
accuracy? As a rough sanity check,
typical *k*-nn
accuracies for the magic telescope data set are 75-85%.

3. (10%) Use the
regression variation of your algorithm (without distance weighting) for the housing price prediction
problem using this training
set and
this test set. Report Mean Square
Error on the test set as your accuracy metric for this case. Experiment using
odd values of *k* from 1 to 15. Which
value of *k* is the best?

4. (10%) Repeat your experiments for magic
telescope and housing using distance-weighted (inverse of distance squared)
voting.

5. (10%) Use the *k* nearest neighbor algorithm to solve
the credit-approval
(credit-a) data set. Note that this set has both continuous and nominal attributes,
together with don’t know values. Implement and
justify a distance metric which supports continuous,
nominal, and don’t know attribute values. Use your own choice for *k*, training/test split, etc. More
information on distance metrics can be found here.
As a rough sanity check, typical *k*-nn accuracies for the credit data set are 70-80%.

6. (15%) Do an
experiment of your own regarding the *k*
nearest neighbor paradigm. One option you can do to fulfill this part is the
following:

For the best value of *k* for each or any one of the datasets, implement a reduction
algorithm that removes data points in some rational way such that performance
does not drop too drastically on the test set given the reduced training set.
Compare your performance on the test set for the reduced and non-reduced
versions and give the number (and percentage) of training examples removed from
the original training set. (Note that performance for magic telescope is
classification accuracy and for housing it is sum squared
error). How well does your reduction algorithm work? Magic Telescope has about
12,000 instances and if you use a leave one out style of testing for your data
set reduction, then your algorithm will run slow since that is *n*^{2} at each step. If you wish, you may use a random subset of
2,000 of the magic telescope instances. More
information on reduction techniques can be found here.

1. (40%) Implement the *k*-means clustering algorithm __or__ the
HAC (Hierarchical Agglomerative Clustering) algorithm. Attach your source code.
Use Euclidean distance for continuous attributes and (0,1) distances for
nominal and unknown attributes (e.g. matching nominals
have distance 0 else 1, unknown attributes have distance 1). HAC should support both single link and
complete link options. For *k*-means you will pass in a specific *k* value for the number of clusters that
should be in the resulting clustering.
Since HAC automatically generates groupings for all values of *k*, you will pass in a range or set of *k* values for which actual output will be
generated. The output for the algorithm tested should include for each
clustering: a) the number of clusters, b) the centroid values of each cluster,
c) the number of instances tied to that centroid, d) the SSE of each cluster,
and e) the total SSE of the full clustering. The sum squared
error (SSE) of a single cluster is the sum of the squared distances to the
cluster centroid. Run *k*-means, or HAC-single link and HAC-complete
link on this
exact set of the sponge data set (use all columns and do not shuffle or
normalize) and report your exact results for each algorithm with *k*=4 clusters. For *k*-means
use the first 4 elements of the data set as initial centroids. This will allow us to check the accuracy of
your implementation.

To help debug your implementations, you may run
them on this
data set (don’t shuffle or normalize it). Note that this is just the labor
data set with an instance id column added for showing results. (Do NOT use the id column or the output class
column as feature data). The results for
5-means, using the first 5 elements of the data set as initial centroids should
be this. In HAC we just do the one closest merge per
iteration. The results for HAC-single
link up to 5 clusters should be this
and complete link up to 5 clusters should be this.

For the sample files, we ignore missing values
when calculating centroids and assign them a distance of 1 when determining
total sum squared error. Suppose you had the following instances in a
cluster:

?, 1.0, Round

?, 2.0, Square

?, 3.0, Round

Red, ?, ?

Red, ?, Square

The centroid value for the first attribute would
be "Red" and SSE would be 3. The centroid value for the second
attribute would be 2, and SSE would be 4.
In case of a tie as with the third attribute, we choose the nominal value which appeared first in the meta data list. So if the attribute were declared as
@attribute Shape{"Round",
"Square"}, then the centroid value for the third attribute would be
Round and the SSE would be 3. For other
types of ties (node or cluster with the same distance to another cluster, which
should be rare), just go with the earliest cluster in your list. If all the attributes in a cluster have don’t
know for one of the attributes, then use don’t know in
the centroid for that attribute.

2. (20%) Run
your variation (*k*-means, or
HAC-single link and HAC-complete link) on the full iris
data set where you do not include the output label as part of the data set. For
*k*-means you should always choose *k* random points in the data set as
initial centroids. If you ever end up
with any empty clusters in *k*-means,
re-run with different initial centroids.
Run it for *k* = 2-7. State whether you normalize or not (your
choice). Graph the total SSE for each *k* and discuss your results (i.e. what
kind of clusters are being made). Now do it again where you include the output
class as one of the input features and discuss your results and any
differences. For this final data set,
run *k*-means 5 times with *k*=4, each time with different initial
random centroids and discuss any variations in the results.

3. (15%) Run
your variation (*k*-means, or
HAC-single link and HAC-complete link) on the following smaller (500 instance) abalone
data set where you include all attributes including the normal output attribute
“rings”. Treat “rings” as a continuous
variable, rather than nominal. Why would
I suggest that? Run it for *k* = 2-7.
Graph your results without normalization. Then run it again with normalization.

4. (10%) For your
normalized abalone experiments, calculate and graph a performance metric for
the clusterings for *k* = 2-7. You may
use the Silhouette or some other metric including one of your own making. If not using Silhouette, justify your
metric. Discuss how effective the metric
you used might be in these cases for selecting which number of clusters is
best.

5. (15%) Do an
experiment of your own regarding clustering.

Come up with one carefully proposed
idea for a possible group machine learning project,
that could be done this semester. This
proposal should be one page long. The
proposal should start with a descriptive title of a few words, your name, and
to what extent you would want to be part of this project if chosen. Your page will then have the following 3
paragraphs:

1.
Description of
the project

2.
What features the
data set would include

3.
How and from
where would the data set be gathered and labeled

As part of paragraph 2 *give one fully specified example of a data
set instance based on your proposed features, including a reasonable
representation (continuous, nominal, etc.) and value for each feature and
output*. Don’t worry about
normalizing for this example. The actual
values may be fictional at this time.
Creating an example training instance will encourage
you to consider how plausible the future data gathering and representation
might actually be. Following is an
example (not using well thought out features) of what an example instance might
look like if the task was heart attack diagnosis.

__Heart Rate Pain Level BP-systolic BP-diastolic Age Gender Color Numb Output__

96 8 120 80 54 F Red N Yes

Don’t choose a task where the
data set is already worked through and collected, and pretty much ready to use
for training. I want you to learn by
having to work through, at least to some degree, the challenging issues
regarding feature selection and data gathering. Please e-mail me this proposal as a __PDF__
by the due date. You may work together on the proposal and send in one
proposal *if* it is one that you all
specify as wanting to work on.

The grade on this will not be
based on whether your proposal is chosen or not. It will be based on whether it appears that
you put in a reasonable effort to propose a plausible project and if the
proposal is appropriate for a semester project based on what we have learned in
class so far, and also if you included each of the items mentioned above. Sometimes your project will not be chosen for
the potential list simply because one similar to it is already there, or I feel
that it may be too hard to get the data, etc.

Immediately after the due
date I will consider which proposals are most reasonable for the class and
e-mail all of them out to each of you. Read them all at least briefly as part of this
assignment is simply to have you look over a set of possible tasks to get more
of a feel for the types of things that could be done with machine learning.

After reading through them, each
of you should replay to my e-mail with 1) a ranked list of the
top 4 projects you would like to be part of (if one of your choices is the
project you proposed, __note which__), and 2) specify if you are committed
to give full effort on the group project (regardless of which one you end up
on). If you foresee that you may not be able to (e.g. possibly dropping
the class, etc.), I need to know, so that I can put together sufficiently
dependable groups

After getting all your
e-mails I will then do my best to place you in teams of 3-4 on a project you
are interested in. Only a subset of the proposals I send out will be
chosen for actual projects. Note that
you may not all get your first choice, but I will guarantee that if a) you are
the one(s) who proposed the project, and if b) that project is chosen and you
put it as your first choice, then you will be on that project.

When a group is chosen to
work on the project, then you will all start fresh to attack the problem and
create the actual set of features you will work with, which may end up being
quite different from the ones proposed in the initial proposal. You may also modify somewhat the initially
proposed project as needed. If you want
to make major changes, please run it by me first. I usually choose projects to e-mail out for
consideration based on their potential, assuming some significant upgrades on
the features will occur once the group gets going, rather than just using the
features suggested by the initial proposer.

Your
goal in the group project is to get the highest possible generalization
accuracy on a real world application. A
large part of the project will be gathering, deriving, and representing input
features. After you have come up with
basic data and features, you will choose a machine learning model(s), and
format the data to fit the model(s).
Expect that initial results may not be as good as you would like. You will then begin the iterative process of
a) trying adjusted/improved data/features, and b) adjusted/different machine
learning models in order to get the best possible results. You may use your own implementations of
models or a resource such as WEKA (http://www.cs.waikato.ac.nz/ml/weka/)
for doing simulations on multiple models.
Your written report and oral presentation (formats below) should contain
at least the following.

a)
Motivation
and discussion of your chosen task

b)
The
features used to solve the problem and details on how you gathered and
represented the features, including critical decisions/choices made along the
way

c)
Your
initial results with your initial model

d)
__The iterative steps you took to get better results (improved
features and/or learning models)__

e)
Clear
reporting and explanation of your final results including your training/testing
approach

f)
Conclusions,
insights, and future directions you would take if time permitted

1.
Prepare
and submit a polished paper describing your work. Submit one hard copy at the
beginning of class on the day of the oral presentation. You should use these Latex style files if you are writing in
Latex (which is a great way to write nice looking papers) or this Word template. Read one of these
links (especially the early paragraphs) because they contain more information
on the contents of your paper (#pages, order, etc.). I will not be getting out a ruler to make
sure you match format exactly, but it is a good experience to put together a
professional looking paper. Your paper
should look like a conference paper, including Title, Author Affiliations,
Abstract, Introduction, Methods, Results and Conclusion (Bibliographical
references are optional for this, but do include them if you have any). Here
are some examples of a well written write-ups from a
previous semesters.

2.
Prepare
a polished oral presentation, including slides, that communicates your work.
Your presentation should take 12 minutes and allow 3 minutes for questions (a rough
guide for slide preparations is about 1 slide per 1-2 minutes). Practice in
advance to insure a quality appropriately timed presentation.

3.
Email
me a thoughtful and honest evaluation of the contributions of your group members
(including yourself). For each, include a score from 0 to 10 indicating your
evaluation of their work (10 meaning they were a valuable member of the team
that made significant contributions to the project and were good to work with,
0 meaning they contributed nothing). If you would like, you may also include
any clarifying comments, etc.