Projects are to be done on a word processor and be neat and
professional. Good writing, grammar, punctuation, etc. are important and points
will be taken off if these things are lacking. Each written project
report should be about 4-5 pages and are limited to no more than five
single-sided, single-spaced pages using 12 point font, including all graphs and
figures. (For the backpropagation
report you may use 6 pages if absolutely necessary). Communicating clearly and
concisely what you have to say is an important skill you will use throughout
your career. Figures are not to be hand-drawn and should be large enough to be
easily legible. All assignments should be e-mailed as PDF files to cs478ta@axon.cs.byu.edu. Put the assignment name in the subject
line. If you have a question for
the TAs send it to the same address, but put question in the subject line so they can
respond more readily. Except where specifically stated otherwise, all reports
are due by midnight of the published due date. By midnight we mean the very end of the published due date.
Programming
projects will be graded primarily on positive completion and technical accuracy
in the discussion. Most parts of the projects require an clear measurable
result (e.g. working program, graphs of results, etc.) and some questions where
you analyze your results. Most of the questions will have objective
answers which you are expected to get right. A few questions are more
open-ended and subjective, and in these cases you will be graded based on
perceived effort and thought.
·
Instructions and source code for the
toolkit can be found here.
·
A collection of data sets already in
the ARFF format can be found here.
·
Details on ARFF are found here.
·
The complete and continually updating
UC Irvine Data Set is here. This is a good place to go to study
more about the problems used in your assignments and also to find new problems.
·
In order to help you debug projects we
have included some small examples and
other hints with actual learned hypotheses so that
you can compare the results of your code and help ensure that your code is
working properly.
Answer problems 1.1, 1.2, and 1.4 from
the text. This assignment is due before class on the published due date.
1. (25%) Correctly
implement the perceptron learning algorithm and integrate it with the tool
kit. Before implementing your
perceptron you should review the remaining project requirements so that you
implement in a way which will support them. In particular, note that you will be supporting tasks with
non-binary outputs, which require multiple output nodes. Include your source code as a separate
attachment in this and all labs.
2. (5%) Create
2 ARFF files, both with 8 instances using 2 real valued inputs (which range
between -1 and 1) each with 4 instances from each class. One should be linearly separable and
the other not.
3. (10%) Train on both sets (the entire
sets) with the Perceptron Rule.
Try it with a couple different learning rates and discuss the effect of
learning rate, including how many epochs are completed before stopping. (For these situations learning rate
should have minimal effect, unlike with the Backpropagation lab). Discuss the differences in outcome with
the linearly separable vs. the non-linearly separable set.
The basic stopping
criteria for these models is to stop training when no longer making significant
progress. For example, when you
have gone a number of epochs with no significant improvement in accuracy. Describe your specific stopping
criteria. Don’t just stop the
first epoch when no improvement occurs.
Use a learning rate of .1 for experiments 4-6 below.
4. (10%) Graph the
instances and decision line for the two cases above (with LR=.1) and discuss
the graphs.
5. (15%) Use the
perceptron rule to learn this version of the voting task. This particular task is an edited
version of the standard
voting set, where we have replaced all the don’t know values with the most
common value for the particular attribute. Randomly split the data into 70% training and 30% test
set. Try it five times with
different random splits. For each
split report the final training and test set accuracy and the # of epochs
required. Also report the average
of these values from the 5 trials. You should update after every instance. Remember to shuffle the data order
after each epoch. By looking at
the weights, explain what the model has learned and how the individual input
features effect the result. Do one
graph of the average misclassification rate vs epochs (1st –final
epoch) for the training set.
Discuss your results.
6. (20%) Use the
perceptron rule to learn the iris
task. Randomly split the data into
70% training and 30% test set. Try
it five times with different random splits. For each split report the final training and test set
accuracy and the # of epochs required.
Also report the average of these values from the 5 trials. By looking at the weights, explain what
the model has learned and how the individual input features effect the
result. Discuss your results. As a rough sanity check, typical Perceptron accuracies for
these data sets are: Vote: 90%-98%, Iris: 60-95%. Note
that the iris data set has 3 output classes, and the perceptron node only has
two possible outputs. Two common
ways to deal with this are:
a) Create 1 perceptron for each output
class. Each perceptron has its own
training set which considers its class positive and all other classes to be
negative examples. Run all three
perceptrons on novel data and set the class to the label of the perceptron
which outputs high. If there is a
tie, choose the perceptron with the highest net value.
b) Create 1 perceptron for each pair of
output classes, where the training set only contains examples from the 2
classes. Run all perceptrons on
novel data and set the class to the label with the most wins (votes) from the
perceptrons. In case of a tie, use
the net values to decide.
Implement one of these and discuss your
results. For either of these
approaches you can train up the models independently or simultaneously. For testing you just execute the novel
instance on each model and combine the overall results to see which output
class wins.
7. (15%) Do your own experiment with
either the perceptron or delta rule.
Show and discuss your results, including what you learned. Have fun and be creative! Make sure you do something more than just
arbitrarily try out the perceptron on some new data set.
Note: In order to help you debug
this and other projects we have included a some small examples and other hints with actual learned hypotheses so that you can compare the
results of your code and help ensure that your code is working properly.
1. (30%) Implement the backpropagation algorithm and
integrate it with the toolkit. Attach your source code. Your implementation
should include:
◦ ability to
create an arbitrary network structure (# of nodes, layers, etc.)
◦ random weight
initialization (small random weights with 0 mean)
◦
on-line/stochastic weight update
◦ a reasonable
stopping criterion
◦ training set
randomization at each epoch
◦ an option to
include a momentum term
2. (10%) Use
your backpropagation learner, with stochastic weight updates, for the iris
classification problem. Use one layer of hidden nodes with the number of hidden
nodes being twice the number of inputs.
Remember to use bias weights!
Use a random 75/25 split of the data for the training/test set and a
learning rate of .1. Graph both MSE (mean squared error) and misclassification
accuracy vs epochs. As a rough sanity check, typical Backpropagation accuracies
for these data sets are: Iris: high 70% to low 90%, Vowel: around 60%.
3. (20%) For
3-5 you will just use the vowel data
set. Use your backpropagation
learner, with stochastic weight updates, for the vowel
classification problem. Consider carefully which of the given input features
you should actually use. Initially
use one layer of hidden nodes with the number of hidden nodes being twice the
number of inputs. Remember to use
bias weights. Use a random 75/25
split of the data for the training/test set. Try some different learning
rates. Average the results of a few different random splits for each
learning rate. Graph both MSE
(mean squared error) and Misclassification accuracy vs Epochs for the different
learning rates. For these tasks you should be able to do one graph for
MSE vs LR and another graph for classification accuracy vs LR, where the
results for the different learning rates are shown with a different color, line
type, etc. Explain your stopping
criteria, justify why you used it, and discuss its effect on your
generalization accuracy. (If using a window it should usually be more than just
a few epochs (20-100 is common).
If using a validation set, also graph the validation set error per
epoch. How did learning rate
affect your results? Discuss the quality of results from each data set.
4. (10%) Once you have a reasonable learning rate
value chosen, experiment with different numbers of hidden nodes. Start with 2
hidden nodes and double them for each test until you get no more
improvement. For the vowel dataset graph MSE vs epochs for both training
and validation sets for each choice of number of hidden nodes. Discuss the
effect of different numbers of hidden units on the algorithm's performance.
5. (10%) Try some different momentum terms in the learning
equation using the best number of hidden nodes and learning rate from your
earlier experiments. For the vowel
dataset Graph MSE vs epochs for both training and validation sets for your
different choices of momentum. How did the momentum term affect your
results?
6. (5%) Using
the best parameters found in 2-4, what difference in accuracy and learning do
you get (for both training and test set) with vowel if you include name and
gender as features vs excluding them. Discuss some possible reasons for this.
7. (15%) Do an experiment of your own regarding
backpropagation learning and report your results and conclusions. Do
something more interesting than just trying BP on a different task, or just a
variation of the other requirements above. Analyze and discuss your results. Try to explain why
things happened the way they did.
Turn in a thoughtful, well-written written
report (see the guidelines above) that details your experiments and addresses
the questions posed above (look carefully at everything to make sure you've
covered all the parts of each).
1. (35%) Correctly implement the ID3 decision tree
algorithm, including the ability to handle unknown attributes (You do not need
to handle real valued attributes). Attach your source code. Use standard
information gain as your basic attribute evaluation metric. It is a good
idea to use a simple data set (like the lenses data),
that you can check by hand, to test your algorithm to make sure that it is
working correctly. You should be able to get about 68% predictive accuracy on
lenses.
2. (15%) Use your ID3 algorithm to induce a decision tree
for the cars data
set and the voting data set. Do not use a stopping criteria, but induce
the tree as far as it can go (until classes are pure or there are no more data
or attributes to split on). Note
that you will need to support unknown attributes in the voting data set.
For each data set first induce a tree using the entire data set as the training
set. How accurately does it classify the data sets (is this a good
measure of predictive accuracy)? Now use 10-fold CV on each data set to predict
how well the models will do on novel data. Report the training and test classification accuracy for
each fold and then average the test accuracies to get your prediction.
Create a table summarizing these accuracy results, and discuss what you observed. As a rough sanity check, typical
Decision Tree accuracies for these data sets are: Lenses: .61-.82, Cars:
.90-.95, Vote: .92-.95.
3. (10%) For each of the two problems, summarize in English
what the decision tree has learned (look at the induced tree and describe what
rules it has discovered to try to solve each task). If the tree is large you can just discuss a few of the more
shallow rules and the most important decisions made high in the tree.
4. (10%) How did you handle unknown attributes in the
voting problem? Why did you choose this approach? (Do not use the approach of
just throwing out data with unknown attributes).
5. (15%) Implement an approach to avoid
overfitting. You may use a validation set, reduced error pruning, or some
statistical stopping criteria, etc. If you use your own stopping criteria
justify why you think it should work. Note that just checking if
information gain is above some threshold is not a good stopping criteria
because it does not take into account any statistical significance info
regarding the new nodes (e.g. the Social Security # problem). If you use information gain you must at
least add some mechanism which takes into account these issues. Create a table comparing the original
trees created with no overfit avoidance in item 2 above and the trees you create
with your overfit avoidance technique.
This table should compare a) the size (# of nodes and tree depth) of the
final decision trees and b) the generalization accuracy. Summarize your findings.
6. (15%) Do an experiment of your own regarding decision
trees and report your results and conclusions. You can be as creative as
you would like on this. Experiments could include modifying the
algorithm, modifying the measure of best attribute, comparing information gain
and gain ratio, comparing different stopping criteria and/or pruning
approaches, etc. Analyze and discuss your results. Try to explain
why things happened the way they did.
Turn in a thoughtful, well-written
written report (see the guidelines above) that addresses each question posed
above (look carefully at everything to make sure you've covered all the parts
of each).
1. (40%) Implement the k nearest neighbor algorithm and the k nearest
neighbor regression algorithm, including optional distance weighting for both
algorithms. Attach your source code.
2. (10%) Use the k nearest neighbor algorithm for the magic
telescope problem using this training set
and this test set.
Try it with k=3
with normalization (input features normalized between 0 and 1) and without normalization
and discuss the accuracy results on the test set. For the rest of the
experiments use only normalized data. With just the normalized training
set as your data, graph classification accuracy on the test set with odd values
of k from
1 to 15. Which value of k is the best in terms of classification accuracy? As a rough sanity check, typical k-nn accuracies for the magic telescope data set are 75-85%.
3. (10%) Use the regression variation of your
algorithm for the housing
price prediction problem using this training set
and this test set.
Report Mean Square Error on the test set as your accuracy metric for this case.
Experiment using odd values of k from 1 to 15. Which value of k is the best?
4. (10%) Repeat
your experiments for magic telescope and housing using distance-weighted
(inverse of distance squared) voting. Discuss how distance weighting affects
the algorithm performance.
5. (10%) For
the best value of k for each dataset, implement a reduction algorithm that removes
data points in some rational way such that performance does not drop too
drastically on the test set given the reduced training set. Compare your
performance on the test set for the reduced and non-reduced versions and give
the number (and percentage) of training examples removed from the original
training set. (Note that performance for magic telescope is classification
accuracy and for housing it is sum squared error). How well does your pruning
algorithm work? Magic Telescope has about 12,000 instances and if you use a
leave one out style of testing for your data set reduction, then your algorithm
will run slow since that is n2 at each step.
If you wish, you may use a random subset of 2,000 of the magic telescope
instances. More information on
reduction techniques can be found here.
6. (10%) Use the k nearest neighbor algorithm to solve the credit-approval
(credit-a) data set.
Note that this set has both continuous and nominal attributes, together with
don’t know values. Implement and justify a distance metric which supports
continuous, nominal, and don’t know attribute values. Use your own choice
for k,
training/test split, etc. and discuss your results. More information on
distance metrics can be found here.
As a rough sanity check, typical k-nn accuracies for
the credit data set are 70-80%.
7. (10%) Do an experiment of your own regarding the k nearest
neighbor paradigm and report your results and conclusions. Analyze and
discuss your results. Try to explain why things happened the way they
did.
Turn in a thoughtful, well-written
written report (see the guidelines above) that details your experiments and
addresses the questions posed above (look carefully at everything to make sure
you've covered all the parts of each).
1. (40%) Implement the k-means clustering algorithm and the HAC
(Hierarchical Agglomerative Clustering) algorithm. Attach your source code. Use
Euclidean distance for continuous attributes and (0,1) distances for nominal
and unknown attributes (e.g. matching nominals have
distance 0 else 1, unknown attributes have distance 1). HAC should support both single link and
complete link options. For k-means you
will pass in a specific k value for the number of clusters that should be in the resulting
grouping. Since HAC automatically
generates groupings for all values of k, you will pass in a range or set of k values to
for which output will be generated. The output for both algorithms should
include for each grouping: a) the number of clusters, b) the centroid values of
each cluster, c) the number of instances tied to that centroid, d) the SSE of
each cluster, and e) the total SSE of the full group of clusters. The sum
squared error (SSE) of a single cluster is the sum of the squared distances to
the cluster centroid. Run k-means,
HAC-single link, and HAC-complete link on this
exact set of the sponge data set (use all columns and do not shuffle or
normalize) and report your exact results for each algorithm with k=4 clusters. For k-means use the first 4 elements of the data
set as initial centroids. This
will allow us to check the accuracy of your implementation.
To help
debug your implementations, you may run them on this data set (don’t shuffle or normalize
it). Note that this is just the labor data set with an instance id column added
for showing results. (Do NOT use
the id column or the output class column as feature data). The results for 5-means, using the
first 5 elements of the data set as initial centroids should be this.
In HAC we just do the one closest merge per iteration. The results for HAC-single link up to 5
clusters should be this and complete
link up to 5 clusters should be this.
For the
sample files, we ignore missing values when calculating centroids and assign
them a distance of 1 when determining total sum squared error. Suppose
you had the following instances in a cluster:
?, 1.0,
Round
?, 2.0,
Square
?, 3.0,
Round
Red, ?, ?
Red, ?,
Square
The
centroid value for the first attribute would be "Red" and SSE would
be 3. The centroid value for the second attribute would be 2, and SSE
would be 4. In case of a tie as
with the third attribute, we choose the nominal value which appeared first in
the meta data list. So if the
attribute were declared as @attribute Shape{"Round",
"Square"}, then the centroid value for the third attribute would be
Round and the SSE would be 3. For
other types of ties (node or cluster with the same distance to another cluster,
which should be rare), just go with the earliest cluster in the list. If all the attributes in a cluster have
don’t know for one of the attributes, then use don’t know in the centroid for
that attribute.
2. (20%) Run all three variations (k-means,
HAC-single link, and HAC-complete link) on the full iris
data set where you do not include the output label as part of the data set. For
k-means
you should always choose k random points in the data set as initial centroids. If you ever end up with any empty
clusters in k-means,
re-run with different initial centroids.
Run it for k = 2-7. State whether
you normalize or not (your choice).
Graph the total SSE for each k and discuss your results (i.e. what kind of
clusters are being made). Now do it again where you include the output class as
one of the input features and discuss your results and any differences. For this final data set, run k-means 10
times with k=4,
each time with different initial random centroids and discuss any variations in
the results.
3. (15%) Run all three variations (k-means, HAC-single
link, and HAC-complete link) on the following smaller (500 instance) abalone data set where you include all
attributes including the normal output attribute “rings”. Treat “rings” as a continuous variable,
rather than nominal. Why would I
suggest that? Run it for k = 2-7. Graph and discuss your results without
normalization. Then run it again
with normalization and discuss any differences.
4. (10%) For your Iris experiments (with output class) and
your normalized abalone experiments, calculate and graph a performance metric
for the clusterings for k = 2-7. You may use the Davies-Bouldin
Index or some other metric including one of your own making. If not using Davies-Bouldin,
justify your metric. Discuss how
effective the metric you used might be in these cases for selecting which
number of clusters is best.
5. (15%) Do an experiment of your own regarding
clustering and report your results and conclusions. Analyze and discuss
your results. Try to explain why things happened the way they did.
Come up with one carefully
proposed idea for a possible group machine learning project, that could be done
this semester. This proposal
should not be more than one page long.
It should include a thoughtful first draft proposal of a) description of
the project, b) what features the data set would include and c) how and from
where would the data set be gathered and labeled. Give at least one fully specified example of a data set instance based
on your proposed features, including a reasonable representation (continuous,
nominal, etc.) and value for each feature. The actual values may be fictional at this time. This effort will cause you to consider
how plausible the future data gathering and representation might actually be.
Also specify whether this is
a project you would really like to be part of or if you just did it because it
was assigned. I will use this in helping decide which subset of the
proposals I will send to the class. Please e-mail me this proposal by the
due date. I will then consider which ones are most reasonable (about
12-15) and e-mail them out to the class. I will then have each of you
e-mail me a ranked list of the top 4 projects you would like to be part
of. I will then do my best to place you in a team of 3 on a project
you are interested in. Note that you may not all get your first choice,
but I will guarantee that if you are the one who proposes the project, and that
project is chosen and you want to be on it, that you will get to be on that
project.
You can work together on this
(up to 2 people) and send in one proposal if it is one that you both specify as
wanting to work on.
The grade on this will not be
based on whether your proposal is chosen or not. It will be based on whether it appears that you put in a
reasonable effort to propose a plausible project and if the proposal is
appropriate based on what we have learned in class.
Your goal in the group project is to
get high as possible generalization accuracy on a real world application. A large part of the project will be in
gathering, deriving, and representing input features. After you have come up with basic data and features, you
will choose a machine learning model, and format the data to fit the model. Expect that initial results may not be
as good as you would like. You will
then begin the iterative process of a) trying adjusted/improved data/features,
and b) adjusted/different machine learning models in order to get the best
possible results. Your report and
presentations (format below) should contain at least the following.
a)
Motivation and discussion of your
chosen task
b)
The features used to solve the problem
and details on how you gathered and represented the features, including
critical decision/choices you made along the way
c)
Your initial results with your initial
model
d)
The iterative steps
you took to get better results (improved features and/or learning models)
e)
Clear reporting and explanation of your
final results including your training/testing approach
f)
Conclusions, insights, and future
directions you would take if time permitted
1.
Prepare and submit a polished paper
describing your work. Submit one hard copy at the beginning of class on the day
of the oral presentation. You
should use these Latex
style files if you are writing in Latex (which is
a great way to write nice looking papers) or this Word template. Read one of these links (especially the early paragraphs)
because they contain more information on the contents of your paper (#pages,
order, etc.). I will not be
getting out a ruler to make sure you match format exactly, but it is a good
experience to put together a professional looking paper. Your paper should look like a
conference paper, including Title, Author Affiliations, Abstract, Introduction,
Methods, Results and Conclusion (Bibliographical references are optional for
this, but do include them if you have any). Here is an
example of a well written write-up from a previous semester.
2.
Prepare a polished oral presentation,
including slides, that communicates your work. Your presentation should take 12
minutes and allow 3 minutes for questions (a rough guide for slide preparations
is about 1 slide per 1-2 minutes). Practice in advance to insure a quality
presentation.
3.
Email me a thoughtful and honest
evaluation of the contributions of your group members (including yourself). For
each, include a score from 0 to 10indicating your evaluation of their work (10
meaning they were a valuable member of the team that made significant
contributions to the project and were good to work with, 0 meaning they
contributed nothing). If you would like, you may also include any clarifying
comments, etc.