Assignment Specifications

Guidelines

Each written assignment is limited to no more than four single-sided, single-spaced pages (the first should require significantly less) using 12 point font, including all figures. Figures are not to be hand-drawn and should be large enough to be easily legible. Anything beyond the four page limit will not be graded! You must work to include all important information in the allotted four pages. Communicating clearly and concisely what you have to say is an important skill you will use throughout your career. NOTE: All written assignments are to be done on a word processor and be neat and professional.

Grading

Lab assignments will be graded 40% on positive completion, 50% on technical accuracy and insight and 10% on writing and presentation. Most parts of the assignments are fairly objective and for these, you are expected to be correct. A few questions are more open-ended and subjective, and in these cases you will be graded based on perceived effort and understanding. Good writing, grammar, punctuation, etc. are extremely important because of the great effect they have on the impact of your work.

Written Assignment

This assignment is worth 4% of your overall Homework and Programming Assignments grade.

Thoughtfully answer the following questions:

  1. Explain what you believe it means for a system to learn.
  2. Give three examples of problems which you believe are well-suited for solution by machine learning. Give three examples of problems which you believe are not amenable to solution by machine learning. In each instance, give reasons to justify your choice.
  3. Argue for why future machine learning systems will exhibit human level intelligence. Then argue for why they will not. (Although possibly justifiable, theological arguments are not acceptable for the purposes of this assignment.)

This assignment should be between one and two (typed) pages in length.

Data Set Module Assignment

This assignment is worth 5% of your overall Homework and Programming Assignments grade.

In this first assignment you will begin implementing the input functionality necessary for training and testing the various models we will experiment with over the semester. You are to implement the structures and associated operations necessary to hold and manipulate datasets and instances.

We will always use ARFF (Attribute-Relation File Format) files for our datasets, and we will make the assumption that all data will fit in RAM. Details on ARFF and a collection of ARFF data sets are provided.

Unlike most lab assignments, this one will require you to demonstrate your code for one of the TAs, so you are responsible for doing so on or before the due date. For the purposes of passing off the assignment, you must add a few display functions that dump information to the screen and a "main" function to exercise your functions/methods/whatevers. In order to get full credit, your "program" must accurately:

  • Ask the user to enter the name of an ARFF file (w/o the extension)
  • Load the ARFF file into adequate structures in memory
  • Display on screen:
    • The name of the dataset and its size (i.e., number of instances)
    • The number of attributes and for each attribute: its name and type. If the type is not continuous, then a list of all possible values
    • The first 5 instances from the data section of the ARFF file
    • Randomize the data and display the first 5 instances of the "new" dataset

The 5 datasets used for testing (and passing off) are as follows:

Experiment Shell Assignment

This assignment is worth 7% of your overall Homework and Programming Assignments grade.

In this assignment you will implement a user interface for managing all your learning modules, conducting experiments, and evaluating and reporting results. The general idea is to implement a high-level module whose role is to:

  • Provide a user-friendly interface
  • Hide implementation details from the user
  • Offer uniform access to all learning modules
  • Control execution

Your shell should allow users to:

  • Select a dataset
  • Select a learning algorithm
  • Select an evaluation method
  • See a report of the experiment, including:
    • Name of dataset, learning algorithm used and evaluation method used (also display any salient parameters)
    • Sizes of training and test set
    • Predictive accuracy

The shell should support:

  • The various methods for accuracy prediction evaluation.
  • A function for learning (inducing a model of the user-selected type)
  • A function for estimating generalization accuracy (using the user-selected method)

For the purpose of passing off the assignment, you will need to stub out some of your functions since no learners have been implemented yet. As the semester progresses, we will be adding functionality to the system in the form of learning modules, evaluation methods, etc. Some thinking now about the design of the system as a whole will make your life much simpler down the road.

In order to get full credit, you must:

  • Hand-in a short document describing your overall design (no more than 2 pages)
  • Explain your design philosophy to the TAs (motivation, how things will fit in, etc.)
  • Demonstrate your program:
    • Try various user input selections
    • Show the results (adequately stubbed out so the TAs know what would have been executed)

Decision Tree Assignment

This assignment is worth 12% of your overall Homework and Programming Assignments grade.

  1. Implement the ID3 decision tree algorithm, including the option for reduced error pruning and integrate it with your experiment shell. It is a good idea to use a simple data set (like the lenses data), that you can check by hand, to test your algorithm to make sure that it is working correctly.
  2. Use your ID3 algorithm to induce a decision tree for the iris problem using the entire dataset. Give a visual representation of the tree. How accurately does it classify the data set (is this a good measure of predictive accuracy)? Now randomly split the data set into 70% training data and 30% test data and re-induce the tree using only the training set. How accurately does it classify the training set? How accurately does it classify the test set? What does this have to do with estimating predictive accuracy?
  3. Repeat your experiments for the congressional voting record problem.
  4. Implement "reduced-error pruning" and use it on the trees generated for the iris and voting problems. In each case, use a validation set (~10%) from the training set to drive the pruning algorithm (and be sure not to use it for training). Give a visual representation of the pruned trees. How accurately do they classify the training set? How accurately do they classify the test set?
  5. Analyze the data you have collected and thoughtfully answer the following questions:
    1. For both problems, do your best to summarize in English what the decision tree has learned.
    2. For both problems, do your best to summarize in English what the pruned decision tree has learned. How is this different than the unpruned tree?
    3. How did you handle unknown attributes in the voting problem?
    4. Does the pruning algorithm improve the algorithm's performance on either problem? Why or why not?
    5. Come up with your own experiment. Briefly and carefully explain it, run it and report and discuss your results.

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).

Backpropagation Assignment

This assignment is worth 12% of your overall Homework and Programming Assignments grade.

  1. Implement the backpropagation algorithm and integrate it with your experiment shell. Your implementation should include:
    • Incremental weight update
    • A reasonable stopping criterion
    • Training set randomization at each epoch

  2. Use your backpropagation learner for the iris classification problem. Use a 70/30 split of the data for the training/test set. Experiment with different learning rates. Graph training and test set TSS over time for several different learning rates. Explain and justify your stopping criterion.
  3. Repeat your experiments for the vowel identification problem. This time use a 75/25 split for training and testing sets.
  4. Once you have a reasonable learning rate, experiment with different numbers of hidden nodes. Start with 2 hidden nodes and double the number for each test until you see no more improvement on the training set. For both data sets, graph TSS and classification error for both the training and test sets for each number of hidden nodes.
  5. Analyze the data you have collected and thoughtfully answer the following questions:
    1. Did the backprop perform significantly better on one of the problems? If so, which one? Discuss some possible reasons for this.
    2. Discuss the effect of different learning rates on the algorithm's performance.
    3. Discuss the effect of different numbers of hidden units on the algorithm's performance.
    4. On the vowel data, what types of errors are most common for your network (the best way to see this is to create a confusion matrix for the network outputs)? Why? Suggest some modifications/approaches to improving your results
    5. Come up with your own experiment. Briefly and carefully explain it, run it and report and discuss your results.

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).

Midterm Report

This assignment is worth 12% of your overall Homework and Programming Assignments grade.

  1. Update your experiment shell to handle significance testing in the form of a paired permutation test.
  2. Visit the UCI Machine Learning Repository and select 5 data sets to which both ID3 and backpropagation learning can be applied.
  3. For each of your 5 data sets, compare the performance of ID3 and backpropagation using a paired permutation test.
    • for ID3 do not use pruning.
    • for the backprop algorithm, use incremental weight updates, a single hidden layer (with an appropriate number of hidden nodes) and no momentum.
    • for the permutation tests, obtain 10 paired accuracy estimates (the best and easiest way to do this is 10-fold cross-validation) and perform an exhaustive permutation test. To ensure the validity of your statistical estimates, you should choose data sets large enough that each of your folds contains at least 30 instances.

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments. Explain and justify any decisions you make, summarize and analyze your results and draw conclusions.

Naive Bayes Assignment

This assignment is worth 12% of your overall Homework and Programming Assignments grade.

  1. Implement the naive Bayes algorithm and integrate it with your experiment shell.
  2. Use your naive Bayes learner to learn the nursery problem and evaluate its predictive accuracy using 10-fold cross-validation.
  3. Repeat your experiment using a 60/40 split of the data for the training/test set.
  4. Use your naive Bayes learner for the soybean problem. Use this training set and evaluate predictive accuracy using this test set.
  5. When the assumption of conditional independence holds, naive Bayes is a maximum a posteriori classifier. How might you check whether this assumption holds for a given dataset? Use your method on the nursery problem and decide whether or not naive Bayes is "safe" on it.
  6. Come up with your own experiment. Briefly and carefully explain it, run it and report and discuss your results.

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).

Instance-based Learning Assignment

This assignment is worth 12% of your overall Homework and Programming Assignments grade.

  1. Implement the k nearest neighbor algorithm, including optional distance weighting, and integrate it with your experiment shell.
  2. Use your algorithm for the balanced scale problem using this training set and this test set. Experiment with odd values of k from 1 to 19. Which value of k is the best?
  3. Use your algorithm for the housing price prediction problem using this training set and this test set. Note that this is a regression problem -- report Root Mean Squared Error as your accuracy metric. Experiment using odd values of k from 1 to 19. Which value of k is the best? Repeat your experiments using normalized input features.
  4. Repeat your experiments using distance-weighted (inverse of distance squared) voting.
  5. For the best value of k for each dataset (and using normalized data for the housing data), implement a simple "pruning" algorithm that iteratively removes data points that least affect performance on a validation set. Graph performance on the validation and test sets vs. number of training examples removed.
  6. Analyze the data you have collected and thoughtfully answer the following questions:
    1. How well suited is the nearest neighbor algorithm for these datasets? Why?
    2. How does distance-weighted voting affect your algorithm's performance? Why?
    3. What effect does normalization of the input data have on the algorithm's performance on the housing data set? Why?
    4. How well does the simple pruning algorithm work? How many examples can you eliminate without significantly affecting performance?
    5. Come up with your own experiment. Briefly and carefully explain it, run it and report and discuss your results.

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).

Genetic Algorithm Assignment

This assignment is worth 12% of your overall Homework and Programming Assignments grade.

  1. Implement a genetic algorithm for evolving feed forward neural networks with a single hidden layer and integrate it with your experiment shell.
    • The number of inputs and outputs will depend on the problem being solved.
    • The number of hidden nodes should be chosen to be "reasonable", perhaps by revisiting your results on the backpropagation assignment.
    • In the genome, weights should be binary-encoded, integer-valued and range from -512 to +512; when expressed as a phenotype, they should be real-valued and range from -0.512 to 0.512.
    • You will need to use your NN module to compute fitness = (accuracy on 80% of the data). You should randomly choose 80% of the data and compute classification accuracy on this set as your fitness score. The remaining 20% should be used as a holdout set (in this exercise we will not do anything but report results on the holdout set).
  2. Using the iris dataset:
    • Use fitness proportionate selection.
    • During evolution, graph (on the same graph) the best fitness, worst fitness and average fitness of your population over time (number of generations). Also, graph (together, on a different graph than the first) the classification accuracy on the 20% holdout set for the best member of the population, worst member of the population and the average classification accuracy for the population on the hold out set.
    • Experiment with different values for population size, replacement rate and mutation rate. Choose those that you think are best and justify your choices.
  3. Using your chosen values for population size, replacement rate and mutation rate, re-run your experiments using tournament selection and rank selection. For tournament selection, use a probability p=0.75 that the higher fitness wins. For rank selection, use probability of selection p=((popsize+1)-rank)/sum(all ranks).
  4. Analyze the data you have collected and thoughtfully answer the following questions:
    1. How do the values for population size, replacement rate and mutation rate affect the performance of the algorithm?
    2. Which of the selection methods worked the best? Why?
    3. How does the evolved neural network for the iris problem compare with your network trained using backpropagation? How does the training time for the two approaches compare? How do their accuracies compare?
    4. Come up with your own experiment. Briefly and carefully explain it, run it and report and discuss your results.

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments and addresses the questions posed above (look carefully at everything to make sure you've covered all the parts of each).

Stacking

This assignment is worth 12% of your overall Homework and Programming Assignments grade.

  1. Choose 3 datasets which can be used with Backprop, Decision Tree, Naive Bayes and k-NN.
  2. Train ten different learning models on each of the tree data sets. These ten models should be made up of four different MLP networks (different numbers of hidden nodes), two different decision trees (full tree and stump), one naive Bayes net and three k-NN models (different values for k). As necessary, choose learning parameters for the algorithms based on your previous experience and state which parameters you used.
  3. Create a new training set for each of the three applications whose input features are the outputs of the ten models and whose output(s) is/are the same as in the original training set.
  4. Train a stacked model (you choose which of the four algorithms you want to use for the "stacked" model) on the three new training sets.
  5. For each of the three data sets report and compare the results of the stacked ensemble with the results of each of the four individual learning algorithms.
  6. Come up with your own experiment. Briefly and carefully explain it, run it and report and discuss your results.

Turn in a thoughtful, well-written written report (see the guidelines above) that details your experiments. Explain and justify any decisions you make, summarize and analyze your results and draw conclusions.