We
will always use ARFF files for our datasets, and we will make the assumption
that all data will fit in RAM. Details on ARFF are found here. A
collection of data sets already in the ARFF format can be found here.
A basic tool kit is provided
in C++, Java, and Python to help you get started implementing learning algorithms. You
are also welcome to code up your own toolkit or modify the source code made
available here however you want. If you
do so, here are a few things to keep in mind:
The CS 478 tool kit is
intended as a starting place for working with machine learning algorithms. It
provides the following functionality to run your algorithms:
Build Instructions for the Java version:
Download the zip file here
Build Instructions for Linux:
1. Unzip the zip
file
2. javac *.java
(We do not use windows, but there is nothing fancy and I imagine that it should
work in Microsoft Visual C++. You will need to create a project solution. If
you use windows and have trouble, come see us for help.)
Build Instructions for the
C++ version:
Open the terminal
wget http://axon.cs.byu.edu/~martinez/classes/478/stuff/toolkitc.zip
unzip toolkitc.zip
cd toolkit/src/
make opt
Test to make sure that the
toolkit works:
mkdir datasets
cd datasets/
wget http://axon.cs.byu.edu/~martinez/classes/478/stuff/iris.arff
cd ..
./bin/MLSystemManager -L
baseline -A datasets/iris.arff -E training
You should see the results
for a baseline classifier (33% accuracy on iris)
Usage Instructions for both version:
MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E [EvaluationMethod]
{[ExtraParameters]} [-N] [-R seed]
Where the -N will normalize the training and test data sets (Normalization max
and min will come from the training set).
The -R allows you pass in a seed for the random number generator. By default
each time you run the code, the data set will be shuffled differently. If you
wish to produce the same shuffle, provide a seed such as 1 or 2.
Possible evaluation methods are:
./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File]
-E training
./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E static [TestARFF_File]
./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E random [PercentageForTraining]
./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File]
-E cross [NumOfFolds]
Here is an example of using the C++ ML tookit and the output:
./MLSystemManager -L dummy -A ../Research/dataSets/iris.arff -E training -N
Dataset name: iris
Dataset is normalized.
Number of instances: 150
Learning algorithm: dummy
Evaluation method: training
Accuracy on the training set:
Output classes accuracy:
Iris-setosa: 1
Iris-versicolor: 0
Iris-virginica: 0
Set accuracy: 0.333333
Accuracy on the test set:
Output classes accuracy:
Iris-setosa: 1
Iris-versicolor: 0
Iris-virginica: 0
Set accuracy: 0.333333
Time to train: 5.96046e-06 seconds
(Note: If the simulation starts before midnight and ends after, the time
will not be accurate)
A DummyLearner
class is provided that classifies all instances as the majority class (BaselineLearner). This class can be used as a template for
creating your own learning algorithms. The instances are stored in a vector of
vectors of doubles (c++ version) or an ArrayList of ArrayLists of doubles
(java version). For further implementation questions, please see the TAs. When creating a new learning algorithm, you need to
add the include line in the MLSystemManager file and
it must inherit from the Learner class.