General Inefficiency of Batch Training for Gradient Descent Learning
by D. Randall Wilson and Tony R. Martinez
To appear in Neural Networks
On-line Appendix
Introduction
This directory contains experimental results reported in:
This paper had two main sets of experimental results:
- Machine Learning Database (MLDB) Experiments to test batch vs. on-line training.
- Digit Speech Recognition Experiments to test mini-batch training.
The entire archive of results can be downloaded as a .zip file.
Machine Learning Database Results
The first set of experiments used 26 datasets from the
UCI Machine Learning Database Repository.
Information about the number of instances, inputs, output classes, etc., for each
of these databases is available in the Microsoft Excel spreadsheet
MLDB/MLDB-Information.xls.
For each task, 60% of the data was used for training, 20% for a hold-out
set, and the remaining 20% as a final test set. The hold-out set was used to test the generalization
accuracy after each epoch of training, and the final test set was used to measure accuracy at the
epoch where generalization was the highest on the hold-out set.
Each task was trained using both on-line and batch training methods, and in each case learning
rates of 0.1, 0.01, 0.001, and 0.0001 were used. Ten trials were run for each of the four learning
rates for both methods on each of the 26 tasks, for a total of 10*4*2*26 = 2,080 runs.
More details about the neural network architecture and training are available in
the paper.
Each neural network was trained for 1,000 epochs for learning rates 0.1 and 0.01; 5,000 epochs
for 0.001; and 10,000 epochs for learning rate 0.0001. Each neural network was then tested on the
hold-out set and the results for each epoch over the ten trials were averaged. A few tasks had
generalization accuracy that was still rising after the number of epochs listed above. In those cases,
additional training epochs were used in order to determine how long batch and on-line training took
to train to a maximum accuracy in each case.
Complete, Raw MLDB Results
The directory MLDB/complete/ contains one ".csv" (Comma-Separated Value)
file for each of the 26 MLDB datasets, plus one containing the average across
all datasets.
A csv file can be opened by Microsoft Excel, but is also easy to parse.
Each line consists of the same number of fields (values), with a comma used
to delimit each field.
Each of the raw results files contains the following columns:
- The name of the database (e.g., "australian") at the top of the first column.
- Itr: Number of training iterations (epochs) up to and including the row.
(If any files contain a "0.1" instead of an integer, this is to indicate what the
accuracy was with the random initial weights, and a "0.1" was used instead of "0"
to avoid divide-by-zero errors in Microsoft Excel graphs).
- Hold-out set accuracy for each method. Each column has a heading to indicate
which run is represented in the column. The format of the column heading is
- c = continuous = on-line; or b = batch.
- A number 0..9 immediately after c or b is which of the ten trials it is.
(No number means it is the average, which is what is displayed in the leftmost
columns).
- A dash (sort of takes the place of the decimal point in the learning rate).
- The learning rate (w/o the decimal point)
- A "t" if the accuracy is on the test set instead of the hold-out set.
For example:
- c-1 is the average on the hold-out set over all 10 runs of on-line training using a learning rate of 0.1.
- b0-01t is the accuracy on the test set on run 0 (out of 0..9) of batch training using
a learning rate of 0.01.
Note that not all columns have results for the same number of epochs.
Often smaller learning rates (or batch training) required many more
epochs (iterations) of training.
Average Raw MLDB Results
The directory MLDB/average/ contains the same
results as MLDB/complete/, except that only the averages are included,
and not the 10 individual runs. This makes the files much smaller and
more manageable.
Selected Charts
The directory MLDB/charts/ contains copies of
a couple of the MLDB results files in Microsoft Excel format, with
charts built to display how accuracy changed through time for batch
and on-line ("continuous") training on the given datasets. The
mushroom dataset was plotted in the paper, as was the overall
average. The paper did not have room for these additional examples.
MLDB Results Overview
The file MLDB-Overview.cvs
(and the Excel version) summarize
the on-line vs. batch training experiments on all 26 of the machine
learning databases.
It is not trivial to boil down thousands of experiments into a single chart,
but these charts attempt to do just that.
Each row contains the name of one of the MLDB datasets, along with the
following information:
- Classes: Number of output classes in the dataset.
- Instances: Number of instances in the dataset (used for training, hold-out and testing).
- MacC: Maximum generalization accuracy achieved by on-line ("continuous") training.
- MaxB: Maximum generalization accuracy achieved by batch training.
- MaxAcc: Maximum generalization accuracy of on-line or batch
- Diff: Difference between on-line and batch (i.e., batch - on-line)
- CSafeLR: Fastest learning rate that allowed on-line ("continuous") training to
get within 0.5% of its highest accuracy.
- BSafeLR: Fastest learning rate that allowed batch training to
get within 0.5% of its highest accuracy.
- CSafe: Number of training epochs needed to get within 0.5% of on-line's
highest accuracy using the learning rate shown under "CSafeLR".
- BSafe: Largest learning rate that allowed batch training to
get within 0.5% of its highest accuracy.
- xSlower: Number of times slower batch training is than on-line training,
when both are allowed to use their "best" learning rate,
i.e., the one that gets them within 0.5% of the highest
accuracy in the fewest number of epochs.
- xSlow by hand: Same as xSlower, but has a few entries adjusted
by hand. For example, for the letter-recognition dataset,
on-line got to an accuracy of 83.53% after 4968 epochs.
Batch training's highest accuracy was 73.25% after 9915
epochs, but only because training stopped after 10000
epochs (and these epochs took a long time due to the size
of the dataset). Examining the accuracy trends on a
log scale made it clear that batch training was lagging
by 100 times behind on-line, so the "xSlow" column
contains this value as a better estimate of how much
slower batch was progressing on that dataset.
What is "safe"? There are two versions of the table which differ
by what they mean by a "safe" learning rate for each training method. A learning
rate is "safe" if it allows training to get within 0.5% of the maximum accuracy.
For example, on-line training might get 88.7% accuracy in 9000 epochs on a particular
dataset, but if it can get up to 88.5% accuracy in 100 epochs, then we would say that
this is "close enough", and that the savings in training epochs is worth the
(probably statistically insignificant) difference in accuracy.
The first table defines "safe" as coming within 0.5% of the max accuracy achieved
by either training method. Thus, if batch training never comes within 0.5% of
on-line training, then it must use whatever learning rate gave it the absolute
best accuracy. The second table defines "safe" as coming within 0.5% of the max
accuracy of any learning rate for just the training method itself.
The rest of the columns contain the following
information for each learning rate:
Cont (i.e., "continuous" or "on-line"):
- Epoch: Number of epochs to get to the best hold-out set accuracy.
- Dev: Accuracy on the hold-out ("development" or "dev") set at the given epoch.
- Test: Accuracy on the test set at the same epoch. In other words,
the hold-out ("dev") set is used to pick which epoch looks
the most accurate, and the test set is used to test what
the accuracy was at that epoch.
Batch: The same information as above is listed for batch training.
Digit Speech Recognition and Mini-Batch Experiments
The other set of experiments presented in a paper uses a large
training set (20,000 instances) and different "batch sizes",
where a size of 1 is the same as on-line training, and a size
of 20,000 is the same as batch training. By varying the batch
size from 1 to 20,000, we can observe the effect of moving from
on-line to batch training.
The digit results are presented in raw format in
Digit.csv (or as an
Excel spreadsheet).
The rows are labeled from 0 (initial weights) to 10,000 and
represent the number of training epochs (row 0 is actually labeled
0.1 to avoid error messages when plotting on a log scale using an
X-Y chart). Some learning rates did not require all 10,000
training epochs and thus do not continue that long.
The columns are labeled with the batch size and learning rate.
For example, 100-.01 means that weights were accumulated for 100
instances before being applied, and a learning rate of 0.01 was used.
The Excel spreadsheet contains charts showing progress on a logrithmic
scale for the various learning rates and batch sizes.
Another Excel spreadsheet, DigitWithCharts.xls
contains the same results, but has many charts showing the trends for each
learning rate, etc. It may require some scrolling to find all of the charts.
The same results are also repeated in a file called
DigitLogScaleResults.csv. In this version, only 70 of the rows are
kept (according to a log scale) to make it easier to plot and look at the
trends in the results without so many data points.
Finally, in another Excel spreadsheet, DigitOverview.xls,
the "best" learning rate for each batch size (including 1 and 20,000, representing
on-line and batch, respectively) and the corresponding accuracy are reported. This
chart is the basis for the data in Table 2 in the paper.
Contact Information
Feel free to contact me if you have questions, comments or suggestions.
Randy Wilson
E-mail: randy@axon.cs.byu.edu
WWW: http://axon.cs.byu.edu/~randy