General Inefficiency of Batch Training for Gradient Descent Learning

by D. Randall Wilson and Tony R. Martinez

To appear in Neural Networks

On-line Appendix

Introduction

This directory contains experimental results reported in:

Wilson, D. Randall

Tony R. Martinez

"The General Inefficiency of Batch Training for Gradient Descent Learning,"

Neural Networks

This paper had two main sets of experimental results:

Machine Learning Database (MLDB) Experiments to test batch vs. on-line training.
Digit Speech Recognition Experiments to test mini-batch training.

The entire archive of results can be downloaded as a .zip file.

Machine Learning Database Results

The first set of experiments used 26 datasets from the UCI Machine Learning Database Repository. Information about the number of instances, inputs, output classes, etc., for each of these databases is available in the Microsoft Excel spreadsheet MLDB/MLDB-Information.xls.

For each task, 60% of the data was used for training, 20% for a hold-out set, and the remaining 20% as a final test set. The hold-out set was used to test the generalization accuracy after each epoch of training, and the final test set was used to measure accuracy at the epoch where generalization was the highest on the hold-out set.

Each task was trained using both on-line and batch training methods, and in each case learning rates of 0.1, 0.01, 0.001, and 0.0001 were used. Ten trials were run for each of the four learning rates for both methods on each of the 26 tasks, for a total of 10*4*2*26 = 2,080 runs.

More details about the neural network architecture and training are available in the paper.

Each neural network was trained for 1,000 epochs for learning rates 0.1 and 0.01; 5,000 epochs for 0.001; and 10,000 epochs for learning rate 0.0001. Each neural network was then tested on the hold-out set and the results for each epoch over the ten trials were averaged. A few tasks had generalization accuracy that was still rising after the number of epochs listed above. In those cases, additional training epochs were used in order to determine how long batch and on-line training took to train to a maximum accuracy in each case.

Complete, Raw MLDB Results

The directory MLDB/complete/ contains one ".csv" (Comma-Separated Value) file for each of the 26 MLDB datasets, plus one containing the average across all datasets.

A csv file can be opened by Microsoft Excel, but is also easy to parse. Each line consists of the same number of fields (values), with a comma used to delimit each field.

Each of the raw results files contains the following columns:

The name of the database (e.g., "australian") at the top of the first column.
Itr: Number of training iterations (epochs) up to and including the row. (If any files contain a "0.1" instead of an integer, this is to indicate what the accuracy was with the random initial weights, and a "0.1" was used instead of "0" to avoid divide-by-zero errors in Microsoft Excel graphs).
Hold-out set accuracy for each method. Each column has a heading to indicate which run is represented in the column. The format of the column heading is
- c = continuous = on-line; or b = batch.
- A number 0..9 immediately after c or b is which of the ten trials it is. (No number means it is the average, which is what is displayed in the leftmost columns).
- A dash (sort of takes the place of the decimal point in the learning rate).
- The learning rate (w/o the decimal point)
- A "t" if the accuracy is on the test set instead of the hold-out set.
For example:
- c-1 is the average on the hold-out set over all 10 runs of on-line training using a learning rate of 0.1.
- b0-01t is the accuracy on the test set on run 0 (out of 0..9) of batch training using a learning rate of 0.01.

Note that not all columns have results for the same number of epochs. Often smaller learning rates (or batch training) required many more epochs (iterations) of training.

Average Raw MLDB Results

The directory MLDB/average/ contains the same results as MLDB/complete/, except that only the averages are included, and not the 10 individual runs. This makes the files much smaller and more manageable.

Selected Charts

The directory MLDB/charts/ contains copies of a couple of the MLDB results files in Microsoft Excel format, with charts built to display how accuracy changed through time for batch and on-line ("continuous") training on the given datasets. The mushroom dataset was plotted in the paper, as was the overall average. The paper did not have room for these additional examples.

MLDB Results Overview

The file MLDB-Overview.cvs (and the Excel version) summarize the on-line vs. batch training experiments on all 26 of the machine learning databases.

It is not trivial to boil down thousands of experiments into a single chart, but these charts attempt to do just that.

Each row contains the name of one of the MLDB datasets, along with the following information:

Classes: Number of output classes in the dataset.
Instances: Number of instances in the dataset (used for training, hold-out and testing).
MacC: Maximum generalization accuracy achieved by on-line ("continuous") training.
MaxB: Maximum generalization accuracy achieved by batch training.
MaxAcc: Maximum generalization accuracy of on-line or batch
Diff: Difference between on-line and batch (i.e., batch - on-line)
CSafeLR: Fastest learning rate that allowed on-line ("continuous") training to get within 0.5% of its highest accuracy.
BSafeLR: Fastest learning rate that allowed batch training to get within 0.5% of its highest accuracy.
CSafe: Number of training epochs needed to get within 0.5% of on-line's highest accuracy using the learning rate shown under "CSafeLR".
BSafe: Largest learning rate that allowed batch training to get within 0.5% of its highest accuracy.
xSlower: Number of times slower batch training is than on-line training, when both are allowed to use their "best" learning rate, i.e., the one that gets them within 0.5% of the highest accuracy in the fewest number of epochs.
xSlow by hand: Same as xSlower, but has a few entries adjusted by hand. For example, for the letter-recognition dataset, on-line got to an accuracy of 83.53% after 4968 epochs. Batch training's highest accuracy was 73.25% after 9915 epochs, but only because training stopped after 10000 epochs (and these epochs took a long time due to the size of the dataset). Examining the accuracy trends on a log scale made it clear that batch training was lagging by 100 times behind on-line, so the "xSlow" column contains this value as a better estimate of how much slower batch was progressing on that dataset.

What is "safe"? There are two versions of the table which differ by what they mean by a "safe" learning rate for each training method. A learning rate is "safe" if it allows training to get within 0.5% of the maximum accuracy. For example, on-line training might get 88.7% accuracy in 9000 epochs on a particular dataset, but if it can get up to 88.5% accuracy in 100 epochs, then we would say that this is "close enough", and that the savings in training epochs is worth the (probably statistically insignificant) difference in accuracy.

The first table defines "safe" as coming within 0.5% of the max accuracy achieved by either training method. Thus, if batch training never comes within 0.5% of on-line training, then it must use whatever learning rate gave it the absolute best accuracy. The second table defines "safe" as coming within 0.5% of the max accuracy of any learning rate for just the training method itself.

The rest of the columns contain the following information for each learning rate:

Cont (i.e., "continuous" or "on-line"):

Epoch: Number of epochs to get to the best hold-out set accuracy.
Dev: Accuracy on the hold-out ("development" or "dev") set at the given epoch.
Test: Accuracy on the test set at the same epoch. In other words, the hold-out ("dev") set is used to pick which epoch looks the most accurate, and the test set is used to test what the accuracy was at that epoch.

Batch: The same information as above is listed for batch training.

Digit Speech Recognition and Mini-Batch Experiments

The other set of experiments presented in a paper uses a large training set (20,000 instances) and different "batch sizes", where a size of 1 is the same as on-line training, and a size of 20,000 is the same as batch training. By varying the batch size from 1 to 20,000, we can observe the effect of moving from on-line to batch training.

The digit results are presented in raw format in Digit.csv (or as an Excel spreadsheet).

The rows are labeled from 0 (initial weights) to 10,000 and represent the number of training epochs (row 0 is actually labeled 0.1 to avoid error messages when plotting on a log scale using an X-Y chart). Some learning rates did not require all 10,000 training epochs and thus do not continue that long.

The columns are labeled with the batch size and learning rate. For example, 100-.01 means that weights were accumulated for 100 instances before being applied, and a learning rate of 0.01 was used.

The Excel spreadsheet contains charts showing progress on a logrithmic scale for the various learning rates and batch sizes.

Another Excel spreadsheet, DigitWithCharts.xls contains the same results, but has many charts showing the trends for each learning rate, etc. It may require some scrolling to find all of the charts. The same results are also repeated in a file called DigitLogScaleResults.csv. In this version, only 70 of the rows are kept (according to a log scale) to make it easier to plot and look at the trends in the results without so many data points.

Finally, in another Excel spreadsheet, DigitOverview.xls, the "best" learning rate for each batch size (including 1 and 20,000, representing on-line and batch, respectively) and the corresponding accuracy are reported. This chart is the basis for the data in Table 2 in the paper.

Contact Information

Feel free to contact me if you have questions, comments or suggestions.

Randy Wilson
E-mail: randy@axon.cs.byu.edu
WWW: http://axon.cs.byu.edu/~randy