General Inefficiency of Batch Training for Gradient Descent Learning

by D. Randall Wilson and Tony R. Martinez

To appear in Neural Networks

On-line Appendix

Introduction

This directory contains experimental results reported in: This paper had two main sets of experimental results:
  1. Machine Learning Database (MLDB) Experiments to test batch vs. on-line training.
  2. Digit Speech Recognition Experiments to test mini-batch training.
The entire archive of results can be downloaded as a .zip file.

Machine Learning Database Results

The first set of experiments used 26 datasets from the
UCI Machine Learning Database Repository. Information about the number of instances, inputs, output classes, etc., for each of these databases is available in the Microsoft Excel spreadsheet MLDB/MLDB-Information.xls.

For each task, 60% of the data was used for training, 20% for a hold-out set, and the remaining 20% as a final test set. The hold-out set was used to test the generalization accuracy after each epoch of training, and the final test set was used to measure accuracy at the epoch where generalization was the highest on the hold-out set.

Each task was trained using both on-line and batch training methods, and in each case learning rates of 0.1, 0.01, 0.001, and 0.0001 were used. Ten trials were run for each of the four learning rates for both methods on each of the 26 tasks, for a total of 10*4*2*26 = 2,080 runs.

More details about the neural network architecture and training are available in the paper.

Each neural network was trained for 1,000 epochs for learning rates 0.1 and 0.01; 5,000 epochs for 0.001; and 10,000 epochs for learning rate 0.0001. Each neural network was then tested on the hold-out set and the results for each epoch over the ten trials were averaged. A few tasks had generalization accuracy that was still rising after the number of epochs listed above. In those cases, additional training epochs were used in order to determine how long batch and on-line training took to train to a maximum accuracy in each case.

Complete, Raw MLDB Results

The directory MLDB/complete/ contains one ".csv" (Comma-Separated Value) file for each of the 26 MLDB datasets, plus one containing the average across all datasets.

A csv file can be opened by Microsoft Excel, but is also easy to parse. Each line consists of the same number of fields (values), with a comma used to delimit each field.

Each of the raw results files contains the following columns:

Note that not all columns have results for the same number of epochs. Often smaller learning rates (or batch training) required many more epochs (iterations) of training.

Average Raw MLDB Results

The directory MLDB/average/ contains the same results as MLDB/complete/, except that only the averages are included, and not the 10 individual runs. This makes the files much smaller and more manageable.

Selected Charts

The directory MLDB/charts/ contains copies of a couple of the MLDB results files in Microsoft Excel format, with charts built to display how accuracy changed through time for batch and on-line ("continuous") training on the given datasets. The mushroom dataset was plotted in the paper, as was the overall average. The paper did not have room for these additional examples.

MLDB Results Overview

The file MLDB-Overview.cvs (and the Excel version) summarize the on-line vs. batch training experiments on all 26 of the machine learning databases.

It is not trivial to boil down thousands of experiments into a single chart, but these charts attempt to do just that.

Each row contains the name of one of the MLDB datasets, along with the following information:

What is "safe"? There are two versions of the table which differ by what they mean by a "safe" learning rate for each training method. A learning rate is "safe" if it allows training to get within 0.5% of the maximum accuracy. For example, on-line training might get 88.7% accuracy in 9000 epochs on a particular dataset, but if it can get up to 88.5% accuracy in 100 epochs, then we would say that this is "close enough", and that the savings in training epochs is worth the (probably statistically insignificant) difference in accuracy.

The first table defines "safe" as coming within 0.5% of the max accuracy achieved by either training method. Thus, if batch training never comes within 0.5% of on-line training, then it must use whatever learning rate gave it the absolute best accuracy. The second table defines "safe" as coming within 0.5% of the max accuracy of any learning rate for just the training method itself.

The rest of the columns contain the following information for each learning rate:

  • Cont (i.e., "continuous" or "on-line"):
  • Batch: The same information as above is listed for batch training.

    Digit Speech Recognition and Mini-Batch Experiments

    The other set of experiments presented in a paper uses a large training set (20,000 instances) and different "batch sizes", where a size of 1 is the same as on-line training, and a size of 20,000 is the same as batch training. By varying the batch size from 1 to 20,000, we can observe the effect of moving from on-line to batch training.

    The digit results are presented in raw format in Digit.csv (or as an Excel spreadsheet).

    The rows are labeled from 0 (initial weights) to 10,000 and represent the number of training epochs (row 0 is actually labeled 0.1 to avoid error messages when plotting on a log scale using an X-Y chart). Some learning rates did not require all 10,000 training epochs and thus do not continue that long.

    The columns are labeled with the batch size and learning rate. For example, 100-.01 means that weights were accumulated for 100 instances before being applied, and a learning rate of 0.01 was used.

    The Excel spreadsheet contains charts showing progress on a logrithmic scale for the various learning rates and batch sizes.

    Another Excel spreadsheet, DigitWithCharts.xls contains the same results, but has many charts showing the trends for each learning rate, etc. It may require some scrolling to find all of the charts. The same results are also repeated in a file called DigitLogScaleResults.csv. In this version, only 70 of the rows are kept (according to a log scale) to make it easier to plot and look at the trends in the results without so many data points.

    Finally, in another Excel spreadsheet, DigitOverview.xls, the "best" learning rate for each batch size (including 1 and 20,000, representing on-line and batch, respectively) and the corresponding accuracy are reported. This chart is the basis for the data in Table 2 in the paper.


    Contact Information

    Feel free to contact me if you have questions, comments or suggestions.

    Randy Wilson
    E-mail: randy@axon.cs.byu.edu
    WWW: http://axon.cs.byu.edu/~randy