Abstract: Gradient descent training of neural networks can be done in either a batch or on-line manner. A widely held myth in the neural network community is that batch training is as fast or faster and/or more "correct" than on-line training because it supposedly uses a better approximation of the true gradient for its weight updates. This paper explains why batch training is almost always slower than on-line training often orders of magnitude slower especially on large training sets. The main reason is due to the ability of on-line training to follow curves in the error surface throughout each epoch, which allows it to safely use a larger learning rate and thus converge with less iterations through the training data. Empirical results on a large (20,000-instance) speech recognition task and on 26 other learning tasks demonstrate that convergence can be reached significantly faster using on-line training than batch training, with no apparent difference in accuracy.
Reference: Neural Networks, vol. 16, no. 10, pages 1429-1451, Elsevier Science Ltd. Oxford, UK, UK, 2003. (PDF: wilson.nn2003.batch.pdf)
Abstract: Exemplar-based learning algorithms are often faced with the problem of deciding which instances or other exemplars to store for use during generalization. Storing too many exemplars can result in large memory requirements and slow execution speed, and can cause an oversensitivity to noise. This paper has two main purposes. First, it provides a survey of existing algorithms used to reduce the number of exemplars retained in exemplar-based learning models. Second, it proposes six new reduction algorithms called DROP1-5 and DEL that can be used to prune instances from the concept description. These algorithms and 10 algorithms from the survey are compared on 31 datasets. Of those algorithms that provide substantial storage reduction, the DROP algorithms have the highest generalization accuracy in these experiments, especially in the presence of noise.
Reference: Machine Learning, vol. 38, no. 3, pp. 257-286, March 2000. (Postscript: wilson.ml2000.drop.ps) (or PDF)
Abstract: The basic nearest-neighbor rule generalizes well in many domains but has several shortcomings, including inappropriate distance functions, large storage requirements, slow execution time, sensitivity to noise, and an inability to adjust its decision boundaries after storing the training data. This paper proposes methods for overcoming each of these weaknesses and combines these methods into a comprehensive learning system called the Integrated Decremental Instance-Based Learning Algorithm (IDIBL) that seeks to reduce storage, improve execution speed, and increase generalization accuracy, when compared to the basic nearest neighbor algorithm and other learning models. IDIBL tunes its own parameters using a new measure of fitness that combines confidence and cross-validation (CVC) accuracy in order to avoid discretization problems with more traditional leave-one-out cross-validation (LCV). In our experiments IDIBL achieves higher generalization accuracy than other less comprehensive instance-based learning algorithms, while requiring less than one-fourth the storage of the nearest neighbor algorithm and improving execution speed by a corresponding factor. In experiments on 21 datasets, IDIBL also achieves higher generalization accuracy than those reported for 16 major machine learning and neural network models.
Reference: Computational Intelligence, vol. 16, no. 1, pp. 1-28, 2000. (Postscript: wilson.ci2000.idibl.ps) (or PDF)
Abstract: Nearest neighbor and instance-based learning techniques typically handle continuous and linear input values well, but often do not handle nominal input attributes appropriately. The Value Difference Metric (VDM) was designed to find reasonable distance values between nominal attribute values, but it largely ignores continuous attributes, requiring discretization to map continuous values into nominal values. This paper proposes three new heterogeneous distance functions, called the Heterogeneous Value Difference Metric (HVDM), Interpolated Value Difference Metric (IVDM) and the Windowed Value Difference Metric (WVDM). These new distance functions are designed to handle applications with nominal attributes, continuous attributes, or both. In experiments on 48 applications the new distance metrics achieve higher average classification accuracy than previous distance functions on applications with both nominal and continuous attributes.
Reference: Journal of Artificial Intelligence Research (JAIR), Vol. 6, No. 1, pp. 1-34, 1997. (Postscript: wilson.jair97.hvdm.ps), (or PDF), (or see the HTML version)
Abstract: The nearest neighbor algorithm and its derivatives, which are often referred to collectively as instance-based learning algorithms, have been successful on a variety of real-world applications. However, in its basic form, the nearest neighbor algorithm suffers from inadequate distance functions, large storage requirements, slow execution speed, a sensitivity to noise and irrelevant attributes, and an inability to adjust its decision surfaces after storing the data. This dissertation presents a collection of papers that seek to overcome each of these disadvantages. The most successful enhancements are combined into a comprehensive system called the Integrated Decremental Instance-Based Learning algorithm, which in experiments on 44 applications achieves higher generalization accuracy than other instance-based learning algorithms. It also yields higher generalization accuracy than that reported for 16 major machine learning and neural network models.
Reference: Ph.D. Dissertation, Computer Science Department, Brigham Young University, August 1997. (Postscript: wilson.phd97.ps) (or PDF)
Abstract: Multilayer perceptrons are often trained using error backpropagation (BP). BP training can be done in either a batch or continuous manner. Claims have frequently been made that batch training is faster and/or more "correct" than continuous training because it uses a better approximation of the true gradient for its weight updates. These claims are often supported by empirical evidence on very small data sets. These claims are untrue, however, for large training sets. This paper explains why batch training is much slower than continuous training for large training sets. Various levels of semi-batch training used on a 20,000-instance speech recognition task show a roughly linear increase in training time required with an increase in batch size.
Reference: In Proceedings of the International Joint Conference on Neural Networks (IJCNN2000), Vol. II, pp. 113-117, July 2000. (Postscript: wilson.ijcnn2000.batch.ps, or (or PDF)
Abstract:Neural network and machine learning algorithms often have parameters that must be tuned for good performance on a particular task. Leave-one-out cross-validation (LCV) accuracy is often used to measure the fitness of a set of parameter values. However, small changes in parameters often have no effect on LCV accuracy. Many learning algorithms can measure the confidence of a classification decision, but often confidence alone is an inappropriate measure of fitness. This paper proposes a combined measure of Cross-Validation and Confidence (CVC) for obtaining a continuous measure of fitness for sets of parameters in learning algorithms. This paper also proposes the Refined Instance-Based (RIB) learning algorithm which illustrates the use of CVC in automated parameter tuning. Using CVC provides significant improvement in generalization accuracy on a collection of 31 classification tasks when compared to using LCV.
Reference: In Proceedings of the International Joint Conference on Neural Networks (IJCNN'99), paper 163, 1999. (Postscript: wilson.ijcnn99.cvc.ps, or PDF)
Abstract:A relaxation network model that includes higher order weight connections is introduced. To demonstrate its utility, the model is applied to the speech recognition domain. Traditional speech recognition systems typically consider only that context preceding the word to be recognized. However, intuition suggests that considering following context as well as preceding context should improve recognition accuracy. The work described here tests this hypothesis by applying the higher order relaxation network to consider both precedes and follows context in a speech recognition task. The results demonstrate both the general utility of the higher order relaxation network as well as its improvement over traditional methods on a speech recognition task.
Reference: In Proceedings of the International Joint Conference on Neural Networks (IJCNN'99), paper 2188, 1999. (PDF format: ventura.ijcnn99b.pdf)
Abstract:Constraint satisfaction networks contain nodes that receive weighted evidence from external sources and/or other nodes. A relaxation process allows the activation of nodes to affect neighboring nodes, which in turn can affect their neighbors, allowing information to travel through a network. When doing discrete updates (as in a software implementation of a relaxation network), a goal net or goal activation can be computed in response to the net input into a node, and a relaxation rate can then be used to determine how fast the node moves from its current value to its goal value. An open question was whether or not the relaxation rate is a sensitive parameter. This paper shows that the relaxation rate has almost no effect on how information flows through the network as long as it is small enough to avoid large discrete steps and/or oscillation.
Reference: In Proceedings of the International Joint Conference on Neural Networks (IJCNN'99), paper 162, 1999. (PDF: wilson.ijcnn99.relax.pdf)
Abstract: The nearest neighbor algorithm and its derivatives are often quite successful at learning a concept from a training set and providing good generalization on subsequent input vectors. However, these techniques often retain the entire training set in memory, resulting in large memory requirements and slow execution speed, as well as a sensitivity to noise. This paper provides a discussion of issues related to reducing the number of instances retained in memory while maintaining (and sometimes improving) generalization accuracy, and mentions algorithms other researchers have used to address this problem. It presents three intuitive noise-tolerant algorithms that can be used to prune instances from the training set. In experiments on 29 applications, the algorithm that achieves the highest reduction in storage also results in the highest generalization accuracy of the three methods.
Reference: In Fisher, D., ed., Machine Learning: Proceedings of the Fourteenth International Conference (ICML'97), Morgan Kaufmann Publishers, San Francisco, CA, pp. 403-411, July 1997. (Postscript: wilson.icml97.prune.ps) (or PDF)
Abstract: Probabilistic Neural Networks (PNN) typically learn more quickly than many neural network models and have had success on a variety of applications. However, in their basic form, they tend to have a large number of hidden nodes. One common solution to this problem is to keep only a randomly-selected subset of the original training data in building the network. This paper presents an algorithm called the Reduced Probabilistic Neural Network (RPNN) that seeks to choose a better-than-random subset of the available instances to use as center points of nodes in the network. The algorithm tends to retain non-noisy border points while removing nodes with instances in regions of the input space that are highly homogeneous. In experiments on 22 datasets, the RPNN had better average generalization accuracy than two other PNN models, while requiring an average of less than one-third the number of nodes.
Reference: In Proceedings of the International Conference on Artificial Neural Networks and Genetic Algorithms (ICANNGA'97), pp. 514-517, 1997. (Postscript: wilson.icannga97.rpnn.ps) (or PDF)
Abstract: Many neural network models must be trained by finding a set of real-valued weights that yield high accuracy on a training set. Other learning models require weights on input attributes that yield high leave-one-out classification accuracy in order to avoid problems associated with irrelevant attributes and high dimensionality. In addition, there are a variety of general problems for which a set of real values must be found which maximize some evaluation function. This paper presents an algorithm for doing a schemata search over a real-valued weight space to find a set of weights (or other real values) for a given evaluation function. The algorithm, called the Real-Valued Schemata Search (RVSS) uses the BRACE statistical technique [Moore & Lee, 1993] to determine when to narrow the search space. This paper details the RVSS approach and gives initial empirical results.
Reference: In Proceedings of the 1997 Sian Ka'an International Workshop on Neural Networks and Neurocontrol, 1997. (Postscript: wilson.sian97.schema.ps or presentation: wilson.sian97.slides.ps) (or PDF)
Abstract: Radial Basis Function (RBF) networks typically use a distance function designed for numeric attributes, such as Euclidean or city-block distance. This paper presents a heterogeneous distance function which is appropriate for applications with symbolic attributes, numeric attributes, or both. Empirical results on 30 data sets indicate that the heterogeneous distance metric yields significantly improved generalization accuracy over Euclidean distance in most cases involving symbolic attributes.
Reference: Proceedings of the International Conference on Neural Networks (ICNN'96), Vol. 2, pp. 1263-1267, June 1996. (Postscript: wilson.icnn96.hrbf.ps) (or PDF)
Reference: In Proceedings of the 1997 International Conference on Intelligent Information Systems (IIS'97), pp. 108-114, 1997. (Postscript: wilson.iis97.bias.ps) (or PDF)
Abstract: Nearest neighbor and instance-based learning techniques typically handle continuous and linear input values well, but often do not handle symbolic input attributes appropriately. The Value Difference Metric (VDM) was designed to find reasonable distance values between symbolic attribute values, but it largely ignores continuous attributes, using discretization to map continuous values into symbolic values. This paper presents two heterogeneous distance metrics, called the Interpolated VDM (IVDM) and Windowed VDM (WVDM), that extend the Value Difference Metric to handle continuous attributes more appropriately. In experiments on 21 data sets the new distance metrics achieves higher classification accuracy in most cases involving continuous attributes.
Reference: Proceedings of the International Conference on Artificial Intelligence, Expert Systems, and Neural Networks (AIE'96), pp. 74-78, August 1996. (Postscript: wilson.aie96.ivdm.ps) (or PDF)
Abstract: This paper presents an inductive learning system called the Genetic Instance-Based Learning (GIBL) system. This system combines instance-based learning approaches with evolutionary computation in order to achieve high accuracy in the presence of irrelevant or redundant attributes. Evolutionary computation is used to find a set of attribute weights that yields a high estimate of classification accuracy. Results of experiments on 16 data sets are shown, and are compared with a non-weighted version of the instance-based learning system. The results indicate that the generalization accuracy of GIBL is somewhat higher than that of the non-weighted system on regular data, and is significantly higher on data with irrelevant or redundant attributes.
Reference: Proceedings of the International Conference on Artificial Intelligence, Expert Systems, and Neural Networks (AIE'96), pp. 11-14, August 1996. (Postscript: wilson.aie96.gibl.ps) (or PDF)
Abstract: A learning system can generalize from training set data in many ways. No single style of generalization is likely to solve all problems better than any other style, and different styles work better on some applications than others. Several generalization styles are proposed, including distance metrics, first-order features, voting schemes, and confidence levels. These generalization styles are efficient in terms of time and space and lend themselves to massively parallel architectures. Empirical results of using these generalization styles on several real-world applications are presented. These results indicate that the prototype styles of generalization presented can provide accurate generalization for many applications, and that having several styles of generalization available often permits one to be selected that works well for a particular application.
Reference: Master's Thesis, Brigham Young University, August 1994. (Postscript: wilson.thesis94.ps) (or PDF)
Abstract: There are many ways for a learning system to generalize from training set data. There is likely no one style of generalization which will solve all problems better than any other style, for different styles will work better on some applications than others. This paper presents several styles of generalization and uses them to suggest that a collection of such styles can provide more accurate generalization than any one style by itself. Empirical results of generalizing on several real-world applications are given, and comparisons are made on the generalization accuracy of each style of generalization. The empirical results support the hypothesis that using multiple generalization styles can improve generalization accuracy.
Reference: Proceedings of the First New Zealand International Conference on Artificial Neural Networks and Expert Systems (ANNES), pp. 54-57, November 1993. (Postscript: wilson.annes93.mult.ps) (or PDF)
Abstract: There are many ways for a learning system to generalize from training set data. This paper presents several generalization styles using prototypes in an attempt to provide accurate generalization on training set data for a wide variety of applications. These general- ization styles are efficient in terms of time and space, and lend themselves well to massively parallel architectures. Empirical results of generalizing on several real-world applications are given, and these results indicate that the prototype styles of generalization presented have potential to provide accurate generalization for many applications.
Reference: The Sixth Australian Joint Conference on Artificial Intelligence (AI '93), pp. 356-361, November 1993. (Postscript: wilson.ai93.proto.ps) (or PDF)
Abstract: A method and apparatus for signal classification using a multilayer temporal Workshop on Technology for Family History and Genealogical Research (FHT2001), March 29, 2001. (PDF: wilson.fht01.remerge.pdf) (See my Remerge web page for updates).