© 1997 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.
D. Randall Wilson, RANDY@AXON.CS.BYU.EDU
Tony R. Martinez MARTINEZ@CS.BYU.EDU
Computer Science Department
Brigham Young University
Provo, UT 84602, USA
Instance-Based Learning (IBL) (Aha, Kibler & Albert, 1991; Aha, 1992; Wilson & Martinez, 1993; Wettschereck, Aha & Mohri, 1995; Domingos, 1995) is a paradigm of learning in which algorithms typically store some or all of the n available training examples (instances) from a training set, T, during learning. Each instance has an input vector x, and an output class c. During generalization, these systems use a distance function to determine how close a new input vector y is to each stored instance, and use the nearest instance or instances to predict the output class of y (i.e., to classify y). Some instance-based learning algorithms are referred to as nearest neighbor techniques (Cover & Hart, 1967; Hart, 1968; Dasarathy, 1991), and memory-based reasoning methods (Stanfill & Waltz, 1986; Cost & Salzberg, 1993; Rachlin et al., 1994) overlap significantly with the instance-based paradigm as well. Such algorithms have had much success on a wide variety of applications (real-world classification tasks).
Many neural network models also make use of distance functions, including radial basis function networks (Broomhead & Lowe, 1988; Renals & Rohwer, 1989; Wasserman, 1993), counterpropagation networks (Hecht-Nielsen,1987), ART (Carpenter & Grossberg, 1987), self-organizing maps (Kohonen, 1990) and competitive learning (Rumelhart & McClelland, 1986). Distance functions are also used in many fields besides machine learning and neural networks, including statistics (Atkeson, Moore & Schaal, 1996), pattern recognition (Diday, 1974; Michalski, Stepp & Diday, 1981), and cognitive psychology (Tversky, 1977; Nosofsky, 1986).
There are many distance functions that have been proposed to decide which instance is closest to a given input vector (Michalski, Stepp & Diday, 1981; Diday, 1974). Many of these metrics work well for numerical attributes but do not appropriately handle nominal (i.e., discrete, and perhaps unordered) attributes.
The Value Difference Metric (VDM) (Stanfill & Waltz, 1986) was introduced to define an appropriate distance function for nominal (also called symbolic) attributes. The Modified Value Difference Metric (MVDM) uses a different weighting scheme than VDM and is used in the PEBLS system (Cost & Salzberg, 1993; Rachlin et al., 1994). These distance metrics work well in many nominal domains, but they do not handle continuous attributes directly. Instead, they rely upon discretization (Lebowitz, 1985; Schlimmer, 1987), which can degrade generalization accuracy (Ventura & Martinez, 1995).
Many real-world applications have both nominal and linear attributes, including, for example, over half of the datasets in the UCI Machine Learning Database Repository (Merz & Murphy, 1996). This paper introduces three new distance functions that are more appropriate than previous functions for applications with both nominal and continuous attributes. These new distance functions can be incorporated into many of the above learning systems and areas of study, and can be augmented with weighting schemes (Wettschereck, Aha & Mohri, 1995; Atkeson, Moore & Schaal, 1996) and other enhancements that each system provides.
The choice of distance function influences the bias of a learning algorithm. A bias is "a rule or method that causes an algorithm to choose one generalized output over another" (Mitchell, 1980). A learning algorithm must have a bias in order to generalize, and it has been shown that no learning algorithm can generalize more accurately than any other when summed over all possible problems (Schaffer, 1994) (unless information about the problem other than the training data is available). It follows then that no distance function can be strictly better than any other in terms of generalization ability, when considering all possible problems with equal probability.
However, when there is a higher probability of one class of problems occurring than another, some learning algorithms can generalize more accurately than others (Wolpert, 1993). This is not because they are better when summed over all problems, but because the problems on which they perform well are more likely to occur. In this sense, one algorithm or distance function can be an improvement over another in that it has a higher probability of good generalization than another, because it is better matched to the kinds of problems that will likely occur.
Many learning algorithms use a bias of simplicity (Mitchell, 1980; Wolpert, 1993) to generalize, and this bias is appropriate--meaning that it leads to good generalization accuracy--for a wide variety of real-world applications, though the meaning of simplicity varies depending upon the representational language of each learning algorithm. Other biases, such as decisions made on the basis of additional domain knowledge for a particular problem (Mitchell, 1980), can also improve generalization.
In this light, the distance functions presented in this paper are more appropriate than those used for comparison in that they on average yield improved generalization accuracy on a collection of 48 applications. The results are theoretically limited to this set of datasets, but the hope is that these datasets are representative of other problems that will be of interest (and occur frequently) in the real world, and that the distance functions presented here will be useful in such cases, especially those involving both continuous and nominal input attributes.
Section 2 provides background information on distance functions used previously. Section 3 introduces a distance function that combines Euclidean distance and VDM to handle both continuous and nominal attributes. Sections 4 and 5 present two extensions of the Value Difference Metric which allow for direct use of continuous attributes. Section 4 introduces the Interpolated Value Difference Metric (IVDM), which uses interpolation of probabilities to avoid problems related to discretization. Section 5 presents the Windowed Value Difference Metric (WVDM), which uses a more detailed probability density function for a similar interpolation process.
Section 6 presents empirical results comparing three commonly-used distance functions with the three new functions presented in this paper. The results are obtained from using each of the distance functions in an instance-based learning system on 48 datasets. The results indicate that the new heterogeneous distance functions are more appropriate than previously used functions on datasets with both nominal and linear attributes, in that they achieve higher average generalization accuracy on these datasets. Section 7 discusses related work, and Section 8 provides conclusions and future research directions.
Return to Contents
Send comments to randy@axon.cs.byu.edu