Randy Wilson's Publications

This page contains the titles and abstracts of publications of D. Randall Wilson. Postscript files are available for most papers. Papers by other members of the Neural Network & Machine Learning (NNML) research group are also available.

The General Inefficiency of Batch Training for Gradient Descent Learning

Authors:

Contact address:

Abstract: Gradient descent training of neural networks can be done in either a batch or on-line manner. A widely held myth in the neural network community is that batch training is as fast or faster and/or more "correct" than on-line training because it supposedly uses a better approximation of the true gradient for its weight updates. This paper explains why batch training is almost always slower than on-line training—often orders of magnitude slower, especially on large training sets. The main reason is due to the ability of on-line training to follow curves in the error surface throughout each epoch, which allows it to safely use a larger learning rate and thus converge with less iterations through the training data. Empirical results on a large (20,000-instance) speech recognition task and on 26 other learning tasks demonstrate that convergence can be reached significantly faster using on-line training than batch training, with no apparent difference in accuracy.

Reference: Neural Networks, vol. 16, no. 10, pages 1429-1451, Elsevier Science Ltd. Oxford, UK, UK, 2003. (PDF: Wilson.nn03.batch.pdf). Test data is also available.

Reduction Techniques for Exemplar-Based Learning Algorithms

Authors:

Contact address:

Abstract: Exemplar-based learning algorithms are often faced with the problem of deciding which instances or other exemplars to store for use during generalization. Storing too many exemplars can result in large memory requirements and slow execution speed, and can cause an oversensitivity to noise. This paper has two main purposes. First, it provides a survey of existing algorithms used to reduce the number of exemplars retained in exemplar-based learning models. Second, it proposes six new reduction algorithms called DROP1-5 and DEL that can be used to prune instances from the concept description. These algorithms and 10 algorithms from the survey are compared on 31 datasets. Of those algorithms that provide substantial storage reduction, the DROP algorithms have the highest generalization accuracy in these experiments, especially in the presence of noise.

Reference: Machine Learning, vol. 38, no. 3, pp. 257-286, March 2000. (Postscript: wilson.ml2000.drop.ps) (or PDF). Source code and test data is also available.

An Integrated Instance-Based Learning Algorithm

Authors:

Contact address:

Abstract: The basic nearest-neighbor rule generalizes well in many domains but has several shortcomings, including inappropriate distance functions, large storage requirements, slow execution time, sensitivity to noise, and an inability to adjust its decision boundaries after storing the training data. This paper proposes methods for overcoming each of these weaknesses and combines these methods into a comprehensive learning system called the Integrated Decremental Instance-Based Learning Algorithm (IDIBL) that seeks to reduce storage, improve execution speed, and increase generalization accuracy, when compared to the basic nearest neighbor algorithm and other learning models. IDIBL tunes its own parameters using a new measure of fitness that combines confidence and cross-validation (CVC) accuracy in order to avoid discretization problems with more traditional leave-one-out cross-validation (LCV). In our experiments IDIBL achieves higher generalization accuracy than other less comprehensive instance-based learning algorithms, while requiring less than one-fourth the storage of the nearest neighbor algorithm and improving execution speed by a corresponding factor. In experiments on 21 datasets, IDIBL also achieves higher generalization accuracy than those reported for 16 major machine learning and neural network models.

Reference: Computational Intelligence, vol. 16, no. 1, pp. 1-28, 2000. (Postscript: wilson.ci2000.idibl.ps) (or PDF)

Improved Heterogeneous Distance Functions

Authors:

Contact address:

Abstract: Nearest neighbor and instance-based learning techniques typically handle continuous and linear input values well, but often do not handle nominal input attributes appropriately. The Value Difference Metric (VDM) was designed to find reasonable distance values between nominal attribute values, but it largely ignores continuous attributes, requiring discretization to map continuous values into nominal values. This paper proposes three new heterogeneous distance functions, called the Heterogeneous Value Difference Metric (HVDM), Interpolated Value Difference Metric (IVDM) and the Windowed Value Difference Metric (WVDM). These new distance functions are designed to handle applications with nominal attributes, continuous attributes, or both. In experiments on 48 applications the new distance metrics achieve higher average classification accuracy than previous distance functions on applications with both nominal and continuous attributes.

Reference: Journal of Artificial Intelligence Research (JAIR), Vol. 6, No. 1, pp. 1-34, 1997. (Postscript: wilson.jair97.hvdm.ps), (or PDF), (or see the HTML version)

Prototype Classifier Design with Pruning

Authors:

Contact address:

Abstract: The conventional nearest neighbor classifier (NNC) suffers from a large storage requirement and a slow query execution time when dealing with a large database. Algorithms which attempt to alleviate these disadvantages can be divided into three main categories: Fast searching algorithms, Instance-based learning algorithms and Prototype based algorithms. In this paper an algorithm, called LVQPRU, is proposed for pruning NNC prototype vectors so that a compact classiffier with good performance can be obtained. The basic condensing algorithm is applied to the initial prototypes to speed up the learning process. A Learning Vector Quantization (LVQ) algorithm is utilized to fine-tune the remaining prototypes during each pruning iteration. The LVQPRU algorithm is evaluated on several data sets with 12 reported algorithms using ten-fold cross-validation followed by a t-test. Simulation results show that the proposed algorithm has the highest generalization accuracies and good storage reduction ratios for most of the data sets.

Reference: International Journal on Artificial Intelligence Tools, Vol. 14, No. 1-2, pp. 261-280, 2005. (PDF: wilson.jait05.nnc.pdf)

Advances in Instance-Based Learning Algorithms

Authors:

Contact address:

Abstract: The nearest neighbor algorithm and its derivatives, which are often referred to collectively as instance-based learning algorithms, have been successful on a variety of real-world applications. However, in its basic form, the nearest neighbor algorithm suffers from inadequate distance functions, large storage requirements, slow execution speed, a sensitivity to noise and irrelevant attributes, and an inability to adjust its decision surfaces after storing the data. This dissertation presents a collection of papers that seek to overcome each of these disadvantages. The most successful enhancements are combined into a comprehensive system called the Integrated Decremental Instance-Based Learning algorithm, which in experiments on 44 applications achieves higher generalization accuracy than other instance-based learning algorithms. It also yields higher generalization accuracy than that reported for 16 major machine learning and neural network models.

Reference: Ph.D. Dissertation, Computer Science Department, Brigham Young University, August 1997. (Postscript: wilson.phd97.ps) (or PDF)

The Need for Small Learning Rates on Large Problems,

Authors:

Contact address:

Abstract: In gradient descent learning algorithms such as error back-propagation, the learning rate parameter can have a significant effect on generalization accuracy. In particular, decreasing the learning rate below that which yields the fastest convergence can significantly improve generalization accuracy, especially on large, complex problems. The learning rate also directly affects training speed, but not necessarily in the way that many people expect. Many neural network practitioners currently attempt to use the largest learning rate that still allows for convergence, in order to improve training speed. However, a learning rate that is too large can be as slow as a learning rate that is too small, and a learning rate that is too large or too small can require orders of magnitude more training time than one that is in an appropriate range. This paper illustrates how the learning rate affects training speed and generalization accuracy, and thus gives guidelines on how to efficiently select a learning rate that maximizes generalization accuracy.

Reference: In Proceedings of the 2001 International Joint Conference on Neural Networks (IJCNN2001), pp. 115-119, July 2001. (Postscript: wilson.ijcnn2001.lr.ps) (or PDF)

The Inefficiency of Batch Training for Large Training Sets,

Authors:

Contact address:

Abstract: Multilayer perceptrons are often trained using error backpropagation (BP). BP training can be done in either a batch or continuous manner. Claims have frequently been made that batch training is faster and/or more "correct" than continuous training because it uses a better approximation of the true gradient for its weight updates. These claims are often supported by empirical evidence on very small data sets. These claims are untrue, however, for large training sets. This paper explains why batch training is much slower than continuous training for large training sets. Various levels of semi-batch training used on a 20,000-instance speech recognition task show a roughly linear increase in training time required with an increase in batch size.

Reference: In Proceedings of the International Joint Conference on Neural Networks (IJCNN2000), Vol. II, pp. 113-117, July 2000. (Postscript: wilson.ijcnn2000.batch.ps, or (or PDF)

Combining Cross-Validation and Confidence to Measure Fitness

Authors:

Contact address:

Abstract:Neural network and machine learning algorithms often have parameters that must be tuned for good performance on a particular task. Leave-one-out cross-validation (LCV) accuracy is often used to measure the fitness of a set of parameter values. However, small changes in parameters often have no effect on LCV accuracy. Many learning algorithms can measure the confidence of a classification decision, but often confidence alone is an inappropriate measure of fitness. This paper proposes a combined measure of Cross-Validation and Confidence (CVC) for obtaining a continuous measure of fitness for sets of parameters in learning algorithms. This paper also proposes the Refined Instance-Based (RIB) learning algorithm which illustrates the use of CVC in automated parameter tuning. Using CVC provides significant improvement in generalization accuracy on a collection of 31 classification tasks when compared to using LCV.

Reference: In Proceedings of the International Joint Conference on Neural Networks (IJCNN'99), paper 163, 1999. (Postscript: wilson.ijcnn99.cvc.ps, or PDF)

A Neural Model of Centered Tri-gram Speech Recognition

Authors:

Contact address:

Abstract:A relaxation network model that includes higher order weight connections is introduced. To demonstrate its utility, the model is applied to the speech recognition domain. Traditional speech recognition systems typically consider only that context preceding the word to be recognized. However, intuition suggests that considering following context as well as preceding context should improve recognition accuracy. The work described here tests this hypothesis by applying the higher order relaxation network to consider both precedes and follows context in a speech recognition task. The results demonstrate both the general utility of the higher order relaxation network as well as its improvement over traditional methods on a speech recognition task.

Reference: In Proceedings of the International Joint Conference on Neural Networks (IJCNN'99), paper 2188, 1999. (PDF format: ventura.ijcnn99b.pdf)

The Robustness of Relaxation Rates in Constraint Satisfaction Networks

Authors:

Contact address:

Abstract:Constraint satisfaction networks contain nodes that receive weighted evidence from external sources and/or other nodes. A relaxation process allows the activation of nodes to affect neighboring nodes, which in turn can affect their neighbors, allowing information to travel through a network. When doing discrete updates (as in a software implementation of a relaxation network), a goal net or goal activation can be computed in response to the net input into a node, and a relaxation rate can then be used to determine how fast the node moves from its current value to its goal value. An open question was whether or not the relaxation rate is a sensitive parameter. This paper shows that the relaxation rate has almost no effect on how information flows through the network as long as it is small enough to avoid large discrete steps and/or oscillation.

Reference: In Proceedings of the International Joint Conference on Neural Networks (IJCNN'99), paper 162, 1999. (PDF: wilson.ijcnn99.relax.pdf)

Instance Pruning Techniques

Authors:

Contact address:

Abstract: The nearest neighbor algorithm and its derivatives are often quite successful at learning a concept from a training set and providing good generalization on subsequent input vectors. However, these techniques often retain the entire training set in memory, resulting in large memory requirements and slow execution speed, as well as a sensitivity to noise. This paper provides a discussion of issues related to reducing the number of instances retained in memory while maintaining (and sometimes improving) generalization accuracy, and mentions algorithms other researchers have used to address this problem. It presents three intuitive noise-tolerant algorithms that can be used to prune instances from the training set. In experiments on 29 applications, the algorithm that achieves the highest reduction in storage also results in the highest generalization accuracy of the three methods.

Reference: In Fisher, D., ed., Machine Learning: Proceedings of the Fourteenth International Conference (ICML'97), Morgan Kaufmann Publishers, San Francisco, CA, pp. 403-411, July 1997. (Postscript: wilson.icml97.prune.ps) (or PDF)

Improved Center Point Selection for Probabilistic Neural Networks

Authors:

Contact address:

Abstract: Probabilistic Neural Networks (PNN) typically learn more quickly than many neural network models and have had success on a variety of applications. However, in their basic form, they tend to have a large number of hidden nodes. One common solution to this problem is to keep only a randomly-selected subset of the original training data in building the network. This paper presents an algorithm called the Reduced Probabilistic Neural Network (RPNN) that seeks to choose a better-than-random subset of the available instances to use as center points of nodes in the network. The algorithm tends to retain non-noisy border points while removing nodes with instances in regions of the input space that are highly homogeneous. In experiments on 22 datasets, the RPNN had better average generalization accuracy than two other PNN models, while requiring an average of less than one-third the number of nodes.

Reference: In Proceedings of the International Conference on Artificial Neural Networks and Genetic Algorithms (ICANNGA'97), pp. 514-517, 1997. (Postscript: wilson.icannga97.rpnn.ps) (or PDF)

Real-Valued Schemata Search Using Statistical Confidence

Authors:

Contact address:

Abstract: Many neural network models must be trained by finding a set of real-valued weights that yield high accuracy on a training set. Other learning models require weights on input attributes that yield high leave-one-out classification accuracy in order to avoid problems associated with irrelevant attributes and high dimensionality. In addition, there are a variety of general problems for which a set of real values must be found which maximize some evaluation function. This paper presents an algorithm for doing a schemata search over a real-valued weight space to find a set of weights (or other real values) for a given evaluation function. The algorithm, called the Real-Valued Schemata Search (RVSS) uses the BRACE statistical technique [Moore & Lee, 1993] to determine when to narrow the search space. This paper details the RVSS approach and gives initial empirical results.

Reference: In Proceedings of the 1997 Sian Ka'an International Workshop on Neural Networks and Neurocontrol, 1997. (Postscript: wilson.sian97.schema.ps or presentation: wilson.sian97.slides.ps) (or PDF)

Heterogeneous Radial Basis Function Networks

Authors:

Contact address:

Abstract: Radial Basis Function (RBF) networks typically use a distance function designed for numeric attributes, such as Euclidean or city-block distance. This paper presents a heterogeneous distance function which is appropriate for applications with symbolic attributes, numeric attributes, or both. Empirical results on 30 data sets indicate that the heterogeneous distance metric yields significantly improved generalization accuracy over Euclidean distance in most cases involving symbolic attributes.

Reference: Proceedings of the International Conference on Neural Networks (ICNN'96), Vol. 2, pp. 1263-1267, June 1996. (Postscript: wilson.icnn96.hrbf.ps) (or PDF)

Bias and the Probability of Generalization

Authors:

Abstract:

Reference: In Proceedings of the 1997 International Conference on Intelligent Information Systems (IIS'97), pp. 108-114, 1997. (Postscript: wilson.iis97.bias.ps) (or PDF)

Value Difference Metrics for Continuously Valued Attributes

Authors:

Contact address:

Abstract: Nearest neighbor and instance-based learning techniques typically handle continuous and linear input values well, but often do not handle symbolic input attributes appropriately. The Value Difference Metric (VDM) was designed to find reasonable distance values between symbolic attribute values, but it largely ignores continuous attributes, using discretization to map continuous values into symbolic values. This paper presents two heterogeneous distance metrics, called the Interpolated VDM (IVDM) and Windowed VDM (WVDM), that extend the Value Difference Metric to handle continuous attributes more appropriately. In experiments on 21 data sets the new distance metrics achieves higher classification accuracy in most cases involving continuous attributes.

Reference: Proceedings of the International Conference on Artificial Intelligence, Expert Systems, and Neural Networks (AIE'96), pp. 74-78, August 1996. (Postscript: wilson.aie96.ivdm.ps) (or PDF)

Instance-Based Learning with Genetically Derived Attribute Weights

Authors:

Contact address:

Abstract: This paper presents an inductive learning system called the Genetic Instance-Based Learning (GIBL) system. This system combines instance-based learning approaches with evolutionary computation in order to achieve high accuracy in the presence of irrelevant or redundant attributes. Evolutionary computation is used to find a set of attribute weights that yields a high estimate of classification accuracy. Results of experiments on 16 data sets are shown, and are compared with a non-weighted version of the instance-based learning system. The results indicate that the generalization accuracy of GIBL is somewhat higher than that of the non-weighted system on regular data, and is significantly higher on data with irrelevant or redundant attributes.

Reference: Proceedings of the International Conference on Artificial Intelligence, Expert Systems, and Neural Networks (AIE'96), pp. 11-14, August 1996. (Postscript: wilson.aie96.gibl.ps) (or PDF)

Prototype Styles of Generalization

Authors:

Contact address:

Abstract: A learning system can generalize from training set data in many ways. No single style of generalization is likely to solve all problems better than any other style, and different styles work better on some applications than others. Several generalization styles are proposed, including distance metrics, first-order features, voting schemes, and confidence levels. These generalization styles are efficient in terms of time and space and lend themselves to massively parallel architectures. Empirical results of using these generalization styles on several real-world applications are presented. These results indicate that the prototype styles of generalization presented can provide accurate generalization for many applications, and that having several styles of generalization available often permits one to be selected that works well for a particular application.

Reference: Master's Thesis, Brigham Young University, August 1994. (Postscript: wilson.thesis94.ps) (or PDF)

The Importance of Using Multiple Styles of Generalization

Authors:

Contact address:

Abstract: There are many ways for a learning system to generalize from training set data. There is likely no one style of generalization which will solve all problems better than any other style, for different styles will work better on some applications than others. This paper presents several styles of generalization and uses them to suggest that a collection of such styles can provide more accurate generalization than any one style by itself. Empirical results of generalizing on several real-world applications are given, and comparisons are made on the generalization accuracy of each style of generalization. The empirical results support the hypothesis that using multiple generalization styles can improve generalization accuracy.

Reference: Proceedings of the First New Zealand International Conference on Artificial Neural Networks and Expert Systems (ANNES), pp. 54-57, November 1993. (Postscript: wilson.annes93.mult.ps) (or PDF)

The Potential of Prototype Styles of Generalization

Authors:

Contact address:

Abstract: There are many ways for a learning system to generalize from training set data. This paper presents several generalization styles using prototypes in an attempt to provide accurate generalization on training set data for a wide variety of applications. These general- ization styles are efficient in terms of time and space, and lend themselves well to massively parallel architectures. Empirical results of generalizing on several real-world applications are given, and these results indicate that the prototype styles of generalization presented have potential to provide accurate generalization for many applications.

Reference: The Sixth Australian Joint Conference on Artificial Intelligence (AI '93), pp. 356-361, November 1993. (Postscript: wilson.ai93.proto.ps) (or PDF)

Method and Apparatus for Signal Classification Using a Multilayer Network

Authors: Tony R. Martinez, R. Brian Moncur, D. Lynn Shepherd, D. Randall Wilson, Carl Hal Hansen
Contact address: martinez@cs.byu.edu, BMoncur@fonix.com, LynnS@fonix.com, WilsonR@fonix.com

Abstract: A method and apparatus for signal classification using a multilayer temporal relaxation network involves receiving an input signal feature vector, classifying a first signal feature, and classifying a second signal feature using contextual information. The multilayer temporal relaxation network applies a relaxation process that updates an activation value of a node in a first layer and updates an activation value of a node in a second layer. The multilayer network then generates a signal classification according to an activation value of a node in the multilayer network.

Reference: U.S. Patent & Trademark Office, Patent No. 6,208,963, issued March 27, 2001. (WWW listing for Patent No. 6,208,963)

Beyond GEDCOM: Modeling Genealogical Source Record Data

Author: D. Randall Wilson
Contact address: wilsonr@familysearch.org

Abstract: Computer software has modeled genealogical conclusions for decades using a data model similar to the one used by the GEDCOM data exchange format. Modeling genealogical source data that appears in historical records, however, presents some challenges beyond what a conclusional data model was designed to handle. This paper discusses the ways in which historical record data and conclusion data are different, and in what ways they are the same. It illustrates how a model can handle both in a way that makes them compatible for many uses while preserving some of the nuances found in the records.

Reference: 13th Annual Family History Technology Workshop (FHTW2013), 2013. (PDF: wilson.fht2013.beyondgedcom.pdf ; Poster)

Personal Digital Preservation: Issues and Approaches

Author: D.Randall Wilson
Contact address: wilsonr@familysearch.org

Abstract: Many individuals and organizations have valuable photos and documents that need to be preserved, organized and shared. However, it is nearly impossible to preserve such "digital artifacts" in a way that guarantees that they will be around and accessible 50 years from now. People often have shoe boxes full of old family photos, which are priceless treasures when the individuals are identified, but become nearly worthless when nobody can remember who they are. Other people probably have many old photos that would be valuable to you, but it is currently very difficult for you to find out that they exist or to get a copy. Others in the family could help you identify who the people are in some of the photos you have, but there isn't a convenient way to get their input. Many excellent products and services have been created that address parts of the problem, but there remains a need for industry standards and free (or perhaps pre-paid) long-term preservation services that continue to preserve data and metadata even if the user passes away or a company goes out of business.

This paper reviews the requirements for long-term preservation of personal digital artifacts, discusses existing standards that might help (such as Adobe's XMP metadata format and the Metadata Working Group's definitions), and suggests what still needs to happen to make personal digital preservation reasonable and attractive to the masses.

Reference: RootsTech 2012. (PDF; Slides).

Beyond Probabilistic Record Linkage: Using Neural Networks and Complex Features to Improve Genealogical Record Linkage

Author:

Contact address:

Abstract: Probabilistic record linkage has been used for many years in a variety of industries, including medical, government, private sector and research groups. The formulas used for probabilistic record linkage have been recognized by some as being equivalent to the naïve Bayes classifier. While this method can produce useful results, it is not difficult to improve accuracy by using one of a host of other machine learning or neural network algorithms. Even a simple single-layer perceptron tends to outperform the naïve Bayes classifier—and thus traditional probabilistic record linkage methods—by a substantial margin. Furthermore, many record linkage system use simple field comparisons rather than more complex features, partially due to the limits of the probabilistic formulas they use. This paper presents an overview of probabilistic record linkage, shows how to cast it in machine learning terms, and then shows that it is equivalent to a naïve Bayes classifier. It then discusses how to use more complex features than simple field comparisons, and shows how probabilistic record linkage formulas can be modified to handle this. Finally, it demonstrates a huge improvement in accuracy through the use of neural networks and higher-level matching features, compared to traditional probabilistic record linkage on a large (80,000 pair) set of labeled pairs of genealogical records used by FamilySearch.org.

Reference: International Joint Conference on Neural Networks (IJCNN 2011), August 2011, pp. 9-14. (PDF)

Genealogical Record Linkage: Features for Automated Person Matching

Authors:

Contact address:

Abstract: This paper provides a high-level overview of how automatic person matching (genealogical record linkage) algorithms can be developed, and then provides a detailed explanation of many of the features used by FamilySearch in doing person matching. Empirical results show a dramatic improvement in accuracy by using these features trained with neural networks, when compared to traditional probabilistic record linkage with simple field agreement features.

Reference: RootsTech 2011, pp. 331-340, February 2011, Salt Lake City, Utah. (PDF)

Genealogical Record Linkage on International Data

Author:

Contact address:

Abstract: Record linkage is used to identify multiple records that refer to the same real person in many fields, including medicine, advertising, business and genealogy. Record linkage requires dealing with variations in names, dates, places and other data, which can be a challenge even in a single culture. FamilySearch deals with genealogical records from many parts of the world, which presents further challenges in dealing with various naming conventions, writing systems, and cultural differences. This paper reviews recent work done by FamilySearch to address the special matching challenges on Chinese, Japanese, Korean and Cyrillic records, as well as Scandinavian cultures.

Reference: 8th Annual Family History Technology Workshop (FHT2008), 2008. (PDF: wilson.fht2008.cjkc.pdf)

High-Level View of a Source-Centric Genealogical Model: "The Model with Four Boxes"

Author:

Contact address:

Abstract: This paper presents a high-level genealogical model intended to avoid infinite amounts of duplicate effort that is currently a problem in several parts of family history work. The model contains just four high-level elements: (i) a source authority keeps track of all known sources of genealogical data in the world, including written records and living memory; (ii) an artifact archive stores scanned images of documents and other digital artifacts; (iii) a structured data archive stores structured data that was extracted directly from sources; and (iv) a family tree stores conclusions about real people, including pointers to entries in the structured data archive that reference a person. In addition to tracking sources, evidence and conclusions, verification work is also tracked so that this, too, need not be repeated indefinitely. Reference: 6th Annual Family History Technology Workshop (FHT2006), 2006. (PDF: wilson.fht2006.source-centric.pdf)

Name Standardization for Genealogical Record Linkage

Author:

Contact address:

Abstract: A common part of genealogical research is finding multiple records that refer to the same person and either linking them together or gathering their information into a single place. Finding records that refer to the same person-whether in printed indexes or online databases-usually requires using the person's name. However, names can vary in different sources for a variety of reasons, including nicknames, transcription or typographical errors, abbreviations, migration name changes, homographic spelling variations, etc. Various approaches have been proposed to standardize names so that variations of the same person's names can be brought together. These approaches include manually built name catalogs, algorithmically enhanced name catalogs, name encoding algorithms (such as Soundex, Nysiis, and Double Metaphone), name comparison algorithms (such as Edit Distance, Jaro-Winkler distance, etc.), as well as simple normalization techniques (lower case, remove periods, etc.). Each approach addresses a different set of the above types of name variations, and each casts a different size and shape of a "net" over the name space. The result is that each brings together different sets of names that commonly belong to the same person, and each tends to also bring together other names that do not.

This paper presents empirical results of testing a variety of name standardization techniques on a corpus of labeled training data. A database of 15 million individual genealogical records is used to determine how many individuals are brought together using each standardization technique (i.e., the "average number of hits"). Genealogical experts have found 25,000 pairs of records for which both records in each pair refer to the same person. The recall of a particular standardization method is the percentage of these individuals brought together when using that method (or collection of methods). The goal of standardization is to achieve near 100% recall while keeping the average number of overall hits down as low as possible. This paper presents the recall and average number of hits for a variety of name standardization approaches and promising combinations of these approaches, and draws conclusions regarding the best technique(s) to use at different levels of acceptable recall on data of this type.

Reference: 5th Annual Family History Technology Workshop (FHT2005), 2005. (PDF: wilson.fht2005.namestd.pdf)

Efficient Genealogy through Personal Extraction and Automated Verification

Author:

Contact address:

Abstract: Traditional genealogical research often involves (1) searching for documents containing information about relatives, and then (2) entering the extracted genealogical information from the documents into a genealogical database. Often the overwhelming majority of the time is spent on the first step, because documents must be sought at libraries, on microfilm, or even at distant locations. Extraction efforts involve entering all of the genealogical information about all of the people referenced in a source into a database. If extraction is done before the searching phase, then searches can be completed in seconds instead of hours. Volunteers are more likely to be interested in helping with such efforts if they are able to choose assignments of personal interest to them, and also if they know that there is a "critical mass" of volunteers that are all pitching in to extract a significant portion of the available relevant records. Technology has been of great benefit in helping researchers to organize and share their research and conclusions with others around the world. However, it is important to make sure that both traditional research and extraction efforts all contribute to the global effort of building a worldwide genealogical database instead of simply multiplying the number of (often undocumented) copies—and perhaps erroneous variations—of information taken from original sources. Furthermore, there remains a need to be able to automatically verify work done by others so that the work does not have to be duplicated. This paper discusses what makes extraction approaches efficient, and how personal extraction provides additional motivation for people to participate. It then shares ideas on the harder issues of how to use the extracted records as the basic building blocks of a new kind of evidence-based genealogical database. Finally, it discusses approaches to verification in order to provide some hope of being able to trust the work of others and build a truly collaborative global genealogical database.

Reference: 3rd Annual Family History Technology Workshop (FHT2003), 2003. (PDF: wilson.fht2003.extraction.pdf)

Bidirectional Source Linking: Doing Genealogy "Once" and "For All"

Author:

Contact address:

Abstract: There are many people working on genealogy throughout the world. Unfortunately, much of their time is spent duplicating the work of others or doing work that will be done again later. This paper discusses the reasons behind this duplication of efforts. It then proposes the use of bidirectional source linking in conjunction with evidence-based genealogical research methods to make it possible to quickly find out what work has already been done with regards to a particular record or set of records, and to more easily know what else remains to be done. The proposed methods also make it possible for genealogists to do work just once that can then benefit everyone.

Reference: 2nd Annual Family History Technology Workshop (FHT2002), pp. 54-60, April 4, 2002. (PDF: wilson.fht2002.bi-link.pdf)

Graph-Based Remerging of Genealogical Databases

Author:

Contact address:

Abstract: Most personal genealogy software packages available today suffer from weak merging functionality, which makes collaboration with others much more difficult than it needs to be. In particular, if someone shares a copy of their electronic genealogical database with a relative, and then both make additions and/or changes to the data, there is currently no reasonable way in most genealogical programs to merge these changes together, except through painstaking examination of almost everyone in the database. This paper reviews current approaches to this problem and presents an algorithm for automating the merging process by taking into account the relationships of individuals in the database. The algorithm does not require storage of ID numbers, nor does it require exclusive access to portions of the database by certain individuals. Instead, subgraphs of people with identical information are identified, and the user only need make decisions where the databases differ on the edges of the subgraphs.

Reference: Family History Technology Workshop (FHT2001), March 29, 2001. (PDF: wilson.fht2001.remerge.pdf) (See my Remerge web page for updates).

Go to

Last modified Jun 5, 1996, by Randy Wilson.

Direct comments, suggestions or problems to randy@axon.cs.byu.edu.