Genealogy Merging Algorithms

Introduction

Most personal genealogy software packages available today suffer from weak merging functionality, which makes collaboration with others much more difficult than it needs to be. In particular, if someone shares a copy of their electronic genealogical database with a relative, and then both make additions and/or changes to the data, there is currently no reasonable way in most genealogical programs to merge these changes together.

A more powerful approach was presented in the paper entitled Graph-based Remerging of Genealogical Databases by Randy Wilson, which was presented at the 2001 Family History Technology Workshop on March 29, 2001, at Brigham Young University.

Click here for a PDF file of that paper.

The ideas in that paper and on this web page are available to be freely used by anyone who wants to use them. Including these algorithms in your genealogical software, thus allowing people to collaborate and avoid wasting more time would be payment enough to me.

Further Developments

Since the publication of that report, additional ideas have come up that could improve the usefulness of the algorithm. After reading the paper, feel free to also read these ideas, and contact me if you have any further suggestions, questions, or if you would be interested in implementing these algorithms into your own code.

Incremental Updates

The graph-based merging algorithm allows you to share your database with someone else, and then "resynchronize" them back together by using the matching individuals to determine how any non-matching individuals fit together into the relationship graph.

However, many people's modems are slow, so sending a genealogical database that is several megabytes in size may not be practical for some people. Fortunately, it is also not necessary after the first copy of the database is sent.

Once both people have a copy of the same database (on a certain date), they can both make updates independently. When someone wants to send another update, they can do an "incremental update," which works as follows.

1. All individuals whose records have been added or modified since the last synchronization (which could be kept track of in a list, e.g., "Last update with Robert B. Jones on 11 November 2000") are marked for inclusion in the export.
2. The shortest number of people needed to link these people together are found and also marked for export, so that the exported file contains a connected graph of individuals. (The number of additional individuals needed for this will usually not be more than a couple of dozen).
3. Since the merging algorithm needs to be able to find a good-sized group of matching individuals to be sure of a match, additional individuals connected to the exported ones can be added, unless there are already enough matching individuals from Step 2 to "anchor" the group properly.

This results in a small subset of the total database, which can then be easily e-mailed to a relative. Upon receipt, their program would do the graph-based "remerging" algorithm as usual, but there would just be fewer individuals in the incoming database. The "anchored" individuals would match, causing the new or modified individuals to fit into the relationship graph in the correct place. The thousands of unchanged individuals would not need to be e-mailed nor reviewed by the person receiving the update.

Contact Information

If you have any ideas related to merging genealogical databases, or if you would like to include these algorithms in your code, discuss them, volunteer to write some code, or anything else, please contact me!

randy@axon.cs.byu.edu