Genealogy Merging Algorithms
Introduction
Most personal genealogy software packages available today suffer from weak merging
functionality, which makes collaboration with others much more difficult than it needs to be. In
particular, if someone shares a copy of their electronic genealogical database with a relative, and
then both make additions and/or changes to the data, there is currently no reasonable way in most
genealogical programs to merge these changes together.
A more powerful approach was presented in the paper entitled
Graph-based Remerging of Genealogical Databases
by Randy Wilson, which was presented at the
2001 Family History Technology Workshop
on March 29, 2001, at Brigham Young University.
Click here for a PDF file of that paper.
The ideas in that paper and on this web page are available to be freely used by
anyone who wants to use them. Including these algorithms in your genealogical
software, thus allowing people to collaborate and avoid wasting more time would
be payment enough to me.
Further Developments
Since the publication of that report, additional ideas have come up that
could improve the usefulness of the algorithm. After reading the paper,
feel free to also read these ideas, and contact me
if you have any further suggestions, questions, or if you would be interested
in implementing these algorithms into your own code.
Incremental Updates
The graph-based merging algorithm allows you to share your database with someone
else, and then "resynchronize" them back together by using the matching individuals
to determine how any non-matching individuals fit together into the relationship graph.
However, many people's modems are slow, so sending a genealogical database that is
several megabytes in size may not be practical for some people. Fortunately, it is
also not necessary after the first copy of the database is sent.
Once both people have a copy of the same database (on a certain date), they can
both make updates independently. When someone wants to send another update, they
can do an "incremental update," which works as follows.
- 1. All individuals whose records have been added or modified since the last
synchronization (which could be kept track of in a list, e.g., "Last update with
Robert B. Jones on 11 November 2000") are marked for inclusion in the export.
- 2. The shortest number of people needed to link these people together are found
and also marked for export, so that the exported file contains a connected graph
of individuals. (The number of additional individuals needed for this will usually
not be more than a couple of dozen).
- 3. Since the merging algorithm needs to be able to find a good-sized group of
matching individuals to be sure of a match, additional individuals connected to
the exported ones can be added, unless there are already enough matching
individuals from Step 2 to "anchor" the group properly.
This results in a small subset of the total database, which can then be easily
e-mailed to a relative. Upon receipt, their program would do the graph-based
"remerging" algorithm as usual, but there would just be fewer individuals in
the incoming database. The "anchored" individuals would match, causing the
new or modified individuals to fit into the relationship graph in the correct
place. The thousands of unchanged individuals would not need to be e-mailed
nor reviewed by the person receiving the update.
Contact Information
If you have any ideas related to merging genealogical databases, or if you
would like to include these algorithms in your code, discuss them, volunteer
to write some code, or anything else, please contact me!