Dealing with Duplicate Person Data

clones.jpgI've recently been working on a fairly large project that that has contact information for almost 2 million people. These records contain details for both online and offline actions. Since the data can come from multiple sources there exist many duplicate records. Duplicate records mean more processing for our code, more storage space and more hassle for our clients who have to deal with these duplicates. All in all, bad things to leave lying around.

In this article we'll look at some strategies that I used to identify and remove these duplicates. All code in this article are samples, and we'll leave the task of assembling them into a final working program up to the reader.

CPAN is your Friend

Like all good Perl projects, we will make heavy use of the CPAN. It makes our lives so much easier and every day I'm more in awe at the quality and bredth of solutions I find there. For this project we'll be using Text::LevenshteinXS, Lingua::EN::Nickname and Parallel::ForkManager.

What is a Duplicate?

For our project we didn't want to automatically merge duplicates but rather flag them as being potential duplicates and then let our client choose to do the actual merge. Not only does this force a human to make the final decision but it also allows the client to determine which values for which conflicting fields are chosen for the final merged record.

To mark a pair of records as being potential duplicates we want to give them a score and then remember any pair that scored over a certain threshold. In our case the scores go from 0 to 100 and our threshold is 75. Each organization is different in what fields it has and what they consider to be a duplicate. Since some fields are more important than others (having the same state is not as important as having the same last name) we need to give our fields some weight. For this project we'll use the following fields and weights:

    my %field_weight = (
        last_name  => 25,
        first_name => 25,
        address1   => 20,
        address2   => 10,
        city       => 20,
    );


If the fields from 2 different records don't match (we ignore case for all matches because "peters" is the same thing as "PETERS") for a given field we subtract the given weight. You will of course have to play with these numbers for your own data.

Catching Typos

What if 2 people who are living at the same address have first names of "Michael" and "Michal"? They are most likely the same person and someone somewhere made a typo. One of the easiest ways to catch typos or strings that are almost identical is with an algorithm called a "Levenshtein Distance". You can read more about it if you want the specifics, but basically this algorithm calculates the number of changes needed to transform one string into another. In the case of "Michael" vs "Michal" the Levenshtein Distance is 1. The distance between "Michael" and "Michelle" is 3 and the distance between "Michael" and "Peters" is a whopping 7.

    use Text::LevenshteinXS qw(distance);
    my $distance = distance("Michael", "Michelle");


We're using Text::LevenshteinXS because it's simple and written in C (the "XS" part of the name means it's written in Perl's C API which is called XS).

More Weights

Since typos are usually 1 or 2 characters we don't really want anything with a distance of more than 3. But 1 is more likely to be a duplicate than is 3. So once again we'll use weighted values. For this project we'll use the following weights modifiers:

    my @distance_modifiers = (.10, .30, .45, .60, .90);


So if we're comparing our first_name fields and they have a Levenshtein Distance of 2, then we'll subtract 7.5 points (25 x .30). If that's the only difference between our two records then we'll have a total score of 92.5 which is a pretty good score.

    my $distance = distance($val1, $val2);
    $score -= $distance ? $field_weight{$field} * $distance_modifiers[$distance -1] : 0;


Nicknames

Nicknames are another problem we want to deal with. If we encounter 2 records that are similar but one has a first name of "Michael" and the other is "Mike" that won't score very well because it has a Levenshtein Distance of 4. It gets even worse when you look at examples like "Charles" and "Chuck". Our specific problem is made easier by the fact that almost all our our data is in English. CPAN comes to the rescue with Lingua::EN::Nickname. You can give it 2 names and it will give you a score that lets you know whether one is probably the nickname of the other.

    nickname_eq("Robert", "Bob"); # gives a score of 98


Having played with our data set more, we find that if the score is above 90 we are almost certain that it's a nickname and we treat it as if the names had matched exactly. If the score is between 80 and 90 we use the score as another modifier to the weight.

    my $nick_score = nickname_eq($val1, $val2);
    if( $nick_score < 90 && $nick_score > 80 ) {
        $score -= $field_weight{$field} * (1 - ($nick_score / 100));
    }


Abbreviations

We love to use abbreviations to make things shorter and easier to type. I can't remember the last time I entered my full street name in a web form. But to better find duplicates in our data we need to be able to consider "Drive" and "Dr." to be the same thing. You could use something like Geo::PostalAddress to normalize() the different addresses, but that was a little more processing than I wanted for this project.

So I just used a simple lookup hash based on the data I pulled from the US Postal Service (http://www.usps.com/ncsc/lookups/abbr_suffix.txt). I just took the abbreviations that were most common and our resulting address normalization looks something like this:

    my %ADDR_ABBR = (
        avenue     => 'ave',
        avn        => 'ave',
        boulevard  => 'blvd',
        circle     => 'cir',
        circ       => 'cir',
        court      => 'ct',
        crt        => 'ct',
        drive      => 'dr',
        driv       => 'dr',
        drv        => 'dr',
        ...
    );

    my @parts = split(/\s+/, $address);
    for(my $i=0; $i<=$#parts; $i++) {
        if( exists $ADDR_ABBR{$parts[$i]} ) {
            $parts[$i] = $ADDR_ABBR{$parts[$i]};
        }
    }

    $address = join(' ', @parts);


And now we can compare 2 addresses to get the Levenshtein distance without having to worry about common abbreviations throwing it off.

Multiple Cores and Multiple CPUS

Since this process will be running on a machine with multiple CPUs we want to take advantage of that. We are comparing each record to every other record so it's basically an N-squared algorithm. With millions of records this means the number of record comparisons will be in the trillions and would take way too long if we didn't take advantage of the extra CPUs on this machine. To split the work up I decided to take all of the data for a single US state and work on that as a group and then move on to the next state one by one.

This approach is trivially parallelizable so we should simply be able to fork off multiple processes and have them work on their own portion of the data. Duplicate scores will be recorded in the database so we don't even need any communication between the processes (in reality I used IPC::Shareable to communicate with the main process so that we could show a progress indicator of sorts when used from the command line, but this isn't necessary).

Using fork() under Unix isn't really that hard especially if you follow the canonical example in the Camel Book. But that's no reason it can't be made even simpler. I like Parallel::ForkManager because it makes it brain-dead simple to fork off a bunch of processes up to a some limit (for me this seemed to be the number of CPUs + 1). Then every time a child process finishes it takes care of forking another one to take it's place to do more work.

    my $fork_manager = Parallel::ForkManager->new($num_cpus + 1);
    while(my ($state) = get_next_state()) {
        # fork off and let a child handle this state
        my $pid = $fork_manager->start and next;

        process_state_records($state);
        $fork_manager->finish();
    }
    $fork_manager->wait_all_children;


Conclusions

While there are many ways to tackle this problem (and mine is far from the most optimal) I hope that this helps you to see the power of the CPAN when tackling big problems. It's hard to find a problem that someone else hasn't already faced and with the size and generosity of the Perl community, they've probably put at least some of the solution on to the CPAN.

Leave a comment

Authors

  • Dave Cross
  • Luis Motta Campos
  • Jason Purdy
  • Michael Peters
  • Steve Marvell

Recent Entries

Close