As I look at this problem I notice a couple key facts to base some improvements on:
Facts and observations
- Max iterations of 1000.
- 15 for Levenshtein distance sounds really high to me.
- You know, by observing the data empirically, what your fuzzy matching should look like (there are many cases for fuzzy matching and each depends on why the data is bad).
- By building this API-like you could plug in many algorithms, including your own and others like Soundex, instead of depending on just one.
Requirements
I've interpreted your problem as requiring the following two things:
- You have objects that you want to look up via a key that is based on the name. It sounds like you want to do this because you need a pre-existing
PersonDO
of which one exists per unique name, and the same name may come up more than once in your loop/workflow.PersonDO
- You need "fuzzy matching" because the incoming data is not pure. For the purposes of this algorithm we'll assume that if a name "matches", it should always use the same
PersonDO
(in other words, a person's unique identifier is their name, which is obviously not the case in real life, but seems to work for you here).
Implementation
Next, let's look at making some improvements on your code:
1. Cleanup: unnecessary hashcode manipulation.
You don't need to generate hash codes yourself. This confuses the issue a bit.
You're simply generating a hashcode for the combination of the firstname + lastname. This is exactly what would do if you gave it the concatenated string as a key. So, just do that (and add a space, just in case we want to reverse parse out first/last from the key later).HashMap
Map<String, PersonDO> personCache = Maps.newHashMap();
public String getPersonKey(String first, String last) {
return first + " " + last;
}
...
// Initialization code
for(PersonDO p: dao.getPeople()) {
personCache.put(getPersonKey(p.getFirstName(), p.getLastName()), p);
}
2. Cleanup: Build a retrieval function to perform the lookup.
Since we've changed the key in the map we need to change the lookup function. We'll build this like a mini-API. If we always knew the key exactly (i.e. unique IDs) we would of course just use . So we'll start with that, but since we know we will need to add fuzzy matching we'll add a wrapper where this can happen:Map.get
public PersonDO findPersonDO(String searchFirst, String searchLast) {
return personCache.get(getPersonKey(searchFirst, searchLast));
}
3. Build a fuzzy matching algorithm yourself using scoring.
Note that since you are using Guava, I've used a few conveniences here (, , , etc.).Ordering
ImmutableList
Doubles
First, we want to preserve the work we do to figure out how close a match is. Do this with a POJO:
class Match {
private PersonDO candidate;
private double score; // 0 - definitely not, 1.0 - perfect match
// Add candidate/score constructor here
// Add getters for candidate/score here
public static final Ordering<Match> SCORE_ORDER =
new Ordering<Match>() {
@Override
public int compare(Match left, Match right) {
return Doubles.compare(left.score, right.score);
}
};
}
Next, we create a method for scoring a generic name. We should score first and last names separately, because it reduces noise. For instance, we don't care if the firstname matches any part of the last name — unless your firstname might accidentally be in the lastname field or vice versa, which you should account for intentionally and not accidentally (we'll address this later).
Note that we no longer need a "max levenshtein distance". This is because we normalize them to length, and we will pick the closest match later. 15 character adds/edits/deletes seems very high, and since we've minimized the blank first/last name problem by scoring names separately, we could probably now pick a max of 3-4 if you wanted (scoring anything else as a 0).
// Typos on first letter are much more rare. Max score 0.3
public static final double MAX_SCORE_FOR_NO_FIRST_LETTER_MATCH = 0.3;
public double scoreName(String searchName, String candidateName) {
if (searchName.equals(candidateName)) return 1.0
int editDistance = StringUtils.getLevenshteinDistance(
searchName, candidateName);
// Normalize for length:
double score =
(candidateName.length() - editDistance) / candidateName.length();
// Artificially reduce the score if the first letters don't match
if (searchName.charAt(0) != candidateName.charAt(0)) {
score = Math.min(score, MAX_SCORE_FOR_NO_FIRST_LETTER_MATCH);
}
// Try Soundex or other matching here. Remember that you don't want
// to go above 1.0, so you may want to create a second score and
// return the higher.
return Math.max(0.0, Math.min(score, 1.0));
}
As noted above, you could plug-in third party or other word-matching algorithms and gain from the shared knowledge of all of them.
Now, we go through the whole list and score every name. Note that I've added a spot for "tweaks". Tweaks may include:
-
Reversal: If the PersonDO is "Benjamin Franklin", but the CSV sheet may contain "Franklin, Benjamin", then you will want to correct for reversed names. In this case you will probably want to add a method that would score the name in reverse and take that score if it is significantly higher. If it matched in reverse exactly, you would give it a 1.0 score.
checkForReversal
-
Abbreviations: You may want to give the score a bonus bump if either first/last names matches identically and the other is fully contained in the candidate (or vice versa). This might indicate an abbreviation, like "Samantha/Sam".
-
Common nicknames: You could add a set of known nicknames ("Robert -> Bob, Rob, Bobby, Robby") and then score the search name against all of them and take the highest score. If it matches any of these, you would probably give it a 1.0 score.
As you can see building this as a series of API's gives us logical locations to easily tweak this to our heart's content.
On with the alogrithm:
public static final double MIN_SCORE = 0.3;
public List<Match> findMatches(String searchFirst, String searchLast) {
List<Match> results = new ArrayList<Match>();
// Keep in mind that this doesn't scale well.
// With only 1000 names that's not even a concern a little bit, but
// thinking ahead, here are two ideas if you need to:
// - Keep a map of firstnames. Each entry should be a map of last names.
// Then, only iterate through last names if the firstname score is high
// enough.
// - Score each unique first or last name only once and cache the score.
for(PersonDO person: personCache.values()) {
// Some of my own ideas follow, you can tweak based on your
// knowledge of the data)
// No reason to deal with the combined name, that just makes things
// more fuzzy (like your problem of too-high scores when one name
// is completely missing).
// So, score each name individually.
double scoreFirst = scoreName(searchFirst, person.getFirstName());
double scoreLast = scoreName(searchLast, person.getLastName());
double score = (scoreFirst + scoreLast)/2.0;
// Add tweaks or alternate scores here. If you do alternates, in most
// cases you'll probably want to take the highest, but you may want to
// average them if it makes more sense.
if (score > MIN_SCORE) {
results.add(new Match(person, score));
}
}
return ImmutableList.copyOf(results);
}
Now we modify your findClosestMatch to get just the highest one from all matches (throws if none in list).NoSuchElementException
Possible tweaks:
- You may want to check if multiple names scored very close, and either report the runners-up (see below), or skip the row for manual choice later.
- You may want to report how many other matches there were (if you have a very tight scoring algorithm).
Code:
public Match findClosestMatch(String searchFirst, String searchLast) {
List<Match> matches = findMatch(searchFirst, searchLast);
// Tweak here
return Match.SCORE_ORDER.max(list);
}
.. and then modify our original getter:
public PersonDO findPersonDO(String searchFirst, String searchLast) {
PersonDO person = personCache.get(getPersonKey(searchFirst, searchLast));
if (person == null) {
Match match = findClosestMatch(searchFirst, searchLast);
// Do something here, based on score.
person = match.getCandidate();
}
return person;
}
4. Report "fuzziness" differently.
Finally, you'll notice that doesn't just return a person, it returns a — This is so that we can modify the program to treat fuzzy matches differently from exact matches.findClosestMatch
Match
Some things you probably want to do with this:
-
Report guesses: Save all names that matched based on fuzziness into a list so that you can report those and they can be audited later.
-
Validate first: You may want to add a control to turn on and off whether it actually uses the fuzzy matches or just reports them so that you can massage the data before it comes in.
-
Data defenesiveness: You may want to qualify any edits made on a fuzzy match as "uncertain". For example, you could disallow any "major edits" to a Person record if the match was fuzzy.
Conclusion
As you can see it's not too much code to do this yourself. It's doubtful there is ever going to be a library that will predict names as well as you can knowing the data yourself.
Building this in pieces as I've done in the example above will allow you to iterate and tweak easily and even plug in third-party libraries to improve your scoring instead of depending on them entirely -- faults and all.