Complicated Simple Things

When integrating data, evaluating objects from multiple sources aids in determining their equivalence. Each source may identify customers, but determining which customer from each system represents the same customer can prove daunting.  Sometimes matching things is straight-forward; for example, if all sources should have an accurate social security number or taxpayer ID, success involves simply linking the matching numbers. 

Generally, each system has a local customer ID and matching relies on looking at text names and addresses; however, matching ends up more difficult because of the textual nature of these items.  If everyone spelled identically, abbreviated identically, and punctuated everything exactly the same all the time, then problems would occur only when accidental typos appeared in the data content.  Since these editorial variances hinder the accurate matching of text data, tools and techniques exist to help manage these challenges. 

Such processes can help standardize and otherwise clean up names and addresses prior to matching, by applying such possibly simple procedures as translating known typos into proper values based on entries in a manually maintained reference list.  Functions such as "soundex" can do a fuzzy match based on phonetics instead of exact letters.  Additional matching possibilities include looking at the first few letters in a word, or removing vowels, spaces, and punctuation prior to matching. 

As complexity increases, a simple match becomes a significant set of rules and exceptions.  If only partial matches happen, then further data items get interrogated.  A percentage of confidence in two objects being a match replaces the original, it-is-a-match-or-it-is-not-a-match, binary approach.  Chains of rules develop, such as if these two columns fully match we have a high percentage of confidence, but if only the first 10 characters match and these two other fields are exact matches then we have a slightly lower percent of confidence that these are a match, and so on and so forth.  After applying these rules, some cut-off point marks what constitutes a match. 

Designating what seems reasonable for the type of data being processed drives the establishment of an arbitrary numeric limit.  Keep in mind that not all that glitters is gold; while converging on the name "Jim's All Night Diner and Laundromat" convinces one strongly that a match exists, because of many "St. Mary's Hospital", "Joe's Garage" and similar examples across the country that might produce a name match, the instances should not necessarily match to each other.  Dealing with businesses, chains and franchises can lead logic down a twisting set of corridors only made more complex by ever-changing business sales force slants.  An alternate option worth exploring could involve a manual review of possible matches.  If a staff manually evaluates matches within a certain percentage of confidence to determine when things match or not, their expected productivity determines if such an investment offers true value.

Upon making a determination that two objects from two different systems represent the same functional object, this knowledge should persist.  Future processing of data from these sources can leverage previous matches.  But do support this persisted information by properly flagging the changes or events that then compel such previously matched data back into a reevaluation, such as a changed address or whatever other situation may reasonably bring a match into question. 

Even after all the possible matching rules have been determined, it still may prove premature to expect smooth sailing for the data integration ship.  Problems of granularity distinctions between sources may arise.  What one system considers a single customer, may appear as two or more customers in another source.  Exploration of the meaning behind such many-to-one, or even many-to-many interrelationships, allows for further adventures beyond managing the simple matching needs.