Patient Matching Data Matching and Demographic Data

The United States has moved rapidly to increase the adoption of electronic medical records (EMRs) over the past decade. These new systems promised improved safety and lowered cost through information sharing. One key challenge to achieve this goal is better patient data matching[1]. Patient matching is the process of comparing different identify records to determine if the two records refer to the same patient.

For example, in the figure below you compare each individual data attributes to determine rather the two records are the same. For instance, algorithms can compare whether John and Jon are the same name – then look Smith vs Smtih, and then compare the individual dates. If the elements are exactly the same this is a relatively simple task. But when data quality errors are entered, the task becomes far more cumbersome.

Data Source












Match Status





The data quality errors include varying data standards, the absence or presence of different data elements, misspelled names, transposed names (Smith John versus John Smith), phonetic variation, names change such as the case of marriage, and ‘fat finger’ errors. The old adage “Garbage in and Garbage out” holds true with patient matching. If the data is of poor quality, there is very little the best algorithm can do. However, one of the simplest data quality metrics to be considered is whether or not a given data element is even collected.

To date, there has been there has been little substantive work that looks at the data available for matching on a national scale. A recent study, “The Building Blocks of Interoperability. A Multisite Analysis of Patient Demographic Attributes Available for Matching” (available through an Open Source license) focuses on the issue of patient demographic availability. The work coordinated at Northwestern University by Dr. Abel Kho’s group, along with over two dozen other investigators helps fill this knowledge gap and lead to a few interesting findings.

The paper looks at how the different data elements change across time and over region. The work found that key elements such as first and last names, address, date of birth, gender are highly available across time and regions. Additionally, the work found that social security numbers are becoming less frequently collected, which may have implications for organizations' ability to link records. Furthermore, we found email address collection is significantly increasing.

This ongoing work helps set the stage for future exploration in the areas of patient data quality and patient matching. The manuscript serves as a good starting point to better understanding the challenges of consistent matching of health records across the country.