Often when you work with large amounts of data in a database, you come across unwanted duplicates. These can cause issues, both from the run-time and quality perspective. The Talend Data Quality tool comes in very handy when dealing with duplicates – simply run an analysis on a column set and it can detect duplicate rows and even generate jobs to remove them. But what if two or more rows are clearly the same but just misspelled or incorrectly formatted?
Unlike column set analysis, advanced matching can also detect these similar yet different rows. In example below, it is obvious that “Eric Smith” and “Eric Smit” is the same person but just misspelled.
Another example is when address names are placed in different order. In such cases, we will need to use advanced matching to detect these similarities.
There are two different algorithms that Talend Data Quality (Profiling) provides when comparing rows; T-Swoosh and Simple VSR Matcher. They use different methods to compare rows and will produce different number of duplicates. To learn more about these algorithms, click T-Swoosh and Simple VSR Matcher
Blocking key is not mandatory to have but can speed up the analyzing process when used correctly. A column can be defined as a blocking key to divide data into different blocks. When analyzing, rows are compared for duplication inside each block and rows from different blocks are not compared with.
Matching keys are columns that you want to compare with. These can be string values that can easily be misspelled or incorrectly formatted such as names or addresses. Each of the matching keys can have their own matching function which can range from “exact” to “Jaro-Winkler” algorithm. To learn about these matching functions, click here.
Threshold is used to separate duplicates from unique rows. By default, the matching threshold is set to 85% and the confident match threshold is set to 90%. This means that if the similarity score is below 85%, we have unique rows, if the similarity score is between 85% to 90%, those rows are suspect rows and needs expert knowledge to define whether the rows are different or not, and if the similarity score is above 90%, those rows are duplicates. You can play around with these threshold numbers to achieve a desired result.
Now that we have setup our advanced matching configuration, we can finally run our analysis. Simply click the run button and the analyze tab will show us the result. In the example below, we can see that 5.68% of the data are duplicates and 1.76% are suspected rows. We can then either manually change these rows per our preference or create a job to deal with them.