This document explains the procedure on how you join two datasets using proximity matching.
The algorithms we support are:
- Fuzzy Matcher: You can use this algorithm to match values in one or more columns from dataset 1 with one or more columns from dataset 2. You can perform proximity matching based on the matching threshold. The threshold values range from 0 to 1 where 1 is a perfect match.
- Record Linkage: You can use this algorithm to match values in one or more columns from dataset 1 with one or more columns from dataset 2. This algorithm requires identifying a consistent column (with you know data should be matching 100%) along with fuzzy join columns. Relatively, this algorithm works faster than “Fuzzy Matcher”.
Fuzzy join using "Fuzzy Matcher" algorithm:
- From query studio, create a new query and drag and drop RDBMS tables’ widgets as shown here:
- You need to select database type, connection name, and pick a table as dataset 1 (assume you are using a “driver” dataset table here) and click select as shown below:
- Repeat the same process to select dataset 2.
- From the transformations tab, drag the “Fuzzy Lookup” widget and connect the two tables as shown here:
- Click on the “Fuzzy datasets” and now you see the two algorithms as shown below:
- Let's select the “Fuzzy Matcher” algorithm and selected 3 matching values from the left 1st column with the other 3 matching values in the 2nd column and clicked on “select” as shown below:
From the above assume you selected the three similar values in two columns as below:
- Mailing_Street with MailingStreet
- Mailing_City with MailingCity
- Mailing_Zip_Postal_Code with MailingZip
Now, you need to add an RD output widget to the Fuzzy Matcher from the “Output” tab to extract the results as shown below:
- Optionally you can download the fuzzy join output using “offload the File” widget as shown below:
- Now click execute button to run the query chain as follows:
- From the above execution, it is clear that you have 93 matching results with a search from “130353” total values. Now click on the RD output to go through the match percentage results as shown below:
Note: As you see from the above, no threshold value has been selected while doing a “Fuzzy Match” algorithm, so you will get almost all matching results from 0-1. (low to high matching) You can see the matching value results above and you remember the selection while doing the Fuzzy Match in the initial stage.
Fuzzy join using "Record Linkage" transalgorithmormation:
- Follow the steps from the above section to select datasets and connect using the "Fuzzy Join" transformation. Choose "Record Linkage" algorithm as shown below:
Assume here you are selecting one consistent value and other values as below:
- Mailing_Street with MailingStreet
- Mailing_City with MailingCity
- Mailing_Zip_Postal_Code (consistent value) with MailingZip
- As you selected only few matching’s (including one consistent value) you got 37 values as matching results. In the monitor mode, click on the RD output:
- From the above, you can see the best match score for this linkage algorithm.
Comments
0 comments
Please sign in to leave a comment.