Source to Target Reconciliation of Random Sample Dataset – RightData

This article explains the steps needed to perform for comparing random sample data between two heterogeneous datasources. This process is helpful when you want to perform a quick random data integrity check for a very large dataset by comparing data field-by-field for a random set of records.

For explaining the process, we will use source as SQL Server and target as Snowflake. We will use the table LINEITEM from TPCH_100 schema which has about 600 million records, and 600 records as a random sample size.

Step 1: Snapshot the random sample dataset from the source

We will use the source SQL Server for capturing the random sample dataset keys. Following are the steps needed to capture a random sample from SQL Server LINEITEM table.

Create a new query chain with the name “Random 600 records from LINEITEM Table” as follows

Drag and drop “Custom SQL” widget from design palette and input the SQL as follows:

SQL:

SELECT TOP 600 *

FROM "TPCH_100".dbo."LINEITEM" TABLESAMPLE (1)

Drag and drop “RD Output”, input the following configuration

Execute the query chain.

Step 2: Create variable source query

Connect to “RD Extracts” (RightData (DataTrust) DataStore DB) profile and create a query with the view created in step 1 as the source by concatenating all the key columns as follows:

SQL:

SELECT Concat(L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER) AS CONCAT_KEY

FROM "rightdata_db".dbo."MSSQL_TPCH_RANDOM_600"

Step 3: Create a variable

Create variable “VAR_MSSQL_TPCH_RANDOM_600_KEYS” using the query “QRY_MSSQL_TPCH_RANDOM_600_KEYS” as source as follows:

Save the variable.

Step 4: Create a “functional data reconciliation” scenario

Go to Scenario Studio, click create a new scenario with the name “Compare random 600 records between MSSQL and Snowflake”

Drag and drop the “Custom SQL” widget from the design palette, double click and select the MSSQL TPCH profile, enter the SQL text as shown below:

SQL

SELECT *

FROM "TPCH_10".dbo."LINEITEM"

WHERE CONCAT(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_LINENUMBER) IN

( '~~USR_VAR_MSSQL_TPCH_RANDOM_600_KEYS##' )

Note: Use the variable created in step 2

Click “Infer Metadata” and “Save”

Repeat 2.2 and 2.3 for Snowflake as follows:

Drag and drop the “Map and Recon” widget and configure as follows

Save the config and run the scenario.

To run the scenario for different random sample, before running the scenario execute the query chain “Random 600 records from LINEITEM Table”

Step 5: Create a job

Automate the end-to-end process by creating a job

Go to “Batch Console”, create a new job with the name “JOB_Compare random 600 records between MSSQL and Snowflake” as follows:

From the design palette “Schedule” tab, drag and drop the “Execute Now” widget to the canvas (you can use other scheduling options as per your requirement).

Now drag and drop the “Query Chains” widget and select the query chain “Random 600 records from LINEITEM Table”

Drag and drop “FDR Scenarios” to the canvas and select the FDR “Compare random 600 records between MSSQL and Snowflake” as follows:

Execute the job to capture the random sample keys and execute the comparison between MSSQL to Snowflake