This article explains the steps needed to perform for comparing random sample data between two heterogeneous datasources. This process is helpful when you want to perform a quick random data integrity check for a very large dataset by comparing data field-by-field for a random set of records.
For explaining the process, we will use source as SQL Server and target as Snowflake. We will use the table LINEITEM from TPCH_100 schema which has about 600 million records, and 600 records as a random sample size.
Step 1: Snapshot the random sample dataset from the source
We will use the source SQL Server for capturing the random sample dataset keys. Following are the steps needed to capture a random sample from SQL Server LINEITEM table.
- Create a new query chain with the name “Random 600 records from LINEITEM Table” as follows
- Drag and drop “Custom SQL” widget from design palette and input the SQL as follows:
SQL:
SELECT TOP 600 *
FROM "TPCH_100".dbo."LINEITEM" TABLESAMPLE (1)
- Drag and drop “RD Output”, input the following configuration
- Execute the query chain.
Step 2: Create variable source query
- Connect to “RD Extracts” (RightData (DataTrust) DataStore DB) profile and create a query with the view created in step 1 as the source by concatenating all the key columns as follows:
SQL:
SELECT Concat(L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER) AS CONCAT_KEY
FROM "rightdata_db".dbo."MSSQL_TPCH_RANDOM_600"
Step 3: Create a variable
- Create variable “VAR_MSSQL_TPCH_RANDOM_600_KEYS” using the query “QRY_MSSQL_TPCH_RANDOM_600_KEYS” as source as follows:
- Save the variable.
Step 4: Create a “functional data reconciliation” scenario
- Go to Scenario Studio, click create a new scenario with the name “Compare random 600 records between MSSQL and Snowflake”
- Drag and drop the “Custom SQL” widget from the design palette, double click and select the MSSQL TPCH profile, enter the SQL text as shown below:
SQL
SELECT *
FROM "TPCH_10".dbo."LINEITEM"
WHERE CONCAT(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_LINENUMBER) IN
( '~~USR_VAR_MSSQL_TPCH_RANDOM_600_KEYS##' )
Note: Use the variable created in step 2
- Click “Infer Metadata” and “Save”
- Repeat 2.2 and 2.3 for Snowflake as follows:
- Drag and drop the “Map and Recon” widget and configure as follows
- Save the config and run the scenario.
- To run the scenario for different random sample, before running the scenario execute the query chain “Random 600 records from LINEITEM Table”
Step 5: Create a job
Automate the end-to-end process by creating a job
- Go to “Batch Console”, create a new job with the name “JOB_Compare random 600 records between MSSQL and Snowflake” as follows:
- From the design palette “Schedule” tab, drag and drop the “Execute Now” widget to the canvas (you can use other scheduling options as per your requirement).
- Now drag and drop the “Query Chains” widget and select the query chain “Random 600 records from LINEITEM Table”
- Drag and drop “FDR Scenarios” to the canvas and select the FDR “Compare random 600 records between MSSQL and Snowflake” as follows:
- Execute the job to capture the random sample keys and execute the comparison between MSSQL to Snowflake