Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture

This article is a case study of a real-world implementation. The client and purpose of the application are not relevant to the architecture.

My previous articles on Snowflake and ELVT], Snowflake and ELVT vs [ELT, ETL], Part 1 and Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture have been primarily theoretical, only alluding to implementations using this technique.

This article is a case study of a real-world implementation. The client and purpose of the application are not relevant to the architecture.

The client presented a unique challenge. Create an architecture before the application and its data model is even defined or implemented!

The only known “requirements” are:

  1. The primary source of data will be a custom Salesforce application, using both standard and custom Salesforce Objects.
  2. Every change to every Salesforce Object record must be captured.
  3. The architecture must support the full data life cycle, from development through production and maintenance.
  4. There will be ongoing changes to the Salesforce Object fields over time.

At this point, the reader might be saying to themselves, as I did initially, “No big deal, use a data integration product such as Fivetran”. This will handle both data and schema changes.

Except there is a 5th requirement; “the solution must be FedDRAMP Mod” certified! None of the currently available data integration offerings meet this requirement.As a result,  Kafka and JSON form the data pipeline from Salesforce to Snowflake. The pipeline is repurposed from a prior implementation for Salesforce to PostgreSQL replication.

How to cope with all the above?

A Snowflake based Salesforce “Object” DDL Generator! Each Salesforce Object results in three VIEWs, two TABLEs, one STREAM and one TASK in Snowflake for capturing the Salesforce data for reporting and analytics.

The Object Generator consists of metadata TABLEs and UDFs for generating DDL physical data TABLEs, STREAMs and VIEWs corresponding to Salesforce Objects.

Meta-data Tables

  1. A meta-data table, SFORCE_META_DATA, consisting of meta-data extracted from Salesforce with the following columns:
    1. SFORCE_OBJECT – the name used for creating the various Snowflake objects e.g., ACCOUNT for standard object or CUSTOM for a custom object, e.g., CUSTOM__c.
    2. API_NAME – the Salesforce field API Name – used to extract the appropriate JSON field for the VIEW column name
    3. FIELD_LABEL – the Salesforce field label – this becomes the VIEW column name.
    4. DATA_TYPE – the Salesforce data type.
    5. Security Classification – used for data masking
  2. A meta-data table, SFORCE_JSON_CASTING, for casting the Salesforce data type to Snowflake with the following columns:
    1. SFORCE_DATA_TYPE – the Salesforce data type
    2. CAST_TO – the desired Snowflake data type
    3. IGNORE_FLAG – ignore columns with this data type
  3. A Calendar dimension table, with a DATE column

Physical Model

Every Salesforce Object, e.g., ACCOUNT and CONTACT, will have the following corresponding physical Snowflake objects:

  1. Staging TABLE for Kafka payload, e.g., ACCOUNT_KAFKA_STG, with the two standard VARIANT/JSON columns from the Kafka Connector.:
    1. RECORD_META_DATA
    1. RECORD_CONTENT
  2. Append only STREAM, e.g., ACCOUNT_KAFKA_STREAM on the staging table, e.g., ACCOUNT_KAFKA_STG
  3. History TABLE, e.g., ACCOUNT, holding all data from Salesforce with the following columns:
    1. LAST_MODIFIED_DTIME  – the last timestamp the source Salesforce record was modified
    1. CREATED_DTIME – the time the record was created in Salesforce
    1. RECORD_CONTENT – the VARIANT column from the staging table STREAM
  4. TASK, e.g., ACCOUNT_INSERT_INTO_TASK – performs the ELT from the STREAM, removing unwanted JSON objects and filling the DTIME columns.

UDFs are created to generate the DDL for each of the physical objects. Each of the UDFs takes the desired object name, e.g., ‘ACCOUNT’ and returns the DDL for creating the appropriate object.

The “standardized” UDFs:

  • UDF_GEN_SFORCE_OBJECT_TABLE(‘ACCOUNT’)  produces:
CREATE TABLE SFORCE_PHYSICAL_SCHEMA
    SFORCE_PHYSICAL_SCHEMA.ACCOUNT IF NOT EXISTS 
    (LAST_MODIFIED_DTIME TIMESTAMP_TZ(9), 
     CREATED_DTIME TIMESTAMP_TZ(9), 
     RECORD_CONTENT VARIANT) 
    DATA_RETENTION_TIME_IN_DAYS = 90;
  • UDF_GEN_SFORCE_KAFKA_STG_TABLE(‘ACCOUNT’) produces:
CREATE TABLE 
    SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STG IF NOT EXISTS 
    (RECORD_METADATA VARIANT, 
        RECORD_CONTENT VARIANT) 
    DATA_RETENTION_TIME_IN_DAYS = 30; 
  • UDF_GEN_SFORCE_KAFKA_STREAM(‘ACCOUNT’) produces:
CREATE STREAM IF NOT EXISTS 
SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM ON TABLE
  SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STG 
  APPEND_ONLY = TRUE SHOW_INITIAL_ROWS =false;
  • UDF_GEN_INSERT_INTO_TASK (‘ACCOUNT’) produces:
CREATE OR
REPLACE TASK SFORCE_PHYSICAL_SCHEMA.INSERT_INTO_ACCOUNT_TASK S
SCHEDULE = '1440 MINUTE'
ALLOW_OVERLAPPING_EXECUTION  = FALSE WAREHOUSE =
SFORCE_TASK_USAGE_WH USER_TASK_TIMEOUT_MS      = 300000
WHEN SYSTEM$STREAM_HAS_DATA('SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM') 
AS
INSERT INTO SFORCE_PHYSICAL_SCHEMA.ACCOUNT
  (SELECT  
convert_timezone('America/Los_Angeles',record_content:LastModifiedDate) AS
LAST_MODIFIED_DTIME,
convert_timezone('America/Los_Angeles',record_content:CreatedDate::timestamp) AS
CREATED_DTIME
FROM
SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM
ORDER BY
LAST_MODIFIED_DTIME);

There are corresponding overloaded UDFs that have additional parameters to govern the options used in the DDL such as schemas and CREATE OR REPLACE/IF NOT EXISTS options. The “standardized” UDFs call these functions with standard settings.

This means that all generated objects have the same structure; only the names differ. The deployment schema is SFORCE_PHYSICAL_SCHEMA. Changes to the Salesforce Object do not require DDL ALTERing any physical objects.

This approach facilitates the development and deployment process. Eventually, an automated pipeline will create the physical objects when new Salesforce Objects are added. Until then,  we can substantially automate the process by loading the SFORCE_META_DATA table from a csv file.

This allows us to create simple SQL to produce scripts, e.g.

SELECT
UDF_GEN_SFORCE_OBJECT_TABLE(‘ACCOUNT’)
UNION ALL 
SELECT UDF_GEN_SFORCE_KAFKA_STG(‘ACCOUNT’)
UNION ALL
SELECT UDF_GEN_SFORCE_KAFKA_STREAM(‘ACCOUNT’);

We combine UDFs calls into a single UDF creating DDL to deploy all physical objects:

UDF_DEPLOY_SFORCE_PHYSICAL(object_name, VARCHAR)

We can then generate a script for deploying either the entire set of Salesforce Objects or a subset with a simple SQL statement, e.g.:

SELECT 
    UDF_DEPLOY_SFORCE_PHYSICAL(SFORCE_OBJECT)
FROM
    (   SELECT 
            DISTINCT SFORCE_OBJECT
        FROM 
            SFORCE_META_DATA);

Presentation VIEW Model

One of the “best practices” with JSON is to use VIEWs for presentation. JSON fields are shredded into relational columns only as needed, typically for performance reasons. A common use case is converting a JSON UTC column to a desired time zone. In this application, the LAST_MODIFIED_DTIME and CREATED_DTIME are particularly important and need to be in Pacific Time.

As the number and structure of the Salesforce Objects is both unknown and certain to be subject to frequent changes, a simple Snowflake VIEW generator is the key part of this architecture.

The application’s requirement to keep a history of every change to every record in every Salesforce Object. This presents challenges common to a temporal database. In this application, most metrics are counts and date ranges instead of numeric values, but the same techniques are applicable.

As the rate of change differs between Salesforce Objects, there is no reliable join between objects for historical data. A “normalization” technique is needed to enable consistent joins between Snowflake objects.

The solution is a three-layer stack of VIEWs:

  1. The “historical” VIEW. This VIEW presents the data for every row in the TABLE. These VIEWs are referred to and named with “_HIST” as the suffix, e.g., VW_ACCOUNT_HIST.
  2. The “as of date” VIEW. This VIEW presents a daily snapshot of the values of each record at the end of every calendar DATE, e.g.,

    WHERE MAX(LAST_MODIFIED_DTIME) <=  DIM_CALENDAR_DATE.DATE

     The suffix is “_AS_OF_DATE”, e.g., VW_ACCOUNT_AS_OF_DATE. This allows meaningful JOINs between object based on the appropriate IDs and the AS_OF_DATE column, e.g.,
JOIN ON  VW_CONTACT_AS_OF_DATE.ACCOUNT_ID = VW_ACCOUNT_AS_OF_DATE.ACCOUNT_ID
AND
VW_CONTACT_AS_OF_DATE.AS_OF_DATE = VW_ACCOUNT_AS_OF_DATE.AS_OF_DATE

PERFORMANCE NOTE: The above is the "logical" join condition.  Due to ACCOUNT_ID being an 18 character string, Snowflake (and most DBs) perform very poorly joining on long characters string. The actual joins are alway HASH(AS_OF_DATE, <name>_ID), for the join keys.

3. The “current” record status, with the suffice “_CURR” This is simply:

SELECT… FROM… VW_object_AS_OF_DATE WHERE AS_OF_DATE = CURRENT_DATE()

Joins only require appropriate ID columns, as seen below.

The “as of date” and “current” VIEWs are the primary VIEWs for the anticipated analytics and KPIs.

Generating VIEWs

Let’s discuss the key UDFs used to generate the VIEWs.

UDF_GEN_VW_HIST(‘object_name’) generates the first level of VIEWs DDL, the “history” _HIST VIEW. This is the “presentation”/SOPR VIEW as described in Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture.

UDF_GEN_VW_HIST uses the SFORCE_META_DATA and SFORCE_JSON_CASTING tables to generate the CREATE VIEW DDL. It maps each JSON API field to a VIEW column based on the field label, casting each field according to the SFORCE_JSON_CASTING mapping.

UDF_GEN_VW_HIST is composed of calls to several functions. The key child function is UDF_GEN_TARGET_COLUMNS. This function creates the columns in the VIEW for both standard fields and custom fields. It uses the FIELD_LABEL as the VIEW’s column name in mixed case, except for the ID field and fields with data types of REFERENCE and LOOKUP; for these, the resulting column name is upper, snake_case. The ID field will be name as <object_name>_ID. A partial example for VW_CONTACT_HIST:

RECORD_CONTENT:Target_Payload__c.AccountId::STRING AS "ACCOUNT_ID" , 
RECORD_CONTENT:Target_Payload__c.Annual_Income__c::STRING AS "Annual Income" , 
RECORD_CONTENT:Target_Payload__c.Birthdate::DATE   AS "Birthdate" , 
RECORD_CONTENT:Target_Payload__c.Id::STRING	   AS "CONTACT_ID" , 
RECORD_CONTENT:Target_Payload__c.CreatedById::STRING 	AS "CREATED_BY_ID" , 
RECORD_CONTENT:Target_Payload__c.FirstName::STRING 	AS "First Name"

UDF_GEN_VW_AS_OF_DATE generates the second level VIEW DDL, joining the _HIST VIEW with DIM_CALENDAR_DATE as noted above.

UDF_GEN_VW_CURR generates the third level VIEW’s DDL. It is simply

SELECT * FROM VW_object_AS_OF_DATE WHERE AS_OF_DATE = CURRENT_DATE

All the generated VIEWs use CREATE OR REPLACE syntax. This allows simple updates to the VIEWs based on changes to the SFORCE_META_DATA table. NOTE: due to Snowflake’s compiling of VIEWs, all three VIEWs need to be recreated in the following order when the meta-data for the Salesforce Object changes:

  1. _HIST
  2. _AS_OF_DATE
  3. _CURR

as both _AS_OF_DATE and _CURR use SELECT * in their DDL, which is resolved at CREATE VIEW execution.

UDF_DEPLOY_SFORCE_VIEWS(‘object_name’) generates a DDL script for creating the three VIEWs.

The same simple SQL technique as noted above for the physical objects can be used to generate a script to create or refresh multiple objects.

Query Examples

Let’s examine the results for a Salesforce Account Object selected from the different VIEW types.

SELECT 
    "Account Name",
    ACCOUNT_ID,
    last_modified_dtime,
    TO_DATE(LAST_MODIFIED_DTIME)
FROM
    VW_ACCOUNT_HIST
WHERE
    ACCOUNT_ID = '0013R000003pNu8QAE'
ORDER BY
    to_Date(last_modified_dtime);

returns:

Account NameACCOUNT_IDLAST_MODIFIED_DTIMETO_DATE(LAST_MODIFIED_DTIME
QA_verification0013R000003pNu8QAE2021-10-22 16:13:542021-10-22
QA_verification test0013R000003pNu8QAE2021-10-22 16:04:582021-10-22
QA_verification test record0013R000003pNu8QAE2021-10-22 16:13:092021-10-22
QA_verification0013R000003pNu8QAE2021-10-29 08:37:342021-10-29
QA_troubleshootings0013R000003pNu8QAE2021-10-29 11:07:392021-10-29
QA_ver0013R000003pNu8QAE2021-10-29 08:40:232021-10-29
QA_troubleshooting0013R000003pNu8QAE2021-10-29 11:06:492021-10-29

Note that the account had several updates on two different dates, 10/22 and 10/29.

Let’s see the data for the daily snapshot from VW_ACCOUNT_AS_OF_DATE:

Account NameACCOUNT_IDLAST_MODIFIED_DTIMEAS_OF_DATE
QA_verification0013R000003pNu8QAE2021-10-22 16:13:542021-10-22
QA_verification0013R000003pNu8QAE2021-10-22 16:13:542021-10-25
QA_verification0013R000003pNu8QAE2021-10-22 16:13:542021-10-26
QA_verification0013R000003pNu8QAE2021-10-22 16:13:542021-10-28
QA_verification0013R000003pNu8QAE2021-10-22 16:13:542021-10-24
QA_verification0013R000003pNu8QAE2021-10-22 16:13:542021-10-27
QA_verification0013R000003pNu8QAE2021-10-22 16:13:542021-10-23
QA_troubleshootings0013R000003pNu8QAE2021-10-29 11:07:392021-10-29
QA_troubleshootings0013R000003pNu8QAE2021-10-29 11:07:392021-10-30
QA_troubleshootings0013R000003pNu8QAE2021-10-29 11:07:392021-10-31

Note that the results remain the same from the last update on 10/22 through 10/28 and change on 10/29.

Let’s look at the current status, VW_ACCOUNT_CURR:

Account NameACCOUNT_IDLAST_MODIFIED_DTIMEAS_OF_DATE
QA_troubleshootings0013R000003pNu8QAE2021-10-29 11:07:392021-11-01

Note that the status has not changed since 10/29.

Finally, let’s see the current values for this Account and its CONTACTs:

SELECT 
    "Account Name", 
    "First Name", 
    "Birthdate", 
    "Driver's License State"
FROM 
    VW_ACCOUNT_CURR ACCT
JOIN 
    VW_CONTACT_CURR CNTC
ON 
    ACCT.ACCOUNT_ID = CNTC.ACCOUNT_ID
WHERE 
    ACCT.ACCOUNT_ID = '0013R000003pNu8QAE';

produces:

Account NameFirst NameBirthdateDriver’s License State
QA_troubleshootingsthomas2010-01-01California

Summary

The architecture describe in this article has numerous advantages:

  1. Every Salesforce Object in the application has/will have a corresponding set of physical objects
    1. All the physical objects have the identical structure
    1. The physical objects are all generated from meta-data
    1. The default standard for all generated physical objects is idempotent, using appropriate IF NOT EXISTS or OR REPLACE in the CREATE statements
  2. The generated DDL is idempotent
    1. Only the _HIST VIEWs DDL needs to be generated for new or changed Salesforce Objects
    1. As noted for VIEWs, all three DDL definitions need to be executed. Other VIEWs may also need to be recreated
  3. ETL/ELT is minimized and standardized
    1. The Kafka connector is used for ELT from Salesforce
    1. Tasks are used to copy the data from the Kafka stage TABLE to the main table
    1. Monitoring the ETL/ELT processes is via queries to the SNOWFLAKE.ACCOUNT_USAGE schema
  4. Using JSON/VARIANT as the primary persistence store allows the Salesforce Objects to be change before exposing them in the VIEWs; the new data will be present when finally exposed in the VIEWs
  5. There is no need for traditional schema maintenance, as the architecture is truly “schema on read”.

Epilogue

The architecture and examples in this article were developed months and tested months before the first Salesforce Objects were implemented. While the application is still in its initial stages, 22 standard and custom Salesforce Objects have been implemented producing 161Snowflake object. No changes have been needed to the architecture.

Copyright ©2021, Jeffrey Jacobs & Associates, LLC

Comments

Leave a comment