Snowflake and ELVT vs [ELT, ETL]

Over several generations of RDBM technologies, I have learned that common practices, knowledge, and attitudes become obsolete. Query planning and optimization is continually evolving. This evolution, combined, with Snowflake’s revolutionary performance capabilities and data model agnosticism, is leading many database practitioners to a new architectural pattern. A key element of this new pattern is what I call Extract, Load, Virtual Transform (ELVT).

A little historical perspective is in order. Running a SQL query involves three fundamental steps:

  1. Creating an execution plan. This operation is performed by the Query Optimizer (or sometimes Query Compiler).
  2. Executing the plan, i.e., retrieving the desired data from storage.
  3. Returning the data to the user.

I spent much of my professional database career working with Oracle’s RDBMS product suite. In the earlier versions of Oracle, the execution plan creation was based on a simple set of rules. This was known as the Rule Based Optimizer (RBO). The optimization was quite simple; use either an index if it exists or scan the data if no index is available. Query tuning was largely a function of shared tribal knowledge and understanding the rules. Those of us who worked with early versions of Oracle all have stories about dramatically improving performance by reversing the join order, and similar tales.

The Cost Based Optimizer (CBO) was introduced with Release 7. It was primitive, and mostly ignored, but it was the beginning of one of the most sophisticated applications of computer science in any technology. The creators of Snowflake, Thierry Cruanes and Benoit Dageville, were significant contributors to the CBO. (Oracle’s CBO has been an endless source of fascination to me.) Let us not forget the incredible improvements in computer hardware in general.

A CBO evaluates many possible executions plans based on table statistics and the operations available in the underlying execution engine.  For our purposes, an even more important CBO feature is the rewriting of original queries into better performing code. The underlying theory of relational algebra enables nearly infinite transformation of a SQL query into its logical equivalent. Only an exceedingly small number of these will perform better than the original.

A VIEW is essentially a pre-packaged query and may be used any place a physical table is used. When VIEWs were introduced, DBAs often discouraged their use; joining VIEWs in a query was strongly discouraged. Despite the abilities of today’s CBOs, this attitude, and a preference for using physical tables instead of VIEWs is still quite common.

Extract, Transform and Load (ETL) was the earliest method for loading large amounts of data into a database.  ETL was performed by a separate application, such as Informatica, usually hosted on its own servers. The necessary data transformations were performed outside the database and then loaded into the database.

As database platform performance improved, the order changed. Data was loaded into the database and the transformation was performed by SQL operations which INSERT the data into tables. Hence, Extract Load and Transform (ELT). This is not particularly new, having become more common with each new generation of hardware.

ELT is a somewhat misleading term. It should be a four-letter acronym, ELTL as the final step is still Loading data into physical tables. Snowflake’s performance has made ELTL very popular.

In many cases, the primary rationale for using the ELT/ELTL approach is the practitioner’s experience with data marts, pperational data stores, and prior experience with database performance.

Snowflake’s revolutionary performance enables a new approach requiring a new name and acronym; Extract, Load, Virtual Transform (ELVT). Data is still Extracted from source systems and Loaded into the database. ETL/ELTL is used to produce clean physical tables. But VIEWs perform the logical Transformation without loading into or creating physical tables.

Let us keep three common engineering adages in mind:

  1. Murphy’s Law: whatever can go wrong, will go wrong
  2. KISS: Keep It Simple Stupid
  3. The probability of Murphy’s Law being observed in any system increases with the number of moving parts

ETL/ETLL pipelines typically have multiple steps, with intermediate storage between steps. Things may fail or break for many reasons.

ELVT has the following advantages, with the assumption that physical tables contain required level of  cleanliness:

  1. Simplicity: VIEWs are SQL
  2. Fewer moving parts
  3. True single source of data; no transformed physical data that may be incorrect or out of date
  4. Easily modified and fixed
  5. Simpler governance and security; SECURE VIEWs may be used
  6. It is straightforward to replace a VIEW with a physical table
  7. Business Intelligence tool vendor independence – this will be covered in detail in my next article
  8. Reduced development time, effort, and cost across the entire data cycle

The reader might question the impact of ELVT on performance and computing costs. The major factor in query performance is the amount of data to be processed. If ELT/ELTL produces much smaller tables, then they should be considered. A well-designed ELVT architecture enables straightforward replacement of VIEWs with access to physical tables if needed. The use of VIEWs instead of direct physical tables has little impact on CPU usage.

I have used ELVT on several projects. I have participated in many discussions and seen numerous case studies where ELVT  is commonly used in place of  ELT/ETLL.

The article Pivoting IIOT Data in Snowflake, describes a simple ELVT model for creating pivoted VIEWs for BI purposes. The underlying tables are never directly referenced in queries. One VIEW, VW_SAMPLE_QUARTER_HOUR_STATS, aggregates the data into the average, maximum and minimum values for quarter hour intervals.

Let us examine a more complex VIEW from that article’s data moden, shown below at the end of this article. This VIEW determines the start and end times for decanting natural gas from trucks into a pipeline.  A “totalizer” sensor maintains the running total of the amount of gas decanted over time. The trucks come from different farms and each farm has a dedicated decanting station, aka “pole”. The time and duration for each decanting is determined by changes in the totalizer value. Totalizers occasionally “burp” an invalid sample, which needs to be ignored.

Using a Small virtual warehouse, SELECT * FROM VW_YJ_DECANT_INJECTIONs, using a complex VIEW with 289M rows in the base table took ~2.7s total time, ~0.6s CPU. Query execution was “first time” to avoid local caching and Small was used to minimize local disk. The query has 26 steps with the following statistics.

This was a very inexpensive operation to get the results required, and was faster to develop and deploy. Compare that to the cost of storage plus time and resources for developing, managing, and testing an ETL or ELT pipeline process had we chosen to create a 2nd physical table instead. This ELVT is a much more agile and cost effective approach.

Snowflake’s unique capabilities enable this new architectural pattern. ELVT is not a silver bullet; ETL/ELTL pipelines need to be examined to see if they can be replaced with VIEWs. In my next article, I will discuss a common, disciplined approach to building ELVT architectures.

CREATE OR REPLACE VIEW
    VW_YJ_DECANT_INJECTION_TEST copy grants AS
SELECT
    * ,
    TRUNC( TIMEDIFF('minutes',START_TIME,END_TIME)/60,2) AS DURATION_HOURS
FROM
    (   SELECT
            DESCRIPTIVE_TAG_NAME,
            CASE
                WHEN DESCRIPTIVE_TAG_NAME = 'pole_1_totalizer_scf'
                THEN 'McDonalds 1
                WHEN DESCRIPTIVE_TAG_NAME = 'pole_2_totalizer_scf'
                THEN 'McDonalds 2
                WHEN DESCRIPTIVE_TAG_NAME = 'pole_3_totalizer_scf'
                THEN 'McDondalds 2
                WHEN DESCRIPTIVE_TAG_NAME = 'pole_4_totalizer_scf'
                THEN 'McDondalds 4'
                ELSE 'Unknonn'
            END AS farm,
            CASE
                WHEN current_value = next_value
                AND current_value <> first_prior_value
                THEN TRUNC(current_value - lag( TRUNC (current_value,0),1) over
                                                                                 (
                                                                             PARTITION BY
                                                                                 DESCRIPTIVE_TAG_NAME
                                                                             ORDER BY
                                                                                 CURRENT_TIME) )
            END AS SCF_INJECTED ,
            LAG(T.current_time,1) OVER (
                                    PARTITION BY
                                        DESCRIPTIVE_TAG_NAME
                                    ORDER BY
                                        CURRENT_TIME) AS START_TIME,
            T.CURRENT_TIME                            AS END_TIME--,
        FROM
            (   SELECT
                    DESCRIPTIVE_TAG_NAME,
                    SAMPLE_QRTR_HOUR             CURRENT_TIME,
                    TRUNC ("Quarter Hour Avg",0) current_value,
                    lead( TRUNC ("Quarter Hour Avg",0),1)over
                                                               (
                                                           PARTITION BY
                                                               DESCRIPTIVE_TAG_NAME
                                                           ORDER BY
                                                               SAMPLE_QRTR_HOUR) next_value,
                    lead (SAMPLE_QRTR_HOUR,1) over
                                                    (
                                                PARTITION BY
                                                    DESCRIPTIVE_TAG_NAME
                                                ORDER BY
                                                    SAMPLE_QRTR_HOUR) next_time,
                    lag( TRUNC ("Quarter Hour Avg",0),1) over
                                                               (
                                                           PARTITION BY
                                                               DESCRIPTIVE_TAG_NAME
                                                           ORDER BY
                                                               SAMPLE_QRTR_HOUR) AS
                    first_prior_value,
                    lag (SAMPLE_QRTR_HOUR,1) over
                                                   (
                                               PARTITION BY
                                                   DESCRIPTIVE_TAG_NAME
                                               ORDER BY
                                                   SAMPLE_QRTR_HOUR) AS first_prior_time
                FROM
                    VW_SAMPLE_QUARTER_HOUR_STATS
                WHERE
                    tagpath IN
                    (   SELECT
                            tagpath
                        FROM
                            tag_detail
                        WHERE
                            DESCRIPTIVE_TAG_NAME IN ( 'pole_1_totalizer_scf' ,
                                                     'pole_2_totalizer_scf',
                                                     'pole_3_totalizer_scf',
                                                     'pole_4_totalizer_scf' ) )) T
        WHERE
            (
                current_value <> first_prior_value
            AND CURRENT_VALUE = NEXT_VALUE )
        OR
            (
                current_value <> next_value
            AND CURRENT_VALUE = FIRST_PRIOR_VALUE ))
WHERE
    scf_injected IS NOT NULL
AND scf_injected >= 500
ORDER BY
    DESCRIPTIVE_TAG_NAME,
    START_TIME;

Copyright © 2021, Jeffrey Jacobs

Snowflake Micro-partition vs Legacy Macro-partition Pruning

I have been in the data business through several RDBM generations and have seen many attempts at comparing performance between competing vendors.

To say those comparisons should be taken with a grain of salt is an understatement. The resulting salt consumption would not be good for anybody’s health.

The Transaction Processing Council (TPC) performance benchmarks provide the standard. TPC provides datasets and specifications for various benchmarks.

Historically, RDBMS vendors ran (or avoided running) TPC benchmarks themselves and boasted about the results.

This process came with the caveat: “there are lies, damn lies, and (vendor) benchmarks”. There were (and are) just too many variables rendering claims unreliable at best.

I have not seen any TPC benchmarks used to compare current RDBMS vendors. I have seen many, in my opinion, overly simplistic query performance comparisons, but none that I consider credible or reliable. Snowflake provides TPC datasets in the SNOWFLAKE_SAMPLE_DATA in every account, but I have yet to hear results of anybody performing a significant comparison using TPC datasets. (I would not be surprised to learn that Snowflake used them extensively for their own testing.)

There are numerous customer testimonials stating that moving from their prior vendor to Snowflake has resulted in greatly improved query performance, but there could be many reasons for this.

Putting this in perspective requires understanding the internals of Snowflake’s performance. While there are many factors involved in query performance, this article focuses on one major ingredient, partitioning and partition pruning.

Partitioning divides a table’s storage into pieces. The fewer partitions requiring processing, the better the query performance. A query skipping the data in a partition is referred to as pruning. The query engine’s optimizer examines the meta-data of a table’s partition to determine if it can be pruned.

Partitions in both on-prem and cloud-based legacy RDBMSes (e.g., Oracle, Teradata, Synapse, Big Query) tend to be very large. We will refer to these as “macro-partitions”. Macro-partitions require specifying a partition-key, which is a very small set of the table’s columns. Each partition contains only the data that satisfies the condition specified for all key columns. The most common condition is a range of values, typically dates, for each partition, although a scalar value may also be used. Some RDBMSes allow sub-partitioning as well, with the sub-partition-key consisting of the parent’s partition-key columns and additional key columns for the sub-partition(s).

If a partition is not pruned, the data within the partition must be processed. For analytics, this is typically  the most expensive operation, a full scan of the data.

Macro-partitioning is both an art and a science, requiring substantial planning, setup and often maintenance.

Snowflake’s approach is completely different. The table is automatically partitioned into micro-partitions, with a maximum size of 16MB compressed data, typically 100-150MB uncompressed. The meta-data for every column in a micro-partition has the minimum and maximum values for that column. Unlike macro-partitioned tables, every column in the table can potentially be used to determine if the micro-partition can be pruned. This includes appropriate fields (or sub-columns) in semi-structured data contained in VARIANT columns, e.g., JSON.

The statistics are gathered when the micro-partition is created and are kept in  the management level’s meta-data. Once created, a micro-partition is immutable.

There are many other aspects to Snowflake’s performance, but micro-partitioning is a key differentiator from all other RDBMSes.

Here are the key differentiators between micro-partitions and legacy partitions:

  1. Macro-partitions  are comparatively large, e.g., one week’s or one month’s data. Micro-partitions are smaller, e.g., one day’s data.
  2. Macro-partitions must have the partition-key columns and the range of values for each partition-key column specified. Pruning occurs based on the order of the partition-key columns. Pruning granularity occurs based on left to right ordering of the partition-key columns in the partition-key definition. If the leading/left-most column is not used in the WHERE clause, no pruning takes place. If the 2nd partition-key column is not used in the WHERE clause, pruning is based only on the 1st column resulting in scanning multiple partitions matching the 1st column’s filter. Etc.
  3. In micro-partitions, the partition-key columns are not specified, as every column has maximum and minimum values in the meta-data.

A table in Snowflake is effectively range partitioned on every column.  Using more  filtering columns in the WHERE clause may dramatically increase the pruning effectiveness.

We will demonstrate using a denormalized TPC web sales table created from “SNOWFLAKE_SAMPLE_DATA”.”TPCDS_SF10TCL” with data from 2002. The table, DEMO_PRUNING, was created to ensure that pruning is on a single table.  Daily loading of the data was simulated using ORDER BY D_DATE, resulting in minor overlap of data in the micro-partitions.

We will look at Q4’s data, which includes Black Friday and what is now called Cyber Monday. Although the table contains eight columns, only three are of interest for our purpose:

  • D_DATE
  • WS_EXT_SALES_PRICE
  • WS_SALES_PRICE

The table contains all data for 2002.

  • 1,437,206,906 rows
  • 795 micro-partitions

In Q4 of 2002

  • One day’s data is ~7.8M rows
  • One week’s data is ~54.6M rows

Snowflake’s UI shows the execution plan and statistics in the History Profile Tab reached by clicking on Query ID link in the History window or the  Query_ID link in the Results pane in the Worksheet.

The statistics  for the week containing Black Friday will serve as a proxy for comparing a one-week macro-partition query with micro-partition queries.

The following query provide the desired statistics :

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING
WHERE
    D_DATE BETWEEN '2002-11-24' AND '2002-11-30'

The query scans 31 partitions, 346MB which we will use when comparing micro- and macro- partition pruning in the following examples. . (Note that in row storage RDBMSes, the number of bytes scanned in a macro-partition would be  significantly greater.)

In a macro-partition RDBMS, any query WHERE D_DATE falls within these dates requires scanning all macro-partition data, e.g.:

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING 
WHERE
    D_DATE = '2002-11-29' – Black Friday
  • The macro-partitioned RDBMS scans 2 full weeks of data, 62 partitions, 728MB.
  • Snowflake scans 23 micro-partitions, 308MB.

Let us look at the very largest purchases for the same period:

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING 
WHERE
    D_DATE BETWEEN '2002-12-28' AND '2002-12-03'
    AND WS_EXT_SALES_PRICE > 29000
  • The macro-partitioned RDBMS scans 2 full weeks of data, 62 partitions, 728MB. WS_EXT_SALES would not typically be a column in a macro-partition-key specification.
  • Snowflake uses the new filter as an “ad-hoc” partition-key column, 12 partitions, 152MB.

Finally, let us add yet another filter column to look at high priced item purchases:

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING 
WHERE
    D_DATE BETWEEN '2002-12-28' AND '2002-12-03'
    AND WS_EXT_SALES_PRICE > 29000
    AND WS_SALES_PRICE > 297
  • The macro-partitioned RDBMS scans 2 full weeks of data, 62 partitions, 728MB. WS_EXT_SALES_PRICE would also not typically be a column in a macro-partition-key specification.
  • Snowflake uses the new filter to further reduce the number to 11 partitions, 111MB.

Clustering of the data is a key factor in effective partition pruning. Data that is loaded on a regular basis, e.g., daily, is typically well clustered. Even poorly clustered data often performs surprisingly well.

Loading the data with COPY INTO … FROM SELECT … ORDER BY is a highly effective technique for some types of loads, with a higher, one time compute cost.

The Clustering feature may be used to enforce clustering. It is a background service. UPDATE and DELETE operations may result in additional service charges. Please refer to the Clustering page for recommendations about this feature.

Snowflake’s unique micro-partitioning strategy is a key factor its exceptional query performance.

Special thanks to my long time friend Kent Graziano for his feedback on this article.

Copyright © Jeffrey Jacobs, 2021