Over several generations of RDBM technologies, I have learned that common practices, knowledge, and attitudes become obsolete. Query planning and optimization is continually evolving. This evolution, combined, with Snowflake’s revolutionary performance capabilities and data model agnosticism, is leading many database practitioners to a new architectural pattern. A key element of this new pattern is what I call Extract, Load, Virtual Transform (ELVT).
A little historical perspective is in order. Running a SQL query involves three fundamental steps:
- Creating an execution plan. This operation is performed by the Query Optimizer (or sometimes Query Compiler).
- Executing the plan, i.e., retrieving the desired data from storage.
- Returning the data to the user.
I spent much of my professional database career working with Oracle’s RDBMS product suite. In the earlier versions of Oracle, the execution plan creation was based on a simple set of rules. This was known as the Rule Based Optimizer (RBO). The optimization was quite simple; use either an index if it exists or scan the data if no index is available. Query tuning was largely a function of shared tribal knowledge and understanding the rules. Those of us who worked with early versions of Oracle all have stories about dramatically improving performance by reversing the join order, and similar tales.
The Cost Based Optimizer (CBO) was introduced with Release 7. It was primitive, and mostly ignored, but it was the beginning of one of the most sophisticated applications of computer science in any technology. The creators of Snowflake, Thierry Cruanes and Benoit Dageville, were significant contributors to the CBO. (Oracle’s CBO has been an endless source of fascination to me.) Let us not forget the incredible improvements in computer hardware in general.
A CBO evaluates many possible executions plans based on table statistics and the operations available in the underlying execution engine. For our purposes, an even more important CBO feature is the rewriting of original queries into better performing code. The underlying theory of relational algebra enables nearly infinite transformation of a SQL query into its logical equivalent. Only an exceedingly small number of these will perform better than the original.
A VIEW is essentially a pre-packaged query and may be used any place a physical table is used. When VIEWs were introduced, DBAs often discouraged their use; joining VIEWs in a query was strongly discouraged. Despite the abilities of today’s CBOs, this attitude, and a preference for using physical tables instead of VIEWs is still quite common.
Extract, Transform and Load (ETL) was the earliest method for loading large amounts of data into a database. ETL was performed by a separate application, such as Informatica, usually hosted on its own servers. The necessary data transformations were performed outside the database and then loaded into the database.
As database platform performance improved, the order changed. Data was loaded into the database and the transformation was performed by SQL operations which INSERT the data into tables. Hence, Extract Load and Transform (ELT). This is not particularly new, having become more common with each new generation of hardware.
ELT is a somewhat misleading term. It should be a four-letter acronym, ELTL as the final step is still Loading data into physical tables. Snowflake’s performance has made ELTL very popular.
In many cases, the primary rationale for using the ELT/ELTL approach is the practitioner’s experience with data marts, pperational data stores, and prior experience with database performance.
Snowflake’s revolutionary performance enables a new approach requiring a new name and acronym; Extract, Load, Virtual Transform (ELVT). Data is still Extracted from source systems and Loaded into the database. ETL/ELTL is used to produce clean physical tables. But VIEWs perform the logical Transformation without loading into or creating physical tables.
Let us keep three common engineering adages in mind:
- Murphy’s Law: whatever can go wrong, will go wrong
- KISS: Keep It Simple Stupid
- The probability of Murphy’s Law being observed in any system increases with the number of moving parts
ETL/ETLL pipelines typically have multiple steps, with intermediate storage between steps. Things may fail or break for many reasons.
ELVT has the following advantages, with the assumption that physical tables contain required level of cleanliness:
- Simplicity: VIEWs are SQL
- Fewer moving parts
- True single source of data; no transformed physical data that may be incorrect or out of date
- Easily modified and fixed
- Simpler governance and security; SECURE VIEWs may be used
- It is straightforward to replace a VIEW with a physical table
- Business Intelligence tool vendor independence – this will be covered in detail in my next article
- Reduced development time, effort, and cost across the entire data cycle
The reader might question the impact of ELVT on performance and computing costs. The major factor in query performance is the amount of data to be processed. If ELT/ELTL produces much smaller tables, then they should be considered. A well-designed ELVT architecture enables straightforward replacement of VIEWs with access to physical tables if needed. The use of VIEWs instead of direct physical tables has little impact on CPU usage.
I have used ELVT on several projects. I have participated in many discussions and seen numerous case studies where ELVT is commonly used in place of ELT/ETLL.
The article Pivoting IIOT Data in Snowflake, describes a simple ELVT model for creating pivoted VIEWs for BI purposes. The underlying tables are never directly referenced in queries. One VIEW, VW_SAMPLE_QUARTER_HOUR_STATS, aggregates the data into the average, maximum and minimum values for quarter hour intervals.
Let us examine a more complex VIEW from that article’s data moden, shown below at the end of this article. This VIEW determines the start and end times for decanting natural gas from trucks into a pipeline. A “totalizer” sensor maintains the running total of the amount of gas decanted over time. The trucks come from different farms and each farm has a dedicated decanting station, aka “pole”. The time and duration for each decanting is determined by changes in the totalizer value. Totalizers occasionally “burp” an invalid sample, which needs to be ignored.
Using a Small virtual warehouse, SELECT * FROM VW_YJ_DECANT_INJECTIONs, using a complex VIEW with 289M rows in the base table took ~2.7s total time, ~0.6s CPU. Query execution was “first time” to avoid local caching and Small was used to minimize local disk. The query has 26 steps with the following statistics.
This was a very inexpensive operation to get the results required, and was faster to develop and deploy. Compare that to the cost of storage plus time and resources for developing, managing, and testing an ETL or ELT pipeline process had we chosen to create a 2nd physical table instead. This ELVT is a much more agile and cost effective approach.
Snowflake’s unique capabilities enable this new architectural pattern. ELVT is not a silver bullet; ETL/ELTL pipelines need to be examined to see if they can be replaced with VIEWs. In my next article, I will discuss a common, disciplined approach to building ELVT architectures.
CREATE OR REPLACE VIEW VW_YJ_DECANT_INJECTION_TEST copy grants AS SELECT * , TRUNC( TIMEDIFF('minutes',START_TIME,END_TIME)/60,2) AS DURATION_HOURS FROM ( SELECT DESCRIPTIVE_TAG_NAME, CASE WHEN DESCRIPTIVE_TAG_NAME = 'pole_1_totalizer_scf' THEN 'McDonalds 1 WHEN DESCRIPTIVE_TAG_NAME = 'pole_2_totalizer_scf' THEN 'McDonalds 2 WHEN DESCRIPTIVE_TAG_NAME = 'pole_3_totalizer_scf' THEN 'McDondalds 2 WHEN DESCRIPTIVE_TAG_NAME = 'pole_4_totalizer_scf' THEN 'McDondalds 4' ELSE 'Unknonn' END AS farm, CASE WHEN current_value = next_value AND current_value <> first_prior_value THEN TRUNC(current_value - lag( TRUNC (current_value,0),1) over ( PARTITION BY DESCRIPTIVE_TAG_NAME ORDER BY CURRENT_TIME) ) END AS SCF_INJECTED , LAG(T.current_time,1) OVER ( PARTITION BY DESCRIPTIVE_TAG_NAME ORDER BY CURRENT_TIME) AS START_TIME, T.CURRENT_TIME AS END_TIME--, FROM ( SELECT DESCRIPTIVE_TAG_NAME, SAMPLE_QRTR_HOUR CURRENT_TIME, TRUNC ("Quarter Hour Avg",0) current_value, lead( TRUNC ("Quarter Hour Avg",0),1)over ( PARTITION BY DESCRIPTIVE_TAG_NAME ORDER BY SAMPLE_QRTR_HOUR) next_value, lead (SAMPLE_QRTR_HOUR,1) over ( PARTITION BY DESCRIPTIVE_TAG_NAME ORDER BY SAMPLE_QRTR_HOUR) next_time, lag( TRUNC ("Quarter Hour Avg",0),1) over ( PARTITION BY DESCRIPTIVE_TAG_NAME ORDER BY SAMPLE_QRTR_HOUR) AS first_prior_value, lag (SAMPLE_QRTR_HOUR,1) over ( PARTITION BY DESCRIPTIVE_TAG_NAME ORDER BY SAMPLE_QRTR_HOUR) AS first_prior_time FROM VW_SAMPLE_QUARTER_HOUR_STATS WHERE tagpath IN ( SELECT tagpath FROM tag_detail WHERE DESCRIPTIVE_TAG_NAME IN ( 'pole_1_totalizer_scf' , 'pole_2_totalizer_scf', 'pole_3_totalizer_scf', 'pole_4_totalizer_scf' ) )) T WHERE ( current_value <> first_prior_value AND CURRENT_VALUE = NEXT_VALUE ) OR ( current_value <> next_value AND CURRENT_VALUE = FIRST_PRIOR_VALUE )) WHERE scf_injected IS NOT NULL AND scf_injected >= 500 ORDER BY DESCRIPTIVE_TAG_NAME, START_TIME;
Copyright © 2021, Jeffrey Jacobs