Agile Cost Management in Snowflake – Part 2, Storage

In Agile Cost Management in Snowflake – Part 1, Compute, we discussed managing compute costs. Managing storage costs is much simpler, but it is still very important, as poorly managed storage will result in unexpected expenses.

As with Compute, Snowflake’s revolutionary architecture requires a different approach and mindset to managing storage. In legacy DW RDBMS, storage is a limited resource. The need to accurately estimate current and future requirements is a key driver in determining cost. Adding storage is expensive and often requires downtime. Reducing physical storage is unheard of. Instead, DBAs typically focus on reclaiming and reusing storage as data becomes obsolete. This often involves complex and burdensome Information Life Cycle processes. Finally, managing and ensuring the reliability of backups, restores and recovery is a major headache.

Snowflake brings unlimited storage to the cloud, enabling an Agile “lean and fit” storage management style, effectively eliminating the DBA burdens discussed above. Instead, the focus is on appropriate, simple settings to control storage costs.

Pricing

Storage charges in Snowflake are simply “pass through” of the cloud vendor’s pricing.

Snowflake charges monthly for data in databases and data in Snowflake file “stages”. Charges are based on the average storage used per day, computed on a daily basis.

Managing Cost in Stages

Cost management in stages is pretty simple: remove files from stages after successful load into the target databases, unless there is a compelling reason to keep them. This can be performed as part of the COPY INTO command or via the REMOVE command after loading.

Time Travel

As we’ll see, appropriate use of Snowflake’s data retention capability, Time Travel, is the key to controlling storage costs. Time Travel features enable:

  • Restoring accidentally or deliberately dropped databases, schemas and tables with simple SQL
  • Cloning tables, schemas and databases from key historical points with simple SQL
  • Backing up data for historical and archiving purposes via cloning from key points in the past with simple SQL
  • Querying data as it existed at previous points in time with simple SQL

Snowflake’s Database Storage

To manage database storage costs, it’s important to first understand the relevant parts of the storage architecture.

Snowflake uses unlimited “blob” storage for database and file stage storage, which cloud vendors tout as having “eleven nines”  (99.999999999) of durability. All storage is stored redundantly in multiple availability zones (AZ).

Snowflake’s “secret sauce” is the use of “micro-partitions”. A micro-partition is a file object in blob storage.

A micro-partition has the following characteristics:

  • Micro-partitions are immutable. Once written, their data is never modified.
  • Micro-partitions are always compressed and encrypted. The maximum (and common) size is 16MB, typically representing 100-150MB of uncompressed data.
  • In addition to statistics on the data, a micro-partition may have a “date modified” timestamp.
  • “Active” micro-partitions have not been modified, i.e. no “date modified” timestamp.
  • “Historical” micro-partitions have been modified, i.e. a “data modified” timestamp is present
  • UPDATE, DELETE, and a MERGE that performs an UPDATE, result in the date modified timestamp being set and a new micro-partition being written. INSERT (and COPY) always create new micro-partitions.
  • Truncating a table effectively sets the data modified timestamp to the time of the operation for all of the micro-partitions in the table.
  • Dropping a table sets the date modified timestamp to the time of the operation and makes the table definition unavailable.

When a date modified timestamp is present, the data retention setting determines how long the data is available.

Data Retention Period

When a date modified timestamp is present, the data retention setting determines the length of time the historical data will remain available. This in turn determines the charge for the data.

Data retention may be set at the account, database, schema or table level via DATA_RETENTION_TIME_IN_DAYS parameter. A value at a higher level, e.g. database, may be overridden by setting the value for an object at a lower level. Time Travel may only be set for permanent tables and Transient tables.  The default is always 1 day. Transient tables may only be set to 1 or 0 days.

These features enable Snowflake’s “Time Travel” capabilities. In addition, there is a “Fail Safe” backup of 7 days, available in all editions. Recovering data from Fail Safe requires Snowflake support intervention. Time Travel and Fail Safe are collectively referred to as Continuous Data Protection (CDP).

Historical micro-partitions are moved into Fail Safe when their data retention setting is exceeded.

“Standard Time Travel” is only available for 1 day. This applies to both Standard and Premier Editions.

“Extended Time Travel” may range from 1 to 90 days retention for the Enterprise, Business Critical and Virtual Private Snowflake editions.  Managing Extended Time Travel is the key to managing storage costs.

Note that Time Travel can be disabled by setting the retention period to 0. This is generally a bad idea.

Time Travel Cost Example

This is an extreme example to demonstrate Time Travel’s effect on storage costs and why data retention management is critical.

  1. Create a permanent (not TRANSIENT or TEMPORARY) table with 90 day data retention
  2. Populate on Day 1 with 10GB
  3. Drop table on Day 1.
  4. Total cost = (90+7) * 10GB * price/GB/day

This will enable you to UNDROP the table for 97 days from the time of dropping it. However, you will be charged for 97 days of storage, even though you might never UNDROP and/or use it!  (We’ll show you how to fix this later.)

Similarly, historical micro-partitions modified due to UPDATE, DELETE and MERGE are moved into Fail Safe when their data retention setting is exceeded.

Changing Data Retention Period

The data retention period may be changed for an object via the SQL command:
 ALTER <object> DATA_RETENTION_TIME_IN_DAYS=

If the retention period is increased, all data that has not been moved into Fail Safe will be affected by the new value. E.g., if the value is increased from 10 days to 20 days, historical data that would have been moved to Fail Safe after 10 days lives for an additional 10 days.

If the retention period is decreased, data that has exceeded the original retention period will be moved to Fail Safe. E.g., decreasing the value from 10 to 1 will result in 9 days of historical data will be moved into Fail Safe.

Fixing the 90 day “Mistake”

Let’s assume that this table and its population was a mistake. Fortunately, we can reduce the cost to a minimum of Fail Safe’s 7 days, depending on when the mistake is discovered.

  1. UNDROP the table
  2. Modify the data retention period to 0 days
  3. Drop the table

The data was previously moved into Fail Safe, so you are still subject to a minimum charge of 7 days plus any time the mistake went undiscovered.

Managing Cost and Time Travel

The key decision to be made is “how long do you need data retention”? Key factors are:

  1. Is point in time query required for modified historical data? Remember, 90 days is the maximum age for historical data (for Enterprise Edition and above).
  2. Is point in time query required for overall data consistency, e.g. multiple ETL/ELT processes must be completed before querying a set of tables?
  3. How far back might you need to go for fast restore of dropped objects? Remember, data is redundantly stored in 3 availability zones with eleven 9s of durability, so backup for media failures is not needed.
  4. How volatile is your data, how frequently are UPDATEs, DELETEs and MERGEs that modify data performed?
  5. Is there a need to create historical clones, for testing, debugging, etc.? Note that micro-partitions referenced by clone(s) live until all retention time in referencing clone(s) is exceeded. See Cloning Tables, Schemas, and Databases.

Higher volatility = increased storage usage by creating new micro-partitions. Carefully consider the tradeoffs; does high volatility mean increased need for point in time queries and possibly recovery from mistakes? Or the opposite?

Staging tables should have minimal data retention. They should be either TEMPORARY or TRANSIENT tables. Temporary tables only exist during a session. Transient tables may have data retention of 1 or 0. Data is never moved into Fail Safe for either type of table.

Cluster Keys

Be careful when using Cluster Keys, as this may result in frequent aging of micro-partitions and unexpected storage costs. Consider a fairly low data retention setting when using cluster key.

Point in Time Cloning

Databases, schemas and tables may be cloned as of a point in time using the AT clause. This is particularly useful for regression testing, debugging and other development and QA.

Table Storage Metrics View

You will find all of the storage metrics for a table in the TABLE_STORAGE_METRICS view in the Snowflake database INFORMATION Schema. In particular:

  • ACTIVE_BYTES –number of bytes in active micro-partitions
  • TIME_TRAVEL_BYLES – number of bytes in historical micro-partitions
  • FAILSAFE_BYTES – number of bytes in Fail Safe
  • RETAINED_FOR_CLONE_BYTES – number of bytes in micro-partitions referenced by clones
  • TABLE_DROPPED – date and time if table was dropped
  • TABLE_ENTERED_FAILSAFE – date and time the table entered Fail Safe. Requires Snowflake support to restore

The retention period in days for a table is found is found in the RETENTION_TIME column in the TABLES view. Note that the BYTEs column in this view does not reflect true storage cost; it represents uncompressed byte count for a scan of active micro-partitions.

Summary

Snowflake provides unparalleled capabilities to control your cloud data warehouse spend in an agile manner. It further provides unrivalled Continuous Data Protection against failures of all types. Investing in a new approach to storage management offers an opportunity to significantly improve your ROI on Snowflake.

Thanks to Amy Kaulius for her technical review, and LBJ,my editor/wife.

Copyright © 2019, Jeffrey M. Jacobs, permission to redistribute with copyright notice

Agile Cost Management in Snowflake – Part 1, Compute

Snowflake’s revolutionary architecture brings new opportunities and approaches to managing the cost of running its cloud data warehouse. To paraphrase Voltaire and Peter Parker’s uncle Ben Parker, “with great innovation comes great power; with great power comes new responsibilities”. The overall workload architecture is key to managing spend, which in turn requires new modes of thinking and new administration responsibilities.

Historically, workload management was focused on improving performance and throughput; it was seldom directly involved in cost control.

In “legacy” RDBMS, both in cloud and on-premise, workload management is focused on allocating scarce resources among users running different types of queries. Workload management may take different forms. Users may be limited in the amount of CPU and I/O they can use and/or receive priorities on queues and other resources. Elasticity and expansion options are typically limited, and in many cases require downtime. Shared nothing architectures are particularly difficult to upsize and downsize.

With legacy RDMBS, throughput and performance are the main drivers, not cost management. This leads to “bigger is always better” thinking. Elasticity on these platforms is tactical, not strategic, often involving delays or downtime. Over provisioning for anticipated peak load is all too common.

Snowflake provides unlimited compute and storage resources; money is the scarce resource on which to focus workload management.

Managing Snowflake Spend

There are two aspects to managing your Snowflake spend; compute and storage. This article concentrates on the compute aspects, which are both more variable and tunable. We’ll address storage in Part 2.

Minimizing spend is a matter of proactive management and continual improvement, i.e. “refactoring” the workload distribution. Snowflake provides unique capabilities for refactoring the workload distribution.

Let’s take a brief look at what constitutes workload management (and keep it fairly simple).

“Performance” is defined as how long it takes for a query to execute, i.e. when does the user receive results.

“Throughput” is defined as how many queries can be run in X amount of time. This is determines how many concurrent users the system supports.

The fundamental architectures for database engines are Massively Parallel Processor (MPP) and Symmetric Multi-Processor (SMP).  For our purposes and to keep things simple, we’ll consider an MPP processor as a compute engine, either physical server(s) or virtual machine(s), with local storage used during query execution. In a shared nothing architecture, the persisted database storage (PDS) is also typically distributed across the processors (e.g., Redshift and Teradata).

In MPP, the processing of each query is distributed among all of the processors. This provides the necessary power for processing large amounts of data quickly. More processors = better performance = higher throughput. However, the law of diminishing returns also comes into play. Why use, and pay for, 128 processors when a much smaller number will meet the desired service levels?

Snowflake’s unique architecture provides complete and total separation of the processor and PDS.

Snowflake’s primary billing format is “pay for what you use”, not pay for what you provision. Usage is measured in “credits”. The dollar amount for a credit varies depending on which cloud service is used. For compute:  

credits_used/cluster=  t-shirt_size_of_cluster * running_seconds. (See below for more on t-shirt sizing).

Compute is performed by a “warehouse”, a somewhat confusing terminology choice by Snowflake. The traditional use of the term “(data) warehouse” refers to both the database storage as well as the compute server.

 We will use the term “Virtual Warehouse” (VW), as compute and database storage are completely independent. A VW can potentially use data from any and/or multiple databases.

Managing VWs

All of the settings for a VW in this section may either be:

  • Set at VW creation
  • Changed via the SQL Alter Warehouse statement
  • Set or changed in the Snowflake Web UI

Charges for a VW apply only when it is running, not when it is suspended (not running). VW options include:

  • Auto-resume – a suspended VW is resumed when a query is submitted.  The minimum charge for resuming a suspended VW is 60 seconds.  The alternative is explicitly resuming the VW via the SQL Alter Warehouse statement.
  • Auto-suspend – a VW may be suspended when no queries are run for the specified time. The alternative is explicitly suspending the VW via the SQL Alter Warehouse statement.

A VW consists of between 1 and 10 “clusters”. VWs with > 1 cluster are referred to as “multi-cluster VWs”. More on this later. Each cluster consists of 1 or more “servers”.  A server is a virtual machine with multiple vCPUs and local storage for processing and caching, e.g. an EC2 instance in Amazon. Snowflake engineering is continually evaluating the cloud vendors’ offerings for the best performance and to maintain performance parity across cloud services.

A query runs in only 1 cluster. Multiple clusters provide greater throughput, i.e. load balancing.

Servers come in “t-shirt” sizes

  • Extra Small – 1 server
  • Small – 2 servers
  • Medium – 4 servers
  • Large – 8 servers
  • Extra Large – 16 servers
  • 2X Large – 32 servers
  • 3X Large – 64 servers
  • 4x Large – 128 servers

Multi-cluster VWs come in 2 flavors, auto-scale and maximized.

Auto-scale balances the number of running clusters based on query load, i.e. the number of queries that need to be run simultaneously.  As the load demand increases, additional clusters are started. As the load decreases, clusters are suspended. The 60 second minimum charge applies only to VW startup, not cluster startup. Auto-scale has two additional settings, economy and standard, controlling the aggressiveness of resuming and suspending clusters.

Maximized mode starts with and runs with the number of clusters specified.  Remember the 60 second minimum; in this mode, on resume of the multi-cluster VW , you are charged a minimum of 60 seconds for all clusters!!!

The size of clusters in VWs can be resized up or down on the fly, which does not affect running queries! Only newly submitted queries use the reconfigured size. But there’s one more feature to manage cost; the complete separation of VW and physical database storage allows multiple VW to run against the same data without incurring additional storage costs.

Let’s Get Agile

Where does “agile” come in? As an adjective, it’s defined as “able to move quickly and easily”. All of the options and configurations we’ve discussed are very simple to manage. A key feature of agile development is refactoring, which is the approach to be used for managing our WA and spend. Refactoring may involve either changing the settings for an existing VW, creating new VWs to replace and/or offload work from an existing VW.

Understanding Workload

Snowflake’s Information Schema provides easy to use “table functions” for understanding workload and account usage.  Since we’ve already shown how to Reduce Your DBAs’ Burden with Snowflake, we’ll assign the primary responsibility for workload management to the DBAs.

 The following  are key to understanding workloads:

Typically you would create views or package queries to use these functions as discussed in the documentation.  Note that there are a number of prepackaged dashboards and visualizations available, e.g. from Tableau, Looker and Power BI.

WAREHOUSE_METERING_HISTORY provides a good starting place for identifying VWs to be refactored. In addition, Resource Monitors can be useful for proactively monitoring and controlling credit usage.

QUERY_HISTORY, QUERY_HISTORY_BY_* provides the ability to analyze and identify different workload patterns. Ideally, we want to create one or more history tables covering periods longer than 7 days.

For the purpose of this article, we’ll assume that we are managing workloads for each database separately, as these each address separate business functions.

Which columns identify VW characteristics to be refactored?

  • WAREHOUSE NAME
  • WAREHOUSE SIZE
  • WAREHOUSE TYPE

How do we determine when a VW should be running/resumed/suspended based on date and time?

  • START TIME
  • END TIME

How do we determine sizing?

  • TOTAL ELAPSED TIME
  • TOTAL BYTES SCANNED
  • TOTAL ROWS PRODUCED
  • EXECUTION TIME

With all of this information, we can now clear identify workload patterns such as:

  • What types of queries run during business hours, but almost never on weekend or between 8PM and 7AM?
  • What are the appropriate VW settings for a workload?
  • Are there concurrency spikes in the workload?
  • What is the range of query types? Is the size of the queries consistent or widely ranging?
  • Are there peak loads on certain days of the month, quarter, and year?
  • How often and how complex are ad hoc queries?
  • Which user(s) perform which categories of queries?
  • Estimating credit usage and setting up Resource Monitors?

A Few Guidelines and Tips

  • Workloads will change over time. Not only will existing usage patterns change, new usage patterns will emerge for both existing data and new data.
  • VWs for loading should be separate from VWs used for queries. Be aware of the guidelines for loading data effectively. Larger VWs will not improve performance for loading large files.
  • Understand the cost difference in t-shirt sizes. Adjust auto-resume and auto-suspend appropriately.
  • Set auto-scaling for workloads with varying loads.
  • Consider setting STATEMENT_TIMEOUT_IN_SECONDS, particularly for very large VWs used for ad hoc queries. This will cancel queries that run longer than the time specified.
  • Evaluate the execution plan of queries with long running times, heavy byte scanning, etc. to determine if they should be running in larger VWs. In particular, disk spilling is a key indicator of needing larger t-shirt sizing.

Note on Cluster Keys

Although not strictly part of workload management, the use of cluster keys affects both compute and storage costs as part of its background processing. It should be used with care; see the documentation for discussion on credit usage.

Summary

Snowflake provides unparalleled agile capabilities to control your cloud data warehouse spend. Investing in a new approach to workload management offers an opportunity to significantly manage your ROI on Snowflake.

Copyright ©, 2019, Jeffrey M Jacobs

Notes from Snow Summit Product Innovations Keynote

This mornings keynote was full of upcoming and very exciting innovations. The biggest are:

  1. Going cross region and cross cloud.
    1. GCP coming, currently in preview.
    2. Cross region and cross cloud bidirectional replication.
    3. Transparent failover.
    4. Consolidated billing and management of all related accounts.
    5. Load data from Google Cloud Storage as stage, prior to full implementation of Snowflake on GCP (I believe this is already available).
  2. Numercy UI – a significantly better web user interface.
  3. Tasks and job control
    1. Ability to invoke new stored procedures
    2. Guaranteed change data capture on streaming data
  4. Simplified Kafka connection in conjunction with #3.
  5. Materialized views on external files.
  6. Auto Ingest, similar to snowpipes but without the need for Rest call to kick off ingestion.
  7. Geo/spatial and user defined data types.

It was also great watching the presentation on data lakes in Snowflake, almost matching my post The Elephant in the Data Lake and Snowflake!

The Elephant in the Data Lake and Snowflake

Let’s talk about the elephant in the data lake, Hadoop, and the constant evolution of technology.

Hadoop, (symbolized by an elephant), was created to handle massive amounts of raw data that were beyond the capabilities of existing database technologies. At its core, Hadoop is simply a distributed file system. There are no restrictions on the types of data files that can be stored, but the primary file contents are structured and semi-structured text. “Data lake” and Hadoop have been largely synonymous, but, as we’ll discuss, it’s time to break that connection with Snowflake’s cloud data warehouse technology.

Hadoop’s infrastructure requires a great deal of system administration, even in cloud managed systems.   Administration tasks include: replication, adding nodes, creating directories and partitions, performance, workload management, data (re-)distribution, etc.  Core security tools are minimal, often requiring add-ons. Disaster recovery is another major headache.  Although Hadoop is considered a “shared nothing” architecture, all users compete for resources across the cluster.

Map Reduce (MR) is a fairly simple distributed, disk based computing solution, requiring the writing of complex, procedural code. The developer needs to be intimately familiar with the contents of files being used. MR is batch oriented and the number of simultaneous queries is limited. This led to development of other technologies, such as Spark, to improve performance by using memory instead of disk.

It became apparent for reporting and analytics purposes that representing data in relational form with a dictionary was still very desirable.   Hive was born.   With a relational model and dictionary, SQL itself was once again back in favor.  Many people were surprised to discover that SQL was not dependent on the underlying storage technique, but could be mapped to delimited and semi-structured data.   “Schema on read” became “schema on demand”.

A plethora of new SQL tools were born.  Some ran entirely within the Hadoop cluster, others grafted external servers to improve query performance and concurrency.  This also made it much easier for existing Business Intelligence tools to work against the Hadoop data lake.

ETL from Hadoop into other SQL RDBMS has also become commonplace.

But Hadoop’s underlying administration issues still exist, even with cloud managed services.  Feeding and caring for an elephant is an expensive, non-trivial exercise.

Hadoop in its time was a sea change.  Snowflake’s technology is the latest sea change in database technology.

First, let’s determine if Snowflake is suitable for use as a data lake.  The key features of a data lake are:

  1. Support for a wide variety of data types, e.g. structured data, semi-structured (JSON, Avro, Parquet, etc.), log files, IOT events, etc.
  2. Unlimited storage.
  3. Wide range of ingestion techniques and technologies for ETL, ELT and streaming.

Snowflake meets all 3 of the above criteria, and offers substantial improvements over Hadoop.

For #1, Snowflake supports all of the standard RDBMS data types in native format, i.e. NUMBER is stored as binary, not text.  Semi-structured data is stored as VARIANT. Unlike most other semi-structured technologies, Snowflake performs the same optimizations on the VARIANT data as it does on structured date, i.e. meta-data information, statistics, compression and encryption. This provides tremendous improvement in performance, allowing Snowflake to perform its micro-partition pruning technology on semi-structured data as well as strongly typed data.

For #2, a simple “yes” would suffice; there are no limits to Snowflake’s storage capacity.   But Snowflake also eliminates the storage management headache.  There are no nodes to set up and manage, no file systems or directories to define, no data distribution policies to implement and revise.  Replication, backup and disaster recovery are all part of Snowflake’s managed services.  Data protection policies are set by simple SQL or UI actions.

For #3, Snowflake and its ecosystem offers a tremendous number of partners and connectors.

Let’s look at other advantages:

Security – In addition to role based security, Snowflake security offers:

  • Data is always encrypted both at rest and in transit, with managed hierarchical and periodic key rotation. Data uploaded to staging may be encrypted by the user.  Snowflake offers “triple key security” and customer managed keys in its Enterprise for Sensitive Data (ESD) level.
  • All editions are Soc 1 Type II and Soc 2 Type II compliant. EDS is PCI DSS certified, FedRamp Ready, and HIPAA compliant.
  • The “Virtual Private Snowflake” edition offers additional security, providing a single tenant Snowflake cloud.

Performance and Concurrency – Snowflake’s architecture offers both high performance and unlimited concurrency.

First, let’s look at performance:

  • Snowflake’s unique micro-partition architecture provides both partition pruning as well as column pruning within a micro-partition.
  • “Compute” is separate from storage. Compute is provided by “Virtual Data Warehouses (VDW)”.   A VDW consists of:
    1. 1 or more clusters
    2. A cluster may contain 1 – 128 servers. A server is a virtual machine in the appropriate cloud.
  • Query performance depends on the cluster size of a VDW. Cluster size can be increased or decreased dynamically by simple SQL or UI commands without any service interruption. In-flight queries continue to completion; only new queries are affected by the change.
    1. Snowflake maintains the results of all queries for 24 hours; identical queries retrieve results from this result cache.

Now let’s look at concurrency:

  • Concurrency, i.e. the number of simultaneous queries may be scaled in 2 ways.
    1. A VDW may be configured with multiple clusters, again via simple SQL or UI command. As of this writing, the maximum number of clusters is 10.  A VDW may be configured as autoscale, with clusters being started and stopped based on dynamic load balancing.
    2. For even higher levels of concurrency, multiple VDWs can be created for similar workloads.

As we’ve seen, Snowflake clearly adds the ability to meet the requirements of a data lake, in addition to its more traditional data warehouse capabilities.

It’s time to return the elephant to the wild!

Copyright © Jeffrey Jacobs, 2019, permission to redistribute with copyright notice

Reduce Your DBAs’ Burden with Snowflake

 

Today’s DBA typically manages 10s, 100s or even 1000s of databases, often from multiple vendors, both on premise and in the cloud.   These may include RDBMS, NoSQL DBMS, and or Hadoop clusters.

While management automation has made substantial strides enabling DBAs to handle larger workloads, the care and feeding of these databases is still all too frequently a burdensome task.

Data warehouses, data marts, and data lakes usually require the most attention.  Let’s discusses how using Snowflake RDBMS can dramatically reduce the DBA’s workload!

Snowflake Architecture Overview

Snowflake’s architecture consists of 3 distinct layers.

  1. “Cloud Services” manages the cloud vendor services. It also manages the meta-data and SQL execution plan generation.
  2. Virtual Data Warehouse(s) (VDW)” provide the compute engine(s).   VDWs are completely separate and independent of each other, even when operating on the same data.   A VDW supporting BI is not affected by another VDW performing ETL/ELT. They are very simple to create and manage, with either a simple SQL statement or in the web UI.   Note that a VDW is pure compute; don’t confuse this with the traditional use of “data warehouse” to mean both compute and storage.
  3. “Storage” aka “database” provides a “shared disk” architecture, separate from the VDWs that are using it. Storage is AWS S3 or Azure Blob.

A VDW can operate against any database; it is not allocated to a specific database.

Let’s look at the common tasks and requirement for creation and management.

Initial Creation for Legacy DBMS

The initial creation of a database typically requires the following steps. This is neither an exhaustive list nor applicable to every DBMS:

  1. Capacity Planning. This is critical to ensuring smooth operation, both at normal loads and peak loads.   This can be a daunting task, and usually results in over-provisioning.
  2. Infrastructure procurement and installation. Typically based on the peak load estimate, often leaving unused capacity for significant periods of time.
  3. Software installation.
  4. Creation of the database object itself (CREATE DATABASE …). Common steps include specifying the file system, file naming conventions, storage management, system parameters, etc.
  5. Database schema and object creation. We will assume that a suitable logical data model has been created and basic DDL generated.   However, if the platform supports indexing, the DBA will typically be called upon to perform initial review and implementation of bit-mapped or b-tree indexes.  Additional indexing is to be expected!
    1. Further, for many RDBMS, additional indexes and possibly other support features may be required for semi-structured data performance.
  6. Identification of ETL loads. This further requires:
    1. Determining size and performance impact of each load.
    2. Determining appropriate time of day/day of week for each ETL load.
    3. Scheduling of jobs to perform ETL and ELT.

Initial Create for Snowflake

  1. Create Account. Select the appropriate account type; no other steps necessary.
  2. Issue CREATE DATABASE SQL (or use web UI). The primary parameter is “Time Travel”, i.e. backup data retention in days (1-90) for recovery from user errors, disaster recovery and “point in time” querying. No file system, storage management, etc. required.   Snowflake databases are inherently elastic, with automatic purging of obsolete data.
    1. Note that all data is automatically replicated across availability zones. Coupled with “Time Travel”, this typically eliminates the need for other forms of backup.   However, creating a manual backup is also a single, nearly instantaneous operation of creating a new database by cloning”.
  3. Create database object i.e. tables, sequences and views. No indexes.  No additional support objects required for semi-structured data performance.
  4. Create initial VDW(s). Again, simple SQL with simple, easily changeable parameters:
    1. Size – number of servers per cluster, ranging from 1 – 128.
    2. Single or multiple cluster, possibly with auto-scaling.

Snowflake’s separation of compute and storage means that ETL/ELT does not run on the same compute as operational DML queries.   ETL/ELT processing is completely isolated and self-contained.

Leaving time for the DBA to work on legacy problems!

Performance Tuning for Legacy DBMS

Tuning a legacy database platform varies tremendously depending on the platform.   Common tuning work includes:

  1. Partitioning tables. All too often, this is required after the data warehouse has been operating for some time. This may involve downtime, online repartitioning with attendant compute and disk load, creating new tables, etc.
  2. Gathering statistics. This may be required after each data load.  It may be automated by job a scheduler, but often in off hours or on weekends.   Planning and analysis is often required to determine which tables need regular statistics gathered and which do not.
  3. Adjusting System Parameters. This often requires restarting the database.
  4. “Row store” databases typically require indexes for performance.  Indexes are also typically needed for performance of semi-structured data such as JSON.

Performance Tuning in Snowflake

None of the performance tuning discussed previously is required for Snowflake.

  1. Statistics are collected when data is loaded and become part of the meta-data. They are always current.
  2. All tables in Snowflake are automatically “micro-partitioned”.
  3. Snowflake is a column store, so no indexing is needed.
  4. There are no “system parameters” to tune. In rare cases, a cluster key may be needed to improve performance on multi-TB (compressed) tables.  Snowflake performs clustering when loading data.  Adding a cluster key to an existing large table results in reclustering as a background process.

Snowflake’s approach to performance tuning is based on the elasticity of its VDW.

For large, complex queries, the solution is to scale up the VDW to a larger size, i.e. more servers in a cluster.   Again, this is a simple SQL or web UI action.  A VDW can be scaled up (or down) dynamically; this won’t affect running queries, only queries submitted after the size change.

A multi-cluster VDW is used to support high concurrency workload which can be created with a simple SQL or web UI action.   This type of VDW may also be “auto-scaled”, allowing dynamic activation and suspension of the number of cluster.

Another key feature for Snowflake performance is the ability to create multiple VDWs, each supporting a different type of workload or business area.   The VDWs are completely independent of each other.

Finally, Snowflake’s “micro-partition” architecture eliminates the need for traditional performance tuning.   This approach takes partition pruning to a new level, using an incredibly rich meta-data store, enabling both vertical elimination of partitions and horizontal examination only of relevant columns.

Once again, leaving the DBA time to address legacy problems!

Copyright © Jeffrey Jacobs, 2019