Agile Cost Management in Snowflake – Part 1, Compute

Snowflake’s revolutionary architecture brings new opportunities and approaches to managing the cost of running its cloud data warehouse. To paraphrase Voltaire and Peter Parker’s uncle Ben Parker, “with great innovation comes great power; with great power comes new responsibilities”. The overall workload architecture is key to managing spend, which in turn requires new modes of thinking and new administration responsibilities.

Historically, workload management was focused on improving performance and throughput; it was seldom directly involved in cost control.

In “legacy” RDBMS, both in cloud and on-premise, workload management is focused on allocating scarce resources among users running different types of queries. Workload management may take different forms. Users may be limited in the amount of CPU and I/O they can use and/or receive priorities on queues and other resources. Elasticity and expansion options are typically limited, and in many cases require downtime. Shared nothing architectures are particularly difficult to upsize and downsize.

With legacy RDMBS, throughput and performance are the main drivers, not cost management. This leads to “bigger is always better” thinking. Elasticity on these platforms is tactical, not strategic, often involving delays or downtime. Over provisioning for anticipated peak load is all too common.

Snowflake provides unlimited compute and storage resources; money is the scarce resource on which to focus workload management.

Managing Snowflake Spend

There are two aspects to managing your Snowflake spend; compute and storage. This article concentrates on the compute aspects, which are both more variable and tunable. We’ll address storage in Part 2.

Minimizing spend is a matter of proactive management and continual improvement, i.e. “refactoring” the workload distribution. Snowflake provides unique capabilities for refactoring the workload distribution.

Let’s take a brief look at what constitutes workload management (and keep it fairly simple).

“Performance” is defined as how long it takes for a query to execute, i.e. when does the user receive results.

“Throughput” is defined as how many queries can be run in X amount of time. This is determines how many concurrent users the system supports.

The fundamental architectures for database engines are Massively Parallel Processor (MPP) and Symmetric Multi-Processor (SMP).  For our purposes and to keep things simple, we’ll consider an MPP processor as a compute engine, either physical server(s) or virtual machine(s), with local storage used during query execution. In a shared nothing architecture, the persisted database storage (PDS) is also typically distributed across the processors (e.g., Redshift and Teradata).

In MPP, the processing of each query is distributed among all of the processors. This provides the necessary power for processing large amounts of data quickly. More processors = better performance = higher throughput. However, the law of diminishing returns also comes into play. Why use, and pay for, 128 processors when a much smaller number will meet the desired service levels?

Snowflake’s unique architecture provides complete and total separation of the processor and PDS.

Snowflake’s primary billing format is “pay for what you use”, not pay for what you provision. Usage is measured in “credits”. The dollar amount for a credit varies depending on which cloud service is used. For compute:  

credits_used/cluster=  t-shirt_size_of_cluster * running_seconds. (See below for more on t-shirt sizing).

Compute is performed by a “warehouse”, a somewhat confusing terminology choice by Snowflake. The traditional use of the term “(data) warehouse” refers to both the database storage as well as the compute server.

 We will use the term “Virtual Warehouse” (VW), as compute and database storage are completely independent. A VW can potentially use data from any and/or multiple databases.

Managing VWs

All of the settings for a VW in this section may either be:

  • Set at VW creation
  • Changed via the SQL Alter Warehouse statement
  • Set or changed in the Snowflake Web UI

Charges for a VW apply only when it is running, not when it is suspended (not running). VW options include:

  • Auto-resume – a suspended VW is resumed when a query is submitted.  The minimum charge for resuming a suspended VW is 60 seconds.  The alternative is explicitly resuming the VW via the SQL Alter Warehouse statement.
  • Auto-suspend – a VW may be suspended when no queries are run for the specified time. The alternative is explicitly suspending the VW via the SQL Alter Warehouse statement.

A VW consists of between 1 and 10 “clusters”. VWs with > 1 cluster are referred to as “multi-cluster VWs”. More on this later. Each cluster consists of 1 or more “servers”.  A server is a virtual machine with multiple vCPUs and local storage for processing and caching, e.g. an EC2 instance in Amazon. Snowflake engineering is continually evaluating the cloud vendors’ offerings for the best performance and to maintain performance parity across cloud services.

A query runs in only 1 cluster. Multiple clusters provide greater throughput, i.e. load balancing.

Servers come in “t-shirt” sizes

  • Extra Small – 1 server
  • Small – 2 servers
  • Medium – 4 servers
  • Large – 8 servers
  • Extra Large – 16 servers
  • 2X Large – 32 servers
  • 3X Large – 64 servers
  • 4x Large – 128 servers

Multi-cluster VWs come in 2 flavors, auto-scale and maximized.

Auto-scale balances the number of running clusters based on query load, i.e. the number of queries that need to be run simultaneously.  As the load demand increases, additional clusters are started. As the load decreases, clusters are suspended. The 60 second minimum charge applies only to VW startup, not cluster startup. Auto-scale has two additional settings, economy and standard, controlling the aggressiveness of resuming and suspending clusters.

Maximized mode starts with and runs with the number of clusters specified.  Remember the 60 second minimum; in this mode, on resume of the multi-cluster VW , you are charged a minimum of 60 seconds for all clusters!!!

The size of clusters in VWs can be resized up or down on the fly, which does not affect running queries! Only newly submitted queries use the reconfigured size. But there’s one more feature to manage cost; the complete separation of VW and physical database storage allows multiple VW to run against the same data without incurring additional storage costs.

Let’s Get Agile

Where does “agile” come in? As an adjective, it’s defined as “able to move quickly and easily”. All of the options and configurations we’ve discussed are very simple to manage. A key feature of agile development is refactoring, which is the approach to be used for managing our WA and spend. Refactoring may involve either changing the settings for an existing VW, creating new VWs to replace and/or offload work from an existing VW.

Understanding Workload

Snowflake’s Information Schema provides easy to use “table functions” for understanding workload and account usage.  Since we’ve already shown how to Reduce Your DBAs’ Burden with Snowflake, we’ll assign the primary responsibility for workload management to the DBAs.

 The following  are key to understanding workloads:

Typically you would create views or package queries to use these functions as discussed in the documentation.  Note that there are a number of prepackaged dashboards and visualizations available, e.g. from Tableau, Looker and Power BI.

WAREHOUSE_METERING_HISTORY provides a good starting place for identifying VWs to be refactored. In addition, Resource Monitors can be useful for proactively monitoring and controlling credit usage.

QUERY_HISTORY, QUERY_HISTORY_BY_* provides the ability to analyze and identify different workload patterns. Ideally, we want to create one or more history tables covering periods longer than 7 days.

For the purpose of this article, we’ll assume that we are managing workloads for each database separately, as these each address separate business functions.

Which columns identify VW characteristics to be refactored?

  • WAREHOUSE NAME
  • WAREHOUSE SIZE
  • WAREHOUSE TYPE

How do we determine when a VW should be running/resumed/suspended based on date and time?

  • START TIME
  • END TIME

How do we determine sizing?

  • TOTAL ELAPSED TIME
  • TOTAL BYTES SCANNED
  • TOTAL ROWS PRODUCED
  • EXECUTION TIME

With all of this information, we can now clear identify workload patterns such as:

  • What types of queries run during business hours, but almost never on weekend or between 8PM and 7AM?
  • What are the appropriate VW settings for a workload?
  • Are there concurrency spikes in the workload?
  • What is the range of query types? Is the size of the queries consistent or widely ranging?
  • Are there peak loads on certain days of the month, quarter, and year?
  • How often and how complex are ad hoc queries?
  • Which user(s) perform which categories of queries?
  • Estimating credit usage and setting up Resource Monitors?

A Few Guidelines and Tips

  • Workloads will change over time. Not only will existing usage patterns change, new usage patterns will emerge for both existing data and new data.
  • VWs for loading should be separate from VWs used for queries. Be aware of the guidelines for loading data effectively. Larger VWs will not improve performance for loading large files.
  • Understand the cost difference in t-shirt sizes. Adjust auto-resume and auto-suspend appropriately.
  • Set auto-scaling for workloads with varying loads.
  • Consider setting STATEMENT_TIMEOUT_IN_SECONDS, particularly for very large VWs used for ad hoc queries. This will cancel queries that run longer than the time specified.
  • Evaluate the execution plan of queries with long running times, heavy byte scanning, etc. to determine if they should be running in larger VWs. In particular, disk spilling is a key indicator of needing larger t-shirt sizing.

Note on Cluster Keys

Although not strictly part of workload management, the use of cluster keys affects both compute and storage costs as part of its background processing. It should be used with care; see the documentation for discussion on credit usage.

Summary

Snowflake provides unparalleled agile capabilities to control your cloud data warehouse spend. Investing in a new approach to workload management offers an opportunity to significantly manage your ROI on Snowflake.

Copyright ©, 2019, Jeffrey M Jacobs

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s