The Elephant in the Data Lake and Snowflake

Let’s talk about the elephant in the data lake, Hadoop, and the constant evolution of technology.

Hadoop, (symbolized by an elephant), was created to handle massive amounts of raw data that were beyond the capabilities of existing database technologies. At its core, Hadoop is simply a distributed file system. There are no restrictions on the types of data files that can be stored, but the primary file contents are structured and semi-structured text. “Data lake” and Hadoop have been largely synonymous, but, as we’ll discuss, it’s time to break that connection with Snowflake’s cloud data warehouse technology.

Hadoop’s infrastructure requires a great deal of system administration, even in cloud managed systems.   Administration tasks include: replication, adding nodes, creating directories and partitions, performance, workload management, data (re-)distribution, etc.  Core security tools are minimal, often requiring add-ons. Disaster recovery is another major headache.  Although Hadoop is considered a “shared nothing” architecture, all users compete for resources across the cluster.

Map Reduce (MR) is a fairly simple distributed, disk based computing solution, requiring the writing of complex, procedural code. The developer needs to be intimately familiar with the contents of files being used. MR is batch oriented and the number of simultaneous queries is limited. This led to development of other technologies, such as Spark, to improve performance by using memory instead of disk.

It became apparent for reporting and analytics purposes that representing data in relational form with a dictionary was still very desirable.   Hive was born.   With a relational model and dictionary, SQL itself was once again back in favor.  Many people were surprised to discover that SQL was not dependent on the underlying storage technique, but could be mapped to delimited and semi-structured data.   “Schema on read” became “schema on demand”.

A plethora of new SQL tools were born.  Some ran entirely within the Hadoop cluster, others grafted external servers to improve query performance and concurrency.  This also made it much easier for existing Business Intelligence tools to work against the Hadoop data lake.

ETL from Hadoop into other SQL RDBMS has also become commonplace.

But Hadoop’s underlying administration issues still exist, even with cloud managed services.  Feeding and caring for an elephant is an expensive, non-trivial exercise.

Hadoop in its time was a sea change.  Snowflake’s technology is the latest sea change in database technology.

First, let’s determine if Snowflake is suitable for use as a data lake.  The key features of a data lake are:

  1. Support for a wide variety of data types, e.g. structured data, semi-structured (JSON, Avro, Parquet, etc.), log files, IOT events, etc.
  2. Unlimited storage.
  3. Wide range of ingestion techniques and technologies for ETL, ELT and streaming.

Snowflake meets all 3 of the above criteria, and offers substantial improvements over Hadoop.

For #1, Snowflake supports all of the standard RDBMS data types in native format, i.e. NUMBER is stored as binary, not text.  Semi-structured data is stored as VARIANT. Unlike most other semi-structured technologies, Snowflake performs the same optimizations on the VARIANT data as it does on structured date, i.e. meta-data information, statistics, compression and encryption. This provides tremendous improvement in performance, allowing Snowflake to perform its micro-partition pruning technology on semi-structured data as well as strongly typed data.

For #2, a simple “yes” would suffice; there are no limits to Snowflake’s storage capacity.   But Snowflake also eliminates the storage management headache.  There are no nodes to set up and manage, no file systems or directories to define, no data distribution policies to implement and revise.  Replication, backup and disaster recovery are all part of Snowflake’s managed services.  Data protection policies are set by simple SQL or UI actions.

For #3, Snowflake and its ecosystem offers a tremendous number of partners and connectors.

Let’s look at other advantages:

Security – In addition to role based security, Snowflake security offers:

  • Data is always encrypted both at rest and in transit, with managed hierarchical and periodic key rotation. Data uploaded to staging may be encrypted by the user.  Snowflake offers “triple key security” and customer managed keys in its Enterprise for Sensitive Data (ESD) level.
  • All editions are Soc 1 Type II and Soc 2 Type II compliant. EDS is PCI DSS certified, FedRamp Ready, and HIPAA compliant.
  • The “Virtual Private Snowflake” edition offers additional security, providing a single tenant Snowflake cloud.

Performance and Concurrency – Snowflake’s architecture offers both high performance and unlimited concurrency.

First, let’s look at performance:

  • Snowflake’s unique micro-partition architecture provides both partition pruning as well as column pruning within a micro-partition.
  • “Compute” is separate from storage. Compute is provided by “Virtual Data Warehouses (VDW)”.   A VDW consists of:
    1. 1 or more clusters
    2. A cluster may contain 1 – 128 servers. A server is a virtual machine in the appropriate cloud.
  • Query performance depends on the cluster size of a VDW. Cluster size can be increased or decreased dynamically by simple SQL or UI commands without any service interruption. In-flight queries continue to completion; only new queries are affected by the change.
    1. Snowflake maintains the results of all queries for 24 hours; identical queries retrieve results from this result cache.

Now let’s look at concurrency:

  • Concurrency, i.e. the number of simultaneous queries may be scaled in 2 ways.
    1. A VDW may be configured with multiple clusters, again via simple SQL or UI command. As of this writing, the maximum number of clusters is 10.  A VDW may be configured as autoscale, with clusters being started and stopped based on dynamic load balancing.
    2. For even higher levels of concurrency, multiple VDWs can be created for similar workloads.

As we’ve seen, Snowflake clearly adds the ability to meet the requirements of a data lake, in addition to its more traditional data warehouse capabilities.

It’s time to return the elephant to the wild!

Copyright © Jeffrey Jacobs, 2019, permission to redistribute with copyright notice

Reduce Your DBAs’ Burden with Snowflake

 

Today’s DBA typically manages 10s, 100s or even 1000s of databases, often from multiple vendors, both on premise and in the cloud.   These may include RDBMS, NoSQL DBMS, and or Hadoop clusters.

While management automation has made substantial strides enabling DBAs to handle larger workloads, the care and feeding of these databases is still all too frequently a burdensome task.

Data warehouses, data marts, and data lakes usually require the most attention.  Let’s discusses how using Snowflake RDBMS can dramatically reduce the DBA’s workload!

Snowflake Architecture Overview

Snowflake’s architecture consists of 3 distinct layers.

  1. “Cloud Services” manages the cloud vendor services. It also manages the meta-data and SQL execution plan generation.
  2. Virtual Data Warehouse(s) (VDW)” provide the compute engine(s).   VDWs are completely separate and independent of each other, even when operating on the same data.   A VDW supporting BI is not affected by another VDW performing ETL/ELT. They are very simple to create and manage, with either a simple SQL statement or in the web UI.   Note that a VDW is pure compute; don’t confuse this with the traditional use of “data warehouse” to mean both compute and storage.
  3. “Storage” aka “database” provides a “shared disk” architecture, separate from the VDWs that are using it. Storage is AWS S3 or Azure Blob.

A VDW can operate against any database; it is not allocated to a specific database.

Let’s look at the common tasks and requirement for creation and management.

Initial Creation for Legacy DBMS

The initial creation of a database typically requires the following steps. This is neither an exhaustive list nor applicable to every DBMS:

  1. Capacity Planning. This is critical to ensuring smooth operation, both at normal loads and peak loads.   This can be a daunting task, and usually results in over-provisioning.
  2. Infrastructure procurement and installation. Typically based on the peak load estimate, often leaving unused capacity for significant periods of time.
  3. Software installation.
  4. Creation of the database object itself (CREATE DATABASE …). Common steps include specifying the file system, file naming conventions, storage management, system parameters, etc.
  5. Database schema and object creation. We will assume that a suitable logical data model has been created and basic DDL generated.   However, if the platform supports indexing, the DBA will typically be called upon to perform initial review and implementation of bit-mapped or b-tree indexes.  Additional indexing is to be expected!
    1. Further, for many RDBMS, additional indexes and possibly other support features may be required for semi-structured data performance.
  6. Identification of ETL loads. This further requires:
    1. Determining size and performance impact of each load.
    2. Determining appropriate time of day/day of week for each ETL load.
    3. Scheduling of jobs to perform ETL and ELT.

Initial Create for Snowflake

  1. Create Account. Select the appropriate account type; no other steps necessary.
  2. Issue CREATE DATABASE SQL (or use web UI). The primary parameter is “Time Travel”, i.e. backup data retention in days (1-90) for recovery from user errors, disaster recovery and “point in time” querying. No file system, storage management, etc. required.   Snowflake databases are inherently elastic, with automatic purging of obsolete data.
    1. Note that all data is automatically replicated across availability zones. Coupled with “Time Travel”, this typically eliminates the need for other forms of backup.   However, creating a manual backup is also a single, nearly instantaneous operation of creating a new database by cloning”.
  3. Create database object i.e. tables, sequences and views. No indexes.  No additional support objects required for semi-structured data performance.
  4. Create initial VDW(s). Again, simple SQL with simple, easily changeable parameters:
    1. Size – number of servers per cluster, ranging from 1 – 128.
    2. Single or multiple cluster, possibly with auto-scaling.

Snowflake’s separation of compute and storage means that ETL/ELT does not run on the same compute as operational DML queries.   ETL/ELT processing is completely isolated and self-contained.

Leaving time for the DBA to work on legacy problems!

Performance Tuning for Legacy DBMS

Tuning a legacy database platform varies tremendously depending on the platform.   Common tuning work includes:

  1. Partitioning tables. All too often, this is required after the data warehouse has been operating for some time. This may involve downtime, online repartitioning with attendant compute and disk load, creating new tables, etc.
  2. Gathering statistics. This may be required after each data load.  It may be automated by job a scheduler, but often in off hours or on weekends.   Planning and analysis is often required to determine which tables need regular statistics gathered and which do not.
  3. Adjusting System Parameters. This often requires restarting the database.
  4. “Row store” databases typically require indexes for performance.  Indexes are also typically needed for performance of semi-structured data such as JSON.

Performance Tuning in Snowflake

None of the performance tuning discussed previously is required for Snowflake.

  1. Statistics are collected when data is loaded and become part of the meta-data. They are always current.
  2. All tables in Snowflake are automatically “micro-partitioned”.
  3. Snowflake is a column store, so no indexing is needed.
  4. There are no “system parameters” to tune. In rare cases, a cluster key may be needed to improve performance on multi-TB (compressed) tables.  Snowflake performs clustering when loading data.  Adding a cluster key to an existing large table results in reclustering as a background process.

Snowflake’s approach to performance tuning is based on the elasticity of its VDW.

For large, complex queries, the solution is to scale up the VDW to a larger size, i.e. more servers in a cluster.   Again, this is a simple SQL or web UI action.  A VDW can be scaled up (or down) dynamically; this won’t affect running queries, only queries submitted after the size change.

A multi-cluster VDW is used to support high concurrency workload which can be created with a simple SQL or web UI action.   This type of VDW may also be “auto-scaled”, allowing dynamic activation and suspension of the number of cluster.

Another key feature for Snowflake performance is the ability to create multiple VDWs, each supporting a different type of workload or business area.   The VDWs are completely independent of each other.

Finally, Snowflake’s “micro-partition” architecture eliminates the need for traditional performance tuning.   This approach takes partition pruning to a new level, using an incredibly rich meta-data store, enabling both vertical elimination of partitions and horizontal examination only of relevant columns.

Once again, leaving the DBA time to address legacy problems!

Copyright © Jeffrey Jacobs, 2019