Pivoting IIOT Data in Snowflake

In this article, we discuss building an architecture for performing user friendly historical analysis of Industrial Internet of Things (IIOT) data.

IIOT data is a sub-category of Internet of Things (IOT) data. IOT data is gathered from a large number of similar or identical devices. For our purposes, IIOT data is collected from Programmable Logic Controllers (PLCs) used to control and monitor equipment in “industrial” plants. Each PLC typically manages and monitors a well-defined set of steps in the overall process. Multiple data samples are gathered from multiple PLC sensors. The plant may be large, such as a refinery, or small, such as a dairy farm.

“System Control and Data Acquisition” (SCADA) applications from companies such as Inductive Automation and Rockwell Automation, are used to both control and monitor the PLCs. SCADA apps provide GUIs for control and monitoring, with a focus on real-time plant management. The data is typically stored both in memory for real-time use and a legacy “local” RDBMS such as Postgres, SQLServer or Oracle for history. The data may be stored in specialized time series formats, depending on the vendor.

The SCADA databases are intended for minimal maintenance and administration, and, although technically feasible, are not suitable for use as a more generalized data warehouse. The SCADA applications provide a great deal of BI style functionality, but this is very specific to near real time monitoring and management needs.

PLC data is standardized via the Open Platform Communications United Architecture (OPC UA). Programming a PLC includes defining meta-data for each sensor and control. While this meta-data may include a detailed description, it is typically not captured in the SCADA’s RDBMS.

Each data sensor in a PLC has a “tagpath” defined in the SCADA app which uniquely identifies the sensor. A data sample consists of the tagpath identifier, the timestamp, and the sample value. Sensors and controls are referred to as “tags”.

We would like to store this data in a more general data warehouse, and Snowflake is the ideal solution. We want to enhance and enrich the data to perform deeper and broader historical analysis.

We will look at a simplified real work use case. Dairy farms produce an enormous amount of manure and resulting climate change gases, primarily methane. Methane can be converted into “renewable” natural gas (RNG) and sold via pipelines for many uses. RNG also earns EPA renewable energy credits. This is a complex process involving collecting the gas from the manure (“digesting”), removing the impurities (“upgrading”) and injecting or trucking the resulting salable natural gas to a pipeline.

McDonald’s farm has such a plant, with 350 tags from several PLCs. Tags include such things as:

  • Fan speed
  • Temperature at various places
  • Pressure at various places
  • Humidity
  • Gas quality
  • Gas flows

Tagpaths are generally cryptic, e.g.

  • mcdonalds_psa/pt0963/processvalue
  • mcdonalds_psa/pt0974/processvalue
  • mcdonalds_plant/pt_01
  • mcdonalds_plant/pt_04
  • mcdonalds_plant/pt_04

The business would appreciate more descriptive terms for their analysis. Let’s examine a slightly simplified data model and process for supporting the business’ needs.

The SCADA app collects the tag samples into a Postgres database, which is loaded into Snowflake daily. The SAMPLE_DATA table consists of 3 columns:

TAG_ID INTEGER,
SAMPLE_TIME TIMESTAMP,
SAMPLE_VALUE FLOAT

The data is loaded without any transformation, resulting in a tall, skinny table.

Some aspects, such as sampling rate, may change for a tag. A second table, from the SCADA app, TAG_PATH, provides a Type 2 dimension, mapping the TAG_ID to the tagpath and its effective date range.

TAG_ID INTEGER,
TAGPATH STRING,
CREATION_DATE TIMESTAMP,
RETIREMENT_DATE TIMESTAMP

For analytics, we want a table or VIEW with the columns for the desired samples, i.e. pivoted data. Only a limited number of tags are of interest for any given area of analysis. Providing a simple user request and subsequent service process is a key goal for our architecture.

The third table is TAG_DETAIL, which contains the enriched description of the tag. For our purposes, we will use the following columns:

TAGPATH STRING,
DESCRIPTIVE_TAG_NAME STRING

With the sample data for our example:

TAGPATHDESCRIPTIVE_TAG_NAME
mcdonalds_plant/pt_01Pressure Transmitter #1 – From Digester
mcdonalds_plant/pt_03Pressure Transmitter #3 – Sulfur Trap
mcdonalds_plant/pt_04Pressure Transmitter #4 – To Pipeline
mcdonalds_psa/pt0963/processvaluePSA Pressure Transmitter 0963 To Vacuum Pumps
mcdonalds_psa/pt0974/processvaluePSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)

The TAG_DETAIL data is created manually and loaded via spreadsheets. Note that the DESCRIPTIVE_TAG_NAME is used to create column names; avoid using single or double quotes to avoid problems with the resulting VIEW’s column names.

Simply joining the three tables would not only be unwieldly, it would also be useless. The analyst needs to be able to look at tag data in the same period, but the tags may have different sampling periods. As RNG production is a fairly slow process, the desired data are the tags’ average, min and max by quarter hour.

We also have a calendar dimension, DIM_SAMPLE_CALENDAR.

The first step is to create a VIEW on the raw data itself.

CREATE OR REPLACE VIEW
VW_SAMPLE_QUARTER_HOUR COPY GRANTS AS
SELECT
   
    DESCRIPTIVE_TAG_NAME,
    STATS.SAMPLE_QRTR_HOUR,
    STATS.SAMPLE_HOUR,
    STATS.SAMPLE_DATE,
    CAL.SAMPLE_WEEK ,
    CAL.SAMPLE_MONTH,
    CAL.SAMPLE_YEAR ,
    STATS."Quarter Hour Avg",
    STATS."Quarter Hour Min",
    STATS."Quarter Hour Max"
FROM
    (   SELECT
            PATH.TAGPATH,
            TRUNC(AVG(SAMPLE_VALUE),4) "Quarter Hour Avg",
            TRUNC(MIN(SAMPLE_VALUE),4) "Quarter Hour Min",
            TRUNC(MAX(SAMPLE_VALUE),4) "Quarter Hour Max",
            TIMEADD(MINUTE,
            CASE
                WHEN MINUTE(SAMPLE_TIME) BETWEEN 0 AND 14
                THEN 0
                WHEN MINUTE(SAMPLE_TIME) BETWEEN 15 AND 29
                THEN 15
                WHEN MINUTE(SAMPLE_TIME) BETWEEN 30 AND 44
                THEN 30
                ELSE 45
            END,TRUNCATE (SAMPLE_TIME,'hour')) AS SAMPLE_QRTR_HOUR,
            date_trunc('hour’, SAMPLE_TIME)        SAMPLE_HOUR,
            TO_DATE(SAMPLE_TIME)                 SAMPLE_DATE,
            DAT.TAGPATH
        FROM
            SAMPLE_DATA DAT
        JOIN TAG_PATH PATH
 ON PATH.TAG_ID = DAT.TAG_ID
        GROUP BY
            4,
            5,
            6,
            7) STATS
LEFT JOIN
    TAG_DETAIL DET
ON
    STATS.TAGPATH = DET.TAGPATH
LEFT JOIN
    DIM_SAMPLE_CALENDAR CAL
ON
    STATS.SAMPLE_DATE = CAL.SAMPLE_DATE.

This gives us a tall skinny, source object for developing our pivoted analysis VIEWs. The basic pivot looks like this:

SELECT
    t1. *
FROM
    (   SELECT
            SAMPLE_QRTR_HOUR,
            SAMPLE_HOUR,Q
            SAMPLE_DATE,
            SAMPLE_WEEK,
            SAMPLE_MONTH,
            SAMPLE_YEAR,
            DESCRIPTIVE_TAG_NAME,
            "Quarter Hour Avg" – other options are “Quarter Hour Min” or Max
        FROM
            "VW_SAMPLE_QUARTER_HOUR_STATS") PIVOT (AVG ("Quarter Hour Avg") FOR "Tag Description" 
    IN ('Digester Gas Analyzer CH4 %',…)) t1 – more columns

Note that due to the aggregation by quarter hour, there is only 1 row per SAMPLE_QRTR_HOUR.

We want to create a simple process and avoid the headache of the embedded single quotes in the resulting column name. The process is simple.

We are lazy, so we’re going to create a simple PIVOT_GENERATOR table, with the following columns:

VIEW_NAME STRING, -- the name of the VIEW to be generated.
TAG_DESCRIPTIVE_NAME STRING --the desired descriptive column names from TAG_DETAIL.

We give the user a spreadsheet of the data in TAG_DETAIL, and have them return a spreadsheet with a VIEW name to be created (or modified) and the desired DESCRIPTIVE_TAG_NAMEs.

Example:

VIEW_NAMETAG_DESCRIPTIVE_NAME
VW_PRESSUREPressure Transmitter #1 – From Digester
VW_PRESSUREPressure Transmitter #3 – Sulfur Trap
VW_PRESSUREPressure Transmitter #4 – To Pipeline
VW_PRESSUREPSA Pressure Transmitter 0963 To Vacuum Pumps
VW_PRESSUREPSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)

This data is loaded into the PIVOT_GENERATE table, typically via spreadsheet. 

We create a simple UDF that generates a CREATE VIEW statement that takes the PIVOT_GENERATE.VIEW_NAME desired and produces a SQL statement with the desired descriptive column names using an outer query that removes the annoying single quotes.

CREATE OR REPLACE FUNCTION UDF_GENERATE_QRTR_HOUR_VIEW(NEW_VIEW VARCHAR) 
RETURNS TABLE (DDL_LINE VARCHAR)
    LANGUAGE SQL
AS
    $$
    WITH
        NAMES AS
        (   SELECT
                '''' || DESCRIPTIVE_TAG_NAME || '''' PIVOT_NAME,
                '"'||DESCRIPTIVE_TAG_NAME||'"'       VIEW_COLUMN,
                DESCRIPTIVE_TAG_NAME
            FROM
                PIVOT_GENERATOR
            WHERE
                VIEW_NAME = NEW_VIEW)
    SELECT
        'CREATE OR REPLACE VIEW ' || NEW_VIEW||'  COPY GRANTS AS select'
     
    UNION ALL
     
    SELECT
        'SAMPLE_QRTR_HOUR,
SAMPLE_HOUR,
SAMPLE_DATE,
SAMPLE_WEEK,
SAMPLE_MONTH,
SAMPLE_YEAR,'
    
    UNION ALL
    
    SELECT
        LISTAGG( '"'||PIVOT_NAME||'"'||' AS '||VIEW_COLUMN,',')
    FROM
        NAMES
    
    UNION ALL
    
    SELECT
        'FROM
( SELECT
t1.*
FROM
(   SELECT
SAMPLE_QRTR_HOUR,
SAMPLE_HOUR,
SAMPLE_DATE,
SAMPLE_WEEK,
SAMPLE_MONTH,
SAMPLE_YEAR,
DESCRIPTIVE_TAG_NAME,
"Quarter Hour Avg"
FROM
"VW_SAMPLE_QUARTER_HOUR_STATS") PIVOT (AVG
("Quarter Hour Avg") FOR "Tag Description" IN ('
    
    UNION ALL
    
    SELECT
        LISTAGG(PIVOT_NAME,',')
    FROM
        NAMES
    
    UNION ALL
    
    SELECT
        '))t1) order by 1'$$

We run the UDF with the desired VIEW_NAME:

SELECT *
FROM TABLE(UDF_GENERATE_QRTR_HOUR_VIEW('VW_PRESSURE'));

The result is the CREATE OR REPLACE VIEW statement. Formatting the result gives us:

CREATE OR REPLACE VIEW 
    VW_PRESSURE COPY GRANTS AS 
SELECT
    SAMPLE_QRTR_HOUR,
    SAMPLE_HOUR,
    SAMPLE_DATE,
    SAMPLE_WEEK,
    SAMPLE_MONTH,
    SAMPLE_YEAR,
    "'Pressure Transmitter #1 - From Digester'"       AS "Pressure Transmitter #1 - From Digester", 
    "'Pressure Transmitter #3 - Sulfur Trap'"         AS "Pressure Transmitter #3 - Sulfur Trap", 
    "'Pressure Transmitter #4 - To Pipeline'"         AS "Pressure Transmitter #4 - To Pipeline", 
    "'PSA Pressure Transmitter 0963 To Vacuum Pumps'" AS 
    "PSA Pressure Transmitter 0963 To Vacuum Pumps", 
    "'PSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)'" AS 
    "PSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)"
FROM
(   SELECT
    t1.*
        FROM
            (   SELECT
            SAMPLE_QRTR_HOUR,
            SAMPLE_HOUR,
            SAMPLE_DATE,
            SAMPLE_WEEK,
            SAMPLE_MONTH,
            SAMPLE_YEAR,
            DESCRIPTIVE_TAG_NAME,
            "Quarter Hour Avg"
            FROM
            "VW_SAMPLE_QUARTER_HOUR_STATS") PIVOT (AVG ("Quarter Hour Avg") FOR 
            DESCRIPTIVE_TAG_NAME IN 
             ( 'Pressure Transmitter #1 - From Digester', 
               'Pressure Transmitter #3 - Sulfur Trap', 
               'Pressure Transmitter #4 - To Pipeline', 
               'PSA Pressure Transmitter 0963 To Vacuum Pumps', 
               'PSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)'
               ))t1) 
ORDER BY 
    1

As a test, let’s look at the averages for the columns for the latest day:

SELECT SAMPLE_DATE,
AVG("Pressure Transmitter #1 - From Digester") ,
AVG("Pressure Transmitter #3 - Sulfur Trap"),
AVG("Pressure Transmitter #4 - To Pipeline"),
AVG("PSA Pressure Transmitter 0963 To Vacuum Pumps"),
AVG("PSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)")
FROM VW_PRESSURE 
WHERE SAMPLE_DATE = '2020-09-06'
GROUP BY 1

The query results are:

SAMPLE_DATEAVG(“PRESSURE TRANSMITTER #1 – FROM DIGESTER”)AVG(“PRESSURE TRANSMITTER #3 – SULFUR TRAP”)AVG(“PRESSURE TRANSMITTER #4 – TO PIPELINE”)AVG(“PSA PRESSURE TRANSMITTER 0963 TO VACUUM PUMPS”)AVG(“PSA PRESSURE TRANSMITTER 0974 POST PSA (TO PIPELINE AND/OR FLARE)”)
2020-09-060.7499435480.379994-0.53716-0.070770.119654

We now have a simple, maintainable, repeatable process providing understandable data for analysis.

Author: Jeffrey Jacobs, jmjacobs@jeffreyjacobs.com

Copyright © Jeffrey Jacobs, 2021

Author: jeffreyjacobs

Independent Consulting Data Architect specializing in Snowflake.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s