In this article, we discuss building an architecture for performing user friendly historical analysis of Industrial Internet of Things (IIOT) data.
IIOT data is a sub-category of Internet of Things (IOT) data. IOT data is gathered from a large number of similar or identical devices. For our purposes, IIOT data is collected from Programmable Logic Controllers (PLCs) used to control and monitor equipment in “industrial” plants. Each PLC typically manages and monitors a well-defined set of steps in the overall process. Multiple data samples are gathered from multiple PLC sensors. The plant may be large, such as a refinery, or small, such as a dairy farm.
“System Control and Data Acquisition” (SCADA) applications from companies such as Inductive Automation and Rockwell Automation, are used to both control and monitor the PLCs. SCADA apps provide GUIs for control and monitoring, with a focus on real-time plant management. The data is typically stored both in memory for real-time use and a legacy “local” RDBMS such as Postgres, SQLServer or Oracle for history. The data may be stored in specialized time series formats, depending on the vendor.
The SCADA databases are intended for minimal maintenance and administration, and, although technically feasible, are not suitable for use as a more generalized data warehouse. The SCADA applications provide a great deal of BI style functionality, but this is very specific to near real time monitoring and management needs.
PLC data is standardized via the Open Platform Communications United Architecture (OPC UA). Programming a PLC includes defining meta-data for each sensor and control. While this meta-data may include a detailed description, it is typically not captured in the SCADA’s RDBMS.
Each data sensor in a PLC has a “tagpath” defined in the SCADA app which uniquely identifies the sensor. A data sample consists of the tagpath identifier, the timestamp, and the sample value. Sensors and controls are referred to as “tags”.
We would like to store this data in a more general data warehouse, and Snowflake is the ideal solution. We want to enhance and enrich the data to perform deeper and broader historical analysis.
We will look at a simplified real work use case. Dairy farms produce an enormous amount of manure and resulting climate change gases, primarily methane. Methane can be converted into “renewable” natural gas (RNG) and sold via pipelines for many uses. RNG also earns EPA renewable energy credits. This is a complex process involving collecting the gas from the manure (“digesting”), removing the impurities (“upgrading”) and injecting or trucking the resulting salable natural gas to a pipeline.
McDonald’s farm has such a plant, with 350 tags from several PLCs. Tags include such things as:
- Fan speed
- Temperature at various places
- Pressure at various places
- Gas quality
- Gas flows
Tagpaths are generally cryptic, e.g.
The business would appreciate more descriptive terms for their analysis. Let’s examine a slightly simplified data model and process for supporting the business’ needs.
The SCADA app collects the tag samples into a Postgres database, which is loaded into Snowflake daily. The SAMPLE_DATA table consists of 3 columns:
TAG_ID INTEGER, SAMPLE_TIME TIMESTAMP, SAMPLE_VALUE FLOAT
The data is loaded without any transformation, resulting in a tall, skinny table.
Some aspects, such as sampling rate, may change for a tag. A second table, from the SCADA app, TAG_PATH, provides a Type 2 dimension, mapping the TAG_ID to the tagpath and its effective date range.
TAG_ID INTEGER, TAGPATH STRING, CREATION_DATE TIMESTAMP, RETIREMENT_DATE TIMESTAMP
For analytics, we want a table or VIEW with the columns for the desired samples, i.e. pivoted data. Only a limited number of tags are of interest for any given area of analysis. Providing a simple user request and subsequent service process is a key goal for our architecture.
The third table is TAG_DETAIL, which contains the enriched description of the tag. For our purposes, we will use the following columns:
TAGPATH STRING, DESCRIPTIVE_TAG_NAME STRING
With the sample data for our example:
|mcdonalds_plant/pt_01||Pressure Transmitter #1 – From Digester|
|mcdonalds_plant/pt_03||Pressure Transmitter #3 – Sulfur Trap|
|mcdonalds_plant/pt_04||Pressure Transmitter #4 – To Pipeline|
|mcdonalds_psa/pt0963/processvalue||PSA Pressure Transmitter 0963 To Vacuum Pumps|
|mcdonalds_psa/pt0974/processvalue||PSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)|
The TAG_DETAIL data is created manually and loaded via spreadsheets. Note that the DESCRIPTIVE_TAG_NAME is used to create column names; avoid using single or double quotes to avoid problems with the resulting VIEW’s column names.
Simply joining the three tables would not only be unwieldly, it would also be useless. The analyst needs to be able to look at tag data in the same period, but the tags may have different sampling periods. As RNG production is a fairly slow process, the desired data are the tags’ average, min and max by quarter hour.
We also have a calendar dimension, DIM_SAMPLE_CALENDAR.
The first step is to create a VIEW on the raw data itself.
CREATE OR REPLACE VIEW VW_SAMPLE_QUARTER_HOUR COPY GRANTS AS SELECT DESCRIPTIVE_TAG_NAME, STATS.SAMPLE_QRTR_HOUR, STATS.SAMPLE_HOUR, STATS.SAMPLE_DATE, CAL.SAMPLE_WEEK , CAL.SAMPLE_MONTH, CAL.SAMPLE_YEAR , STATS."Quarter Hour Avg", STATS."Quarter Hour Min", STATS."Quarter Hour Max" FROM ( SELECT PATH.TAGPATH, TRUNC(AVG(SAMPLE_VALUE),4) "Quarter Hour Avg", TRUNC(MIN(SAMPLE_VALUE),4) "Quarter Hour Min", TRUNC(MAX(SAMPLE_VALUE),4) "Quarter Hour Max", TIMEADD(MINUTE, CASE WHEN MINUTE(SAMPLE_TIME) BETWEEN 0 AND 14 THEN 0 WHEN MINUTE(SAMPLE_TIME) BETWEEN 15 AND 29 THEN 15 WHEN MINUTE(SAMPLE_TIME) BETWEEN 30 AND 44 THEN 30 ELSE 45 END,TRUNCATE (SAMPLE_TIME,'hour')) AS SAMPLE_QRTR_HOUR, date_trunc('hour’, SAMPLE_TIME) SAMPLE_HOUR, TO_DATE(SAMPLE_TIME) SAMPLE_DATE, DAT.TAGPATH FROM SAMPLE_DATA DAT JOIN TAG_PATH PATH ON PATH.TAG_ID = DAT.TAG_ID GROUP BY 4, 5, 6, 7) STATS LEFT JOIN TAG_DETAIL DET ON STATS.TAGPATH = DET.TAGPATH LEFT JOIN DIM_SAMPLE_CALENDAR CAL ON STATS.SAMPLE_DATE = CAL.SAMPLE_DATE.
This gives us a tall skinny, source object for developing our pivoted analysis VIEWs. The basic pivot looks like this:
SELECT t1. * FROM ( SELECT SAMPLE_QRTR_HOUR, SAMPLE_HOUR,Q SAMPLE_DATE, SAMPLE_WEEK, SAMPLE_MONTH, SAMPLE_YEAR, DESCRIPTIVE_TAG_NAME, "Quarter Hour Avg" – other options are “Quarter Hour Min” or Max FROM "VW_SAMPLE_QUARTER_HOUR_STATS") PIVOT (AVG ("Quarter Hour Avg") FOR "Tag Description" IN ('Digester Gas Analyzer CH4 %',…)) t1 – more columns
Note that due to the aggregation by quarter hour, there is only 1 row per SAMPLE_QRTR_HOUR.
We want to create a simple process and avoid the headache of the embedded single quotes in the resulting column name. The process is simple.
We are lazy, so we’re going to create a simple PIVOT_GENERATOR table, with the following columns:
VIEW_NAME STRING, -- the name of the VIEW to be generated. TAG_DESCRIPTIVE_NAME STRING --the desired descriptive column names from TAG_DETAIL.
We give the user a spreadsheet of the data in TAG_DETAIL, and have them return a spreadsheet with a VIEW name to be created (or modified) and the desired DESCRIPTIVE_TAG_NAMEs.
|VW_PRESSURE||Pressure Transmitter #1 – From Digester|
|VW_PRESSURE||Pressure Transmitter #3 – Sulfur Trap|
|VW_PRESSURE||Pressure Transmitter #4 – To Pipeline|
|VW_PRESSURE||PSA Pressure Transmitter 0963 To Vacuum Pumps|
|VW_PRESSURE||PSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)|
This data is loaded into the PIVOT_GENERATE table, typically via spreadsheet.
We create a simple UDF that generates a CREATE VIEW statement that takes the PIVOT_GENERATE.VIEW_NAME desired and produces a SQL statement with the desired descriptive column names using an outer query that removes the annoying single quotes.
CREATE OR REPLACE FUNCTION UDF_GENERATE_QRTR_HOUR_VIEW(NEW_VIEW VARCHAR) RETURNS TABLE (DDL_LINE VARCHAR) LANGUAGE SQL AS $$ WITH NAMES AS ( SELECT '''' || DESCRIPTIVE_TAG_NAME || '''' PIVOT_NAME, '"'||DESCRIPTIVE_TAG_NAME||'"' VIEW_COLUMN, DESCRIPTIVE_TAG_NAME FROM PIVOT_GENERATOR WHERE VIEW_NAME = NEW_VIEW) SELECT 'CREATE OR REPLACE VIEW ' || NEW_VIEW||' COPY GRANTS AS select' UNION ALL SELECT 'SAMPLE_QRTR_HOUR, SAMPLE_HOUR, SAMPLE_DATE, SAMPLE_WEEK, SAMPLE_MONTH, SAMPLE_YEAR,' UNION ALL SELECT LISTAGG( '"'||PIVOT_NAME||'"'||' AS '||VIEW_COLUMN,',') FROM NAMES UNION ALL SELECT 'FROM ( SELECT t1.* FROM ( SELECT SAMPLE_QRTR_HOUR, SAMPLE_HOUR, SAMPLE_DATE, SAMPLE_WEEK, SAMPLE_MONTH, SAMPLE_YEAR, DESCRIPTIVE_TAG_NAME, "Quarter Hour Avg" FROM "VW_SAMPLE_QUARTER_HOUR_STATS") PIVOT (AVG ("Quarter Hour Avg") FOR "Tag Description" IN (' UNION ALL SELECT LISTAGG(PIVOT_NAME,',') FROM NAMES UNION ALL SELECT '))t1) order by 1'$$
We run the UDF with the desired VIEW_NAME:
SELECT * FROM TABLE(UDF_GENERATE_QRTR_HOUR_VIEW('VW_PRESSURE'));
The result is the CREATE OR REPLACE VIEW statement. Formatting the result gives us:
CREATE OR REPLACE VIEW VW_PRESSURE COPY GRANTS AS SELECT SAMPLE_QRTR_HOUR, SAMPLE_HOUR, SAMPLE_DATE, SAMPLE_WEEK, SAMPLE_MONTH, SAMPLE_YEAR, "'Pressure Transmitter #1 - From Digester'" AS "Pressure Transmitter #1 - From Digester", "'Pressure Transmitter #3 - Sulfur Trap'" AS "Pressure Transmitter #3 - Sulfur Trap", "'Pressure Transmitter #4 - To Pipeline'" AS "Pressure Transmitter #4 - To Pipeline", "'PSA Pressure Transmitter 0963 To Vacuum Pumps'" AS "PSA Pressure Transmitter 0963 To Vacuum Pumps", "'PSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)'" AS "PSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)" FROM ( SELECT t1.* FROM ( SELECT SAMPLE_QRTR_HOUR, SAMPLE_HOUR, SAMPLE_DATE, SAMPLE_WEEK, SAMPLE_MONTH, SAMPLE_YEAR, DESCRIPTIVE_TAG_NAME, "Quarter Hour Avg" FROM "VW_SAMPLE_QUARTER_HOUR_STATS") PIVOT (AVG ("Quarter Hour Avg") FOR DESCRIPTIVE_TAG_NAME IN ( 'Pressure Transmitter #1 - From Digester', 'Pressure Transmitter #3 - Sulfur Trap', 'Pressure Transmitter #4 - To Pipeline', 'PSA Pressure Transmitter 0963 To Vacuum Pumps', 'PSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)' ))t1) ORDER BY 1
As a test, let’s look at the averages for the columns for the latest day:
SELECT SAMPLE_DATE, AVG("Pressure Transmitter #1 - From Digester") , AVG("Pressure Transmitter #3 - Sulfur Trap"), AVG("Pressure Transmitter #4 - To Pipeline"), AVG("PSA Pressure Transmitter 0963 To Vacuum Pumps"), AVG("PSA Pressure Transmitter 0974 post PSA (to pipeline and/or flare)") FROM VW_PRESSURE WHERE SAMPLE_DATE = '2020-09-06' GROUP BY 1
The query results are:
|SAMPLE_DATE||AVG(“PRESSURE TRANSMITTER #1 – FROM DIGESTER”)||AVG(“PRESSURE TRANSMITTER #3 – SULFUR TRAP”)||AVG(“PRESSURE TRANSMITTER #4 – TO PIPELINE”)||AVG(“PSA PRESSURE TRANSMITTER 0963 TO VACUUM PUMPS”)||AVG(“PSA PRESSURE TRANSMITTER 0974 POST PSA (TO PIPELINE AND/OR FLARE)”)|
We now have a simple, maintainable, repeatable process providing understandable data for analysis.
Author: Jeffrey Jacobs, firstname.lastname@example.org
Copyright © Jeffrey Jacobs, 2021