Premise:

  • The analytic data pipeline was stable for decades, providing historical performance reporting on business transactions captured in enterprise applications.
  • We are in the midst of a transition toward Systems of Intelligence with “converged” analytics.
  • The new pipeline will encompass near real-time and batch responsiveness and span business intelligence and machine learning.
  • Data originating from smart, connected products (IoT) will get some low latency analytics at the edge.  But analytics centralized in the cloud will make it possible to optimize their performance as part of a large ecosystem.
  • This new analytic pipeline will drive entirely new design decisions relative to traditional OLTP to data warehouse designs.

Three themes highlight evolution of analytic pipeline

The best way to understand the emerging analytic data pipeline is to look back and find the elements that are already changing and then extrapolate based on the new requirements to support the IoT.  Three themes play out:

Cost of data capture and management has to approach $0.

The cost of capturing data manually by keyboard has stayed roughly constant over several decades at $1/KB or $1bn/TB.  But all new information, including analog data, is captured digitally at a marginal cost approaching $0.  This data volume and variety has to be matched by a data pipeline that completely leverages elastic clusters of commodity hardware and software with automated management, making the new cost of capture and management as close to $0 as well.

Fast Data requires near real-time analytics.

The data pipeline needs to support the new velocity of data by providing near real-time responsiveness between capturing data and driving an action while still leveraging historical data to improve the context of analytics.

Converged analytics emerges.

Applications need to provide “converged” analytics, which supports both batch and real-time responsiveness, and business intelligence and machine learning on any type of data.

The rest of the research note is divided into two parts.  The first is for IT directors who need to understand the new principles behind pipelines between products and applications as opposed to traditional ones between applications.  The second, more detailed part, is for architects.  This part explains the new ingredients for those who will be responsible for building the new analytic pipelines.

For IT directors: summary recipe and ingredients for new IoT-ready analytic pipeline

Example application

GE’s Predix SaaS application providing predictive maintenance service for industrial equipment for wide range of vendors.  Continual sensor readings enable vendors to monitor, maintain, monetize entire ecosystem of smart, connected products operated by their customers.

Analytic output

No longer just traditional business intelligence (BI) for human consumption, but increasingly machine learning for near real-time performance that optimizes performance of an ecosystem of smart devices.

 Source data

“Messy” data with only some structure often originates in analog form from sensor readings on devices.  What structure it has often evolves over time, requiring more flexible management than tradition RDBMS’s.

Intermediate processing

Edge compute processing required for handling intermittent connectivity from greatly decentralized data capture.  Edge compute must help filter signal from the noise in order to optimize bandwidth and responsiveness.

Destination

Bidirectional.  From the edge up to the cloud and back out via a control channel.

Latency

Traditional pipelines were batch because BI workloads were historical.  New pipelines have to capture data, analyze it, make decisions, and optimize the response of smart devices in near real-time.  Historical and new data sets that add additional perspective to the analytics can be incorporated in batch mode.

For IT architects: detailed recipe and ingredients for new IoT-ready analytic pipeline

Analytic output: monitor, maintain, and monetize smart, connected products.  Traditional data warehouse pipelines typically supported business intelligence workloads flowing in one direction from source applications.  The new pipelines, by contrast, need to support predictive analytics and machine learning and flow back out to the source devices in order to optimize how an ecosystem of connected products operates.

  • Machine learning for performance optimization of connected devices as part of an ecosystem.  More data from a larger network of devices makes predictive accuracy higher.
  • Business intelligence workloads such as reports, OLAP, dashboards for human tracking of performance.
  • Business value: topline via differentiation, bottom line via efficiency of product/service delivery.
    • Monitor: understand behavior of devices.
    • Maintain: support ability to upgrade connected devices.
    • Monetize: guide an action by an individual or a company.

Source data: doesn’t fit into the cookie cutter forms that get stored in tables from traditional OLTP apps and data warehouses.  The data can be less structured and, more important, its structure can evolve with new devices and new versions of sensor readings on devices.

  • Industrial data typically originates in analog forms such as motion, voltage, moisture, temperature, acceleration, and location.
  • Data must be converted to digital form, typically with a timestamp and a device identifier that together make each reading identifiably unique.
  • Data also typically has version information since data from sensors can change over time.

Intermediate processing: edge compute will often have to bring some intelligence close to the source devices.  Its primary role is to provide secure connectivity to the cloud, filter the signal from the noise, buffer data when connectivity is intermittent, and process updates from the cloud when the definition of what is signal and what is noise changes.

  • Requires power and connectivity.  Connectivity can be intermittent if edge data buffering is available.
  • Downstream out-of-order processing is necessary if buffering prevents completely sequential delivery of sensor data.
  • Filtering and aggregation are necessary to eliminate the “noise” and only send changed data or anomalies for downstream processing.
  • Edge processing must have enough compute to receive updates from the cloud about what to filter out and changing definition of anomalies or data of interest.
  • New security model must enable authentication and authorization of devices and encryption of data from source to destination.

Destination: Unlike traditional data warehouses where the destination supports business intelligence workloads, the primary consumer now becomes applications.  They optimize the operation of an entire ecosystem of smart, connected products by sending feedback, typically through a control channel.  Humans can be in the loop as part of the decision process and the applications can support business intelligence workloads, but that isn’t the default case.

Latency between source and analytic output: as low as possible and getting lower.  Unlike data warehouses that support business intelligence views of historical data, an effective feedback channel that can control smart, connected devices needs to be low latency.  The more that the intelligence is in the cloud, the lower the latency needed to reach the edge.

Storage: Because of the variety and evolving structure of the source data (see above), storage has to provide more flexibility and scale than the tables in traditional scale-up SQL RDBMS’s.  But above all, new storage approaches have to take the “messy” data that arrives in traditional Data Lakes and convert it under the covers as much as possible into something that looks like the more curated MPP SQL databases.  In other words, emerging storage technology needs to help diminish the data preparation process by converting raw, incoming data into data ready for analysis.

  • IoT Devices emit events that may have different metrics and dimensions over time.  And the range of sensors adds additional variety to the data coming in.
  • JSON has emerged as the preferred way of representing this data.  It has the flexibility to handle the variety of machine-generated information because it comes with a description of its structure.  It also is easy for developers to read when they’re working with it.
  • New database storage models can make managing JSON data much easier.  Data stores such as Key Value (DynamoDB), Time Series (Riak-TS), Wide (sparse) Columns (HBase or Cassandra), and Document (MongoDB, Azure DocumentDB) can handle high-volume streams of JSON events, each with varying sweet spots in their usage scenarios.
  • Some storage approaches such as HDFS and similar file systems (Azure Data Lake) ingest this data with little upfront processing and rely on data engineers and data scientists to progressively curate it with more structure.
  • Data volumes and velocity are orders of magnitude greater than BI workloads on data warehouses.  Storage needs to scale-out elastically, accommodate appended data rather than updates, provide high availability via multi-master replication.
  • Even streaming data that doesn’t accumulate in an unbounded collection of stored objects will have some summary or analytic derived data that needs storage for later access or re-processing.
  • Traditional SQL DBMS’s separate the logical view of the data, which the developer sees, from how it is stored physically within the DBMS.  The purpose of this approach is to store the data in form that is highest performance while presenting it to the developer in the form that is easiest to work with.  MPP SQL RDBMS’s designed for data warehousing stored their tables in columns, instead of the traditional rows.  That makes scanning just the relevant columns much faster.  With JSON data, it’s typically not clear up front how the data will be accessed.  Databases in the future will have to analyze how applications are using the data and organize its storage to correspond to those usage patterns.  Just “shredding” the JSON objects and force-fitting them into tables as some legacy databases do won’t turn “messy” data into something that’s curated for performance.

Query technology:  Transaction processing and support for SQL queries that hide any need to navigate the data together define the essence of traditional RDBMS’s.  But with the variety of structure in IoT data, typically stored in JSON format, there needs to be a compromise between SQL, which assumes tabular data, and something that can access the “messier” JSON data.

  • XQuery was once considered the magic bullet for dealing with data with complex structure.  But it didn’t become mainstream.  It was too difficult for most developers to work with even though it was the most technically correct way of solving the problem.
  • What is emerging is a combination of SQL and the “dot.notation” way of navigating JSON data. Developers can express what data they need in SQL and the query returns the relevant JSON objects.  Unlike with SQL databases which return exactly what you specify, these new databases require the developer to go the last “10 yards”.  Developers have to navigate the data returned in JSON using dot.notation to extract exactly what they want.  In a nutshell, SQL databases get you to your destination  with just an address.  These new hybrid SQL/NoSQL databases get you most of the way with an address and then you have to find the rest of your way with directions and a map.

Management: Since these workloads include streaming analytics, that requires much greater operational simplicity than current Hadoop deployments.  Streaming workloads are near real-time and run non-stop.  Data preparation and integration also has to start leveraging more machine learning relative to human curation in order to keep track of ever-expanding sources of data with as few operational interruptions as possible.

  • The new databases need elastic scalability. Unlike traditional data warehouses which have well-known workloads, IoT ingest and analysis volumes can vary significantly over time.  Elasticity that works at scale means more than just spreading data across more nodes in a cluster.  It means that the meta data that describes the data has to scale out as well.  And for a database processing updated data, that means the meta data needs transactional control in addition to the core data.  The high availability and disaster recovery mechanisms have to scale out as well.
  • Managing stream processing relative to traditional MPP databases requires another leap in sophistication.  Managing analytics on IoT data flowing across a cluster requires hiding things from developers like data that might arrive late and analytic operations that have to get moved around the cluster depending on where data arrives.
  • Data preparation and integration in batch systems is itself a batch process.  As streaming systems get more advanced, the process of tracking the lineage and governance of new sources has to become more automatic.  For example, the management tools need to track who in the organization is using what data sets and how.  That “tribal” knowledge that used to be tacit needs to be part of the catalog that enables users to find the data sets or analytic streams that are relevant to them.
  • In traditional DBMS’s, no matter how real-time the workload the database calculated the descriptive data required for analysis and management offline when there was more time.  Increasingly, this meta data needs to be calculated in near real-time as it comes into the database.  And as the data gets analyzed, the database has to figure out how to organize the data for best performance.  It can make the necessary changes itself or recommend them to the DBA or developer

Action Items

  • IT directors must rethink pricing for databases that are going to handle the volume of data associated with IoT pipelines.  Two new criteria need to be met: metered pricing that supports elastic workloads and open source brings pricing levels that can accommodate much higher data volumes.
  • IT architects must leverage near real-time analytics that require a combination of stream processors and databases that can use machine learning to improve the operation of an ecosystem of smart, connected products.