Premise

It’s hard not to walk away from Pentaho’s 2015 user conference impressed.  The product has become mainstream and it appears to be on a path to even greater importance.  The product’s ability to orchestrate the analytic data pipeline fits neatly into Wikibon’s Systems of Intelligence theme. IT infrastructure and application architects can use Pentaho to build new Systems of Intelligence on existing Systems of Record by blending across traditional data, big data, and fast data and taking the analytics all the way to the operational application.

Pentaho blends and enriches these sources and delivers the data to or through the point of analysis to drive an operational application or present data to an end-user.  The key is that this process is getting faster on both ends.  It is getting better at streaming data as a source on the back-end, a big theme on the roadmap for IoT functionality, and embedding analytics more deeply in an application on the front-end, via its API’s and open source visualization widgets.

Orchestrating the analytic data pipeline

The company was very disciplined about its message.  Everything rolled up  to “governed data delivery”.  Scratch the surface and it’s clear just how much more there is than a pretty GUI for those who might categorize them as a business intelligence vendor.

Pentaho’s ambition is to orchestrate the analytic data pipeline from end-to-end.  In other words, it wants to orchestrate the blending, enrichment, delivery, and analysis of data from the source repository all the way to the operational application in which the analytics may be embedded.  Governance may not be flashy, but it’s a requirement if you want to trust your analytics.  Pentaho tracks the lineage of all the data it works with from the destination back to the source.  It also exports this lineage information when necessary to other platforms, such as Cloudera’s or Tableau’s.

Figure 1: Pentaho’s vision for orchestrating the end-to-end analytic data pipeline

Releases 5.3 and 5.4

The company’s recent heritage is in managing the integration of data across data warehouses and big data environments such as Hadoop.  This year’s point releases, 5.3 and 5.4, included the initial elements of functionality that will continue to grow in future releases.  First among these is Streamlined Data Refinery (SDR).  It takes blending, something that used to require SQL expertise, as we understand it, and pushes it out to a wider user community.

 The idea is that business analysts, not data engineers, can blend data from Hadoop clusters and data warehouses because IT has put guardrails on the process.  The key capability, as we understood it, is the semi-automated creation of an OLAP model.  This is by no means complete, but it appears to be growing rather quickly.

The point releases introduced new deployment capabilities for integration and orchestration by enabling storage and processing to be done in the cloud.

Release 6.0 – enterprise hardening and performance, deeper integration of Hadoop and data warehouses, and collaborative model building

Release 6.0, the year’s major release, became generally available on October 14, during the show.  Its major focus was integrating the data lake and the data warehouse.

blending2worlds
Figure 2: Pentaho’s focus for 2015 is deeper integration of traditional data warehouses and Hadoop

It wasn’t clear if blends between Hadoop and data warehouses could happen pre-6.0, but they’re definitely in there now.  Data can also come from streams such as Storm.  What’s interesting is that each transformation step can produce a virtual table, something that only Pentaho maintains, not the underlying storage engines.  As part of orchestrating the analytic data pipeline from end-to-end, it can pass this virtual table to another application to operate on it.  For example, it could send the table to an analytic application that might score each entry using an R model.

With the new SDR functionality which first appeared in 5.3 and 5.4, Pentaho can go much further in figuring out the OLAP model from the data it’s blending.  While preparing the data, the system can offer intelligent prompts for users to identify the features that are relevant for building the OLAP model.  Then, each time the job runs, the system builds the model automatically.  Users can collaboratively add to the model and their additions can then feed back into the auto generation process.

 6.0 also included more focus on delivering enterprise-hardened data integration and analysis.  That hardening included better cluster support, better authentication and authorization, SNMP support for 3rd party monitoring tools, and the ability to ensure upgrades of customers’ existing deployments would work even if it meant sacrificing some new features.

Roadmap – manage increasingly heterogeneous landscape, including streaming data for IoT, with ever deeper embedding in operational applications

To date the emphasis has been on integrating data across the data warehouse and Hadoop.  Hadoop support included Storm for near real-time stream processing.  But with Storm’s fade from relevance, new stream processors are on the roadmap.

 As part of a broader theme of shielding customers from technology changes, the Labs department is working on Spark support, both for batch and stream processing.  A future release will enable customers to use the GUI data integration tool to create jobs that run on Spark rather than MapReduce or a data warehouse.

PDISpark
Figure 3: Prototype of GUI design tool that will generate Spark code

The IoT opportunity for the whole ecosystem is enormous because it will increase the data volume, velocity, and variability by orders of magnitude.  From the perspective of use cases such as smart buildings, connected cars, energy generation and failure, fraud prevention, and predictive maintenance, Pentaho sees a lot of commonality in requirements.

 For example, all the scenarios they see involve both raw data accumulated over time in a data lake as well as near real-time data continuously streaming in plus the up-to-date current state of the relevant devices.  While the sweet spot of traditional data integration has been batch blending and enrichment, users will now have to have a more exploratory user experience to be able to look for anomalies and patterns in historical, streaming, and current data.

 API’s aren’t as flashy as GUI business intelligence tools.  But Pentaho is putting a large and continuing effort into opening up interfaces to make embedding their pipeline in applications ever easier.

The Emerging Stack for Analytic Data Pipelines

Systems of Record have had a well-established pipeline architecture with operational and analytic databases linked by traditional ETL tools.  Customers and vendors are still working out exactly what that architecture will look like for Systems of Intelligence built on Big and Fast Data.  This analytic data pipeline has to gather data from a wide variety of sources, perform ever more sophisticated analytics, and do it all with ever greater speed. Building this needs a coherent foundation.

New processing engines are fracturing this foundation all the time: NoSQL and NewSQL DBMS’s, the Hadoop ecosystem, Spark, and all the related services on AWS, Azure, and Google Cloud Platform.  All of this is making customers look for “future proof” insurance in the form of an insulating layer.

The first order of business is deciding just how much of this new pipeline belongs on an insulating layer and how much can be built directly on native data platforms and analysis tools.  Customers largely kept the data integration process for Systems of Record on a platform (Informatica ETL) separate from the analytics process.  That process relied on a range of specialized tools in addition mostly to the same database engines as OLTP but configured for data warehousing.

Action Item

We can see one choice that customers should make based on how they are organized and their objectives.

If there is one organization that is responsible for data management infrastructure and a separate one that is responsible for analytic data feeds that operational applications consume, then the best fit is to have two systems.  One is the appropriately future-proofed platform for data integration.  The other is comprised of specialized analytic databases and tools.  The assumption here is that direct access to the explosive innovation in analytics is beneficial and that is probably too difficult to future-proof analytics.  This choice mirrors the architecture of Systems of Record.  The downside is that customers will have to custom build the integration and orchestration of the two systems if their chosen vendors don’t provide it out-of-the-box.

If there is a single organization to provide data integration and analytic data feeds, then the best fit is one system that incorporates data integration and analytics as well as orchestrating the whole process.  The advantage here is that data engineers, data scientists, and business analysts can collaboratively shape the data for rapidly evolving analytic data pipelines.  Pentaho fits in this category.  However, for advanced analytics such as machine learning it lets developers call out directly to specialized tools.  While Pentaho can still orchestrate the entire analytic data pipeline in this scenario, it loses some of the value of enabling business and IT to work collaboratively on evolving the analytic data pipeline.