Building Blocks of Machine Learning Applications Feature Operational Complexity

November 30, 2016 | By George Gilbert |

Analysis, Big Data

Premise: Big data pro’s have a plethora of mix and match tools from which to choose in building applications. What’s not clear, however, is whether they actually fit together in practice as well as they do in theory.

Based on the experiences of leading Internet companies’ experiences, common application design patterns have been emerging for mainstream big data pro’s. Using these common designs, mainstream companies can mix and match open source as well as proprietary tools (see Figure 1). While vendors may have solved the problem of selection, the mix and match approach has side effects in operational and development complexity. Some vendors have begun making progress in solving complexity by designing more of the services to work as a single unit. We’ve highlighted these issues before (here and here). But most vendors of popular services haven’t solved the problem of administrative or operational simplicity.

**Figure 1: Selecting from a chinese menu of services is required for building machine learning applications.**

Wikibon groups the administrative functions of big data applications into four categories, each with one or more tasks. The challenge for administrators is that virtually every application service has its own way of implementing all administrative tasks. That matrix of application services versus administrative tasks is what is slowing big data deployments outside the largest or most sophisticated organizations (see Figure 2).

**Figure 2: Machine learning application services drive operational complexity. Each component has its own implementation of individual administrative tasks.**

Trusted systems require authentication, authorization, and data governance. Where trusted systems tend to break down is in the “cracks” between the boundaries of the analytic services.
Scalability requires elasticity, but separate clusters for separate products adds latency and complexity in different data distribution schemes. Both latency and differing data distribution schemes are the enemy of simple and fast data pipelines.
Resilience builds on availability & recovery and resource isolation. If different services handle availability and recovery differently, the analytic data pipeline is likely to be in an inconsistent state.
Automation requires scheduling & orchestration, testing integration, and logging. More services from more vendors increases the likelihood that they will use different technology for scheduling and orchestration nodes across a cluster.(For a detailed definition of the administrative tasks that are the source of complexity see Figure 3).

**Figure 3: Definitions of common administrative tasks typical of big data services.**

Trusted systems require authentication, authorization, and data governance.

Sharing common authentication and authorization services across Hadoop components has been a challenge; sharing these services across the wider ecosystem of services now required to build a pipeline is an even greater problem. Some services use Kerberos and some LDAP. Some of the Hadoop-originated services use one or more of Knox, Sentry, and Ranger for authorizing different access permissions, depending on the vendor. But many Hadoop services don’t support these yet. And non-Hadoop services support something entirely different. Kafka stores permissions in Zookeeper and plans to store them internally. MemSQL manages permissions within its DBMS. Data governance, which includes audit, lineage, lifecycle management, curation, and metadata management has the same challenges. Cloudera built its own solution, Navigator. But it doesn’t extend beyond Cloudera services.

Trust in the integrity of the process and its output requires the following: knowing which people or programs accessed the pipeline, that they only accessed what the were allowed to see, and that any changes to the data were tracked in a continuous “chain of custody” across the pipeline. Perhaps the most promising product in this area is Alation. Alation claims to handle data governance tasks across products without requiring them to be modified in order to work with it.

Scalability requires elasticity but separate clusters for separate products adds latency and complexity in different data distribution schemes.

Big data services scale elastically, enabling them typically to add storage or compute as necessary. With Hadoop, several services such as Hive and HBase might share a common store, with the underlying data residing in HDFS.

This elasticity gets challenging when data has to move across compute and/or storage clusters belonging to different services within the pipeline. Data gravity slows this process way down. Having the data partitioned and sharded across servers (i.e. elastically scaled) differently between clusters makes the problem much worse. If data ingested into a Kafka cluster isn’t divided up the same way it’s divided up in Spark, it has to get reshuffled when it moves between the clusters. That reshuffling is extremely expensive because it taxes network and/or storage bandwidth, both of which are typically tight.

Resilience builds on availability & recovery and resource isolation.

Distributed systems in the enterprise have to be able to continue operating even if some nodes go offline. They also have to bring the offline nodes up to date when back online. Resilience also means multiple services can share access to infrastructure by isolating which service can access which infrastructure resources.

Again, a mix and match approach to services makes an end-to-end pipeline difficult to operate. Most Hadoop-based services use YARN for resource isolation and tend to use Zookeeper to keep track of which services are online and healthy. Outside the Hadoop ecosystem, services might use Mesos instead of YARN. Availability and recovery also typically work differently. Kafka and Spark are examples of two products that reinforce each other with availability and recovery. Spark doesn’t yet handle well sudden spikes of high speed incoming data. Kafka can operate as a buffer to keep pressure from building up while Spark works through data at a more measured pace. Maintaining high availability within a service also depends on a protocol that all the nodes in the service use to communicate their health with each other. When different services use different protocols, they can react differently to the same failure event, like a network partition going down. In other words, the different parts of the pipeline can be out of kilter.

Automation requires scheduling & orchestration, testing integration, and logging.

By definition, an analytic data pipeline is a sequence of operations that ingests raw data and turns it into actionable insight. Scheduling and orchestration makes sure all the services properly share resources and run and complete in the proper sequence.

Here, there are also a multitude of choices for scheduling and orchestration just in the container ecosystem. Although the resilience section talked about YARN for resource isolation, it’s also relevant for scheduling. Resource managers like YARN work with schedulers to make sure services are on the right servers and can deliver required performance. YARN competes with Mesos for products outside the Hadoop-originated ecosystem and Mesos couldn’t be more different. Mesos handles coarse-grained scheduling on its own but relies on each service to handle its own concurrency needs for scheduling activities within the service. An MPP SQL DBMS would typically have a sophisticated concurrency capability that YARN may not know how to interact with.

Action Item: Wikibon recommends mainstream companies look to public cloud providers as partners in building and deploying big data applications. The public cloud vendors can make operating the services in the open source ecosystem simpler and they also offer increasingly simple proprietary services. While the proprietary services come with a greater lock-in trade-off, they are becoming more accessible as well.

George Gilbert

George Gilbert, lead data & analytics analyst for theCUBE Research. Former Gartner analyst, former lead enterprise software analyst for Credit Suisse First Boston, one of the top investment banks serving the technology sector. Big Data analyst for Gigaom Research. Co-founded Techalphapartners, a consultancy that advised vendors and institutional investors on market development and product strategy. George has led conference panels with prominent thought leaders in cloud infrastructure and big data. He has been profiled on the front page of the Wall Street Journal and published as a guest author in a major overview of the evolution of cloud computing in The Economist. Prior to being an analyst, George was a product manager on Notes at Lotus Development. George received his BA in economics from Harvard University.

You may also be interested in

IT Takes an Ecosystem.

Bob Laliberte April 19, 2024

The Transformative Nature of Cloud and AI Together: What’s Now, What’s Ahead

Shelly Kramer April 17, 2024

George Gilbert

IT Takes an Ecosystem.

Studio Locations

Research Areas

Podcasts

Solutions

Engage

Stay Connected

theCUBE Research weekly

Building Blocks of Machine Learning Applications Feature Operational Complexity

George Gilbert

You may also be interested in

IT Takes an Ecosystem.

The Transformative Nature of Cloud and AI Together: What’s Now, What’s Ahead

Studio Locations

Research Areas

Podcasts

Solutions

Engage

Stay Connected

theCUBE Research weekly

Book A Briefing