Wrapping DevOps Around the Data Science Pipeline

June 27, 2017 | By James Kobielus |

AI, Analysis, Big Data

Premise

Organizations are deploying deep learning, machine learning, cognitive analytics, and other data-science apps into their most mission-critical always-on applications. As they further embed these datas science assets into high-value applications, organizations must design analytic pipelines that can operate reliably, cost-effectively, and with high throughput. Increasingly, data science professionals are automating more of their tasks within continuous DevOps workflows.

Analysis

Data science is the core skillset of the next-generation developer. To successfully scale up mature application-release pipelines, data science teams should adopt DevOps practices. So far, though, few data science team have adopted DevOps. According to a recent CrowdChat, Wikibon’s community believes that fewer than 25% of organizations that have adopted DevOps are applying these practices to the data science pipeline. This poll result is illustrated in Figure 1.

Figure 1: Adoption of DevOps in Data Science

Wrapping DevOps around the analytics pipeline requires that convergence of the organizational siloes that have traditionally kept data-science teams separate from traditional application developers. Many of the most innovative new business applications involve the fruits of data science and are developed and maintained by converged teams of statistical modelers working closely with data engineers, programmers, subject-matter experts, and other specialties throughout the DevOps lifecycle.

Bringing DevOps into the analytic pipeline requires that data science professioals adopt the following practices:

Introduce industrial discipline into data-science development and deployment. Establishing a continuous data-science pipeline involves instituting standardized tasks, repeatable workflows, and fine-tuned rules governing the creation and deployment of deep-learning models and other artifacts.
Automate every stage of the data-science pipeline. Once standardized, many data-science pipeline tasks are amenable to greater automation, ranging from upfront data discovery and exploration all the way through to modeling, training, deployment, and evaluation.
Enforce strong lifecycle governance over data-science assets. The foundations of comprehensive data-science pipeline governance are a source-control repository, data lake, and integrated collaboration environment.
Evaluate blend of commercial and custom-built tools for data-science DevOps. Few tools on the market currently handle end-to-end DevOps for data science. Consequently, you may need to explore using multiple existing tools for various stages of the pipeline. And you may need to write custom apps and integration code to build a more comprehensive toolset that spans your organization’s data-science pipeline.

Introduce Industrial Discipline Into The Data-Science Development And Deployment

DevOps focuses on organizing software release pipelines to boost scale, speed, efficiency, and predictability. One can industrialize the data-science pipeline by reconstituting its processes around three core principles: role specialization, workflow repeatability, and tool-driven acceleration.

Table 1 discusses how those industrialized practices may be implemented in the data-science pipeline.

PRACTICE	DISCUSSION
*Role specialization*	The key role specializations in an industrialized data-science pipeline consist of statistical modelers, data engineers, application developers, business analytics, and subject-domain specialists. Within collaborative environments, data science team members pool their skills and specialties in the exploration, development, deployment, testing, and management of convolutional neural networks, machine learning models, programming code, and other pipeline artifacts.
*Workflow repeatability*	The principal workflow patterns in the data-science pipeline are those that govern the creation and deployment of deep-learning models, statistical algorithms, and other repeatable data/analytics artifacts. The primary workflow patterns fall into such categories as data discovery, acquisition, ingestion, aggregation, transformation, cleansing, prototyping, exploration, modeling, governance, logging, auditing, archiving, and so on. In a typical enterprise data-science development practice, some of these patterns may be largely automated, while others may fall toward the manual, collaborative, and agile end of the spectrum. In the context of the data-science pipeline, it requires moving away from data science’s traditional focus on one-time, ad-hoc, largely manual development efforts that are divorced from subsequent operational monitoring and updating and for which downstream changes require significant and disruptive model rebuilds.
*Tool-driven acceleration*	The most important platform for tool-driven acceleration in data-science pipelines is the cloud. The chief cloud-based tools include Spark runtime engines for distributed, in-memory training of machine learning and other statistical algorithms; unified workbenches for fast, flexible sharing and collaboration within data science teams; and on-demand, self-service collaboration environments that provide each DevOps pipeline participant with tools and interfaces geared to their specific tasks.

Table 1: Industrialized Data-Science Pipeline Practices

Automate Every Stage Of The Data-Science Pipeline

DevOps achieves industrial discipline in software development, testing, and deployment by automating many tasks that were formerly handled through manual techniques. Many upfront data engineering functions—such as data discovery, transformation, and staging—have long been automated to a high degree. The next frontier in automation consists of the core modeling practices, ranging from visualization, algorithm selection, hyperparameter tuning, and so on.

Table 2 discusses the data-science modeling pipeline stages that are most amenable to automation. For further detail on tools and technologies that address these capabilities—of which the most noteworthy is Google’s AutoML.

STAGE	DISCUSSION
*Automated data visualization*	This accelerates the front-end process of discovering and exploring data prior to statistical modeling, such as by automatically plotting all variables against the target variable being predicted through machine learning, and also computing summary statistics.
*Automated data preprocessing*	This accelerates the process of readying training data for statistical modeling by automating how categorical variables are encoded, missing values are imputed, and so on.
*Automated feature engineering*	Once the statistical modeling process has begun, this accelerates the process of exploring alternative feature representations of the data, such as by testing which best fit the training data.
*Automated algorithm selection*	Once the feature engineering has been performed, this accelerates the process of identifying, from diverse neural-net and other architectural options, the algorithms best suited to the data and the learning challenge at hand.
*Automated hyperparameter tuning*	Once the statistical algorithm has been selected, this accelerates the process of identifying optimal model hyperparameters such as the number of hidden layers, the model’s learning rate (adjustments made to backpropagated weights at each iteration); its regularization (adjustments that help models avoid overfitting), and so on.
*Automated model benchmarking*	Once the training data, feature representation, algorithm, and hyperparameters are set, this accelerates the process, prior to deployment, of generating, evaluating, and identifying trade-offs among alternate candidate models that conform with all of that.
*Automated model training*	Once a model has been built and training data acquired and labeled, this accelerates the process of ensuring that it performs its designated data-driven learning task—such as predicting some future event, classifying some entity, or detecting some anomalous incident—with sufficient accuracy and efficiency.
*Automated model diagnostics*	Once models have been deployed, this accelerates the process of generating the learning curves, partial dependence plots, and other metrics that show how rapidly, accurately, and efficiently they achieved their target outcomes, such as making predictions or classifications on live data.

Table 2: Automating the Stages of the Data Science Pipeline

Enforce Strong Lifecycle Governance Over Data-Science Assets

DevOps ensures industrial-grade output through imposition of strong governance over assets, workflows, and quality at every stage of the data-science pipeline.

As you ramp up your data-science development teams and give them more powerful tools for data engineering, modeling, and refinement, you will soon be swamped with more versions of more development artifacts than you can possibly track manually. Your AI pipelines will:

Include more participants, steps, workflows, milestones, checkpoints, reviews, and metrics;
Develop more artifacts—models, code, visualizations, etc.–in more tools and languages;
Incorporate more complex feature sets that incorporate more independent variables;
Train more models with data from more sources in more formats and schemas;
Run more concurrent ingest, modeling, training, and deployment jobs; and
Deploy, monitor, and manage more models in more downstream applications.

As these complexities mount in the data-science pipeline, strong governance tools and practices will enable your DevOps professionals to answer the following questions continuously, in real time, and in an automated fashion:

How many concurrent data-science workstreams are in process concurrently across your organization?
How recently have you evaluated and retrained each in-production data-science build with the best training data available?
How recently was the training data associated with each data-science build refreshed in your data lake? How recently was training data labeled and certified as fit for training the associated data-science models?
How recently was each trained data-science model build promoted to production status?
How recently have you trained and evaluated the “challenger” models associated with each in-production data-science “champion” model build and evaluated their performance vis-à-vis the champion and each other?
Where does responsibility reside in your data-science DevOps pipeline reside for approving the latest, greatest and fittest champion data-science model for production deployment?
Are you logging all these data-science DevOps pipeline steps?
Are you archiving the logs?
How searchable are the archives for process monitoring, tracking, auditing, and e-discovery?

The data-science DevOps environment should automate governance of all activities, interactions throughout the lifecycle of all deliverables. To ensure industrial-grade quality controls, tools and platforms should enable creation, monitoring, and management of policies across all pipeline participants, states, and artifacts. This is analogous to how high-throughput manufacturing facilities dedicate personnel to test samples of their production runs before they’re shipped to the customer.

Table 3 discusses the principal platforms for comprehensive governance across the data-science pipeline.

PLATFORM	DISCUSSION
*Source-control repository*	This is where all builds, logs, and other pipeline artifacts are stored, managed, and controlled throughout every step of the DevOps lifecycle. The repository serves as the hub for collaboration, reuse, and sharing of all pipeline artifacts by all participants, regardless of their roles, tools, and workstreams. It provides access to the algorithm and code libraries that data scientists and other developers incorporate in their projects. It’s the point from which consistent policies for workflow, sharing, reuse, permissioning, check in/check-out, change tracking, versioning, training, configuration, deployment, monitoring, testing, evaluation, performance assessment, and quality assurance across all pipeline artifacts.
*Data lake*	This is where data is stored, aggregated, and prepared for use in exploration, modeling, and training throughout the data-science pipeline. Typically, it’s a distributed file system (built on Hadoop or NoSQL platforms) that stores multistructured data in its original formats, schemas, and types, facilitating its use in subsequent exploration, modeling, and training by data scientists. Another way to think of the data lake is as a queryable archive of various data sources that has potential value in exploration, visualization, modeling, training, reproducibility, accountability, auditing, e-discovery, and regulatory compliance.
*Integrated collaboration environment*	This is where data-science DevOps professionals execute all or most pipeline functions. It provides a unified platform for source discovery, visualization, exploration, data preparation, statistical modeling, training, deployment, feedback, and evaluation, sharing, and reuse. It should provide · unified access to the data-science pipeline’s source-control repository, data lake, processing clusters, and other shared resources. It should provide self-service tools for prototyping and programming data-science applications for deployment in various runtime environment. · tools for composing AI applications as containerized microservices for cloud-native deployments. · an interactive, scalable, and secure visual workbench for consolidating open-source tools, languages and libraries and for collaborating within teams to rapidly put high-quality data-science applications into production. · access to data-science tools and libraries–including TensorFlow, Caffe, Keras, MXNet, and CNTK—as well as for programming in R, Python, Scala, Java, and other languages. · facilities for developers to connect with one another while accessing project dashboards and learning resources, forking and sharing projects, exchanging development assets (datasets, models, projects, tutorials, Jupyter notebooks, etc.) and sharing results.

Table 3: Data Science Pipeline Governance Platforms

Evaluate Blend Of Commercial And Custom-Built Tools For Data-Science DevOps

Only a handful of commercial tools (to be discussed shortly) enable data scientists, traditional app developers, and IT professionals to implement an industrial-strength data-science pipeline along the lines discussed above. Consequently, you may need to explore using multiple existing tools to implement the requisite automation, governance, and other DevOps requirements across the pipeline. And you may need to write custom apps and integration code to build a more comprehensive DevOps toolchain that spans your organization’s data-science pipeline.

Table 4 discusses some commercially available data science platforms that offer some of these core DevOps features.

PLATFORM/TOOL	DESCRIPTION
IBM Watson Data Platform	IBM Watson Data Platform supports collaborative development of cognitive analytics. Hosted and fully managed in IBM Cloud, it supports cognitive-assisted automated development of machine learning development in Spark, as well as DevOps-oriented deployment of analytics into business processes and other downstream applications. The platform includes a hub repository with a library of cognitive algorithms for modeling and execution in Spark, Hadoop, stream computing, and other platforms. An integrated environment enables data science teams to collaborate cross-functionally and deliver results into production environments rapidly. The platform enables continuous improvement of analytic models through rapid iteration, provides task-oriented tools for diverse data-science roles, supports lifecycle governance of data and models, and incorporates tools for automating ingest and preparation of data in cloud-based data lakes.
Cloudera Data Science Workbench	Cloudera Data Science Workbench automates data and analytics pipelines. Running on-premises or in the cloud, it provides a self-service DevOps tool for data scientists to manage their own analytics pipeline. It supports quick development, prototyping, training, and production deployment of machine learning projects. It accelerates data science from exploration to production using R, Python, Spark and other libraries and frameworks. It provides out-of-the-box support for full Hadoop security, authentication, authorization, encryption, and governance. And it includes a collaborative, shareable project environment to ensure that diverse data science teams can work together towards standard, reproducible research.
Domino Data Lab Data Science Workbench	Domino Data Lab’s Data Science Workbench facilitates discovery, reuse, and reproduction of past data-science deliverables as well as work in progress. As a DevOps platform, it automatically tracks code, data, results, environments and parameters in one central place. It accelerates iteration and experimentation among teams of data scientists. It enables data scientists to easily spin up interactive workspaces in Jupyter, RStudio, and Zeppelin. It enables flexible sharing and discussion of work among collaborators and stakeholders in one place. It controls access with granular permissions and LDAP integration; enables rapid publishing of visualizations, interactive dashboards, and web apps; and supports scheduling of recurring tasks to update reports, so that results can be served through the web or sent to stakeholders via email. It supports access controls on dashboards and reports from Control access to dashboards and reports from a central security layer.
MapR Distributed Deep Learning Quick Start Solution	MapR Distributed Deep Learning Quick Start Solution provides a scalable DevOps platform for automated deployment of deep learning models. It supports continuous development, training, and release of deep learning applications into operational environments. It enables deep learning models built in tools such as TensorFlow, Caffe, and MXnet to be modeled, trained, and deployed into mixed hardware environments that includes GPUs and CPUs. It supports distributed deployment of containerized deep-learning pipeline functions over Kubernetes and persistence of models and data in the MapR Converge-X Data Fabric. It provides a dashboard for monitoring of deep-learning pipeline jobs, while also logging all pipeline operations for continuous analysis and diagnostics.

Table 4: Commercial Data Science DevOps Platforms

When converging governance of your data science pipeline with the corresponding practices of the larger DevOps practices in your organization, you should ensure that your enabling platforms harmonize with corresponding investments in the following:

Source-control repositories, such as Git, Bitbucket, and Visual Studio Team Services
Integration, testing, and performance monitoring tools, such as Jenkins and Bamboo
Infrastructure automation and configuration management tools, such as Puppet, Chef, and Ansible
Logging and monitoring tools, such as Datadog and Splunk
Orchestration platforms, such as Kubernetes, Mesos DC/OS, and Swarm
Containerization tools, such as Docker

Action Item

Extending DevOps practices to data science pipelines will enhance the value of big data investments to the business. Don’t focus on tool chains first. Instead, establish disciplines and workflow patterns that are governable and can eventually be automated with tools.

James Kobielus

You may also be interested in

IBM and Palo Networks join forces on security

IBM and Palo Alto Networks Join Forces on the Security Front

Shelly Kramer May 20, 2024

Salesforce Commerce Cloud Seeks to Dominate with a Novel Data Integration Layer Bowered by Data Cloud

Shelly Kramer May 18, 2024

Wrapping DevOps Around the Data Science Pipeline

Premise

Analysis

Introduce Industrial Discipline Into The Data-Science Development And Deployment

Automate Every Stage Of The Data-Science Pipeline

Enforce Strong Lifecycle Governance Over Data-Science Assets

Evaluate Blend Of Commercial And Custom-Built Tools For Data-Science DevOps

Action Item

James Kobielus

You may also be interested in

IBM and Palo Alto Networks Join Forces on the Security Front

Salesforce Commerce Cloud Seeks to Dominate with a Novel Data Integration Layer Bowered by Data Cloud

Studio Locations

Research Areas

Podcasts

Solutions

Engage

Stay Connected

theCUBE Research weekly

Wrapping DevOps Around the Data Science Pipeline

Premise

Analysis

Introduce Industrial Discipline Into The Data-Science Development And Deployment

Automate Every Stage Of The Data-Science Pipeline

Enforce Strong Lifecycle Governance Over Data-Science Assets

Evaluate Blend Of Commercial And Custom-Built Tools For Data-Science DevOps

Action Item

James Kobielus

You may also be interested in

IBM and Palo Alto Networks Join Forces on the Security Front

Salesforce Commerce Cloud Seeks to Dominate with a Novel Data Integration Layer Bowered by Data Cloud

Studio Locations

Research Areas

Podcasts

Solutions

Engage

Stay Connected

theCUBE Research weekly

Book A Briefing