Application Decay and the Burden of Data-Driven Algorithm Training

April 11, 2017 | By James Kobielus |

Big Data

Application developers like to think that they produce gems. Many do, but their handiwork is not immune to the ravages of obsolescence.

Application fitness is fragile. Like any eroding asset, applications must often be maintained in order to stay fit for their intended purpose. Application decay is the process under which the fitness of some programmatic artifact declines. Though the app in question may remain bug-free, its value to users will decay if the code proves difficult or costly to modify, extend, or evolve to keep pace with changing conditions. For example, an app’s value may decay because developers neglect to fix a confusing user interface, or include a key attribute in its data model, or expose an API that other apps need in order to consume its output.

Application maintenance may involve modifications to layers of code, schemas, metadata, and other artifacts of which it is comprised. To the extent that developers have inadvertently built “technical debt” into their apps—in other words, maintenance nightmares—the costs of keeping it all fit-to-purpose deepen and the inevitable value decay accelerates. Typically, technical debt stems from reliance on problematic development practices such as opaque interfaces, convoluted interdependencies, tightly coupled components, and scanty documentation.

Technical debt adds fragility to your code and downstream costs and delays to your development practice. As developers add fresh complexities to their work, the maintenance burdens grow, technical debts accumulate, and the potential for value decay intensifies. As developers start to incorporate data-driven statistical algorithms into their work, warding off application decay will also depend on the maintenance of those artifacts. As my Wikibon colleague George Gilbert and I discussed on theCUBE recently, maintaining traditional application code and training data-driven statistical algorithms are two sides of the same coin. Whereas traditional maintenance involves revisions to deterministic application logic, the new era of data science demands that developers also maintain—in other words, train—the probabilistic logic expressed in machine learning (ML), deep learning, predictive analytics, and other statistical models.

Training is the essence of a data algorithm’s fitness, but it too is a fragile thing. Where supervised learning is concerned, that involves creating ML algorithms that learn from labeled training data. To the extent that developers have trained it effectively, an ML algorithm will learn the correlation of data features to desired outputs that are implicit in the training data. A typical ML algorithm might involve using artificial neural nets to identify which features in the data (e.g, independent variables such as customer attributes and behaviors) best predict the outcome of interest (e.g, dependent variables such as customer likelihood to churn). But even if you trained the algorithm well at initial deployment, failure to keep retraining it at regular intervals with fresh observational data may cause your ML model to become less effective over time. That’s especially true if the original independent variables become less predictive of the phenomenon of interest.

When you engage in supervised learning—the dominant development approach for ML—how you choose to train your algorithms can add fragility, hence an ongoing maintenance burden, to whatever applications consume ML model outputs. Here are some algorithm-training issues that can multiply downstream technical debts:

Data scarcity: Some ML algorithms may be in application domains for which adequate training data sets are difficult to obtain.
Frequent retraining: Some ML algorithms may require frequent retraining, especially if the underlying predictors are poorly understood, interact in complicated patterns, or change dynamically.
Manual-labeling resource issues: Some ML algorithms depend on labeled training-data sets that require scarce, unreliable, low-productivity human resources.
Training-data variety: Some ML algorithms leverage diverse training data sets from myriad sources that may prove difficult and costly to retrain regularly.
Convoluted DevOps training pipelines: Some ML algorithms are built and trained in machine learning, data engineering and modeling, and DevOps pipelines with many intricate interdependencies.
Multi-stage training procedures for complex apps: Some ML algorithms are embedded within exceptionally complicated distributed applications with multi-stage training procedures of commensurate complexity.

As your development practice requires additional layers of algorithmic-model training, the technical debts accumulate. Most ongoing requirements to update these models will require retraining them afresh with every app-update cycle. And if you consider how technical debts may require parallel revisions to the corresponding app code, it’s clear that the maintenance burdens will intensify as developers build apps that rely on ML and other advanced analytics.

Keeping the maintenance burden under control requires a keen eye on the technical debts that may be building up inside your next-generation application development practice. That means you need to factor ML, deep learning, and cognitive computing pipelines into your DevOps practices. For a good Wikibon Premium report on the DevOps challenges in the era of digital business, check out this report from last year.