Applying Machine Learning to IT Operations Management – Part 4: Diagnostics, Remediation, and Optimization

September 11, 2017 | By George Gilbert |

Analysis, Availability, Big Data

Premise. Customers can choose breadth-first or depth-first application performance and operations intelligence (AP&OI) applications. Breadth-first offers the the ability to diagnose and describe problems across the end-to-end landscape, including across cloud platforms, at the expense of more false positive alerts. Depth-first, focused on big data applications, for example, offers a narrower scope but with high fidelity root cause analysis and the associated ability to precisely remediate problems or optimize the application’s performance.

This report is the fourth in a series of four reports on applying machine learning to IT Operations Management. As mainstream enterprises put in place the foundations for digital business, it’s becoming clear that the demands of managing the new infrastructure are growing exponentially. The first report summarizes the approaches. The following three reports go into greater depth:

Applying Machine Learning to IT Operations Management – Part 1

Applying Machine Learning to IT Operations Management for End-to-End Visibility – Part 2

Applying Machine Learning to IT Operations Management for High-Fidelity Root Cause Analysis – Part 3

Being able to diagnose, anticipate, plan and then take action in order to maintain application and infrastructure availability and performance requires deep knowledge of an application domain or resource pool. That deep knowledge can only come from two sources. The application performance and operations intelligence (AP&OI) vendor has to build in knowledge of a finite domain: a set of workload, service, and infrastructure and entities. Machine learning models encode knowledge of the topology of how the entities fit together as well as how they operate individually and collectively over time. The deeper that knowledge, the more precise the recommendations for how to maintain performance and availability. Ultimately, by putting a boundary around the AP&OI’s scope of visibility, the AP&OI application is able to make assumptions about how things fit together and what’s going wrong when there’s a problem.

High fidelity remediation requires knowledge about how a domain or resource pool works together.

AP&OI apps based on breadth-first require added intelligence from administrator diagnostics.

Breadth-first and depth-first can anticipate problems but planning requires depth-first.

High Fidelity Remediation Requires Knowledge About How a Domain or Resource Pool Works Together.

Diagnosing, describing, and remediating a problem with high fidelity can only happen with a depth-first product (see table 1). A depth-first product knows how a well-defined set of entities in an application fit together and operate. For big data applications, this deep knowledge comes from pre-trained machine learning models that capture the behavior of each entity in the application, including workloads, services, and infrastructure; a data model that corresponds to each entity’s structure; and a pre-designed “graph structure” that captures their collective structure and behavior. The graph structure, which is relatively new to enterprise software, is similar to the “knowledge graph” data model supporting Google Search, which captures how people, places, and things are related. The AP&OI vendor can train their machine learning models both in how each entity works individually as well as how they all work together over time. The vendor can do this training using actual data from design partners or operations at early customers. Getting to high fidelity intelligence can only work where the scope is bounded, such that the models can have enough pre-trained intelligence to figure out an individual customer’s specific topology and the operation of all their entities working together.

For example, a customer was executing a long-running ETL workload using Hive on MapReduce, data distributed across HDFS in a large cluster, and YARN as a resource manager. The customer wrote a SQL query in Hive which created a query plan that executed in MapReduce. MapReduce accessed the datasets in HDFS based on the file paths and the tables within them. The workload was on a multi-tenant infrastructure, so YARN mediated the multiple services distributed across and running on each server. The AP&OI application tracked how everything executed together all the way down to the data shuffling in the map and reduce steps and their phases on individual processor cores. The AP&OI application found that 382 tasks finished in under a minute. One straggler task never finished but took 17 hours before it was killed. A surface-level analysis would show that the MapReduce job stopped. The AP&OI application determined that it was a problem of data skew: too much data on one node bottlenecked the entire workload. The AP&OI application was able to remediate the problem by adjusting the settings in both Hive and MapReduce, since it knew that the query ultimately came from Hive but executed in MapReduce. The AP&OI application also knew how the MapReduce operations mapped to the data in HDFS distributed across cluster. When the data skew was fixed, the entire workload completed in 2 hours and 17 minutes. This type of diagnosing and remediation is only possible with a depth-first AP&OI product.

	Action	Describe	Remediate	Optimize
Intelligence
Diagnose		✓ breadth-first, depth-first	✓depth-first, limited for breadth-first	✕
Anticipate		✓ breadth-first, depth-first	✓ depth-first	✕
Plan		✕	✕	✓ depth-first

Table 1: Intelligence determines ability to act

AP&OI Apps Based on Breadth-First Require Added Administrator Intelligence for Diagnostics.

AP&OI applications based on breadth-first can surface alerts that diagnose problems. It’s somewhat harder to suggest remediation because an accurate root cause analysis requires more depth than just identifying when something goes wrong with a component (see table 1). Breadth-first has more trouble telling you why something went wrong because it doesn’t have a deep model about how a component works. And it has yet more trouble telling you when something goes wrong between components because the management application doesn’t necessarily know about how all the components fit and work together. Administrators are more likely to see alerts all around the core problem and will have to analyze the data from multiple, specialized administrative perspectives to get the root cause.

Breadth-First and Depth-First Can Anticipate Problems, But Planning Requires Depth-First.

Both breadth-first and depth-first management applications can anticipate problems based on an analysis of the trends in telemetry. Depth-first approaches can suggest remediation with higher confidence based on richer knowledge of the context of the interactions within the application. Finally, planning requires the ability to predict multiple outcomes based on what-if analysis and then to choose the optimal outcome. For example, plans to add capacity may show that storage-intensive nodes are more important for new capacity then more balanced systems.

Action Item. AP&OI tools with intelligence will leapfrog each other for the next few years. Don’t make depth-first or breadth-first decisions narrowly based on product or brand attributes. Instead, base decisions on your portfolio of applications and talent. Depth-first AP&OI applications are the best approach under either or both of the following conditions: 1) Your big data application or similar specialized domain employs batch-mode integration between it and other applications such as Web or mobile or systems of record; and 2) you have a shortage of administrators in a particular domain and need to leverage their expertise with as much root cause and remediation intelligence as possible. Breadth-first AP&OI applications are the best approach for either or both of the following conditions: 1) Your specialized domain features high-latency integration with the rest of your application and infrastructure landscape. 2) You need to foster closer collaboration between administrators in different domains.

George Gilbert

George Gilbert, lead data & analytics analyst for theCUBE Research. Former Gartner analyst, former lead enterprise software analyst for Credit Suisse First Boston, one of the top investment banks serving the technology sector. Big Data analyst for Gigaom Research. Co-founded Techalphapartners, a consultancy that advised vendors and institutional investors on market development and product strategy. George has led conference panels with prominent thought leaders in cloud infrastructure and big data. He has been profiled on the front page of the Wall Street Journal and published as a guest author in a major overview of the evolution of cloud computing in The Economist. Prior to being an analyst, George was a product manager on Notes at Lotus Development. George received his BA in economics from Harvard University.

You may also be interested in

230 | Breaking Analysis | RSAC 2024 goes beyond AI-powered security to securing AI itself

David Vellante May 11, 2024

Riverbed Ramps Up Innovation: Launches AI-Powered Platform and Much More

Bob Laliberte May 10, 2024

Applying Machine Learning to IT Operations Management – Part 4: Diagnostics, Remediation, and Optimization

George Gilbert

You may also be interested in

230 | Breaking Analysis | RSAC 2024 goes beyond AI-powered security to securing AI itself

Riverbed Ramps Up Innovation: Launches AI-Powered Platform and Much More

Studio Locations

Research Areas

Podcasts

Solutions

Engage

Stay Connected

theCUBE Research weekly

Applying Machine Learning to IT Operations Management – Part 4: Diagnostics, Remediation, and Optimization

George Gilbert

You may also be interested in

230 | Breaking Analysis | RSAC 2024 goes beyond AI-powered security to securing AI itself

Riverbed Ramps Up Innovation: Launches AI-Powered Platform and Much More

Studio Locations

Research Areas

Podcasts

Solutions

Engage

Stay Connected

theCUBE Research weekly

Book A Briefing