Artificial Intelligence (AI) in general needs no media blitz, IT operations (ITOps) on the other hand is the least sexy and often overlooked group within the technology operations. However, the marriage of the two has created significant visibility for ITOps and the elevation of its profile. There has been significant and renewed interest in looking at the holy grail of downtime and the efficient running of ITOps. Historically ITOps is all about keeping the lights on and making sure that the infrastructure and applications running on it performs as expected. This post is focused on understanding the context, evolution, and direction of AIOps, while a subsequent post will take a deep dive into the current market offerings.
What is this rage about IT operations and keeping the lights on? First, is the burn rate of ITOps budget which has been steadily rising and consumes up to 70% of budget this is no longer sustainable, Second, with the rise of digital-first enterprises and apps, point tools and platforms are unable to handle the response, scale, and inherent complexity. The measurements built-in are more binary than granular (not to mention they completely miss on user experience). The compound effect is a muted and subpar client experience. When your end-user calls to let you/helpdesk know that his app is not working, the database is down and can’t complete the transactions or generate the tableau report, the credibility takes a hit of the ITOps team. This is despite heavy spending on infrastructure, monitoring systems, and tools to keep the lights on. In a 2018 Everest Group survey of 200 CIO’s (with > 1B$ in revenues) they found :
- 71% of the enterprises believed they lack a meaningfully scalable model for infrastructure growth.
- 73% of the respondents had included and identified intelligent automation as a key theme for infrastructure management as part of a broader IT adoption strategy.
Digital transformation often leads to a Hybrid environment (at least in the interim) and that requires two different sets of tools, processes, and response thresholds. This has set the stage for the rise of artificial intelligence for IT operations, or AIOps. At least that’s the premise.
Systems were simple, siloed and segmented
As we transitioned from a self-hosted back office with localized operations, we were still in the back office mode as far as most of the end customers were concerned. Most of this productivity improvement and efficiency were targeted towards internal systems and rarely crossed the hard-line towards the end customer. The advent and acceleration of cloud computing (2000’s) laid the foundation of this “Digital Divide” between legacy and current state. As ubiquitous connectivity and high-speed internet became mainstream, the pace of transformation amplified further with the advent of apps economy, rich media uptake, and mobile transformation. For the first time consumers were ahead of the enterprise, and in control of the technology adoption, uptake, and industrialization. The elements of the Digital economy started to emerge piecemeal from the dotcom bubble (2000’s) with e-commerce players and distinct trends emerging, laying the foundations of modern-day customer experience and interaction. Amazon patented its 1-Click service, which allows users to make faster purchases in Digital transformation fundamentals, and foundational drivers started to emerge rapidly with e-commerce adoption and acceptance. While the Harvard Business School (HBR) premium model was very well received by the consumer for services and software, goods, and services exchange needed platforms to transact with trust. The ability to transact and collaborate very quickly replaced 9-5 business day to a 24X7 model. This new medium’s interaction was a primarily digital and existing suite of applications and infrastructure was not able to support and scale this new opportunity. Rather, legacy architecture stifled the rapid explosion and growth for many enterprises and some were driven out of business not by competition but by their own inaction.
Descent into Chaos … Local-scale 2 Planetary Scale
This was further compounded by the explosion in the apps, interfaces, and devices. Let us look at Facebook, more than 1.39 billion people connect to Facebook’s infrastructure per month of which 1.19 billion are on mobile alone. Nearly 1 billion photos are shared and more than 3 billion videos are viewed every day. Facebook’s services run on top of hundreds of thousands of servers spread across multiple geographically separated data centers. None of this can be managed by the human-scale!….. Things will break and require care and feeding at a rate much faster than eyeballs can provide and hands can hit the keyboard. The evolution from on-prem/local to web-scale and now to Planetary scale requires out of the box and unconventional approach to reacting and responding to events. Uptime and digital user experience is the lifeline of customer service and business processes and which would have otherwise degraded into chaos. Thanks to data analytics and machine learning technologies, we have a possible breakthrough here. This would not have been possible were it not for Google and its team’s fine-tuning and industrializing to scale of Google SRE approach. The focus of this post is not to cover SRE but highlight the emergence of data science, machine learning, and possibilities that emerged as a result. That is a promise!
The Promise….Single Pane of Glass (SPOG)
A Gartner research published in 2019 to augment decision making in DevOps states, “The growing need for organizations to analyze vast volumes of data in enabling rapid application delivery makes manual decision making a key bottleneck in DevOps. I&O leaders must leverage AI techniques to make data-driven decisions and automate actions to ensure business agility and stability”. Gartner estimates that only 5% of all large enterprises are currently combining big data and machine learning (the heart of an AIOps platform) to support and partially replace monitoring, service desk, and automation processes and tasks. However, Gartner expects that number to jump to 40% of all large enterprises by 2022. If this comes true, AIOps will create a massive shift in IT Operations methodology and spending, and it benefits everyone to understand what vendors, products, and services make up the AIOps marketplace. Will there be a single pane of glass(SPOG)..? Perhaps as likely as finding a Unicorn in your neighborhood park! Silos between network, infrastructure, apps, servers, DB, security, end-user computing are deep, diverse, and well fortified. Instead, the focus should be on breaking the silos leveraging, business process availability (BPA), and subsequent digital experience monitoring (DEM) as key metrics. This can be made possible if we pivot to event-based viewing, dynamic discovery, real-time mapping, and event correlations vs./ monitoring tool or resolution based roles based triage. Single pane of glass (SPOG) constructs are, accurate telemetry, large amounts of data aggregation, optimal/minimal human input in the ack-react loop, and noise reduction using algorithmic clustering.
Data as Crystal Ball Into Future State
What AIOps is to Service Management is what AI is to enterprise data..? What AIOps does is allow us to move us from reactive to preventive and finally to predictive. The goal of AI and current advancement is to apply the tools and techniques to data to prevent the inevitable, to predict the future possibilities within the use case context, and help optimize the business process performance and functions. Now that we have collected a ton of data and have stored it successfully and safely, in the mile-deep proverbial vaults, but haven’t had the time or the tools to analyze and leverage its ability to act as a crystal ball for making future predictions. This is primarily the ticket data but can be expanded to include all IT operations artifacts including, logs, RCA, runbooks, monitoring/alerts, notifications, etc. This data can help predict future state health is the premise of AIOps!
AIOps Use Cases … The Path Forward
While we are far out from realizing the benefits of full automation and movement towards NOOps the following are the patterns that have emerged as possible use cases that address not only low hanging fruit but also provide a foundation for building AI/ML-based IT operations practices.
- Prevent and Predict: An emerging use case is to predict the failure of the DevOps pipeline based on the release history, magnitude of changes and complexity of the build, etc. This avoids downtime toll as well as expensive regression testing.
- Anomaly/threat detection: Once the baseline behavior of the system is established, the AIOps tool watches for variance and flags outliers as they present. AIOps is a valuable addition to a strong security management posture. Heuristics and algorithms can mine traffic data for botnets, scripts, or other threats that can take out a network. Subsequently, if the anomalies represent the new baseline the mechanism allows it to update and revise its thresholds dynamically. This capability and subsequent use case is gaining wider traction due to the rapid growth of cloud computing workloads.
- Event Correlation: Infrastructure teams are faced with floods of alerts, and yet, there is only a handful that is business impacting. AIOps can mine these alerts, use inference models to group them together, and identify upstream root-cause issues that are at the core of the problem. Often when an event occurs, multiple monitoring systems are generating alert storms and as a result, users are also opening up tickets that are related and subsequently can be triaged and tracked as one event.
- Intelligent alerting and escalation: After root-cause alerts and issues are identified, ITOps teams are using artificial intelligence to automatically notify subject matter experts or teams of incidents for faster remediation. Artificial intelligence can act as a routing system, immediately setting the remediation workflow in motion before a human being ever gets involved.
- Incident auto-remediation: AIOps is also being used as an end-to-end bridge between ITSM and IT operations management tools. Traditionally, ITSM teams sift through infrastructure data to identify and remediate issues at the root cause. AIOps extracts root cause inferences from infrastructure alerts and sends them to an ITSM team or tool through API integration pathways.
- Capacity optimization: This can also include predictive capacity planning and refers to the use of statistical analysis or AI-based analytics to optimize application availability and workloads across infrastructure. These analytics can proactively monitor raw utilization, bandwidth, CPU, memory, and more, and help increase overall application uptime.
As complexity continues to mount, failure potential increases exponentially while the pressure builds for IT teams to deliver business services with minimal/zero downtime. AIOps is emerging as both a leading-edge discipline to address the operational issues and in doing so effectively focus on true auto-remediation and root cause (vs repeat) resolutions. IT leadership and operations teams have started to recognize the potential and are carefully integrating the use cases into their operational models with some early proof of concept/proof of value (POC/POV) delivering promising results. The competitive advantage in adopting and embracing AIOps is not purely from a resource unit/opex savings perspective but has the potential to bring continuous innovation in the enterprise. Some of the key players in the emerging AIOps space are Stackstate, Ops Ramp, Opsani, Dynatrace, Sciencelogic, Moogsoft, Big Panda, SignalFx, Darwin AI. A subsequent post will cover the individual vendor capabilities and product focus areas.
DISCLOSURE STATEMENT: Opinions are those of the individual author. Unless noted otherwise in this post, the author is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.