Evolution of Artificial Intelligence for IT Operations

Sudhir Goel, Granite Director of IT Infrastructure and Operations, Granite Construction

As IT services have matured in recent years, IT operations processes and automation tools have evolved to support them. The focus of these improvements has been to deliver IT support quickly and consistently. However, automated operational incident resolution is slower to come of age. Many operational automation products label themselves as artificial intelligence (AI) based tools, but they are often using simple automation as opposed to being learning-capable intelligent solutions.

In the context of operational incident resolution, automation is limited to detecting operational events, matching them to a pre-identified set of event signatures, and resolving them with previously created runbooks.

More advanced tools track the success of runbook resolution and record a “success-score”. This score is used to prioritize the runbooks, resulting in a better resolution rate in the future. Although this set of tools collects information from resolution results to auto-improve event signature to runbook matching, they do not create the new runbooks independently.

True AI should be able to create a new runbook based on observations and learning, even if by only observing a human. IT operations technology has not reached that level yet.

In one of my previous roles, we had significant success in developing a solution that came close to AI. In one year, this solution auto-resolved an impressive 45% of incidents.

It was able to resolve incidents in a matter of seconds and also prevented downstream impact and eliminated 20% of incidents from ever occurring. The results, though impressive, still fell far short of a true AI solution, which should have eliminated the need for human intervention for any recurring incidents, auto-resolved up to 70% of incidents, and eliminated more.

An AI solution typically consists of the following modules:

• Monitoring/Detection of event – This can take the form of a virtual chat agent or a monitoring/ticket parsing tool.

• Signature generation and runbook mapping – Using inputs from the monitoring/detection tools, the AI identifies a unique event signature and matches it to a set of runbooks/solutions that may be able to resolve the incident.

• Machine learning runbook execution/prioritization module – This module runs the matched runbook/solutions repetitively until the issue is fixed, or matching solutions are exhausted. It also tracks the solution success rate and generates a “success-score” for future runs. If no solution is found, the ticket is moved to a human queue with a history of runbook/solutions applied.

True AI should be able to create a new runbook based on observations and learning, even if by only observing a human. IT operations technology has not reached that level yet

• Runbook creation module – This module observes and captures all commands executed by a human for an event resolution, in addition to searching solutions on online vendor websites/repositories and create a basic runbook. This runbook can then be manually tweaked to ready it for use.

It is the runbook creation module that needs significant enhancements to get to a true AI solution. However auto-creation of runbook is not as simple as it appears. Humans take actions based on what they see on the screen. However, a capture of resolver actions is simply a log of what commands were executed. It does not capture the reason why a command was needed, or when a variation of commands should be used, which is still a manual input. Similarly, vendor websites are not designed to give automated solutions for issues reported by customers across the world. Over time, with iterative improvements to runbook creation module, as well as better solution linkages with vendors, there is potential to develop a true AI solution. That solution will be truly self-learning and make a measurable impact to minimize business downtimes and reduce operational costs.