Systems Handbook

Recently, I had the opportunity to participate in a joint mainframe study between IBM and a large financial services company in Europe. As part of that study also the customer’s service management processes were explored and the goal was to figure out where and how these processes can be optimized and further automated.

The challenge that I faced was to quickly get an overview about individual service management processes but also the tools that facilitate these processes for the people in their day to day job. Now, you probably rightly assume that the author of this blog should be very familiar with all this. But to be honest, the truth is that as soon as you start to dig a little deeper, it becomes very obvious where the gaps are. So, let me share with you, how I approached this to fill the blank spots.

One of the service management processes that I looked at in more detail was Incident Management. As you know, Incident Management is about restoring the normal service operations as quickly as possible, once an incident has been detected. It helps to ensure that service level objectives and availability targets are met. Conceptually, the tooling required to support you in this process is depicted in the following picture:

Shows incident management phases and capabilities required to resolve incidents

The process requires tools that monitor the individual services to detect issues. This is all about monitoring, tracing and logging. Once an incident was detected, it requires tools to create a ticket, to analyze the data, and to reduce it to consumable bites for the Site Reliability Engineer (SRE) to work with. Then, there are tools that assist you with planning tasks such as notifying the appropriate specialist and suggesting possible solutions. Finally it requires tools that put all this into action through collaboration and knowledge sharing with the help of chat tools, bots and automated runbooks.

So far so good. But how do the products in the IBM zSystems AIOps portfolio fit to this picture? How can I learn myself a bit more about those products that are not necessarily part of my own day job?

Luckily, this learning journey wasn’t so difficult. I used the AIOps framework that we have created around the areas Detect, Decide, and Act and that framework highlights the different capabilities that are required to successfully operate your mainframe systems in the context of your hybrid cloud environment. Please, refer to Sanjay Chandru’s blog for an introduction of this framework and subsequent blogs referred by it that discuss the three areas in more detail.

On top of the framework, we have also created the AIOps on IBM zSystems Handbook which organizes the products in the IBM zSystems AIOps portfolio along Detect, Decide, and Act. The overview fits on a single page as depicted in the following picture:

AIOps framework on IBM zSystems depicting IBM capabilities in the areas Detect, Decide, and Act

For instance, if you are interested to find out more about Collaborative incident remediation, you can click on the color-coded hand icon next to this heading to jump to a more detailed page introducing IBM Z ChatOps and Service Management Unite.  That page then briefly talks about challenges, what new capabilities you might consider that address these challenges, and what IBM can offer in that space.  And if that’s not enough, you can also link to a blog with more details about this section.

So, after studying the handbook and getting a better sense of all of our products’ capabilities, I was able to complete the IBM view of the Incident Management tooling overview easily. The result is shown below:

Incident Management Toolchain depicting IBM tools and references to 3rd party

As you can see, IBM’s AIOps portfolio covers all the necessary capabilities that are needed to facilitate the Incident Management process, from detecting incidents in the first place, over AI-assisted grooming up to the collaborative resolution of the incident using ChatOps on popular platforms such as Slack, Microsoft Teams and open source based Mattermost.

Looking at the picture, you might ask yourself, do I really need all these tools? No, of course you don’t need all the products if you feel content with the way you manage incidents in your organization, today. However, with growing complexity of the application landscape paired with shrinking mainframe skills, effective tools are needed to improve meantime-to-restore or at least maintain it at the level you have today. 

Isn’t there a product that can do all of this? Personally, I haven’t seen such product. As the picture above shows, various different capabilities are required to efficiently manage incidents. The key point is not necessarily to have a single product. Rather it is about how easily you can integrate these capabilities – even across products from different vendors – so that you can pick the best tool and benefit most. For instance, the IBM Z OMEGAMON Data Provider converts proprietary OMEGAMON metrics to JSON-data that can be fed into a data lake of choice. Similarly, the IBM CloudPak for Watson AIOps or IBM Netcool Operations Insights can consume events from a variety of different IBM and non-IBM sources and inform SREs about them.

I found the handbook an excellent tool to familiarize more with all the products in the IBM zSystems AIOps portfolio. It reminded me about the various challenges you can encounter in your day-to-day Operations and the solutions that are available to address these challenges. If you haven’t had a chance yet to take a look, I can only encourage you to download a copy and to start planning the next stage of your journey to AIOps.

Originally published on the IBM AIOps Community Blog.