Handy AIOps on IBM zSystems Handbook – A user experience

Sep 1, 2022

Systems Handbook

Juergen Holtz

Recently, I had the opportunity to participate in a joint mainframe study between IBM and a large financial services company in Europe. As part of that study also the customer’s service management processes were explored and the goal was to figure out where and how these processes can be optimized and further automated.

The challenge that I faced was to quickly get an overview about individual service management processes but also the tools that facilitate these processes for the people in their day to day job. Now, you probably rightly assume that the author of this blog should be very familiar with all this. But to be honest, the truth is that as soon as you start to dig a little deeper, it becomes very obvious where the gaps are. So, let me share with you, how I approached this to fill the blank spots.

One of the service management processes that I looked at in more detail was Incident Management. As you know, Incident Management is about restoring the normal service operations as quickly as possible, once an incident has been detected. It helps to ensure that service level objectives and availability targets are met. Conceptually, the tooling required to support you in this process is depicted in the following picture:

Shows incident management phases and capabilities required to resolve incidents

The process requires tools that monitor the individual services to detect issues. This is all about monitoring, tracing and logging. Once an incident was detected, it requires tools to create a ticket, to analyze the data, and to reduce it to consumable bites for the Site Reliability Engineer (SRE) to work with. Then, there are tools that assist you with planning tasks such as notifying the appropriate specialist and suggesting possible solutions. Finally it requires tools that put all this into action through collaboration and knowledge sharing with the help of chat tools, bots and automated runbooks.

So far so good. But how do the products in the IBM zSystems AIOps portfolio fit to this picture? How can I learn myself a bit more about those products that are not necessarily part of my own day job?

Luckily, this learning journey wasn’t so difficult. I used the AIOps framework that we have created around the areas Detect, Decide, and Act and that framework highlights the different capabilities that are required to successfully operate your mainframe systems in the context of your hybrid cloud environment. Please, refer to Sanjay Chandru’s blog for an introduction of this framework and subsequent blogs referred by it that discuss the three areas in more detail.

On top of the framework, we have also created the AIOps on IBM zSystems Handbook which organizes the products in the IBM zSystems AIOps portfolio along Detect, Decide, and Act. The overview fits on a single page as depicted in the following picture:

AIOps framework on IBM zSystems depicting IBM capabilities in the areas Detect, Decide, and Act

For instance, if you are interested to find out more about Collaborative incident remediation, you can click on the color-coded hand icon next to this heading to jump to a more detailed page introducing IBM Z ChatOps and Service Management Unite. That page then briefly talks about challenges, what new capabilities you might consider that address these challenges, and what IBM can offer in that space. And if that’s not enough, you can also link to a blog with more details about this section.

So, after studying the handbook and getting a better sense of all of our products’ capabilities, I was able to complete the IBM view of the Incident Management tooling overview easily. The result is shown below:

Incident Management Toolchain depicting IBM tools and references to 3rd party

As you can see, IBM’s AIOps portfolio covers all the necessary capabilities that are needed to facilitate the Incident Management process, from detecting incidents in the first place, over AI-assisted grooming up to the collaborative resolution of the incident using ChatOps on popular platforms such as Slack, Microsoft Teams and open source based Mattermost.

Looking at the picture, you might ask yourself, do I really need all these tools? No, of course you don’t need all the products if you feel content with the way you manage incidents in your organization, today. However, with growing complexity of the application landscape paired with shrinking mainframe skills, effective tools are needed to improve meantime-to-restore or at least maintain it at the level you have today.

Isn’t there a product that can do all of this? Personally, I haven’t seen such product. As the picture above shows, various different capabilities are required to efficiently manage incidents. The key point is not necessarily to have a single product. Rather it is about how easily you can integrate these capabilities – even across products from different vendors – so that you can pick the best tool and benefit most. For instance, the IBM Z OMEGAMON Data Provider converts proprietary OMEGAMON metrics to JSON-data that can be fed into a data lake of choice. Similarly, the IBM CloudPak for Watson AIOps or IBM Netcool Operations Insights can consume events from a variety of different IBM and non-IBM sources and inform SREs about them.

I found the handbook an excellent tool to familiarize more with all the products in the IBM zSystems AIOps portfolio. It reminded me about the various challenges you can encounter in your day-to-day Operations and the solutions that are available to address these challenges. If you haven’t had a chance yet to take a look, I can only encourage you to download a copy and to start planning the next stage of your journey to AIOps.

Originally published on the IBM AIOps Community Blog.

0 Comments

Sign up to receive the latest mainframe information

← Previous Article Next Article →

SHARE Pittsburgh 2026 - Register Now

Recently Published

Tuning for Business Value, Not Cost: Building a Better Business Case for Mainframe Modernization

by Donald Zeunert

❓Mainframe History Trivia: Mauchly and Eckert

by Sonja Soderlund

Practical AI Use Cases for CICS Teams

by Amanda Hendley

Understanding CICS Performance: How SMF 110 Turns Symptoms Into Evidence

Read More

Actual Intelligence in z/OS Performance and Capacity Planning

Actual Intelligence in z/OS Performance and Capacity Planning

by Peter Enrico

It is exciting to be part of the world of computer performance at the dawn of AI. Many companies are heavily investing in Artificial Intelligence (AI) with only a few crude implementations having materialized thus far. As a performance guy, I can say it is important...

Driving Mainframe Reliability with AIOps, Observability, and Automation

Driving Mainframe Reliability with AIOps, Observability, and Automation

by Amanda Hendley

At SHARE, Broadcom spotlighted the future of mainframe operations through the lens of AIOps, observability, and automation. Stuart McIrvine, Director of Product Management for AI, Ops, and Automation at Broadcom’s Mainframe Software Division, explained how these...

The Evolution of IBM OMEGAMON for CICS: AI Insights, DB2 Correlation, and zIIP Optimization

The Evolution of IBM OMEGAMON for CICS: AI Insights, DB2 Correlation, and zIIP Optimization

by Amanda Hendley

Sign up today! CICS Virtual User Group – Tomorrow, July 29, 2025 Join us on Tuesday, July 29, for the next installment of the CICS User Group series. Colin Pearce, z/OS and CICS Systems Programmer at Bank of America Merrill Lynch, will present. Register for the...

The Next-Gen AIOps Doctor Is In: Diagnosing Mainframe Issues Quickly and Intelligently

The Next-Gen AIOps Doctor Is In: Diagnosing Mainframe Issues Quickly and Intelligently

by Alan Warhurst

Across industries, mainframe teams are under pressure to meet SLAs, modernize operations, and reduce costs, all while managing growing complexity and talent turnover. Whether in financial services, insurance, healthcare, or retail, one thing is constant: when the...