Driving Mainframe Reliability with AIOps, Observability, and Automation

Sep 12, 2025

Amanda Hendley is the Managing Editor of Planet Mainframe and host of the Virtual Mainframe User Groups. With a career rooted in the technology community, she has held leadership roles at the Technology Association of Georgia, Computer Measurement Group (CMG), and Planet Mainframe. A proud Georgia Tech graduate, Amanda spends her free time renovating homes and volunteering with SEGSPrescue.org in Atlanta, Georgia.

At SHARE, Broadcom spotlighted the future of mainframe operations through the lens of AIOps, observability, and automation. Stuart McIrvine, Director of Product Management for AI, Ops, and Automation at Broadcom’s Mainframe Software Division, explained how these capabilities are shaping reliability, efficiency, and interoperability for enterprises that rely on mainframe systems.

Observability at the Application Layer

Broadcom’s WatchTower platform continues to evolve, expanding beyond infrastructure to deliver visibility into the application layer. This shift helps customers understand not only when incidents occur but also how failures cascade across systems.

A major part of this is topology mapping, which shows how resources such as CICS, Db2, MQ, and now IMS are connected. When an outage happens, teams can quickly see which areas may be impacted—speeding up diagnosis and recovery.

“Incidents are inevitable on the mainframe, and outages are expensive. What’s critical is full observability and the ability to recover quickly.”
— Stuart McIrvine

Predicting and Preventing Outages

Observability provides the “what” and “where,” but AIOps provides the “what’s next.” By using machine learning to establish activity baselines, WatchTower can flag anomalies and warn customers before outages occur. Whether it’s Black Friday-level traffic or an unusual mid-summer spike, AIOps helps teams anticipate problems instead of reacting to them.

The Role of Automation

Automation closes the loop. By eliminating manual, repetitive tasks—and the human errors that come with them—automation enables faster recovery when problems arise. In McIrvine’s words, it ensures the “quick recovery on the inevitable.”

Open Standards for Interoperability

No enterprise runs only on mainframe or only on cloud. Broadcom has embraced OpenTelemetry as the standard for sharing observability data across platforms. This lets WatchTower integrate seamlessly with tools like AppDynamics, Dynatrace, and Datadog, giving organizations a unified view of operations.

Looking Ahead

AI on the mainframe is still expanding. Beyond today’s machine learning and anomaly detection, McIrvine sees generative AI providing richer insights and agentic AI automating fixes through system-specific intelligence.

“AI will continue to expand its role in observability platforms, evolving from prediction to explanation and even resolution.”
— Stuart McIrvine

For a Broadcom WatchTower overview, click here.

For information about OpenTelemetry, click here.

Read the Transcription

Stuart McIrvine is Director of Product Management for AI, Ops, and Automation at Broadcom’s Mainframe Software Division. Stuart drives the strategy for integrating automation, observability, and AI into mainframe operations to improve reliability and efficiency. With a career spanning IBM, HPE, and Broadcom, he brings deep technical knowledge and market expertise.

What is Broadcom highlighting at SHARE?
We continue to evolve the WatchTower platform. WatchTower is focused on mainframe observability—making the mainframe more observable and sharing that data with other platforms. Our core focus for this conference is observability at the application layer: high-level, non-intrusive ways of monitoring how applications operate, as well as deep dives.

Topology is another area of focus. Incidents are inevitable on the platform, and customers want to understand the implications: if a resource fails, what else could be affected? We’ve extended topology to cover CICS, Db2, and MQ, and this week we’re announcing support for IMS, which is critical for many customers. So overall, we’re expanding WatchTower’s observability and interoperability with other platforms.

There’s a lot of use of observability for known issues, but how are mainframe customers using AIOps to anticipate issues before they affect operations?
That’s a critical area. For many customers—especially in financial services and retail—an incident is extremely expensive. You can’t have the mainframe going down. They want to anticipate problems before they happen.

We use machine learning to understand patterns and profiles of each customer’s landscape. For example, before Black Friday, a certain level of activity is normal, but in the middle of summer, that same level might be abnormal. We look for those abnormalities and flag them. By leveraging discovery, profiling, machine learning, and anomaly detection, we give customers an early warning that something’s going wrong and help them pinpoint where to look to prevent an incident.

What is the role of automation in improving service reliability and reducing costs?
Automation is very important because there are a lot of repetitive tasks. When people perform tasks, we make mistakes. Automating problem resolution means the machine doesn’t make those errors—no typos, no slips.

Automation is also critical for recovery. Incidents will happen. Outages will happen. What’s most important is quick recovery. Automation helps detect where the problem is and speeds up the resolution process.

How does WatchTower do this, and does it integrate with other tools?
WatchTower collects incidents and alerts from many areas—system performance, network, storage, database—and brings them together in one interface. It automatically pulls in insights about each alert. For example, if there’s a performance alert, WatchTower gathers related information about resources.

Not all users today have the deep skills of 30- or 40-year mainframers. So we provide additional context and explanations for newer users. With topology, they can see how one alert affects other resources.

Mainframe is just one of many platforms, so interoperability is critical. We support open standards, especially OpenTelemetry, which lets us share mainframe observability data with platforms like AppDynamics, Dynatrace, and Datadog.

Why is OpenTelemetry becoming so important?
Nobody is just mainframe, and nobody is just cloud. Enterprises need a consistent view across platforms. OpenTelemetry has become the standard for sharing observability data, and that’s why we use it to partner with enterprise vendors like Datadog and AppDynamics.

WatchTower has been out about 18 months since the official release. Do you have real-world examples you can share?
Let me focus on the application layer. Customers run their businesses on large mainframe applications, and they need to understand more about how those applications behave.

One of our newest capabilities is AP for Z, an application profiler. A customer recently used it to confirm whether a recompiled application had actually rolled into production. Profiling revealed it was still running on the old compiler, even though operations had said otherwise. So they were able to prove the update hadn’t gone live.

Another customer is running both AppDynamics and WatchTower. They’re using application tracing with OpenTelemetry to share mainframe data into their non-mainframe observability platform.

So one example is application profiling, and another is interoperability through OpenTelemetry.

Where do you see the future of AIOps on the mainframe in the next few years?
AI is evolving quickly. A few years ago, we introduced machine learning to predict outages—that’s just one aspect of AI.

Generative AI, which produces content and answers questions, will play a larger role in observability. Looking further out, agentic AI will be important. Agents tied to systems like CICS, MQ, and Db2 will understand those environments in detail and even automate fixes. For example, a CICS agent will know better than anything else how to resolve a CICS issue.

So AI will continue to expand its role in observability platforms.

If you had 20 seconds to explain why AIOps is essential for the modern mainframe, what would you say?
Incidents are inevitable on the mainframe, and outages are expensive. The key is full observability—understanding what’s happening when an incident occurs—and the ability to recover quickly. The only way to do that is with deep observability and the tools to narrow down, target, and resolve the problem fast. Incidents will happen, but fixing them quickly is critical.

Transcript edited with AI assistance for clarity and readability while maintaining the speakers’ original words.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Sign up to receive the latest mainframe information

This field is for validation purposes and should be left unchanged.

Read More

A Breakthrough in Mainframe Storage Efficiency

A Breakthrough in Mainframe Storage Efficiency

Broadcom has delivered a first in mainframe storage with the Virtual Storage Adapter (VSA) enhancement to CA 1™ Flexible Storage™. This new feature lets you have your cake and eat it too – achieve high-performance virtual tape storage without the traditionally higher...