Observations on Observability

Observability is the vehicle for delivering the next generation of tools to help us regain control.

About 15 years ago, I participated in a meeting with customer IT executives to discuss their biggest challenges. I remember one of them describing difficulties in identifying all the ‘moving parts’ in an application and assembling all the associated subject matter experts.

The client shared an example of an application distributed across multiple platforms that relied on 13 components to process the client’s transaction. The most time-consuming part of fixing an application problem was simply identifying which of those 13 components was failing. I recall thinking, ‘Thirteen components in one transaction? Whoever designed that wasn’t responsible for fixing it when it broke.’ 

I expect there are application developers reading this article who look back fondly at simpler times when there were only 13 components involved.

More recently, I exchanged emails with a client trying to figure out why a subset of their batch jobs suddenly doubled their CPU time, even though everything else was unchanged. The programs had not been changed; they were doing roughly the same number of I/Os—which is always a good indicator of the volume of work being done. And they used the same amount of memory as before, yet it took twice as much CPU time.

What do these events have to do with Observability with a capital ‘O’? In the last two years, the term “Observability” has suddenly appeared everywhere in the performance and resiliency world. Broadcom’s Nicole Fagen kindly asked me to moderate two ‘Observability Shootouts’ at SHARE in Atlanta and again in New Orleans. The shootouts featured representatives from BMC, Broadcom, EPS Strategies, IBM, IntelliMagic, and Rocket Software demonstrating the observability capabilities of their products. I remember watching the demos, green with envy that products like these didn’t exist earlier in my career.

Most of us in the mainframe performance world have a very SMF-centric view of monitoring. Our peers on other platforms deeply envy SMF. There is no doubt that SMF represents a great source of information now, and it promises a vast ocean of insights when we have more powerful tools to analyze it in the future. 

However, the term “Observability” originated in the distributed world, so it isn’t SMF-based.  That, combined with those very impressive demos at the Observability shootouts, got me thinking that I need to broaden my view of what can be observed.

There is a world of fascinating data and metrics outside SMF. For example, imagine if you could easily combine SMF data with data from other tools to identify every component used by a given transaction, get notified if one of them stops working, and be told what the impact of that failure is. That warrants an uppercase WOW.

What is Observability

I searched for a universal definition of Observability. Not surprisingly, I couldn’t find one. That’s when I realized that Observability is not some standard tool or report that will make us all obsolete. After all, Artificial Intelligence will look after that!

To me, Observability is a movement, a response by vendors to customer cries for help in managing a wildly complicated IT world. We are drowning in data, metrics, and complexity, and we are losing the deeply experienced subject matter experts to retirement or other pursuits. Observability is the vehicle for delivering the next generation of tools to help us regain control. 

Looking ahead, Observability will encompass products that warn us about the impact of problems before we even know we have a problem. It will mine our metrics, both SMF and otherwise, and point out relationships we never considered.

Let me give you an example. A recent Watson & Walker client mentioned that their CPU time per transaction was very stable when the percent of time the CPU was in problem state was within a certain–and quite small– range. But it became much more erratic outside that range. He wanted to know if this was unique to them, or if it was the same everywhere.

I was taken aback, as it never occurred to me to consider a correlation between those two metrics. I asked some very experienced friends if I was the only person in the performance world unaware of this. Everyone replied that they had never considered looking for a correlation there either.

I suspect there are many relationships between the workloads and systems we manage that we are not aware of. It sure would be nice to have ‘The Observability Tool’ beside me to gently point out relevant insights! I’d be a hero to my users and an indispensable sage to my executive.

What Observability capabilities that address your performance and availability challenges would you like to see?

Frank describes himself as a ‘true Mid-Atlanicer’ — he was born in New York, spent half his life in Ireland, and the other half in the Northeast of the US, resulting in an average of half-way across the Atlantic Ocean.

His first IT job was as an operator in an insurance company in Ireland. Over the subsequent 20 years he became a VM system programmer, then joined IBM Ireland as an MVS system programmer, and then worked in IBM’s fledgling Services organization before moving to IBM’s Redbooks group in Poughkeepsie as a project leader for sysplex Redbooks.

During his time in the Redbooks group, Frank’s role expanded to include Redbooks on high availability, GDPS, and performance, teaching various classes around the world, and consulting with IBM’s most complex and interesting customers.

In March 2014, Frank left IBM to join Watson & Walker and is currently its President and author of most articles in their quarterly newsletter, Cheryl Watson’s Tuning Letter, a role he describes as ‘the ultimate job for

Leave a Reply

Your email address will not be published. Required fields are marked *