Why microbenchmarks are not suitable for performance analysis

Jan 26, 2022

Mike Chase

What are microbenchmarks, and how do they differ from real-world programs?

Microbenchmarks are very small programs that test a small number of operations in a loop, often as little as an individual computation. They’re different from real-world programs in that the result of the computations aren’t used for any purpose in the program. Microbenchmarks are used to measure the performance of a computation, and are put in a loop because a single computation is generally too short to be measured. Microbenchmarks are not real-world programs because they are not meaningul programs written to meet business needs or perform a useful task.

Ex. 1: Microbenchmark to measure performance of the ADD instruction: There is no purpose in incrementing Y, as it isn’t used anywhere in the program.

01 X PIC 9(6) COMP-3.
01 Y PIC 9(9) COMP-3 VALUE 0.

PERFORM VARYING X FROM 1 BY 1 UNTIL X > 99999
    ADD 1 TO Y
END-PERFORM.

Ex. 2: Real-world loop computing data that is used elsewhere in the program.

01 X PIC 9(6) COMP-3.
01 RECEIPTS-GRP.
02 RECEIPTS OCCURS 99999 TIMES.
    03 TOTAL PIC 9(18) COMP-3.
    03 SUBTOTAL PIC 9(18) COMP-3.
01 MAX-ITEMS PIC 9(5) COMP-3.
01 TAX-RATE PIC 9(2)V9(2) COMP-3.

<Code that computes subtotals>
…
PERFORM VARYING X FROM 1 BY 1 UNTIL X > MAX-ITEMS
    COMPUTE TOTAL(X) = SUBTOTAL(X) * TAX-RATE
END-IF
…
<Other code that uses TOTAL>

Why is it not appropriate to do performance testing on microbenchmarks?

There may be interactions between instructions that show up in tight loops (a loop made up of a small number of machine instructions, usually coming from a small number of COBOL statements within the loop, such as in Ex. 1) that are uncommon in real-world programs. These interactions mean that microbenchmarks are not reliable indicators of performance; optimization done on microbenchmarks may be optimizing for interactions that don’t show up in real-world code.

Spending time optimizing microbenchmarks may make those microbenchmarks run faster, but the effort may have minimal impact on actual applications. Since it’s the real-world applications that will be running in production on a regular basis, not the microbenchmarks, there is no use in analyzing and optimizing microbenchmarks unless similar issues can be found in real-world code. There is little value in improving code for test purposes that has minimal impact to real-world applications.

Here are some specific things to consider:

Interactions Between Instructions

Instructions in a program are not executed in isolation. Instructions issued earlier can have an effect on instructions issued later. One such example is that in the tight loop in Ex. 1, the loop counter (X) as well as the data item being incremented (Y) are going to be read from and written to repeatedly in close succession. This strains the hardware to always have the correct data ready when it’s needed, without introducing stalls (where subsequent instructions are delayed because they’re waiting on a prior result). While IBM Z hardware and the Enterprise COBOL for z/OS compiler both have improved in recent releases to handle this better, this interaction is likely to occur more frequently in a microbenchmark than in real-world code. There are many other types of interactions between instructions as well. Optimizing a microbenchmark may result in optimizing interactions between instructions that don’t frequently occur in real-world code, and thus the work of optimizing microbenchmarks may have minimal real-world impact.

Variation

Another problem is that some microbenchmarks may have a short execution time, which means that any variation in execution time (“noise” in the system due to system load and other factors) is exaggerated. For example, if the application takes 2 seconds to run and then it takes 3 seconds when it is run again, does that indicate a 1.5x increase? Or is it due to noise? Whereas if a long running program takes 100 seconds to run and then takes 101 seconds when run again, the “noise” becomes just a small fraction of the total.

Run 1	Run 2	Increase (seconds)	Increase (multiplication factor)
2 seconds	3 seconds	1 second of noise	1.5x
100 seconds	101 seconds	1 second of noise	1.01x

Running a short program many times over can help eliminate some of this variance, but also there is overhead associated with running a program in general. This overhead includes invoking the COBOL runtime, bringing the program into memory and initializing data items, and the compiler can not help improve the performance of this overhead. If the running time of the program itself is small, that overhead becomes a large portion of the application performance. Whether you run it once or thousands of times, that ratio of overhead to program running is still the same.

Run 1 (run once)	Run 2 (run 100x)
2 seconds program runtime + 1 second overhead = 3 seconds	(2 seconds program runtime * 100 runs) + (1 second overhead * 100 runs)= 200 + 100 = 300 seconds

When is it appropriate to do performance testing on a microbenchmark?

It may be that a microbenchmark-style loop actually appears in real-world code. If so, and if performance measurements indicate it is a bottleneck, then it is worth looking at, because unlike a standalone microbenchmark, we now have an example where that loop affects the overall performance of the application, and so a performance improvement actually will be beneficial.

Are all real-world programs appropriate for performance testing?

A short-running real-world program is subject to the same concerns as a short-running microbenchmark, at which point it’s more useful to see performance measurements for a whole application, perhaps being run with more data, so the actual bottlenecks in the application get exposed.

Recommended approach to performance testing

Performance testing should focus on using long-running, real world applications. In cases where applications are not performing as well as they should, or have worse performance on newer hardware or when compiled with a newer compiler version, a detailed performance report can indicate which programs, and which instructions in those programs, are performing worse between versions. This information does not need to be gathered in production, but should be gathered with the actual application and ideally with real-world data. A performance report and measurements gives the compiler developers a targeted place to look, enabling IBM to fix situations that have a direct impact on the performance of client applications running in production. Time spent optimizing microbenchmarks may not have any impact on a client application performance in production and so it takes away from more useful work that could benefit our clients.

IBM offers COBOL performance tuning webinars. Register to join a live webinar or find a pre-recorded webinar video here.

Originally published on the IBM Z and Linux Community Blog.

0 Comments

Submit a Comment Cancel reply

Sign up to receive the latest mainframe information

← Previous Article Next Article →

Recently Published

The bill for technical debt just came due

by Mark Wilson

Stop Defending Mainframe Costs. Start Demonstrating Mainframe Business Value.

by Planet Mainframe

Compact z17 and LinuxONE Systems, Real-Time Mainframe Software Visibility, and more

by Sonja Soderlund

The General Ledger, the Mainframe, and the Programmer Who’s Taking Up Golf

by Allan Zander

From the Super Bowl to the System of Record: Why Winning Runs on Data Integrity

by Allan Zander

The Super Bowl looks like a game of skill and instinct: a quarterback scans the defense, the coach makes a call in seconds, a receiver adjusts a route in motion. All while 70,000 people hold their breath. From the outside, it feels like improvisation under pressure....

The Need for Effective Change Management

by Hugo Prittie

Introduction Over the last few years there has been an ever-increasing number of widely publicised problems involving notable corporate organisations and the failings of their IT systems. Incidents of ransomware, hacking and phishing are becoming worryingly...

AI and ethics and mainframes

by Trevor Eddolls

When Beliefs Shape Machines Imagine two people talking in a bar—one believes in God, and the other doesn’t. One swears by Apple, the other by Android. One supports Trump, the other can’t stand him. It doesn’t matter which side you’re on; the point is that people...

IBM z17 Time Synchronization Resiliency Enhancements

by Steve Guendert PhD

Most likely, you have heard the generic acronym RAS, which typically stands for Resiliency, Availability, and Serviceability. However, in the world of time synchronization at IBM, we changed RAS to mean Resiliency, Accuracy, and Security. From RAS to IBMz17 Timing,...