Performance management is a key consideration on any platform. A system may have several resources (CPU, I/O, storage, network etc.) that collectively work together to process a workload. To assess overall health of the system, mainframe system reports can be reviewed on certain key performance metrics on these resources. These metrics are measured and compared against the service level agreements (SLA), or performance Rule-of-Thumb standards.
An SLA is a contract between the user and system that describes the goals to meet for business critical workloads. If results are not as expected, then mission critical workloads running on the system will usually suffer. Some of the possible remedies could be: performance tuning to get metrics results to base standards, buy more resources, offload eligible workloads to specialty engines, or steal from less critical work by adjusting priorities.
Types of workloads
First let’s take a look at typical workloads that run on mainframe systems. They can be classified into 2 flavors – Batch processing or Online transactional processing.
Batch workloads: Batch workloads process high volumes of data, and produce outputs/reports, are typically scheduled programs processed without user interaction, and are often run during off-peak hours. As an example, a mainframe job that requests large numbers of customer billing statements or customer orders etc.
Transactional workloads: These workloads typically involve end user interactions. Transactions are usually short and are often considered mission-critical workloads for the business. Examples are: bank ATM transactions, merchant credit card processing at checkout stations, online order purchase, etc.
Mainframe systems regularly capture key systems metrics to gauge system performance, data that various performance monitoring tools use to display to end users. Here are some of them:
This indicates the average number of service completions per unit time. For example, the number of transactions per second or minute. Transaction workloads are typically measured using this performance metric.
Average response time
This measures the average amount of time it takes to complete a single service. Transactional workloads are usually measured using this performance metric – it can also be specified as an SLA goal for workloads.
Typically, this metric shows the amount of time the workloads (batch or transaction) consumed resources over a period of time. Examples: CPU utilization, Processor storage utilization, I/O rates, paging rates etc.
Velocity is a measure of resource contention. When multiple workloads require a resource (example CPU) at the same time then there is contention for the resource. While one workload is using the resource other workloads are put in a waiting queue. Resource velocity is the ratio of time taken for using the resource (A) to the total time spent using resource (A) and waiting in the queue (B). i.e., A / (A+B). This value is expressed as percentage of 0-100 range. A value of 0 means a high amount of contention for resource, and a value of 100 indicates no contention. This metric can be specified as an SLA goal for workloads.
As part of SLA, workloads are classified as service classes and each service class has goals. Goals for workloads can be expressed as response time, velocity, etc. Since there are several types of goals defined for various workloads in SLA, to determine how workloads are performing with respect to their defined goals, a simple metric performance index(PI) is used. PI is simply a ratio of defined goal vs achieved goal. A PI value of 1 means workloads are meeting goals. A value < 1 means workloads are exceeding the goals, and value > 1 means workloads are missing goals.