Changing Demands on Workload Availability
Businesses are being challenged to provide continuous availability for their critical workloads. Businesses need to be online 24×7 to remain competitive and avoid negative headlines. Some industries are fined for any downtime and the inability to meet regulatory compliance. For these types of workloads, it is no longer acceptable to have multi-hour downtimes for unplanned and planned outages. This includes workloads that currently utilize IBM MQ for their high availability functions.
Although IBM MQ provides high availability functions, businesses are being tasked to further improve the resiliency of their critical MQ workloads. For your businesses, high availability within a sysplex or failover of the sysplex to a disaster recovery site is no longer sufficient – your clients are demanding continuous availability for these workloads.
IBM MQ High Availability
IBM MQ provides a flexible, robust messaging backbone, with assured delivery of messages across a wide range of operating systems and platforms. An important feature of messaging is to remove complexity from the application code – the reliable delivery and recovery of data is the job of IBM MQ. A key concept of IBM MQ is its asynchronous processing – applications don’t rely on each other’s availability, or the availability of the network, to send data. If the applications and network are available, messages are immediately delivered. However, if the target application or network is not available, IBM MQ temporarily stores the data (i.e. messages) until it can be delivered.
The fundamental components of IBM MQ are its messages, queues, queue managers, and channels.
- Messages are data blocks that have some meaning to an application. IBM MQ adds information about how to store and deliver the message before IBM MQ forwards the message, and strips the information before delivering the message to the target application.
- Queues are data objects used to store messages, either until the message is forwarded to the appropriate queue manager or until the message is requested by the target application.
- Queue managers provide the messaging services and manage the queues and channels. Queue managers handle the delivery and recovery of messages. They are responsible for ensuring that messages are transferred to other queue managers by using channels.
- Channels represent the logical communication links, typically TCP socket connections, between queue managers and between a queue manager and a connecting application.
Figure 1: IBM MQ Topology
Figure 1 is an example of an IBM MQ topology. To process business transactions, messages are sent directly to local queue managers. These messages typically are destined to queues on remote queue managers within a z/OS sysplex, where the applications and data are located.
IBM MQ provides several functions on z/OS to enable high availability for the delivery and processing of messages.
- Shared channels enable local queue managers to quickly recover in the event of a remote queue manager failure. Using Sysplex Distributor, a local queue manager connects to an available remote queue manager in the sysplex and all messages flow to/from that remote queue manager. If the remote queue manager becomes unavailable, the local queue manager is automatically redirected to another remote queue manager by Sysplex Distributor.
- Shared queues allow applications to process messages from any queue manager in the sysplex. When a local queue manager sends messages to a remote queue manager in the sysplex, they are stored in a queue that is accessible by all remote queue managers. In the event of a failure of a remote queue manager, any messages previously delivered to that queue manager can be retrieved from any other available remote queue manager.
- Clusters provide the ability for local queue managers to deliver messages across all remote queue managers. If a remote queue manager becomes unavailable, any messages sent from the local queue manager are directed to the remaining available remote queue managers.
Using IBM Multi-site Workload Lifeline for Continuous Availability
IBM Multi-site Workload Lifeline (Lifeline) is a product that enables continuous availability for critical workloads during unplanned outages and reduces the downtime for these workloads for planned outages. Leveraging the capabilities of the MQ cluster as well as a replication product, such as InfoSphere Data Replication (IIDR) for Db2, Lifeline is able to quickly reroute messages from one sysplex to another, usually in less than a minute.
Figure 2: IBM MQ with Lifeline Topology
Figure 2 is an example of an IBM MQ topology with Lifeline and IIDR. The local queue managers and remote queue managers spanning both sysplexes are configured into an MQ cluster. IIDR replicates the data sources being updated by the workload transactions from one sysplex to the other sysplex, to keep the data sources synchronized.
By default, messages sent from local queue managers are delivered to any available remote queue manager in the cluster. To ensure that the workload’s data sources are only updated in one sysplex at any time, Lifeline influences how messages are delivered within the cluster. With Lifeline, one of the sysplexes is designated as the “active site” and the other sysplex is the “standby site”. Lifeline ensures that messages sent from local queue managers are only delivered to available remote queue managers in this “active site”. Lifeline can also influence message routing such that the distribution of the messages in the “active site” will favor the remote queue managers that are performing better. Although remote queue managers are available in the “standby site”, no messages are delivered to these queue managers from the local queue managers. Applications in the “active site” process messages from remote queue managers in the “active site” and any resulting updates to data sources are replicated to the “standby site” by IIDR.
In the event of an unplanned outage, where none of the remote queue managers are available, Lifeline can re-designate the “standby site” to be the new “active site”. After IIDR has completed any remaining replication of updates to the data sources, Lifeline will direct new messages sent from local queue managers to remote queue managers in this new “active site”. This recovery of the MQ workload, following an unplanned outage, can typically be accomplished in under a minute.
For planned outages, such as for maintenance activities or verification of workload recovery procedures, Lifeline can enable graceful rerouting of messages sent from local queue managers. To perform a planned workload switch, Lifeline temporarily prevents any messages on local queue managers from being sent to any remote queue manager in the cluster. This allows applications to process any remaining messages queued on remote queue managers in the “active site” and for IIDR to replicate any updates to the data sources to the “standby site”. Any queued messages on remote queue managers not processed by the applications are transferred by Lifeline to remote queue managers in the “standby site”, to ensure no messages are lost during the workload switch. Lifeline then re-designates the “standby site” to be the “active site” and any messages waiting to be sent from local queue managers are now able to be routed to remote queue managers in the new “active site”. This planned MQ workload switch, with no lost messages or lost data source updates, can typically be accomplished in under a minute.
IBM MQ provides high availability functions to quickly recover in the event of failures to queue managers in a z/OS sysplex. For unplanned or planned outages that affect all queue managers in the sysplex, recovering the queue managers in a disaster recovery site can result in an extended workload outage. IBM Multi-site Workload Lifeline can help your business enable continuous availability for your MQ workloads. For more information about Lifeline, see https://www.ibm.com/us-en/marketplace/multisite-workload-lifeline .
Michael Fitzpatrick is a Senior Technical Staff Member of the IBM Enterprise Networking Software Group, based in Research Triangle Park, North Carolina, in the US. He is the architect for the Multi-site Workload Lifeline product. Mike has worked in the networking area for 23 years, with a focus on resiliency, network design, and performance.