Db2 User Group | September 2025

Mainframes in Big Data

In this session, Colin Knight explores the role of mainframes—particularly Db2 for z/OS—in the evolving world of big data. Drawing on his extensive experience at NatWest, he shares how mainframe data is delivered into big data environments using tools such as Infosphere CDC, Q Replication, and unload processes. Colin also looks ahead to the future of mainframes in big data, offering his perspective on how Db2 data can best be integrated into modern analytical frameworks. He concludes with reflections, a puzzle on big data, and an open discussion with attendees.

Read the Transcription

[00:00:01] – Amanda Hendley, Host

Well, thank you for joining me today. My name is Amanda Hendley. I am your host for today’s virtual user group, and we’re here to talk about Db2. So thank you for joining me. Today, we are going to have some brief introductory remarks. This is where we are on the schedule. Then we’ll have our presentation. We’ll leave a little bit of time for Q&A. Then we’ll wrap up today with some news and articles and announce our next session. Before we dive into the presentation, I do want to mention that it is Observability Month at Planet Mainframe, PlanetMainframe.com. We’re talking about all things observability and open telemetry. There’s original content as well as some fun things like observability, open telemetry, and Trivia on Planet Mainframe this month. Please do check that out. For Q&A today, we are doing questions as we go, and at the end. But as we go, you can pop a question into chat. I think, Colin, correct me if I’m wrong, if there’s something that needs a little bit more explanation, we’d welcome someone to come off mute and explain it.

[00:01:16] – Colin Knight, Presenter

Definitely. Cool.

[00:01:18] – Amanda Hendley, Host

We just want to make sure that you guys are able to follow along. If you do have questions along the way, be sure to ask them. With that, let me introduce I’m going to stop my share. We’re here for mainframes in big data, and I’m excited to introduce you to Colin Knight. Colin has spent 40 years in the mainframe world, including 36 working with Db2. He’s been at NatWest since 1994. He gives back to the community as the co-chair of the Db2 UK User Group and the GS UK Db2 Working Group. Thank you for joining us, Colin. I’ll let you take over the screen share.

[00:02:00] – Colin Knight, Presenter

I think that looks like it says it’s working. Hopefully, you can see the first purple screen. The color scheme is a copyright of NatWest. It’s a deep purple, and that’s the colors of NatWest. Some of the screens will be deep purple, most of them be white, so don’t worry too much about that. Okay, so welcome aboard, everybody. The subject is mainframes in big data. It’s around my part in its downfall or opening the floodgates, whichever you choose to look at it, whichever direction you choose to look at it from. Let’s see if we can move it on. Yes, it does work. This is the agenda I’m looking at today. We’ve got first item, just looking at what is big data, what’s the mainframe got to do with big data. Then I’ll take you through how Db2 for z/OS in NatWest looks, just how we do things with Db2 and what scale we use and so on. I’ll move on to how we make mainframe data available for analytics on big data in NatWest. Then look forward a bit towards the future. See what… I guess a lot of this is really a personal journey, so I’m really looking at what I’d like to see in the future and how I’d like to see the mainframe appear on the big data platform.

[00:03:36] – Colin Knight, Presenter

Some of the ideas that I’d like to see happen may or may not happen, and reaching out to seeing what’s available going forward. Finally, we’ll summarize all of this. There’s a puzzle to look at, which is with a big data slant and some interesting musings on that subject. Then we’ll have questions at the end. If there are any other questions, we can do that then. Okay, so moving on to the first subject. When I started this, when I sat down and thought, Right, what am I going to do? Big data, what does that mean? What does it look like? I thought, Well, first question is, how big is data? Unfortunately, I was thinking more along the lines of hitchhikers Hitchhiker’s Guide to the Galaxy and how big is space. Wow, it’s really big. I remember that from the narrator in Hitchhiker’s Guide to the Galaxy, and I think the same thing goes for big data and data itself. The current estimate that I’ve seen is that in the global data sphere, there’s around 181 zettabites currently. Now, one zettabite is A trillion gigabytes. Yeah, that’s really big. And 1 zettabite is 10 to the 21 bytes or 1 and 21 noughts after it bytes.

[00:05:13] – Colin Knight, Presenter

that is quite big. Just about everything we do, everything we touch, everywhere we go, creates data: banking, using your phone, card payments, CCTV, doorbells, video doorbells. All that data goes somewhere, and that’s creating data all the time. Now, looking at the size of data is one thing, but then I thought, Well, how much data is actually created per day? Again, these are estimates, and I do question just how you find all that data, how you count that, how does anybody know what I’m storing on my personal phone? I hope nobody does know what I’m storing in my personal phone because that’s personal. But also on my laptop, how do people know how much data they’re stored on my SSD? It’s got to be an estimate. I don’t know quite how they get to that estimate, but the suggestion is it’s 400 million terabytes per day. To my mind, that doesn’t sound like a lot, and it might be an underestimate. Then, as I was thinking about big data, I thought, Okay, so on a day, Is there a way that I could actually get by through the day without creating any data? Moving away from my weekday job, it has to be a weekend because otherwise I’m creating data and I log on, et cetera, et cetera, and everything I do at work.

[00:06:48] – Colin Knight, Presenter

Picking a Saturday or Sunday. First thing on the day is I’d have to leave my phone behind, not touch it. That stops quite a lot of data being created. How about my video doorbell? Okay, so I could sneak out the back door if I wanted to go somewhere. But as soon as I got in my car, data is being collected in the car. If I pass a CCTV or traffic cameras, they’re going to record more data on my car passing that area. It then gets really difficult to actually pass the day without creating any data. You couldn’t use that thing in the corner, which I won’t mention because otherwise it’ll start talking to me. There’s the Google one as well. I wouldn’t be able to turn on lights in my house, and it goes on and on. You think it would be quite hard to actually avoid creating any data in a single day. We all create data every day. I was also thinking about how some of that data is collected and what it’s used for. One thing I found, which is quite interesting, is soundscape data analysis. This is where they’re looking at urban soundscapes, which is monitoring Basically, just recording sound in an urban environment to look at public health and environment changes.

[00:08:22] – Colin Knight, Presenter

Perhaps bird calls, what birds are there now and what birds were there maybe a year ago. The traffic flow, how that noise comes out on the microphones, and noise pollution itself from different sources. That must create a lot of data, but that could be really useful for making your urban environments much more healthy and better to live in. I think that’s an interesting angle on data, but all that data is recorded and stored somewhere, and it’s all part of that 1 trillion gigabytes. Sorry, it’s all That’s part of the 181 zettabytes, which is a lot more than 1 trillion gigabytes. Okay, let’s move on. I’m going to ask that big question, what is big data? Happy I don’t know if anybody to put anything in the chat for that or put their hand up if there’s anything anybody wants to call out, but I’ll go through the definition anyway. This is what big data… This is the definition of big data, basically. It’s extremely large and complex data sets that are difficult to process using, quotes, traditional data processing tools. Okay, I don’t get much out of that statement, to be honest, traditional data processing tools, why?

[00:09:51] – Colin Knight, Presenter

That asks more questions than it answers. But I think the next section, really, is looking at what we call the three Vs. So first one is volume. Yeah, it’s really big, massive. Terabytes of data, petabytes, exabytes, all of that. Next V is velocity. You stream quite Often in real-time. And to analyze that data, you need to analyze it generally in real-time as well. So that’s certainly a big element of big data is the velocity it comes at you at, and obviously, the velocity you want to analyze it at as well. The third fee, but by no means the last, the third V is variety of data. It’s data types like structured data, DBMSs, spreadsheets, semi-structured data, XML, JSON, and your totally unstructured data like text that I’m recording right now and the images and audio that I’m recording and video. All of that, very unstructured. But quite often, you want to make sense of that alongside your other data types. That feeds into the traditional data processing tools not being up for the job because you’ve got all types of data, structured and unstructured. Traditionally, you just operate on one data type. That’s where that variety and the idea of new data processing tools being needed as well.

[00:11:23] – Colin Knight, Presenter

The fourth of the three Vs is veracity. I think this is the important one, actually, because the whole question of big data is all about quality and how you can trust the data. Do you know where the data come from? Is it internal or external? Does it come from social media? We know of any number of issues from social media feeds that are well-documented. So yeah, veracity is really key to your big data. You can’t just to analyze it, you need to check and make sure that your big data is actually correct. As I said, it’s hard to process traditional data processing tools because you’re joining different data types from different sources. I think that’s another key element of big data is it isn’t just one data type, obviously. Okay, so hopefully that’s explained what big data is, and I got a lot out when I was looking up and reading about big data, and I’ve got several books on it now. I’m starting to understand it a lot more than I did maybe a few months ago. Right, well, what’s big data used for? The obvious uses would be around finance, full detection, risk management, all of which have to be in real-time.

[00:12:59] – Colin Knight, Presenter

I think healthcare is one key aspect of big data. Large-scale patient records would form part big data used for diagnosis or investigating epidemics, things like that, health trends. Maybe looking at where money or resources in healthcare could be best focused by looking at data of where those hotspots are for various health scares and so on. That’s a key aspect of big data. I think I developed a dry thought this morning when I was going through this for about the fifth time accessing it. It’s coming back. We’ll be slurping water from time to time. I hope you’re still with me. Do ask any questions if you’ve got them. Another use I think, is with social media, using social media input to maybe look at where to focus adverts, what adverts would get the most benefit Government, health tracking, again, looking at crime patterns, perhaps, road safety, accident hotspots, looking at where to maybe put resources to improve traffic flow and safety on the roads. Science, obvious one is climate change, and that must be using massive amounts of big data. One thing that I’ve read about was in manufacturing, using predictive maintenance for manufacturing by monitoring the equipment sensor data.

[00:14:53] – Colin Knight, Presenter

So that therefore you see maybe if a particular tool or part of the manufacturing process is needing maintenance or it’s coming up for renewal or something’s slowing down on it, then you could check that out and respond to it before it actually broke down, perhaps. I think the big example that really appeals to me is looking at F1 and the big data you see from an F1 car. Now, wherever the F1 car is going around the circuit, you’re getting potentially Essentially, a million data points per second, probably more than that. Collecting that, and that really is big data because you’ve got sensor data, you’ve got velocity data, you’ve got gravitational pull data, You’ve got visual data, you’ve got tyre temperature data, you’ve got any number of sensors, all types of data feeding into that big data view of your F1 car. There’s no way that F1 would be where it was today without all of that data being crunched. Finally, in the last slide, we talked about not being able to use the traditional methods of data processing on this new big data. I’m thinking, well, tools for big data are many and varied, and certainly things like Hadoop, Apache Spark, NoSQL, MongoDB, Kafka, Snowflake, TensorFlow, all of these are new tools that certainly hit your big data and make sense of it and analyze it and do what you need to do with it to get the answers from your big data environment.

[00:16:43] – Colin Knight, Presenter

right. That’s big data and what it’s used for. Now, how do we use Db2 and z/OS in NatWest? Just to give you a view of how before we operate on Db2. We certainly do provide Db2 data for big data, but let’s go through the data we get on the mainframe. Peek days in 2025, We saw 21,740 transactions per second. A large amount of those were CICS transactions. We have a large CICS environment, but also off-host transactions that are coming in through the old distributed connections into Db2. We also have a table, the biggest table we have, 65 billion rows. I’ll look at that again next year and see how I think it is, but it’s certainly one that keeps growing. It’s always easy to find that table, and it’s always the biggest one for the last five or six years, I think. Db2 itself on z/OS, we are now on 13 Function Level 500 (FL500), and that’s all across the whole of NatWest. That’s the old new function mode, really, as it used to be before it got to version 12 and so on. We’re very proud to announce this bit, which is that we have the first z17 installed in Europe, and that landed in May, in fact.

[00:18:38] – Colin Knight, Presenter

So even before it went GA, and that’s up and running and being thoroughly tested, ready for production. I think a GA announcement was that the GA was announced in April for the 18th of June, so we were quite lucky to get that delivered in and get it connected up. Within the Db2, the main environments, we run 921,000 SQL statements every second. I was thinking, How many SQL statements do I write a day? Well, I’ve written two or three today, so it would take me a long time to write 921,000, I think. That’s probably at peak, and that’s in our main Db2 group. Overall, it’s more than that for all the Db2s we have. We also have a message from Teams, which I should switch off. We also have over 20 million customers in the UK for NatWest Bank, and we operate over retail, commercial, and private banking markets within the UK. That’s what we do with Db2 in NatWest and a bit about what we do in the banking environment in the UK. Now, probably I could have shown this slide earlier, but I think Amanda already mentioned earlier that I’ve been in the mainframe for 40 years, 36 years in Db2.

[00:20:22] – Colin Knight, Presenter

The first Db2 that I worked on was version 2. 1. I just arrived as 1. 3 was going out the door and 2. 1 had been installed and everything had been migrated. That was back in 1989. That’s quite a while back. I just had to try this when I was preparing my presentation. I used the internal AI tool and I asked the question, who is Colin Knight? Now, I know within NatWest, there’s two guys me with my name, me and somebody else who works in the bank outside of IT. I thought, Let’s ask AI machine what it says. It came back and said that I was seniorColin Knight, Chief Executive, Chief Risk Officer. Now, thankfully, I’m not, but I wasn’t aware of that third Colin Knight, and I’m not sure if that’s hallucination or what from AI, because I don’t recall Colin Knight being a Chief Risk Officer. That may just be crossed wires within AI, and I’ll probably investigate that later. I’m a Db2 Systems Programming Tech Lead, and I’m 61 years old, which probably means I should have retired last year or maybe a few years before that, but I’m still enjoying it, so we move on.

[00:22:01] – Colin Knight, Presenter

I should just add, actually, just for completeness, my interests are obviously F1, as I mentioned earlier, photography, steam trains, computers, birds and wildlife, and walking. Looking after my four-year-old grandson, which I’ll be doing tomorrow from quite early on, and giving him his breakfast and so on. So yeah, quite busy still, but I’m enjoying working in Db2 still. So How do we in NatWest, actually make that mainframe data available in the big data environment? Traditionally, and for many years, we’ve done extracts, and we’ve used simple unloads. More recently, we use the high performance unload, HPU. That works, and it’s That’s quite cheap, and that’s good. Also, we use queue replication using IBM MQ. The Db2 data changes go into IBM MQ and then out to Db2 target tables. Then extracts from those targets are sent to the big data environment. We also use a tool called InfoSphere CDC, which through sheer total good luck, I was able to become the SME for. I often think that when I’m called out at three o’clock in the morning, and thank my lucky stars that I became the SME for CDC. We feed InfoSphere CDC data into Kafka, generally, but from there it may go to MongoDB and other systems to be analyzed to get the value out of the big data.

[00:24:00] – Colin Knight, Presenter

We also have some direct links, for example, Splunk via a direct link and SQL into the Db2 database itself. We take data into Teradata and Oracle to do the analytics Analytics. We have an IDAA in test, which we do a lot of analysis and monitoring with. That really is very useful, especially around SMF data and so on. Then more and more, we are creating REST APIs so that we can expose the mainframe data for big data analytics and making it easier to see what’s going on within the mainframe. That’s the things we use to get big data data. Any questions so far? Moving swiftly on. I’m going to just go through those tools that we use. Extracts, unloads, and HPU. They’re great. They’re cheap. We don’t necessarily do much in the transforms, though we can do some clever stuff with HPU. But at the end of the day, you’re creating multiple copies and potentially in multiple locations. It’s usually a moment in time. Typically in NatWest, it’s daily and weekly. There are sometimes monthly extracts, perhaps for MIS reporting, but it could be more often. But generally, It is daily in most cases. Very cheap, very easy to set up, set up a batch job, run it every day, grab the data, off you go.

[00:25:55] – Colin Knight, Presenter

It’s been in use for many years with no problems, and CSV files hang around. But what bugs me sometimes is some of these unloads could have been written and created 10, 15 years ago. All those files that we create and unload to actually used by anyone. I think if you’ve got full control over all of your channels and applications and things don’t get decommissioned very often, then sure, you know exactly where that data is going and who it goes and what uses it. But generally, I imagine there are some files we have created that nobody knows quite what they’re used for if they’re still used. I think that leads on to your golden source and data dictionary view and having a view of everything, which is quite key for your big data, I think. As I say, HPU does clever things, can do some transforms in your extracts, and all of these unloads are very good for MIS reporting. If you’re looking at yesterday’s data and you want to report to management how things look in your bank, then it’s very simple, very easy. It does the job. Now, on to Q Replication, which is one of those tools that is absolutely brilliant when it’s working, but when it goes wrong, everybody sits there looking at it scratching their head.

[00:27:29] – Colin Knight, Presenter

I’ve been on one of those recovery calls quite, well, a year or so ago, and that involved IBM, who also sat around the table virtually and scratched their head. I think we did solve it after about six hours, but it create more moving parts and more things can go wrong. But typically, we do Db2 to Db2. Sometimes it’s a different Db2, but usually… Well, no, I think equally, it goes to a different Db2, and sometimes it goes to the same Db2, and the target’s on the same Db2 as the source. It does create large volumes very nicely with very low latency. Cpu costs, though, are not insignificant. They are significant. Just looking at what has to happen with Q Replication. You’re looking at Db2 logs, you’re going to take in those logs for any changes for the tables that you’re replicating. Any changes you see, you grab from the log and you insert the data into an IBM MQ. Sometime later on the apply side, you read the IBM MQ and you say, Oh, there’s a change for my target, or put that into the target table with a timestamp so I know exactly when that operation happened.

[00:28:57] – Colin Knight, Presenter

Now, that operation is not free. You’re going from a Db2 potentially to another Db2 using IBM MQ, probably two MQs in the middle. As the diagram shows, there’s various cues, admin cues, control tables. Excuse me, dry folk. There’s a lot of moving parts in here, and a lot of things that could go wrong in IBM MQ, potentially in networks, potentially in the Db2, It’s producing lots of volumes of log changes to be analyzed by queueReplicationCapture. Excuse me. Certainly, it’s not free for all those reasons. Various things can cause upset to queue replication. Typically, you change the structure. That could stop the subscription for your replication. This is really problematic when you’re running in test. Because basically a simple table change that you might want to do quite regularly in test, suddenly your subscription falls over. That is something that can be a real headache in test, and even in production can knock out your subscription. But equally keeping source and target in sync so you know that your data on target is up to date with source. Luckily, with MQQ replication, there is a way of reporting how your target compares to source. That should alleviate any doubts that your target data is out of sync.

[00:30:45] – Colin Knight, Presenter

But then you extract from the target table and it’s outside into big data environments. How do you know that your new target data location and the data in there is the same as the source data was, especially if you’re doing conversions and changing character sets and so on and transforms. That can certainly add to the picture and maybe cause problems in almost corrupting the data as it transfers from mainframe into big data. But also just looking at taking that data from queue replication outside. Taking the target tables here, taking the data from the target tables for big data. One thing that we do on certain systems is to say, actually, I’m not going to take it in real-time as such, but I want to have a consistent copy of those tables for my application. To do that, we do something called a suspension of the capture and apply, and we take a copy of those tables, that target that have been updated, say for the previous day or the previous week. We take all the data out of that for what we want to do in big data, and then we resume the capture and apply and carry on processing the data.

[00:32:26] – Colin Knight, Presenter

What we have within The queue replication for that is spill queues. Whilst we’re stopping the subscription, stopping the actual updates, sorry, don’t stop the subscription. We stop the updates through capture and apply. But any data that is captured from the log is sent to a spill queue to be applied when we resume the subscription. We suspend the subscription, take a copy of the targets, and then resume, and then we apply all the spill queue updates to the target. That’s quite a nice way of taking a consistent copy, and we use that in one or two places within NatWest. We have We have quite a bit of queue application between Db2s and into Db2s, if you like. All right, now we get on to InfoSphere CDC, which, as I said, was my specialist subject. Yes, so this is a good product and does a job. But it is capable of supporting high volume throughput through Db2 and capturing the changes to Db2 tables and pushing it out to the big data environment. It also generally has low latency, but that can be increased at peak logging in Db2. When Db2 logging is really busy, often on an evening, sometimes, then the latency can go a bit higher than zero.

[00:34:26] – Colin Knight, Presenter

The definition, the design of CDC is that you It’s not going to have a synchronous replication. It’s got latency built in by design. Cpu costs. We see it’s definitely more expensive than Q Replication CDC, which is a shame because it’s really performing the same function within Db2. It’s scraping the log for changes to tables and then applying it to your target, very similar to Q Replication. But the actual capture process within CDC is more expensive than Q Replication from our experience. I think it has been getting better. They are getting closer to what the MQQ rep replication does, but there’s still a slight extra cost for… But in general, CDC is more expensive than not replicating, so it’s a consideration for sure. One area that’s a real problem that we’ve had in NatWest for a while is looking at how we refresh targets. That’s a key area. Typically, for example, we have in Db2, we’ve got a reorg with a discard, so we’re discarding records and we’re not logging that process Because we don’t need to, we don’t have to. It’s a good way of speeding up the process. Because there’s no logging, CDC can’t keep track of the changes within that process.

[00:36:12] – Colin Knight, Presenter

What CDC has to do, it has to register a synchronization process between target and source. That process is called a refresh. Now, another example is where we have, say, a table that is built It’s built up with inserts all the time, but actually we process most of that data within two or three days. Say a week’s data is all we keep in there. After a week, we have reorgs that then delete all the data older than a week because we know we’ve processed all the data. Maybe we’ve got a flag in there saying that it’s been processed as well. We reorg with a discard to delete all that old data. Now, With CDC, that triggers a refresh. We’re going to select the whole table from Db2 and send that down the pipe into target. But do we Why do we really need that? If we’ve been capturing all the data via inserts into the table in CDC and sending to target, why do we need to send all the data again? That’s where we get into, can we ignore the data or do we need to refresh the data? Do we really need to send all those duplicates or can we just ignore that?

[00:37:40] – Colin Knight, Presenter

Through various learnings, it’s very key that you do not ignore data via post filters, because what that means is if you use a post filter to ignore your data refreshes, then you grab all the data in Db2 You select potentially a couple of billion rows, a million rows, however big the table is. You send all that data down the pipe to your target, and the apply engine, target engine says, Yeah, I’ve got a post filter on that table. I’m going to discard all of that data and throw it all away and not apply it to the target. I don’t need it. Thanks very much. You’ve wasted all that CPU on the mainframe to get nothing. But worse than that, if you do ignore data via post filters, you can also lose data because CDC is expecting the data to be in the refresh sometimes if data is being changed during the refresh operation. That data might appear in the refresh, and if you ignore the refresh, you don’t get all the updates that happen during the refresh period. There’s a lot of issues there. You could easily fall into losing data. You need to be very careful.

[00:39:05] – Colin Knight, Presenter

Always try and make your ignore process on the source side. Then you don’t waste CPU and you don’t lose data. We use Kafka as our main target for CDC, and that works quite well. You get decision-making applications on systems that subscribe and consume the data out of Kafka, and that all works very well. Yeah, I can’t argue with that. There are transform keys and things that you need to do, and these must be efficient, and some of that is provided by the user potentially to translate the data from the mainframe into Kafka and then on to wherever it’s going at target. But a key element is how you manage that environment. You can see there’s a lot of moving parts again, even more than queue application. We’ve got a team, for example, on the You’ve got a mainframe side, my team, that looks after the CDC source capture side. You’ve got a team that looks after the target apply side on a server outside of the mainframe. You’ve got a team that support Kafka and look after that. You’ve got teams that look at the networks, both internal mainframe and external networking. All these have to sit around a table and look at the diagnostics to find where the problem lies.

[00:40:51] – Colin Knight, Presenter

Gulp, gasp. Getting dry now. Again, a bit like In Q Replication, you’ve got this target versus source issue and query. Have you got all the data? Is the data as accurate as the source, or have you somehow converted and lost some of the important elements of the data? Therefore, as I’ve hinted, there are certainly a few problems with CDC replication as well. Okay, so Yes, it’s clever, but there are a lot of moving parts. You’ve got capture, you’ve got apply, you’ve got Kafka target, you’ve got network. The log scraping process is one element in the capture side. Yeah, and then on top of that, you’ve got your post filters potentially converting the data in one way or another. Some of the post filters can actually say, I’m not interested in anything but inserts or only updates. You have to be really careful on how that’s set up as well. The refresh is annoying, as I’ve said. This can be unpredictable, especially if you do ad hoc reorgs without logging. That will cause problems for your CDC. We’ve mentioned post filters and the problems you can get there. Cpu costs, obviously, for getting all the data from source and then throwing it away because you’ve got a post filter that says, I’m going to ignore all your data for refresh.

[00:42:30] – Colin Knight, Presenter

Log scraping is costly and not as efficient as Q-Rep, but we do use CDC for not just Db2 capture and apply, we also capture VSAM data set changes Oracle table changes. It’s definitely a big tool for analytics and big data. It’s important and definitely something we can’t live without That at the moment. But you do have to have, I think, a good understanding of how that data changes at source, what happens to it before you can move on to maybe making the best use of it within CDC. We’ve also hinted at with Q application about structure changes. Even more so with CDC, If you make a structure change, you’ve now got not just the applications on source that can be affected by your structure change needing application rebinds and Db2 and so on. You make sure that you apply your source structure changes at a quiet time on the correct time. But now you’ve got the added dimension that you impact target applications, the other side of CDC. You could cause an outage to CDC or a subscription stop if you change the structure at source. You’ve now got to consider another dimension, making sure that you can make your changes when it’s good for your target application as well.

[00:44:15] – Colin Knight, Presenter

Excuse me a second. I’m going to have a cough. Right. Is there any questions? I can’t see any at Give me a moment. I’ll press on. In a way, suddenly, your changes you want to make at source for your source application, suddenly, you’ve got to ask permission and get the right time to make the change with the target application after CDC and through into big data, and the applications use the data there. In my mind, that’s the tail wag in the dog at You’re not just taking data from the mainframe, you’re actually influencing when structures on the mainframe could change so that you don’t affect your application on big data. Yes, and for very good reasons for the customer, you’ve got to do that. But ultimately, I have to say, CDC does the job. It’s mostly low latency. I’ve got to say, I’ve met a lot of new people outside a mainframe by supporting I’ve been watching CDC over the last few years and plenty of call-outs at times. Less time spent in bed, perhaps, but I think I now fully understand what makes CDC tick. So, yeah, it’s good, but there are problems at times.

[00:45:50] – Colin Knight, Presenter

And those are the beads of sweat when you wake up at three o’clock and the phone call person on the phone says, We’ve got a CDC problem. Could you have a look at it, Colin? Yeah, you get a lot more beads of sweat than that, I think. Right, okay, let’s move on. All right, now, I’ll move on to another pet topic of mine, which is the IBM Db2 Analytics Accelerator. The first thing to know about IDAA, again, they’re really good and very useful way of running analytics on your mainframe data. But you do only have a subset of SQL that you can run in an IDAA, so you can’t run everything you can on Db2 for z/OS in an IDAA. There’s a subset. But that subset does grow every time they do an upgrade. There’s less and less that you can’t do. The really key element in some ways is that replication between Db2 and IDAA can be done real-time via this in sync feature, which is faster and cheaper than Q Replication or CDC by a long way. I think that’s a real consideration for an IDAA. Also, coming along in version 8. 1, we’ve got a new monitoring configuration, and that answers a lot of questions where you’re trying to find out how things are operating within IDAA.

[00:47:40] – Colin Knight, Presenter

That’s a real good feature coming along in 8. 1. Now, another good thing about IDAA is that you can load not only Db2 data, but non-Db2 data into your IDAA. In our case, in NatWest, we load a lot of SMF data into IDAA from the files out of SMF extracts. That’s really clever. We really do get really good value from looking at SMF data stats out of IDAA. A sad demise of Data Studio, I have to mention here. Data Studio has plugins to manage IDAA and Loader and was quite nice to use. I found it a pain, but lots of people do like Data Studio. It certainly does the job. Now, in the new world, we’ve got Admin Foundation and the new tools from that which manage IDAA and the Loader. Other ways of using IDAA for archiving data, perhaps. So data in Db2 that maybe you don’t need interactively, you can archive and compress and shrink within in IDAA quite significantly and use it later if you need to. We also, as I say, we do a lot of SMF monitoring in our IDAA and CDC SMF is really quite useful. I take the data around latency, for example, out of SMF for CDC.

[00:49:26] – Colin Knight, Presenter

That gets loaded into the IDAA each morning from the previous the previous day, and I have a little process that then tells me exactly what the maximum figure was for latency the previous day. Before anybody tells me there’s a problem, I can see that there might have been a slowdown at a certain time, and then I can investigate it from there. That’s quite handy. That’s quite an early warning alert system for the morning. Now, I had to look at the big question. You’ve got an IDAA, what does it do? And how is SQL faster? So the definition is that SQL running on an IDAA can be orders of magnitude faster than in Db2. I think it really depends on what you actually run. If it’s a real warehouse type query, analytical query, then yeah, I think it would be orders of magnitude faster than Db2. But if it’s a simple get to a table via an index and a very simple query, probably the same as Db2 on the outside. It depends what you’re asking, I think, what you get faster out of IDAA. I’m going to have another cough. Excuse me a second. I’ve lost my cursor.

[00:50:57] – Colin Knight, Presenter

All right. Okay. It’s under the covers, it’s IBM Db2 Warehouse. Yeah, that’s what we use within IDAA. It does the job. It’s very clever. Just looking at a bit more about IDAA. In 8. 1, there’s a nice new feature where you can copy data between IDAA. Before 8. 1, you used to have to take the data out of IDAA into Db2, then copy it into another IDAA. Now, for example, if you’re upgrading an IDAA and you want to make sure the data is backed up, you can copy data to another IDAA if you got the space, and therefore be sure you haven’t lost any data after the upgrade. Key feature of IDAA is that you can load non-Db2 data. It’s not just for Db2 data. As I said, we use a lot of it, a lot of SMF data in IDAA, and we use it for monitoring things like IBM MQ and CDC. You can run complex queries on large data sets, which is definition of big data thing. But on top of paying for Db2, you’re paying for IDAA all the features that you need with an IDAA and all your mechanics and your black box that you have to install for IDAA.

[00:52:40] – Colin Knight, Presenter

So, yeah, it’s very good, but I’m not sure. Yeah, it’s more expensive. That’s the bottom line. Okay. So, looking at problems, issues of making mainframe data available for big data. I think the major consideration has got to be around security. I work for a bank. It’s a key element in everything we do. It’s not so security in target, which could be in the cloud. It’s also security in transit. That’s a major consideration through anything we with extracting mainframe data and sending it to big data. We have to make sure it’s synchronized. I’ve gone through a few problems and explained how checking target and source is key to a lot of application tasks, and synchronizing the data is key to making sense and making your big data. It’s that validity of your big data. The final V that’s important here. Yeah, that’s important, making sure it is synchronized properly. Latency is always a consideration, even in real-time. Even with best intentions, most replication tools have some latency built in. They’re never going to be zero all the time. How does that affect your target application? Is it good enough? Can customers wait a few seconds for the answer?

[00:54:30] – Colin Knight, Presenter

If it’s unpredictable, that’s going to be a problem for your customers, surely. Now, one particular problem. Excuse me a second, is what happens when you run a recovery, that source, or the source changes. For example, I want to recover my key table. It’s now UK time, 25 past 5: 00, and I’ve got to get my data back to 10: 00 this morning because someone did something, run a batch job at a wrong time, maybe repeated a batch job, and all the dates have been repeated and the data’s wrong. We have to get that table back to 10: 00. Great. Recover. Got the timestamp in the recover for utility in Db2. Source table’s back to 10: 00. Everything’s working. Source application’s fine. Everything’s running. Customers are happy. Off we go. But now you’ve got another dimension. You’ve got what happens to the transactions you’ve captured in, for example, CDC. You’ve now got data that never existed. How do you cope with that at target? You then have to somehow remove those records from your target systems, your applications relying on big data. That’s a bit of a key headache, and hopefully, you don’t see that too often, but that could be a real headache for your customers.

[00:56:08] – Colin Knight, Presenter

Cpu overheads, it all costs money to replicate. Maintenance costs more to support the processing sources that do the replication. My question is, why don’t you just leave the data on source and view the data in a big data environment from the mainframe directly? Finally, moving data does give a lot more units of fun, especially around quiet times when you just be sleeping. It’s nice to have that wake up call at 3: 00 AM sometimes. Moving on to the future. We’ve talked about problems, we’ve talked about how good CDC can be, we’ve talked about an IDAA, we’ve talked about big data. Generally, as I’ve shown, most of the Db2 data and the mainframe data in big data in NatWest is being replicated into a big data environment. I think the future should be along the lines of virtualization using tools like CloudPack for data, data virtualization. These give you a view into the mainframe data without creating a physical copy, often using caching techniques. You’ve still got a copy in real-time, but you haven’t got copy after copy in different locations and all the management headaches around that. I think the key element is always going to be around structured non-structured data views, how you see Db2 alongside text, video, audio data, perhaps, how you make sense of that.

[00:58:09] – Colin Knight, Presenter

That can be done virtually, I think, in the future, certainly. Simplifying it, simplifying the process to get data from mainframe to big data. We don’t want to move the data around all the time. We want less moving parts, so virtualization would answer that question. Maybe another way is using more REST APIs, and I think they’re coming out all the time in a lot of areas to expose data to applications outside of the mainframe to deliver that information to big data applications. It does I use the data in the web friendly formats, JSON and so on, which is easy to integrate with big data analytics platforms and pipelines, so that’s a good way of getting data out. Can’t really get through a presentation in 21st century in 2025 without mentioning AI. Ai and SQL DI modeling the data on mainframe, which is really good, and I think something that we all want to look at and get working. We can get some real big analytical questions answered through direct access on the mainframe. Could be more IDAA and less CDC. That would be nice, I think, in certain aspects. Idaa replication, as I said, much more efficient and cheaper, and you can get answers out of IDAA for your analytic queries.

[00:59:38] – Colin Knight, Presenter

That would be a good way of doing it. In my mind, the future has to be around Whilst you’re still extracting some data, I think you want to keep data at source as much as possible. The concept of data gravity is a good point here, where you have a tendency for large volumes of data to attract Select applications. Why don’t you run some more applications on the mainframe for that? Move services and other data onto the location where the data actually resides. Could this happen with mainframe Db2 data? Good question. Finally, here on this slide, we’re looking at quantum data, quantum computing. What can that do to analytics on big data? Well, yeah, quantum computing definitely deliver massive parallelism in speed. It improves handling of complex data structures, you get enhanced machine learning. But obviously, the big snag at the moment is overcoming the current technological challenges. Is that really the future? I think quantum computing will come for sure. Excuse me. It’d be interesting to see how it evolves around big data and analyzing that. All right. Just looking at something we’re doing and hoping to produce over the next few months, a year or two.

[01:01:11] – Colin Knight, Presenter

We’re taking data out of SYSVIEW via the APIs provided by Broadcom for monitoring tools and Db2. We’re hoping to use Python via those APIs to produce graphs and look at the data for monitoring Db2. But also within Grafana, we can produce dashboards to report on monitoring information from the Db2s. That’s opening a window, I think, which I’m hoping can be exposed to areas maybe outside the mainframe, but certainly our management team to look at how Db2 is performing and what it does, which might help to answer some questions for a lot of people. And thanks to Broadcom for giving us help along the way there. SQL DI? Yes, I think that should be easy to install, but it isn’t, so I’m not sure why. But if we don’t use SQL DI, We still have AI acceleration on mainframe data via the z17 coming in very soon to most shops on mainframes. So there’s going to be more AI, and that’s definitely influence how we do analytics on big data. But the key element, I think, is always going to be getting some form of golden source and getting your data dictionaries so that you can actually see what data is available and how to get hold of that.

[01:02:52] – Colin Knight, Presenter

That’s the vision I think people really need to look at is how to get data dictionaries and golden source really visible to everyone who wants to use big data. Visualizing the data and virtualizing the data is very important. We’ve not got far to go now. Getting to the last bit. I’ll just put a slide out there to ask some questions. I think these are ones I’m going to leave with you in some ways for you to think about. Do you need your own copy of the data? I think in the analytical world, often you do because you want to change that data, your copy of the data, and say, Well, what if I do this? What if I change that? What does that do? How does that look if I run my query again? It’s got to be close to real-time. But how close to real-time is sufficient? Is latency at any time going to be an issue? Do I have to have zero latency 99. 9% of the time? I think more and more that’s going to be key to your big data applications. Is virtualization and a holistic view of big data the final answer?

[01:04:07] – Colin Knight, Presenter

I think it probably is. It’s as simple as that, isn’t it? We really do want to Have a virtual view, an overall view of all the data in big data wherever it comes from and understand how it got there. I think virtualization could be a big answer to most of those questions. Do we know what data is actually available? Again, looking back at the holistic view and understanding your data across the mainframe. Within a big data environment. Do we understand or need to understand how data is used or changed at source? Well, potentially, if you’ve got a virtualization of the view of… Sorry, a virtualization mobilization of your data in the mainframe for big data, you may not need to know how it changes. You may be able to watch it through your virtual tools and understand it that way. That may be an answer to your problem of understanding how the data has changed and what I’m seeing in my target application. How do I travel through the world of big data? What tools am I going to use? There’s always tools coming out for big data every day, so Knowing what’s available, how you can use it, golden source, data dictionaries, these are all the answers.

[01:05:38] – Colin Knight, Presenter

Then what about data lakes? Data Lakehouses, hybrid cloud solutions? I think hybrid cloud solutions is the answer, and just looking at the tools to make your virtual data available. There’s more questions and answers there, and something for people to think about, I hope. Okay, I began to build this presentation by looking out for stuff on big data, basically. I found a very interesting chap, Victor Mayo-Shonberger, who provided a podcast on big data from 2016. I think it was presented to the Oxford University in the UK. Now, I was listening to that avidly on a wet Sunday afternoon, and it really sparked my imagination, and it really made me think about how we use big data and what it’s all about and how the mainframe fits into all of that. One aspect that is really interesting and brings out how big data works. This is prior to 2016. Walmart did an analysis of shopping behaviors just before the hurricane was due to hit on the forecast. Say, Tuesday, hurricane is going to hit your area. They wanted to look at what people actually purchased just before that hurricane hit town. What did they go out and buy?

[01:07:15] – Colin Knight, Presenter

Now, I think it’s fairly easy to know that things like torches, flashlights, batteries for torches, fuel for generators, perhaps, all key things that people did go out and buy from Walmart, and certainly the stats and the data showed that. But there was one thing that also came up on the analysis of things people purchased that nobody had a clue why. I don’t know whether anybody can think of what else people might want to buy when there’s a hurricane coming. Well, people, apparently, went out and bought higher volumes of Pop-Tarts when the hurricane was about to hit. Why? In my mind, chances are if a hurricane hits, you’re going to lose power, you’re probably not going to be able to warm up those Pop-Tarts. Why would you buy the Pop-Tarts? Yeah, and this is the thing. They tried to answer the question of why. From the data, they could see definitely there was a trend in more Pop-Tarts were sold. Then they realized, looking at the data, well, that doesn’t matter. It doesn’t matter why they want to buy more Pop-Tarts. The data shows they’re buying more Pop-Tarts, so Walmart had to provide more Pop-Tarts.

[01:09:00] – Colin Knight, Presenter

When there was a hurricane about to strike, the plans were to make sure that there’s plenty of Pop-Tarts available. The supply wouldn’t run out, and they put in the front of the shop so they can be purchased when there’s a hurricane about to strike, which shows you good analysis of big data, put out there what the customer wants just before a hurricane, even if you don’t understand why. It’s important to make sure the customer gets what they want, and you can sell, therefore, more Pop-Tarts during that period. That’s a fascinating look at what big data means to people. Not long to go now. We can take a breather. Nice picture of a cat there. I look around. My cat’s not actually here at the moment, which is a shame because I could have then showed you the picture of my cat also sleeping. I don’t think there’s anything more peaceful than a sleeping cat. Often I look around when I’m working, look behind me, and there’s my cat Thor sleeping away on the mat. Just take a few seconds to breathe out. What have we done so far? I hope you haven’t nodded off in the last 10 minutes.

[01:10:17] – Colin Knight, Presenter

We explained big data, what it is, how big data is, and how mainframes fit in to the process. Obviously, mainframes are a key system of records, so they should be in big data. It’s an element that’s critical. We looked at an overview of how Db2 operates in NatWest, and who am I? We also looked at how we make mainframe data available for big data in Netwest, and tools for the trade like CDC, Q-Rep, IDAA, and there’s others as well, unloads, HPU, all of that. We also try to look into In the future, we want virtual. We want not to have to move and copy data around over and over again. We want a hybrid cloud solution, and AI is going to play a part, and maybe quantum computers as well. We looked at problems, issues, pain points along the way. There’s always pain with gain as well. I was going to say without gain, but no, we definitely have plenty of gain with providing data into big data environments from the mainframe, but also there are pain points along the way that we explained. The question, Can I stop copying data off the mainframe?

[01:11:36] – Colin Knight, Presenter

I hope so. Then we spent a little bit of time looking at a puzzle around big data and popped Start. That’s it. Time’s up. I think I’ve got just about one more slide. If you haven’t got any questions right now, by all means, send me an email. That’s my email address, raleighnight@natwest.com. That’s my 60-year-old cat, Thor, who recently had an operation, so his paw was bandaged up. He’s now feeling a lot better, minus one toe, which needed to be removed, but he’s coping well and he’s enjoying life still and sleeping away the afternoons as usual. And he’s certainly helped me with this presentation. So, yes, thanks very much. I’ll open it up to any questions. I’ll switch back for any questions. Okay, I’m not seeing any questions, Amanda.

[01:12:49] – Amanda Hendley, Host

I had dropped one in earlier. I don’t know if you can see.

[01:13:02] – Colin Knight, Presenter

I think, replication-wise, IDAA has got to be the best way of replicating data because it’s built into Db2 at the base level. I’d always suggest that’s the best one. But obviously, to use IDAA replication, you still first have to have an IDAA, which is not free. If you’re leaving that aside, you haven’t got an IDAA, then CDC It does the job, especially if you’re moving to targets like Kafka. Q-rep is producing replication to Db2, so it also depends what you want to do. But certainly, if you’re going to Kafka, CDC is a really good solution, probably the best solution for that. Other sources, Q-Rep and taking the data out the target. Nothing’s really simple, though. Most of them have got good performance I don’t see if you tune them right, but never simple. You never… Mark. Hi, Mark. You never warm up your Pop-Tarts. That answers the question. It was in my head. I was thinking, why would you buy Pop-Tarts and you’ve got no power? Good. That answers that. Thank you.

[01:14:21] – Amanda Hendley, Host

I think warming up your Pop-Tarts is a dangerous slope because if you warm them up too much- You’ve got something inside that’s red hot Yeah. No, I think they’re great out of the box. But only the real Pop-Tarts, not the organic ones. They are definitely missing something when they don’t have the same amount of sugar and preservatives.

[01:14:45] – Colin Knight, Presenter

They’re better for you, but they don’t taste quite so good. Yes. Exactly.

[01:14:52] – Amanda Hendley, Host

I was curious to know, Colin, what do you think about, or do you think that virtualization AI tools are going to finally reduce the need for us to copy data off the mainframe? Where do you see AI?

[01:15:10] – Colin Knight, Presenter

I hope so. I hope that we can really get the AI functioning. Like z17, you accelerate AI on the chip. SQL DI should be answering some of the questions that you’ve got for analytical queries. Probably, you’ve only If you have four functions in SQL DI. Even then, you can get some good answers, and I’ve seen some really good business cases, but it’s still not quite where you want it just yet. I think AI is going to be a big part of the whole picture, as with everything else in IT. I think you’re going to want to see the data as it really is quite often with enough virtual tools so that you transform the data in situ in cash, but you’re almost touching it within the mainframe. You’ve got that hybrid view and you’ve got that holistic view, and that’s where all your ducks are lined up, I think. By copying data out, by replicating data, you’re instantly creating a number of problems for yourself and managing those tools as well is not easy.

[01:16:33] – Amanda Hendley, Host

Great. Any other questions?

[01:16:43] – Colin Knight, Presenter

All right.

[01:16:48] – Amanda Hendley, Host

We’ll see if any other pop in. But Colin, I want to thank you for your presentation today. Obviously, great topic and really important for us to be thinking about, especially as more and more data is produced as you talked about. I see a question from Mark just popped in.

[01:17:09] – Colin Knight, Presenter

Oh. Yeah. Right. Really, I guess, in my experience, I guess we wanted to use CDC because the target was Kafka. I think that was what pushed us to use CDC because there wasn’t… Q Replication wasn’t really going to do the job. There’s still something else to get to Kafka. In real-time, that It’s a real headache where CDC will take your data from Db2 straight out to Kafka, and then you can subscribe to the topics in Kafka, and you’ve suddenly got an application data from target. Queue applications is only going to be part of the story, though, isn’t it? Because at the end of the day, you’ve still got a target table that you then have to take data out of, again, to copy it or to transform it or to do something with it. The IDAA, yeah, the IDAA is really nice. It would be nice to see more use of an IDAA to get your analytical queries through for big data. And that should be happening in a lot of places, I think. But it’s a significant outlaid by an IDAA. I think that’s part of the problem. Yeah. Okay, it’s gone quiet, I think.

[01:18:47] – Amanda Hendley, Host

All right, I’m going to take my screen share back. Let’s see. Great. Well, Colin, again, thank you for your presentation today. Before we depart and I invite you to the next session, I just wanted to give you a couple of QR codes. These are all things also searchable on PlanetMainframe.com. But we did Db2 month earlier this year in June in alignment with IDUG North America. I’ve done a QR code over to the recap article that we did, but you can just search Db2 on PlanetMainframe.com, and this stuff will pop up. Then we started releasing some content from the Sheryl Watson’s tuning letter. Just a slow trickle. It’s not the brand new stuff, that’s subscription only, but it is from Sheryl Watson’s tuning letter. Those pieces are being posted on PlanetMainframe.com occasionally. There’s one about buffer pools I pulled. Then our YouTube channel. If you haven’t been to our YouTube channel, we post these videos. Planet Mainframe will also be rolling out a new YouTube channel of all of our content, including interviews and podcasts and everything that we’ve done. All of that’s available for you to check out. Let me just make sure no questions are coming in.

[01:20:30] – Amanda Hendley, Host

But lastly, I will mention that our next session is on Db2 12 and 13, SQL and SQL enhancements. We’re doing an update session, November 18th. I hope to see you there. That’s our next session. Thank you again, Colin. Thank you, everyone, for joining us today. All right, you all have a good one.

[01:20:57] – Colin Knight, Presenter

Thank you. Cheers, everyone. Have afternoon, evening. Thank you, Colin.

Get the latest Virtual User Group Updates

Virtual Db2 Sponsor

Colin Knight

Db2 Systems Programmer Technical Lead,
NatWest

Download the Presentation

Db2 User Group | September 2025

Mainframes in Big Data

Read the Transcription

Get the latest Virtual User Group Updates

Virtual Db2 Sponsor

Colin Knight

Db2 Systems Programmer Technical Lead,
NatWest

Upcoming Db2 Session

Virtual Db2 User Group | January 20, 2026

Db2 User Group | September 2025

Mainframes in Big Data

Read the Transcription

Get the latest Virtual User Group Updates

Virtual Db2 Sponsor

Colin Knight

Db2 Systems Programmer Technical Lead,NatWest

Upcoming Db2 Session

Db2 Systems Programmer Technical Lead,
NatWest