“Big data” and the impact of analytics on large quantities of data is a persistent trend these days in just about every organization. The general idea is that large amounts of data, from multiple sources, and of multiple types, can be analyzed to produce heretofore unknown insights about your business.
But what exactly is big data? The most common definition was coined by Forrester Research defining big data in terms of “The 4 V’s” — volume, velocity, variety, variability. The first V is volume and that is the obvious one, right? In order for “data” to be “big” you have to have a lot of it. And most of us do in some form or another. A recent survey published by IDC claims that the volume of data under management by the year 2020 will be 44 times greater than what was managed in 2009.
But volume is only the first dimension of the big data challenge; the others are velocity, variety, and variability. Velocity refers to the increased speed of data arriving in our systems along with the growing number and frequency of business transactions being conducted. Variety refers to the increasing growth in both structured and unstructured data being managed. And the fourth V, variability, refers to the increasing variety of data formats (as opposed to just relational data). Others have tried to add more V’s to the Big Data definition, as well. I’ve seen and heard people add verification, value, and veracity to this discussion.
When you mentioned big data and analytics, the first thing many people think is Hadoop and Spark and maybe NoSQL. But are these newer technologies required for big data projects? What about the mainframe?
The Mainframe Does Big Data
Mainframes are not often mentioned in big data articles and advertising. But they should be! The mainframe is the most secure and reliable processor of business transactions for the Fortune 500. Millions of CICS and IMS/TM transactions are being processed every second by big businesses and their customers. Every time you book a flight, visit the ATM, or make a purchase just about anywhere, chance are that there is a mainframe behind the scenes making sure that the transaction happens accurately and efficiently.
And why is that important to note in an article about big data? Well, according to a recent study by TechTarget the number one type of data being collected for big data programs is “structured transaction data.” And most of that lives on a mainframe!
The same study also indicates that most organizations are planning to use mainstream relational databases to support their big data environment – more so than Hadoop, NoSQL, or any other type of database or data platform. So traditional RDBMSes, like DB2 for z/OS, can be – and are being – used to drive big data projects. And O’Reilly’s Data Science Salary Survey, conducted at the Strata Conference, indicates that the top tool used by data scientists is SQL on relational databases. Yes, the same SQL that is the meat and potatoes of most data access has not been displaced by other tools for big data analytics – at least not for the most part.
Judging what is Big
When evaluating your data, what is big for you may differ from what is big for another shop. Remember it is not just about big, but also about different and rapidly changing data.
Raw data growth is only part of the story. More data types are being captured, stored, and made available for analysis. More external data sources, too.
Structured data remains the bedrock of the information infrastructure in most organizations, but unstructured data is growing in importance. Unstructured data refers to non-traditional data that is not numbers and characters. This “unstructured” data constitutes the bulk of the data out there. Analysts at IDC estimate that unstructured data accounts for 90% of all digital information.
If you are using a lot of LOBs to store multimedia, or even large text documents, in DB2 then you are working in a potential big data environment!
Relational and DB2 for Big Data?
Sure, it makes sense that relational databases can be used for big data projects. But probably not all of them. Examples where relational may break down or perform poorly include projects requiring a very large number of columns, a flexible schema, relaxed consistency (BASE instead of ACID), or graph processing. But that does not mean the mainframe cannot be used. IBM’s Linux for System z can be used to run Hadoop as well as many types of open source, NoSQL database systems. Consider, for example, Veristorm’s zDoop distribution of Hadoop that runs on Linux for System z.
And traditional z/OS-based software and projects can be “big data” projects, too. The IBM DB2 Analytics Accelerator (IDAA) can be integrated with DB2 for z/OS and used to provide powerful analytics processing. IDAA is a workload optimized appliance that blends System z and Netezza technologies to deliver, mixed workload performance for complex analytic needs. It can run complex queries up to 2000x faster while retaining single record lookup speed and eliminates costly query tuning while offloading query processing. IDAA can deliver high performance for complex business analysis over many rows of data. And it is completely transparent to your application – it looks and behaves just like DB2 for z/OS… only faster.
And IBM’s BigInsights Hadoop distribution has integration points with DB2 for z/OS. IBM’s InfoSphere BigInsights connector for DB2 for z/OS provides two user-defined functions, JAQL_SUBMIT and HDFS_READ. These functions enable developers to write scripts in JAQL query JSON objects in the BigInsights environment for analysis.
From a more traditional DB2 perspective, compression can be used to reduce the amount of storage used by very large databases. The DSN1COMP utility can be used to gauge the percentage of compression that can be achieved for your DB2 table spaces. And as of DB2 9 for z/OS, you can even compress your indexes. Sample tests have shown index compression rates as high as 70 percent or more.
Partitioning can also be used to spread data across multiple data sets, which might be needed for very large table spaces. Universal table spaces can store up to 128 TB of data. With partition-by-range table spaces the DBA can set up which data goes into which partition. And all DBAs should know that balancing the amount of data per partition for large amounts of data is crucial to achieve performance not only for programs and transactions, but for utility processing (e.g. load, unload, reorg, backup, recover).
And DB2 is gaining capabilities for ingesting, storing, and processing Big Data, including JSON documents and integration with Hadoop. So you will be able to use DB2 for z/OS along with new big data technologies to drive your analytics projects. But the key here is that the mainframe should be involved in your big data planning efforts!
Finally, it is worth noting that mainframe users have been pushing the boundaries of what is acceptable for their DB2 for z/OS environment. This means more data, more users, more transactions, more memory, more, more, more! Although not common, I have seen data sharing groups with more than 20 members, buffer pool sizes over 100GB, and escalating transaction rates, especially for DDF distributed transactions. What is “not common” today will become more commonplace as newer and greater DB2 versions with additional features and functionality are released.
Summary
The bottom line is that the mainframe should be a vital part of creating your big data projects to deliver value to your organization.
Now don’t get me wrong. I am not saying that you shouldn’t learn the new technologies of big data, such as Hadoop, Spark and the NoSQL offerings. But I am saying that DB2 and the mainframe should absolutely be a major component of your big data planning and infrastructure.
Because, we are, indeed, doing big data on DB2… and the mainframe.
Regular Planet Mainframe Blog Contributor
Craig Mullins is President & Principal Consultant of Mullins Consulting, Inc., and the publisher/editor of The Database Site. Craig also writes for many popular IT and database journals and web sites, and is a frequent speaker on database issues at IT conferences. He has been named by IBM as a Gold Consultant and an Information Champion. He was recently named one of the Top 200 Thought Leaders in Big Data & Analytics by AnalyticsWeek magazine.
Useful article.Thanks for sharing.