You’ve heard people talking about it and finally your organization has decided that there is a business case for going down that route, but what does big data involve? What do you need to know? After all, mainframes have always been able to store large amounts of data, and, importantly, retrieve that information quickly. Isn’t that what IMS databases were designed to do? Firstly, when you start Googling like mad to find out some information, you need to search for big analytics and perhaps even big workflows as well as just big data. But before you do that, let’s run over some basics.
There was a time when people filled in paper-based forms and someone else would then enter that information onto the computer, and then some application or transaction would run against that data. Nowadays, things can’t work that slowly. Not only are people able to enter their own information straight into the computer, information is coming from the Internet of Things (IoT), including sources such online thermometers, credit card machines, and CCTV footage. The amount of data that is being made available is huge – and somewhere inside all that data is some really useful information that could make an organization more profitable and more successful.
Apart from the sheer volume of data, another problem that has to be faced is the variety of formats that the data can be in. These two factors impact how quickly information can be retrieved. And, of course, just getting a result isn’t the only thing – it has to be accurate and relevant information!
The other thing you need to realize about big data is that it can’t be stored natively on a mainframe – it has to be stored on Linux on z Systems (or you might have a LinuxONE mainframe). The most popular tools to use for big data are Hadoop, HBase, and MapReduce. Hadoop Distributed File System (HFDS) is, as its name suggests, a file system. Data in a Hadoop cluster gets broken down into smaller pieces that are called blocks, and these are distributed throughout the cluster. Any work on the data can then be performed on manageable pieces rather than on the whole mass of data. HBase is an open source, non-relational, distributed database modelled after Google’s BigTable and is written in Java. It’s a column-oriented DataBase Management System (DBMS) that runs on top of HDFS. HBase applications are written in Java. MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
Apache Flume is a system for collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store. Sqoop will transfer bulk data between Hadoop and structured datastores such as relational databases. Developers like NoSQL because they can store and retrieve data without being locked into the tabular relationships used in relational databases. It makes scaling easier and they provide superior performance. Large volumes of structured, semi-structured, and unstructured data can be stored using NoSQL. An example of a NoSQL database you may have heard of is MongoDB.
When it comes to big analytics, it’s worth mentioning Spark. Spark is an open source cluster distributed framework and data processing engine, and it comes from Apache. It’s a popular alternative to Hadoop’s MapReduce – in fact, for some applications, it can be 100 times faster than MapReduce. Spark uses a cluster manager and a distributed storage system. For cluster management, Spark supports a native Spark cluster, Hadoop YARN, or Apache Mesos. For distributed storage, Spark interfaces with HDFS, Cassandra, OpenStack Swift, and Amazon S3. Developers can write applications in Java, Scala, or Python.
A thing called Spark Core provides distributed task dispatching, scheduling, and basic I/O functionalities. The fundamental programming abstraction is called Resilient Distributed Datasets (RDDs), which is a logical collection of data partitioned across machines. RDDs can be created by referencing datasets in external storage systems, or by applying coarse-grained transformations (eg map, filter, reduce, join) on existing RDDs. The RDD abstraction is exposed through a language-integrated API in Java, Python, and Scala that is similar to local, in-process collections. This makes programming easier because applications manipulate RDDs in a similar way to how they manipulate local collections of data. The fact that Spark is so fast and so flexible is probably why IBM is so keen on using it and why so many people have contributed to it.
The fact that the first wave of people have been using big data for a while has proven that it can be used by organizations to help them get some kind of business advantage over their competitors. Big data is more than just SlideWare (software that only appears on PowerPoint slides), and more than just something that techie people want to have a play with. The software is getting better – it is being developed by users as well as large software vendors contributing – and it will run on mainframe Linux partitions. So now is the time for organizations that plan to still be in business in ten years’ time to start to make the most of big data technology.
Regular Planet Mainframe Blog Contributor
Trevor Eddolls is CEO at iTech-Ed Ltd, and an IBM Champion for the eight years running. He currently chairs the Virtual IMS, Virtual CICS and Virtual Db2 user groups, and is featured in many blogs. He is also editorial director for the Arcati Mainframe Yearbook.