When asked what is the single most important or stressful aspect of their job, DBAs typically respond “assuring optimal performance.” Indeed, a Forrester Research survey of critical DBA concerns indicates that performance and troubleshooting tops the list of most challenging DBA tasks. With this in mind, let’s take a moment to outline the basic factors that influence the performance of DB2 applications.
Even though a proper investigation of DB2 performance issues should probably begin with database design, let’s start off with a discussion of SQL because it impacts more users. Everyone who needs to access or modify data in a DB2 database will use SQL to accomplish the task.
As you write SQL statements to access DB2 data, there are certain very simple, yet important rules to follow to encourage efficient SQL. Of course, SQL performance is a complex topic and to understand every nuance of how SQL performs can take a lifetime to master. That being said, adhering to just a few simple rules can put you on the right track to achieving high-performing DB2 applications.
The first rule is to always provide only the exact columns that you need to retrieve in the SELECT-list of each SQL SELECT statement. If you only need three columns, why ask for more? Every column that you request must be accessed and moved by DB2 from database storage to your program.
Most DB2 developers have heard the standard advice: “Do not use SELECT *”. This is a common standard in many shops and it is a good one… but it does not go far enough. For those who do not know, SELECT * is a shorthand means of telling DB2 to retrieve all of the columns from the table(s) being accessed. It can save some time for the SQL coder, but it is not a good idea to specify SELECT * in production applications because:
- DB2 tables may need to be changed in the future to include additional columns. SELECT * will retrieve those new columns too, and your program may not be capable of handling the additional data without requiring time-consuming changes. If instead you simply specified only the columns you need, you can add columns whenever you like without impacting production applications.
- DB2 will consume additional resources for every column that you request to be returned. If the program does not need the data, it should not ask for it. Even if the program needs every column, it is better to explicitly code each column by name in the SQL statement for clarity and to avoid the previous pitfall.
So SELECT * is fine for quick and dirty queries, but using it is a bad practice for inclusion in application programs.
A second rule to keep in mind is that you should not code your SQL to ask for things you already know. This may seem to be simple advice and easy-to-heed, but most programmers violate this rule at one time or another. For a typical example, consider the following SQL statement:
SELECT EMPNO, LASTNAME, SALARY
FROM EMP
WHERE EMPNO = ‘000010’
At first glance this SQL looks fine. It is very simple, and with an index on the EMPNO column the query should perform well. But it is asking for information that you already know and it should be recoded.
The problem is that EMPNO is included in the SELECT-list. You already know that EMPNO will be equal to the value ‘000010’ because that is what the WHERE clause tells DB2 to do. There is no possible way for DB2 to return an employee with any other number. But with EMPNO listed in the WHERE clause DB2 will dutifully retrieve that column too. This causes additional overhead to be incurred thereby degrading performance. The overhead may be minimal, but if the same SQL statement is run hundreds, thousands, or even millions of times a day then that minimal performance impact can add up to a significant impact.
Another guiding principle that you can invoke as the third rule is that you should access DB2 data like it is in a relational database system and not in flat files. A common rookie mistake, especially for old mainframe programmers, is to not think relationally.
For example, you should always use the WHERE clause to filter data in the SQL instead of bringing it into your program and filtering it with IF-THEN-ELSE statements. From a performance perspective, it is much better for DB2 to filter the data before returning it to your program. This is so because DB2 uses additional I/O and CPU resources to obtain each row of data. And the fewer rows passed to your program, the more efficient your SQL will be. So, the following SQL
SELECT EMPNO, LASTNAME, SALARY
FROM EMP
WHERE SALARY > 50000.00;
…is better than simply reading all of the data without the WHERE clause and then checking each row to see if the SALARY is greater than 50000.00 in your program. Of course, this is a simple example. It is typically in more complex SQL statements with multiple predicates and join clauses that cause programmers to switch to filtering in their COBOL or Java instead of tinkering with an already complex SQL statement. But the proper approach is to filter in the SQL, not in your program.
Another way that programmers write SQL to access DB2 tables like flat files is to avoid joins. DB2 is not designed to mimic the old master file processing tactics of QSAM files. By this I mean reading a record from a file and then using a value from that record to drive reads from an existing file. DB2 programmers try to mimic this type of processing using two cursors: one to read a row and the other using a value to drive the next cursor. This is a recipe for poor performance. Instead, code the SQL as a join and let DB2 do the work for you.
And finally, the fourth rule to keep in mind is that you should put as much work as possible into the SQL and let DB2 optimize the access paths for you. With appropriate statistics and proper SQL coding, DB2 almost always will formulate more efficient access paths to access the data than you can code into your programs. Yes, the code will be more complex—and you need to factor that into your decision-making process—but from a performance perspective, let SQL do the work!
Conclusion
The rules and ideas in this article can be used to create a set of philosophical guiding principles for writing SQL to access DB2 data. No, I cannot guarantee that you will not have any performance problems if you follow them, but I can unequivocally assure you that you will be minimizing self-inflicted problems.
So these rules, though they are not the be-all / end-all of SQL performance tuning, can set you up on the right path. Additional, in-depth tuning will likely be required at some point. But following the above rules will ensure that you are not making “rookie” mistakes that can kill the performance of your DB2 applications.
Regular Planet Mainframe Blog Contributor
Craig Mullins is President & Principal Consultant of Mullins Consulting, Inc., and the publisher/editor of The Database Site. Craig also writes for many popular IT and database journals and web sites, and is a frequent speaker on database issues at IT conferences. He has been named by IBM as a Gold Consultant and an Information Champion. He was recently named one of the Top 200 Thought Leaders in Big Data & Analytics by AnalyticsWeek magazine.
Craig, thank you for this interesting article. I had implemented most of your performance rules in my earlier product MANASYS so that programmers would just write
COPY TableName;
PROCESS TableName WHERE (condition);
…
and if TableName were an SQL table then MANASYS would figure out which fields were actually used in the program and generate a SELECT with an appropriate data list. Of course if TableName were a VSAM record, it would get the whole record, and use a combination of keyed-access and filtering to implement the WHERE. However I’d never thought of your second rule, not asking for things that we already know.
The implementation of SQL in the new product, MANASYS Jazz, (see https:// http://www.jazzsoftware.co.nz) will follow the same principles, but I’ll include this extra rule where possible. Thank you for bringing it to my attention.
Robert Barnes,
CEO, Jazz Software Ltd,
Auckland, New Zealand