Covering Indexes in mySQL...

s1988306_thumb

s1988306_thumb

This is going to be a quick-and-dirty discussion on the importance of covering indexes using mySQL.  I recently was hired to steward a LAMP system that's fairly complex.  It relies heavily on mySQL as the db-engine and has mostly PHP driving the business logic within a Smarty template engine. So far, by adjusting tunable parameters within mySQL, I've been able to improve response times (world-wide page load averages) by about 25%.  Now it's time to start ploughing through the slow-query logs since the db server is stable under the current loads and we've established a zero-state, through metrics, of baseline performance.

A db index, or indexing to describe the act of adding one or more indexes to a table or tables, to seek improvements in data access and retrieval, is often seen as something akin to the burning of incense and killing of goats when scrying the fortune of the gods.  As in all things performance-related to mySQL, it requires an continuous cycle of minute adjustment, measurement, and evaluation.

We're going to talk about covering indexes today -- (here's a good overview of indexes courtesy of wikipedia -- good to know, or good review!) -- which are indexes that contain all of the data from the query or, in other words, the data is stored with the index and is returned by accessing the index without doing a table look-up or scan.

In my slow-query log, where any query requiring more than 1-second of cpu-time to execute is flogged and logged, I saw a query that was part of the login process that was appearing up-to several times per second.  The horrifying thing about this particular index, as opposed to all the other horrors exposed in the slow-query log, (and, trust me, if you've never delved into your slow-query log on your production machine as the primary developer for your project, bring a barf-bucket as it's not pretty...), was that it was firing off several times per second but each query was taking between 20 and 40 seconds to complete.

Did I mention that this query was part of the login process for our users?

Horrifying.

The query was a complex query that joined two tables, while accessing several fields across both tables in the where clause, it still only used a single key join between the two, and completed the query with an ORDER BY.  The explain for this particular query (sorry, I didn't screen shot anything!) was, on my sandbox, about 19 seconds (My sandbox is a much slower-than-production laptop.) and accessing about 330,000 rows of data, using the indexes, where clause, file sort, temporary table.  The query basically looked something like this:

[cc lang='mysql' line_numbers='false']

select c.c1, c.c2, c.c3, cp.c1, cp.c2 from customer_table c, customer_profile_table cp inner join on (c.pkey = cp.key1 and cp.pkey = v5) where c.key = v1 and c.c4 in (v2, v3) and c.c5 in (v4) and c.c6 = v5 and cp.c3 = v6 order by c.c1 DESC limit 600

[/cc]

I realize this query is hard to read - I don't want to give any real-world examples because (a) I'm under NDA, and (b) I'm at home and not at work while typing this, so I'm coasting off my short-term memory which is, in itself, terrifying...just bear with me and all will be made clear, I promise.  Stare at that query until it makes sense, please.  (v - value, c - column, etc.)

Now, it give credit, by itself, this query isn't so bad.  If had been the author of this query, I probably ran it in my environment, using test data, and saw acceptable response times for the dataset, for the environment, including the system load.  In other words, probably about as far from a real-world production environment as you can get.  So, it ran just fine!

Another thing to note is that this query is dynamically generated by the PHP business logic input based on real-time input from the system.  The IN(x,y) parts of the query can contain a variable number of elements depending on customer account type and activity.  Or, the low-hanging fruit of eliminating the IN(x,y) parts of the query isn't necessarily the way we want to go.  We can improve this query by paying attention to a few basic rules, the table DDL, and plain old common sense.  We don't even have to pull any rabbits out of any hats.

First off, let's look at the DDL (data-definition-language) for table customer_table (c) and we immediately notice that:

  1. One of the referenced columns in the where clause is not indexed (c7)
  2. The remaining columns are indexed
  3. The primary key is a composite key of (PKEY, key1)

When we examine the DDL of customer_profile_table (cp), we notice that:

  1. All of the columns referenced in the WHERE clause are indexed
  2. The primary key is a single key (pkey)

The EXPLAIN output is concerning -- we're looking at about 330,000 rows to satisfy the query, but we're using the index columns as search criteria and it's coming back as type=ref, which is good (sub-optimal), but still good.

The order by is killing us.  When I remove the order-by from the query, I immediately improve the query speed by several orders of magnitude.  When I look at the explain output to see why, the output shows me that tmp tables are no longer being used; the filesorts are happening in memory.

But I have to have the order-by -- without delving deeper into the code (analyzing the business logic could mean an extensive code re-write and we're only after a quick-win here.) and spending hours, if not days, on the issue -- so how can I avoid the tmp-table file sort?

Time to look at the indexes.

In my customer_table (c), I am accessing the table based on the following conditionals:

  1. c.pkey (join) = cp.key1
  2. c.key (where) = value
  3. c.c4 (where) = value
  4. c.c5 (where) = value
  5. c.c6 (where) = value

Note that we're not at all concerned with the columns from (c) that appear in the SELECT part of the query (c1, c2, c3).  For indexing, we need to make sure that all of the columns in the where clause are indexed, as a general rule.

When we look at the DDL for the table, we see that we're missing an index on c6.  The first reaction is to run out and alter the table to add that index to make the query run faster.  In reality, it will probably make performance worse for this, and other queries.

You have to keep in-mind that we're dealing with InnoDB tables -- the generality for dealing with these types of tables is that data is stored with the index.  Ergo, the more indexes you create on a table, the more overhead you're introducing to the environment during maintenance.  Inserts, updates, and deletes become a function of the number of indexes you have in terms of speed, access, and maintenance.

Additionally, the more indexes you have, the more "pieces" my sql has to merge when it's building your resulting data set.  In our example, we've got five columns referenced in the WHERE or JOIN clause, one of which isn't referenced, two of which are composite indexes, but are referenced in different parts of the query.  When the mySQL engine is building the dataset for these indexed fields, it traverses that index (c.c6 = v5) and pulls the dataset out of the index.  When it has all the separate datasets, it has to do a merge between them to complete the search requirements.

So, depending on the size of your table, say in this case, obviously many hundreds of thousands of rows, the number of indexed columns you're joining on, you'll eventually end-up out on tmp disk space (because you've exhausted the available memory allocated to your application instance slice), to do the sort-merge and, unless you're on SSD devices, this (hitting physical disk) is going to noticeably impact your performance.

The take-away here is that for every index you're using, mySQL sort-of treats it like an individual table -- since the data is stored in the index, I can retrieve the data I need by accessing a index, and then I leave it to the optimizer to merge the results of all the index retrievals I've made.  Starting to see why we're hitting disk space and filesort merges?

Instead of adding another index to the table, which will only increase the complexity, the trick here is to add a covering index to the table which is, basically, on ginormous composite index which contains all of the index columns specified in the query.  In other words, we add a new index(pkey, key, c4, c5, c6) to the schema.  (Pay attention to the order of the columns as they appear in the query -- mySQL uses left-most query schema in it's ddl -- you have to access your index, via the query, in the same order (from left to right) in which the index was declared.)

What's nice about covering indexes is that, later on, if I have a query that accesses (pkey, key, c4), I don't need to write another index -- the covering index I just wrote will satisfy the conditions for that index and it will be used also.  In the event where I have a query that accesses the index out-of-order, it's acceptable practice to add that out-of-ordered index column as it's own index.  You'll then use the covering index, plus the second index, in the data merge.  Which is still better than doing a data merge across five, six, seven, or more, indexes.

I apply the same logic to the second table (customer_profile_table (cp)).  I create a covering index based on (pkey, key1, c3).  Note also that I'm going to tweek the query from:

[cc lang='mysql' line_numbers='false']inner join on (c.pkey = cp.key1 and cp.pkey = v5)[/cc]

to

[cc lang='mysql' line_numbers='false']inner join on (cp.pkey = v5 and c.pkey = cp.key1)[/cc]

to insure that the index order on cp is consistent.  Last, I'm going to leave the ORDER BY logic in the query.  Normally, for performance sakes, you'd want to avoid ORDER BY as one of the cardinal since of query performance.  Let the processing language do the sort, in-memory, on the returned dataset where possible.

When I EXPLAIN the slightly-modified query, with the new indexing, I see where we've reduced the number of rows down to about 22,000 and, while we're still doing a disk-based filesort/merge, the query is now measuring in the 0.19 second range.  A drastic improvement over the original query!

I make the changes to the production table and spend the next several days monitoring the slow-query log and, during this time, I never see the query appear again.  In addition, I see that my selects on the system rise about 11% meaning that I've increased the system throughput as a result.  My metrics relating to disk writes and temp tables falls off the chart (less than 0.0/second) and overall, I get a more "responsy" feel from the system.

I still have to wait a week or two for google analytics to report back on the change, but what I've seen so far has been very informative, instructive and convincing.

I hope you find this the same.