index

Renaming mongodb Columns

Today I was putzing around in the geo-spatial collection when I noticed that I had an unhappy over one of the column names within the collection.

In the mySQL world, changing a column name is pretty straight-forward courtesy of the alter table command.

Mongo...not so much...

<BEGIN_UNRELATED_SIDE_RANT>

The Mongo documentation is normally the first place most of us go when we're looking for help in using our favorite noSQL database.

Why?

Well...because that's usually where Google directs us to go and also because there just isn't a whole lot of documentation out there on the subject to begin with.

The mongo (10gen) documentation is pretty good.  It's not, however, excellent.  And I can articulate the reason why.

It's pretty easy to identify documentation written by engineers as opposed to documentation written by everyone else (on the planet).  And not because of technical content or the (ab)use of really big and impressive-sounding jargon.

No - it's because most engineering-authored documents are written using a solution-based voice instead of a problem-based voice.

Think about it:  when I have to go to the man-page for help, it's because I have a problem.  If I had a solution, I would be writing a blog post.    But since I have a problem, I need the man-pages, online docs, whatever, to help me figure-out a solution.

Engineering documents are written from a solution perspective:  the document assumes you possess some bit of arcane lore (which is probably just exactly that little bit of lore that you're missing which has caused your trip to the documentation vault) and everything that is explained within the document all hinges on this piece of knowledge which the author, albeit with the finest of intentions, assumes is already firmly in your mental possession.

And that's why I usually don't like 10gen's documentation.  But, like I said earlier, it's the only game in (Google)town.

<END_UNRELATED_SIDE_RANT>

In mongo, to change the name of a column within a collection, you first have to be on a release of mongodb 1.7.2 or later.  Since most of us bleeding-edge, early-adopter types are all 2.x versioned, this shouldn't be an issue.

This page from 10Gen is the update page and, within, talked about the $rename modifier to the update command.  What the section doesn't say, because it's assuming you're wanting to update records and not schema, is how to apply a change to all of the records in your collection.

In my case, I have a column-name which I fat-fingered the name right out it's camel-case:  CountryID instead of countryID.  (And, yes, OCD-peeps, I know that it's not strictly camelCase, thank-you!)  I want to spin through all 3.7 million rows in my collection and rename this column...

[codesyntax lang="javascript" lines="no"]


> db.geodata_geo.update( {} , { $rename : { 'CountryID' : 'countryID' }}, true, true );

[/codesyntax]

So what we have here is the update command to the collection (geodata_geo) and four parameters:

  1. {} -- the empty set (this is what's missing from the 10gen doc) implying to do whatever to each record in the collection
  2. $rename -- the modifier to the update command which, in this case: replace 'CountryID' with 'countryID'
  3. false -- indicates to allow upserts if the record does not exist
  4. true -- multi option:  means to apply command to all records since, by default, the update() quits after updating the first record

And I run this command and mongo goes off (whirr...whirr ... I have two-node replication...) and renames the column in my collection!

What it didn't do was update my index. 

So, after my column-renaming completed, I needed to drop the index(es) that had 'CountryID' as members and re-index the collection to reflect the new column name.

Executing getIndexes() confirmed that my mongo world was back in it's correct orbit and life, once again, was good.

Covering Indexes in mySQL...

s1988306_thumb

s1988306_thumb

This is going to be a quick-and-dirty discussion on the importance of covering indexes using mySQL.  I recently was hired to steward a LAMP system that's fairly complex.  It relies heavily on mySQL as the db-engine and has mostly PHP driving the business logic within a Smarty template engine. So far, by adjusting tunable parameters within mySQL, I've been able to improve response times (world-wide page load averages) by about 25%.  Now it's time to start ploughing through the slow-query logs since the db server is stable under the current loads and we've established a zero-state, through metrics, of baseline performance.

A db index, or indexing to describe the act of adding one or more indexes to a table or tables, to seek improvements in data access and retrieval, is often seen as something akin to the burning of incense and killing of goats when scrying the fortune of the gods.  As in all things performance-related to mySQL, it requires an continuous cycle of minute adjustment, measurement, and evaluation.

We're going to talk about covering indexes today -- (here's a good overview of indexes courtesy of wikipedia -- good to know, or good review!) -- which are indexes that contain all of the data from the query or, in other words, the data is stored with the index and is returned by accessing the index without doing a table look-up or scan.

In my slow-query log, where any query requiring more than 1-second of cpu-time to execute is flogged and logged, I saw a query that was part of the login process that was appearing up-to several times per second.  The horrifying thing about this particular index, as opposed to all the other horrors exposed in the slow-query log, (and, trust me, if you've never delved into your slow-query log on your production machine as the primary developer for your project, bring a barf-bucket as it's not pretty...), was that it was firing off several times per second but each query was taking between 20 and 40 seconds to complete.

Did I mention that this query was part of the login process for our users?

Horrifying.

The query was a complex query that joined two tables, while accessing several fields across both tables in the where clause, it still only used a single key join between the two, and completed the query with an ORDER BY.  The explain for this particular query (sorry, I didn't screen shot anything!) was, on my sandbox, about 19 seconds (My sandbox is a much slower-than-production laptop.) and accessing about 330,000 rows of data, using the indexes, where clause, file sort, temporary table.  The query basically looked something like this:

[cc lang='mysql' line_numbers='false']

select c.c1, c.c2, c.c3, cp.c1, cp.c2 from customer_table c, customer_profile_table cp inner join on (c.pkey = cp.key1 and cp.pkey = v5) where c.key = v1 and c.c4 in (v2, v3) and c.c5 in (v4) and c.c6 = v5 and cp.c3 = v6 order by c.c1 DESC limit 600

[/cc]

I realize this query is hard to read - I don't want to give any real-world examples because (a) I'm under NDA, and (b) I'm at home and not at work while typing this, so I'm coasting off my short-term memory which is, in itself, terrifying...just bear with me and all will be made clear, I promise.  Stare at that query until it makes sense, please.  (v - value, c - column, etc.)

Now, it give credit, by itself, this query isn't so bad.  If had been the author of this query, I probably ran it in my environment, using test data, and saw acceptable response times for the dataset, for the environment, including the system load.  In other words, probably about as far from a real-world production environment as you can get.  So, it ran just fine!

Another thing to note is that this query is dynamically generated by the PHP business logic input based on real-time input from the system.  The IN(x,y) parts of the query can contain a variable number of elements depending on customer account type and activity.  Or, the low-hanging fruit of eliminating the IN(x,y) parts of the query isn't necessarily the way we want to go.  We can improve this query by paying attention to a few basic rules, the table DDL, and plain old common sense.  We don't even have to pull any rabbits out of any hats.

First off, let's look at the DDL (data-definition-language) for table customer_table (c) and we immediately notice that:

  1. One of the referenced columns in the where clause is not indexed (c7)
  2. The remaining columns are indexed
  3. The primary key is a composite key of (PKEY, key1)

When we examine the DDL of customer_profile_table (cp), we notice that:

  1. All of the columns referenced in the WHERE clause are indexed
  2. The primary key is a single key (pkey)

The EXPLAIN output is concerning -- we're looking at about 330,000 rows to satisfy the query, but we're using the index columns as search criteria and it's coming back as type=ref, which is good (sub-optimal), but still good.

The order by is killing us.  When I remove the order-by from the query, I immediately improve the query speed by several orders of magnitude.  When I look at the explain output to see why, the output shows me that tmp tables are no longer being used; the filesorts are happening in memory.

But I have to have the order-by -- without delving deeper into the code (analyzing the business logic could mean an extensive code re-write and we're only after a quick-win here.) and spending hours, if not days, on the issue -- so how can I avoid the tmp-table file sort?

Time to look at the indexes.

In my customer_table (c), I am accessing the table based on the following conditionals:

  1. c.pkey (join) = cp.key1
  2. c.key (where) = value
  3. c.c4 (where) = value
  4. c.c5 (where) = value
  5. c.c6 (where) = value

Note that we're not at all concerned with the columns from (c) that appear in the SELECT part of the query (c1, c2, c3).  For indexing, we need to make sure that all of the columns in the where clause are indexed, as a general rule.

When we look at the DDL for the table, we see that we're missing an index on c6.  The first reaction is to run out and alter the table to add that index to make the query run faster.  In reality, it will probably make performance worse for this, and other queries.

You have to keep in-mind that we're dealing with InnoDB tables -- the generality for dealing with these types of tables is that data is stored with the index.  Ergo, the more indexes you create on a table, the more overhead you're introducing to the environment during maintenance.  Inserts, updates, and deletes become a function of the number of indexes you have in terms of speed, access, and maintenance.

Additionally, the more indexes you have, the more "pieces" my sql has to merge when it's building your resulting data set.  In our example, we've got five columns referenced in the WHERE or JOIN clause, one of which isn't referenced, two of which are composite indexes, but are referenced in different parts of the query.  When the mySQL engine is building the dataset for these indexed fields, it traverses that index (c.c6 = v5) and pulls the dataset out of the index.  When it has all the separate datasets, it has to do a merge between them to complete the search requirements.

So, depending on the size of your table, say in this case, obviously many hundreds of thousands of rows, the number of indexed columns you're joining on, you'll eventually end-up out on tmp disk space (because you've exhausted the available memory allocated to your application instance slice), to do the sort-merge and, unless you're on SSD devices, this (hitting physical disk) is going to noticeably impact your performance.

The take-away here is that for every index you're using, mySQL sort-of treats it like an individual table -- since the data is stored in the index, I can retrieve the data I need by accessing a index, and then I leave it to the optimizer to merge the results of all the index retrievals I've made.  Starting to see why we're hitting disk space and filesort merges?

Instead of adding another index to the table, which will only increase the complexity, the trick here is to add a covering index to the table which is, basically, on ginormous composite index which contains all of the index columns specified in the query.  In other words, we add a new index(pkey, key, c4, c5, c6) to the schema.  (Pay attention to the order of the columns as they appear in the query -- mySQL uses left-most query schema in it's ddl -- you have to access your index, via the query, in the same order (from left to right) in which the index was declared.)

What's nice about covering indexes is that, later on, if I have a query that accesses (pkey, key, c4), I don't need to write another index -- the covering index I just wrote will satisfy the conditions for that index and it will be used also.  In the event where I have a query that accesses the index out-of-order, it's acceptable practice to add that out-of-ordered index column as it's own index.  You'll then use the covering index, plus the second index, in the data merge.  Which is still better than doing a data merge across five, six, seven, or more, indexes.

I apply the same logic to the second table (customer_profile_table (cp)).  I create a covering index based on (pkey, key1, c3).  Note also that I'm going to tweek the query from:

[cc lang='mysql' line_numbers='false']inner join on (c.pkey = cp.key1 and cp.pkey = v5)[/cc]

to

[cc lang='mysql' line_numbers='false']inner join on (cp.pkey = v5 and c.pkey = cp.key1)[/cc]

to insure that the index order on cp is consistent.  Last, I'm going to leave the ORDER BY logic in the query.  Normally, for performance sakes, you'd want to avoid ORDER BY as one of the cardinal since of query performance.  Let the processing language do the sort, in-memory, on the returned dataset where possible.

When I EXPLAIN the slightly-modified query, with the new indexing, I see where we've reduced the number of rows down to about 22,000 and, while we're still doing a disk-based filesort/merge, the query is now measuring in the 0.19 second range.  A drastic improvement over the original query!

I make the changes to the production table and spend the next several days monitoring the slow-query log and, during this time, I never see the query appear again.  In addition, I see that my selects on the system rise about 11% meaning that I've increased the system throughput as a result.  My metrics relating to disk writes and temp tables falls off the chart (less than 0.0/second) and overall, I get a more "responsy" feel from the system.

I still have to wait a week or two for google analytics to report back on the change, but what I've seen so far has been very informative, instructive and convincing.

I hope you find this the same.

O'Reilly...and my timestamp index problem...

Just a quick jot - I'm headed to the O'Reilly MySQL conference, April 12-14 in Santa Clara for three joyous days filled with database geekiness. There's a lot of diversity in my schedule for the conference - I'm focusing mainly on performance tuning for mySQL, cloud deployment, replication and sharding, and a shallow dip into noSQL strategies.

Here's the current problem I'm working on:

For any given table, (I don't believe that the ddl matters), say you have a timestamp field which is indexed and referenced in a query.  The table is Innodb so the timestamp field can only be a btree index.  Which means that the index is stored in ascending order.

In your query, your conditional reads something to the effect of (where timestamp(now) > timestamp(now) - 5 minutes)  (it's psuedo-code, ok?)

What I expect to happen is that as the query is calculating the conditional border value and then doing an index-scan on the btree-index, examining each index value (row) until the current timestamp exceeds the calculated value.

What I'm seeing (in the slow-query log, among other places), is that the query is executing as an index-scan, but is scanning all the rows of the index, essentially doing a full table scan.

The query is calculated and stored as a variable passed to mysql_query() -- so the parser is determining the now()-5 minutes value and is passing it off as a string constant...

Anyone have any suggestions, I'd be most happy.  This lil' kitteh turd has filled my slow-query litter box....