database

Renaming mongodb Columns

Today I was putzing around in the geo-spatial collection when I noticed that I had an unhappy over one of the column names within the collection.

In the mySQL world, changing a column name is pretty straight-forward courtesy of the alter table command.

Mongo...not so much...

<BEGIN_UNRELATED_SIDE_RANT>

The Mongo documentation is normally the first place most of us go when we're looking for help in using our favorite noSQL database.

Why?

Well...because that's usually where Google directs us to go and also because there just isn't a whole lot of documentation out there on the subject to begin with.

The mongo (10gen) documentation is pretty good.  It's not, however, excellent.  And I can articulate the reason why.

It's pretty easy to identify documentation written by engineers as opposed to documentation written by everyone else (on the planet).  And not because of technical content or the (ab)use of really big and impressive-sounding jargon.

No - it's because most engineering-authored documents are written using a solution-based voice instead of a problem-based voice.

Think about it:  when I have to go to the man-page for help, it's because I have a problem.  If I had a solution, I would be writing a blog post.    But since I have a problem, I need the man-pages, online docs, whatever, to help me figure-out a solution.

Engineering documents are written from a solution perspective:  the document assumes you possess some bit of arcane lore (which is probably just exactly that little bit of lore that you're missing which has caused your trip to the documentation vault) and everything that is explained within the document all hinges on this piece of knowledge which the author, albeit with the finest of intentions, assumes is already firmly in your mental possession.

And that's why I usually don't like 10gen's documentation.  But, like I said earlier, it's the only game in (Google)town.

<END_UNRELATED_SIDE_RANT>

In mongo, to change the name of a column within a collection, you first have to be on a release of mongodb 1.7.2 or later.  Since most of us bleeding-edge, early-adopter types are all 2.x versioned, this shouldn't be an issue.

This page from 10Gen is the update page and, within, talked about the $rename modifier to the update command.  What the section doesn't say, because it's assuming you're wanting to update records and not schema, is how to apply a change to all of the records in your collection.

In my case, I have a column-name which I fat-fingered the name right out it's camel-case:  CountryID instead of countryID.  (And, yes, OCD-peeps, I know that it's not strictly camelCase, thank-you!)  I want to spin through all 3.7 million rows in my collection and rename this column...

[codesyntax lang="javascript" lines="no"]


> db.geodata_geo.update( {} , { $rename : { 'CountryID' : 'countryID' }}, true, true );

[/codesyntax]

So what we have here is the update command to the collection (geodata_geo) and four parameters:

  1. {} -- the empty set (this is what's missing from the 10gen doc) implying to do whatever to each record in the collection
  2. $rename -- the modifier to the update command which, in this case: replace 'CountryID' with 'countryID'
  3. false -- indicates to allow upserts if the record does not exist
  4. true -- multi option:  means to apply command to all records since, by default, the update() quits after updating the first record

And I run this command and mongo goes off (whirr...whirr ... I have two-node replication...) and renames the column in my collection!

What it didn't do was update my index. 

So, after my column-renaming completed, I needed to drop the index(es) that had 'CountryID' as members and re-index the collection to reflect the new column name.

Executing getIndexes() confirmed that my mongo world was back in it's correct orbit and life, once again, was good.

Why is my mongo query so slow?

Why's my mongodb query so slow? I got my geospatial collection set-up -- I am running some really great queries making sure that the locations I am pulling aren't in any sort of cache, and I am just blown-away by how fast data is being returned.

The problem is:  when I query the collection to pull up the requisite lon/lat data by name:  city & state, or city & country, the query seems to take seconds to complete!

I set-up the table correctly...I indexed the crap out of all my columns...a week or two ago, I was at the mongoSV 2011 in Santa Clara and learned some really cool stuff about queries, indexing, and performance management, so let's dig-out the notes and see where I went wrong.  Because I strongly doubt that the problem is in mongo but, rather as we used to say in technical support: this is a PBCK issue...

The first thing I want to do is run an explain against my query so I can see mongo's query plan for my query.  This should provide me with a starting point for trying to figure out what went wrong.

> db.geodata_geo.find({ cityName : "Anniston", stateName : "Alabama" }).explain();

By adding the trailing function: .explain(), I'm requesting that mongoDB return the query-plan to me instead of executing the query.  I hit enter to launch the explain() and get back the following output:

> db.geodata_geo.find({ cityName : "Anniston", stateName : "Alabama"}).explain(); { "cursor" : "BasicCursor", "nscanned" : 3691723, "nscannedObjects" : 3691723, "n" : 1, "millis" : 2269, "nYields" : 0, "nChunkSkips" : 0, "isMultiKey" : false, "indexOnly" : false, "indexBounds" : {

} }

The important information, I bold-faced in the query output (above).   What this output is telling me is that I've using a "BasicCursor" for my search cursor -- which is indicates that, yes, I am doing a table-scan on the collection.  So, already I know my query is not optimal.  But, wait!  More good news...

The value for nscanned and nscannedObjects is the same: 3,691,723 -- which coincidently is the same as the cardinality of the collection.  This number is the number of documents scanned to satisfy the query which, given it's value, confirms that I am doing a full table scan.

millis tells me the number of milliseconds that the query would take:  2.269 seconds:  way too slow for my back-end methods() serving a REST API -- unacceptable.

And then we get to the tell:  IndexOnly tells me that if the query could have been resolved by an (existing) covering index.  Seeing the value false here tells me that the collection has no index on the columns I am scanning against.

What?!?  I know I indexed this collection...

So, I run db.geodata_geo.getIndexes() to dump my indexes and ... I ... don't see my name columns indexed.  Oh, I remembered to index the the ID and Code columns...but somehow, indexing the Name columns completely slipped past my lower brain-pan.

I add these indexes to my collection:

> db.geodata_geo.ensureIndex({ cityName : 1 }); > db.geodata_geo.ensureIndex({ stateName : 1 });

And then I rerun the query plan and see the following output:

> db.geodata_geo.find({ cityName : "Anniston", stateName : "Alabama"}).explain(); { "cursor" : "BtreeCursor cityName_1", "nscanned" : 2, "nscannedObjects" : 2, "n" : 1, "millis" : 101, "nYields" : 0, "nChunkSkips" : 0, "isMultiKey" : false, "indexOnly" : false, "indexBounds" : { "cityName" : [ [ "Anniston", "Anniston" ] ] } }

Instead of BasicCursor, I see BtreeCursor which gives me a happy.  I also see that the nscanned and nscannedObjects values are now more realistic...seriously:  2 is a LOT better than 3.6 million something, right?  Another happy for me!

I score the third happy when I see that the millis has dropped down to 101:  0.101 seconds to execute this search/query!  Not jaw-dropping, I agree -- but acceptable considering that everything is running off my laptop...I know production times will be much, much lower.

 

In the end, I learned that a simple tool like .explain() can tell me where my attention is needed when it comes to optimization and fixing even simple, seemingly innocent queries.  Knowing what you're looking at, and what you're looking for, is pretty much thick-end of the baseball bat when it comes to crushing one out of the park.

I hope this helps!

 

Reference Link:  Explain

Searching MongoDB Sub-Documents...

I've recently finished a mongo collection that stores all auditing data from my application -- specifically, it records every database transaction, conducted in either mySQL or mongo, assigning an event-identifier to the event, and storing the data under an event ID within a single sessionManger object.

Sounds good?

Well, I like it.   This design eliminated the need to maintain meta-data in my data tables since I can pull transaction history for any record that I've accessed.

The problem is that, being new to mongodb, accessing what I've put into mongodb isn't (yet) as intuitive as, say, my mySQL skills are.

Sub-documents within a mongo document are analogous to the results of a mySQL join.  One of the key motivators in storing this information in mongodb to begin with was that I could de-normalize the data by storing the sub-document with it's parent instead of having to incur the expense of a search-join-fetch later.

Traditionally, any data objects defined as a one-to-many type of a relationship (1:m) were stored in multiple mySQL tables and were accessed via some sort of join mechanism.

Mongodb breaks that traditional mold by allowing you to store a sub-document (the "m" part of the 1:m relationship) within the same document in which you're currently working.

Using my sessionManger document, I have a document that looks something like this:

[cc lang='javascript' line_numbers='false'] { _id : somevalue, foo : bar, event : {}, argle : bargle, } [/cc]

My desire is to, for every database event that is recorded, enter information about that event within the sub-document that I've wittily named "event".

In my PHP code, I've written a sequence manager for mongo that maintains a document containing sequence values for various tables.  Think of this as the functional version of mySQL's auto-increment feature.  I decided, then, for the sessionManager events, I would use this key sequence to obtain unique values and use those as my sub-document index.  I'd then store whatever data I needed to store using the sequence value as a sub-document key, or index:

[cc lang='javascript' line_numbers='false']{ _id : somevalue, foo: bar, event : { n : { created : dateval, table : tableName, schema : dbSchema, query : lastQuery } } argle : bargle }[/cc]

So, when I need to add another event, I just create a new sub-document under the event key, then add the data I need to store under the sub-document index key.

Worked like a champ!

And then I asked myself:  "So, Brainiac, how would you go about extracting event -n- from your collection?"

I went through a lot of failed query attempts, bugged a lot of people, googled and saw stuff  that led me down many plush ratholes until I finally, through some serious trial-and-error, got the answer...

> db.mytable.find( { foo : bar }, { 'event.n' : 1 } );

where n = the number of the event I want to find.

If I want to get all of the events for a particular document (sessionManger object), then I would write something like:

> db.mytable.find( {foo : bar}, { event : 1});

If I wanted to return all of the events for all of the objects, then I would write this:

> db.mytable.find( {}, {event : 1});

What I've not been able to figure out, so far, is how I can use $slice to grab a range of events within a document.  Everything I try returns the full sub-set of documents back to me.  The doc tells me that $slice is used to return a subrange of array elements, which is what I thought "event.n" was but, apparently, it's not.  (I think it's an object (sub-document) which is why $slice fails for me.)

It's not a big deal because, programmatically, I can grap the entire sub-document from it's parent and parse in-memory to get the desired record.  And, if I know what the value for -n- is, then I can fetch just that one sub-document.  So, I'm ok for now.  However, please feel free to enlighten me with your expertise and experience should you see where I am failing here, ok?

 

mongodb, geospatial indexing, and advanced queries....

I've been working to build, and re-build, a geospatial table for work.  There's been a lot of challenges in this project for me as this is the first time that I've had to architect db designs incorporating mongodb with mySQL.

The mongo geospatial repository will be replacing several tables in the legacy mySQL system -- as you may know, mongodb comes with full geospatial support so executing queries against a collection (table) built in this manner is shocking in terms of it's response speeds -- especially when you compare those speeds to the traditional mySQL algorithms for extracting geo-points based on distance ranges for lat/lon coordinates.  The tl;dr for this paragraph is:  no more hideous trigonometric mySQL queries!

What I learned in this exercise was that the key to architecting a mongo collection requires you to re-think how data is stored.  Mongo stores data as a collection of documents.  The key to successful thinking, at least in terms of mongo storage, is denormalization of your data objects.

Let's use a standard customer object as an example.  Every customer has at least one phone number.  Most, if not all, customers have more than one phone number.  We could define several columns in the customer table for the phone numbers: workphone, homephone, cellphone, otherphone and store the data that way.  Problem is that we will eventually hit the wall where we have the need to store numbers for which we don't have columns pre-defined:  faxphone, skypephone, tddphone, vrsphone, etc.

RDBMS design demands a normalization of this 1:M data design by requiring a second table to store just phone numbers for each customer.  The phone table would have a primary key (id), the customer id, the customer phone number and perhaps a short, descriptive, field explaining the purpose of this number.  To get the phone data for a customer, then, you'd simply query (or join) the phone table based on the customer ID to get all the phone tuples for that customer.

Mongo, on the other hand, sees every customer as a document.  Think of each customer in your collection as a piece of paper.  You want to go into your collection and retrieve on piece of paper upon which has all the customer data.  So, for example, you retrieve the document for "John Smith" and on this document, it lists several key-value pairs, underneath an array called phone:

phone : { home : (408) 123-4567, work : (415) 123-4567, cell : (312) 765-4321 }

...and so on...

Mongo stores the document for this, or any user, by de-normalizing the data relationships within the customer object.  These relationships can be maintained as sub-arrays within the document.  Because mongo is schema-less, every customer object isn't required to have all the possible combinations of phone numbers.  So, if you were to do a search where you pull-up all customers with fax numbers, our Mr. Smith would not appear in this list since he has no fax number listed in his phone array.

See?

This first step towards clarity in mongo architecture, then, is to think of all the data when you design a class object and include that data within a single document.  Data that was stored, in traditional RDBMS relation-based tables, is incorporated into the document as sub-arrays to the document.

But, you're asking, what if you want to later add a fax number to John Smith's phone collection?  Can you do that?

Sure!

Again, this is the inherent strength in mongodb -- it's schema-less!  Adding another number to the existing collection of phone numbers, or adding a new "column" to the document itself, requires only that you update that document.  That's it!

So, returning back to the geospatial build, I used mySQL pull the legacy data, and collect the updated catalog tables into a single database.  Then I built new tables that (a) eliminated columns I no longer needed and, (b), de-normalized the data so that every tuple in every table reflected all of the data.

I then combined the five tables into a single table under a new primary-key value and then imported this data directly into a mongo collection.  This took several hours as my collection has over 3.6 million rows.

Once I had the collection in mongo, I made a mongo-dump of the collection so that I'd could recover back to this point in-case anything went south.  (Which it did...)

I executed a PHP script I wrote to scan the mySQL table, get the tuple by the newly-created primary key, and then create the sub-array in the mongo collection for the geospatial data.  See, in order to impose a geospatial index, your lat/lon data has to be a sub-array within the primary collection.  There's no way I've yet discovered to import data over from either a flat (csv) file, or directly from mySQL, so that it creates your sub-array automagically.  Hence, the home-brew PHP script to parse through the mySQL records and build (insert) the sub-array in the newly-created mongodb collection.

(Side note:  I was careful to maintain the maximum mantissa values for the lat/lon fields by importing initially, into mySQL as varchar(255) fields -- this kept my 13-digit mantissas.  When I imported the data into mongodb, mongo converted these values into doubles and kept the precision.  However, my PHP program, casting these values to either (float) or (double) converted (round) the matissa to 7-digit precision.  Suitable for task?  Yes.  Annoying to lose this data?  Yes.  If you have a solution to this, please leave me a comment at the end of this article.  Thanks!   :-P )

The next step was to add the geo-spatial index to the collection:

> db.geodata_geo.ensureIndex( { loc : "2d" } ); point not in interval of [ -180, 180 )

What?

This error message was telling me that my data was out of range of the acceptable lat/lon values!

I tried searching for the data culprits:

> db.geodata_geo.find( { "loc" : { $exists : true }}).count();
3685667
> db.geodata_geo.find({"loc.lon" : {$lt : -180}}).count();
0
> db.geodata_geo.find({"loc.lon" : {$gt : 180}}).count();
0
> db.geodata_geo.find({"loc.lat" : {$gt : 180}}).count();
0
> db.geodata_geo.find({"loc.lat" : {$lt : -180}}).count();
0

These queries were telling me that while I have over 3.6 million records indexed, none are outside of the -180,180 boundaries.

> db.geodata_geo.find({"loc.lat" : {$gt : -180}, "loc.lon" : {$lt : 180}}).count();
3685663
> db.geodata_geo.find( { "loc" : { $exists : true }}).count();
3685667

These queries tell me that I have a delta of 4-records that exists outside of the -180, 180 boundary.

Wait...wut?

I focus on the $gt/$lt and wonder if I have an "edge" case.  Given that I've lost 6-digits of my mantissa, I wonder if mongo has rounded my data to my edge cases of 180:

> db.geodata_geo.find({"loc.lon" : 180 });

And I get back exactly four records that have a lon-value of exactly 180:

"loc" : { "lon" : 180, "lat" : -16.1499996 }

This, to me, seems to be an error in how mongodb indexes geospatial data.  If 180 is an allowed value for lat/lon, then why throw the error when you ensure the index?  I decide to handle this rounding problem by expanding the allowable limits of my query:

> db.geodata_geo.ensureIndex({ "loc" : "2d" }, { min : -180, max : 181 });
> db.geodata_geo.getIndexes();
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"ns" : "dev_honeybadger.geodata_geo",
		"name" : "_id_"
	},
	{
		"v" : 1,
		"key" : {
			"loc" : "2d"
		},
		"ns" : "dev_honeybadger.geodata_geo",
		"name" : "loc_",
		"min" : -180,
		"max" : 181
	}
]

And I see that my geospatial index has been created.  Now, to test:

> db.geodata_geo.find( { loc : {$near : [-50,50] } } ).limit(5);

And it immediately returns five records (Elliston, Bonavista, Elliston Station, Catalina and Port Union, Division #7,  in Canada) that I asked for.

My geospatial index is complete!  Now, all I need to do is add my regular indexes for keyed searching and export the table off my development environment.

 

 

mongodb.findOne() -- calling with PHP variables (not literals)

So I've been doing a lot of work, for work, in MongoDB lately and I've learned an awful lot.  Or, depending on your point of view, a lot that's just awful.

See, there's not what you could even charitably call a lot of MongoDB documentation to begin with.   If you filter what is available on, oh, say, PHP implementation, well the results just dwindle to something roughly the same size as a tax-collector's heart.

Here's the scenario -- I've been working on adding a mongo abstraction class on top of my base-data abstraction class -- whereas said classes are extended by the table-level class instantiation.  This allows me to keep all of my query logic in the middle tier of the class design, generic and administrative functions in the base class, and table-specific stuff in the table class.  So far, so good, right?

Well, I get the mongo constructor running and, like it's mySQL counterpart, I have an rule in every table constructor that states "if I pass a indexed field and it's value to the constructor, then instantiate the class pre-populated with that record."

And that's where things start to head south.

In my constructor logic, I'm only allowing single-value key->value pairs as constructor parameters with the design intention of getting a record from the db using the pkey of the table/collection.  In other words, you get one column and one column value.  So, if you're going to instantiate a new user object, you'd probably want to pass-in the primary-key field of a user and that field's value:

$objUser = new UserProfile('email', 'mshallop@gmail.com');   // instantiate a new user object with this email address

Still pretty easy.  I bang out the mySQL equivalent in nothing flat.  I hit a huge pothole when I get to the mongo side.

The method is defined as a protected abstract method in the base class - so this method has to appear in both child classes as defined in the parent:

protected abstract function loadClassById($_key, $_value);

So I have my methods defined in both the mySQL and mongoDB middle layer.  My strategy for the mongo fetch-and-return is pretty simple -- once the class has been instantiated, do the following:

  1. make sure the $_key value exists in the allowed field list
  2. make sure the $_value has a value
  3. query mongodb using .findOne()
  4. store the return key->value pairs in the member array
  5. return status

That's pretty much it.  But I run into huge problems when I get to step 3 -- use the mongoDB findOne command.

The findOne method takes an array input of the key->value pair.  From the mongo command line, you'd execute something like this:

> db.session_ses.findOne({'idpro_ses' : 1})
{
 "_id" : ObjectId("4ea1af93ddc69802376b56d1"),
 "id_ses" : 1,
 "idpro_ses" : 1
}

( Just to show you that the data exists in the mongo collection...)

But, the PHP-ized version of the method is a wee bit different:

$this->collection->findOne(array('idpro_ses' => 1));

All of the examples that I've been able to locate show using the method by invoking it using literals.  My problem is that I have the two input parameters sent to the method ($_key and $_value) and I've got to find a way to get the PHP version of the method call to work using variables instead of constants.  This is what didn't work:

$this->collection->findOne(array($_key => $_value));

$this->collection->findOne(array("'" . $_key . "'" => $_value));

 

$this->collection->findOne(array("{$_key}" => $_value));

$aryData = array(); $aryData[$_key] = $_value; $this->collection->findOne($aryData); or $this->collection->findOne(var_dump($aryData)); 

I thought this worked but I was wrong:

$this->collection->findOne(array(array_keys($aryData) => array_values($aryData)));

This format returned a mongo record -- the problem was that it returned the first mongo record independently of any key-search criteria.

What finally worked for me was this:

            $qs = array(); // QueryStructure
            switch($this->fieldTypes[$_k]) {
                case 'int' :
                    $_v = intval($_v);
                    break;
                case 'str' :
                    $_v = strval($_v);
                    break;
                case 'float' :
                    $_v = floatval($_v);
                    break;
            }
            $qs[$_k] = $_v;
            $aryData = $this->collection->findOne($qs);

[Update]

I encountered a similar problem when trying to update records in a mongo collection -- while I could update the record from the mongo command line, I did not experience the same success in trying to execute the command from within my PHP program...

$foo = $collection->find(array('id_geo' => $row['id_geo']));

Consistently failed.  No exceptions were caught, and mongo's findLastError() reported no errors in the transaction.

After several iterations of debugging and attempting various work-arounds, I stumbled upon the solution as being one of casting.  While the variable was being evaluated in the PHP array as type int, somehow this wasn't being interpreted that way by Mongo.  Casting the variable to an integer:

$foo = $collection->find(array('id_geo' => intval($row['id_geo']))); 

 generated a successful query for both the find() and my update() functions.

As I gain experience with Mongo, I expect to discover more of these little mannerisms...