mongodb, geospatial indexing, and advanced queries....

I've been working to build, and re-build, a geospatial table for work.  There's been a lot of challenges in this project for me as this is the first time that I've had to architect db designs incorporating mongodb with mySQL.

The mongo geospatial repository will be replacing several tables in the legacy mySQL system -- as you may know, mongodb comes with full geospatial support so executing queries against a collection (table) built in this manner is shocking in terms of it's response speeds -- especially when you compare those speeds to the traditional mySQL algorithms for extracting geo-points based on distance ranges for lat/lon coordinates.  The tl;dr for this paragraph is:  no more hideous trigonometric mySQL queries!

What I learned in this exercise was that the key to architecting a mongo collection requires you to re-think how data is stored.  Mongo stores data as a collection of documents.  The key to successful thinking, at least in terms of mongo storage, is denormalization of your data objects.

Let's use a standard customer object as an example.  Every customer has at least one phone number.  Most, if not all, customers have more than one phone number.  We could define several columns in the customer table for the phone numbers: workphone, homephone, cellphone, otherphone and store the data that way.  Problem is that we will eventually hit the wall where we have the need to store numbers for which we don't have columns pre-defined:  faxphone, skypephone, tddphone, vrsphone, etc.

RDBMS design demands a normalization of this 1:M data design by requiring a second table to store just phone numbers for each customer.  The phone table would have a primary key (id), the customer id, the customer phone number and perhaps a short, descriptive, field explaining the purpose of this number.  To get the phone data for a customer, then, you'd simply query (or join) the phone table based on the customer ID to get all the phone tuples for that customer.

Mongo, on the other hand, sees every customer as a document.  Think of each customer in your collection as a piece of paper.  You want to go into your collection and retrieve on piece of paper upon which has all the customer data.  So, for example, you retrieve the document for "John Smith" and on this document, it lists several key-value pairs, underneath an array called phone:

phone : { home : (408) 123-4567, work : (415) 123-4567, cell : (312) 765-4321 }

...and so on...

Mongo stores the document for this, or any user, by de-normalizing the data relationships within the customer object.  These relationships can be maintained as sub-arrays within the document.  Because mongo is schema-less, every customer object isn't required to have all the possible combinations of phone numbers.  So, if you were to do a search where you pull-up all customers with fax numbers, our Mr. Smith would not appear in this list since he has no fax number listed in his phone array.


This first step towards clarity in mongo architecture, then, is to think of all the data when you design a class object and include that data within a single document.  Data that was stored, in traditional RDBMS relation-based tables, is incorporated into the document as sub-arrays to the document.

But, you're asking, what if you want to later add a fax number to John Smith's phone collection?  Can you do that?


Again, this is the inherent strength in mongodb -- it's schema-less!  Adding another number to the existing collection of phone numbers, or adding a new "column" to the document itself, requires only that you update that document.  That's it!

So, returning back to the geospatial build, I used mySQL pull the legacy data, and collect the updated catalog tables into a single database.  Then I built new tables that (a) eliminated columns I no longer needed and, (b), de-normalized the data so that every tuple in every table reflected all of the data.

I then combined the five tables into a single table under a new primary-key value and then imported this data directly into a mongo collection.  This took several hours as my collection has over 3.6 million rows.

Once I had the collection in mongo, I made a mongo-dump of the collection so that I'd could recover back to this point in-case anything went south.  (Which it did...)

I executed a PHP script I wrote to scan the mySQL table, get the tuple by the newly-created primary key, and then create the sub-array in the mongo collection for the geospatial data.  See, in order to impose a geospatial index, your lat/lon data has to be a sub-array within the primary collection.  There's no way I've yet discovered to import data over from either a flat (csv) file, or directly from mySQL, so that it creates your sub-array automagically.  Hence, the home-brew PHP script to parse through the mySQL records and build (insert) the sub-array in the newly-created mongodb collection.

(Side note:  I was careful to maintain the maximum mantissa values for the lat/lon fields by importing initially, into mySQL as varchar(255) fields -- this kept my 13-digit mantissas.  When I imported the data into mongodb, mongo converted these values into doubles and kept the precision.  However, my PHP program, casting these values to either (float) or (double) converted (round) the matissa to 7-digit precision.  Suitable for task?  Yes.  Annoying to lose this data?  Yes.  If you have a solution to this, please leave me a comment at the end of this article.  Thanks!   :-P )

The next step was to add the geo-spatial index to the collection:

> db.geodata_geo.ensureIndex( { loc : "2d" } ); point not in interval of [ -180, 180 )


This error message was telling me that my data was out of range of the acceptable lat/lon values!

I tried searching for the data culprits:

> db.geodata_geo.find( { "loc" : { $exists : true }}).count();
> db.geodata_geo.find({"loc.lon" : {$lt : -180}}).count();
> db.geodata_geo.find({"loc.lon" : {$gt : 180}}).count();
> db.geodata_geo.find({"" : {$gt : 180}}).count();
> db.geodata_geo.find({"" : {$lt : -180}}).count();

These queries were telling me that while I have over 3.6 million records indexed, none are outside of the -180,180 boundaries.

> db.geodata_geo.find({"" : {$gt : -180}, "loc.lon" : {$lt : 180}}).count();
> db.geodata_geo.find( { "loc" : { $exists : true }}).count();

These queries tell me that I have a delta of 4-records that exists outside of the -180, 180 boundary.


I focus on the $gt/$lt and wonder if I have an "edge" case.  Given that I've lost 6-digits of my mantissa, I wonder if mongo has rounded my data to my edge cases of 180:

> db.geodata_geo.find({"loc.lon" : 180 });

And I get back exactly four records that have a lon-value of exactly 180:

"loc" : { "lon" : 180, "lat" : -16.1499996 }

This, to me, seems to be an error in how mongodb indexes geospatial data.  If 180 is an allowed value for lat/lon, then why throw the error when you ensure the index?  I decide to handle this rounding problem by expanding the allowable limits of my query:

> db.geodata_geo.ensureIndex({ "loc" : "2d" }, { min : -180, max : 181 });
> db.geodata_geo.getIndexes();
		"v" : 1,
		"key" : {
			"_id" : 1
		"ns" : "dev_honeybadger.geodata_geo",
		"name" : "_id_"
		"v" : 1,
		"key" : {
			"loc" : "2d"
		"ns" : "dev_honeybadger.geodata_geo",
		"name" : "loc_",
		"min" : -180,
		"max" : 181

And I see that my geospatial index has been created.  Now, to test:

> db.geodata_geo.find( { loc : {$near : [-50,50] } } ).limit(5);

And it immediately returns five records (Elliston, Bonavista, Elliston Station, Catalina and Port Union, Division #7,  in Canada) that I asked for.

My geospatial index is complete!  Now, all I need to do is add my regular indexes for keyed searching and export the table off my development environment.