MySQL Latitude/Longitude Indexing

Table of Contents

The Problem
A Solution -- First, the principles
Representation choices
GCDist -- Compute "Great Circle Distance"
Required TABLE Structure
Performance
FindNearest
The code for deg*10000 and 'miles'
Using a Single Table
Using Two Tables
Postlog
Brought to you by Rick James

The Problem


You want to find the nearest 10 pizza parlors, but you cannot figure out how to do it efficiently in your huge database. Database indexes are good at one-dimensional indexing, but poor at two-dimensions.

You might have tried
    ⚈  INDEX(lat), INDEX(lon) -- but the optimizer used only one
    ⚈  INDEX(lat,lon) -- but it still had to work too hard
    ⚈  Sometimes you ended up with a full table scan -- Yuck.

WHERE SQRT(...) < ... -- No chance of using any index.

WHERE lat BETWEEN ... AND lng BETWEEN... -- This has some chance of using such indexes.

A Solution -- First, the principles


PARTITIONs in MySQL give you a way to sort of have two clustered indexes. So, if we could slice up (partition) the globe in one dimension and use ordinary indexing in the other dimension, maybe we can get something approximating a 2D index.

It works. Not perfectly, but better than the alternatives.

What to PARTITION on? It seems like latitude or longitude would be a good idea. Note that longitudes vary in width, from 69 miles at the equator, to 0 at the poles. So, latitude seems like a better choice.

How many PARTITIONs? It does not matter a lot. Some thoughts:
    ⚈  90 - 2 degrees each. (I don't like tables with too many partitions; 90 seems like plenty.)
    ⚈  50-100 - evenly populated. (This requires code. For 2.7M placenames, 85 partitions varied from 0.5 degrees to very wide partitions at the poles.)
    ⚈  Don't have more than 100 partitions, there are inefficiencies in the partition implementation.

How to PARTITION? Well, MySQL is very picky. So FLOAT/DOUBLE are out. DECIMAL is out. So, we are stuck with some kludge. Essentially, we need to convert Lat/Lng to some size of INT and use PARTITION BY RANGE.

Representation choices


To get to a datatype that can be used in PARTITION, you need to "scale" the latitude and longitude.
   Datatype           Bytes       resolution
   ------------------ -----  --------------------------------
   Deg*100 (SMALLINT)     4  1570 m    1.0 mi  Cities
   DECIMAL(4,2)/(5,2)     5  1570 m    1.0 mi  Cities
   SMALLINT scaled        4   682 m    0.4 mi  Cities
   Deg*10000 (MEDIUMINT)  6    16 m     52 ft  Houses/Businesses
   DECIMAL(6,4)/(7,4)     7    16 m     52 ft  Houses/Businesses
   MEDIUMINT scaled       6   2.7 m    8.8 ft
   FLOAT                  8   1.7 m    5.6 ft
   DECIMAL(8,6)/(9,6)     9    16cm    1/2 ft  Friends in a mall
   Deg*10000000 (INT)     8    16mm    5/8 in  Marbles
   DOUBLE                16   3.5nm     ...    Fleas on a dog
(Sorted by resolution)

What these mean...

Deg*100 (SMALLINT) -- you take the lat/lng, multiply by 100, round, and store into a SMALLINT. That will take 2 bytes for each dimension, for a total of 4 bytes. Two items might be a mile apart, but register as having the same latitude and longitude.

DECIMAL(4,2) for Latitude and DECIMAL(5,2) for Longitude will take 2+3 bytes and have no better resolution than Deg*100.

SMALLINT scaled -- Convert Latitude into a SMALLINT SIGNED by doing (degrees / 90 * 32767) and rounding; Longitude by (degrees / 180 * 32767).

FLOAT has 24 significant bits; DOUBLE has 53. (They don't work with PARTITIONing but are included for completeness. Often people use DOUBLE without realizing how much an overkill it is, and how much space it takes.)

Sure, you could do DEG*1000 and other "in between" cases, but there is no advantage. DEG*1000 takes as much space as DEG*10000, but has less resolution.

So, go down the list to see how much resolution you need, then pick an encoding you are comfortable with. However, since we area about to use latitude as a "partition key", it must be limited to one of the INTs. For the sample code, I will use Deg*10000 (MEDIUMINT).

GCDist -- Compute "Great Circle Distance"


GCDist is a helper FUNCTION that correctly computes the distance between two points on the globe.

The code has been benchmarked at about 20 microseconds per call on a 2011-vintage PC. If you had to check a million points, that would take 20 seconds -- far too much for a web application. So, one goal of the Procedure that uses it will be to minimize the usage. With the code presented here, the function need be called only a few dozen or few hundred times, except in pathological cases.

Sure, you could use the Pythagorean formula. And it would work for most applications. But it does not take extra effort to do the GC. Furthermore, GC works across a pole and across the dateline. And, a Pythagorean function is not that much faster.

For efficiency, GCDist understands the scaling you picked and has that stuff hardcoded. I am picking "Deg*10000", so the function expects 350000 for 35 degrees. If you choose a different scaling, you will need to change the code.

GCDist() takes 4 scaled DOUBLEs -- lat1, lon1, lat2, lon2 -- and returns a scaled number of "degrees" representing the distance.

The table of Representation Choices says 52 feet of resolution for DECIMAL(x,4). Here is how it was calculated: To measuring a diagonal between lat/lng (0,0) and (0.0001,00001) (one 'unit in the last place'): GCDist(0,0,1,1) * 69.172 / 10000 * 5280 = 51.65, where
    ⚈  69.172 miles/degree of latitude
    ⚈  10000 units per degree for the scaling chosen
    ⚈  5280 feet / mile.

(No, this function does not compensate for the Earth being an oblate spheroid, etc.)

Required TABLE Structure


Having a single table is simpler, but may be less efficient than using a pair of tables. Either way, the main table is partitioned and indexed as indicated below.

For various reasons, you must create the main (or only) table this way...

Required fields (in the first, or only, table):
    ⚈  id -- unique. If 2 tables, it is probably an AUTO_INCREMENT in the other table
    ⚈  lat -- scaled (see above) latitude
    ⚈  lon -- scaled longitude
    ⚈  any fields needed filtering
    ⚈  PRIMARY KEY(lon, id, lat)
    ⚈  INDEX(id) -- (optional) if AUTO_INCREMENT here, or if need to JOIN from another table (for maintenance)
    ⚈  ENGINE=InnoDB -- so the PRIMARY KEY will be "clustered"

To elaborate on the contents and ordering of the PK:
    ⚈  lon -- must be first; the algorithm needs "clustering" on it.
    ⚈  lat -- because every UNIQUE key in a PARTITIONed table must include the partition 'key'
    ⚈  id -- because the PK must be unique.

Filtering... An argument to the FindNearest procedure includes a Boolean expression ("condition") for a WHERE clause. If you don't need any filtering, pass in "1". If you do need filtering, the fields you are going to test must be included in this table to avoid lots of JOINing to another table. To avoid "SQL injection", do not let web users put arbitrary expressions in the "condition" without making sure it is safe.

The idea behind having two tables is to keep the first table (the one that is searched) as small as possible, thereby minimizing I/O. The second table has the rest of the desired info; it does not matter (much) how bulky it is, since it will be probed by its primary key for only the desired number of "nearest" items; no need to slog through other rows.

The FindNearest PROCEDURE will do multiple SELECTs something like this:
      WHERE lat    BETWEEN @my_lat - @dlat
                       AND @my_lat + @dlat   -- PARTITION Pruning and bounding box
        AND lon    BETWEEN @my_lon - @dlon
                       AND @my_lon + @dlon   -- first part of PK
        AND _condition_
The query planner will
    ⚈  Do PARTITION "pruning" based on the latitude;
    ⚈  Within a PARTITION (which is effectively a table), use lon do a 'clustered' range scan;
    ⚈  Then the "condition" come into play, and lat is rechecked.
This design leads to very few disk blocks needing to be read, which is the main goal of the design.

Note that this does not even call GCDist. That comes in the last pass when the ORDER BY and LIMIT are used.

There will be at least two SELECTs, but with proper tuning; usually no more than 6 SELECTs will be needed.

Performance


Because of all the variations, it is hard to get a meaningful benchmark. So, here is some hand-waving instead.

The following comes from a 3.2M-row, 120MB, table (cities in the world), 50-mile cutoff, looking for 10 nearest cities (but not always finding 10 within 50 miles), using "current" technology (MySQL uses only one CPU, and all cpus for the last decade have been about the same). There are 20 partitions, averaging 160K rows each. All InnoDB blocks are cached in the buffer_pool, so this fails to point out how limited the I/O is. The 'filtering' performed is "population > 0", which eliminates 98.5% of the rows. This helps explain why thousands of rows are scanned to find 10.
    ⚈  2-5ms for most random tests performed
    ⚈  2-4 SELECTs performed (Com_select)
    ⚈  Handler_read value (from SHOW STATUS): Handler_read_next has the largest value, typically a few thousand, and closely matches Innodb_rows_read
    ⚈  1-3 iterations of the LOOP were performed
    ⚈  1 Created_tmp_table, typically 8 Sort_rows

Note: If you choose to have a single table, and it contains a TEXT field, then the tmp table will have to be on-disk as a MyISAM table, not a MEMORY table. This is one of the possible costs of the single-table approach.

With no filtering and a lower starting radius (5 mi instead of 15), Handler_read_next drops dramatically, and the overall time drops.

FindNearest


Here's the gist of it.
    ⚈  Make a guess at how close to "me" to look.
    ⚈  See how many items are in a 'square' around me, after filtering.
    ⚈  If not enough, repeat, doubling the width of the square.
    ⚈  After finding enough, or giving up because we are looking "too far", make one last pass to get all the data, ORDERed and LIMITed

Note that the loop merely uses 'squares' of lat/lng ranges. This is crude, but fits well with the partitioning and indexing, without needing excessive calls to GCDist. In the sample code, I picked 15 miles as starting value. Adjusting this will have some impact on the Procedure's performance, but the impact will vary with the use cases. A rough way to set the radius is to guess what will find the desired LIMIT about half the time. (This value is hardcoded in the PROCEDURE.)

Parameters passed into FindNearest():
    ⚈  Latitude -- -90..90 (not scaled -- see hardcoded conversion in PROCEDURE)
    ⚈  Longitude -- -180..180 (not scaled)
    ⚈  Max distance -- in miles or km -- see hardcoded conversion in PROCEDURE
    ⚈  Limit -- maximum number of items to return
    ⚈  Condition -- something to put after 'AND' (more discussion above)

The function will find the nearest items, up to Limit that meet the Condition. But it will give up at Max distance. (If you are at the South Pole, why bother searching very far for the tenth pizza parlor?)

Because of the "scaling", "hardcoding", "Condition", the table name, etc, this PROCEDURE is not truly generic and must be recoded for each application. Yes, I could have designed it to pass all that stuff in. But what a mess.

Timing: Under 10ms for "typical" usage; any dataset size. Slower for pathological cases (low min distance, high max distance, crossing dateline, bad filtering, etc)

End-cases:
    ⚈  By using GC distance, not Pythagoras, distances are 'correct' even near poles.
    ⚈  Poles -- Even if the "nearest" is 180 degrees away (longitude), it can find it.
    ⚈  Dateline -- There is a small, 'contained', piece of code for crossing the Dateline. Example: you are at +179 deg longitude, and the nearest item is at -179.

The procedure returns one resultset, SELECT *, dist.
    ⚈  Only rows that meet your Condition, within Max distance are returned
    ⚈  At most Limit rows are returned
    ⚈  The rows will be ordered, "closest" first.
    ⚈  "dist" will be in miles or km (based on a hardcoded constant in the SP)

The code for deg*10000 and 'miles'


This version is based on scaling "Deg*10000 (MEDIUMINT)".
DELIMITER //

drop function if exists GCDist //
CREATE FUNCTION GCDist (
        _lat1 DOUBLE,  -- Scaled Degrees north for one point
        _lon1 DOUBLE,  -- Scaled Degrees west for one point
        _lat2 DOUBLE,  -- other point
        _lon2 DOUBLE
    ) RETURNS DOUBLE
    DETERMINISTIC
CONTAINS SQL  -- SQL but does not read or write
BEGIN
    -- Hardcoded constant:
    DECLARE _deg2rad DOUBLE DEFAULT PI()/1800000;  -- For scaled by 1e4 to MEDIUMINT
    DECLARE _rlat1 DOUBLE DEFAULT _deg2rad * _lat1;
    DECLARE _rlat2 DOUBLE DEFAULT _deg2rad * _lat2;
    -- compute as if earth's radius = 1.0
    DECLARE _rlond DOUBLE DEFAULT _deg2rad * (_lon1 - _lon2);
    DECLARE _m     DOUBLE DEFAULT COS(_rlat2);
    DECLARE _x     DOUBLE DEFAULT COS(_rlat1) - _m * COS(_rlond);
    DECLARE _y     DOUBLE DEFAULT               _m * SIN(_rlond);
    DECLARE _z     DOUBLE DEFAULT SIN(_rlat1) - SIN(_rlat2);
    DECLARE _n     DOUBLE DEFAULT SQRT(
                        _x * _x +
                        _y * _y +
                        _z * _z    );
    RETURN  2 * ASIN(_n / 2) / _deg2rad;   -- again--scaled degrees
END;
//

-- FindNearest (about my 6th approach)
drop procedure if exists FindNearest6 //
CREATE
PROCEDURE FindNearest (
        IN _my_lat DOUBLE,  -- Latitude of me [-90..90] (not scaled)
        IN _my_lon DOUBLE,  -- Longitude [-180..180]
        IN _max_dist DOUBLE,  -- Limit how far to search: miles or km
        IN _limit INT,     -- How many items to try to get
        IN _condition VARCHAR(1111)   -- ANDed; restricted to *Locations* fields
    )
    DETERMINISTIC
BEGIN
    -- lat and lng are in degrees -90..+90 and -180..+180
    -- All computations done in Latitude degrees.
    -- Thing to tailor
    --   *id* = column that JOINs *Locations* and *Info*
    --   *Locations*, *Info* -- the two tables
    --   Scaling of lat, lon; here using *10000 in MEDIUMINT
    --   Table name
    --   miles versus km.

    -- Hardcoded constant:
    DECLARE _deg2rad DOUBLE DEFAULT PI()/1800000;  -- For scaled by 1e4 to MEDIUMINT

    -- Cannot use params in PREPARE, so switch to @variables:
    -- Hardcoded constant:
    SET @my_lat := _my_lat * 10000,
        @my_lon := _my_lon * 10000,
        @deg2dist := 0.0069172,  -- 69.172 for miles; 111.325 for km  *** (mi vs km)
        @start_deg := 15 / @deg2dist,  -- Start with this radius first (15 miles)
        @max_deg := _max_dist / @deg2dist,
        @cutoff := @max_deg / SQRT(2),  -- (slightly pessimistic)
        @dlat := @start_deg,  -- note: must stay positive
        @lon2lat := COS(_deg2rad * @my_lat),
        @iterations := 0;        -- just debugging

    -- Loop through, expanding search
    --   Search a 'square', repeat with bigger square until find enough rows
    --   If the inital probe found _limit rows, then probably the first
    --   iteration here will find the desired data.
    -- Hardcoded table name:
    -- This is the "first SELECT":
    SET @sql = CONCAT(
        "SELECT COUNT(*) INTO @near_ct
            FROM Locations
            WHERE lat    BETWEEN @my_lat - @dlat
                             AND @my_lat + @dlat   -- PARTITION Pruning and bounding box
              AND lon    BETWEEN @my_lon - @dlon
                             AND @my_lon + @dlon   -- first part of PK
              AND ", _condition);
    PREPARE _sql FROM @sql;
    MainLoop: LOOP
        SET @iterations := @iterations + 1;
        -- The main probe: Search a 'square'
        SET @dlon := ABS(@dlat / @lon2lat);  -- good enough for now  -- note: must stay positive
        -- Hardcoded constants:
        SET @dlon := IF(ABS(@my_lat) + @dlat >= 900000, 3600001, @dlon);  -- near a Pole
        EXECUTE _sql;
        IF ( @near_ct >= _limit OR         -- Found enough
             @dlat >= @cutoff ) THEN       -- Give up (too far)
            LEAVE MainLoop;
        END IF;
        -- Expand 'square':
        SET @dlat := LEAST(2 * @dlat, @cutoff);   -- Double the radius to search
    END LOOP MainLoop;
    DEALLOCATE PREPARE _sql;

    -- Out of loop because found _limit items, or going too far.
    -- Expand range by about 1.4 (but not past _max_dist),
    -- then fetch details on nearest 10.

    -- Hardcoded constant:
    SET @dlat := IF( @dlat >= @max_deg OR @dlon >= 1800000,
                @max_deg,
                GCDist(ABS(@my_lat), @my_lon,
                       ABS(@my_lat) - @dlat, @my_lon - @dlon) );
            -- ABS: go toward equator to find farthest corner (also avoids poles)
            -- Dateline: not a problem (see GCDist code)

    -- Reach for longitude line at right angle:
    -- sin(dlon)*cos(lat) = sin(dlat)
    -- Hardcoded constant:
    SET @dlon := IFNULL(ASIN(SIN(_deg2rad * @dlat) /
                             COS(_deg2rad * @my_lat))
                            / _deg2rad -- precise
                        , 3600001);    -- must be too near a pole

    -- This is the "last SELECT":
    -- Hardcoded constants:
    IF (ABS(@my_lon) + @dlon < 1800000 OR    -- Usual case - not crossing dateline
        ABS(@my_lat) + @dlat <  900000) THEN -- crossing pole, so dateline not an issue
        -- Hardcoded table name:
        SET @sql = CONCAT(
            "SELECT *,
                    @deg2dist * GCDist(@my_lat, @my_lon, lat, lon) AS dist
                FROM Locations
                WHERE lat BETWEEN @my_lat - @dlat
                              AND @my_lat + @dlat   -- PARTITION Pruning and bounding box
                  AND lon BETWEEN @my_lon - @dlon
                              AND @my_lon + @dlon   -- first part of PK
                  AND ", _condition, "
                HAVING dist <= ", _max_dist, "
                ORDER BY dist
                LIMIT ", _limit
                        );
    ELSE
        -- Hardcoded constants and table name:
        -- Circle crosses dateline, do two SELECTs, one for each side
        SET @west_lon := IF(@my_lon < 0, @my_lon, @my_lon - 3600000);
        SET @east_lon := @west_lon + 3600000;
        -- One of those will be beyond +/- 180; this gets points beyond the dateline
        SET @sql = CONCAT(
            "( SELECT *,
                    @deg2dist * GCDist(@my_lat, @west_lon, lat, lon) AS dist
                FROM Locations
                WHERE lat BETWEEN @my_lat - @dlat
                              AND @my_lat + @dlat   -- PARTITION Pruning and bounding box
                  AND lon BETWEEN @west_lon - @dlon
                              AND @west_lon + @dlon   -- first part of PK
                  AND ", _condition, "
                HAVING dist <= ", _max_dist, " )
            UNION ALL
            ( SELECT *,
                    @deg2dist * GCDist(@my_lat, @east_lon, lat, lon) AS dist
                FROM Locations
                WHERE lat BETWEEN @my_lat - @dlat
                              AND @my_lat + @dlat   -- PARTITION Pruning and bounding box
                  AND lon BETWEEN @east_lon - @dlon
                              AND @east_lon + @dlon   -- first part of PK
                  AND ", _condition, "
                HAVING dist <= ", _max_dist, " )
            ORDER BY dist
            LIMIT ", _limit
                        );
    END IF;

    PREPARE _sql FROM @sql;
    EXECUTE _sql;
    DEALLOCATE PREPARE _sql;
END;
//
DELIMITER ;

Using a Single Table


If all the data about all the points will fit in one table in the buffer_pool, then it is probably better to use a single table.

In this case, the `id` part of the PRIMARY KEY can be any field that is unique.

Using Two Tables


If a single table cannot be cached, it is better to split the data vertically. The first table (let's call it `Locations`) has a minimal set of columns to run the Stored Procedure. The second table (let's call it `Info`) has the rest of the data.

We can hope that this "vertical" split makes Locations small enough to be fully cached, but even if it is still not that small, we can depend on the algorithm to minimize the number of I/Os that will be needed for searches. This minimization will clearly be much less than if you depended on just partitioning or just an index (of any flavor).

If you are searching for, say, 10 items, we want to not touch the Info table until we have determined exactly the 10 items (via their `id`s); then there will be at most 10 I/Os into Info, regardless of how randomly the rows are arranged. (Yeah, it could be more than 10 because of BTree activity, off-page blob storage, etc; but let's call it 10.)

Let's analyze the I/O for Locations, assuming that it cannot be fully cached. The "first SELECT" looks at a 'square' around a given latitude and longitude. PARTITION pruning will kick in first, and limit the scope to 1 (sometimes 2) PARTITION. Within that PARTITION (think of it as a table), the WHERE clause kicks in and scans a range of longitudes. This "range scan" is over consecutive rows in the data because of the clustered PRIMARY KEY starting with `lon`. If that range hits 100 rows, then probably 1-2 blocks will be hit (and possibly needing an I/O). If it hits 1K rows, then it might be a dozen blocks. As the scope widens (the LOOP), these number could grow. (In theory the numbers could grow to encompass the entire table, hence the argument "max_dist" to allow you to have some control.)

The "last SELECT" is a 'square' in which it knows that there are at least 10 items (if you are looking for the 10 nearest). This query is repeating what was already fetched by the first SELECT, so all the blocks will be cached (with rare exceptions). No extra I/Os.

The 'square' might be bigger than it absolutely needed to be, and the corners of the square are wasted since the desired items are only within the inscribed circle. Overall, the algorithm might touch a few times as many rows as needed. This is still _much_ better than scanning an entire latitude range, which is the best that can be done by other techniques.

Principles in having two tables:
    ⚈  `id` (or equivalent) is used to JOIN the tables
    ⚈  `id`, if an AUTO_INCREMENT, will be such in `Info`, not `Locations`
    ⚈  `id` should be the PRIMARY KEY in `Info`
    ⚈  INDEX(id) may be desired in `Locations`, in case you need to delete entries, etc.
    ⚈  Any field in "condition" _must_ be in Locations, and _may_ be (redundantly) in Info.
    ⚈  Minimize the size of fields. For example, the sample code uses MEDIUMINT (3 bytes) instead of INT (4 bytes).

To do the JOIN, you can either do it after getting back from the Stored Procedure, or you can modify the SP something like the following. Instead of the "last select",
    -- This is the "last SELECT":
    IF (ABS(@my_lon) + @dlon ... ) THEN
        SET @sql = CONCAT( ... the simpler SELECT with ORDER BY dist ...);
    ELSE
        SET @sql = CONCAT( ... the pair of SELECTs with UNION ALL ... );
    END IF;

    -- Use that SELECT as a subquery while JOINing to `Info`:
    SET @sql = CONCAT(
        "SELECT Info.*, Loc.dist        -- the fields you are likely to want
            FROM ( ", @sql, " ) AS Loc  -- subquery
            JOIN Info  USING (id)
            ORDER BY Info.dist"         -- SQL won't guarantee order without this
                     );

    PREPARE _sql FROM @sql;
    EXECUTE _sql;
    DEALLOCATE PREPARE _sql;

Postlog


Original writing -- Apr, 2012; 'Performance' section and 2-table discussion added -- Sep, 2014.

Requires at least MySQL 5.1

A forum thread
discussing this algorithm.

Akiban has a builtin mechanism for doing the equivalent. It involves a "Z-Order" index and can mix with other conditions in a WHERE clause. See Akiban Spatial Indexes
(Link may be broken now that FoundationDB and Akiban have merged.)


Contact me by posting a question at MySQL Forums :: Performance
-- Rick James

Rick's MySQL Documents

MySQL Documents by Rick James

Tips, Debugging, HowTos, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Converting from MyISAM to InnoDB -- includes differences between them
Big DELETEs - how to optimize
Compound INDEXes plus other insights into the mysteries of INDEXing
Partition Maintenance (DROP+REORG) for time series
Entity-Attribute-Value -- a common, poorly performing, design patter; plus an alternative
Find the nearest 10 pizza parlors (efficient searching on Latitude + Longitude)
Alter of a Huge table
Latest 10 news articles -- how to optimize the schema and code for such
Pagination, not with OFFSET, LIMIT
Data Warehouse techniques (esp., Summary Tables)
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance
MySQL Limits
Galera Limitations (with Percona XtraDB Cluster / MariaDB)
Rollup Unique User Counts
Best of MySQL Forum