MySQL Techniques for "Find Nearest"

Table of Contents

The Problem
Static versus Moving Items
Dateline and Poles
Polar Bears
Computing Distance
The Contenders
Test case
No-Index
Bounding box
Spatial
Partitioning
Z-Order
Comparison of methods
GCDistDeg
Postlog
Brought to you by Rick James

The Problem


Searching for the "nearest" on a map, needs a 2D index. But indexes are one-dimensional. This article discusses several techniques, starting with obvious, but inefficient, methods, and moving into efficient methods.

Static versus Moving Items


The test data in this discussion is static -- cities. However, the principles work with dynamic data, such as trucks. The notable difference is that the underlying data will be moving around. This does not hurt the algorithms, but it poses a potentially worse performance challenge. Most algorithms have the PRIMARY KEY depend on location, so when an item is moved, it must be deleted from one spot in the table and reinserted in another spot.

Dateline and Poles


Cities, businesses, and trucks are not likely to be located near poles, or even near the Dateline. Both of these are computational problems when computing distances.

We can mostly assume there are few if any items near the poles. The Dateline is less easy to sweep under the rug; there are a few cities, businesses, and trucks near longitude -180 or +180, especially in Fiji. The No-Index algorithm is works fine assuming that the distance formula works. The Partition algorithm has special code. The others probably need extra code.

Polar Bears


If you have tagged polar bears and are tracking them, the discussion here is of limited use. You have the problems discussed about poles, and the hassles of moving items.

If all the items are in one polar region, I would recommend projecting the globe onto a square with the pole at the center. Have the projection go some number of degrees toward the equator. On the projection, use x & y coordinates, not polar coordinates. Then the square will work nicely with any of the algorithms.

Computing Distance


Pythagorean -- This is rather inaccurate at the poles. Don't bother using it for anything.

GCDistDeg -- This Stored Function uses the haversine algorithm. Code is below.

ST_Distance_Sphere -- This works only with Geometry items, and requires MySQL 5.7.6 or newer.

The Contenders


These five algorithms will be discussed further below. They are listed in (roughly) most-efficient to least-efficient order. The algorithm assumes a table with items with latitude and longitude coordinates, plus other columns. The goal is to "Find the nearest N items to a given lat/lng location, plus optional filtering." Filtering examples: Businesses->Restaurants; Polar bears->Male; Trucks->Tractor-trailors; Cities->Population > 100K.

No-Index - This is the straightforward case that people start with. However, "find nearest" involves a full table scan, computing the distance to every point, sorting by distance, then delivering the closest item(s). This is OK for thousands of items, but not for millions.

Bounding Box - This involves 2 indexes and an addition to the WHERE clause. There is some arithmetic to compute ranges for latitude and longitude for that clause. It is orders of magnitude faster than No-Index, but significantly slower than the other algorithms when handling millions of rows. This is not practical for billions of rows.

Spatial - This involves a bounding box and an index. It is relatively simple to set up. It can be slow if the limit is too high and there are lots if items. (Example: You are in a big city and you ask for the nearest restaurant, limited to 50 miles.)

Partitioning - What is really needed is a two-dimensional index, but indexes don't work that way. However, with PARTITION BY RANGE, one can approximate one dimension with partitions, then use the PRIMARY KEY for the other dimension. The algorithm uses a bounding box that expands until the row(s) are found or the limit is reached.

Z-Order - That takes a different tack; it plays with the bits of latitude and longitude to blend them together. Now, items that are "near" each other on the globe are _usually_ "near" each other with this contrived number. In this algorithm, a bounding box starts full sized, but shrinks as needed. It involves 2-6 SELECTs and scales well.

Test case


For experimenting with the algorithms, I picked a dataset of 3.1 million cities around the world. This provided oceans (no cities) versus crowded areas. And it avoided the poles, so the code for such could be postponed.

For bounding boxes, I picked (arbitrarily) 50 miles (80.57km, 0.723 degrees of latitude). And I picked (arbitrarily) to pick the 10 nearest cities. But I deliberately picked not to test with filters, since an appropriate filter could lead to finding zero results, regardless of the radius.

No-Index


This is what the novice starts with, and it leads to a full table scan. Slow.
    SELECT *, GCDistDeg($lat_me, $lng_me, lat, lng) * 69.172 AS miles
        FROM t
        ORDER BY miles ASC
        LIMIT 10;

Bounding box


With suitable indexes, and use of a "bounding box" in the WHERE clause, you can easily improve on the No-Index algorithm.
    WHERE lat BETWEEN ... AND ...
      AND lng BETWEEN ... AND ...

If you are looking for items within 0.2 degree of latitude (= 14 miles = 22 km) of $my_lat, $my_lng, then
    WHERE lat BETWEEN $my_lat - 0.2
                  AND $my_lat + 0.2
      AND lng BETWEEN $my_lng - 0.2 / COS(RADIANS($my_lat))
                  AND $my_lng + 0.2 / COS(RADIANS($my_lat))

Dividing by the cosine approximately adjusts for how longitude lines creep toward each other as you go away from the equator. This approximation is 'good enough' except very close to the poles.

Use these two indexes:
    INDEX(lat, lng)
    INDEX(lng, lat)

The Optimizer will pick between the two indexes based on statistics. Do not also provide single-column indexes on lat and lng; the Optimizer might pick one of them, thereby preventing the use of "Using index condition" (aka ICP - index condition pushdown). ICP sometimes provides a 2x speedup.

If, say, there are 1000 items in the latitude range (an east-west stripe) but 4000 in the longitude range (north-south), the Optimizer will prefer (lat, lng). With that, it will scan the 1000, then check each according to the longitude range.

A slightly more precise, especially at high latitudes, way of computing the longitude with based on radius is (using PHP)
    // $radius is in degrees, not miles or km:
    $dlng = $radius / cos(deg2rad($my_lat));   // The method described above
    $dlng = rad2deg(asin( sin(deg2rad($radius)) / cos(deg2rad($my_lat)) ));  // better
    or, in SQL:
      AND lng BETWEEN ? - DEGREES(ASIN( SIN(RADIANS(?)) / COS(RADIANS(?)) ))
                  AND ? + DEGREES(ASIN( SIN(RADIANS(?)) / COS(RADIANS(?)) ))
    and bind with  $my_lng, $radius, $my_lat,
                   $my_lng, $radius, $my_lat
Caution: If you are too close ($my_lat+$radius >= 90) to the pole, $dlng will be NaN (Not a Number).

The bounding-box method is orders of magnitude faster than the no-index case. However, it requires that you limit how far you need to search. And, if you are in a crowded city looking for a coffee bar, you might still have to filter through thousands of rows.

It turns out that all of the following techniques also use a bounding box, but often buried in the code. Without a box, the search for something that does not exist anywhere would require searching the entire table -- slow. And the end-user probably does not care if the thing being searched for is on the other side of the world.

The bounding box technique when picking from a 50-mile radius from 3.1 million items (cities) will typically take between 1ms and 60ms if the data is cached; longer if it must hit disk.

Spatial

     SELECT ...,
            ST_Distance_Sphere(lat, lng) AS meters
         FROM table
         WHERE ... (any filtering)
           AND MBRContains(GeomFromText(Polygon(...)),  -- The bounding box is encoded as a 'square'
                           sp_point)
         ORDER BY distance ASC, id
         LIMIT ...

Notes:
    ⚈  Select whatever columns you need.
    ⚈  Use some spherical distance formula, and convert to the desired units (miles or km or whatever)
    ⚈  For efficiency, the table should have extra columns to facilitate filtering (eg coffee shops / businesses)
    ⚈  Use LIMIT to control the number of items to return

Distance calculation:
    ⚈  GCDistDeg(lat1, lng1, lat2, lng2) -- args and return are in degrees
    ⚈  With MySQL 5.7.6 or newer, ST_Distance_Sphere(point1, point2) -- args are POINTs, returns meters

Engine:
    ⚈  Pre 5.7.6 (MariaDB 10.2.2): use MyISAM
    ⚈  5.7.6+ (10.2.2+): Use InnoDB (or MyISAM)

Partitioning


This is attempts to make a two-dimensional index out of the tools available in MySQL. The partitions are on some function of latitude. Then the PRIMARY KEY starts with longitude. That way you first get an east-west stripe, sometimes more than one stripe. After that, you can limit the searhing because of lng being first in the PK. (Again, use a bounding box.)

See this for an implementation of such:
Find the nearest 10 pizza parlors

Z-Order


This involves blending the lat and lng numbers together to form a number that has geographically nearby items with similar number. Yeah, it sounds like magic. Yeah, it is not perfect, but it is good enough to be very fast -- touching an average of 3 data blocks of the table to find the 10 nearest items in up to 50 miles.

To keep the bounding boxes close to square, longitudes are keep as is, but latidudes are stretched by an unobvious function (the integral of secant). About 85 degrees north/south is where the algorithm becomes dubious. This is fine for cities, businesses, and trucks but not for polar bears.

Wikipedia on Z-roder curve
Implementing zorder in MySQL

Comparison of methods


Test case
    ⚈  3.1 million cities in the world
    ⚈  Start at some semi-random starting point
    ⚈  "semi-random" meaning that it avoids the middle of oceans, etc.
    ⚈  Find the 10 nearest
    ⚈  Limit the search to 50 miles

under construction
    Method        Rows  Blocks ms
    ------------
    no-index
    bounding-box
    partitioning
    zorder
    spatial

Metrics
    ⚈  Elapsed time is in milliseconds, taken around the call to the routine
    ⚈  "Rows" generally comes from innodb_rows_read
    ⚈  All data is cached.
    ⚈  "Blocks" is a hack to guesstimate how many disk blocks would be fetched from disk if the table were significantly bigger than cache.
    ⚈  So, think of the elapsed time as being CPU time, but the "Blocks" as being more important if short on RAM.
    ⚈  Rows/10 is an interesting metric -- "How many rows had to be touched to find the 10 desired". For no-index and bounding-box, those grow as the dataset grows. For the others, the ratio is relatively constant.

No filtering is done for these tests. That is, all nearby cities count toward the 10. In a real situation, the "items" might be "businesses", and the query might want to filter down to just "coffee purveyors".

GCDistDeg


Haversine algorithm for computing the distance between two lat/lng pairs anywhere on earth.
DELIMITER //
CREATE  FUNCTION `GCDistDeg`(
        _lat1 DOUBLE,
        _lon1 DOUBLE,
        _lat2 DOUBLE,
        _lon2 DOUBLE
    ) RETURNS double
    DETERMINISTIC
    SQL SECURITY INVOKER
    COMMENT 'Degrees in, Degrees out.  Use 69.172 mi/deg'
BEGIN

    DECLARE _deg2rad DOUBLE DEFAULT PI()/180;
    DECLARE _rlat1 DOUBLE DEFAULT _deg2rad * _lat1;
    DECLARE _rlat2 DOUBLE DEFAULT _deg2rad * _lat2;

    DECLARE _rlond DOUBLE DEFAULT _deg2rad * (_lon1 - _lon2);
    DECLARE _m     DOUBLE DEFAULT COS(_rlat2);
    DECLARE _x     DOUBLE DEFAULT COS(_rlat1) - _m * COS(_rlond);
    DECLARE _y     DOUBLE DEFAULT               _m * SIN(_rlond);
    DECLARE _z     DOUBLE DEFAULT SIN(_rlat1) - SIN(_rlat2);
    DECLARE _n     DOUBLE DEFAULT SQRT(
                        _x * _x +
                        _y * _y +
                        _z * _z    );
    RETURN  2 * ASIN(_n / 2) / _deg2rad;
END
//

Postlog


This problem has been a fun challenge that I have worked on off-and-on for many years.

-- Rick James

MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
    Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
    Overview   Summary Tables   High speed ingestion  
Entity-Attribute-Value -- a common, poorly performing, design pattern (EAV); plus an alternative
5 methods for 'Find Nearest'
Find the nearest 10 pizza parlors -- efficient searching on Latitude + Longitude (another PARITION use)
    Lat/Long representation choices
Z-Order 'find nearest'(under construction)
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Rollup Unique User Counts
Alter of a Huge table -- Mostly obviated by 5.6
Latest 10 news articles -- how to optimize the schema and code for such
Build and execute a "Pivot" SELECT (showing rows as columns)
Find largest row for each group ("groupwise max")

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
    Trouble with UTF-8   If you want case folding, but accent sensitivity, please file a request at http://bugs.mysql.com .
    Python tips,   PHP tips,   other language tips
    utf8 Collations   utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
    Many-to-many mapping table   wp_postmeta   UNION+OFFSET
MySQL Limits -- built-in hard limits
    767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Request for tuning / slowlog info
Best of MySQL Forum -- index of lots of tips, discussions, etc

Analyze MySQL Performance
    Analyze VARIABLEs and GLOBAL STATUS     Analyze SlowLog

My slides from conferences
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
(older ones upon request)

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:   View Rick James's profile on LinkedIn

Did my articles help you out? Like what you see? Consider donating:

☕️ Buy me a Banana Latte ($4) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake "🙂"