MySQL Techniques for "Find Nearest"

Table of Contents

The Problem

Static versus Moving Items

Dateline and Poles

Polar Bears

Computing Distance

The Contenders

Test case

No-Index

Bounding box

Spatial

Partitioning

Z-Order

Comparison of methods

GCDistDeg

Representation choices

Postlog

Brought to you by Rick James

The Problem


Searching for the "nearest" on a map needs a 2D index. But indexes are one-dimensional. This article discusses several techniques, starting with obvious, but inefficient, methods, and moving into efficient methods.

Static versus Moving Items


The test data in this discussion is static -- cities. However, the principles work with dynamic data, such as trucks. The notable difference is that the underlying data will be moving around. This does not hurt the algorithms, but it adds a burden to the UPDATEs. Most algorithms have the PRIMARY KEY depend on location, so when an item is moved, it must be deleted from one spot in the table and reinserted in another spot.

Dateline and Poles


Cities, businesses, and trucks are not likely to be located near poles, or even near the Dateline. Both of these are computational problems when computing distances.

We can mostly assume there are few if any items near the poles. The Dateline is less easy to sweep under the rug; there are a few cities, businesses, and trucks near longitude -180 or +180, especially in Fiji. The No-Index algorithm is works fine assuming that the distance formula works. The Partition algorithm has special code. The others probably need extra code.

Polar Bears


If you have tagged polar bears and are tracking them, the discussion here is of limited use. You have the problems discussed about poles, and the hassles of moving items.

If all the items are in one polar region, I would recommend projecting the globe onto a square with the pole at the center. Have the projection go some number of degrees toward the equator. On the projection, use x & y coordinates, not polar coordinates. Then the square will work nicely with any of the algorithms.

Computing Distance


Pythagorean -- This is rather inaccurate at the poles. It is faster than GCDistDeg, but is not very accurate for long distances. You could consider it for reasonably short distances (up to tens of miles or kilometers). Be sure to adjust longitude by the cosine of the latitude.

GCDistDeg -- This Stored Function uses the haversine algorithm to calculate the Great Circle Distance. Code is below.

ST_Distance_Sphere -- This works only with Geometry items, and requires MySQL 5.7.6 or newer.

Timing: ST_Distance_Sphere is 10-20 times as fast as GCDistDeg. The downside is that you must have a bulky Point column instead of smaller lat and lng columns.

The Contenders


Five algorithms will be discussed further below. They are listed in (roughly) most-efficient to least-efficient order. The algorithms use a table with items with latitude and longitude coordinates, plus other columns. Each algorithm needs a distinctly different schema (different indexes, partitioning, spatial, etc). The goal is to "Find the nearest N items to a given lat/lng location, plus optional filtering." Filtering examples: Businesses->Restaurants; Polar bears->Male; Trucks->Tractor-trailors; Cities->(Population > 100K).

No-Index - This is the straightforward case that people start with. However, "find nearest" involves a full table scan, computing the distance to every point, sorting by distance, then delivering the closest item(s). This is OK for thousands of items, but not for millions.

Bounding Box - (Also called "bounding square".) This involves 2 indexes and an addition to the WHERE clause. There is some arithmetic to compute ranges for latitude and longitude for that clause. It is orders of magnitude faster than No-Index, but significantly slower than the other algorithms when handling millions of rows. This is not practical for heavy use with billions of rows.

Spatial - This involves a bounding box and an index. It is relatively simple to set up. It can be slow if the limit is too high and there are lots if items. (Example: You are in a big city and you ask for the nearest restaurant, limited to 50 miles. But fast in a desert looking for water.)

Partitioning - What is really needed is a two-dimensional index, but indexes don't work that way. However, with PARTITION BY RANGE, one can approximate one dimension with partitions, then use the PRIMARY KEY for the other dimension. The algorithm uses a bounding box that expands until enough row(s) are found or the limit is reached.

Z-Order - This takes a different tack; it plays with the bits of latitude and longitude to blend them together. Now, items that are "near" each other on the globe are usually "near" each other with this contrived number. In this algorithm, a bounding box starts full sized, but shrinks as needed. It involves 2-6 SELECTs and scales well.

Both Partitioning and Z-order do the best job of minimizing the number of disk blocks touched. This makes either excellent for scaling into billions of rows, while providing fast results. Part of the secret is using multiple SELECTs and limiting how hard each one has to work.

Both Bounding Box and Spatial are reasonably fast most of the time. But when the density of items is high and the range limitation is high, they may need to hit a lot of rows.

Test case


For experimenting with the algorithms, I picked a dataset of 3.1 million cities around the world. This provided oceans (no cities) versus crowded areas. This avoided the poles. Code for such is not available.

For bounding boxes, I picked (arbitrarily) 50 miles (80.57km, 0.723 degrees of latitude). And I picked (arbitrarily) to request the 10 nearest cities. But I deliberately picked not to test with filters, since an appropriate filter could lead to finding zero results, regardless of the radius.

No-Index


This is what the novice starts with, and it always requires a full table scan. Slow.
    SELECT *, GCDistDeg($lat_me, $lng_me, lat, lng) * 69.172 AS miles
        FROM t
        ORDER BY miles ASC
        LIMIT 10;

Bounding box


With suitable indexes, and use of a "bounding box" in the WHERE clause, you can easily improve on the No-Index algorithm.
    WHERE lat BETWEEN ... AND ...
      AND lng BETWEEN ... AND ...
    HAVING distance <= ...

If you are looking for items within $m miles of $my_lat, $my_lng, then
    WHERE lat BETWEEN $my_lat - $m/69.172
                  AND $my_lat + $m/69.172
      AND lng BETWEEN $my_lng - $m/69.172 / COS(RADIANS($my_lat))
                  AND $my_lng + $m/69.172 / COS(RADIANS($my_lat))
    HAVING miles <= $m

Dividing by the cosine approximately adjusts for how longitude lines creep toward each other as you go away from the equator. This approximation is 'good enough' except very close to the poles.

Use these two indexes:
    INDEX(lat, lng)
    INDEX(lng, lat)

The Optimizer will pick between the two indexes based on statistics. Do not also provide single-column indexes on lat and lng; the Optimizer might pick one of them, thereby preventing the use of "Using index condition" (aka ICP - index condition pushdown). ICP sometimes provides a 2x speedup.

If, say, there are 1000 items in the latitude range (an east-west stripe) but 4000 in the longitude range (north-south), the Optimizer will prefer (lat, lng). With that, it will scan the 1000, then check each according to the longitude range.

A slightly more precise, especially at high latitudes, way of computing the longitude with based on radius is (using PHP)
    // $radius is in degrees, not miles or km:
    $dlng = $radius / cos(deg2rad($my_lat));   // The method described above
    $dlng = rad2deg(asin( sin(deg2rad($radius)) / cos(deg2rad($my_lat)) ));  // better
    or, in SQL:
      AND lng BETWEEN ? - DEGREES(ASIN( SIN(RADIANS(?)) / COS(RADIANS(?)) ))
                  AND ? + DEGREES(ASIN( SIN(RADIANS(?)) / COS(RADIANS(?)) ))
    and bind with  $my_lng, $radius, $my_lat,
                   $my_lng, $radius, $my_lat
Caution: If you are too close ($my_lat+$radius >= 90) to the pole, $dlng will be NaN (Not a Number).

The bounding-box method is orders of magnitude faster than the no-index case. However, it requires that you limit how far you need to search. And, if you are in a crowded city looking for a coffee bar, you might still have to filter through thousands of rows.

It turns out that all of the following techniques also use a bounding box, but often buried in the code. Without a box, the search for something that does not exist anywhere would require searching the entire table -- slow. And the end-user probably does not care if the thing being searched for is on the other side of the world.

The bounding box technique when picking from a 50-mile radius from 3.1 million items (cities) will typically take between 1ms and 60ms if the data is cached; longer if it must hit disk. The high end is for areas of Bangladesh, which has lots of cities crowded together.

Spatial

     SELECT ...,
            ST_Distance_Sphere(center_point, item_point) AS meters
         FROM table
         WHERE ... (any filtering)
           AND MBRContains(GeomFromText(Polygon(...)),  -- The bounding box is encoded as a 'square'
                           sp_point)
         HAVING meters <= ...
         ORDER BY distance ASC, id ASC
         LIMIT ...

Notes:
    ⚈  Select whatever columns you need.
    ⚈  Use some spherical distance formula, and convert to the desired units (miles or km or whatever)
    ⚈  For efficiency, the table should have extra columns to facilitate filtering (eg coffee shops / businesses)
    ⚈  Use LIMIT to control the number of items to return
    ⚈  The polygon corners are like the bounding box discussed above, including the cosine adjustment.

Distance calculation:
    ⚈  GCDistDeg(lat1, lng1, lat2, lng2) -- args and return are in degrees; adjust longitudes by cos(lat).
    ⚈  With MySQL 5.7.6 or newer, ST_Distance_Sphere(point1, point2) -- args are POINTs, returns meters

Engine:
    ⚈  Pre 5.7.6 (MariaDB 10.2.2): use MyISAM and GCDistDeg
    ⚈  For 5.7.6+ (10.2.2+): Use InnoDB (or MyISAM) and ST_Distance_Sphere

Partitioning


This attempts to make a two-dimensional index out of the tools available in MySQL. The partition ranges are on some function of latitude. Then the PRIMARY KEY starts with longitude. That way you first get an east-west stripe (one partition); sometimes more than one stripe. After that, you can limit the searhing because of lng being first in the PK. (Again, use a bounding box.)

See this for an implementation of such:
Find the nearest 10 pizza parlors

Z-Order


This involves blending the lat and lng numbers together to form a number that has geographically nearby items with numerically close numbers. Yeah, it sounds like magic. Yeah, it is not perfect, but it is good enough to be very fast -- touching an average of 2 data blocks of the table to find the 10 nearest cities up to 50 miles.

To keep the bounding boxes close to square, longitudes are keep as is, but latidudes are stretched by an unobvious function (the integral of secant). About 85 degrees north/south is where the algorithm becomes dubious. This is fine for cities, businesses, and trucks but not for polar bears.

Wikipedia on Z-roder curve
Implementing zorder in MySQL

Comparison of methods


Test case
    ⚈  3.1 million cities in the world
    ⚈  Start at some semi-random starting point
    ⚈  "semi-random" meaning that it avoids the middle of oceans, etc.
    ⚈  Find the 10 nearest cities
    ⚈  Limit the search to 50 miles

under construction
    Method        Rows  Blocks  ms sels
    ------------
    no-index      3.1M    30K  40K   1
    bounding-box  1700     96   24   1
    spatial       1900      ?   15   1
    zorder          54      2    3   4
    partitioning   570      4    2   3

Metrics
    ⚈  Elapsed time is in milliseconds, taken around the call to the routine, which is a Stored Proc for Partitioning, or PHP code for Zorder, or just an SQL query for the others.
    ⚈  "Rows" generally comes from innodb_rows_read
    ⚈  All data is assumed to be cached.
    ⚈  "Blocks" is a hack to guesstimate how many disk blocks would be fetched from disk if the table were significantly bigger than cache.
    ⚈  So, think of the elapsed time as being CPU time, but the "Blocks" as being more important if you are short on RAM, leading to being I/O-bound.
    ⚈  Rows/10 is an interesting metric -- "How many rows had to be touched to find the 10 desired". For no-index and bounding-box, those grow as the dataset grows. For the others, the ratio is relatively constant.
    ⚈  Bounding-box and Spatial are 'slow' because of needing to calculate Distance for so many rows.

No filtering is done for these tests. That is, all nearby cities count toward the 10. In a real situation, the "items" might be "businesses", and the query might want to filter down to just "coffee purveyors". The table is designed with filtering in mind. That is, you should add columns (within reason) to the table to allow for queries to limit the search to only green vehicles, transvestite polar bears, or large cities.

GCDistDeg


Haversine algorithm for computing the distance between two lat/lng pairs anywhere on earth.
DELIMITER //
CREATE  FUNCTION `GCDistDeg`(
        _lat1 DOUBLE,
        _lon1 DOUBLE,
        _lat2 DOUBLE,
        _lon2 DOUBLE
    ) RETURNS double
    DETERMINISTIC
    SQL SECURITY INVOKER
    COMMENT 'Degrees in, Degrees out.  For conversion: 69.172 mi/deg or 111.325 km/deg'
BEGIN

    DECLARE _deg2rad DOUBLE DEFAULT PI()/180;
    DECLARE _rlat1 DOUBLE DEFAULT _deg2rad * _lat1;
    DECLARE _rlat2 DOUBLE DEFAULT _deg2rad * _lat2;

    DECLARE _rlond DOUBLE DEFAULT _deg2rad * (_lon1 - _lon2);
    DECLARE _m     DOUBLE DEFAULT COS(_rlat2);
    DECLARE _x     DOUBLE DEFAULT COS(_rlat1) - _m * COS(_rlond);
    DECLARE _y     DOUBLE DEFAULT               _m * SIN(_rlond);
    DECLARE _z     DOUBLE DEFAULT SIN(_rlat1) - SIN(_rlat2);
    DECLARE _n     DOUBLE DEFAULT SQRT(
                        _x * _x +
                        _y * _y +
                        _z * _z    );
    RETURN  2 * ASIN(_n / 2) / _deg2rad;
END
//

Representation choices

   Datatype           Bytes       resolution
   ------------------ -----  --------------------------------
Normal use:
   DECIMAL(4,2)/(5,2)     5  1570 m    1.0 mi  Cities
   DECIMAL(6,4)/(7,4)     7    16 m     52 ft  Houses/Businesses
   DECIMAL(8,6)/(9,6)     9    16cm    1/2 ft  Friends in a mall
   FLOAT                  8   1.7 m    5.6 ft  Vehicles
   DOUBLE                16   3.5nm     ...    Fleas on a dog
For Partitioning, which has limitations on datatype for "partition key":
   Deg*100 (SMALLINT)     4  1570 m    1.0 mi  Cities
   Deg*10000 (MEDIUMINT)  6    16 m     52 ft  Houses/Businesses
   SMALLINT scaled        4   682 m    0.4 mi  Cities
   MEDIUMINT scaled       6   2.7 m    8.8 ft  Vehicles
   Deg*10000000 (INT)     8    16mm    5/8 in  Marbles
For SPATIAL indexing:
   POINT                 25   3.5nm     ...    (2 doubles, plus overhead)
See also

Postlog


This problem has been a fun challenge that I have worked on off-and-on for many years. For a long time I saw this as one of the very few viable uses for PARTITIONing, but now I see that Z-Order is strong contender, but arguably more complex to program.


MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)

Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
    Chunking lengthy DELETE/UPDATE/etc.

Data Warehouse techniques:
    Overview   Summary Tables   High speed ingestion  

Entity-Attribute-Value -- a common, poorly performing, design pattern (EAV); plus an alternative

5 methods for 'Find Nearest'

Find the nearest 10 pizza parlors -- efficient searching on Latitude + Longitude (another PARITION use)
    Lat/Long representation choices

Z-Order 'find nearest'(under construction)

Pagination, not with OFFSET, LIMIT

Techniques on efficiently finding a random row (On beyond ORDER BY RAND())

GUID/UUID Performance (type 1 only)

IP Range Table Performance -- or other disjoint ranges

Rollup Unique User Counts

Alter of a Huge table -- Mostly obviated by 5.6

Latest 10 news articles -- how to optimize the schema and code for such

Build and execute a "Pivot" SELECT (showing rows as columns)

Find largest row for each group ("groupwise max")

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)

Datatypes and building a good schema

Memory Allocation (caching, etc)

Character Set and Collation problem solver
    Trouble with UTF-8   If you want case folding, but accent sensitivity, please file a request at http://bugs.mysql.com .
    Python tips,   PHP tips,   other language tips
    utf8 Collations   utf8mb4 Collations on 8.0

Converting from MyISAM to InnoDB -- includes differences between them

Compound INDEXes plus other insights into the mysteries of INDEXing

Cookbook for Creating Indexes
    Many-to-many mapping table   wp_postmeta   UNION+OFFSET

MySQL Limits -- built-in hard limits
    767-byte INDEX limit

Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)

5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users

Analyze MySQL Performance
    Analyze VARIABLEs and GLOBAL STATUS     Analyze SlowLog

My slides from conferences
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
(older ones upon request)

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:   View Rick James's profile on LinkedIn

Did my articles help you out? Like what you see? Consider donating:

☕️ Buy me a Banana Latte ($4) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake "🙂"