Searching for the "nearest" on a map, needs a 2D index. But indexes are one-dimensional. This article discusses several techniques, starting with obvious, but inefficient, methods, and moving into efficient methods.
Static versus Moving Items
The test data in this discussion is static -- cities. However, the principles work with dynamic data, such as trucks. The notable difference is that the underlying data will be moving around. This does not hurt the algorithms, but it poses a potentially worse performance challenge. Most algorithms have the PRIMARY KEY depend on location, so when an item is moved, it must be deleted from one spot in the table and reinserted in another spot.
Dateline and Poles
Cities, businesses, and trucks are not likely to be located near poles, or even near the Dateline. Both of these are computational problems when computing distances.
We can mostly assume there are few if any items near the poles. The Dateline is less easy to sweep under the rug; there are a few cities, businesses, and trucks near longitude -180 or +180, especially in Fiji. The No-Index algorithm is works fine assuming that the distance formula works. The Partition algorithm has special code. The others probably need extra code.
If you have tagged polar bears and are tracking them, the discussion here is of limited use. You have the problems discussed about poles, and the hassles of moving items.
If all the items are in one polar region, I would recommend projecting the globe onto a square with the pole at the center. Have the projection go some number of degrees toward the equator. On the projection, use x & y coordinates, not polar coordinates. Then the square will work nicely with any of the algorithms.
Pythagorean -- This is rather inaccurate at the poles. Don't bother using it for anything.
GCDistDeg -- This Stored Function uses the haversine algorithm. Code is below.
ST_Distance_Sphere -- This works only with Geometry items, and requires MySQL 5.7.6 or newer.
These five algorithms will be discussed further below. They are listed in (roughly) most-efficient to least-efficient order. The algorithm assumes a table with items with latitude and longitude coordinates, plus other columns. The goal is to "Find the nearest N items to a given lat/lng location, plus optional filtering." Filtering examples: Businesses->Restaurants; Polar bears->Male; Trucks->Tractor-trailors; Cities->Population > 100K.
No-Index - This is the straightforward case that people start with. However, "find nearest" involves a full table scan, computing the distance to every point, sorting by distance, then delivering the closest item(s). This is OK for thousands of items, but not for millions.
Bounding Box - This involves 2 indexes and an addition to the WHERE clause. There is some arithmetic to compute ranges for latitude and longitude for that clause. It is orders of magnitude faster than No-Index, but significantly slower than the other algorithms when handling millions of rows. This is not practical for billions of rows.
Spatial - This involves a bounding box and an index. It is relatively simple to set up. It can be slow if the limit is too high and there are lots if items. (Example: You are in a big city and you ask for the nearest restaurant, limited to 50 miles.)
Partitioning - What is really needed is a two-dimensional index, but indexes don't work that way. However, with PARTITION BY RANGE, one can approximate one dimension with partitions, then use the PRIMARY KEY for the other dimension. The algorithm uses a bounding box that expands until the row(s) are found or the limit is reached.
Z-Order - That takes a different tack; it plays with the bits of latitude and longitude to blend them together. Now, items that are "near" each other on the globe are _usually_ "near" each other with this contrived number. In this algorithm, a bounding box starts full sized, but shrinks as needed. It involves 2-6 SELECTs and scales well.
For experimenting with the algorithms, I picked a dataset of 3.1 million cities around the world. This provided oceans (no cities) versus crowded areas. And it avoided the poles, so the code for such could be postponed.
For bounding boxes, I picked (arbitrarily) 50 miles (80.57km, 0.723 degrees of latitude). And I picked (arbitrarily) to pick the 10 nearest cities. But I deliberately picked not to test with filters, since an appropriate filter could lead to finding zero results, regardless of the radius.
This is what the novice starts with, and it leads to a full table scan. Slow.
SELECT *, GCDistDeg($lat_me, $lng_me, lat, lng) * 69.172 AS miles
ORDER BY miles ASC
With suitable indexes, and use of a "bounding box" in the WHERE clause, you can easily improve on the No-Index algorithm.
WHERE lat BETWEEN ... AND ...
AND lng BETWEEN ... AND ...
If you are looking for items within 0.2 degree of latitude (= 14 miles = 22 km) of $my_lat, $my_lng, then
WHERE lat BETWEEN $my_lat - 0.2
AND $my_lat + 0.2
AND lng BETWEEN $my_lng - 0.2 / COS(RADIANS($my_lat))
AND $my_lng + 0.2 / COS(RADIANS($my_lat))
Dividing by the cosine approximately adjusts for how longitude lines creep toward each other as you go away from the equator. This approximation is 'good enough' except very close to the poles.
Use these two indexes:
The Optimizer will pick between the two indexes based on statistics. Do not also provide single-column indexes on lat and lng; the Optimizer might pick one of them, thereby preventing the use of "Using index condition" (aka ICP - index condition pushdown). ICP sometimes provides a 2x speedup.
If, say, there are 1000 items in the latitude range (an east-west stripe) but 4000 in the longitude range (north-south), the Optimizer will prefer (lat, lng). With that, it will scan the 1000, then check each according to the longitude range.
A slightly more precise, especially at high latitudes, way of computing the longitude with based on radius is (using PHP)
// $radius is in degrees, not miles or km:
$dlng = $radius / cos(deg2rad($my_lat)); // The method described above
$dlng = rad2deg(asin( sin(deg2rad($radius)) / cos(deg2rad($my_lat)) )); // better
or, in SQL:
AND lng BETWEEN ? - DEGREES(ASIN( SIN(RADIANS(?)) / COS(RADIANS(?)) ))
AND ? + DEGREES(ASIN( SIN(RADIANS(?)) / COS(RADIANS(?)) ))
and bind with $my_lng, $radius, $my_lat,
$my_lng, $radius, $my_lat
Caution: If you are too close ($my_lat+$radius >= 90) to the pole, $dlng will be NaN (Not a Number).
The bounding-box method is orders of magnitude faster than the no-index case. However, it requires that you limit how far you need to search. And, if you are in a crowded city looking for a coffee bar, you might still have to filter through thousands of rows.
It turns out that all of the following techniques also use a bounding box, but often buried in the code. Without a box, the search for something that does not exist anywhere would require searching the entire table -- slow. And the end-user probably does not care if the thing being searched for is on the other side of the world.
The bounding box technique when picking from a 50-mile radius from 3.1 million items (cities) will typically take between 1ms and 60ms if the data is cached; longer if it must hit disk.
ST_Distance_Sphere(lat, lng) AS meters
WHERE ... (any filtering)
AND MBRContains(GeomFromText(Polygon(...)), -- The bounding box is encoded as a 'square'
ORDER BY distance ASC, id
⚈ Select whatever columns you need.
⚈ Use some spherical distance formula, and convert to the desired units (miles or km or whatever)
⚈ For efficiency, the table should have extra columns to facilitate filtering (eg coffee shops / businesses)
⚈ Use LIMIT to control the number of items to return
⚈ GCDistDeg(lat1, lng1, lat2, lng2) -- args and return are in degrees
⚈ With MySQL 5.7.6 or newer, ST_Distance_Sphere(point1, point2) -- args are POINTs, returns meters
⚈ Pre 5.7.6 (MariaDB 10.2.2): use MyISAM
⚈ 5.7.6+ (10.2.2+): Use InnoDB (or MyISAM)
This is attempts to make a two-dimensional index out of the tools available in MySQL. The partitions are on some function of latitude. Then the PRIMARY KEY starts with longitude. That way you first get an east-west stripe, sometimes more than one stripe. After that, you can limit the searhing because of lng being first in the PK. (Again, use a bounding box.)
See this for an implementation of such: Find the nearest 10 pizza parlors
This involves blending the lat and lng numbers together to form a number that has geographically nearby items with similar number. Yeah, it sounds like magic. Yeah, it is not perfect, but it is good enough to be very fast -- touching an average of 3 data blocks of the table to find the 10 nearest items in up to 50 miles.
To keep the bounding boxes close to square, longitudes are keep as is, but latidudes are stretched by an unobvious function (the integral of secant). About 85 degrees north/south is where the algorithm becomes dubious. This is fine for cities, businesses, and trucks but not for polar bears.
Wikipedia on Z-roder curve
Implementing zorder in MySQL
Comparison of methods
⚈ 3.1 million cities in the world
⚈ Start at some semi-random starting point
⚈ "semi-random" meaning that it avoids the middle of oceans, etc.
⚈ Find the 10 nearest
⚈ Limit the search to 50 miles
Method Rows Blocks ms
⚈ Elapsed time is in milliseconds, taken around the call to the routine
⚈ "Rows" generally comes from innodb_rows_read
⚈ All data is cached.
⚈ "Blocks" is a hack to guesstimate how many disk blocks would be fetched from disk if the table were significantly bigger than cache.
⚈ So, think of the elapsed time as being CPU time, but the "Blocks" as being more important if short on RAM.
⚈ Rows/10 is an interesting metric -- "How many rows had to be touched to find the 10 desired". For no-index and bounding-box, those grow as the dataset grows. For the others, the ratio is relatively constant.
No filtering is done for these tests. That is, all nearby cities count toward the 10. In a real situation, the "items" might be "businesses", and the query might want to filter down to just "coffee purveyors".
Haversine algorithm for computing the distance between two lat/lng pairs anywhere on earth.
CREATE FUNCTION `GCDistDeg`(
) RETURNS double
SQL SECURITY INVOKER
COMMENT 'Degrees in, Degrees out. Use 69.172 mi/deg'
DECLARE _deg2rad DOUBLE DEFAULT PI()/180;
DECLARE _rlat1 DOUBLE DEFAULT _deg2rad * _lat1;
DECLARE _rlat2 DOUBLE DEFAULT _deg2rad * _lat2;
DECLARE _rlond DOUBLE DEFAULT _deg2rad * (_lon1 - _lon2);
DECLARE _m DOUBLE DEFAULT COS(_rlat2);
DECLARE _x DOUBLE DEFAULT COS(_rlat1) - _m * COS(_rlond);
DECLARE _y DOUBLE DEFAULT _m * SIN(_rlond);
DECLARE _z DOUBLE DEFAULT SIN(_rlat1) - SIN(_rlat2);
DECLARE _n DOUBLE DEFAULT SQRT(
_x * _x +
_y * _y +
_z * _z );
RETURN 2 * ASIN(_n / 2) / _deg2rad;
This problem has been a fun challenge that I have worked on off-and-on for many years.
-- Rick James
MySQL Documents by Rick James
HowTo Techniques for Optimizing Tough Tasks:
Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
Overview Summary Tables High speed ingestion
Entity-Attribute-Value -- a common, poorly performing, design pattern (EAV); plus an alternative
5 methods for 'Find Nearest'
Find the nearest 10 pizza parlors -- efficient searching on Latitude + Longitude (another PARITION use)
Lat/Long representation choices
Z-Order 'find nearest'(under construction)
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Rollup Unique User Counts
Alter of a Huge table -- Mostly obviated by 5.6
Latest 10 news articles -- how to optimize the schema and code for such
Build and execute a "Pivot" SELECT (showing rows as columns)
Find largest row for each group ("groupwise max")
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8 If you want case folding, but accent sensitivity, please file a request at http://bugs.mysql.com .
Python tips, PHP tips, other language tips
utf8 Collations utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table wp_postmeta UNION+OFFSET
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Request for tuning / slowlog info
Best of MySQL Forum -- index of lots of tips, discussions, etc
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS Analyze SlowLog
My slides from conferences
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
(older ones upon request)
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request: