Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Data Warehouse techniques:
Schema and code design for large Sensor database
Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative
Lat/Lng search to Find the nearest 10 pizza parlors
The Problem
Searching for the "nearest" on a map needs a 2D index.
But indexes are one-dimensional.
This article discusses several techniques, starting with obvious, but inefficient, methods, and moving into efficient methods.
Static versus Moving Items
The test data in this discussion is static -- cities. However, the principles work with dynamic data, such as trucks. The notable difference is that the underlying data will be moving around. This does not hurt the algorithms, but it adds a burden to the UPDATEs. Most algorithms have the PRIMARY KEY depend on location, so when an item is moved, it must be deleted from one spot in the table and reinserted in another spot.
Dateline and Poles
Cities, businesses, and trucks are not likely to be located near poles, or even near the Dateline. Both of these are computational problems when computing distances.
We can mostly assume there are few if any items near the poles. The Dateline is less easy to sweep under the rug; there are a few cities, businesses, and trucks near longitude -180 or +180, especially in Fiji (Taveuni Island). The No-Index algorithm is works fine assuming that the distance formula works. The Partition algorithm has special code. The others probably need extra code.
Polar Bears
If you have tagged polar bears and are tracking them, the discussion here is of limited use. You have the problems discussed about poles, and the hassles of moving items.
If all the items are in one polar region, I would recommend projecting the globe onto a square with the pole at the center. Have the projection go some number of degrees toward the equator. On the projection, use x & y coordinates, not polar coordinates. Then the square will work nicely with any of the algorithms.
Computing Distance
Pythagorean -- This is rather inaccurate at the poles. It is faster than GCDistDeg, but is not very accurate for long distances. You could consider it for reasonably short distances (up to tens of miles or kilometers). Be sure to adjust longitude by the cosine of the latitude.
GCDistDeg -- This Stored Function uses the haversine algorithm to calculate the Great Circle Distance. Code is below.
ST_Distance_Sphere -- This works only with Geometry items, and requires MySQL 5.7.6 or newer.
Timing: ST_Distance_Sphere is 10-20 times as fast as GCDistDeg. The downside is that you must have a bulky Point column instead of smaller lat and lng columns.
The Contenders
Five algorithms will be discussed further below. They are listed in (roughly) most-efficient to least-efficient order.
The algorithms use a table with items with latitude and longitude coordinates, plus other columns. Each algorithm needs a distinctly different schema (different indexes, partitioning, spatial, etc).
The goal is to "Find the nearest N items to a given lat/lng location, plus optional filtering." Filtering examples: Businesses->Restaurants; Polar bears->Male; Trucks->Tractor-trailors; Cities->(Population > 100K).
No-Index - This is the straightforward case that people start with. However, "find nearest" involves a full table scan, computing the distance to every point, sorting by distance, then delivering the closest item(s). This is OK for thousands of items, but not for millions.
Bounding Box - (Also called "bounding square".) This involves 2 indexes and an addition to the WHERE clause. There is some arithmetic to compute ranges for latitude and longitude for that clause. It is orders of magnitude faster than No-Index, but significantly slower than the other algorithms when handling millions of rows. This is not practical for heavy use with billions of rows.
Spatial - This involves a bounding box and an index. It is relatively simple to set up. It can be slow if the limit is too high and there are lots if items. (Example: You are in a big city and you ask for the nearest restaurant, limited to 50 miles. But fast in a desert looking for water.)
Partitioning - What is really needed is a two-dimensional index, but indexes don't work that way. However, with PARTITION BY RANGE, one can approximate one dimension with partitions, then use the PRIMARY KEY for the other dimension. The algorithm uses a bounding box that expands until enough row(s) are found or the limit is reached.
Z-Order - This takes a different tack; it plays with the bits of latitude and longitude to blend them together. Now, items that are "near" each other on the globe are usually "near" each other with this contrived number. In this algorithm, a bounding box starts full sized, but shrinks as needed. It involves 2-6 SELECTs and scales well.
Both Partitioning and Z-order do the best job of minimizing the number of disk blocks touched. This makes either excellent for scaling into billions of rows, while providing fast results. Part of the secret is using multiple SELECTs and limiting how hard each one has to work.
Both Bounding Box and Spatial are reasonably fast most of the time. But when the density of items is high and the range limitation is high, they may need to hit a lot of rows.
Test case
For experimenting with the algorithms, I picked a dataset of 3.1 million cities around the world. This provided oceans (no cities) versus crowded areas. This avoided the poles. Code for such is not available.
For bounding boxes, I picked (arbitrarily) 50 miles (80.57km, 0.723 degrees of latitude). And I picked (arbitrarily) to request the 10 nearest cities. But I deliberately picked not to test with filters, since an appropriate filter could lead to finding zero results, regardless of the radius.
No-Index
This is what the novice starts with, and it always requires a full table scan. Slow.
SELECT *, GCDistDeg($lat_me, $lng_me, lat, lng) * 69.172 AS miles
FROM t
ORDER BY miles ASC
LIMIT 10;
Bounding box
With suitable indexes, and use of a "bounding box" in the WHERE clause, you can easily improve on the No-Index algorithm.
WHERE lat BETWEEN ... AND ...
AND lng BETWEEN ... AND ...
HAVING distance <= ...
If you are looking for items within $m miles of $my_lat, $my_lng, then
WHERE lat BETWEEN $my_lat - $m/69.172
AND $my_lat + $m/69.172
AND lng BETWEEN $my_lng - $m/69.172 / COS(RADIANS($my_lat))
AND $my_lng + $m/69.172 / COS(RADIANS($my_lat))
HAVING miles <= $m
Dividing by the cosine approximately adjusts for how longitude lines creep toward each other as you go away from the equator. This approximation is 'good enough' except very close to the poles.
Use these two indexes:
INDEX(lat, lng)
INDEX(lng, lat)
The Optimizer will pick between the two indexes based on statistics. Do not also provide single-column indexes on lat and lng; the Optimizer might pick one of them, thereby preventing the use of "Using index condition" (aka ICP - index condition pushdown). ICP sometimes provides a 2x speedup.
If, say, there are 1000 items in the latitude range (an east-west stripe) but 4000 in the longitude range (north-south), the Optimizer will prefer (lat, lng). With that, it will scan the 1000, then check each according to the longitude range.
A slightly more precise, especially at high latitudes, way of computing the longitude with based on radius is (using PHP)
// $radius is in degrees, not miles or km:
$dlng = $radius / cos(deg2rad($my_lat)); // The method described above
$dlng = rad2deg(asin( sin(deg2rad($radius)) / cos(deg2rad($my_lat)) )); // better
or, in SQL:
AND lng BETWEEN ? - DEGREES(ASIN( SIN(RADIANS(?)) / COS(RADIANS(?)) ))
AND ? + DEGREES(ASIN( SIN(RADIANS(?)) / COS(RADIANS(?)) ))
and bind with $my_lng, $radius, $my_lat,
$my_lng, $radius, $my_lat
Caution: If you are too close ($my_lat+$radius >= 90) to the pole, $dlng will be NaN (Not a Number).
The bounding-box method is orders of magnitude faster than the no-index case. However, it requires that you limit how far you need to search.
And, if you are in a crowded city looking for a coffee bar, you might still have to filter through thousands of rows.
It turns out that all of the following techniques also use a bounding box, but often buried in the code. Without a box, the search for something that does not exist anywhere would require searching the entire table -- slow. And the end-user probably does not care if the thing being searched for is on the other side of the world.
The bounding box technique when picking from a 50-mile radius from 3.1 million items (cities) will typically take between 1ms and 60ms if the data is cached; longer if it must hit disk. The high end is for areas of Bangladesh, which has lots of cities crowded together.
Spatial
SELECT ...,
ST_Distance_Sphere(center_point, item_point) AS meters
FROM table
WHERE ... (any filtering)
AND st_contains(st_GeomFromText(Polygon('...')), -- The bounding box is encoded as a 'square'
sp_point)
HAVING meters <= ...
ORDER BY distance ASC, id ASC
LIMIT ...
Notes:
⚈ Select whatever columns you need.
⚈ Use some spherical distance formula, and convert to the desired units (miles or km or whatever)
⚈ For efficiency, the table should have extra columns to facilitate filtering (eg coffee shops / businesses)
⚈ Use LIMIT to control the number of items to return
⚈ The polygon corners are like the bounding box discussed above, including the cosine adjustment.
Distance calculation:
⚈ GCDistDeg(lat1, lng1, lat2, lng2) -- args and return are in degrees; adjust longitudes by cos(lat).
⚈ With MySQL 5.7.6 or newer, ST_Distance_Sphere(point1, point2) -- args are POINTs, returns meters
Engine:
⚈ Pre 5.7.6 (MariaDB 10.2.2): use MyISAM and GCDistDeg
⚈ For 5.7.6+ (10.2.2+): Use InnoDB (or MyISAM) and ST_Distance_Sphere
Partitioning
This attempts to make a two-dimensional index out of the tools available in MySQL.
The partition ranges are on some function of latitude. Then the PRIMARY KEY starts with longitude.
That way you first get an east-west stripe (one partition); sometimes more than one stripe.
After that, you can limit the searhing because of lng being first in the PK.
(Again, use a bounding box.)
See this for an implementation of such:
Find the nearest 10 pizza parlors
Z-Order
This involves blending the lat and lng numbers together to form a number that has geographically nearby items with numerically close numbers.
Yeah, it sounds like magic.
Yeah, it is not perfect, but it is good enough to be very fast -- touching an average of 2 data blocks of the table to find the 10 nearest cities up to 50 miles.
To keep the bounding boxes close to square, longitudes are keep as is, but latidudes are stretched by an unobvious function (the integral of secant). About 85 degrees north/south is where the algorithm becomes dubious. This is fine for cities, businesses, and trucks but not for polar bears.
Wikipedia on Z-roder curve
Implementing zorder in MySQL
Comparison of methods
Test case
⚈ 3.1 million cities in the world
⚈ Start at some semi-random starting point
⚈ "semi-random" meaning that it avoids the middle of oceans, etc.
⚈ Find the 10 nearest cities
⚈ Limit the search to 50 miles
under construction
Method Rows Blocks ms sels
------------
no-index 3.1M 30K 40K 1
bounding-box 1700 96 24 1
spatial 1900 ? 15 1
zorder 54 2 3 4
partitioning 570 4 2 3
Metrics
⚈ Elapsed time is in milliseconds, taken around the call to the routine, which is a Stored Proc for Partitioning, or PHP code for Zorder, or just an SQL query for the others.
⚈ "Rows" generally comes from innodb_rows_read
⚈ All data is assumed to be cached.
⚈ "Blocks" is a hack to guesstimate how many disk blocks would be fetched from disk if the table were significantly bigger than cache.
⚈ So, think of the elapsed time as being CPU time, but the "Blocks" as being more important if you are short on RAM, leading to being I/O-bound.
⚈ Rows/10 is an interesting metric -- "How many rows had to be touched to find the 10 desired". For no-index and bounding-box, those grow as the dataset grows. For the others, the ratio is relatively constant.
⚈ Bounding-box and Spatial are 'slow' because of needing to calculate Distance for so many rows.
No filtering is done for these tests. That is, all nearby cities count toward the 10.
In a real situation, the "items" might be "businesses", and the query might want to filter down to just "coffee purveyors".
The table is designed with filtering in mind. That is, you should add columns (within reason) to the table to allow for queries to limit the search to only green vehicles, transvestite polar bears, or large cities.
GCDistDeg
Haversine algorithm for computing the distance between two lat/lng pairs anywhere on earth.
DELIMITER //
CREATE FUNCTION `GCDistDeg`(
_lat1 DOUBLE,
_lon1 DOUBLE,
_lat2 DOUBLE,
_lon2 DOUBLE
) RETURNS double
DETERMINISTIC
SQL SECURITY INVOKER
COMMENT 'Degrees in, Degrees out. For conversion: 69.172 mi/deg or 111.325 km/deg'
BEGIN
DECLARE _deg2rad DOUBLE DEFAULT PI()/180;
DECLARE _rlat1 DOUBLE DEFAULT _deg2rad * _lat1;
DECLARE _rlat2 DOUBLE DEFAULT _deg2rad * _lat2;
DECLARE _rlond DOUBLE DEFAULT _deg2rad * (_lon1 - _lon2);
DECLARE _m DOUBLE DEFAULT COS(_rlat2);
DECLARE _x DOUBLE DEFAULT COS(_rlat1) - _m * COS(_rlond);
DECLARE _y DOUBLE DEFAULT _m * SIN(_rlond);
DECLARE _z DOUBLE DEFAULT SIN(_rlat1) - SIN(_rlat2);
DECLARE _n DOUBLE DEFAULT SQRT(
_x * _x +
_y * _y +
_z * _z );
RETURN 2 * ASIN(_n / 2) / _deg2rad;
END
//
Representation choices
Datatype Bytes resolution
------------------ ----- --------------------------------
Normal use:
DECIMAL(4,2)/(5,2) 5 1570 m 1.0 mi Cities
DECIMAL(6,4)/(7,4) 7 16 m 52 ft Houses/Businesses
DECIMAL(8,6)/(9,6) 9 16cm 1/2 ft Friends in a mall
FLOAT 8 1.7 m 5.6 ft Vehicles
DOUBLE 16 3.5nm ... Fleas on a dog
For Partitioning, which has limitations on datatype for "partition key":
Deg*100 (SMALLINT) 4 1570 m 1.0 mi Cities
Deg*10000 (MEDIUMINT) 6 16 m 52 ft Houses/Businesses
SMALLINT scaled 4 682 m 0.4 mi Cities
MEDIUMINT scaled 6 2.7 m 8.8 ft Vehicles
Deg*10000000 (INT) 8 16mm 5/8 in Marbles
For SPATIAL indexing:
POINT 25 3.5nm ... (2 doubles, plus overhead)
See also
Postlog
This problem has been a fun challenge that I have worked on off-and-on for many years.
For a long time I saw this as one of the very few viable uses for PARTITIONing, but now I see that Z-Order is strong contender, but arguably more complex to program.
MySQL Documents by Rick James
HowTo Techniques for Optimizing Tough Tasks:
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse Overview
Summary Tables
High speed ingestion
Bulk Normalization
Lat/Long representation choices
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Alter of a Huge table -- Mostly obviated by 5.6
Efficient List of Latest 10 news articles
Build and execute a Pivot SELECT (showing rows as columns)
(Groupwise Max): Efficiently find largest row(s) for each group
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Datatypes and building a good schema
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8
If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
Python tips,
PHP tips,
other language tips
utf8 Collations
utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table
Handler counts
wp_postmeta
UNION+OFFSET
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS
Analyze SlowLog
My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request: