The Problem
The normal way to count "Unique Users" is to take large log files,
sort by userid, dedup, and count. This requires a rather
large amount of processing. Furthermore, the count derived
cannot be rolled up. That is, daily counts cannot be added
to get weekly counts -- some users will be counted multiple times.
So, the problem is to store the counts is such a way as to allow
rolling up.
The solution
Let's think about what we can do with a hash of the userid.
The hash could map to a bit in a bit string.
A BIT_COUNT of the bit string would give
the 1-bits, representing the number of users.
But that bit string would have to be huge.
What if we could use shorter bit strings?
Then different userids would be folded into the same bit.
Let's assume we can solve that.
Meanwhile, what about the rollup?
The daily bit strings can be OR'd together to get a similar
bit string for the week.
We have now figured out how to do the rollup, but have created
another problem -- the counts are too low.
Inflating the BIT_COUNT
A sufficiently random hash (eg MD5) will fold userids into the same bits
with a predictable frequency. We need to figure this out,
and work backwards. That is, given that X percent of the bits are set, we need
a formula that says approximately how many userids were used
to get those bits.
I simulated the problem by generating random hashes and calculated the
number of bits that would be set.
Then, with the help of Eureqa software, I derived the formula:
Y = 0.5456*X + 0.6543*tan(1.39*X*X*X)
How good is it?
The forumla is reasonably precise.
It is usually within 1% of the correct value; rarely off by 2%.
Of course, if virtually all the bits are set, the forumla
can't be very precise. Hence, you need to plan to have the
bit strings big enough to handle the expected number of Uniques.
In practice, you can use less than 1 bit per Unique.
This would be a huge space savings over trying to save all the
userids.
Another suggestion... If you are rolling up over a big
span of time (eg hourly -> monthly), the bit strings must
all be the same length, and the monthly string must be
big enough to handle the expected count.
This is likely to lead to very sparse hourly bit strings.
Hence, it may be prudent to compress the hourly stings.
Postlog
This technique should work equally well in MySQL, MariaDB, and Percona.
I did some preliminary code in September 2016; with suitable encouragement, I could continue with that.
It is complicated by bit-wise operations being limited to BIGINT.
However, with MySQL 8.0, the desired bit-wise
operations can be applied to BLOB, greatly simplifying the code.
So, I might continue with the pre-8.0 code or I might just work on the 8.0 code.
See also:
HyperLogLog
Invented Nov, 2013; published Apr, 2014
Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
Data Warehouse Overview
Summary Tables
High speed ingestion
Bulk Normalization
Schema and code design for large Sensor database
Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative
Lat/Lng search to Find the nearest 10 pizza parlors
Lat/Long representation choices
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Alter of a Huge table -- Mostly obviated by 5.6
Efficient List of Latest 10 news articles
Build and execute a Pivot SELECT (showing rows as columns)
(Groupwise Max): Efficiently find largest row(s) for each group
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Datatypes and building a good schema
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8
If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
Python tips,
PHP tips,
other language tips
utf8 Collations
utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table
Handler counts
wp_postmeta
UNION+OFFSET
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS
Analyze SlowLog
My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request: