Rollup Unique User Counts

Table of Contents

The Problem
The solution
Inflating the BIT_COUNT
How good is it?
Postlog
Brought to you by Rick James

The Problem


The normal way to count "Unique Users" is to take large log files, sort by userid, dedup, and count. This requires a rather large amount of processing. Furthermore, the count derived cannot be rolled up. That is, daily counts cannot be added to get weekly counts -- some users will be counted multiple times.

So, the problem is to store the counts is such a way as to allow rolling up.

The solution


Let's think about what we can do with a hash of the userid. The hash could map to a bit in a bit string. A BIT_COUNT of the bit string would give the 1-bits, representing the number of users. But that bit string would have to be huge. What if we could use shorter bit strings? Then different userids would be folded into the same bit. Let's assume we can solve that.

Meanwhile, what about the rollup? The daily bit strings can be OR'd together to get a similar bit string for the week.

We have now figured out how to do the rollup, but have created another problem -- the counts are too low.

Inflating the BIT_COUNT


A sufficiently random hash (eg MD5) will fold userids into the same bits with a predictable frequency. We need to figure this out, and work backwards. That is, given that X percent of the bits are set, we need a formula that says approximately how many userids were used to get those bits.

I simulated the problem by generating random hashes and calculated the number of bits that would be set. Then, with the help of Eureqa software, I derived the formula:

Y = 0.5456*X + 0.6543*tan(1.39*X*X*X)

How good is it?


The forumla is reasonably precise. It is usually within 1% of the correct value; rarely off by 2%.

Of course, if virtually all the bits are set, the forumla can't be very precise. Hence, you need to plan to have the bit strings big enough to handle the expected number of Uniques. In practice, you can use less than 1 bit per Unique. This would be a huge space savings over trying to save all the userids.

Another suggestion... If you are rolling up over a big span of time (eg hourly -> monthly), the bit strings must all be the same length, and the monthly string must be big enough to handle the expected count. This is likely to lead to very sparse hourly bit strings. Hence, it may be prudent to compress the hourly stings.

Postlog


Invented Nov, 2013; published Apr, 2014

Future: I am working on actual code (Sep, 2016) It is complicated by bit-wise operations being limited to BIGINT. However, with MySQL 8.0 (freshly released), the desired bit-wise operations can be applied to BLOB, greatly simplifying my code. I hope publish the pre-8.0 code soon; 8.0 code later.
-- Rick James

MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
    Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
    Overview   Summary Tables   High speed ingestion  
Entity-Attribute-Value -- a common, poorly performing, design pattern (EAV); plus an alternative
Find the nearest 10 pizza parlors -- efficient searching on Latitude + Longitude (another PARITION use)
    Lat/Long representation choices
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Rollup Unique User Counts
Alter of a Huge table -- Mostly obviated by 5.6
Latest 10 news articles -- how to optimize the schema and code for such
Build and execute a "Pivot" SELECT (showing rows as columns)
Find largest row for each group ("groupwise max")

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
    Trouble with UTF-8   If you want case folding, but accent sensitivity, please file a request at http://bugs.mysql.com .
    Python tips,   PHP tips,   other language tips
    utf8 Collations   utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
    Many-to-many mapping table   wp_postmeta   UNION+OFFSET
MySQL Limits -- built-in hard limits
    767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Request for tuning / slowlog info
Best of MySQL Forum -- index of lots of tips, discussions, etc

Analyze MySQL Performance
    Analyze VARIABLEs and GLOBAL STATUS     Analyze SlowLog

My slides from conferences
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
(older ones upon request)

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:   View Rick James's profile on LinkedIn