EAV - Entity-Attribute-Value Implentation

Table of Contents

The Desire
It goes by various Names
Bad solution
The Problems
A Solution
But what about the ad hoc queries?
Why it Works
Details on the BLOB/JSON
Conclusions
Brought to you by Rick James


The Desire


    ⚈  Open-ended set of "attributes" (key=value) for each "entity". That is, the list of attributes is not known at development time, and will grow in the future. (This makes one column per attribute impractical.)
    ⚈  "ad hoc" queries testing attributes.
    ⚈  Attribute values come in different types (numbers, strings, dates, etc.)
    ⚈  Scale to lots of entities, yet perform well.

It goes by various Names

    ⚈  EAV -- Entity, Attribute, Value
    ⚈  key-value
    ⚈  RDF -- This is a flavor of EAV
    ⚈  MariaDB has
dynamic columns that look something like the solution below. They have the added advantage of being able to index the columns otherwise hidden in the blob. (There are caveats.)
    ⚈  MongoDB, CouchDB -- and others -- Not SQL-based.

Bad solution


    ⚈  Table with 3 columns: entity, key, value
    ⚈  The "value" is a string, or maybe multiple columns depending on type or other kludges.
    ⚈  a JOIN b ON a.entity=b.entity AND b.key='x' JOIN c ON ...

The Problems


    ⚈  The SELECTs get messy -- multiple JOINs
    ⚈  Datatype issues -- It's clumsy to be putting numbers into strings
    ⚈  Numbers stored in VARCHAR do not compare 'correctly', especially for range tests.
    ⚈  Bulky.
    ⚈  Dedupping the Values is clumsy.

A Solution


Decide which columns need to be searched/sorted. No, you don't need all the columns to be searchable or sortable. Certain columns are frequently used for selection; identify these. You probably won't use all of them in all queries, but you will use some of them in every query.

The solution uses one table for all the EAV stuff. The columns include the searchable fields plus one BLOB. Searchable fields are declared appropriately (INT, TIMESTAMP, etc). The BLOB contains JSON-encoding of all the extra fields.

The table should be InnoDB, hence it should have a PRIMARY KEY. If you don't have a 'natural' PK, add an AUTO_INCREMENT. Add a small number of other indexes (often 'compound') on the searchable fields. PARTITIONing is unlikely to be of any use, unless the Entities should purged after some time. (Example: News Articles)

But what about the ad hoc queries?


You have included the most important fields to search on -- date, category, etc. These should filter the data down significantly. When you also want to filter on something more obscure, that will be handled differently. The application code will look at the BLOB for that; more on this later.

Why it Works


    ⚈  You are not really going to search on more than a few fields.
    ⚈  The disk footprint is smaller; Smaller --> More cacheable --> Faster
    ⚈  It needs no JOINs
    ⚈  The indexes are useful
    ⚈  The one table has one row per entity, and can grow as needed. (EAV needs many rows per entity.)
    ⚈  Performance is as good as the indexes you have on the 'searchable fields'.
    ⚈  Optionally, you can duplicate the indexed fields in the BLOB.
    ⚈  Values missing from 'searchable fields' would need to be NULL (or whatever), and the code would need to deal with such.

Details on the BLOB/JSON


    ⚈  Build the extra (or all) key-value pairs in a hash (associative array) in your application. Encode it. COMPRESS it. Insert that string into the BLOB.
    ⚈  
JSON is recommended, but not mandatory; it is simpler than XML. Other serializations (eg, YAML) could be used.
    ⚈  COMPRESS the JSON and put it into a BLOB (or MEDIUMBLOB) instead of a TEXT field. Compression gives about 3x shrinkage.
    ⚈  When SELECTing, UNCOMPRESS the blob. Decode the string into a hash. You are now ready to interrogate/display any of the extra fields.

Conclusions


    ⚈  Schema is reasonably compact (compression, real datatypes, less redundancy than EAV, etc)
    ⚈  Queries are fast (since you have picked 'good' indexes)
    ⚈  Expandable (JSON is happy to have new fields)
    ⚈  Compatible (No 3rd party products, just supported products)
    ⚈  Range tests work (unlike storing INTs in VARCHARs)
    ⚈  (Drawback) Cannot use the non-indexed attributes in WHERE or ORDER BY clauses, must deal with that in the app.


Contact me by posting a question at
MySQL Forums :: Performance
-- Rick James

Rick's MySQL Documents

MySQL Documents by Rick James

Tips, Debugging, HowTos, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Converting from MyISAM to InnoDB -- includes differences between them
Big DELETEs - how to optimize
Compound INDEXes plus other insights into the mysteries of INDEXing
Partition Maintenance (DROP+REORG) for time series
Entity-Attribute-Value -- a common, poorly performing, design patter; plus an alternative
Find the nearest 10 pizza parlors (efficient searching on Latitude + Longitude)
Alter of a Huge table
Latest 10 news articles -- how to optimize the schema and code for such
Pagination, not with OFFSET, LIMIT
Data Warehouse techniques (esp., Summary Tables)
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance
MySQL Limits
Galera Limitations (with Percona XtraDB Cluster / MariaDB)
Rollup Unique User Counts
Best of MySQL Forum