Current Mood.

In Search of Flexible Data

Some time ago, on a project long long archived, I philosophised a new database structure. Being of a sound-mind — and most importantly a one-man development team — I could run with this theorised new structure, which in essence is built on the idea of flexibility.

In a nutshell — everything is an item, items are defined by their attributes.

Items are anything. Blog posts, a person, a customer order. Each of these things have defining properties — a blog title, a name, an invoice number etc. Some have 400 defining properties, some have 4. Some are integars, some are 4000 words.

For a basic example, the structure looks like this...








item
item_id*item_type_iditem_date_addeditem_status
112016-06-30 00:12:001
212016-06-30 00:13:000
312016-06-30 00:14:001







item_attribute
item_attribute_id*item_iditem_attribute_type_iditem_attribute_valueattribute_date_added
111Hello World2016-06-30 00:12:00
212This is a blog post, shortened here so...2016-06-30 00:12:00
321How To Blog 1012016-06-30 00:13:00






item_attribute_type
item_attribute_type_id*item_attribute_name
1blog_title
2blog_content





item_type
item_type_id*item_type_name
1blog_post


Connected...

In the small example outlined above, item_id=1 has 2 attributes assigned — a blog title and some blog content — while item_id=2 has just a blog title set.

This system (with some indexing and the occasional custom view) has served me well, allowing new attributes to be added — even if they apply to a single blog post, promoting a depth-over-breadth structure.

However (as always), structures are tested with queries. And this is where the strength of the flexibility was tested. Even something as simple as getting all the blog posts and printing them out as an archive would require multiple queries — ie. get all the items, cycle through and get the required attributes (if they exist). I've searched tirelessly over the past few years for a cleaner JOIN for these queries, but because this was my own theory, there wasn't an easily defined search term. I experimented with my own, but it was always faster/lighter to simply run with multiple small queries.

And then I stumbled on this (via The Straight Dope Forum), which I've been informed is similar to an Excel pivot statement...


MAX(CASE WHEN item_attribute.item_attribute_type_id=[attribute_id] THEN item_attribute.item_attribute_value ELSE NULL END) as [attribute_name]

For the example above...


SELECT i.*, MAX(CASE WHEN item_attribute.item_attribute_type_id=1 THEN item_attribute.item_attribute_value ELSE NULL END) as blog_title, MAX(CASE WHEN item_attribute.item_attribute_type_id=2 THEN item_attribute.item_attribute_value ELSE NULL END) as blog_content FROM item i LEFT JOIN item_attribute ON (item_attribute.item_id = i.item_id) GROUP BY i.item_id ORDER BY i.item_date_added DESC

Taking this further (spoiler: I did) the attribute part of the query can be drawn directly from the attribute table. An extra query for your page load (cached as needed, obvs) but will allow this main query to expand as new attributes are added.

There are some obvious scalability issues that I haven't yet tested (the current project I'm trialling this on only has about a dozen attributes so far). Not sure what kind of heavy load will be endured when there's 400 different attributes and all we need is a knife and/or 4 or 5 to display a bloody blog post on a page.

A possible solution could be a table that connects the specific query to the required attributes needed for that specific page. For example:







page
page_id*page_name
1blog_archive
2latest_blog_posts






page_2_attribute
page_2_attribute_id*page_idattribute_id
111
212


Continuing the original example, here we just need to first grab the specific attributes required for the blog_archive page (blog_post and blog_content) and then plug them into the query above. Later on, if we discover we also need to include the output of a new attribute called blog_image on this same page, we simply need to add the attribute and then connect it all up in the page_2_attribute joining table. Furthermore, this could even be expanded so the page_name is referenced automatically in the URL (settle down, sport).

As said, still very much in test mode, but the flexible structure and this new pivoting query is so far serving me well. But as Dave Winer would say "I'm still digging".