7-10 years ago, I repeatedly argued the viewpoints:
- Relational DBMS were the right choice in most cases.
- Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
- There were a variety of specialized use cases in which non-relational data models were best.
Since then, however:
- Hadoop has flourished.
- NoSQL has flourished.
- Graph DBMS have matured somewhat.
- Much of the action has shifted to machine-generated data, of which there are many kinds.
So it’s probably best to revisit all that in a somewhat organized way.
To make the subject somewhat manageable, I’ll focus on fielded data — i.e. data that represents values of something — rather than, for example, video or images. Fielded data always arrives as a string of bits, whose meaning boils down to a set of <name, value> pairs. Here by “string of bits” I mean mainly a single record or document (for example), although most of what I say can apply to a whole stream of data instead.
Important distinctions include:
- Are the field names implicit or explicit? In relational use cases field names tend to be implicit, governed by the metadata. In some log files they may be space-savingly implicit as well. In other logs, XML streams, JSON streams and so on they are explicit.
- If the field names are implicit, is any processing needed to recover them? Think Hadoop or Splunk acting on “dumb-looking” log data.
- In any one record/document/whatever, are the field names unique? If not, then the current data model is not relational.
- Are the field names the same from one record/document/whatever to the next? I.e., does the data fit into a consistent schema?
- Is there a structure connecting the field names (and if so what kind)? E.g., hierarchical documents, or relational foreign keys.
Some major data models can be put into a fairly strict ordering of query desirability by noting:
- The best thing to query is a relational DBMS. Everything has a known field name, so SELECTs are straightforward. You also have JOINs, which are commonly very valuable. And RDBMS are a mature technology with in many cases great query performance.
- The next-best thing to query is another kind of data store with known field names. In such data stores:
- SQL or SQL-like SELECTs will still work, or can easily be made to do.
- Useful indexing systems can be grafted on to them (although they are typically less mature than in RDBMS).
- In the (mainly) future, perhaps JOINs can be grafted on as well.
- The worst thing to query is a data store in which you only have a schema on read. You have to do work to make the thing queryable in the first place
Unsurprisingly, that ordering is reversed when it comes to writing data.
- The easiest thing to write to is a data store with no structure.
- Next-easiest is to write to a data store that lets you make up the structure as you go along.
- The hardest thing to write to is a relational DBMS, because of the requirements that must be obeyed, notably:
- Implicit field names, governed by metadata.
- Unique field names within any one record.
- The same (ordered) set of field names for each record — more precisely, a limited collection of such ordered sets, one per table.
And so, for starters, most large enterprises will have important use cases for data stores in all of the obvious categories. In particular:
- Usually it is best to have separate brands of general-purpose/OLTP (OnLine Transaction Processing) and analytic RDBMS. Further:
- I have in the past also advocated for a mid-range — i.e. lighter-weight — general purpose RDBMS.
- SAP really, really wants you to use HANA to run SAP’s apps.
- You might want an in-memory RDBMS (MemSQL) or a particularly cloudy one or whatever.
- Your website alone is reason enough to use a NoSQL DBMS, most likely MongoDB or Cassandra. And it often makes sense to have multiple NoSQL systems used for different purposes, because:
- They’re all immature right now, with advantages over each other.
- The apps you’re using them for are likely to be thrown out in a few years, so you won’t have great pain switching if you ever do decide to standardize.
- Whatever else Hadoop is — and it’s a lot of things — it’s also a happy home for log files. And enterprises have lots of log files.
Beyond that:
- You may want something to manage organizational hierarchies and so on, if you build enough custom systems in areas such as security, knowledge management, or MDM (Master Data Management). I’m increasingly persuaded by the argument that this should be a graph DBMS rather than an LDAP (Lightweight Directory Access Protocol) system.
- Splunk is cool.
- Use cases for various other kinds of data stores can often be found.
- Of course you’ll be implicitly using whatever is bundled into your SaaS (Software as a Service) systems, your app-specific appliances and so on.
And finally, I think in-memory data grids:
- Will be widely used and important.
- Will be used to instantiate multiple data models at once.
Related links
- One reason for writing this post was for some deck-clearing before I revisit the white-hot topic of data streaming. (October, 2014)
- I’ve long mused about the challenges of getting by without joins. (November, 2010)
- In 2013 I observed that data models will be in perpetual, rapid flux.
- In 2013 I also discussed attempts to combine multiple data models (or access methods) in a single DBMS.
- I surveyed data models and access methods back in 2008.