Saturday, August 18, 2012

Choosing a NoSQL Database

Where to Start?

When starting a new project, it is necessary to figure out which of the multitude of No SQL providers to choose from.

I found the following diagram very helpful: No-SQL Diagram
Take note that it covers both CAP requirements and data model (Key-Value, Column or Document).

The following article brings home what CAP means in the real world:  MongoDB vs CouchDB

Once we have eliminated a few choices (we are picking from a short-list) the following article is helpful to understand the trade-offs we are making:  No-SQL Summary

What does CAP mean to me ?

When we introduce a second data center into our design, we are forced into an overall AP model.
We know there will be times that the two data centers will lose connectivity from each other (P) and the primary reason we are doing this is to keep our site available (A).

Even if we run an Active-Passive configuration, we must accept that our primary DC may lose power before replicating data to our secondary. If/when we get the primary datacenter back online we will need to reconcile the data (eventually consistent).

The question becomes, do we want to handle this inconsistency at the application or the data-store layer, putting this another way:
Do we want our data store to span data centers?

If we span data centers we need to deal with eventually consistent data on the site, we will have different data at each site and totals will not always match, we have to make sure that these inconsistencies don't impact the user (too much).

If we isolate the data centers (data-store clusters don't span DC's) all the data-stores within the datacenter can be CP (if a fatal problem occurs it is still possible to cut over to the other data-center). But within each DC things are now much simpler as all the stores are consistent.
The fail-over/back events (at the datacenter level) cause inconsistencies that have to be handled at the application layer, however this may be okay if such fail-overs are rare and the desired SLA is more generous.

In summary if data-stores clusters are restricted to a single data center (or AWS region) our application logic can be simpler, than it would be, if it had to handle eventually consistent data.
However we run the risk of having to reconcile data during a datacenter failover. This risk can be mitigated by spreading the nodes of the cluster across independent power/network zones (availability zones) hence reducing the chance of an outage at the datacenter. But this risk can not be eliminated unless we handle inconsistent data at the application layer.

Data Model

The choice of data model comes down to how we want to handle "document linkage". Key/Value stores are easy to understand and the most efficient for simple use cases. However we inevitably get to the point where we want to "serialize" more complex objects into a single value.
Then we discover that we need to access the data in a more complex way and we end up with a more complex solution:
  • Creating reference documents, where the value is a list of keys.
  • Mangling the keys so that we can load data based on a pattern.
  • Using a third party index (such as solr).
Document stores are specifically optimized for this usage pattern - to be able to extract (and update) parts of multiple documents in a single query.

Hence the choice between these technologies is based on how complex we expect our use cases to be.

Sanity Check our Choice

Running a google query on "DATABASE_NAME doesn't scale" is a good way to find the articles about the negatives.

The most important question is why (the given DB doesn't scale), since if a product only has one limiting feature its likely someone is working on fixing it (we can also make an assessment of how likely the fix is).

Anyway here are a couple of good negative articles:

No comments:

Post a Comment