ALLSYSTEM DESIGN

10 questions to ask yourself before choosing a NoSQL database

So why am I writing this post?

Is it because I think NoSQL solutions are inferior to RDBMS solutions? Certainly not !

Is it because I’m fixated on the SQL way of doing things and don’t want to dive into the uncertainty of a relatively new technology ? Nope – not that either ! In fact, I’m very excited to learn and use the facilities provided by various distributed databases.

Then why am I writing this ?

The reason is simple – last year, a team within my broader organization set out to design a system which provided a schema management facilities for in-game telemetry events. It turned out to be a lot more expensive than initially planned. Why? Because they went with the wrong database solution.

One of the requirements of this system was ensuring that the schema edits are consistent and latest version of the schema is shown to each schema editor. It also should have supported concurrent edits.

Additionally, the number of users hitting this system concurrently would never exceed a few hundred at max. The amount of data stored would not be in tera bytes – a few hundred giga bytes at maximum.

So, for anyone who understands the tradeoffs of CAP Theorem, the choice should have been obvious – go with a RDBMS. This’d have had the benefit of supporting the consistency and transaction support requirements of the system.

Instead, the team decided to go with a NoSQL database ( Azure Table Storage) for prototyping. The official reason they gave for this choice was that this made prototyping faster and provided more flexibility while updating schemas of individual telemetry events. The low cost of Azure Table Storage when compared to Azure SQL was cited as another reason.

Fast forward 5 months ….

They started experiencing a lot of issues around maintaining transactional integrity for CRUD operations. The thin application logic layer designed to handle transactions wasn’t so thin anymore.  The upgrade and backward compatibility story started to look more complicated.

Plagued with a host of other issues, they were back to the drawing board – this time to replace the storage layer with SQL Azure ! I don’t remember the exact details, but this change added approximately 45% additional time and cost to the project.

Management was upset and the project almost got cut. But the engineers on the team were really good and they were able to land the project, albeit the delay and initial wrong technical decisions.

This project had a happy ending – but it might not have been that way. In fact, a lot of internal projects are shut down when they cannot deliver the promised functionality within +/- 20% of the committed dates.

So, how do you know that a NoSQL solution is right for your next software project ? Start by asking yourself and your team these 10 questions:

# 1 : Are you ready to absorb the training cost for your developers / system admins?

In case of this specific project, none of the devs on this team had previous NoSQL experience but tons o SQL Server experience. Going with a NoSQL solution added about 1 sprint in training and of course, design missteps due to lack of experience.

# 2 : Is your data transaction based ? Alternatively, what level of transaction support do you need ?

If your system requires ACID properties, you’re better off sticking to a RDBMS solution. Otherwise, you’ll spend a lot of time trying to replicate ACID guarantees at your application / business logic layer and chances are you’ll still not be as efficient as a RDBMS solution.

# 3: Do you need High Scalability ?

Always run a back of the envelope calculation about what kind of scalability you’d need. In this specific case, we’re building the system for Microsoft Internal Game Studios.

  • There are 10-15 game studios under consideration – depending on how many signs up to use the system
  • There would be a max of 3-5 active game titles per studio.
  • Each game title stores telemetry schema for three environments – Development, Pre- production (PPE)  and Production
  • For each title, there would be 2-5 data scientists modifying game title data concurrently
  • Each title event would have approximately 50 KB of max event data for each version
  • We’re required to store all all versions – let’s estimate this to be 1000 over the life of a title

Having the above rough estimates, we can calculate the concurrency and storage requirements:

Max Concurrency = Number of Studios * Number of titles/ studio * Number of users / Title

=  15 * 5 * 5 = 375 concurrent users

Max Storage = Number of Studios * Number of titles/ studio * Number of environments

* Size of event schema per version * number of versions to store

= 15 * 5 * 3 * 50 KB * 1000 = 11250000 KB = 11.25 GB of max storage

SQL Azure supports 1024 concurrent open connections and would have been able to easily support the concurrency requirements. Also, 11.25 GB is really a very small number when considering cloud scale.

TL’DR – this system was not meant to be the next FaceBook or Bing – so was going the NoSQL route really worth it ?

#4 : Is a NoSQL Solution realistically going to save you money ?

On paper, Azure Table storage was a cheaper option because it costs only cents per gigabyte of data whereas SQL Azure charged around $5 per gigabyte of data at that time.