ALLSYSTEM DESIGN

10 questions to ask yourself before choosing a NoSQL database

So why am I writing this post?

Is it because I think NoSQL solutions are inferior to RDBMS solutions? Certainly not ! 

Is it because I'm fixated on the SQL way of doing things and don't want to dive into the uncertainty of a relatively new technology ? Nope - not that either ! In fact, I'm very excited to learn and use the facilities provided by various distributed databases.

Then why am I writing this ? 

The reason is simple - a few years back, I witnessed the design a system which provided schema management facilities for telemetry events. It turned out to be a lot more expensive than initially planned. Why? Because wrong database solution was chosen.

One of the requirements of this system was ensuring that the schema edits are consistent and latest version of the schema is shown to each schema editor. It also should have supported concurrent edits

Additionally, the number of users hitting this system concurrently would never exceed a few hundred at max. The amount of data stored would not be in terabytes - a few hundred gigabytes at maximum.

So, if we considered the trade-offs of CAP Theorem, the choice should have been obvious - go with a RDBMS. This'd have had the benefit of supporting the consistency and transaction support requirements of the system. w

Instead, a NoSQL database ( Azure Table Storage) was chosen for prototyping. The official reason for this choice was that it made prototyping faster and provided more flexibility while updating schemas of individual telemetry events. The low cost of Azure Table Storage when compared to Azure SQL was cited as another reason.

Fast forward 5 months ....

The system started experiencing a lot of issues around maintaining transnational integrity for CRUD operations. The thin application logic layer designed to handle transactions wasn't so thin anymore.  The upgrade and backward compatibility story started to look more complicated. 

Plagued with a host of other issues, engineers were back to the drawing board - this time to replace the storage layer with Azure SQL! I don't remember the exact details, but this change added approximately 40% additional time and cost to the project.

Management was upset and the project almost got cut. But the engineers on the team were really good and they were able to land the project, albeit the delay and initial wrong technical decisions. 

This project had a happy ending - but it might not have been that way. In fact, a lot of internal projects are shut down when they cannot deliver the promised functionality within +/- 20% of the committed dates.

So, how do you know that a NoSQL solution is right for your next software project ? Start by asking yourself and your team these 10 questions:

#1: Are you ready to absorb the training cost for your developers / system admins ?

    If you're an established IT shop/ software development company, chances are you already have people on your payroll who're intimately familiar with SQL. This group includes not only developers, but also database administrators (DBAs).

Unless you're planning to do hiring for the new NoSQL project, there is going to be a training cost for existing devs and DBAs. The additional training will also likely stretch out the project delivery dates. 

One easy way to think about this is as follows:

  • Count the total number of years of experience your team members (devs + DBAs) have with relational database technology.
  • Figure out the cost of getting to the same level number of years of experience with NoSQL either via training or new hiring. 
  • Finally, figure out what you're getting out of absorbing this cost , i.e., your ROI ?

In case of this specific project, none of the devs on this team had previous NoSQL experience but tons o SQL Server experience. Going with a NoSQL solution added about 1 sprint in training and of course, design missteps due to lack of experience.

#2: Is your data transaction based? Alternatively, what level of transaction support do you need ?

    If your system requires ACID properties, you're better off sticking to a RDBMS solution. Otherwise, you'll spend a lot of time trying to replicate ACID guarantees at your application / business logic layer and chances are you'll still not be as efficient as a RDBMS solution. 

#3: Do you need Web/High Scalability ?

    Always run a back of the envelope calculation about what kind of scalability you'd need. In this specific case, we're building the system for Microsoft Internal Game Studios.

  • There are 10-15 game studios under consideration - depending on how many signs up to use the system
  • There would be a max of 3-5 active game titles per studio. 
  • Each game title stores telemetry schema for three environments - Development, Pre- production (PPE)  and Production
  • For each title, there would be 2-5 data scientists modifying game title data concurrently
  • Each title event would have approximately 50 KB of max event data for each version
  • We're required to store all all versions - let's estimate this to be 1000 over the life of a title

Having the above rough estimates, we can calculate the concurrency and storage requirements:

Max Concurrency = Number of Studios * Number of titles/ studio * Number of users / Title 

                                =  15 * 5 * 5 = 375 concurrent users

Max Storage = Number of Studios * Number of titles/ studio * Number of environments 

                          * Size of event schema per version * number of versions to store

                        = 15 * 5 * 3 * 50 KB * 1000 = 11250000 KB = 11.25 GB of max storage

SQL Azure supports 1024 concurrent open connections and would have been able to easily support the concurrency requirements. Also, 11.25 GB is really a very small number when considering cloud scale.

TL'DR - this system was not meant to be the next FaceBook or Bing - so was going the NoSQL route really worth it ?

#4: Is a NoSQL solution realistically going to save you money ?

On paper, Azure Table storage was a cheaper option because it costs only cents per gigabyte of data whereas SQL Azure charged around $5 per gigabyte of data at that time. 

But since the storage for our system would not exceed 12GB - did this really matter ? $60 per month is literally what we pay a dev for 30 minutes of writing code on the same system. 

So, before deciding to go with NoSQL just because it is cheaper in unit cost, figure out whether the savings is amounting to a significant portion of the budget.

#5: Do you need to attract Venture Capital ?

Anecdotally, there is a bias in the silicon valley towards NoSQL . This is because NoSQL is perceived to be inherently scalable and RDBMS is perceived to be non-scalable. Remember, the keyword here is perception !

This perception of scalability may help convince investors that your software is on the right trajectory to be ready for mass adoption and hence attract their investment dollars.

Many of the NoSQL companies are themselves VC funded - this also creates a positive bias towards them.

Finally, all the marketing buzz around "NoSQL" helps with driving positive sentiment about your product among investors.

#6:  Are you hiring start-up mentality people ?

If you're planning on hiring start-up mentality folks, a lot of them might already have NoSQL knowledge. 

However, if you're not strategically located in a major tech hub, the chances of landing those individuals are less. You area might have a ready pool of RDBMS developers -- trying to recruit NoSQL engineers and DBAs in such an area might delay the project delivery dates and also cost you a lot more money due to the supply demand curve.

My recommendation is to work with your recruiting agency/ HR department to run a market survey for developers and factor that into your technology choices as well.

#7: What technology do your customers use downstream ?

Consider a scenario where you're delivering analytics  data to your customers. You're using a NoSQL to store the analytics data. However, one of your customers decides to stick with their SQL based reporting system. 

what does this mean for you ?

It means that you now need to translate all the NoSQL data into SQL format and push down to your customers SQL database via a service such as Azure Data Factory. This is an additional development and operational cost you need to absorb. If all your downstream customers are using SQL, then you need to take a hard look at whether using NoSQL and doing all these expensive data transforms makes sense for your system ?

#8: For your product , does availability trump consistency ?

If you're building a system like Facebook newsfeed, you might want the system to be highly available and live with eventual consistency

On the other hand, if you're building a banking system (or a schema store as in our case), you might want to favor immediate consistency and forgo high availability.

Either way, you should first consider the implications of CAP theorem and then decide if you need a SQL or NoSQL solution for your system.

#9: Do you anticipate lot of changes to your database schema ?

If you expect a lot of changes to your database schema, as is often the case with mobile apps, real-time analytics, content management systems, etc., then a NoSQL solution might be the way to go.

You can employ a partitioning scheme that can allow you to update your DB schema in a more convenient way than most SQL databases allow. 

#10: Do you want to use NoSQL for personal enrichment/ gratification ?

Please don't do it!

I've seen a few people who are just enamored by the possibilty of learning a NoSQL system and putting it in their resume. There's nothing wrong with it - I'm fascinated by NoSQL technology too. 

But please don't let that be a driving factor behind choosing a technology stack (consciously or subconsciously). You can always learn it on your own time if you're so inclined.


Who wins the database wars?

Frankly - no-one and both!

In a lot of cases, you might need SQL and NoSQL technologies to co-exist side by side in the same system. For example, if you're building a photo sharing application like Instagram, your photos might reside in a NoSQL database whereas your login/ ACLs information might reside in a SQL database. 

Interested in learning more ? Please check out the book "Next Generation Databases" for an in depth look at the evolution and features of NoSQL technology.