NoSQL Distilled

Posted on September 17, 2012

I’ve been waiting expectantly for this book for a while now so as soon as it was published I was keen to obtain a copy. NoSQL is a set of persistence technologies that justifiably receive a lot of attention because of the way they force people to change the way they look at data access. But because NoSQL is a set of (complicated) technologies they are generally not very well understood. This coupled with Fowler’s ability to convey complex concepts in a way that the rest of us mortals can understand has resulted in a compelling read that I will try to summarise in what follows.

Part I: Understand

1. Why NoSQL

The book starts out by explaining the value of relational databases. For me this is very important because often when a new technology comes along there is a tendency to throw the baby out with the bath water and reject similar technologies that have went before. The impedance mismatch between OO models and relational models is described along with the emergence of ORM to address this mismatch. The difference between an application and integration database is described as well as the significant increase in IOPS demanded from databases due to the massive increases in the scale of applications (achieved by clustering). The emergence of NoSQL is explained as well as an attempt to define the term.

2. Aggregate Data Models

One of the defining characteristics of a NoSQL database is that it stores data as aggregates as opposed to the normalised representations we are familiar with. A whole chapter is dedicated to describing aggregate data models, the consequences of using aggregate data models and the different ways in which NoSQL databases store aggregates.

3. More Details on Data Models

Further consequences of using aggregate data models are explored before taking a look at Graph databases which are used to persist data that contains complex relationship structures. The implications of schema-less databases are discussed along with some of the positives and negatives of this approach. One of the big challenges for newcomers to NoSQL is identified as modelling aggregates in a way that fits the way in which the data is to be accessed. However, this was explained very well.

4. Distribution Models

While it is valid for a NoSQL database to be deployed on a single server, one of the major benefits of NoSQL databases is their ability to cluster easily and cost effectively. Deploying in a clustered configuration means consideration must be given to partitioning (or sharding) and replication. Just like with a relational database, partitioning can be achieved by partitioning collectionstables across different servers andor splitting the data within those collectionstables. The fact that data is stored as aggregates makes this easier to achieve using NoSQL databases. Replication is primarily thought of as providing resilience though more and more it is used to increase the number of IOPS achievable by distributing reads and writes across the cluster (with associated effects on data consistency).

5. Consistency

Consistency is one of the things that we often don’t consider when using a relational database. This is because there is a single copy of the data and the database protects us to a large degree from reading stale data (though most RDBMS allow for increasing/reducing consistency). Techniques such as optimistic locking are fairly common place thanks to the adoption of ORM frameworks and provide another mechanism to avoid conflicts. Of course, when we start to replicate data across a cluster and allow reads/writes to more than one node managing conflicts becomes a lot more challenging. We start to have to make trade-offs between consistency and latency. These are described along with how durability may also be traded for reduced latency.

6. Version Stamps

Version stamps are rightly given only a small portion of the book. Again, the adoption of ORM and optimistic locking means that the concept of versioning data is very well understood today. However, some very interesting points are made about how this changes when data is replicated across a cluster and the term vector stamp is introduced.

7. Map-Reduce

For me, Map-Reduce is one of the defining features of NoSQL databases. A solid explanation of the Map and Reduce processes is provided as well as what is described as incremental map reduce. Incremental map reduce occurs when the map reduce process outputs data that is materialised (stored for some temporary period of time) and used as input for further map reduce processes.

Part II: Implement

8. Key-Value Databases

The key-value family of NoSQL databases are described via a number of features such as consistency, transactions, query features, data structure and scaling. A number of use cases are provided as well as scenarios when key-value databases are not suitable. Given that the majority of developers are familiar with Map implementations and caching, key-value database concepts are reasonably accessible. That said, sharding is something that many developers will be unfamiliar with and this receives some attention. For me, sharding at a basic level can be compared to hash codes. Of course, concepts such as distribution and locality become more important when sharding across physical clusters (not that developers should not be aware of these things when writing hash code implementations).

9. Document Databases

Given a background in REST, I think many of the concepts of document databases will be very familiar (documents are not dissimilar to REST resources). There are many overlapping concepts between all NoSQL databases and the treatment of consistency, transactions and scaling are similar to those discussed in the key value chapter. Document databases store data in documents and often these sub elements of these documents can be modified which is in contrast to key value databases. Document databases also tend to have richer query capabilities. Suitable and unsuitable uses of document databases conclude this section.

10. Column-Family Stores

Column family stores store data in similar ways to an RDBMS in that rows and columns are concepts that still exist (in contrast to the other NoSQL families). The query features are very different though and column family stores share consistency, transaction and cluster characteristics with other NoSQL databases.

11. Graph Databases

For me, Graph databases are a the odd man out of the NoSQL bunch particularly in the way that they scale. Nodes and edges are introduced as the main concepts that underpin graph databases. These concepts also dictate many of the features of the query language.

12. Schema Migrations

Given one of the defining characteristics of a NoSQL database is its lack of schema a discussion on schema migrations seems a little out of place. NoSQL database solutions typically require schema to be maintained within the application code that accesses the data and so schema migrations are still necessary, but they occur in the application. This is a big shift from the techniques used to maintain schema with a RDBMS.

13. Polyglot Persistence

One of the things I really liked about this book is the balanced opinion provided. This is emphasised by the fact that a whole chapter is given over to discuss polyglot persistence (i.e. using different storage technologies to handle varying storage needs). It’s refreshing that the opinion expressed does not totally dismiss the dominant persistence technology (RDBMS) for the newcomer (NoSQL) as clearly this would be extremely naive.

14. Beyond NoSQL

This chapter discussed data storage solutions not covered by the RDBMS or NoSQL umbrellas. These include file systems, event sourcing, memory image, version control, XML databases and object databases. This reinforces the points made in the previous chapter that encourage one to select a storage technology that fits the storage need.

15. Choosing Your Database

The book concludes with a very useful summary of some of the considerations you should make when choosing a database technology.

Thoughts

This is the best technology book I have read in quite a while. It did not over promise but delivered a comprehensive and well rounded overview of NoSQL databases. I’d highly recommend it for anyone involved in the field of software development who is interested in what NoSQL has to offer.