Couchbase 4.0 review: The Swiss Army knife of NoSQL

Hybrid document-oriented, key-value database brings easy, ad hoc queries into the mix with a SQL-like query language

Couchbase 4.0 review: The Swiss Army knife of NoSQL
Thinkstock
At a Glance

Couchbase Server, similar to MongoDB and RethinkDB, is a document-oriented distributed database, but that description sells it a good deal short. Couchbase is what you get when a distributed key-value store and a document database join forces -- literally.

With direct and immediate ties to both Membase and CouchDB, Couchbase Server takes the best of both worlds and jams them into a single product. It's even added structured queries to the mix. With the recent release of version 4.0, the open source database takes a big leap forward in usability with the introduction of the SQL-like N1QL query language.

Understanding persistence in Couchbase is much easier if you approach it as a key-value store than as a document database. With a key-value store, it’s obvious that in order to store and retrieve a value, you also need to provide a key. It’s also obvious that the key doesn’t have to be repeated somewhere inside the value you’re storing. However, if you’re coming from a document database like MongoDB, it may seem strange to pull the unique identifier out of the object in order to store it.

The benefit here is that you’re really getting two kinds of databases at once. You can still take advantage of the key-value functionality that’s been a part of Couchbase from the beginning, while utilizing the document storage and retrieval that was incorporated in the 2.0 release.

If I had a N1QL for every view...

Before diving into what N1QL is, let’s get some context for why it’s here. Formerly, the way you got data out of Couchbase was either by direct key lookup or by writing an incremental map-reduce script (a “view”). This was a huge drawback because it seriously limited the ability to make ad-hoc queries in a performant manner.

If you decided you were interested in sifting through data by last name instead of Social Security number, for example, you’d have to write up a map-reduce job and wait for Couchbase to do the equivalent of a full table scan in order to populate your view. (Subsequent requests for that data would be speedy, but the first time for any custom view was painful.) On top of that, if you wanted to do anything else with the data (such as roll up the results on city and state), you’d be stuck doing it in your application code or writing another map-reduce job.

N1QL aims to eliminate that pain by overlaying a partial SQL implementation on top of the otherwise NoSQL model. N1QL not only gives programmers another option for querying their data, but its familiar dialect opens the field to less technical folk who have experience and comfort in the world of SQL. Enabling your business analysts to explore the data more comfortably reduces iteration time to valuable results and frees up engineers who would otherwise be fielding those requests by writing custom views.

N1QL supports a seriously wide range of SQL syntax, from simple SELECT and WHERE statements to nested queries, GROUP BY and HAVING aggregations, and even JOIN. Here is an example of a valid N1QL query:

SELECT u.screen_name as sn, count(*) as num_tweets
FROM `tweets` tw
JOIN
  `users` u ON KEYS tw.user_id
WHERE tw.text LIKE "%javascript%"
GROUP BY u.screen_name
HAVING count(*) > 5

The above query finds tweets containing the word “javascript,” joins them with data from the “users” bucket, groups tweets by user, and only returns groups that have more than five tweets per user. Note that only INNER and LEFT OUTER joins are supported, and one side of the join has to be on a bucket key. Those limitations purportedly improve performance, but I’d still be hesitant to add any join-heavy query logic to my application code. If you know you’ll need it, you’re better off writing a view and side-stepping the costlier query.

Because the nesting of documents and document elements is common practice in document databases, N1QL includes new operators to help navigate these structures. NEST and UNNEST gather documents and split them out, respectively, while helper functions like array_length() allow you to work with embedded arrays.

Indexing, updates, and storage engines

When it comes to relational databases, SQL query performance can be greatly improved by indexing properly ahead of time, and the same is true for N1QL. To support indices for N1QL queries, Couchbase Server now comes with an Index service. The Index service is a new component that allows you to create and manage indices on buckets of data. You can create an index by specifying the fields on which to index, as well as N1QL expressions and an optional WHERE clause to filter which documents get sent to the indexer.

A key point to keep in mind: When you create an index, it exists on a single instance of the index service. If that instance goes down, you’re out of luck. There is no automated replication or sharding of indexes; unless you’ve manually created the index on multiple nodes (which you can and should do), you’re back to full bucket scans for potentially complex queries. Compare indexed queries to writing incremental map-reduce views for complex queries. Although map-reduce views are sharded, distributed, and replicated with your data, getting results for a view requires scatter-gather operations across the network. Because an index resides completely on a single node, you can avoid any scatter-gather operations if you have a covering index.

Like Cassandra and some other popular data stores, Couchbase employs an append-only write model. This model favors immutability by never performing in-place updates. Instead, updated documents are added to the end of a file, which is subsequently read from the end. The most recent document wins, and old versions of the same document are invalid.

The append-only write model lends itself to the problem of files growing forever, since every possible change results in more bits at the end of the file. As a result, a cleanup step, often referred to as compaction, is needed to prevent the disk from filling up. The existing file is rewritten without all of the stale documents, and when the new file catches up with the old file, the database starts using the new one and the old one can be deleted.

couchbase compaction

Instead of performing in-place updates, Couchbase appends updated documents to the end of a file. At some point, it writes a new version of the file that omits the stale documents -- a process called compaction. 

Compaction works pretty well, but it has drawbacks. First, you need enough disk space to hold an extra copy of all of your data; otherwise, compaction will fail. More important, with a write-heavy use case, the new file may never catch up. Couchbase mitigates these issues by performing compaction not at the database or bucket level, but at the vBucket level, which is 1/1,024 of a bucket. By reducing the size of the file to be compacted, you can perform incremental compaction with lower disk resource requirements and smaller probabilities of compaction failure due to heavy writes. Compacting at the vBucket level is a big step toward preventing compaction failures, but it’s easy to negate its mitigations with a poorly devised data model.

Couchbase runs two different services that demand disk usage; as a result, it uses two different storage engines. The data service, responsible for basic CRUD operations and views, uses Couchstore to persist to disk. The index service, responsible for maintaining index data from the GSI (global secondary index), works with ForestDB for persistence.

Couchstore, which has been around since the Couchbase 2.0 release, is what the data service uses to handle direct document access and the maintenance and storage of views. It employs a slightly modified B-Tree data structure, which ensures consistency and performance across lookups, updates, and deletes. There are drawbacks due to the append-only strategy, such as no sibling-chaining for sequential lookups and a slightly costlier update algorithm, but it’s not too different from a standard B-Tree in practice. On disk, the Snappy library is used to compress data, similar to the default in MongoDB 3.0’s Wired Tiger. However, whereas compression is pluggable in MongoDB, Snappy is the only option for Couchbase. Lucky for us, Snappy is a solid compression library.

ForestDB is relatively new by comparison, dating from its original beta release in October 2014. Used by the Index service to maintain the GSI, ForestDB is accessed exclusively through the Query service that receives and parses N1QL queries. It employs a “Hierarchical B+-Tree based Trie,” which is a “trie” (a tree data structure whose keys are strings) of B+ trees optimized for shallow depth and disk access. The benefits gained by this sort of data structure are primarily realized when you need efficient access to variable length strings, which is exactly what the index service is going to be doing.

This sort of data structure would also be useful in place of the B-Tree used in Couchstore, and in fact there are plans to replace Couchstore with ForestDB in the future. Developers at Couchbase have been working on it for over a year, but a target release number has yet to be announced. As with Couchstore, Snappy is used for compression.

InfoWorld Scorecard
Administration (20%)
Ease of use (20%)
Scalability (20%)
Installation and setup (15%)
Documentation (15%)
Value (10%)
Overall Score (100%)
Couchbase Server 4.0 8 8 9 9 8 8 8.5
At a Glance
  • Couchbase Server brings document-orientation, a performant key-value architecture, and the comfort of SQL under one umbrella to enable flexible approaches, sometimes at the cost of clarity.

    Pros

    • N1QL opens the door to a more widely accessible query model
    • Access to both document and key-value functionality
    • Cross Datacenter Replication out of the box

    Cons

    • Indexes aren't automatically replicated across nodes
    • Documentation is fragmented and too short in places
1 2 Page 1
Page 1 of 2