Abstract
A look at one of the best-known contenders in the non-relational database space.
Lately I've been teaching programming courses in both Python and Ruby, often to seasoned developers used to C++ and Java. Inevitably, the fact that Python and Ruby are dynamically typed languages, allowing any variable to contain any type of value, catches these students by surprise. They often are shocked to find that a given variable can, at any point in the program, be assigned to contain an integer, a string or an instance of an object, without any constraints. They wonder how it is that anyone could (or would) use such a language, given the possibility for runtime type errors. One of my jobs, as the instructor of this course, is to convince them that it is possible to work in such a language, but that doing so might require more adherence to conventions than they are used to.
So, it's ironic that during the last few months, as I have begun to experiment with non-relational databases, that I have found myself experiencing something akin to my students' shock. My long-standing beliefs about data integrity and what constitutes a reliable database have gone through a bit of a shake-up. I'm still a bit wary of these non-relational (or NoSQL) databases, and I'm far from convinced that the time has come to throw out SQL and the relational model in favor of something that is often easier to work with.
I do think, as I outlined in last month's column, that these databases offer a type of storage and retrieval that often is a more natural fit for many data-storage requirements. And, just as memcached offered an alternative storage system that complemented relational databases rather than replacing them, so too can these non-relational databases perform many useful functions that would be difficult with a relational database.
One of the best-known contenders in the non-relational database space is MongoDB. MongoDB is an open-source project, sponsored by New York-based 10gen (which intends to make money from licensing and support fees). It is written in C++, and there are drivers for all popular modern libraries. The software is licensed under the Affero GNU General Public License, which means if you modify the MongoDB source, and if those modifications are available on a publicly accessible Web site, you must distribute the source to your modifications. This is different from the standard GPL, which does not require that you divulge the source code to server-side applications with which people interact via a browser or other Internet client.
MongoDB has gained a large number of adherents because of its combination of features. It is easy to work with from a variety of languages, is extremely fast (written in C++), is actively supported by both a company and a large community and has proven itself to be stable in many situations and under high-stress conditions. It also includes a number of features for indexing and scaling that make it attractive.
MongoDB, like several of its competitors, describes itself as a document database. This does not mean it is a filesystem meant to store documents, but rather that it replaces the model of tables, rows and columns with that of “documents” consisting of one or more name-value pairs. I find it easier to think of documents as hash tables (or Python dictionaries), in which the keys are strings and the values can be just about anything. Each of these documents exists in a collection, and you can have one or more collections.
In many ways, you can think of MongoDB as an object database, because it allows you to store and retrieve items as objects, rather than force them into two-dimensional tables. However, this object database stores only basic object types—numbers, strings, lists and hashes, for example. Fortunately, these types can store a wide variety of data, flexibly and reliably, so this is not much of a concern.
To download MongoDB, go to http://mongodb.org, and retrieve the version appropriate for your system. For my server running Ubuntu 8.10, I retrieved the 32-bit version of MongoDB 1.2.2. There is an option to retrieve a statically linked version, but the site itself indicates that this is a fallback, in case the dynamically linked version fails.
After unpacking the MongoDB server, create a directory in which it can store its data. By default, this is /data/db, which you can create with:
mkdir -p /data/db
Start the MongoDB server process with:
./bin/mongod
Now that you have a server running, you need to create a database. However, this step is unnecessary. If you try to connect to a database that has not yet been defined, MongoDB creates it for you. I tend to do most of my MongoDB work in Ruby, so I downloaded and installed the driver for Ruby from GitHub and started up the interactive Ruby interpreter, irb. Then, I typed:
irb(main):001:0> require 'rubygems' irb(main):002:0> require 'mongo'
With the MongDB driver loaded, I was able to connect to the already-running server, creating an “atf” database: Garrick, one line below.
irb(main):005:0> db = Mongo::Connection.new.db("atf")
After this, db is an instance of the Mongo::DB class, representing a MongoDB database. Each database may contain any number of collections, analogous to tables in a relational database. By default, this example database contains no collections, as you can see with this small snippet of code: Garrick, shrink below.
irb(main):008:0> db.collection_names.each { |name| puts name } => [ ]
The return value of an empty list shows that the database is currently empty.
You can create a new collection by invoking the collection method on your database connection:
irb(main):012:0> c = db.collection("stuff")
Once you have created your collection, you also can see that MongoDB has silently created a second collection, named system.indexes, used for indexing the contents:
irb(main):032:0> db.collection_names => ["stuff", "system.indexes"]
Because MongoDB is a schema-less database, you can begin to store items to your collection immediately, without defining its columns or data types. In practice, this means you can store hashes with any keys and values that you choose. For example, you can use the insert method to add a new item to your collection:
irb(main):017:0> c.insert({:a => 1, :b => 2}) => 4b6fe8983c1c7d6a6a000001
The return value is the unique ID for this document (or object) that has just been stored. You can ask the collection to show what you have stored by invoking its find_one method:
irb(main):021:0> c.find_one => {"_id"=>4b6fe8983c1c7d6a6a000001, "a"=>1, "b"=>2}
Notice that two things have happened here. First, the keys have been turned from Ruby symbols into strings. Indeed, MongoDB requires that all keys be strings; because symbols are used so pervasively in the Ruby world for hash keys, they are translated into strings silently if you use them.
Second, you can see that another key, named _id, has been added to the document, and its value matches the return value that you received with your first insert.
You can ask the collection to tell how many documents it contains with the count method:
irb(main):026:0> c.count => 1
As you might expect, you can store and retrieve data using any number of different languages. Although you are likely to work in a single language, MongoDB (like relational databases) doesn't care what language you use and lets you mix and match them freely.
In the above examples, I used Ruby to store data. I should be able to retrieve this data using Python, as follows: Garrick, shrink below.
>>> import pymongo >>> from pymongo import Connection >>> connection = Connection() >>> db = connection.atf >>> db.collection_names() [u'stuff', u'system.indexes'] >>> c = db.stuff >>> c Collection(Database(Connection('localhost', 27017), u'atf'), ↪u'stuff') >>> c.find_one() {u'a': 1, u'_id': ObjectId('4b6fe8983c1c7d6a6a000001'), u'b': 2}
The only surprises here are probably that the strings are all stored as Unicode, represented with the u'' syntax in Python 2.6 (which I am using here). Also, the document ID, with the key of _id, still is there, but is an object, rather than a string.
You also can see that the MongoDB developers have gone to great efforts to keep the APIs similar across different languages. This means if you work in more than one language, you likely will be able to depend on similar (or identical) method names to perform the same task.
The find_one method, as you have seen, returns a single element from a collection. A similar find method returns all of the elements using the Enumerable module, allowing you to iterate over all of the documents in a collection using each. For example, if you add another document: Garrick, shrink below.
irb(main):026:0> c.insert({'name' => 'Reuven', 'email_address' => 'reuven@lerner.co.il'}) => 4b6ff0693c1c7d6ecd000001
you can retrieve the IDs as follows:
irb(main):030:0> c.find.each {|i| puts i['_id']} 4b6fe8983c1c7d6a6a000001 4b6ff0693c1c7d6ecd000001
Notice how you can pull out the _id column by treating the document as a hash. Indeed, if you ask Ruby to show the class of the object, rather than its ID, this suspicion is confirmed:
irb(main):031:0> c.find.each {|i| puts i.class} OrderedHash OrderedHash
But, perhaps you're interested only in some of the documents. By invoking find with a hash, it will return only those documents that match the contents of your hash. For example:
irb(main):040:0> c.find({'name' => 'Reuven'}).count => 1
If nothing matches the hash that you passed, you will get an empty result set:
irb(main):041:0> c.find({'name' => 'Reuvennn'}).count => 0
You also can search for regular expressions:
irb(main):042:0> c.find({'name' => /eu/}).count => 1 irb(main):043:0> c.find({'name' => /ez/}).count => 0
By passing a hash as the value for a key, you also can modify the query, passing parameters that define MongoDB's query syntax. These query operators all begin with the dollar sign ($) and are passed as the key to a sub-hash. For example, you can retrieve all of the documents whose “name” is one of the values in a specified array, as follows: Garrick, shrink below.
irb(main):049:0> c.find({'name' => {'$in' => ['Reuven', 'Atara', 'Shikma', ↪'Amotz'] } } ).count => 1
You also can sort the results by invoking the sort method on the result set, using a similar syntax: Garrick, shrink below.
irb(main):049:0> c.find({'name' => {'$in' => ['Reuven', 'Atara', 'Shikma', ↪'Amotz'] } } ).sort({"name" => 1})
Just as you can sort a result set, you also can perform other actions on it that are analogous to several relational counterparts, such as grouping and limiting the number of results. If you are used to a functional style of programming, in which you chain a number of methods to one another, this style easily will lend itself to working with MongoDB.
MongoDB is causing many ripples in the open-source and database worlds because of its high performance and easy learning curve. This month, I covered the basics of installing and working with MongoDB. Next month, I'll look at some more-advanced topics, such as indexing (which makes queries execute much faster), embedding objects in one another and referencing objects across collections.