pharos

project

ThingDB

When you are building a web application, the standard practice is to use a SQL database. A SQL database is like a large Excel spreadsheet -- a bunch of columns identifying various properties your data has and a bunch of numbered rows, each one representing a particular thing.

But SQL has some problems. For one thing, it's inflexible and poorly-integrated into most programming languages. If you want to add a new feature, you have to be very careful to upgrade the database model and code in tandem, and it gets to be a real pain.

For another, it requires you to decide the structure of the data in advance. It's not safe to allow end users to add new columns to your database, so the only data they can add is the data you're expecting.

For another, it has no revision history. Once a change is committed to the database, the data is permanently changed, with no records of who changed it or when or what it was before. If you're going to make your data publicly-editable over the Web, this is a disaster -- as soon as one spammer starts screwing around with things, you're trapped.

Finally, it doesn't deal with heterogeneous types of objects well. Queries can be made against only one table at a time. If one table stores links and another stores comments on those links, there's no way to do a query that returns both the links and the comments created by a user, sorted by date.

Each one of these can be hacked around. With good techniques, you can have your code autoupgrade the SQL database. With special tables, you can keep around old revisions of the data. With table inheritance you can do heterogeneous queries. But together the problems add up to be quite frustrating.

As a result, at Reddit we built a system called ThingDB which acts as a layer on top of a SQL database to solve these problems. We noticed that our database consisted of a bunch of different kinds of things -- users, wiki pages, reddit links, and subreddits. Each one had a name (e.g. "AaronSw"), a kind (e.g. "User"), and a variety of properties (e.g. password and num_sites_to_display). So in ThingDB, we collapsed these all into one kind of object, called a thing.

The schema looks like this:

  • thing
    • id
    • name
    • kind_id
  • version
    • thing_id
    • creator
    • creation_date
    • comment
  • data
    • version_id
    • key
    • value

In other words, there are things (with names and kinds) which have versions (with creators, creation dates, and comments), which have data -- a series of key-value pairs.

Every time you modify a thing, it creates a new version with all the new data. This way no data is ever lost and changes made by spammers and so on can be safely deleted. Since all data is stored in the data table, new kinds of keys and values can be added at will. Since everything is a thing, queries can return any type of object.

We also built a Python API to make manipulating things easy:

link = getThing('f9fk')

print link.title+':', link.url

link.filetype = 'mpeg'
link.save()

The Python API transparently caches all thing information from the database and automatically updates the cache on saves.

Using this technique, Reddit was able to easily scale to millions of items and requests.

Distributing it

But what if we want to go bigger? For our project, we're starting with hundreds of millions of rows. The model of keeping it all on one machine can only stretch so far. At some point, you have to break it up.

We could build our own distributed database, and maybe someday we will, but until then, the simplest thing is to just split up the SQL database into multiple fragments (called partitions), each one of which can be stored in RAM by one machine and thus quickly queried.

Each partition can store 1/n of the things, based on a consistent hash of the thing name. (This is how web caches work.) Alternately, a coordination server can keep track of which machines hold which things, storing a Bloom filter of the information in memory, with the full results on disk. (This is how Google's BigTable works.)

Each partition has one master writer and then a number of replicated read slaves. This allows the system to scale infinitely: if you're bottlenecked on writes, just add more partitions; if you're bottlenecked on reads, just add more read slaves.

A coordination machine keeps track of which machines are busy and tries to distribute load.

Each database machine runs a TDB server which converts a special TDB protocol into SQL queries. (It also archives TDB writes as flat files for archival.) The TDB client library talks to the coordination machine and the various TDB servers to execute its writes and queries.

For writes and reads on a particular thing, it simply needs to hit one of the machines in that partition.

For larger queries, it hits one read slave in each partition and then does a mergesort on the results.

Since we're building our own reading, writing, and query system, we can cache everything and know exactly when to expire it. Thus the TDB clients (which run on the web servers) can cache all the results they get back from the database, either in a memcached installation or in a distributed Python-level cache, which gets immediately expired when a new write makes it out of date.