A warning for those of you who aren’t tech-crazed. My wife mentioned that reading this was like having a bucket of gravel dumped on her brain. I wish I could interpret that to mean she is in awe of all things tech, but something tells me that would be wishful thinking.
As we work on providing support for current and next-generation continuous glucose monitors, it has become clear that SweetSpot will soon be storing billions of glucose entries. Yikes! Numbers like these will send shivers through anyone running a lean-and-mean startup.
There are plenty of short-term fixes to improve the efficiency of the database: consolidate columns, move sparsely used columns into their own table, minimize the number of indexes, etc. The same methodology applies at the application layer: minimize interactions with the database, use bulk inserts, minimize object creation, blah, blah, blah. These techniques will actually buy us a fair amount of breathing room, but they aren’t long-term solutions.
Now we get into harder choices. Microsoft sql server, mysql, postgres, and db2 are examples of Relational database management systems (RDMS) and are what most people think of when they think of a database. RDMS’s are very powerful software packages, yet they are notorious for not being able to scale out. You can throw a lot of expensive hardware at the problem to scale up, but fairly soon you’ll need to consider techniques like summary rollups, vertical and horizontal partitioning, data warehousing, and sharding to try and scale out (see this piece about the difference between scaling up and scaling out). I cut my teeth working on what was at the time one of the top 5 largest known RDMS’s in the world. One of my jobs was to pull information out of this behemoth system; I definitely experienced the pain of implementing some of these techniques. They sound straight forward but can be incredibly expensive to create and maintain. (I haven’t tried sharding, and it sounds like it might come the closest to being able to scale out the database bottleneck. It is a fascinating idea.)
OK, enough of this lecture! Where does CouchDB come into play?
The CouchDB documentation is dripping with drool-worthy keywords: erlang, shared nothing clustering, a RESTful and JSON interface, lock-free concurrency, and ad-hoc and schema-free querying running through a map-reduce system. This all adds up to what can awkwardly be described as a distributed document non-relational database system.
This triggers some serious geek lust (my wife really begs to differ, but I love her anyway).
How does this affect SweetSpot?
- The relational aspect of RDMS’s, being able to say I want to join this information with a little information over here, and mix in from there, is not really needed. SweetSpot’s main data store contains minimal reference information, so it is pretty much storing documents that are retrieved from the database, minimally modified, and then consumed by the calculation engine, a data feed, or the website.
- SweetSpot’s usage patterns tend to be many reads interspersed with large bulk inserts from medical devices and data imports. These two patterns will clash more and more as the system grows, so a lock-free system is very appealing.
- SweetSpot is a heavy user of the data stored within its system; it aggregates and pushes it out to users and eventually to research centers and hospitals (only when authorized by the users; we take our members’ privacy very seriously!), and it runs increasingly complex calculations. A replication and shared-nothing clustering framework gives us more options for structuring our architecture to deal with these cases. For example, we can build a calculation cluster which has a local replicated version of the entries, where it can just hum along running calculations as needed while not slowing down the main application.
I’ll stop right here and proclaim that I am guilty of being a bit too starry-eyed about CouchDB; for one it is terribly slow and resource heavy, it lacks an authorization and security layer, and the API is not quite as flexible as it needs to be. While there is a long way to go for it to play the sort of role I imagine, it opens up a wide range of possibilities for web applications in general. For example, a web application can have javascript running on the client’s machine, requesting and manipulating JSON directly from CouchDB without the need for a middle layer like RubyOnRails or Spring.
CouchDB is an impressive piece of alpha software, starting with the admin web-interface and finishing with the stupidly simple replication setup. The database is only accessible via a RESTful interface using JSON objects, with views onto data being constructed from javascript functions that wrap a map-reduce system. The developers are moving like wild hares, fixing and improving the system at an incredible rate. Spend a little time on the IRC channel irc.freenode.net#couchdb and you’ll see what I mean (the performance tuning is just about to get underway, so I was told).
This is a project worth keeping your eyes on.
Thank you for making it this far, and as a little reward, here are some commands you can use to get started with CouchDB.
- You can kick the tires by following these commands
svn export http://svn.apache.org/repos/asf/incubator/couchdb/trunk couchdb
cd couchdb
#if you don't have a mac, check out the README
sudo port install erlang icu SpiderMonkey
#and go get coffee!
./bootstrap -C
./configure
make && sudo make install
#note: check README on security tips on how avoid running as root...
sudo couchdb - Get started by going to http://localhost:5984/_utils/index.html. By using FireFox with the firebug plugin, you can poke around to see how the admin interface creates and receives JSON msgs from the database. Slick!
- create a new ‘database’ (it is not really a database but just a way to organize your documents):
curl -i -X PUT \
-H 'Content-Type: application/json; charset=utf-8' \
http://localhost:5984/mydb - create a new document:
curl -i -X POST \
-H 'Content-Type: text/javascript; charset=utf-8' \
-d '{"something":"interesting", "with_goodies":["here", "and", "here"]}’ \
http://localhost:5984/mydb - let’s go read the new document you created! (take the id returned in the above response and replace ID_HERE with it:
curl -i -X GET \
-H 'Content-Type: text/javascript; charset=utf-8' \
http://localhost:5984/mydb/ID_HERE - now that we have a document, you can create views to sort, filter, and generally do some data munging:
curl -i -X POST \
-H 'Content-Type: text/javascript; charset=utf-8' \
-d "function(doc){ if (doc.something=='interesting'){ map(null, doc.with_goodies); } }" \
http://localhost:5984/mydb/_temp_view - And I’ll leave you with this little trick:
curl -i -X PUT \
-H 'Content-Type: application/javascript; charset=utf-8' \
-d '{"_attachments":{"foo.txt":{"content-type":"base64","data":"VGhpcyBpcyBhIGJhc2U2NCBlbmNvZGVkIHRleHQ="}}}' \
http://localhost:5984/mydb/my_docs
You notice how we use a PUT instead of a POST? PUT is what you use to either modify a document or to create a new document with a user-defined ID, which in our case is ‘my_docs’. Now go to http://localhost:5984/mydb/my_docs/foo.txt; pretty neat, eh?
Chargement ...

