So, after choosing Rackspace Cloud for initial hosting, I need to start piecing together my application. First, I’m taking a look at where, how, and what to store data in. I’m actually going to store data in two places, depending on the type of data it is. This entry will cover where I’m storing what I call, persistent data.
Persistent data is data the will remain relatively unchanged, and is something I will need to keep indefinitely. I’ll actually be using another datastore for things like cache, session tokens, and the like. Another entry will be written to detail my choice for that. Persistent data is things like user accounts, comments, and other application data that I’m not ready to divulge yet, but you should get the point.
Individual data items are not expected to be large. No media, just unicode text. However, the application I’m working on is something that could potentially scale to support large amounts of users. So while individual items may be small, there may be a lot of them. The comments feature especially could explode with data… or never be used at all. It’s one of those things.
So with the above in mind, and remembering some technology articles Digg released a while back about their infrastructure, I pretty much ruled out any RDBMS. Having developed on Appengine, I knew a little about key/value systems, but was spoiled by their own query capabilities within BigTable. As I started looking for altenatives, I stumbled across Facebook’s Opensource Page. There Cassandra caught my interest, and I started reading about it.
The things that interested me was the fact it’s built to scale across machines and even data centers, using a customizable replication topology. I prefer the eventually consistent approach, it’s a good fit for my application as I want to make sure I never lose data, and that reads/write operations succeed every request. For the few cases I need to write and have the data immediately available, I can workaround using the other datastore I’ll write about later. The fact Facebook was using it for it’s messaging system was another thing that caught my eye.
What concerned me was that it didn’t look like a lot was going on with the project. The project page was pretty sparse. Apache incubator project was nice, but… I was initially feeling that this was something Facebook threw over the wall and no one really picked up. The reason I’m saying all this is because if this is the impression I got, I imagine others my have had the same impression.
However, I’m pretty stubborn, hey I’ve been using Slackware since version 3, and remember when increasing the amount of available http sockets with apache was a source code change. Lack of documentation doesn’t scare me, so I kept digging. Well, turns out Digg, Rackspace, and to some degree Twitter are getting behind the project. The IRC channel (#cassandra on irc.freenode.net) is pretty active, and they appear to have an interest in prettying things up, as they have had a contest on 99designs.com to create a new logo. (At the time of writing this, the contest has 3 hours, so not posting the link as I’m not sure what happens to contest links after they are over.) In fact, while researching the project. 0.4 was released.
There are a lot of projects out there that provide similar functionality. Some more mature than Cassandra. As I said in my last article, this series will not compare what I’ve chosen to other software choices. I will only detail why I chose the product I am going with. I am fully open to hearing more information about other projects, as I have not begun building, and just making infrastructure choices. One reason for this series is I want the feedback to confirm my choices, or to provide me with information about why I might look elsewhere.
So why Cassandra? It provide horizontal scalability for the persistent data layer of my application. The horizontal clustering approach, with it’s replication topology also provides redundancy. This redundancy includes geographical redundancy support. It was originally built by Facebook, who is a prime example of an organization with data reliability and scalability requirements that match my own. After being opensourced, it has been picked up by Rackspace, Digg, and Twitter, more organizations with scalability requirements similar to my own. Sometimes you have to look at the bells and whistles, but sometimes it’s more important to look at who is standing behind a project. The latter, combined with the fact out the gate Cassandra meets my data requirements, made this choice a no brainer.
Looking forward to how I’ll use Cassandra… Based off of a conversation I had with jbellis in the Cassandra irc channel, I’ll be standing up 3 Rackspace Cloud machines, instead of 1, when I’m ready to move off of working on my home dev machine. The reason for this I want to have the clustering concepts down, long before I need them.
I’ll also have to look at how I want to layout my data. As I understand how Cassandra works, there is no querying, you get by key. So, for example, my user system which allows login from multiple sources will need to be denormalized. So, something like
User:
twitter_id
facebook_id
Where on login, I’d query the user table for the appropriate id, or create a new account if it doesn’t exist. Instead I’ll need something like
{username: {
twitter_id: 123456789,
facebook_id: abcdefg
}
}
{twitter:
{
123456789: username
}
}
{facebook:
{
abcdefg: username
}
}
On login, from say Twitter, I get the key twitter.123456789 and know my account is username. It means my application has more places to update information, but as far as speed goes, should be pretty zippy.
Now, I do have LOTS to still learn about Cassandra. The clustering concepts and how to order my data on disk using functionality Cassandra provides to optimize it is going to take time. The project also is pretty active, and I’m probably months away from even doing my first install. So, some things may change before I even get to that point. What I really like about this is it’s going to be a chance to learn a whole new way of managing data. Professionally, I support installation of MySQL, Microsoft SQL Server, and to a small extent Oracle and their clients for the applications our developers use to interact with them. I have direct experience with MySQL from various side projects, and of course I learned a lot about BigTable working on gaeutilities. Cassandra is going to be a whole new depth of managing data that DBA’s or Google has usually handled for me. This is something I really look forward to.