Sysadmin by day, developer by night

Late last night an update was pushed to www.unscatter.com that had some pretty big backend changes, and one big UI change. Results from Facebook, Twitter and Youtube now stream near real time for all search queries, bringing you the latest conversation and media about anything you’re searching for.

I built a new engine to handle this, using a Tornado instance with a MongoDB backend. I spent a lot of time research message queue services such as RabbitMQ, HornetQ and even considered a 0mq implementation, but decide to go with a more simple polling solution using MongoDB as a backend, as it was already in my software stack.

So, how does it work?

Well the front end looks like.
1) A user makes a search query.

2) The mq_topics capped collection in MongoDB is checked to see if that query is an active topic. This is validated by a ttl field that’s set and updated as the query is active by any users. If it doesn’t exist, the topic is created.

3) Using the _id of the topic, mq_queue is checked for messages that have that ObjectID as a key. If there are none, it waits for a second and checks again until it has something to return.

4) When it finds data, it returns a block of html to add to the box on the page.

5) The javascript on the page then updates it’s uri it checks for data with a “last” query, which is the most recent item it received. It also maintains a status message count of 50 on the page, deleting older ones in order to not fill up memory on the clients browser.

6) Using the new uri, a new query is made. On the backend it will keeping checking for new items in mq_queue and return them when some exist. It’s a basic long polling solution, and the javascript will try again automatically if there is a timeout or other network issue. (Actually on the server side it does query the topics table, writing this I realize I could send the ObjectId of the topic and use it, saving a query each request. Will have to add that in the next release.)

All while the above is going on, there is my little message queue service running in the background. It’s not really a message queue service, I just gave it that name not sure what else to call it. What it does is poll mq_topics for active topics. When it finds an active topic it then uses it’s query to make a request to Facebook, Twitter and Youtube to see if there are new results. It uses the various API functionalities to handle paging and only getting new results, if they exist. It’s not quite horizontally scalable yet, it lacks a locking routine to make sure only one instance is getting results at a time. However, it will scale in the future. It then adds any items in finds to mq_queue for the clients to pick up.

The queue service is throttled per query, checking every few seconds so it’s not quite real time. This is done primarily to avoid hitting API limits.

I actually had this working for a while, except the front end was doing a constant polling and would return the 50 most recent items every time, it would have been a big bandwidth hog. I rewrote it to use long polling to get the most recent items only in a couple hours last night. So, it might be buggy right now. I’m sharing the information about how I did this because I’m open to any suggestions or critiques from people who’ve implemented systems like this before. One problem with working on a project like this by yourself is the lack of someone to go over these ideas with.

In the future I’m going to be adding the Yahoo! Firehose to the search results. I’ll also add a pause and play functionality to the results box. Later I intend to start updating the news results from Bing automatically as well.

Currently the service isn’t API friendly at all, and this is on purpose. It returns HTML and that format can and likely will change a lot as I work on this. I do intend to set up an API people can use in the future, and will return data in a format for fit for that type of consumption later. Right now I am focusing on the product, as I’m still not making any money at all and probably won’t be for a while as I still have large feature sets to build. Local search results are the next big target.

blog comments powered by Disqus
Technorati Profile