David Cramer's Blog

Making Django 1.5 Compatible With Django-bcrypt

Last night I took the opportunity to upgrade all of getsentry.com to Django 1.5. While most things were fairly trivial to sort out, we hit one less obvious (and pretty critical) bug during the migration surrounding django-bcrypt.

This bug would only present itself if you’ve transitioned from older versions of Django, and therefore have passwords in the database using the custom algorithm. Specifically, you’ll have passwords in your user’s table that look something like bc$$somestring$12$somestring.

The fix is actually fairly simple, and just requires you to define a slightly custom legacy backend for django-bcrypt:

1
2
3
4
5
6
7
8
9
from django.contrib.auth.hashers import BCryptPasswordHasher


class DjangoBCryptPasswordHasher(BCryptPasswordHasher):
    """
    Handles legacy passwords which were hashed with the 'bc$' algorithm via
    django-bcrypt.
    """
    algorithm = "bc"

Once you’ve defined the backend, the rest is as simple as adding it to your list of password hashers:

1
2
3
4
5
6
7
8
PASSWORD_HASHERS = (
    'django.contrib.auth.hashers.PBKDF2PasswordHasher',
    'django.contrib.auth.hashers.PBKDF2SHA1PasswordHasher',
    'django.contrib.auth.hashers.BCryptPasswordHasher',
    'getsentry.utils.auth.DjangoBCryptPasswordHasher',
    'django.contrib.auth.hashers.SHA1PasswordHasher',
    'django.contrib.auth.hashers.MD5PasswordHasher',
)

Update: As pointed out by @chrisstreeter it’s also fairly trivial to do a data migration: https://gist.github.com/streeter/5534008

Comments

A Weekend in Russia

This past weekend I had the opportunity to attend Russia’s first ever PyCon. If you’re not familiar, PyCon is the name used for several Python programming conferences. The event itself was set at a holiday lodge in Yekaterinburg and had somewhere between 200 and 300 attendees.

It’s not often I get the chance to attend the country-specific Python events, but the more of them I see and hear about the more envious I am of the communities. In the US we only have a couple of large Python related events, the two I attend being PyCon US (~2500 attendees this year), and DjangoCon.

The event itself was mostly Russian speakers, with a few international speakers. Myself, along with several others (including Russell Keith-Magee, Jeff Lindsay, Holger Krekel, Armin Ronacher, and Amir Salihefendic) were invited to speak at the event. While I don’t speak any Russian, so I was not able to attend every talk, all of the content I saw was very good.

Overall the feel was very personal, and while it was put on by a professional organization it really had the community feel that I miss from when PyCon was much smaller. It was two full days of talks, along with the typical social events you might find. It was a lot of fun, and it amazes me how large our industry is that in a country that is located (inconveniently) so far away, that they can still find plenty of people interested in attending.

The event started with all of the invited speakers receiving a Russian hat with the PyCon Russia logo. The organization (IT-People) who were running the conference had already gone out of the way to make it easy for us international attendees, and this added to the feel and thoughtfulness that seemed to be throughout the conference.

I’m not familiar with how large many of the other country specific Python conferences are, but I expect PyCon Russia will be even more successful next year. The community feeling that you get from events like this is why I enjoy attending EuroPython each year. I definitely miss that feeling as PyCon (US) has grown larger.

Below you’ll find the slides for my talk “Building to Scale”:

Comments

Moving On

For the last three years I’ve been at Disqus helping to scale the infrastructure, as well as the engineering team. During that time I’ve had the opportunity to work on some amazing things, with some amazing people. Disqus is one of the largest platforms on the web, and that has never been more exciting than it is today. It hasn’t been all heads-down product development though, as I’ve been able to spend time on some really cool (open source!) tools.

It’s been exciting see Disqus grow from a traditional group of startup hackers into a company that’s on the cusp of doing something so much bigger.

A New Challenge

While I love the engineering challenges at Disqus, I’ve decided that it was time to try something new. With that said, this Friday will be my last day at Disqus.

I’ll be joining a fledgling company called tenXer, which aims to solve a problem that is near and dear to me. We’re trying to improve the way people work by using measurable metrics. Data has never been as accessible as it is today, and we want to take that data and empower the individual to be more successful.

Measuring Success

The goal is lofty, but the gist of it is that we take a ton of inputs like commit data, code reviews, or even closing tickets. With all of that data, we try to connect the dots and form a reasonable conclusion on how you work, and ideally suggest to you ways you can more efficient, and more importantly more successful.

There’s some interesting ideas floating around with it, but the possibilities are endless. Imagine if you could track things like commits, and combine that with less obvious data like how you perform after taking a short vacation. How about the never ending debate of how many days a week, or hours a day you should work. We want to take what people have done by hand for decades and bring an modern solution to it.

Engineering Focus

We’re going to be focusing on measuring engineering components first. It’s important to us as we’re engineers as well, and it’s something that will really let us dogfood the system.

If you’ve got an interest in this kind of thing, I’d love to hear your thoughts. We all have very strong opinions (usually differing) about what are good and bad metrics, and it’s really interesting to hear other’s take on these things.

p.s. I’ll be on at PyCon in Santa Clara, as well as PyCon Russia (in two weeks), let’s grab a drink :)

Comments

Dependency Graphs and Package Versioning

Today I had the unfortunate pleasure of attempting to upgrade a dependency on getsentry.com. The package I was upgrading contained a bugfix that I needed, so this was actually something I wanted, and needed to get done. Unfortunately, the package also contained a new requirement: requests >= 1.0.

Conflicting Dependencies

Normally dependencies aren’t too much of a nightmare. Every so often you’ll get a library which version locks something that isn’t sensible, and you’ll hit conflicts. In this case, I figured that since I was already relying on the previous release before requests 1.0, that upgrading it would go off without a hitch. Nope.

Upgrading the library resulted in several other dependencies complaining that they require requests < 1.0, or even worse, they didn’t report their dependency correctly and instead failed to even work (in the test suite, at least). I quickly learned that there were (at least) two major compatibility issues with this upgrade. Even worse, one of them was a fundamental core API.

Most libraries had support for this dependency in a newer version, but some of them weren’t even released. I ended up having to pin git SHAs on several of the dependencies, which for various reasons isn’t usually a good idea.

Libraries vs Applications

I’ve had various people today suggest that I should just “update my code”. I’ll assume those various people don’t understand what a dependency graph is, and especially the limited scoping one that Python let’s us work with. This code is relying on a library, and unfortunately in this case, it’s a popular one. This means we end up with numerous dependencies, many of this which also share common dependencies. For example, Django is a dependency of most of the components in Sentry. Django, however, has well spaced releases, and does an excellent job at maintaining compatibility (and deprecations) between point releases.

Several people have tried to suggest that the a major version bump means they can break APIs. You can do whatever you want with your library, but that doesn’t mean you should. To put it frankly:

A library should never completely change APIs between releases.

So please, whether your semantic versioning playbook says you can do something or not, it’s your choice whether you do.

Deprecation Policies

Let me be the first to tell you that I’m not great at following deprecation policies in my open source work. I do try, but sometimes things just slip through that weren’t considered. Instead, let’s talk about another project that many of use every day: Django

Looking at how Django does it, generally you’ll be given one entire release cycle to add transitional support. For example, Django added multiple database support, which subsequently added a new configuration value called DATABASES. This supported many databases instead of one, which was previously defined using DATABASE_XXX values. In the version which this was released, they maintained compatibility with both the new style, and the old. This, among many other reasons, is why Django is a great framework to build on.

In the case of requests, a heavily used attribute on the Response class was changed. The json attribute was changed to be a callable. Now I’m not sure why (though reading the source it seems inconsistent), but it’s an extremely well traveled code path, and entirely backwards incompatible. These are the kinds of changes that frustrate me.

Keep Things Simple

I want to make one final point. Continually people have pestered me to use the requests library for trivial things. My response has always been simply that it is unnescesary. Is the API cleaner than urllib? It sure is. Is it worth introducing a dependency when all I’m doing is a simple GET or POST request? Almost never.

The Python standard library really isn’t that complicated. Consider the cost of a dependency the next time you introduce it.

Comments

Being Wrong on the Internet

First, some context. I forget how, but a GitHub project came across my Twitter stream. I clicked into it, immediately to see it was something that I disagreed with (on its intent). In turn, I posted something on Twitter. Nothing extremely offensive, but nothing nice. The exact contents of the tweet were:

Ever wanted to make sed or grep worse?

Realistically, what I was suggesting is “this is a bad idea”. Whatever I said could have been more clear, more friendly, etc. It wasn’t. We all know how Twitter works. What I said wasn’t nice, I won’t contest that, I also won’t defend it.

The Twitter Effect

One might argue that I should only criticize something if I’m willing to give positive (proper?) criticism. I can agree with that. Take a step back however, and look at the means of communication. I’m posting on my personal Twitter feed, a space confined to a single thought (or barely connected thoughts) fitting within 140 characters.

It is extremely difficult to convey thoughts on Twitter. That isn’t an “excuse” for anything you say. You should be conciuously aware of that. I usually am (though not always), and it sometimes makes its very hard to relay something. Even when I was asked why I said that, the best I did was “I dont understand why you would want this”. That’s not because I didn’t understand, but it’s simply my reaction to the fact that there’s no way I can realistically explain (or convince) someone of something given the constraints.

False Behavior

The reason I’m writing this post is not actually because I got mixed up into this conversation. What I’m actually frustrated about is that I saw responses like this:

http://news.ycombinator.com/item?id=5107089

I wouldn’t be at all surprised if there was a strong undercurrent of misogyny involved here, motivating their incivility and rudeness.

Let’s get some more context in here. The GitHub URL I originally saw was:

https://github.com/harthur/replace

I’m not entirely sure what drives someone to the conclusion that because “harthur”, which turns out to be be “Heather”, is a woman, that by default, I would intentionally discriminate against. Not every single thing on the internet is about gender equality, or about a minority. In fact, the only reason I’m writing this post is because this kind of continued behavior on the internet is one of the primary reasons this is such a problem.

I’m glad that some people see the insanity of Internet-drama when it happens, and more importantly aren’t sitting in a corner “being nice”, but are instead speaking their opinion, just like many of us (correctly or incorrectly) do every day. Whichever side of any debate you’re on, speaking your opinion is almost certainly better than sitting there quietly.

tl;dr

To people like Heather, criticism (good and bad) comes every day. It doesn’t matter what kind of person you are, and it doesn’t matter if you can handle it or not. It’s going to be there. Open source doesn’t change that. In fact, no ecosystem in society changes that. It’s there, and it’s not something everyone can deal with.

I would probably be in just as much of an uproar if many people said negative things about something I did. Whether I agree with your reaction (or many others) or not, this won’t be the last time you receive criticism, and what separates people is how they deal with it.

Steve Klabnik, one of the individuals who seems to be a lot more visible, said something that really resonates with how I see communication on the internet (not just Twitter):

Twitter makes it so hard not to accidentally be an asshole.

For posterity, here are some links to the (far too in depth) Hacker News threads:

Moving Sentry From Heroku to Hardware

Update: Don’t decide against Heroku just because you’ve read my blog. It makes some things (especially prototyping) very easy, and with certain kinds of applications it can work very well.

I’ve talked a lot about how I run getsentry.com, mostly with my experiences on Heroku and how I switched to leased servers. Many people consistently suggested that operations work is difficult so they shouldn’t deal with it themselves. I’m not going to tell you that my roommate, Mike Clarke, one of the few operations people we have at DISQUS, has it easy, but I’d like to give you a little bit of food for thought.

GetSentry started around Christmas of 2011. I had already built and open sourced Sentry at Disqus, and the idea was to take that work and create a Heroku AddOn out of it. The pitch was that I could make a little bit of money on the side simply by hosting Sentry for people. About three months later I had that prototype hosting service running on Heroku, accepting payments both via the AddOn infrastructure, as well as on my own using the amazing Stripe platform.

Let’s fast forward to today. I no longer run any servers on Heroku (or any cloud provider, other than S3 for backups), and instead I lease servers. Now the company I lease from is what most people would call a “budget provider”. They’re extremely cheap (they dont’ add extreme margins to the cost of the machines you’re leasing), and they do absolutely nothing for you. It’s not for the faint of heart. That said, it’s also how I can get away with very low costs.

I’m going to tell you a bit of a story of how I switched from Heroku to fully configured leased servers in less than a week, in my free time. I’m also going to try to convince you that it’s really not that complicated,.

The First Server

This part could be more appropriately titled “Learning Chef”. I’m fortunate to have some awesome coworkers, and even more fortunate that when I was making this transitiong I had access to my roommate to prod him about questions. I’m also extremely fortunate that medians like Google, IRC, and Twitter exist for any other questions I ever have.

The first task I had to getting my prototype web server online was to get it all configured. I could have taken the old fashioned approach of creating a few config files locally (vcs maybe) and then sending them up to the server, as well as manually installing whatever packages I needed (nginx, memcache, etc.), but with Puppet and Chef becoming all the range I figured it was as good as time as ever to dig into one.

I decided to use the Chef hosted service, and after a few bumps with figuring out what all this Ruby stuff was about, I had managed to get a basic understanding of roles and cookbooks. After quite a bit of fiddling I had created a cookbook specific to getsentry (which holds things like setting up varoius paths), and a bunch of generic ones, like apt, nginx, memcached, python, etc.

Creating a Recipe

The meat of this was handled via Chef’s awesome roles, and wiring up a few things in the ‘default’ recipe of getsentry:

include_recipe "python"

directory "/srv/www" do
  owner "root"
  group "root"
  mode "0755"
  action :create
end

directory "/srv/www/getsentry.com" do
  owner "dcramer"
  group "dcramer"
  mode "0755"
  action :create
end

This formed the basis of any server that I would be running, and simply setup a couple of directories. I also simply gave ownership to my user, as I’m the only one working on the project, and didn’t need the added complexities of build or system users.

I then moved on to a second recipe, which formed the basis of a web node. This one has a lot more to it, as it needed to configure nginx and memcache at the start:

include_recipe "getsentry"
include_recipe "supervisor"

template "#{node[:nginx][:dir]}/sites-available/getsentry.com" do
  source "nginx/getsentry.erb"
  owner "root"
  group "root"
  mode 0644
  notifies :reload, "service[nginx]"
end

nginx_site "getsentry.com"

supervisor_service "web-1" do
  directory "/srv/www/getsentry.com/current/"
  command "/srv/www/getsentry.com/env/bin/python manage.py run_gunicorn -b 0.0.0.0:9000 -w #{node[:getsentry][:web][:workers]}"
  environment "DJANGO_CONF" => node[:django_conf]
  user "dcramer"
end

supervisor_service "web-2" do
  directory "/srv/www/getsentry.com/current/"
  command "/srv/www/getsentry.com/env/bin/python manage.py run_gunicorn -b 0.0.0.0:9001 -w #{node[:getsentry][:web][:workers]}"
  environment "DJANGO_CONF" => node[:django_conf]
  user "dcramer"
end

There is a bit more to it then what I’ve shown, but all in all it was pretty simple. It just took me a bit to understand how chef functioned. All in all, I’m now an engineer that has experience in Chef, even if it’s very little. From from my perspective (on the hiring end at Disqus), that’s is an awesome addition to an engineer’s skillset.

Once the web server was online, all I had to do was to configure a primary database server. I simply brought up another node, gave it a new role (db), and didn’t even need to create a custom recipe (I simply reused the existing pgbouncer, postgersql, and redis recipes available elsewhere on the internet).

Operational Complexity

I stated in the beginning that I completed this process in less than a week. From Heroku to hardware it took me about three evenings of toying with Chef (mostly more complex components, like iptables and building a deploy script). What I really want to point out is how I have never been in an operations position. I’ve definitely configured servers (ala apt-get install nano), and know my way around, especially with a database, but most of this was fairly new to me.

The continued argument of it being “too difficult” to run your own servers is quite the overstatement, but it’s not something you should ignore. There are many things I have to be concerned about, most importantly data loss and the ability to recover in the event of a disaster on my machines. These also aren’t overly complex challenges to handle.

Data redundancy is handled a simple cron script that does nightly backups to S3. It’s literally just a script that calls pg_dump and s3cmd to send the files upstream. Now that’s not enough for any real requirements, so step two is simply setting up replication on your database node to a second server, if if that server is your application server.

Availability is the second big problem, and is easily avoided the same way that you avoid losing your database: have a second server. This again can be a server thats primary task is for something other than your application (it can be your database). It doesnt have to a permanent location for it. It only has to survive until a primary server is available or you’re willing and able to invest in more hardware.

Closing Thoughts

I spent an initial three evenings, and another week’s worth since on server configuring an operations. There were various problems like Postgres not being tuned well enough (pgtune is amazing by the way), DNS being slow (fuck it, use IPs), and some more minor things that needed addressed throughout that time. All in all, there’s basically zero day-to-day operations concerns, and most of the work happens when I need to expand the system (which is rare).

All of it ended as an extremely valuable learning experience, but you using Chef wasn’t a necessity. I could have done things the more “amateur” way, but I also now have the benefit of being able to bring online a server, run a few commands, and have a machine or even a cluster identical to what’s already running.

On the limited hardware I run for getsentry.com, that is, two servers that actually service requests (one database, one app), we’ve serviced around 25 million requests since August 1st, doing anywhere from 500k to 2 million in a single day. That isn’t that much traffic, but what’s important is it services those requests very quickly, and is using very little of the resources that are dedicated to it. In the end, this means that Sentry’s revenue will grow much more quickly than it’s monthly bill will.

GetSentry has been profitable since its 4th month, and currently only spends 10% of its monthly revenue (hardware and other third party services). That gap gets larger every month, and I’ve been more than happy to invest some of my time to keep that gap as large as possible. The irony of it all? I’m selling a service that’s entirely open source, yet suggesting that you run your own hardware. For some people sacrificing cost for convenience is acceptable, for others it may not be.

Also, this.

Look for a future post with many more details on how I setup Chef (likely incorrect) with more in-depth code and configuration from the cookbooks.

Comments

Scaling Your Clouds

My post yesterday seems to have gotten all the cloud fanboy’s panties into a twist, so I figured I’d give them something else to rage about.

There were lots of claims that without the cloud you can’t scale, or you dont have redundancy, or you can’t come up with the result of 2 + 2. I can’t even explain the level of ignorance I’ve seen come out of the woodwork.

So let’s clarify some things.

“The Cloud”

There are many definitions that float around for “the cloud”, and what it means, and more specifically what it’s supposed to do for you. When I talk about it, I’m not talking about you setting up hundreds of your own servers and virtualizing them. We do that too. I’m talking about the notion that there’s some mythical provider that is going to cater to your needs and you’re never going to have to worry about operational concerns.

There is nothing wrong with using Heroku, AWS, Dotcloud, or any of the hundreds of other cloud providers out there. They all provide you with some level of relaxed operational requirements. That said, you’re still restricted to whatever completely fucking shit hardware they decide is right for virtualization. Now I’m not talking AWS so much, as they do allow reasonable size instances, but you’re still restricted to what they’re willing to offer. You never have the option to order custom hardware.

Scale

A bunch of the internet hipsters on Hacker News and elsewhere seem to think that if you use the cloud, your application is going to magically scale by adding more servers to it. That may be true if you’re using MongoDB, but we dont live in a fairy tale here and it will not ever work. There are very few systems that I’m aware of that can scale from one machine to tens to hundreds to thousands without a massive rearchitecture of how you use the system.

One of the first things I pointed out in my article was the fact that I had to spin up large amounts of instances to handle temporary workload. Too bad the database was bottlenecking on concurrent writes to the same row. You can’t ignore one important factor: I cant just “spin up more database”. There are many amazing systems out there that are built on the notion of distributed data with the goal of some level of horizontal scalability (Riak, Cassandra). Even they also do not allow you to spin up more servers and gain more capacity immediately.

Operations Complexity

Another argument that was brought up was the fact that I now personally have to deal with redundancy, monitoring, security fixes, OS upgrades, bringing up more servers, etc.. Sure, that’s true. Except that that will cost me far less time than I would have spent trying to create a SQL database that can horizontal scale to infinity.

  • Redundancy is easy, especially at small scale. Cloud hosting is not going to solve your database redundancy for you.
  • Just because I’m hosting my own machines doesnt mean I cant use New Relic, or in my case Scout.
  • I dont need to frequently bring up additional servers to handle the load because my actual hardware performs 2000 times better than my old virtualized hardware
  • Security updates? OS reloads? Its not like I’m compiling shit by hand, and through the convenience of configuration management this is unbelievably easy.

If you ignore the entirety of operations, you will never have any idea what’s going on when there’s a problem.

The Time/Cost Tradeoff

In my original post I stated it took me about three days to get everything into Chef, and have the new hardware ordered and online. Even if this was three full days of my time, I had just spent four days a previous week trying to get the infinitely scalable cloud solution to perform well enough. Simple math right, four is more than three. Not worth it.

I built getsentry.com specifically with the goal of optimizing cost vs profit margins. Ths is the first month that it’s been profitable, and unless every single customer jumps ship at once, it’s unlikely that I will ever have to put my own money (excluding my time) into the project again.

tl;dr

Virtualized computing has many great uses, but you do not need it, especially if you’re just starting a business. If you want to try out a provider, don’t let me stop you. Make your own decisions. That said, you can be anything at any random company and tell me you use the cloud successfully, and I’ll give you a pat on the back. I’ll then tell you that we rent servers successfully, and by we, I mean DISQUS.

Comments

The Cloud Is Not for You

Update: Did I hurt your feelings with this post? Read Scaling your Clouds so you can rage even more.

Well, maybe not specifically you, but the mass that screams it will solve their problems.

It’s been a fun year so far. There’s been exciting things happening both for me personally, as well as at DISQUS. One of those things has been launching a new side project, getsentry.com. This is about it’s 4th month running publicly, and it’s been doing very well. I wanted to talk a little bit about where it started, and how quickly it’s shifted in the stance of where it’s going.

Around Christmas of 2011, and after a lot of prodding by Craig Kerstiens (of Heroku) I had finally given in to the pressure of creating a hosted version of Sentry to launch as a Heroku addon. I already knew Sentry was awesome, as did many others, and this just meant getting something I put a lot of effort into out in front of so many others. It was very little work to get things up and running on Heroku, and just as easy to setup the addon endpoints. We started a private beta shortly thereafter, and immediately picked up a bunch of the Django/Python crowd.

From there it slowly, but steadily grew in both customers and data. In fact, for the first couple of months we were able to survive on just a few dynos and the first tier of dedicated postgres (which was the $200 package at the time). We’ve also expanded to cover nearly all popular languages, including PHP, Ruby, Java, and even JavaScript.

A bit further in the background of how I structured the Sentry service:

  • Two separate apps (www and app)
  • SSL everywhere (two certs, two addons, $40/month plus SSL cert costs)
  • A minimum of two dynos each ($72/month~)
  • Tier-1 dedicated DB (Ronin, $200/month)

Now, before I continue, let me say that I thoroughly enjoyed using Heroku. It’s a great service, I’m friends with a lot of people there. That said, I want to explain why you shouldn’t use Heroku, or the cloud. Let me also clarify that I’m not talking about the limitations of the idea of the cloud, but more specifically the limitations I’ve seen from providers, and specifically my experience with Heroku.

Right from the get-go we had a system that had pretty good HA and redundancy, especially due to how Heroku’s Postgres solution works. Unfortunately, we quickly saw the limitations of what both the Postgres and the dynos could handle.

Our first attempt to address this was to add worker nodes (ala Celery) to handle concurrency better. This turned into one or two additional dynos dedicated to processing jobs, as well as an additional Redis addon. Unfortunately the Redis addon is completely overpriced, we quickly shifted to pulling up a VM in Linode’s eastcoast datacenter instead. This bought us a little bit of time, but really I’d say we were only given an additional 10% capacity by what should have been a large optimization.

Another week or two went by, and it was suggested that we get off the Ronin database, and upgrade to the Fugu package ( which bumped up the database cost to $400/month). This did quite a bit. In fact, this let us handle most things without too much of a concern. A little while down the road, we had a customer sign up who was actually send realistic amounts of data. More specifically, not even close to the amount of data Disqus’ Sentry server handles, but about 10x more than the rest of our customers combined had been sending.

Then shit started to hit the fan.

In no specific order, we started finding numerous problems with various systems:

  • Redis takes too much memory to reliably queue Sentry jobs.
  • Dynos are either memory or CPU bound, but we have no idea how or why.
  • The Postgres server can’t handle any reasonable level of concurrency.
  • We randomly have to spin up 20 dynos to get anywhere in the queue backlog.

Given all of that, I made the decision that I was going to go back to using real hardware and managing it myself. I’m no stranger to operations work, though it’s never been my day job. I did however want to do this right, and with the advice of my coworker, friend, and roommate, Mike Clarke I decided I’d set these up properly, with Chef.

About three days into it, and I had learned how to use Chef (I don’t write Ruby), brought up two full pluggable configurations for a db node and a web node, written a deployment script in Fabric, migrated to the new hardware and destroyed my Heroku and Linode instances. Three days, that’s all it took to replace the cloud.

Now you might argue that the cloud let’s you scale up easily. YOU ARE WRONG, IT DOES NOT. The cloud gives you the convenience, or more importantly, the illusion of convenience, that you can bring up nodes to add to your network without giving it much thought. You can do that. You don’t ever realistically need to do that.

Almost any company worth a damn can bring online a server within 24 hours, even budget companies. When have you actually needed turnaround time faster than that? If you did, maybe you should read up on capacity planning.

The hosted Sentry now runs on two budget servers, one of which runs Postgres, pgbouncer, and Redis, the other handles Nginx, Celery, memcached, and the Python webserver. The cost for these two machines? About $300/month. When I destroyed Heroku, my bill was looking to be around $600-700 between Heroku and Linode. Given the numbers we run at Disqus, the physical hardware should be able to handle no less than 2000% the capacity I was struggling to handle on the cloud.

I’m not saying you can’t make use of the cloud. For example, Disqus uses Amazon for running large amounts of map/reduce work. You know, elastic computing, the kind of computing that is inconsistent, unplanned, or generally infrequent. I’m also not saying you shouldn’t use Heroku. You should see if it works for you. However, if you ever come up to me and argue that the cloud is going to fix any problem, I’ll make the assumption that you’re one of those annoying kids that runs around screaming MongoDB and Node.js are the answer to all of the worlds problems.

Comments

Distributing Work in Python Without Celery

We’ve been migrating a lot of data to various places lately at DISQUS. These generally have been things like running consistancy checks on our PostgreSQL shards, or creating a new system which requires a certain form of denormalized data. It usually involves iterating through the results of an entire table (and sometimes even more), and performing some action based on that row. We never care about results, we just want to be able to finish as quickly as possible.

Generally, we’d just create a simple do_something.py that would look something like this:

1
2
for comment in RangeQuerySetWrapper(Post.objects.all()):
    do_something(comment)

Note: RangeQuerySetWrapper is a wrapper around Django’s ORM that efficiently iterates a table.

Eventually we came up with an internal tool to make this a bit more bearable. Mostly to handle resuming processes based on the last primary key, and to track status. It evolved into a slightly more complex, but still simple utility we called Taskmaster:

1
2
3
4
5
6
7
def callback(obj):
    do_something(obj)

def main(**options):
    qs = Post.objects.all()
    tm = Taskmaster(callback, qs, **options)
    tm.start()

This used to never be much of a problem. We’d just spin up some utility server and max the CPUs on that single machine to get data processed in a day or less. Lately however, we’ve grown beyond the bounds of what is reasonable for a single machine to take care of, and we’ve had to look towards other solutions.

Why Not Celery?

As with most people, we rely on Celery and RabbitMQ for distributing asyncrhonous tasks in our application. Unfortunately that’s not quite the ideal fit out of the box for us in these situations. The root of the problem stems from the fact that we may need to run through a billion objects, and without some effort, that would mean every single task would need to fit into a RabbitMQ instance.

Given that we can’t simply queue every task and then distribute them to some Celery workers, and even more so that we simply dont want to bring up Celery machines/write throwaway Celery code for a simple script, we chose to take a different route. That route ended up with a simple distributed buffer queue, built on the Python multiprocessing module.

Introducing Taskmaster

Taskmaster takes advantage of the remote management capabilities built into the multiprocessing module. This makes it very simple to just throw in a capped Queue and have workers connect, get and execute jobs, and control state via that single master process. In the end, we came up with an API looking something like this:

1
2
3
4
5
# spawn the master process
$ tm-master taskmaster.example --reset --key=foo --host=0.0.0.0:5050

# run a slave
$ tm-slave do_something:handle_job --host=192.168.0.1:5050

You’ll see the status on the master as things process, and if you cancel the process and start it again, it will automatically resume:

1
2
3
$ tm-master taskmaster.example --reset --key=foo --host=0.0.0.0:5050
Taskmaster server running on '0.0.0.0:5050'
Current Job: 30421 | Rate:  991.06/s | Elapsed Time: 0:00:40

Implementing the iterator and the callback are just as simple as they used to be:

1
2
3
4
5
6
7
def get_jobs(last=0):
    # ``last`` will only be passed if previous state was available
    for obj in RangeQuerySetWrapper(Post.objects.all(), min_id=last):
        yield obj

def handle_job(obj):
    print "Got %r!" % obj

Now under the hood Taskmaster will continue to iterate on get_jobs whenever the size of the queue is under the threshold (which defaults to 10,000 items). This means we have a constant memory footprint and can just spin slaves to process the data.

Taskmaster is still new, but if you’re in need of these kinds of one-off migration scripts, we encourage you to try it out and see if it fits.

Comments

Using Travis-CI With Python and Django

I’ve been using Travis-CI for a while now. Both my personal projects, and even several of the libraries we maintain at DISQUS rely on it for Continuous Integration. I figured it was about time to confess my undenying love for Travis, and throw up some notes about the defaults we use in our projects.

Getting started with Travis-CI is pretty easy. It involves putting a .travis.yml file in the root of your project, and configuring the hooks between GitHub and Travis. While it’s not always easy to get the hooks configured when you’re using organizations, I’m not going to talk much about that. What I do want to share is how we’ve structured our configuration files for our Django and Python projects.

A basic .travis.yml might look something like this:

1
2
3
4
5
6
7
8
language: python
python:
  - "2.6"
  - "2.7"
install:
  - pip install -q -e . --use-mirrors
script:
  - python setup.py test

Most of the projects themselves use Django, which also means they need to test several Django versions. Travis makes this very simple with its matrix builds. In our case, we need to setup a DJANGO matrix, and ensure it gets installed:

1
2
3
4
5
6
7
env:
  - DJANGO=1.2.7
  - DJANGO=1.3.1
  - DJANGO=1.4
install:
  - pip install -q Django==$DJANGO --use-mirrors
  - pip install -q -e . --use-mirrors

Additionally we generally conform to pep8, and we always want to run pyflakes against our build. We also use a custom version of pyflakes which allows us to filter out warnings, as those are never critical errors. Add this in is pretty simple using the before_script hook, which gets run before the tests are run in script.

1
2
3
4
5
6
7
8
install:
  - pip install -q Django==$DJANGO --use-mirrors
  - pip install pep8 --use-mirrors
  - pip install https://github.com/dcramer/pyflakes/tarball/master
  - pip install -q -e . --use-mirrors
before_script:
  - "pep8 --exclude=migrations --ignore=E501,E225 src"
  - pyflakes -x W src

When all is said and done, we end up with something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
language: python
python:
  - "2.6"
  - "2.7"
env:
  - DJANGO=1.2.7
  - DJANGO=1.3.1
  - DJANGO=1.4
install:
  - pip install -q Django==$DJANGO --use-mirrors
  - pip install pep8 --use-mirrors
  - pip install https://github.com/dcramer/pyflakes/tarball/master
  - pip install -q -e . --use-mirrors
before_script:
  - "pep8 --exclude=migrations --ignore=E501,E225 src"
  - pyflakes -x W src
script:
  - python setup.py test

Travis will automatically matrix each environment variable with each Python version, so you’ll get a test run for every combination of the two. Pretty easy, right?

Comments