David Cramer's Blog

Sentry on Riak

Over the course of getsentry.com, one thing has become abundantly clear: you can never have too much disk space. In the 20 months it’s been running, we’ve doubled our disk consumption six times. This may not sound like a big deal, but we’ve always tried to be tight on expenses, and it gets even more complicated when this is your primary database cluster.

Recently, with about two weeks remaining on our Postgres nodes, we decided to take a new approach to Sentry’s datastores. That approach begins with introducing Riak.

Why Riak

We made the choice to conform to an extremely simple interface: get, set, and delete. A fairly standard interface, but more importantly one that nearly any data store on the planet can work with. With that decision made, Riak became an extremely simple and obvious solution. We wanted a datastore that was easily scalable, managing its own shards and routing. On top of this, Riak has always been something that I’ve been very fascinated with, but never really had an ideal usecase for.

sentry.nodestore

The introduction of Riak began with us refactoring event blob storage in Sentry. This storage made up 90% of our disk on the Postgres cluster, yet is only read once for every 4000 writes. To get started, we went back to our interface, and designed a very simple abstraction:

class NodeStorage(object):
    def delete(self, id):
        """
        >>> nodestore.delete('key1')
        """

    def get(self, id):
        """
        >>> data = nodestore.get('key1')
        >>> print data
        """

    def set(self, id, data):
        """
        >>> nodestore.set('key1', {'foo': 'bar'})
        """

Now implementing any kind of storage on top of this was very straight forward, and we were able to quickly whip up a Django backend (which stores things very similar to their previous behavior) as well as a Riak backend. Even better, Travis CI was the only thing that ever ran the integration tests against Riak itself.

Once we had a basic storage mechanism completed, we had to implement some transitional components. In this case, the majority of the code lives in something called NodeField. While it’s not great to look at, it manages automatically migrating old nodes into our new node storage whenever they are saved. The behavior is nearly identical, with one exception: we had to explicitly request nodes. We solved this by introducing a new helper function:

event_list = group.event_set.all().order_by('-datetime')[:100]

Event.objects.bind_nodes(event_list, 'data')

Behind the scenes, this calls out to nodestore.get_multi([node_ids]).

Once done, we were able to access data just as before:

for event in event_list
    print event.data

Spinning up Riak

By far my favorite part of this experience was how painlessly Riak’s clustering worked. I in fact, did very little in terms of production or capacity testing. A bit of Math gave us capacity specs for the servers, so it was as simple as:

  • Pull down the Riak cookbook (Chef)
  • Tell all nodes to join the cluster
  • Run curl to ensure connectivity from web nodes
  • Enable dual-write to Postgres + Riak (or as we call it, the just-in-case mechanism)

To my amazement things worked perfectly. Literally, Riak “just worked”. We started by dual-writing to Riak and Postgres, and reading from Postgres, and within one hour we had transitioned to reading from Riak, and then quickly to removing Postgres altogether.

But all wasn’t perfect. We jumped into it so quickly, that we slipped up on hardware verification, quickly to find out that there was a communication error and our servers had been misconfigured.

Moving to LevelDB

We had provisioned cluster of 2x 800gb ssds, just to find out that somewhere there was a slipup and the drives didn’t end up in the expected raid configuration (raid 0). On top of that, disk space was climbing at a rate of 120gb a day, which could quickly exhaust the collective space available on the cluster. Even more of an issue was that BitCask’s memory consumption was higher than we had measured and we wouldn’t even be able to use 60% of the 1.5tb expected to be available on each machine.

Now came the scary part: cycling all machines so we could reconfigure raid and adjusting the bucket’s backend to run on eleveldb. After a bit of talking and researching, we found that the best way to migrate buckets was to remove a node from the cluster, change it’s storage backend, and then have it rejoin (thus retransfering all data). Pulling the trigger on the first “leave the cluster” command was intimidating, but it worked flawlessly. From there it just required us to cycle through the machines to get them re-raided, and have them rejoin the cluster, one at a time.

After about 48 hours we had gone through each machine, smoothly transitioned each node to the new backend, and have plenty of capacity to get us through the forseeable future.

Scaling Out

I’m super excited to see where we go with our node storage model. In the long term we’re looking to expand out more things with a graph-like approach, which would allow us to continue to use SQL as an indexer (even clustered), but give us the freedom to push more things into the distributed context that tools like Riak provide us. More importantly, we’ve managed to keep Sentry just as simple as it was before for small users, and allow us to continue to grow getsentry.com.

Never would I have imagined that this would have been such a painless migration. The future is here.

Comments

Dropbox

I’m excited to say that today is my first day at Dropbox!

I realized last month that was I getting burnt out with the startup culture, and decided I needed to make a change. The change was either going to be doing Sentry full-time, or to join a larger company where I could focus on the bigger picture.

After I talked to a couple guys at Dropbox I quickly realized there was a really good spot for me over there. I’ll be working on a team that focuses on everything I care about: testing and productivity. It’s a new team at Dropbox, but the goal is an obvious one: help developers spend less time doing reptitive work.

As you’ve likely seen from talking to me or listening to my presentations, this is something I deeply care about, and I’m excited to be able to spend my time improving the working life of myself and others in the field.

Comments

The Business of Sentry

Two weeks ago I attended EuroPython. It’s one of my favorite events of the year (likely because of the amazing venue it’s been at for the last three of those). This year I gave a talk on how we operate Sentry, titled “Open Source as a Business”. It goes into details on how we operate Sentry (from the business perspective) and some of the challenges (as well as lessons) we’ve faced with it.

I’d love to hear from more people who are trying to turn side projects into actual businesses. I’m sure there are many that are doing well besides Sentry, but it’d be interesting to hear other stories, and to see if we all ended up with the same conclusions.

If you weren’t at EuroPython, you can view the slides on Speaker Deck as well as the full recording on youtube.

Comments

You Should Be Using Nginx + UWSGI

After lots of experimentation (between disqus.com and getsentry.com), I content with saying that uwsgi should be the standard in the Python world. Combine it with nginx and you’re able to get a lot of (relative) performance out of your threaded (or not) Python web application.

Update: Ignoring the age-old argument of “[whatever metric you give] is slow”, the requests I’m illustrating here are to the “Store” endpoint in Sentry, which processes an input event (anywhere from 20kb to 1mb in size), makes several network hops for various authorization and quota strategies, and then eventually queues up some operations. tl;dr it offloads as much work as possible.

Serving Strategies

There’s quite a number of ways you can run a Python application. I’m not going to include mod_wsgi, and most imporantly, I’m not trying to illustrate how evented models work. I don’t believe they’re practical (yet) in the Python world, so this topic is about a traditional threaded (or multi process) Python application.

Instead, I’m going to focus on two of the most popular solutions, and two I’m very familiar with, gunicorn, and uwsgi.

gunicorn

When you move past mod_wsgi your solutions are basically only Python web servers. One of the most popular (read: trendy) methods has been gunicorn lately.

We actually still recommend using gunicorn for Sentry, but that’s purely out of inconvenience. It was pretty wasy to embed within Django, and setup was simple.

It also has 10% of the configuration options as uwsgi (which might actually be a good thing for some people).

Other than that, it provides nearly identical base featuresets to uwsgi (or any other Python web server) for our comparative purposes.

uwsgi

The only alternative, in my opinion, to gunicorn is uwsgi. It’s slightly more performant, has too many configuration options to ever understand, and also gains the advantage of having a protocol that can communicate with nginx.

It’s also fairly simple to setup if you can find an article on it, more on that later.

I started running uwsgi with something like –processes=10 and –threads=10 to try and max CPU on my servers. There were two goals here:

  • Max CPU, which required us to…
  • Reduce memory usage, which was possible because..
  • Sentry is threadsafe, and threads are easy.

(For what it’s worth, Disqus runs single threaded, but I’m cheap, and I wanted to keep Sentry as lean as possible, which means squeezing capacity out of nodes)

Iterating to Success

I was pretty proud when we got API response times down to 40ms on average. When I say API I’m only talking about the time it takes from it hitting the Python server, to the server returning it’s response to the proxy.

Unfortunately, it quickly became apparent that there were capacity issues when we started getting more traffic for larger spikes. We’d hit bumpy response times that were no longer consistent, but we still had about 30% memory and 60% cpu to spare on the web nodes.

After quite a few tweaks, what we eventually settled on was managing a larger amount of uwsgi processes, and letting nginx load balance them (vs letting uwsgi itself load balance).

What this means, is that instead of doing uwsgi –processes=10, we ran 10 separate uwsgi processes.

The result was a beautiful, consistent 20ms average response time.

API Times

Putting It Together

Because I like when people do more than talk, I wanted to leave everyone with some snippets of code from our Chef recipes which we used to actually set all of this up on the servers (with minimal effort).

nginx

The first piece of configuration is Nginx. We need to actually programatically add backends based on the number of uwsgi processes we’re running, so things became a bit more complicated.

We start by building up the list in our web recipe:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# recipes/web.rb

hosts = (0..(node[:getsentry][:web][:processes] - 1)).to_a.map do |x|
  port = 9000 + x
  "127.0.0.1:#{port}"
end

template "#{node['nginx']['dir']}/sites-available/getsentry.com" do
  source "nginx/getsentry.erb"
  owner "root"
  group "root"
  variables(
    :hosts => hosts
  )
  mode 0644
  notifies :reload, "service[nginx]"
end

Then the nginx config becomes pretty straightforward:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# templates/getsentry.erb

upstream internal {
  least_conn;
<% @hosts.each do |host| %>
  server <%= host %>;
<% end %>
}

server {
  location / {
    uwsgi_pass         internal;

    uwsgi_param   Host                 $host;
    uwsgi_param   X-Real-IP            $remote_addr;
    uwsgi_param   X-Forwarded-For      $proxy_add_x_forwarded_for;
    uwsgi_param   X-Forwarded-Proto    $http_x_forwarded_proto;

    include uwsgi_params;
  }
}

We’ve now setup uwsgi to assign the number of hosts to the value of our web processes, started at port 9000. It’s also been configured to serve uwsgi using it’s socket protocol.

uwsgi

On the other side of things, we’re using supervisor to control our uwsgi processes, so things are pretty straightforward here as well:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# recipes/web.rb

command = "/srv/www/getsentry.com/env/bin/uwsgi -s 127.0.0.1:90%(process_num)02d --need-app --disable-logging --wsgi-file getsentry/wsgi.py --processes 1 --threads #{node['getsentry']['web']['threads']}"

supervisor_service "web" do
  directory "/srv/www/getsentry.com/current/"
  command command
  user "dcramer"
  stdout_logfile "syslog"
  stderr_logfile "syslog"
  startsecs 10
  stopsignal "QUIT"
  stopasgroup true
  killasgroup true
  process_name '%(program_name)s %(process_num)02d'
  numprocs node['getsentry']['web']['processes']
end

One Way, and Only One Way

Unless someone comes up with an extremely convincing argument why there should be another way (or a situation where this can’t work), I hope to hear this pattern become more standard in the Python world. At the very least, I hope it sparks some debates on how to improve process management inside of things like uwsgi.

If you take nothing else away from this post, leave with the notiion that uwsgi is the only choice for serving threaded (or non) python web applications.

(I hastily wrote this post to illustrate some findings today, so pardon the briefness and likely numerous typos)

Comments

Making Django 1.5 Compatible With Django-bcrypt

Last night I took the opportunity to upgrade all of getsentry.com to Django 1.5. While most things were fairly trivial to sort out, we hit one less obvious (and pretty critical) bug during the migration surrounding django-bcrypt.

This bug would only present itself if you’ve transitioned from older versions of Django, and therefore have passwords in the database using the custom algorithm. Specifically, you’ll have passwords in your user’s table that look something like bc$$somestring$12$somestring.

The fix is actually fairly simple, and just requires you to define a slightly custom legacy backend for django-bcrypt:

1
2
3
4
5
6
7
8
9
from django.contrib.auth.hashers import BCryptPasswordHasher


class DjangoBCryptPasswordHasher(BCryptPasswordHasher):
    """
    Handles legacy passwords which were hashed with the 'bc$' algorithm via
    django-bcrypt.
    """
    algorithm = "bc"

Once you’ve defined the backend, the rest is as simple as adding it to your list of password hashers:

1
2
3
4
5
6
7
8
PASSWORD_HASHERS = (
    'django.contrib.auth.hashers.PBKDF2PasswordHasher',
    'django.contrib.auth.hashers.PBKDF2SHA1PasswordHasher',
    'django.contrib.auth.hashers.BCryptPasswordHasher',
    'getsentry.utils.auth.DjangoBCryptPasswordHasher',
    'django.contrib.auth.hashers.SHA1PasswordHasher',
    'django.contrib.auth.hashers.MD5PasswordHasher',
)

Update: As pointed out by @chrisstreeter it’s also fairly trivial to do a data migration: https://gist.github.com/streeter/5534008

Comments

A Weekend in Russia

This past weekend I had the opportunity to attend Russia’s first ever PyCon. If you’re not familiar, PyCon is the name used for several Python programming conferences. The event itself was set at a holiday lodge in Yekaterinburg and had somewhere between 200 and 300 attendees.

It’s not often I get the chance to attend the country-specific Python events, but the more of them I see and hear about the more envious I am of the communities. In the US we only have a couple of large Python related events, the two I attend being PyCon US (~2500 attendees this year), and DjangoCon.

The event itself was mostly Russian speakers, with a few international speakers. Myself, along with several others (including Russell Keith-Magee, Jeff Lindsay, Holger Krekel, Armin Ronacher, and Amir Salihefendic) were invited to speak at the event. While I don’t speak any Russian, so I was not able to attend every talk, all of the content I saw was very good.

Overall the feel was very personal, and while it was put on by a professional organization it really had the community feel that I miss from when PyCon was much smaller. It was two full days of talks, along with the typical social events you might find. It was a lot of fun, and it amazes me how large our industry is that in a country that is located (inconveniently) so far away, that they can still find plenty of people interested in attending.

The event started with all of the invited speakers receiving a Russian hat with the PyCon Russia logo. The organization (IT-People) who were running the conference had already gone out of the way to make it easy for us international attendees, and this added to the feel and thoughtfulness that seemed to be throughout the conference.

I’m not familiar with how large many of the other country specific Python conferences are, but I expect PyCon Russia will be even more successful next year. The community feeling that you get from events like this is why I enjoy attending EuroPython each year. I definitely miss that feeling as PyCon (US) has grown larger.

Below you’ll find the slides for my talk “Building to Scale”:

Comments

Moving On

For the last three years I’ve been at Disqus helping to scale the infrastructure, as well as the engineering team. During that time I’ve had the opportunity to work on some amazing things, with some amazing people. Disqus is one of the largest platforms on the web, and that has never been more exciting than it is today. It hasn’t been all heads-down product development though, as I’ve been able to spend time on some really cool (open source!) tools.

It’s been exciting see Disqus grow from a traditional group of startup hackers into a company that’s on the cusp of doing something so much bigger.

A New Challenge

While I love the engineering challenges at Disqus, I’ve decided that it was time to try something new. With that said, this Friday will be my last day at Disqus.

I’ll be joining a fledgling company called tenXer, which aims to solve a problem that is near and dear to me. We’re trying to improve the way people work by using measurable metrics. Data has never been as accessible as it is today, and we want to take that data and empower the individual to be more successful.

Measuring Success

The goal is lofty, but the gist of it is that we take a ton of inputs like commit data, code reviews, or even closing tickets. With all of that data, we try to connect the dots and form a reasonable conclusion on how you work, and ideally suggest to you ways you can more efficient, and more importantly more successful.

There’s some interesting ideas floating around with it, but the possibilities are endless. Imagine if you could track things like commits, and combine that with less obvious data like how you perform after taking a short vacation. How about the never ending debate of how many days a week, or hours a day you should work. We want to take what people have done by hand for decades and bring an modern solution to it.

Engineering Focus

We’re going to be focusing on measuring engineering components first. It’s important to us as we’re engineers as well, and it’s something that will really let us dogfood the system.

If you’ve got an interest in this kind of thing, I’d love to hear your thoughts. We all have very strong opinions (usually differing) about what are good and bad metrics, and it’s really interesting to hear other’s take on these things.

p.s. I’ll be on at PyCon in Santa Clara, as well as PyCon Russia (in two weeks), let’s grab a drink :)

Comments

Dependency Graphs and Package Versioning

Today I had the unfortunate pleasure of attempting to upgrade a dependency on getsentry.com. The package I was upgrading contained a bugfix that I needed, so this was actually something I wanted, and needed to get done. Unfortunately, the package also contained a new requirement: requests >= 1.0.

Conflicting Dependencies

Normally dependencies aren’t too much of a nightmare. Every so often you’ll get a library which version locks something that isn’t sensible, and you’ll hit conflicts. In this case, I figured that since I was already relying on the previous release before requests 1.0, that upgrading it would go off without a hitch. Nope.

Upgrading the library resulted in several other dependencies complaining that they require requests < 1.0, or even worse, they didn’t report their dependency correctly and instead failed to even work (in the test suite, at least). I quickly learned that there were (at least) two major compatibility issues with this upgrade. Even worse, one of them was a fundamental core API.

Most libraries had support for this dependency in a newer version, but some of them weren’t even released. I ended up having to pin git SHAs on several of the dependencies, which for various reasons isn’t usually a good idea.

Libraries vs Applications

I’ve had various people today suggest that I should just “update my code”. I’ll assume those various people don’t understand what a dependency graph is, and especially the limited scoping one that Python let’s us work with. This code is relying on a library, and unfortunately in this case, it’s a popular one. This means we end up with numerous dependencies, many of this which also share common dependencies. For example, Django is a dependency of most of the components in Sentry. Django, however, has well spaced releases, and does an excellent job at maintaining compatibility (and deprecations) between point releases.

Several people have tried to suggest that the a major version bump means they can break APIs. You can do whatever you want with your library, but that doesn’t mean you should. To put it frankly:

A library should never completely change APIs between releases.

So please, whether your semantic versioning playbook says you can do something or not, it’s your choice whether you do.

Deprecation Policies

Let me be the first to tell you that I’m not great at following deprecation policies in my open source work. I do try, but sometimes things just slip through that weren’t considered. Instead, let’s talk about another project that many of use every day: Django

Looking at how Django does it, generally you’ll be given one entire release cycle to add transitional support. For example, Django added multiple database support, which subsequently added a new configuration value called DATABASES. This supported many databases instead of one, which was previously defined using DATABASE_XXX values. In the version which this was released, they maintained compatibility with both the new style, and the old. This, among many other reasons, is why Django is a great framework to build on.

In the case of requests, a heavily used attribute on the Response class was changed. The json attribute was changed to be a callable. Now I’m not sure why (though reading the source it seems inconsistent), but it’s an extremely well traveled code path, and entirely backwards incompatible. These are the kinds of changes that frustrate me.

Keep Things Simple

I want to make one final point. Continually people have pestered me to use the requests library for trivial things. My response has always been simply that it is unnescesary. Is the API cleaner than urllib? It sure is. Is it worth introducing a dependency when all I’m doing is a simple GET or POST request? Almost never.

The Python standard library really isn’t that complicated. Consider the cost of a dependency the next time you introduce it.

Comments

Being Wrong on the Internet

First, some context. I forget how, but a GitHub project came across my Twitter stream. I clicked into it, immediately to see it was something that I disagreed with (on its intent). In turn, I posted something on Twitter. Nothing extremely offensive, but nothing nice. The exact contents of the tweet were:

Ever wanted to make sed or grep worse?

Realistically, what I was suggesting is “this is a bad idea”. Whatever I said could have been more clear, more friendly, etc. It wasn’t. We all know how Twitter works. What I said wasn’t nice, I won’t contest that, I also won’t defend it.

The Twitter Effect

One might argue that I should only criticize something if I’m willing to give positive (proper?) criticism. I can agree with that. Take a step back however, and look at the means of communication. I’m posting on my personal Twitter feed, a space confined to a single thought (or barely connected thoughts) fitting within 140 characters.

It is extremely difficult to convey thoughts on Twitter. That isn’t an “excuse” for anything you say. You should be conciuously aware of that. I usually am (though not always), and it sometimes makes its very hard to relay something. Even when I was asked why I said that, the best I did was “I dont understand why you would want this”. That’s not because I didn’t understand, but it’s simply my reaction to the fact that there’s no way I can realistically explain (or convince) someone of something given the constraints.

False Behavior

The reason I’m writing this post is not actually because I got mixed up into this conversation. What I’m actually frustrated about is that I saw responses like this:

http://news.ycombinator.com/item?id=5107089

I wouldn’t be at all surprised if there was a strong undercurrent of misogyny involved here, motivating their incivility and rudeness.

Let’s get some more context in here. The GitHub URL I originally saw was:

https://github.com/harthur/replace

I’m not entirely sure what drives someone to the conclusion that because “harthur”, which turns out to be be “Heather”, is a woman, that by default, I would intentionally discriminate against. Not every single thing on the internet is about gender equality, or about a minority. In fact, the only reason I’m writing this post is because this kind of continued behavior on the internet is one of the primary reasons this is such a problem.

I’m glad that some people see the insanity of Internet-drama when it happens, and more importantly aren’t sitting in a corner “being nice”, but are instead speaking their opinion, just like many of us (correctly or incorrectly) do every day. Whichever side of any debate you’re on, speaking your opinion is almost certainly better than sitting there quietly.

tl;dr

To people like Heather, criticism (good and bad) comes every day. It doesn’t matter what kind of person you are, and it doesn’t matter if you can handle it or not. It’s going to be there. Open source doesn’t change that. In fact, no ecosystem in society changes that. It’s there, and it’s not something everyone can deal with.

I would probably be in just as much of an uproar if many people said negative things about something I did. Whether I agree with your reaction (or many others) or not, this won’t be the last time you receive criticism, and what separates people is how they deal with it.

Steve Klabnik, one of the individuals who seems to be a lot more visible, said something that really resonates with how I see communication on the internet (not just Twitter):

Twitter makes it so hard not to accidentally be an asshole.

For posterity, here are some links to the (far too in depth) Hacker News threads:

Moving Sentry From Heroku to Hardware

Update: Don’t decide against Heroku just because you’ve read my blog. It makes some things (especially prototyping) very easy, and with certain kinds of applications it can work very well.

I’ve talked a lot about how I run getsentry.com, mostly with my experiences on Heroku and how I switched to leased servers. Many people consistently suggested that operations work is difficult so they shouldn’t deal with it themselves. I’m not going to tell you that my roommate, Mike Clarke, one of the few operations people we have at DISQUS, has it easy, but I’d like to give you a little bit of food for thought.

GetSentry started around Christmas of 2011. I had already built and open sourced Sentry at Disqus, and the idea was to take that work and create a Heroku AddOn out of it. The pitch was that I could make a little bit of money on the side simply by hosting Sentry for people. About three months later I had that prototype hosting service running on Heroku, accepting payments both via the AddOn infrastructure, as well as on my own using the amazing Stripe platform.

Let’s fast forward to today. I no longer run any servers on Heroku (or any cloud provider, other than S3 for backups), and instead I lease servers. Now the company I lease from is what most people would call a “budget provider”. They’re extremely cheap (they dont’ add extreme margins to the cost of the machines you’re leasing), and they do absolutely nothing for you. It’s not for the faint of heart. That said, it’s also how I can get away with very low costs.

I’m going to tell you a bit of a story of how I switched from Heroku to fully configured leased servers in less than a week, in my free time. I’m also going to try to convince you that it’s really not that complicated,.

The First Server

This part could be more appropriately titled “Learning Chef”. I’m fortunate to have some awesome coworkers, and even more fortunate that when I was making this transitiong I had access to my roommate to prod him about questions. I’m also extremely fortunate that medians like Google, IRC, and Twitter exist for any other questions I ever have.

The first task I had to getting my prototype web server online was to get it all configured. I could have taken the old fashioned approach of creating a few config files locally (vcs maybe) and then sending them up to the server, as well as manually installing whatever packages I needed (nginx, memcache, etc.), but with Puppet and Chef becoming all the range I figured it was as good as time as ever to dig into one.

I decided to use the Chef hosted service, and after a few bumps with figuring out what all this Ruby stuff was about, I had managed to get a basic understanding of roles and cookbooks. After quite a bit of fiddling I had created a cookbook specific to getsentry (which holds things like setting up varoius paths), and a bunch of generic ones, like apt, nginx, memcached, python, etc.

Creating a Recipe

The meat of this was handled via Chef’s awesome roles, and wiring up a few things in the ‘default’ recipe of getsentry:

include_recipe "python"

directory "/srv/www" do
  owner "root"
  group "root"
  mode "0755"
  action :create
end

directory "/srv/www/getsentry.com" do
  owner "dcramer"
  group "dcramer"
  mode "0755"
  action :create
end

This formed the basis of any server that I would be running, and simply setup a couple of directories. I also simply gave ownership to my user, as I’m the only one working on the project, and didn’t need the added complexities of build or system users.

I then moved on to a second recipe, which formed the basis of a web node. This one has a lot more to it, as it needed to configure nginx and memcache at the start:

include_recipe "getsentry"
include_recipe "supervisor"

template "#{node[:nginx][:dir]}/sites-available/getsentry.com" do
  source "nginx/getsentry.erb"
  owner "root"
  group "root"
  mode 0644
  notifies :reload, "service[nginx]"
end

nginx_site "getsentry.com"

supervisor_service "web-1" do
  directory "/srv/www/getsentry.com/current/"
  command "/srv/www/getsentry.com/env/bin/python manage.py run_gunicorn -b 0.0.0.0:9000 -w #{node[:getsentry][:web][:workers]}"
  environment "DJANGO_CONF" => node[:django_conf]
  user "dcramer"
end

supervisor_service "web-2" do
  directory "/srv/www/getsentry.com/current/"
  command "/srv/www/getsentry.com/env/bin/python manage.py run_gunicorn -b 0.0.0.0:9001 -w #{node[:getsentry][:web][:workers]}"
  environment "DJANGO_CONF" => node[:django_conf]
  user "dcramer"
end

There is a bit more to it then what I’ve shown, but all in all it was pretty simple. It just took me a bit to understand how chef functioned. All in all, I’m now an engineer that has experience in Chef, even if it’s very little. From from my perspective (on the hiring end at Disqus), that’s is an awesome addition to an engineer’s skillset.

Once the web server was online, all I had to do was to configure a primary database server. I simply brought up another node, gave it a new role (db), and didn’t even need to create a custom recipe (I simply reused the existing pgbouncer, postgersql, and redis recipes available elsewhere on the internet).

Operational Complexity

I stated in the beginning that I completed this process in less than a week. From Heroku to hardware it took me about three evenings of toying with Chef (mostly more complex components, like iptables and building a deploy script). What I really want to point out is how I have never been in an operations position. I’ve definitely configured servers (ala apt-get install nano), and know my way around, especially with a database, but most of this was fairly new to me.

The continued argument of it being “too difficult” to run your own servers is quite the overstatement, but it’s not something you should ignore. There are many things I have to be concerned about, most importantly data loss and the ability to recover in the event of a disaster on my machines. These also aren’t overly complex challenges to handle.

Data redundancy is handled a simple cron script that does nightly backups to S3. It’s literally just a script that calls pg_dump and s3cmd to send the files upstream. Now that’s not enough for any real requirements, so step two is simply setting up replication on your database node to a second server, if if that server is your application server.

Availability is the second big problem, and is easily avoided the same way that you avoid losing your database: have a second server. This again can be a server thats primary task is for something other than your application (it can be your database). It doesnt have to a permanent location for it. It only has to survive until a primary server is available or you’re willing and able to invest in more hardware.

Closing Thoughts

I spent an initial three evenings, and another week’s worth since on server configuring an operations. There were various problems like Postgres not being tuned well enough (pgtune is amazing by the way), DNS being slow (fuck it, use IPs), and some more minor things that needed addressed throughout that time. All in all, there’s basically zero day-to-day operations concerns, and most of the work happens when I need to expand the system (which is rare).

All of it ended as an extremely valuable learning experience, but you using Chef wasn’t a necessity. I could have done things the more “amateur” way, but I also now have the benefit of being able to bring online a server, run a few commands, and have a machine or even a cluster identical to what’s already running.

On the limited hardware I run for getsentry.com, that is, two servers that actually service requests (one database, one app), we’ve serviced around 25 million requests since August 1st, doing anywhere from 500k to 2 million in a single day. That isn’t that much traffic, but what’s important is it services those requests very quickly, and is using very little of the resources that are dedicated to it. In the end, this means that Sentry’s revenue will grow much more quickly than it’s monthly bill will.

GetSentry has been profitable since its 4th month, and currently only spends 10% of its monthly revenue (hardware and other third party services). That gap gets larger every month, and I’ve been more than happy to invest some of my time to keep that gap as large as possible. The irony of it all? I’m selling a service that’s entirely open source, yet suggesting that you run your own hardware. For some people sacrificing cost for convenience is acceptable, for others it may not be.

Also, this.

Look for a future post with many more details on how I setup Chef (likely incorrect) with more in-depth code and configuration from the cookbooks.

Comments