David Cramer's Blog

Sticking With Standards

More and more I’m seeing the “requirements.txt pattern” come up. This generally refers to projects (but not just), and seems to have started around the same time as Heroku adopting Python. I feel like this is something that matters in the Python world, and because I have an opinion on everything, I want to share mine.

requirements.txt

Let’s first talk about what this pattern actually is. As you should already be familiar with pip (if you’re not, this post is not for you), the idea of this is that whatever you’re doing, is installable by pointing pip at a requirements.txt file which contains a list of your projects dependencies. This has some obvious benefits, one being that you can mark repositories as dependencies.

Another benefit of this is when you have a large project (like DISQUS) and your dependencies can vary between environments. For example, we have several various requirements files for disqus-web (our largest package):

requirements/global.txt
requirements/production.txt
requirements/development.txt

These end up being pretty obvious, and when an app has specific needs there’s no reason not to approach the problem this way. That said, you dont need to do things this way, and in every project other than our main repository, including our open source work, all dependencies are specified completely in setup.py. Even in this case, we could just as easily specify our core requirements as part of the package and simply have additional files which label the production and development dependencies.

setup.py is the right choice

A common argument for not using setup.py is that a library is not the same as an app (or larger project). Why not? We employ the same metadata in everything. Each contains a list of dependencies, some various metadata, and possibly a list of extra resources (such as scripts, or documentation). Fundamentally they’re identical. Additionally, if pip is your thing, it does not prevent you from using setup.py. Let’s take an example setup.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from setuptools import setup, find_packages

requires = [
    'Flask==0.8',
    'redis==2.4.11',
    'hiredis==0.1.1',
    'nydus==0.8.1',
]


setup(
    name='something-sexy',
    version='1.0.0',
    author="DISQUS",
    author_email="dev@disqus.com",
    package_dir={'': 'src'},
    packages=find_packages("src"),
    install_requires=requires,
    zip_safe=False,
)

Now, in our case, this is probably a service on Disqus, which means we’re not listing it as a dependancy. In every single scenario we have, we want our package to be on PYTHONPATH, and this is no different. There’s many ways to solve the problem, and generally adjusting sys.path is not what you’re going to want. Whether you install the package or you just run it as an editable package (via pip install -e or setuptool’s develop command), packaging your app makes it that much easier.

What’s even more important is that you stick with standards, especially in our growing ecosystem of open source and widely available libraries. There’s absolutely no reason to have to explain to a developer that they need to run some arbitrary command to get your neat tool to install. Following the well defined and adopted standards ensures that is never the case.

Keep it simple. Keep it obvious.

Comments

Using Arrays as Materialized Paths in Postgres

Something we’ve been casually working on at Disqus for quite some time is an improved pagination method for threaded comments. This is obviously pretty important to us, it drives the very foundation of our product. It also happens to be an area that’s somewhat challenging, and has a wide array of solutions. In the end, this is an overly complicated solution to solve the problem of threads having 10s or 100s of thousands of comments.

For some background, our first implementation is very similar to how Reddit and many other systems work. It generally looks something like this:

  1. Fetch all children for a tree
  2. Resort them in memory
  3. Return the N entry result set

While fairly easy to implement, this has the enormous cost of pulling down every single child and resorting it at an application level. There are various ways to optimize this, and we even attempted doing it within the database itself. In the end, none of our solutions worked at scale. They would either be too write heavy, or they’d move too much of the logic (read: CPU usage) to the database servers. That said, they led to something great, and in the end we settled on a solution that’s neither too write or read heavy. That solution was materialized paths, but not in your typical way.

A materialized path generally is represented as a serialization of all parents. So in a simple case, it might be a simple delimited list of id values. As an example, let’s say that we have a list of comments that are guaranteed to only be less than 1000 for their identifying value:

001
001002
001002003
001002007
001004
001005
001005006

In this case we’ve managed to stuff all of this into a sortable numeric value. Unfortunately, in the real world, it’s never this easy, so we looked for existing solutions to solve this problem. We’ll skip all of the bikeshedding here, and jump straight to our solution: Arrays.

Arrays are quite an interesting feature in Postgresql. They’re a native data type, indexable, sortable, and contain a variety of operators and functions (and even more so in 8.4+). They also fit in nicely with our previous solution, with the caveat that we had to write to the arrays rather than generate them at execution time. In fact, they fit so well that we were able to directly translate a majority of the effort we spent while toying with CTEs.

What we finally settled on was a schema which looks something like this:

\d postsort

  Column   |   Type    | Modifiers 
-----------+-----------+-----------
 tree_id   | integer   | not null
 child_id  | integer   | not null
 value     | numeric[] | not null

Indexes:
    "postsort_pkey" PRIMARY KEY, btree (tree_id, child_id)
    "postsort_path" btree (tree_id, value)

A simple three-column schema gives us:

  • tree_id The root node for this tree (for us, this is a comment thread)
  • child_id A child contained within this tree. There’s a row for every child
  • value Our materialized path, implemented as an array

The most important bit here is the value, and even more so what that array contains. Let’s take a look at our previous example of simple numeric IDs, and how that’d be represented in this table:

child_id | value
----------------
1        | [1.0]
2        | [1.0, 2.0]
3        | [1.0, 2.0, 3.0]
7        | [1.0, 2.0, 7.0]
4        | [1.0, 4.0]
5        | [1.0, 5.0]
6        | [1.0, 5.0, 6.0]

You’ll notice that the value always contains the id of the child as the last element, and is prefixed parents value. The child’s ID must be present in order to guarantee sortability in conditions where these values are not unique. More specifically, in a real world scenario, you’ll probably have some kind of score that you’d be including. As a demonstration of this eventual conflict, take the following values:

child_id | value
----------------
1        | [0.9134834, 1.0]
2        | [0.9134834, 1.0, 0.149341, 2.0]
3        | [0.9134834, 1.0, 0.149341, 2.0, 0.14123434, 3.0]
4        | [0.9134834, 1.0, 0.149341, 2.0, 0.14123434, 7.0]
5        | [0.9134834, 1.0, 0.149341, 5.0]
6        | [0.9134834, 1.0, 0.149341, 5.0, 0.601343, 5.0]

You’ll see that we had a conflicting score for two children. If we always include the unique identifying numeric value we’ll never have to worry about rows shifting into parents which they’re not a part of. You will also see that we’ve prefixed each child’s value with the score. This again gives us the numeric sorting order which we’re looking for and allows us to sort by any arbitrary score. This could be anything from a timestamp to a completely custom scoring algorithm based on something like up and down votes on a child.

The schema and data storage is pretty straightforward, the bigger challenge is actually implementing the logic in your application (or if you’re insane, within SQL triggers). We end up with a mess of SQL statements, with a singular goal to bring everything down to an atomic, transactionless nature. As an example, creating a new child probably resemebles something like the following:

1
2
3
4
5
6
7
8
9
10
11
INSERT INTO postsort (
    tree_id,
    child_id,
    value
)
SELECT t2.tree_id,
       %(child_id)d as child_id,
       (t2.value || %(value)s::numeric[]) as value
FROM postsort as t2
WHERE t2.tree_id = %(tree_id)d
  AND t2.child_id = %(parent_child_id)d

Once you’ve populated the table, queries become amazingly simple:

1
2
3
4
SELECT child_id
FROM postsort
WHERE tree_id = %(tree_id)s
ORDER BY value

What’s even more cool, aside from a lot of custom SQL we had to create for this to work in Django, is the fact that we were able to easily prototype and implement arrays within the Django ORM:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class NumericArrayField(models.Field):
    __metaclass__ = models.SubfieldBase

    def db_type(self):
        return "numeric[]"

    def get_prep_value(self, value):
        if value:
            value = map(float, value)
        return value

    def to_python(self, value):
        if value:
            value = map(float, value)
        return value

We’ve just begun rolling this out at Disqus, but our initial performance and capacity tests are showing great results. The flexibility of arrays has been amazingly helpful in this scenario, and has pushed us into a new direction in what we can do with SQL. Disqus reaches more than 700 million unique visitors across its platform, and as always, Postgres has stood its ground and will continue to be our primary datastore of choice.

If Disqus sounds interesting to you, and you think you’re a good fit and we’re looking for passionate people to join our team.

Comments

Scaling Schema Changes

I frequently get asked how Disqus deals with schema changes. It’s a fair question, since we operate a fairly large amount of servers, but I also tend to think the answer is somewhat obvious. So let’s start with the problem of schema changes at scale (in PostgreSQL).

Generally you have some table, let’s call it a profile (since people seem to enjoy changing those). Well today, a new service has launched called Twooter, and we want to denormalize the user’s Twooter name into their profile. To do this we need to add a new field, twooter_username.

DDL First

The first thing we have to realize, is that everyone will not have twooter_username. Now even if that weren’t true, it needs to be to maintain compatibility, and efficiency. For us, this means that all additions must be made as NULLable columns. This means that the old code can stay in place whether the schema change has been made or not, and more importantly, NULLable ALTERs are much quicker in Postgres.

It’s very important that the schema change is made before the application’s new version is deployed. Ideally you want to do the change as soon as the schema is finalized. I’ll talk more a bit about the reasons for that later.

Application Changes

The second thing we need to concern ourselves with is our application logic. As I said before you must do the DDL before deploying your code changes. For us, this means all DDL happens in a branch, and can be merged once the change is completed. I also mentioned that additions must be NULLable, which not only means we can do the schema change before updating our application, but we also ensure forwards and backwards compatibility.

In addition to waiting for the schema change to complete before deploying your application, some changes may require several other steps along the release process. As an example, maybe we already had twooter_username stored in a different table, and we were literally just moving it to optimize our data access. This happens with a two things:

  • A write-through cache in the application to ensure new data is stored.
  • A backfill operation to ensure old data is stored (this also must be idempotent).

Once we’ve taken care of the above steps, only then can we actually utilize read operations on this new data. What this generally means is multi-step process to add a new data pattern:

  1. Perform DDL.
  2. Deploy write-through cache code.
  3. Run backfill operation.
  4. Run sanity checks (verify the data is correct, and exists).
  5. Deploy code which utilizes new data.

DDL on a Cluster

I’ve mostly been talking about how we scale the application side (read: code) for our DDL changes, but it’s also important to note how we do no-downtime schema changes. For this there are two important concepts we utilize: platform-wide read-only mode, and enough capacity to remove a node from the cluster. The last part is important: enough capacity to remove a node from the cluster.

Now let’s say this twooter_username is going to be added to a table which is so large, that even a fast NULLable ALTER cannot be run in production. In this case we’re actually going to need to swap out our master PG node to ensure we don’t hit any downtime, or slowness while making these changes. This is where read-only mode comes into play. It looks something like this:

  1. Take a slave out of the pool.
  2. Run DDL on slave.
  3. Put it back into the pool.
  4. (repeat on all slaves)
  5. Turn on read-only.
  6. Promote a slave to master.
  7. (repeat DDL operation on former-master)

And that’s all there is to it. I’d be curious to hear if anyone else is doing things differently.

Comments

Integrating Django With Nose at DISQUS

About a month ago we decided to make the transition off of Django’s test suite over to the Nose runners. Our main selling point was the extensibility, and the existing ecosystem of plugins. Four weeks later I’m happy to say we’re running (basically) Nose with some minor extensions, and it’s working great.

Getting Django running on Nose is no small feat. Luckily, someone else has already put in a lot of that effort, and packaged it up all nice and neat as django-nose. I won’t go through setting up the package, but it’s pretty straight forward. One thing that we quickly noticed however, was that it didnt quite fit our approach to testing, which was strictly unittest. After a couple days of going back and forth with some minor issues, we came up with a few pretty useful extensions to the platform.

A few of the big highlights for us:

  • Xunit integration (XML output of test results)
  • Skipped and deprecated test hooks
  • The ability to organize tests outside of the Django standards

I’m wanted to talk a bit about how we solved some of our problems, and the other benefits we’ve seen since adopting it.

Test Organization

The biggest win for us was definitely being able to reorganize our test suite. This took a bit of work, and I’ll talk about this with some of the plugins we whipped up to solve the problems. We ended up with a nice extensible test structure, similar to Django’s own test suite:

tests/
tests/db/
tests/db/connections/
tests/db/connections/redis/
tests/db/connections/redis/__init__.py
tests/db/connections/redis/models.py
tests/db/connections/redis/tests.py

We retained the ability to keep tests within the common app/tests convention, but we found that we were just stuffing too many tests into obscure application paths that it became unmaintainable after a while.

Unittest Compatibility

The first issue we hit was with test discovery. Nose has a pretty good default pattern for finding tests, but it had some behavior that didn’t quite fit with all of our existing code. Mostly, it found random functions that were prefixed with test_, or things like start_test_server which weren’t tests by themselves.

After digging a bit into the API, it turned out to be a pretty easy problem to solve, and we came up with the following plugin:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class UnitTestPlugin(object):
    """
    Enables unittest compatibility mode (dont test functions, only TestCase
    subclasses, and only methods that start with [Tt]est).
    """
    enabled = True

    def wantClass(self, cls):
        if not issubclass(cls, unittest.TestCase):
            return False

    def wantMethod(self, method):
        if not issubclass(method.im_class, unittest.TestCase):
            return False
        if not method.__name__.lower().startswith('test'):
            return False

    def wantFunction(self, function):
        return False

Test Case Selection

To ensure compatibility with our previous unittest extensions, we needed a simple way to filter only selenium tests. We do this with the –selenium and –exclude-selenium flags.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from disqus.tests.testcases import DisqusSeleniumTest
from nose.plugins.base import Plugin

class SeleniumSelector(Plugin):
    def options(self, parser, env):
        parser.add_option("--exclude-selenium",
                          dest="selenium", action="store_false",
                          default=None)
        parser.add_option("--selenium",
                          dest="selenium", action="store_true",
                          default=None)

    def configure(self, options, config):
        self.selenium = options.selenium
        self.enabled = options.selenium is not None

    def wantClass(self, cls):
        if self.selenium:
            return issubclass(cls, DisqusSeleniumTest)
        elif issubclass(cls, DisqusSeleniumTest):
            return False

Bisecting Tests

One feature I always thought was pretty useful in the Django test suite was their --bisect flag. Basically, given your test suite, and a failing test, it could help you find failures which were related to executing tests in say a specific order. This isn’t actually made available to normal Django applications, but being a large codebase it’s extremely useful for us.

I should note, this one adapted from Django and is very rough. It doesn’t report a proper TestResult, but it’s pretty close to where we want to get it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
class _EmptyClass(object):
    pass

def make_bisect_runner(parent, bisect_label):
    def split_tests(test_labels):
        """
        Split tests in half, but keep children together.
        """
        chunked_tests = defaultdict(list)
        for test_label in test_labels:
            cls_path = test_label.rsplit('.', 1)[0]
            # filter out our bisected test
            if test_label.startswith(bisect_label):
                continue
            chunked_tests[cls_path].append(test_label)

        chunk_a = []
        chunk_b = []
        midpoint = len(chunked_tests) / 2
        for n, cls_path in enumerate(chunked_tests):
            if n < midpoint:
                chunk_a.extend(chunked_tests[cls_path])
            else:
                chunk_b.extend(chunked_tests[cls_path])
        return chunk_a, chunk_b

    class BisectTestRunner(parent.__class__):
        """
        Based on Django 1.3's bisect_tests, recursively splits all tests that are discovered
        into a bisect grid, grouped by their parent TestCase.
        """
        # TODO: potentially break things down further than class level based on whats happening
        # TODO: the way we determine "stop" might need some improvement
        def run(self, test):
            # find all test_labels grouped by base class
            test_labels = []
            context_list = list(test._tests)
            while context_list:
                context = context_list.pop()
                if isinstance(context, unittest.TestCase):
                    test = context.test
                    test_labels.append('%s:%s.%s' % (test.__class__.__module__, test.__class__.__name__,
                                                     test._testMethodName))
                else:
                    context_list.extend(context)

            subprocess_args = [sys.executable, sys.argv[0]] + [x for x in sys.argv[1:] if (x.startswith('-') and not x.startswith('--bisect'))]
            iteration = 1
            result = self._makeResult()
            test_labels_a, test_labels_b = [], []
            while True:
                chunk_a, chunk_b = split_tests(test_labels)
                if test_labels_a[:-1] == chunk_a and test_labels_b[:-1] == chunk_b:
                    print "Failure found somewhere in", test_labels_a + test_labels_b
                    break

                test_labels_a = chunk_a + [bisect_label]
                test_labels_b = chunk_b + [bisect_label]
                print '***** Pass %da: Running the first half of the test suite' % iteration
                print '***** Test labels:',' '.join(test_labels_a)
                failures_a = subprocess.call(subprocess_args + test_labels_a)

                print '***** Pass %db: Running the second half of the test suite' % iteration
                print '***** Test labels:',' '.join(test_labels_b)
                print
                failures_b = subprocess.call(subprocess_args + test_labels_b)

                if failures_a and not failures_b:
                    print "***** Problem found in first half. Bisecting again..."
                    iteration = iteration + 1
                    test_labels = test_labels_a[:-1]
                elif failures_b and not failures_a:
                    print "***** Problem found in second half. Bisecting again..."
                    iteration = iteration + 1
                    test_labels = test_labels_b[:-1]
                elif failures_a and failures_b:
                    print "***** Multiple sources of failure found"
                    print "***** test labels were:", test_labels_a[:-1] + test_labels_b[:-1]
                    result.addError(test, (Exception, 'Failures found in multiple sets: %s and %s' % (test_labels_a[:-1], test_labels_b[:-1]), None))
                    break
                else:
                    print "***** No source of failure found..."
                    break
            return result

    inst = _EmptyClass()
    inst.__class__ = BisectTestRunner
    inst.__dict__.update(parent.__dict__)
    return inst

class BisectTests(Plugin):
    def options(self, parser, env):
        parser.add_option("--bisect", dest="bisect_label", default=False)

    def configure(self, options, config):
        self.enabled = bool(options.bisect_label)
        self.bisect_label = options.bisect_label

    def prepareTestRunner(self, test):
        return make_bisect_runner(test, self.bisect_label)

Improvements to django-nose

Finally I wanted to talk about some of the things that we’ve been pushing back upstream. The first was support for discovery of models that were in non-app tests. This works the same way as Django in that it looks for appname/models.py, and if it’s found, it adds it to the INSTALLED_APPS automatically.

The second addition we’ve been working on allows you to run selective tests that dont require the database, and avoids actually building the database. It does this by looking for classes which inherit from TransactionTestCase, and if none are found, it skips database creation.

I’m curious to here what others have for tips and tricks regarding Nose (or maybe just helpful strategies in your own test runner).

Python and OS X Lion

Just a few quick tips that I’ve had to run through and discover today while upgrading to Lion.

Start by installing Xcode 4, which is available via the App Store (for free now). This will fix your missing distutils package (which probably fixes a majority of your issues). You’ll also need to reinstall all global site-packages, such as pip or virtualenvwrapper.

The last one, which was luckily solved for me already, was hitting [Errno 32] Broken pipe on various things. One example was this:

  File "/Users/dcramer/.virtualenvs/disqus/lib/python2.6/site-packages/compress/utils.py", line 145, in filter_js
    return filter_common(js, verbosity, filters=settings.COMPRESS_JS_FILTERS, attr='filter_js', separator='', signal=js_filtered)
  File "/Users/dcramer/.virtualenvs/disqus/lib/python2.6/site-packages/compress/utils.py", line 136, in filter_common
    output = getattr(get_class(f)(verbose=(verbosity >= 2)), attr)(output)
  File "/Users/dcramer/.virtualenvs/disqus/lib/python2.6/site-packages/compress/filters/yui/__init__.py", line 41, in filter_js
    return self.filter_common(js, 'js', JS_ARGUMENTS)
  File "/Users/dcramer/.virtualenvs/disqus/lib/python2.6/site-packages/compress/filters/yui/__init__.py", line 20, in filter_common
    p.stdin.write(content)
TemplateSyntaxError: Caught IOError while rendering: [Errno 32] Broken pipe

It turns out that with Xcode 4 there were some changes to the way (something that I dont care about) is handled. To solve this, add the following to your .profile:

export ARCHFLAGS='-arch i386 -arch x86_64'

If you rely on sshuttle be warned, it doesn’t work currently on OS X Lion.

I’ll update this post if I hit any more issues, but so far everything else seems to be running smoothly.

EuroPython

This last week I’ve been attending EuroPython over here in Firenze (or as we Americans know it, Florence), Italy. It’s been a pretty amazing time, visiting the beautiful city, putting faces to names, and seeing some great presentations. More importantly, and the main reason for my trip, was the two talks that I delivered here this week.

The first was on Tuesday morning, titled ”Building Scalable Web Apps”. I tried to show how one might solve some problems building a real web app, so we built a little bit of a backend for a Twitter-like stream.

I gave a second talk on Wednesday morning, ”Pitfalls of Continuous Deployment”, which talks a little bit about the lessons we’ve learned during adoption of CD, as well as the value of integration and reporting systems

Creating a Read-only Mirror for Your GitHub Server

Recently we’ve been transitioning our git repositories to GitHub. We chose to go this route for a variety of reasons, but mostly because they have kickass pull requests, which we’re going to test run as code reviews. However, one of the requirements of this process was that our original git-server still remain functional, in at least a read-only state. This saves us the time of having to update deploy and other scripts which read from this mirror and perform various tasks.

I was a bit surprised when I originally searched around for this, as I was either failing horribly at Google (granted, my queries were “how to setup git-server mirror”), or there just wasn’t much information out there on it. After a bit of crawling I found what seems to be a pretty easy way to get the behavior we wanted. For a recap, here’s a checklist of what we needed:

  • Read-only git server
  • One-way mirror from our new server to the legacy server.
  • Mirror all branches
  • Updated near real-time

So, given this, we created a simple bash script that runs on a 1 minute cron timer (it’s as close to real-time as we needed):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/bin/bash

mkdir -p  /var/git/mirrors/

cd /var/git/mirrors/

# clone our newly acquired GitHub mirror
git clone --mirror git@github.com:organization/repo-name.git

cd disqus.git

# Add our local remote
git remote add local /var/git/repositories/repo-name.git

# Unsure if we need to fetch from local, but let's do it anyways
git fetch origin
git fetch local

# push all changes to local using --mirror (ensures all refs in remotes are pushed)
git push local --mirror

Since we were already using gitosis for permissions, it was easy for us to deprecate the legacy repo by simply moving everyone into a readable group that lacks write privileges.

Would love to hear some feedback from avid git users if there’s a better way to do this.

Setting Up Your Own PyPi Server

Ever had problems with PyPi being unreachable? Dislike dealing with requirement.txt files just to support a git repository? For a low low price of FREE, and an hour of labor, get your very own PyPi server and solve all of your worries!

Set up Chishop

We’re going to jump right into this one. Start by setting up Chishop. Currently the best way is to do so using the DISQUS fork as it contains several fixes. Expect to see all of the hard work in the various forks merged upstream as soon as we get some proper docs going. Follow the instructions in the README to configure Chishop, and your PyPi index.

Now you’re going to want to tweak some things that are on by default. For starters, you’re probably going to want to proxy the official PyPi repository, and this can be done by enabling a simple flag in your newly created settings.py:

1
DJANGOPYPI_PROXY_MISSING = True

There are many other configuration options, but you’re going to have to read the source for those.

Configure PIP/Setuptools/Buildout

Now that you’ve got a sexy PyPi server up and running, you’ll probably want to configure the default index locations for your package managers. It took me a bit of Googling but then I stumpled upon an awesome post by Jacob Kaplan-Moss about dealing with PyPi when it goes down, which describes procedures for configuring PyPi mirrors.

Let’s start with pip, which stores its configuration in ~/.pip/pip.conf:

[global]
index-url = http://my.chishop/simple

Next up, setuptools, located in ~/.pydistutils.cfg:

[easy_install]
index_url = http://my.chishop/simple

And finally, if you use buildout, tweak your buildout.cfg:

[buildout]
index = http://my.chishop/simple

Use It

Now that you have a fully functioning PyPi, kill off your requirements files and build a real setup.py. Hopefully as a bit of inspiration, here’s a snippet from Sentry’s:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/usr/bin/env python

try:
    from setuptools import setup, find_packages
except ImportError:
    from ez_setup import use_setuptools
    use_setuptools()
    from setuptools import setup, find_packages

tests_require = [
    'django',
    'django-celery',
    'south',
    'django-haystack',
    'whoosh',
]

setup(
    name='django-sentry',
    version='1.6.8.1',
    author='David Cramer',
    author_email='dcramer@gmail.com',
    url='http://github.com/dcramer/django-sentry',
    description = 'Exception Logging to a Database in Django',
    packages=find_packages(exclude="example_project"),
    zip_safe=False,
    install_requires=[
        'django-paging>=0.2.2',
        'django-indexer==0.2.1',
        'uuid',
    ],
    dependency_links=[
        'https://github.com/disqus/django-haystack/tarball/master#egg=django-haystack',
    ],
    tests_require=tests_require,
    extras_require={'test': tests_require},
    test_suite='sentry.runtests.runtests',
    include_package_data=True,
    classifiers=[
        'Framework :: Django',
        'Intended Audience :: Developers',
        'Intended Audience :: System Administrators',
        'Operating System :: OS Independent',
        'Topic :: Software Development'
    ],
)

Building Cursors for the Disqus API

This last week we’ve been implementing cursors for the Disqus API (3.0). If you’re not familiar, the concept is like cursors in your database: create a marker for where you are with your result set so you can iterate through a large set of results efficiently. Think of it like a snapshot. A marker that lets us retrieve the results you were previously looking for, and return a subset of those results.

LIMIT/OFFSET is Bad

One of the big questions I’ve seen come up, is “Why not just use LIMIT and OFFSET?” To answer this, you must understand how LIMIT/OFFSET actually works. For this we’ll use your typical database example. You come in, request all results that rhyme with RICK, and there are approximately 1000 results. You first ask it for the first 100, which is very easy, as it can yield one row as it gets it, which means it just returns the first 100 rows that match the result set. Fast forward, and now you are asking it for rows 900-1000. The database now must iterate through the first 900 results before it can start returning a row (since it doesnt have a pointer to tell it how to get to result 900). In summary, LIMIT/OFFSET is VERY slow on large result sets.

Range Selectors

The typical solution to avoiding the above pattern is to switch to range selectors. Using some kind of index, you tell it exactly where you need to start and stop. Using the above example, we would say “I want RICK results that have an ID greater than 900 and less than 1000”, which will get you approximately the same thing. With this solution, however, you have to worry about gaps in your ranges. The result set, 900 to 1000, could have anywhere between 0 and 100 rows, which isn’t what you really want.

Non-Unique Ranges

There is one final thing we had to take into account when designing our cursors. We use them for both timestamp and incremental ID sorting (ideally timestamp-only), which works great, but presents the problem of conflicts. It’s very unlikely that two sets of data will have the exact datetime (down to the microsecond), but it happens, especially on very large data sets (like ours). To combat this, we have to actually combine range offsets with row offsets.

id  | timestamp         | title
-------------------------------
1   | 1299061169.043267 | foo
2   | 1299061169.043267 | bar
3   | 1299061170.034193 | baz

Combining Selectors

Our final result consists of generating range offsets with row offsets. We start by generating the absolute highest range identifier we can from a result set (typically the last row in the result), and then we append a row offset on to this (usually 0). In the case where the last row is identical to one or more rows (from end to start) we just increment this offset number. The resulting database logic turns into something like SELECT FROM posts WHERE timestamp > 2012-10-12T08:12:56.34153 LIMIT 50 OFFSET 5. Remember, the key here is that the “timestamp” value we’re sending is continually changing as we paginate through the cursor, which allows us to keep these queries very efficient.

I should note, that we also had to deal with doing the opposite operation of paginating forward, being the obvious “previous results”. This had its own set of problems that we basically had to reverse all of our operations. Given that we are at the cursor we see above, we need to generate a “previous cursor” object. To do this, we just take the first row in the series (again, doing the same offset calculations), and set a directional flag. The result is almost more documentation than code, just because of how complicated the logic can appear.

The end result of our cursors in the API, looks a little bit like this:

1
2
3
4
5
6
7
    "cursor": {
        "prev": "1299061169043267:0:1",
        "hasNext": true,
        "next": "1299061158809627:0:0",
        "hasPrev": true,
        "total": null,
      },

The logic is a bit fuzzy, and we have to do some best guesses in places (such as determining if there is actually a valid previous cursor), but the database queries end up about as efficient as we can hope for. We end up with (N results + 1) rows when we’re paginating forward, and (N results + 2) when pulling up previous cursors. To avoid confusion, this is literally one query for every request, period. There’s no additional overhead for doing counts or determining what your next or previous cursors are. That’s one optimized SQL statement to fetch your results, calculate your next, and previous cursors.

Since I feel bad for not leaving you all with much code, check out some of the database utilities that we use at Disqus to make life with Django QuerySets a bit easier.

Using OS X Media Keys in Rdio

Update: Use Fluid, with this awesome icon by Wilson Miner, and name it “Rdio Desktop” and follow these same instructions for a much better experience. You will also need to edit the macros and change Next/Previous track to use arrow keys instead of ctrl+arrow.

I’m not going into much details, as it’s been a long frustrating day dealing with a number of things today, but I wanted to share how I managd to actually get Rdio Desktop to not suck (well, more like make it bearable). What do I mean by this? After many hours I discovered how I could remap the media keys on my OS X keyboard to work with Rdio Desktop. Ugh.

Remapping media keys to function keys

To get us started, you’ll want to install KeyRemap4MacBook will allow you to use your existing function keys, and simple swap the media keys so that they actually send normal function keypresses. This is needed because Apple doesn’t feel it nescesary to allow you to remap them otherwise.

Pop it open and search for media. Tick the box next to whichever setting applies to your keyboard.

Creating macros for Rdio Desktop

Now that we can actually bind go the media keys, you’re going to need to create some macros to work with the Air app. Why do you have to do this? Because Adobe Air is a crappy framework that no one should ever build apps with. To do this you’re going to need to install KeyboardMaestro. Now while I had to follow a guide to creating these macros, I find that a pretty big waste of time. So save yourself some time, and download and run the macros I created via the aforementioned guide over at DropBox, and you’re good to go.

Complain to developers

Hopefully you found this guide very quickly and didn’t waste time digging for solutions. However, many people didn’t have it so easy. I encourage you to complain to any developer you ever meet who thinks its a good idea to build an Adobe {Air,Flash,Anything else that sucks} application and explain to them how much hell they put people through.