David Cramer's Blog

Creating a Read-only Mirror for Your GitHub Server

Recently we’ve been transitioning our git repositories to GitHub. We chose to go this route for a variety of reasons, but mostly because they have kickass pull requests, which we’re going to test run as code reviews. However, one of the requirements of this process was that our original git-server still remain functional, in at least a read-only state. This saves us the time of having to update deploy and other scripts which read from this mirror and perform various tasks.

I was a bit surprised when I originally searched around for this, as I was either failing horribly at Google (granted, my queries were “how to setup git-server mirror”), or there just wasn’t much information out there on it. After a bit of crawling I found what seems to be a pretty easy way to get the behavior we wanted. For a recap, here’s a checklist of what we needed:

  • Read-only git server
  • One-way mirror from our new server to the legacy server.
  • Mirror all branches
  • Updated near real-time

So, given this, we created a simple bash script that runs on a 1 minute cron timer (it’s as close to real-time as we needed):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/bin/bash

mkdir -p  /var/git/mirrors/

cd /var/git/mirrors/

# clone our newly acquired GitHub mirror
git clone --mirror git@github.com:organization/repo-name.git

cd disqus.git

# Add our local remote
git remote add local /var/git/repositories/repo-name.git

# Unsure if we need to fetch from local, but let's do it anyways
git fetch origin
git fetch local

# push all changes to local using --mirror (ensures all refs in remotes are pushed)
git push local --mirror

Since we were already using gitosis for permissions, it was easy for us to deprecate the legacy repo by simply moving everyone into a readable group that lacks write privileges.

Would love to hear some feedback from avid git users if there’s a better way to do this.

Setting Up Your Own PyPi Server

Ever had problems with PyPi being unreachable? Dislike dealing with requirement.txt files just to support a git repository? For a low low price of FREE, and an hour of labor, get your very own PyPi server and solve all of your worries!

Set up Chishop

We’re going to jump right into this one. Start by setting up Chishop. Currently the best way is to do so using the DISQUS fork as it contains several fixes. Expect to see all of the hard work in the various forks merged upstream as soon as we get some proper docs going. Follow the instructions in the README to configure Chishop, and your PyPi index.

Now you’re going to want to tweak some things that are on by default. For starters, you’re probably going to want to proxy the official PyPi repository, and this can be done by enabling a simple flag in your newly created settings.py:

1
DJANGOPYPI_PROXY_MISSING = True

There are many other configuration options, but you’re going to have to read the source for those.

Configure PIP/Setuptools/Buildout

Now that you’ve got a sexy PyPi server up and running, you’ll probably want to configure the default index locations for your package managers. It took me a bit of Googling but then I stumpled upon an awesome post by Jacob Kaplan-Moss about dealing with PyPi when it goes down, which describes procedures for configuring PyPi mirrors.

Let’s start with pip, which stores its configuration in ~/.pip/pip.conf:

[global]
index-url = http://my.chishop/simple

Next up, setuptools, located in ~/.pydistutils.cfg:

[easy_install]
index_url = http://my.chishop/simple

And finally, if you use buildout, tweak your buildout.cfg:

[buildout]
index = http://my.chishop/simple

Use It

Now that you have a fully functioning PyPi, kill off your requirements files and build a real setup.py. Hopefully as a bit of inspiration, here’s a snippet from Sentry’s:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/usr/bin/env python

try:
    from setuptools import setup, find_packages
except ImportError:
    from ez_setup import use_setuptools
    use_setuptools()
    from setuptools import setup, find_packages

tests_require = [
    'django',
    'django-celery',
    'south',
    'django-haystack',
    'whoosh',
]

setup(
    name='django-sentry',
    version='1.6.8.1',
    author='David Cramer',
    author_email='dcramer@gmail.com',
    url='http://github.com/dcramer/django-sentry',
    description = 'Exception Logging to a Database in Django',
    packages=find_packages(exclude="example_project"),
    zip_safe=False,
    install_requires=[
        'django-paging>=0.2.2',
        'django-indexer==0.2.1',
        'uuid',
    ],
    dependency_links=[
        'https://github.com/disqus/django-haystack/tarball/master#egg=django-haystack',
    ],
    tests_require=tests_require,
    extras_require={'test': tests_require},
    test_suite='sentry.runtests.runtests',
    include_package_data=True,
    classifiers=[
        'Framework :: Django',
        'Intended Audience :: Developers',
        'Intended Audience :: System Administrators',
        'Operating System :: OS Independent',
        'Topic :: Software Development'
    ],
)

Building Cursors for the Disqus API

This last week we’ve been implementing cursors for the Disqus API (3.0). If you’re not familiar, the concept is like cursors in your database: create a marker for where you are with your result set so you can iterate through a large set of results efficiently. Think of it like a snapshot. A marker that lets us retrieve the results you were previously looking for, and return a subset of those results.

LIMIT/OFFSET is Bad

One of the big questions I’ve seen come up, is “Why not just use LIMIT and OFFSET?” To answer this, you must understand how LIMIT/OFFSET actually works. For this we’ll use your typical database example. You come in, request all results that rhyme with RICK, and there are approximately 1000 results. You first ask it for the first 100, which is very easy, as it can yield one row as it gets it, which means it just returns the first 100 rows that match the result set. Fast forward, and now you are asking it for rows 900-1000. The database now must iterate through the first 900 results before it can start returning a row (since it doesnt have a pointer to tell it how to get to result 900). In summary, LIMIT/OFFSET is VERY slow on large result sets.

Range Selectors

The typical solution to avoiding the above pattern is to switch to range selectors. Using some kind of index, you tell it exactly where you need to start and stop. Using the above example, we would say “I want RICK results that have an ID greater than 900 and less than 1000”, which will get you approximately the same thing. With this solution, however, you have to worry about gaps in your ranges. The result set, 900 to 1000, could have anywhere between 0 and 100 rows, which isn’t what you really want.

Non-Unique Ranges

There is one final thing we had to take into account when designing our cursors. We use them for both timestamp and incremental ID sorting (ideally timestamp-only), which works great, but presents the problem of conflicts. It’s very unlikely that two sets of data will have the exact datetime (down to the microsecond), but it happens, especially on very large data sets (like ours). To combat this, we have to actually combine range offsets with row offsets.

id  | timestamp         | title
-------------------------------
1   | 1299061169.043267 | foo
2   | 1299061169.043267 | bar
3   | 1299061170.034193 | baz

Combining Selectors

Our final result consists of generating range offsets with row offsets. We start by generating the absolute highest range identifier we can from a result set (typically the last row in the result), and then we append a row offset on to this (usually 0). In the case where the last row is identical to one or more rows (from end to start) we just increment this offset number. The resulting database logic turns into something like SELECT FROM posts WHERE timestamp > 2012-10-12T08:12:56.34153 LIMIT 50 OFFSET 5. Remember, the key here is that the “timestamp” value we’re sending is continually changing as we paginate through the cursor, which allows us to keep these queries very efficient.

I should note, that we also had to deal with doing the opposite operation of paginating forward, being the obvious “previous results”. This had its own set of problems that we basically had to reverse all of our operations. Given that we are at the cursor we see above, we need to generate a “previous cursor” object. To do this, we just take the first row in the series (again, doing the same offset calculations), and set a directional flag. The result is almost more documentation than code, just because of how complicated the logic can appear.

The end result of our cursors in the API, looks a little bit like this:

1
2
3
4
5
6
7
    "cursor": {
        "prev": "1299061169043267:0:1",
        "hasNext": true,
        "next": "1299061158809627:0:0",
        "hasPrev": true,
        "total": null,
      },

The logic is a bit fuzzy, and we have to do some best guesses in places (such as determining if there is actually a valid previous cursor), but the database queries end up about as efficient as we can hope for. We end up with (N results + 1) rows when we’re paginating forward, and (N results + 2) when pulling up previous cursors. To avoid confusion, this is literally one query for every request, period. There’s no additional overhead for doing counts or determining what your next or previous cursors are. That’s one optimized SQL statement to fetch your results, calculate your next, and previous cursors.

Since I feel bad for not leaving you all with much code, check out some of the database utilities that we use at Disqus to make life with Django QuerySets a bit easier.

Using OS X Media Keys in Rdio

Update: Use Fluid, with this awesome icon by Wilson Miner, and name it “Rdio Desktop” and follow these same instructions for a much better experience. You will also need to edit the macros and change Next/Previous track to use arrow keys instead of ctrl+arrow.

I’m not going into much details, as it’s been a long frustrating day dealing with a number of things today, but I wanted to share how I managd to actually get Rdio Desktop to not suck (well, more like make it bearable). What do I mean by this? After many hours I discovered how I could remap the media keys on my OS X keyboard to work with Rdio Desktop. Ugh.

Remapping media keys to function keys

To get us started, you’ll want to install KeyRemap4MacBook will allow you to use your existing function keys, and simple swap the media keys so that they actually send normal function keypresses. This is needed because Apple doesn’t feel it nescesary to allow you to remap them otherwise.

Pop it open and search for media. Tick the box next to whichever setting applies to your keyboard.

Creating macros for Rdio Desktop

Now that we can actually bind go the media keys, you’re going to need to create some macros to work with the Air app. Why do you have to do this? Because Adobe Air is a crappy framework that no one should ever build apps with. To do this you’re going to need to install KeyboardMaestro. Now while I had to follow a guide to creating these macros, I find that a pretty big waste of time. So save yourself some time, and download and run the macros I created via the aforementioned guide over at DropBox, and you’re good to go.

Complain to developers

Hopefully you found this guide very quickly and didn’t waste time digging for solutions. However, many people didn’t have it so easy. I encourage you to complain to any developer you ever meet who thinks its a good idea to build an Adobe {Air,Flash,Anything else that sucks} application and explain to them how much hell they put people through.

Error Tracing in Sentry

A few weeks ago we pushed out an update to Sentry, bumping it’s version to 1.6.0. Among the changes was a new “Sentry ID” value which is created by the client, rather than relying on the server. This seems like something insignificant, but it allows you to do something very powerful: trace errors from the customer or developer down to the precise request and log entry.

Exposing Sentry ID

The new IDs are generated automatically when a message is processed (by the client), so you won’t need to make any changes on that end. Likely, however, you’re going to want to expose these at your application level for a couple of different reasons. The first one we’re going to cover is your customer’s experience.

The easiest way to expose this information in a useful manner, is by creating a modified 500.html. In DISQUS’ case, we mention the error reference ID to the end-user, so that when they’re reporting a problem they can pass along this information.

Create a custom 500 handler

The first thing you’re going to need to do is to create a custom 500 handler. This defined in urls.py, so we’re just going to go ahead and create the view in-place.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def handler500(request):
    """
    An error handler which exposes the request object to the error template.
    """
    from django.template import Context, loader
    from django.http import HttpResponseServerError
    from disqus.context_processors import default
    import logging
    import sys
    try:
        context = default(request)
    except Exception, e:
        logging.error(e, exc_info=sys.exc_info(), extra={'request': request})
        context = {}

    context['request'] = request

    t = loader.get_template('500.html') # You need to create a 500.html template.
    return HttpResponseServerError(t.render(Context(context)))

We’re going to expose the request object to our 500.html in the above. Keep in mind, that doing this allows you to add some logic into your template, and you’re going to need to be very careful that this logic can’t raise a new exception.

Tweaking your 500.html

The next thing you’ll need to do is to tweak your 500.html template to actually show the Sentry ID. Assuming the request object was passed into Sentry, it will attach the last error seen under request.sentry['id']. Given this, we can easily report it to the end-user in our template:

<p>The Disqus team has been alerted and we're on the case. For more information, check out <a href="http://status.disqus.com">Disqus Status »</a></p>
{% if request.sentry.id %}
    <p>If you need assistance, you may reference this error as <strong>{{ request.sentry.id }}</strong>.</p>
{% endif %}

Sentry ID as a response header

The other quick solution to get access to this variable is simply by enabling an included response middleware, SentryResponseErrorIdMiddleware. Just pop open your settings.py and append it to your MIDDLEWARE_CLASSES:

1
2
3
4
MIDDLEWARE_CLASSES = (
    ...,
    'sentry.client.middleware.SentryResponseErrorIdMiddleware',
)

Now if you check your response headers after hitting an error, you should see X-Sentry-ID.

Find errors by ID

Sentry makes it very easy to pull up error messages by ID. The one requirement is that you’re going to need to ensure sentry.filters.SearchFilter is included within SENTRY_FILTERS (it’s enabled by default). Once done, Sentry will discover if you’re entering a UUID hex value (the Sentry ID) in the search box, and it will jump directly to that error’s page.

You’ll also notice that all messages are now tagged with their unique Sentry ID as well (per the screenshot).

Settings in Django

I want to talk a bit about how we handle our large amounts of application configuration over at DISQUS. Every app has it, and it seems like theres a hundred different ways that you can manage it. While I’m not going to say ours is the best way, it has allowed us a very flexible application config under our varying situations.

Managing Local Settings

First off, we all know how Django does this by default. A simple settings.py file which is loaded at runtime. It works fairly well in very basic apps, until you start relying on a database, or some other configuration value which changes between production and development. Typically, once you’ve hit this, the first thing you do is add a local_settings. This generally is not part of our VCS and contains any settings specific to your environment. To achieve this, you simply need to adjust your settings.py to include the following (at the end of the file, ideally):

1
2
3
4
try:
    from local_settings import *
except ImportError, e:
    print 'Unable to load local_settings.py:', e

Refactoring Settings

Now we’ve solved the very basic case, and this tends to get you quite a bit of breathing room. Eventually you may get to the point where you’re wanting some sort of globalized settings, generic development settings, or you just want to tweak settings based on their defaults. To achieve this we’re going to re architect settings as a whole. For starters, let’s move everything into a conf module in your python app. Try something like the following:

project/conf/__init__.py
project/conf/settings/__init__.py
project/conf/settings/default.py
project/conf/settings/dev.py

To make all this play nice, you’re going to want to shift all of your current settings.py code into project/conf/settings/default.py. This will give your basis to work from, and allow you to easily inherit from it (think OO). Once this is moved, let’s refactor our new settings.py. Bear with me, as we’re going to throw a lot out you all at once now:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import os

## Import our defaults (globals)

from disqus.conf.settings.default import *

## Inherit from environment specifics

DJANGO_CONF = os.environ.get('DJANGO_CONF', 'default')
if DJANGO_CONF != 'default':
    module = __import__(DJANGO_CONF, globals(), locals(), ['*'])
    for k in dir(module):
        locals()[k] = getattr(module, k)

## Import local settings

try:
    from local_settings import *
except ImportError:
    import sys, traceback
    sys.stderr.write("Warning: Can't find the file 'local_settings.py' in the directory containing %r. It appears you've customized things.\nYou'll have to run django-admin.py, passing it your settings module.\n(If the file settings.py does indeed exist, it's causing an ImportError somehow.)\n" % __file__)
    sys.stderr.write("\nFor debugging purposes, the exception was:\n\n")
    traceback.print_exc()

## Remove disabled apps

if 'DISABLED_APPS' in locals():
    INSTALLED_APPS = [k for k in INSTALLED_APPS if k not in DISABLED_APPS]

    MIDDLEWARE_CLASSES = list(MIDDLEWARE_CLASSES)
    DATABASE_ROUTERS = list(DATABASE_ROUTERS)
    TEMPLATE_CONTEXT_PROCESSORS = list(TEMPLATE_CONTEXT_PROCESSORS)

    for a in DISABLED_APPS:
        for x, m in enumerate(MIDDLEWARE_CLASSES):
            if m.startswith(a):
                MIDDLEWARE_CLASSES.pop(x)

        for x, m in enumerate(TEMPLATE_CONTEXT_PROCESSORS):
            if m.startswith(a):
                TEMPLATE_CONTEXT_PROCESSORS.pop(x)

        for x, m in enumerate(DATABASE_ROUTERS):
            if m.startswith(a):
                DATABASE_ROUTERS.pop(x)

Let’s try to cover a bit of what we’ve achieved with our new settings.py. First, we’re inheriting from conf/settings/default.py, followed up by the ability to specify an additional set of overrides using the DJANGO_CONF environment variable (this would work much like DJANGO_SETTINGS_MODULE). Next we’re again pulling in our local_settings.py, and finally, we’re pulling in a setting called DISABLED_APPS. This final piece let’s us (within local_settings and all) specify applications which should be disabled in our environment. We found it useful to pull things like Sentry out of our tests and development environments.

Improving Local Settings

Now that we’ve got a nice basic setup for our application configuration, let’s talk about a few other nice-to-haves that we can pull off with this. Remember how we mentioned it would be nice to inherit from defaults, even in local settings? Well now you can do this, as your settings are stored elsewhere (likely in default.py). Take this piece of code as an example:

1
2
3
4
5
6
7
8
9
10
from project.conf.settings.dev import *

# See the above file for various settings which you shouldn't need to modify :)
# Adjust them by placing the new values in this file

# enable solr
SOLR_ENABLED = True

# disable sentry
DISABLED_APPS = ['sentry']

We also recommend taking your local_settings.py and making a copy as example_local_settings.py within your repository.

Development Settings

You’ll see we recommended a dev.py settings module above, and again reference it here in our local_settings.py. Taking some examples of how we achieve a standardized setup at DISQUS, here’s something to get you started:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Development environment settings

from project.conf.settings.default import *

import getpass

TEMPLATE_LOADERS = (
    # Remove cached template loader
    'django.template.loaders.filesystem.Loader',
    'django.template.loaders.app_directories.Loader',
)

DISABLED_APPS = ['sentry.client', 'sentry']

DEBUG = True

DATABASE_PREFIX = ''
DATABASE_USER = getpass.getuser()
DATABASE_PASSWORD = ''
DATABASE_HOST = ''
DATABASE_PORT = None

for k, v in DATABASES.iteritems():
    DATABASES[k].update({
        'NAME': DATABASE_PREFIX + v['NAME'],
        'HOST': DATABASE_HOST,
        'PORT': DATABASE_PORT,
        'USER': DATABASE_USER,
        'PASSWORD': DATABASE_PASSWORD,
        'OPTIONS': {
            'autocommit': False
        }
    })

# django-devserver: http://github.com/dcramer/django-devserver
try:
    import devserver
except ImportError:
    pass
else:
    INSTALLED_APPS = INSTALLED_APPS + (
        'devserver',
    )
    DEVSERVER_IGNORED_PREFIXES = ['/media', '/uploads']
    DEVSERVER_MODULES = (
        # 'devserver.modules.sql.SQLRealTimeModule',
        # 'devserver.modules.sql.SQLSummaryModule',
        # 'devserver.modules.profile.ProfileSummaryModule',
        # 'devserver.modules.request.SessionInfoModule',
        # 'devserver.modules.profile.MemoryUseModule',
        # 'devserver.modules.profile.LeftOversModule',
        # 'devserver.modules.cache.CacheSummaryModule',
    )


INSTALLED_APPS = (
    'south',
) + INSTALLED_APPS

MIDDLEWARE_CLASSES = MIDDLEWARE_CLASSES + (
    'disqus.middleware.profile.ProfileMiddleware',
)

CACHE_BACKEND = 'locmem://'

Hopefully this will save you as much time as it’s saved us. Simplifying settings like above has made it so a new developer, or a new development machine can be up and running with little to no changes to the application configuration itself.

How to Actually Make LocalSolr Work

Today I’ve been working on integrating geospatial search with our upcoming DISQUS Search product, which happens to rely on Solr. It didn’t take much work before I stumbled upon LocalSolr, which seems to be the defacto gis implementation. The docs were fairly brief, but it seemed easy to get up and running. It just so happens that it wasnt that easy after all. Hoping that this helps someone else out, here’s my step by step to getting it setup (locally, at least):

First up, you’re going to need to grab the localsolr libraries in some form or another. Hidden obscurely on a “Quick Start” link, is a tgz of an example project. It’s much like example project included with the actual Solr package, so it should be fairly straightforward. Once I had this, I pulled in my existing configuration to replace the example’s, and updating it per the docs.

The first set of changes needed to be made in solrconfig.xml. You’re going to need to add the localsolr component, and optionally the geofaceting component. You’ll also need to create a separate handler for geo searches (unless you plan to use longitude and latitude with every single search to Solr).

1
2
3
4
5
6
7
8
<searchComponent name="geofacet"
                 class="com.pjaol.search.solr.component.LocalSolrFacetComponent"/>

<searchComponent name="localsolr"
                 class="com.pjaol.search.solr.component.LocalSolrQueryComponent">
  <str name="latField">lat</str>
  <str name="lngField">lng</str>
</searchComponent>
1
2
3
4
5
6
7
<requestHandler name="geo" class="org.apache.solr.handler.component.SearchHandler">
  <arr name="components">
    <str>localsolr</str>
    <str>geofacet</str>
    <str>mlt</str>
  </arr>
</requestHandler>

Once done, you can move on to altering your schema.xml. It’s very important, that if you had used the examples on the LocalSolr site and already begun indexing, that you obliterate your index completely, as it will contain invalid data. This presents itself with an ugly, misleading (at least to Python folk) error: Invalid shift value in prefixCoded string. It turns out that you actually need to use tdouble instead of sdouble on all field types. Don’t ask me why, as I don’t care to know. So, on to the schema changes:

1
2
3
4
5
<!-- local lucene field types - ensure these are tdouble! -->
<field name="lat" type="tdouble" indexed="true" stored="false" required="false"/>
<field name="lng" type="tdouble" indexed="true" stored="false" required="false"/>
<field name="geo_distance" type="tdouble" required="false"/>
<dynamicField name="_local*" type="tdouble" indexed="true" stored="false"/>

Now just reindex your data and enjoy. You’ll need to pass the qt parameter when searching, and set it to geo (or whatever you named your requestHandler above).

Database Routers in Django

Whether you’re doing master / slave, or partitioning data, when your product gets large enough you’ll need the ability to route data to various nodes in your database. Django (as of 1.2) out of the box provides a pretty cool solution called a Database Router. Here at DISQUS we have a large set of data, and one this of course brings the need to implement some of these fairly standard solutions.

The first solution that many companies will choose is a master / slave setup. This is the most common of all database scaling techniques and is very easy to setup in modern RDBMS solutions. In Django, this also comes very easy with a few lines of code:

1
2
3
4
5
6
7
class MasterSlaveRouter(object):
    "Sends reads to 'slave' and writes to 'default'."
    def db_for_write(self, model, **hints):
        return 'default'

    def db_for_read(self, model, **hints):
        return 'slave'

Now while this won’t scale very far (if you’re not using a proxy or bouncer, this is a single slave), it also brings a lot of other problems with it. The dreaded replication lag will hit you no matter your size (ever notice Facebook not being in “sync”), and can be fairly difficult to work around. Not going to dive into details here, but there are many ways to lessen visibility of this delay by using caching as well as doing some of your reads off your master nodes.

The other solution I want to talk about is partitioning. We’re going to specifically talk about vertical partitioning, or the act of separating data by purpose. This is another very easy to implement solution which just requires you to move tables to other servers. Again, in Django this is very easy to implement with routers:

1
2
3
4
5
6
7
class PartitionByApp(object):
    "Send reads to an app-specific alias, and writes to the 'default'."
    def db_for_write(self, model, **hints):
        return 'default'

    def db_for_read(self, model, **hints):
        return model._meta.app_label

We’re currently working on splitting of a fairly large set of data over here, so we whipped up a little bit more flexible solution using routers. Our needs were simple: assign an app (or a model) to a separate database cluster. Here’s what we came up with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from django.conf import settings

class PrimaryRouter(object):
    _lookup_cache = {}

    default_read = None
    default_write = 'default'

    def get_db_config(self, model):
        "Returns the database configuration for `model`"
        if model not in self._lookup_cache:
            conf = settings.DATABASE_CONFIG['routing']

            app_label = model._meta.app_label
            module_name = model._meta.module_name
            module_label = '%s.%s' % (app_label, module_name)

            if module_label in conf:
                result = conf[module_label]
            elif app_label in conf:
                result = conf[app_label]
            else:
                result = {}
            self._lookup_cache[model] = result
        return self._lookup_cache[model]

    def db_for_read(self, model, **hints):
        db_config = self.get_db_config(model)
        return db_config.get('read', db_config.get('write', self.default_read))

    def db_for_write(self, model, **hints):
        db_config = self.get_db_config(model)
        return db_config.get('write', self.default_write)

    def allow_relation(self, obj1, obj2, **hints):
        # Only allow relations if the models are on the same database
        db_config_1 = self.get_db_config(obj1)
        db_config_2 = self.get_db_config(obj2)
        return db_config_1.get('write') == db_config_2.get('write')

    def allow_syncdb(self, db, model):
        db_config = self.get_db_config(model)
        allowed = db_config.get('syncdb')
        # defaults to both read and write servers
        if allowed is None:
            allowed = filter(None, [self.db_for_read(model),
                                    self.db_for_write(model)])
        if allowed:
            # FIX: TEST_MIRROR passes the mirrored alias, and not the originating
            for k in allowed:
                if db == k:
                    return True
                if db == settings.DATABASES[k].get('TEST_MIRROR') or k:
                    return True
            return False

To use this, we simply define a key called routing in our DATABASE_CONFIG.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Note: this isn't how we partition our models, its just an example
DATABASE_CONFIG = {
    'routing': {
        # defaults for all models in forums
        'forums': {
            'write': 'default',
            'read': 'default.slave',
        },
        # override for forums.Forum
        'forums.forum': {
            'write': 'cluster2',
            'read': 'cluster2.slave',
        },
        # override for forums.Post
        'forums.post': {
            'write': 'default',
            'read': 'default.slave',
        },
    },
}

A future post will cover how we’ve started moving to a dictConfigurator to make inheritance in many of our settings much easier.

BitField’s in Django

Today we’re releasing another heavily used component from the DISQUS code base, our BitField class. While not a true BIT field (it uses a BIGINT), it still allows you the convenience of accessing the values as if they were bit flags.

When I joined DISQUS about 7 months ago, we were using a Q-like object class to do checks against our BigIntegerField’s. It worked fairly well, but was just too verbose. To add to that, we had a function which would attach callables to the instance for each flag. This let us do things like instance.FLAG_NAME() to check if it was set, and intance.FLAG_NAME(True) to set the flag. This worked well, but, like many things, we wanted to improve on it.

So we ended up building out BitField. We modeled it off of the concept of a simple attribute key store. The idea was to keep it dead simple to add flags, but also allow easy access and querying on those flags. A complete guide is available on the GitHub project page, so we’re just going to highlight usage of it.

First things first, defining your BitField. All you have to do is pass it a list of keys as the flags kwarg:

1
2
3
4
5
6
7
8
from bitfield import BitField

class MyModel(models.Model):
    flags = BitField(flags=(
        'awesome_flag',
        'flaggy_foo',
        'baz_bar',
    ))

Now reading and writing bits is very pythonic:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Create the model
o = MyModel.objects.create(flags=0)

# Add awesome_flag (does not work in SQLite)
MyModel.objects.filter(pk=o.pk).update(flags=MyModel.flags.awesome_flag)

# Set flags manually to [awesome_flag, flaggy_foo]
MyModel.objects.filter(pk=o.pk).update(flags=3)

# Remove awesome_flag (does not work in SQLite)
MyModel.objects.filter(pk=o.pk).update(flags=~MyModel.flags.awesome_flag)

# Test awesome_flag
if o.flags.awesome_flag:
    print "Happy times!"

# List all flags on the field
for f in o.flags:
    print f

Let us know if you have any feedback, and make sure you subscribe to updates from our code blog.

Blog Refresh

It’s been long overdue, but I’ve given my blog a facelift, and a brand new domain. I wasn’t content with the .net domain, and my name is just far too common to have that many choices, so I went with one of the few I could come up with that was available, JustCramer.com. I figured that with the domain move, I’d also go ahead and get off of WordPress and refresh the design at the same time.

The new blog is powered off of Jekyll and is running off of the GitHub Pages technology. I decided to use the HTML5 Boilerplate project as a starter for the new design, and it’s turned out pretty well. If you’re curious, the source is available on GitHub as always.

Importing posts from WordPress turned out to be quite the challenge. Jekyll included a tool to translate the posts (for the most part) to HTML pages, and it worked decently. Though there were a few hurdles to overcome.

The first problem was I had been using a syntax highlighting plugin in WordPress that worked by using pre tags with a lang attribute. For example, <pre lang="python">. These all had to be translated to the {% highlight %} tags supported by Jekyll (on top of Pygments).

Next up we actually had to deal with outputting { and } characters without Liquid (the template language Jekyll uses) parsing them. Just like Django, it seems they didn’t see the need to have some kind of raw escape tag (don’t get me started on the templatetag template tag in Django). After a bit of Googling I discovered the entities for these characters were &#123; and &#125;.

After getting the pages mostly working, I realized they were all in .markdown. While I have nothing against markdown, it didn’t seem to play nice with some of my html. For example, it didn’t like lines that started with HTML tags. Not caring to figure out how to work around this, I decided I’d try to convert all HTML to Markdown. I attempted with both Markdownify (PHP) and a project called html2text (Python). Both projects didn’t work out.In the end I decided I would just swap all pages over to HTML.

1
2
3
4
for x in *.dot;
do
  mv "$x" "${x%.markdown}.html";
done

Now that I had renamed all of the files, it quickly came to my attention that I was going to have to deal with paragraphs. To do this I wrote a quick script that would iterate all of the files and replace lines which appeared to be actual paragraphs, with paragraph tags. If you’re curious about this script, you can find it in the source.

The last big hurdle I had to deal with was rewriting permalinks. This was a bigger challenge that I expected. My entire goal here was to be based on free services and handling redirects requires some logic. By suggestion of Anton Kovalyov I reluctantly decided I would write a quick AppEngine app.

After a few wacky ideas I decided that the simplest solution would just be to write a URL mapper using origin and destination columns. With this in mind, I quickly whipped up a PHP script to dump my entire database of posts into a CSV file. The file converted the old permalinks (on davidcramer.net) to the new style. Finishing up the AppEngine app was fairly straight forward after this was ready. It included a CSV importer (to throw data into my datastore model), and handling the redirect part was painless. Again, if you’re curious, the source is on GitHub.

I still have to deal with migrating URLs for Disqus, but let me know if you have any feedback on the new blog, it’s design, or just anything in general. Thanks!