Kristian Glass - Do I Smell Burning?

Shipping Stuff

(No boats were harmed, involved, or even really alluded to in the making of this post)

Two things came through my RSS reader recently that resonated with me particularly. The first, a blog post by Martin Keegan, “Intellectual Debt”, says:

I think it’s possible to accumulate “intellectual debt”. Thoughts and ideas that you’ve had, worked on, developed, talked about, but have not written up and published. You can have an idea, but until you’ve tried to write it up properly such that someone else could read and criticise it, you can’t be sure that it actually makes sense.

and GitHub’s Steve Smith makes the following declaration in an interview on “Optimizing For Happiness”:

Whether or not it gets used, you have to finish something. The worst thing you can do is start a bunch of things, get halfway through, quit and start something else. You’re not going to be happy. Ship stuff.

 

Both of these hit home hard. I’ve got a variety of half-finished, un-written-up side projects that have suffered an undue amount of neglect due to client work and life. I suspect I’m far from alone in this.

Right now I’m taking some holiday time, and since several schools of thought seem to suggest a public commitment can aid in achieving a goal, here’s the list of things I’m going to try to ship over the coming weeks:

  • "Learning Me A Haskell" - an informal "worked introduction" to writing a puzzle solver in Haskell
  • "Jeeves-Door" - hooking up my doorbell to a Raspberry Pi for tweets and texts and photos when someone comes a-ringing
  • "Jeeves-Bell" - replacing the guts of my analogue alarm clock with a Raspberry Pi  for better alarm control and scheduling
  • A campaign to encourage people to cite their sources when publishing

There’s more I’ve got queued, but this holiday time is finite, and this feels like the right amount to be “challenging but achievable”.

If like me you’ve got a personal project backlog, why not join me on a “shipping spree” - I’d love to hear from you if you do.

Upgrading Puppet in Vagrant Boxes

I’ve finally found the time to sit down and start using Vagrant for Real Things. For the unaware, Vagrant is essentially a tool for managing development VMs - excellent for such things as managing a local development environment, or developing and testing Chef/Puppet configuration. For more detail see the excellent set of slides by Vagrant author Mitchell Hashimoto - Develop and Test Configuration Management Scripts with Vagrant.

Something I swiftly ran into was that I have several manifests written for Puppet 3. Among things introduced since Puppet 2.7 are the addition of unless as a synonym for if ! to the DSL, and the introduction of Hiera as a “first class citizen”.

So, I want to use Puppet 3, but I really don’t want to have to go and rebuild existing Vagrant boxes. I have Puppet modules for ensuring that I’m running the latest stable Puppet, but that’s not going to work when the existing install can’t parse said modules.

It turns out you can have multiple Provisioners in Vagrant. So, while in general Puppet (or Chef if you’re of that persuasion) is The Right Way to provision things, we can add a Shell provisioner to run before the Puppet provisioner and ensure the VM is running Puppet 3.

upgrade-puppet.sh

Note: Everything I’m doing at the moment is Ubuntu-based; this script is Debian/Ubuntu specific, but should be fairly trivial to adapt to the (supported) distro of your choosing

PuppetLabs provides packages to enable their apt repositories for Debian and Ubuntu, so it’s a simple matter of extracting the OS code name (helpfully provided by /etc/lsb-release as $DISTRIB_CODENAME Update: This doesn’t work for Debian; better to use lsb_release --codename --short after installing the lsb-release package providing it) and using that to fetch and install the right package, before updating the package indexes and upgrading Puppet:

#!/bin/bash

apt-get install --yes lsb-release
DISTRIB_CODENAME=$(lsb_release --codename --short)
DEB="puppetlabs-release-${DISTRIB_CODENAME}.deb"
DEB_PROVIDES="/etc/apt/sources.list.d/puppetlabs.list" # Assume that this file's existence means we have the Puppet Labs repo added

if [ ! -e $DEB_PROVIDES ]
then
    # Print statement useful for debugging, but automated runs of this will interpret any output as an error
    # print "Could not find $DEB_PROVIDES - fetching and installing $DEB"
    wget -q http://apt.puppetlabs.com/$DEB
    sudo dpkg -i $DEB
fi
sudo apt-get update
sudo apt-get install --yes puppet

With that in place, all that remains is to get Vagrant to use it as a shell provisioner. Just drop the below line into your Vagrantfile, somewhere before your Puppet provisioner:

config.vm.provision :shell, :path => "upgrade_puppet.sh"

Puppet 3 Provisioner

Once I’d upgraded to Puppet 3, I noticed a few warnings appear across my boxes. Naturally, I wanted to squash these

FQDN

Warning: Could not retrieve fact fqdn

Something (I haven’t quite established what) was checking the fqdn fact. All the boxes I use seem not to set an FQDN, for example:

$ hostname -f
precise32

It’s a simple matter of setting config.vm.host_name to a valid FQDN, for example:

config.vm.hostname = "vagrant.example.com"

(If you’re using Vagrant v1 configuration, you’ll want config.vm.host_name (note the underscore))

Hiera

Warning: Config file /etc/puppet/hiera.yaml not found, using Hiera defaults

Puppet 3 now has Hiera built in, and while its default configuration seems fairly sane and reasonable, it still regards lack of an explicit configuration file as warning-worthy. So, use options to explicitly set Puppet’s hiera_config option, for example:

config.vm.provision :puppet do |puppet|
    puppet.manifests_path = "manifests"
    puppet.manifest_file  = "base.pp"
    puppet.module_path = "modules"
    puppet.options = "--hiera_config /vagrant/hiera.yaml"
end

Conclusion

Sure, perhaps you’re better off building a new base box, but if you’re not ready to do that, this should hopefully come in useful!

Sometimes it's the little things

I was reading through the ElasticSearch Guide this morning and found me a typo.

Since their documentation also lives on GitHub, it wasn’t very long before I’d cloned it, fixed it, and sent a pull request.

This is nice. This is so much nicer than the other all-too-common model:

  • Find appropriate contact method, be it a web form or email address somewhere on the site
  • Email them a description of the issue
  • Wait

Sometimes I get a reply. Sometimes I don’t. It’s all too fire-and-forget. What GitHub gives me is visibility and openness. There’s now a public URL for my pull request / issue. The “open requests” count increments. This doesn’t sound like much, but it’s important. Anyone visiting the repository sees the count. Anyone can see the issue.

Why does openness matter?

Tom Preston-Werner, one of the GitHub founders, covers why people should open-source code very nicely in his blog post, “Open Source (Almost) Everything”, with the caveat of:

Don’t open source anything that represents core business value

What I’m saying is that it’s not just code. Open your documentation, and open your processes. I want to know that you’ll respond to fixes and issues and questions. I want to see how you respond, and how you react. I want to see that you care about your users. If you’ve just dumped code onto the internet under the guise of “openness”, and all feedback is routed to /dev/null, I want to see that.

There are lots of things that factor into my decision whether or not to invest in a product or technology. Openness helps.

Even if you’re a primarily closed-source company, what do you have to lose by open-sourcing your (presumably already freely-downloadable) documentation? Maybe you do and no-one interacts with it. Is this a cost? Maybe it’ll encourage it to get a bit more love ;)

With the tools currently available it’s now easier than ever to be more open, and I’m wondering to what extent to start viewing pointless closedness as a weakness.

Addendum - Is this GitHub specific?

Not at all. Replace “GitHub” with BitBucket if you like, or Google Code Project Hosting (though I’d like a nice ‘pull-request’ UI or similar). GitHub just has “the nicest and easiest” (read: “my favourite”) UI for code hosting and basic issue tracking.

StackCompare v1.0

StackCompare, my web app for comparing your reputation and badges on StackExchange sites like StackOverflow, has reached version 1.0.

Features include reputation tracking and graphing, to see how you’re doing compared to your friends and rivals, and a detailed comparison tool, so you can see exactly what badges someone has compared to you, and vice versa.

If you’re a keen StackExchange user who likes to know how they’re doing compared to other users you might know, this is the tool for you.

Using Amazon S3 to host your Django Static Files

(Note, if you haven’t read it already, I recommend my previous article on Django and Static Files to get an understanding of the fundamentals)

Pretty much every Django project I deploy, I use Amazon’s Simple Storage Service (S3) for hosting my static files. If you aren’t particularly familiar with it, then the salient points are:

  • It's part of the excellent Amazon Web Services (AWS) offering
  • It's essentially a cloud file store. You have a bucket. You create, read, update and delete files in that bucket.
  • You can make files in your buckets web accessible
  • Amazon are probably better at this than you
  • It's fairly cheap
    • Storage costs approximately $0.13 per month per GB stored up to 1 TB
    • Inbound data transfer is free
    • Outbound data transfer is free up to 1 GB per month
    • Outbound data transfer between 1 GB and 10 TB per month costs approximately $0.13
    • The cost to the average reader: under $0.15, or free if they're covered by the AWS Free Usage Tier

Why don’t I use nginx or Apache or whatever webserver I have in front of my Django deployment for static file hosting? Three things:

  • Specialisation - while I have no doubts about the abilities of nginx and Apache to host static files, S3 will inevitably do it far better for far less effort, and it means one less thing for them to do
  • I frequently deploy to Heroku where I don't have access to the configuration of the httpd layer
  • I find it pretty simple - not much more than half a dozen lines added to `settings.py`

So, first things first, you’ll need django-storages for the STATICFILES_STORAGE class (see my previous article for the role of storages), and boto which is the (excellent) Python AWS library that the unsurprisingly-namedS3BotoStorage uses to communicate with S3.

Assuming you’re inside a virtualenv, this should be pretty straightforward:

$ pip install django-storages boto
Downloading/unpacking django-storages
# Snip
Downloading/unpacking boto
# Snip
Successfully installed django-storages boto
Cleaning up...`

(Also if you have a requirements.txt file or setup.py, don’t forget to update them!)

Now that’s all installed, add it to INSTALLED_APPS in your settings.py:

    INSTALLED_APPS += ('storages',)

You’ll need an S3 bucket to push files to, so head over to the AWS Management Console for S3 and “Create Bucket”, giving it some appropriate name, and picking the most appropriate geographical region for you. You’ll be offered the option to set up logging et cetera, but can happily skip this by just clicking “Create”:

Amazon S3 'Create Bucket' dialog

You’ll also need to get this name into your settings. We’ll do this from os.environ, because we’ve all read the twelve-factor opinions on config right? (Go read it, so if you turn up on IRC for help and I ask you to pastebin your settings.py, you don’t need to go on an extensive redacting spree / expose sensitive information / both)

    AWS_STORAGE_BUCKET_NAME = os.environ['AWS_STORAGE_BUCKET_NAME']

If you’re on Heroku, you’ll want to add that to your config:

    $ heroku config:add AWS_STORAGE_BUCKET_NAME='your_bucket_name_here'

Finally, all you need is to have Django use the right static configuration. I tend to wrap this in an if not DEBUG block because I don’t want it while developing (I include the AWS_STORAGE_BUCKET_NAME in that block too, so I don’t need to be too specific about my environment at dev time):

if not DEBUG:
    AWS_STORAGE_BUCKET_NAME = os.environ['AWS_STORAGE_BUCKET_NAME']
    STATICFILES_STORAGE = 'storages.backends.s3boto.S3BotoStorage'

Voila. Now, with DEBUG set to False, I just need to collectstatic and my static files will be uploaded to S3:

$ python manage.py collectstatic
You have requested to collect static files at the destination
location as specified in your settings.

This will overwrite existing files!
Are you sure you want to do this?

Type 'yes' to continue, or 'no' to cancel: yes
#Snip

And there you have it. Hopefully you shouldn’t have any problems following this guide, but if you have any questions, issues, or feedback (always appreciated!) then please leave a comment, find me on IRC, or catch me on Twitter.

Did this help? Check out my book

Ok so I'm still writing the book so you can't buy it just yet. But if you want to make sure that you're serving your static files in the best way possible, you'll want it:

Check out the book

Django and Static Files

Django’s handling of static files is great, but sometimes causes confusion. If you’re wondering how it all fits together, what some of the settings mean, or just want some example uses, then keep reading.

Introduction

A typical Django project will have multiple sets of static files. The two common sources are applications with a static directory for media specific to them, and a similarly-named directory somewhere in the project for media tying the whole project together.

Ultimately, you want these all to end up in one place, to be served to the end user. This is where the collectstatic command comes in; as the name suggests, it’ll collect all your static files together into that one place. Of course, if you have DEBUG set to True in your settings, runserver will happily handle all this for you, but that won’t be the case for your final deployment (and nor should it be!)

So how does this all happen?

Configuration

Finders

First of all, where to find the static files?

By default settings.py will be created with a STATICFILES_FINDERS setting, with a value of:

('django.contrib.staticfiles.finders.FileSystemFinder',
 'django.contrib.staticfiles.finders.AppDirectoriesFinder')

These will be used to find the source static files. The AppDirectoriesFinder is responsible for picking up $app_name/static/, the FileSystemFinder uses the directories specified in the STATICFILES_DIRS tuple.

You’ll probably want STATICFILES_DIRS to look something like the below:

# Because actually hard-coding absolute paths into your code would be bad...
import os
PROJECT_DIR = os.path.dirname(__file__)

STATICFILES_DIRS = (os.path.join(PROJECT_DIR, 'static'),)

Storage

The STATICFILES_STORAGE setting controls how the files are aggregated together. The default value is django.contrib.staticfiles.storage.StaticFilesStorage which will copy the collected files to the directory specified by STATIC_ROOT.

Do not confuse STATIC_ROOT, to where static files are collected, with the aforementioned STATICFILES_DIRS; the former is output, the latter are inputs. They should not overlap. This is a common mistake.

Update: To be absolutely clear, STATIC_ROOT should live outside of your Django project - it’s the directory to where your static files are collected, for use by a local webserver or similar; Django’s involvement with that directory should end once your static files have been collected there

URL

Last but not least, STATIC_URL should be the URL at which a user / client / browser can reach the static files that have been aggregated by collectstatic.

If you’re using the default StaticFilesStorage, then this will be the location of where your nginx (or similar) instance is serving up STATIC_ROOT, e.g. the default /static/, or, better, something like http://static.example.com/. If you’re using Amazon S3 this will be http://your_s3_bucket.s3.amazonaws.com/. Essentially, this is wholly dependent on whatever technique you’re using to host your static files. It’s a URL, and not a file path

Common Mistakes

  • Overlap between STATICFILES_DIRS and STATIC_ROOT - the former is a set of places to look for static files, the latter is where they’re stored
  • Incorrect STATIC_URL - it’s a URL, not a file path
  • Incorrect configuration of whatever you’re using to host your static content; this is why I use S3, in my experience it’s the least effort to get working
  • Having STATIC_ROOT inside your project directory. While not strictly a mistake, it’s not where it belongs, and is generally a sign of other misunderstandings

Examples

All of these will assume you’ve left STATICFILES_FINDERS as its default, and STATICFILES_DIRS as described above; I’ve never yet had a reason, across dozens of projects, for these not to be the case

Apache serving static files

STATIC_ROOT = '/srv/www/com.example.static/'
STATIC_URL = 'http://static.example.com/`

Apache config (e.g. /etc/apache2/sites-enabled/com.example.static.conf):

<Virtualhost *:80>
    DocumentRoot /srv/www/com.example.static
    ServerName static.example.com
</Virtualhost>

Amazon S3

(Update: If you want more information on this topic, check out my follow-up blog post, Using Amazon S3 to host your Django Static Files)

Note this uses django-storages, which is a nice wide-ranging collection of custom backends for STATICFILES_STORAGE described above.

INSTALLED_APPS += ('storages',)
STATICFILES_STORAGE = 'storages.backends.s3boto.S3BotoStorage'
AWS_STORAGE_BUCKET_NAME = 'my_bucket_name'
STATIC_URL = 'http://%s.s3.amazonaws.com/' % AWS_STORAGE_BUCKET_NAME

I frequently deploy to Heroku where custom httpd configuration is nontrivial, so often use this method.

Conclusion

Static file handling is important to get right, and straight forward once you know how, but easy to get wrong. I hope this clarifies things.

As ever, drop me a line if you have any queries, questions or complaints.

Did this help? Check out my book

Ok so I'm still writing the book so you can't buy it just yet. But if you want to make sure that you're serving your static files in the best way possible, you'll want it:

Check out the book

Time-series graphs with TempoDB and Flot

A side project of mine that I’m working on at the moment is StackCompare, an app for StackExchange users to compare their reputation and badges to that of their friends. One feature I wanted to add was a graph of reputation over time.

Step One - Data Aggregation

First things first, get the data into some sort of database. The data is a time series (a set of tuples of the form (timestamp,data)), so my thoughts immediately went to setting up my own OpenTSDB instance. However, where possible I’d rather use a hosted solution at this stage of development, and some googling led me to TempoDB, a hosted time-series database service. Currently Tempo seems quite early-stage (a warning sign to me is lack of any mention of pricing…) but works quite nicely, with decent documentation and a Python client.

Writing to TempoDB is nice and straightforward:

def _reputation_key(site, user_id):
    key = '%s.%d.reputation' % (site, user_id)
    return key

def write_reputation(site, users):
    data = [{'key': _reputation_key(site, user.user_id), 'v': user.reputation} for user in users]
    now = datetime.utcnow()
    TEMPODB_CLIENT.write_bulk(now, data)

and then it was just a matter of wrapping this in a management command (using stackpy, my Python library for the StackExchange v2 API (currently quite pre-alpha-quality…)):

class Command(BaseCommand):
    help = 'Grab reputation for all users and their friends, and store in TempoDB'

    def handle(self, *args, **options):
        s = stackpy.Stackpy(settings.STACKEXCHANGE_CLIENT_KEY)
        user_ids = _get_all_user_ids()
        users = s.users(user_ids).items
        tempo.write_reputation('stackoverflow', users)

As StackCompare is currently a Heroku app, it was trivial to hook up the Heroku Scheduler to run this every 10 minutes.

Step Two - Data Extraction

Getting things out of TempoDB is equally straightforward:

def get_reputation(site, user_ids):
    #Fairly fluffly datetime range
    end = datetime.utcnow() + timedelta(days=1)
    start = end - timedelta(weeks=52)
    keys = [_reputation_key(site, user_id) for user_id in user_ids]
    datasets = TEMPODB_CLIENT.read(start, end, keys=keys, interval='1hour')
    return [(_key_to_dict(dataset.series.key), dataset.data) for dataset in datasets]

I’m not using series attributes just yet as that part of the client library is still slightly in flux, instead encoding them in the key

Step Three - Display

On the graphing front, it was time to whip out Flot - a Javascript plotting library for jQuery, beautifully simple to use.

First, some placeholder HTML (with some slightly ugly hardcoded sizes…):

<div id="plot" style="width: 960px; height: 500px;">
    <h2 id="plot-placeholder">Loading...</h2>
</div>

Then, a little Javascript to populate it:

    <script src="{% static "flot-0.7/jquery.flot.js" %}"></script>
    <div id="graph-data-url" data-url="{% url "graph_data_api" %}"></div>
    <script>
        var url = $("#graph-data-url").data("url");
        $(function() {
            $.get(url, function(data) { //TODO Handle errors...
                $("#plot-placeholder").remove();
                var data = $.parseJSON(data);
                var options = {
                    xaxis:{mode:"time"},
                    series:{
                        lines:{show:true},
                        points:{show:true}
                    }
                };
                $.plot($("#plot"), data, options);
            });
        });
    </script>

All that was needed to finish it off was some short Python code to massage the data from TempoDB into the right format for Flot (slightly paraphrased):

@login_required
def graph_data(request):
    user_profile = request.user.get_profile()

    ids = _get_ids(user_profile)

    flot_data = []
    data = tempo.get_reputation('stackoverflow', ids)
    for attr_dict, series_data in data:
        def flotify(series_data):
            return [[time.mktime(point.ts.timetuple()) * 1000, point.value] for point in series_data] # Javascript time in ms
        user_id = int(attr_dict['user_id'])
        label = _determine_label(user_id)
        series = {'label': label, 'data': flotify(series_data)}
        flot_data.append(series)

    return HttpResponse(json.dumps(flot_data))

And voila:

StackCompare Reputation Graph

NHS Hack Day London 2012

So last weekend I found myself up at 0600 on a Saturday, of my own free will. Why? I’d signed up for NHS Hack Day, a two-day event essentially throwing a bunch of NHS-types and tech-types together in a room and seeing what comes out. My domain knowledge is pretty limited - I know the NHS exists and like it, and several friends are doctors or working on becoming so - but I like building things, particularly for people with a clear idea of what they want.

I was initially a little wary - my general area of expertise is fairly back-end-y, which is normally in relatively low demand for small / early-stage projects. However, a few days before, Dr. Wai Keong Wong made a post to Google Groups about looking for Java developers to work on Renal Patient View, an open source system used by 19,000 patients of 53 renal units in the UK to access blood test results. It’s been almost a year since I last really wrote Java in anger, having almost completely switched to writing Python and Django, but this seemed like something I’d want to get involved in, and could potentially really help out with.

Having arrived on the first day, I listened to the other presentations, and while they all sounded very good, Renal Patient View still seemed very appealing, so it was time to meet the others similarly interested; Dr Grant Hill-Cawthorne, Dr Zeinab Abdi, Ayesha Garrett, and my friends Jeff Snyder and Grey Baker. Our first challenge: getting the software built. With my ever-rusting Java experience putting me as ‘the Java guy’ of the team, we perhaps weren’t best placed to be doing this, but we had clue, coffee, and enthusiasm; how bad could it go…! Well, several hours later, we had it building…

The term ‘Open Source’ is used a lot of ways by a lot of people. Renal Patient View is an Open Source project; you can find its code currently at SourceForge (in the process of being moved to GitHub). However, it’s one thing being able to access the code, and another to be able to actually do anything with it. One of what we felt was our biggest, yet least visible, contributions from the weekend was that we were able to identify and document the build/deployment process for other people to use, making Renal Patient View ‘even more Open Source’. Now, in no way should this be taken as a criticism of the Renal Patient View developers; making a relocatable build system requires genuinely hard work, and it’s to the credit of the RPV team that the source is even available for us to work with - the most important first step! I’m glad we were able to contribute in such a way that will aid anyone coming to the code in future.

Meanwhile, once we’d got development environments up and running, it was time to add some features! After some brainstorming, we aimed at revamping the design and copy, adding some graphing for blood component levels for user results, and adding notification emails for patients when their test results arrived. See them demonstrated in the presentation below:

Presentation Video

[youtube=http://www.youtube.com/watch?v=Xuq3825lZio]

Team Photo

While we didn’t win, we made the shortlist, and here we are (apart from Grey, who was called away to Silicon Milkroundabout)

Renal Patient View NHS Hack Day team

In terms of what we were able to produce in a weekend, in a language and framework we were unfamiliar with, in a domain that half of us knew little about, I feel we made incredible progress. It was great meeting and working with the other people on the team, and I learnt an incredible amount.

As for the rest of the event, it was truly excellent. An incredible gathering of bright, intelligent, motivated and interesting people, who produced some great things. Credit to Dr. Carl Reynolds and all the other volunteers and organisers for an absolutely great weekend. Subsequent hack days are being organised for (currently, I believe) Liverpool and Oxford - I thoroughly encourage anyone reading to attend, and I’ll hopefully see you there!

Raw AWS vs Heroku

I’ve been using Heroku a lot recently for deploying code. For those who’ve missed it, it’s a rather nice Platform as a Service (PaaS) offering; think Google App Engine, but actually usable: no long list of forbidden or broken frameworks and libraries, you can use a relational database, and other such niceties.

I’ve had a number of people ask in conversation why I use (or indeed trust) Heroku and don’t just deploy on top of AWS. Then I came across this question asking “Why do people use Heroku when AWS is present?“ on Stack Overflow, which I thought did a pretty good job of explicitly covering some of the aspects of the choice.

Maybe I should have flagged the question as Not Constructive (it’s now closed because others have), but instead found myself using it as a bit of a dumping ground for my thoughts on the matter; if you’re interested, check it out, I hope it’s informative.