Nathan Grigg

Managing Amazon S3 Redirects

This week I put together a Python script to manage Amazon S3’s web page redirects. It’s a simple script that uses boto to compare a list of redirects to files in an S3 bucket, then upload any that are new or modified. When you remove a redirect from the list, it is deleted from the S3 bucket. The script is posted on GitHub.

I use Amazon S3 to host this blog. It is a cheap and low-maintenance way to host a static website, although these advantages come with a few drawbacks. For example, up until a few months ago you couldn’t even redirect one URL to another. On a standard web host, this is as easy as making some changes to a configuration file.

Amazon now supports redirects, but they aren’t easy to configure. To set a redirect, you upload a file to your S3 bucket and set a particular piece of metadata. The contents of the file don’t matter; usually you use an empty file. You can use Amazon’s web interface to set the metadata, but this is obviously not a good long-term solution.

Update: There are actually two types of Amazon S3 redirects. I briefly discuss the other here.

So I wrote a Python script. This was inspired partly by a conversation I had with Justin Blanton, and partly by the horror I felt when I ran across a meta refresh on my site from the days before Amazon supported redirects.

Boto

The Boto library provides a pretty good interface to Amazon’s API. (It encompasses the entire API, but I am only familiar with the S3 part.) It does a good job of abstracting away the details of the API, but the documentation is sparse.

The main Boto objects I need are the bucket object and the key object, which of course represent an S3 bucket and a key inside that bucket, respectively.

The script

The script (listed below) connects to Amazon and creates the bucket object on lines 15 and 16. Then it calls bucket.list() on line 17 to list the keys in the bucket. Because of the way the API works, the listed keys will have some metadata (such as size and md5 hash) but not other (like content type or redirect location). We load the keys into a dictionary, indexed by name.

Beginning on line 20, we loop through the redirects that we want to sync. What we do next depends on whether or not the given redirect already exists in the bucket. If it does exist, we remove it from the dictionary (line 23) so it won’t get deleted later. If on the other hand it does not exist, we create a new key. (Note that bucket.new_key on line 25 creates a key object, not an actual key on S3.) In both cases, we use key.set_redirect on line 32 to upload the key to S3 with the appropriate redirect metadata set.

Line 28 short-circuits the loop if the redirect we are uploading is identical to the one on S3. Originally I was going to leave this out, since it requires a HEAD request in the hopes of preventing a PUT request. But HEAD requests are cheaper and probably faster, and in most cases I would expect the majority of the redirects to already exist on S3, so we will usually save some requests. Also, I wanted it to be able to print out only the redirects that had changed.

At the end, we delete each redirect on S3 that we haven’t seen yet. Line 40 uses Python’s ternary if to find each keys redirect using get_redirect, but only if the key’s size is zero. This is to prevent unnecessary requests to Amazon.

I posted a more complex version of the code on GitHub that has a command line interface, reads redirects from a file, and does some error handling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/usr/bin/python
from boto.s3.connection import S3Connection

DRY_RUN = True
DELETE = True  # delete other redirects?
ACCESS = "your-aws-access-key"
SECRET = "your-aws-secret-key"
BUCKET = "name.of.bucket"
REDIRECTS = [("foo/index.html", "/bar"),
                ("google.html", "http://google.com"),
            ]
if DRY_RUN: print "Dry run"

# Download keys from Amazon
conn = S3Connection(ACCESS, SECRET)
bucket = conn.get_bucket(BUCKET)
remote_keys = {key.name: key for key in bucket.list()}

# Upload keys
for local_key, location in REDIRECTS:
    exists = bool(local_key in remote_keys)
    if exists:
        key = remote_keys.pop(local_key)
    else:
        key = bucket.new_key(local_key)

    # don't re-upload identical redirects
    if exists and location == key.get_redirect():
        continue

    if not DRY_RUN:
        key.set_redirect(location)
    print "{2:<6} {0} {1}".format(
        local_key, location, "update" if exists else "new")

# Delete keys
if DELETE:
    for key in remote_keys.values():
        # assume size-non-zero keys aren't redirects to save requests
        redirect = key.get_redirect() if key.size == 0 else None
        if redirect is None:
            continue
        if not DRY_RUN:
            key.delete()
        print "delete {0} {1}".format(key.name, redirect)