Nathan Grigg

Filter RSS

I was looking to make more room on my phone’s home screen, and I realized that my use of had dwindled more than enough to remove it. I never post any more, but there are a couple of people I would still like to follow that don’t cross post to Twitter. has RSS feeds for every user, but they include both posts and replies. I only want to see posts. So I brushed off my primitive XSLT skills.

I wrote an XSLT program to delete RSS items that begin with @. While I was at it, I replaced each title with the user’s name, since the text of the post is also available in the description tag.

Here is the transformation that would filter my posts, if I had any:

 1 <?xml version="1.0" encoding="UTF-8"?>
 2 <xsl:stylesheet version="1.0"
 3     xmlns:xsl="">
 5 <!-- Default identity transformation -->
 6 <xsl:template match="@*|node()">
 7     <xsl:copy>
 8         <xsl:apply-templates select="@*|node()"/>
 9     </xsl:copy>
10 </xsl:template>
12 <!-- Replace title with my username -->
13 <xsl:template match="item/title/text()">nathangrigg</xsl:template>
15 <!-- Remove completely items which are directed at other users.
16      The RSS feed has titles of the form @username: text of post. -->
17 <xsl:template match="item[contains(title, '@nathangrigg: @')]" />
18 </xsl:stylesheet>

Now I can use xsltproc to filter the RSS. In order to fill in the username automatically, I wrapped the XSLT program in a shell script that also invokes curl.

 1 #!/bin/bash
 2 set -o errexit
 3 set -o pipefail
 4 set -o nounset
 6 if (( $# != 1 )); then
 7     >&2 echo "USAGE: $0 username"
 8     exit 1
 9 fi
11 username=$1
13 xslt() {
14 cat << EOM
15 <?xml version="1.0" encoding="UTF-8"?>
16 <xsl:stylesheet version="1.0"
17     xmlns:xsl="">
19 <!-- Default identity transformation -->
20 <xsl:template match="@*|node()">
21     <xsl:copy>
22         <xsl:apply-templates select="@*|node()"/>
23     </xsl:copy>
24 </xsl:template>
25 <!-- Replace title with just the username -->
26 <xsl:template match="item/title/text()">$username</xsl:template>
27 <!-- Remove completely items which are directed at other users.
28         The RSS feed has titles of the form @username: text of post. -->
29 <xsl:template match="item[contains(title, '@$username: @')]" />
30 </xsl:stylesheet>
31 EOM
32 }
34 rss() {
35     curl --silent --fail$username/posts
36 }
38 xsltproc <(xslt) <(rss)

Illustrating Python multithreading vs multiprocessing

While adding multithreading support to a Python script, I found myself thinking again about the difference between multithreading and multiprocessing in the context of Python.

For the uninitiated, Python multithreading uses threads to do parallel processing. This is the most common way to do parallel work in many programming languages. But CPython has the Global Interpreter Lock (GIL), which means that no two Python statements (bytecodes, strictly speaking) can execute at the same time. So this form of parallelization is only helpful if most of your threads are either not actively doing anything (for example, waiting for input), or doing something that happens outside the GIL (for example launching a subprocess or doing a numpy calculation). Using threads is very lightweight, for example, the threads share memory space.

Python multiprocessing, on the other hand, uses multiple system level processes, that is, it starts up multiple instances of the Python interpreter. This gets around the GIL limitation, but obviously has more overhead. In addition, communicating between processes is not as easy as reading and writing shared memory.

To illustrate the difference, I wrote two functions. The first is called idle and simply sleeps for two seconds. The second is called busy and computes a large sum. I ran each 15 times using 5 workers, once using threads and once using processes. Then I used matplotlib to visualize the results.

Here are the two idle graphs, which look essentially identical. (Although if you look closely, you can see that the multiprocess version is slightly slower.)

Idle threads. The tasks of each group run in parallel. Idle processes. The tasks of each group run in parallel.

And here are the two busy graphs. The threads are clearly not helping anything.

Busy threads. Each task run sequentially, despite multithreading. Busy processes. The tasks of each group run in parallel.

As is my custom these days, I did the computations in an iPython notebook.

Basic unobtrusive multithreading in Python

I have a Python script that downloads OFX files from each of my banks and credit cards. For a long time, I have been intending to make the HTTP requests multithreaded, since it is terribly inefficient to wait for one response to arrive before sending the next request.

Here is the single-threaded code block I was working with.

 1 def ReadOfx(accounts):
 2     downloaded = []
 3     for account in accounts:
 4         try:
 5             account.AddOfx(read_ofx.Download(account))
 6         except urllib.error.HTTPError as err:
 7             print("Unable to download {}: {}".format(account, err))
 8         else:
 9             downloaded.append(account)
11     return downloaded

Using the Python 2.7 standard library, I would probably use either the threading module or multiprocessing.pool.ThreadPool. In both cases, you can call a function in a separate thread but you cannot access the return value. In my code, I would need to alter Download to take a second parameter and store the output there. If the second parameter is shared across multiple threads, I have to worry about thread safety. Doable, but ugly.

In Python 3.2 an higher, the concurrent.futures module makes this much easier. (It is also backported to Python 2.) Each time you submit a function to be run on a separate thread, you get a Future object. When you ask for the result, the main thread blocks until your thread is complete. But the main benefit is that I don’t have to make any changes to Download.

 1 # Among other imports, we have `from concurrent import futures`.
 2 def ReadOfx(accounts):
 3     with futures.ThreadPoolExecutor(max_workers=10) as ex:
 4         ofx_futures = [(account, ex.submit(read_ofx.Download, account))]
 5         print("Started {} downloads".format(len(ofx_futures)))
 7     downloaded = []
 8     for account, future in ofx_futures:
 9         try:
10             account.AddOfx(future.result())
11         except urllib.error.HTTPError as err:
12             print("Unable to download {}: {}".format(account, err))
13         else:
14             downloaded.append(account)
16     return downloaded

In a typical run, my 6 accounts take 3, 4, 5, 6, 8, and 10 seconds to download. Using a single thread, this is more than 30 seconds. Using multiple threads, we just have to wait 10 seconds for all responses to arrive.

Persistent IPython notebook server with launchd, virtual host, and proxy

I have been using IPython for interactive Python shells for several years. For most of that time, I have resisted the web-browser-based notebook interface and mainly used the console version. Despite my love of all things texty, I finally gave in, and began using the web version almost exclusively. So much that I got annoyed at constantly needing to start and stop the IPython server and having a terminal dedicated to running it.

Always running server using Launchd

My first step was to always keep the IPython server running. I did this with a KeepAlive launchd job. Here is the plist:

<plist version="1.0">

This job runs ipython notebook with the --port flag, so that the port stays the same each time.

I used LaunchControl to create and load this launch agent, but you can also just save it in ~/Library/LaunchAgents and run launchctl load.

If you want, you can be done now. The notebook browser is running at http://localhost:10223.

Virtual host and proxy using Apache

But I was not done, because I already had too many processes on my machine that were serving content at some localhost port. This required me to memorize port numbers, made Safari’s autocorrect not very useful, and felt barbaric. What I needed was a domain name that resolved to http://localhost:10223. To do this, I needed a virtual host and a proxy.

Before reading further, you should know that I am not an Apache expert. In fact, I have never managed an Apache webserver except as a hobby. The best I can promise you is that this works for me, on my OS X computer, for now.

In /etc/hosts, I created a new host called py.     py

This resolves py to, i.e., localhost.

Now in /etc/apache2/httpd.conf I created a virtual host and a proxy.

    ServerName py
    ProxyPass /api/kernels/ ws://localhost:10223/api/kernels/
    ProxyPassReverse /api/kernels/ ws://localhost:10223/api/kernels/
    ProxyPass / http://localhost:10223/
    ProxyPassReverse / http://localhost:10223/
    RequestHeader set Origin "http://localhost:10223/"

This forwards all traffic to py on port 80 to localhost on port 10223. Note that the order of the ProxyPass directives is apparently important. Also, if you use * instead of the address in the VirtualHost directive, you might also be forwarding requests originating outside of your machine, which sounds dangerous.

Then I ran sudo apachectl restart, and everything seemed to work.

Note that Safari interprets py as a Google search, so I have to type py/. Chrome does the same thing, except for that after I load py/ once, the trailing slash is optional.