Python WSGI Middleware for automatic Gzipping
I've just started learning Python WSGI (PEP-333) and thought the best way to learn would be to write some WSGI tools myself. Most recently, I chose to write a middleware application that converts all output into valid gzipped data. In this article, I will be demonstrating how my middleware gzipper works and how to implement it.
Gzipping arbitrary data
The first obstacle was that Python has no simple way of converting a normal string into a gzipped string. To do this, let's build off of the gzip.GzipFile object, which isn't really designed for this sort of task.
#!/usr/bin/python2.5 from gzip import GzipFile import StringIO def gzip_string(string, compression_level): """ The `gzip` module didn't provide a way to gzip just a string. Had to hack together this. I know, it isn't pretty. """ fake_file = StringIO.StringIO() gz_file = GzipFile(None, 'wb', compression_level, fileobj=fake_file) gz_file.write(string) gz_file.close() return fake_file.getvalue()
The above function accomplishes this nicely. By using a `StringIO` instance for the `fileobj` in the `GzipFile`, it allows us to make the function think that it is writing to a normal file. Instead, we're grabbing the content through our fake file.
Does the client want gzipped output?
This is a pretty important task. What if the client didn't even ask for gzipped output? Or worse, what if they can't even support it? Cases like these must be taken into consideration. We'll use the following code to check.
def parse_encoding_header(header): """ Break up the `HTTP_ACCEPT_ENCODING` header into a dict of the form, {'encoding-name':qvalue}. """ encodings = {'identity':1.0} for encoding in header.split(","): if(encoding.find(";") > -1): encoding, qvalue = encoding.split(";") encoding = encoding.strip() qvalue = qvalue.split('=', 1)[1] if(qvalue != ""): encodings[encoding] = float(qvalue) else: encodings[encoding] = 1 else: encodings[encoding] = 1 return encodings def client_wants_gzip(accept_encoding_header): """ Check to see if the client can accept gzipped output, and whether or not it is even the preferred method. If `identity` is higher, then no gzipping should occur. """ encodings = parse_encoding_header(accept_encoding_header) # Do the actual comparisons if('gzip' in encodings): return encodings['gzip'] >= encodings['identity'] elif('*' in encodings): return encodings['*'] >= encodings['identity'] else: return False
What this does is parse the `HTTP_ACCEPT_ENCODING` response header to make sure that it says that gzipping is alright and preferred over no compression. It does this by looking at the qvalue weights for each type of encoding. For more information on this, I suggest you read RFC 2616, section 14 which covers the "Accept-Encoding" header.
Creating the WSGI middleware
Alright, now it's time to use that function an create a middleware application.
class Gzipper(object): """ WSGI middleware to wrap around and gzip all output. This automatically adds the content-encoding header. """ def __init__(self, app, compresslevel=6): self.app = app self.compresslevel = compresslevel def __call__(self, environ, start_response): """ Do the actual work. If the host doesn't support gzip as a proper encoding, then simply pass over to the next app on the wsgi stack. """ accept_encoding_header = environ.get("HTTP_ACCEPT_ENCODING", "") if(not client_wants_gzip(accept_encoding_header)): return self.app(environ, start_response) def _start_response(status, headers, *args, **kwargs): """ Wrapper around the original `start_response` function. The sole purpose being to add the proper headers automatically. """ headers.append(("Content-Encoding", "gzip")) headers.append(("Vary", "Accept-Encoding")) return start_response(status, headers, *args, **kwargs) # Unfortunately, we can't gzip data chunk-by-chunk, so we need to join up whatever # is being sent out first. (Remember, WSGI apps return items which can be iterated.) data = "".join(self.app(environ, _start_response)) return [gzip_string(data, self.compresslevel)]
Before doing any encoding, this checks to make sure the client can even support gzipped output by looking at the "HTTP_ACCEPT_ENCODING" header value. What's really cool about how this works is that it adds a content-encoding header for you. This, however, shows a weakness in WSGI as the middleware is required to write a new start_response function that wraps around the old one. WSGI 2.0 clears this up, but this version isn't in use yet.
Using the middleware
Now, let's see how we would implement this with a simple "hello world" WSGI application.
#!/usr/bin/python2.5 from gzipper import Gzipper from wsgiref.simple_server import WSGIServer, WSGIRequestHandler # # Let's create a simple WSGI application to test out the Gzipper. # def test_app(environ, start_response): status = "200 OK" headers = [("content-type", "text/html")] start_response(status, headers) return ["Hello world!"] # Set up the WSGI server and add the middleware that wraps around # the actual application. httpd = WSGIServer(('', 8080), WSGIRequestHandler) httpd.set_app(Gzipper(test_app, compresslevel=8)) httpd.serve_forever()
Final thoughts
Middleware can be incredibly useful. I, however, have been seeing numerous people approaching it incorrectly by modifying the environ and having the application relying on it. I believe that PJ Eby said it best: "If your application requires that API to be present, then it's not middleware any more!" It would be better to just add that code to a library utilized by the application.
To view all of the code in one block, go to the next page.
Pages: 1 2

japherwocky wrote,
I think that's exactly the proper usage of StringIO. Not a hack at all. :)
Evan wrote,
Heh, yeah I guess you're right. It just looks messy to me. :-P
Jim wrote,
Two bugs:
> if(environ.get("HTTP_ACCEPT_ENCODING", "").find("gzip") < 0)
This fails for cases like Accept-Encoding: identity, compress;q=0.5, gzip;q=0
You need to actually split up the header value properly in order to handle it correctly, a simple search fails for corner cases.
You also need to transmit a Vary header.
Evan wrote,
Jim, thank you for your input. I have updated the code accordingly.
Sergey wrote,
In the first for loop, you could do
The advantage here is that the code is linear. The disadvantage is that a later Python version is required.