Designing an async API, from sans-I/O on up

While developing the CLA bot for CPython (aka, the Knights Who Say Ni), I decided to make it asynchronous, partially for performance reasons but also because I simply wanted to. 😁 The problem with doing that for a GitHub bot is there wasn't any pre-existing GitHub library written in Python that was asynchronous. That led to me having to write my own code to make REST calls to GitHub's API in an asynchronous fashion. Knowing full well I wasn't the only person wanting to write GitHub bots in Python that used async functions (including myself since I knew I was going to end up creating Bedevere eventually), I decided to create a new project that provided a library for making asynchronous calls to GitHub's API. Since I wanted the library to be easy to maintain I decided I didn't want to abstract away GitHub's API under some sheen of object-oriented niceties, but instead simply handle the parts required in making a REST request when calling GitHub (in other words I wanted a library that basically looks like an HTTP REST call, but that handled all HTTP-level details that are GitHub-specific). In the end I created gidgethub.

I'm actually not blogging to promote the library (although people are already using it so it seems not not be a failure 😉), but instead to talk about its design from the perspective of creating a library that uses async functions to make HTTP calls on behalf of the user. A knee-jerk reaction to that statement might be, "I have written plenty of libraries that make HTTP calls and there wasn't anything that special about them," and you would probably be right. But asynchronous libraries has some special considerations you must think about, else you will end up coding yourself into a corner that you might regret.

Traditionally, when needing to make an HTTP request in a library, you would reach for something like urllib3 or requests to make the HTTP request and you would be done with it. The users of your library didn't care what you used to make the HTTP request as that was just an implementation detail that doesn't affect them. Everyone gets to use what they want for HTTP requests and everyone's happy!

But then there's async/await. When writing asynchronous code in Python that uses async functions, you end up choosing an asynchronous framework that provides both the functions that you ultimately await on and the async manager of what you are waiting to complete (the "async manager" is typically called an "event loop", but the term just doesn't feel like it fits this async use-case like it does a GUI framework). A key point in all of this is you typically want a single async manager that manages everything you're waiting on. And since async managers only know how to wait on what their own code returns, that means you can only make things like HTTP requests using the framework's own code instead of some third-party library like requests. Thus when you're doing async programming you end up choosing a single framework which you use to make all of your HTTP calls. (If you want to have a better understanding of how this all works, you can see my blog post on how async/await works to better understand how async frameworks are generally structured.)

Notice how having to stick with a single framework for HTTP calls is different from the traditional situation where people didn't care what HTTP library was used? If a library chooses to go the traditional route and standardize on a specific async framework then they run the risk of restricting themselves to only other users of that same async framework (or frameworks that happen to be compatible with that API). Locking users out of a library because of the async framework they chose is rather unfortunate when Python has a good deal of options when it comes to async frameworks that take different approaches and thus are not compatible with each other.

As a big proponent of both Curio and trio, I wanted to make sure that gidgethub could support any async framework somehow. Since I wasn't aware of any other HTTP-based library trying to be this open-ended, I took a long time to think over the API design (remember, API design is UX design at the code level; public APIs matter a lot, so I try to never rush this initial design stage). In the end I decided to start from a sans-I/O base and work my way up to a higher-level abstraction where people make what look like asynchronous REST calls that are tailored to GitHub.

If you're not familiar with the sans-I/O approach to library design, I once wrote a blog post about it. In a nutshell, it's writing code that does no I/O on its own and instead focuses on accepting/returning what is read/written from/to the appropriate I/O. For example, hyper-h2 implements a sans-I/O library for HTTP/2 by parsing bytes read from a socket and returning what should be written back to the socket as a response, all without actually touching the socket itself. The sans-I/O approach when dealing with I/O has multiple benefits, from easier testing to being reusable across libraries that choose to do their own I/O.

In gidgethub's case, the ability to use the lowest-level code in other libraries was important to me. By having the lowest-level APIs do no I/O, they can be used by both synchronous and asynchronous I/O libraries to avoid having to re-implement the same details by other libraries. Since gidgethub is dealing with HTTP requests, that means its base level is structured to work in a world of URLs, HTTP headers, and the bytes making up the body of an HTTP packet; the core parts of an HTTP request and response.

A perfect example of how this works is gidethub.sansio.Event and its from_http() class method. The Event class represents a webhook event from GitHub, which is basically the HTTP request GitHub makes to your webhook whenever some event happens that you have said your endpoint cares about (e.g. a new pull request has been opened). The __init__() method for Event is pretty straight-forward and directly represents the details of a webhook event:

def __init__(self, data, *, event, delivery_id):  
    ...

But that in no way directly maps to an HTTP packet coming from GitHub. Now one could write some code to read the HTTP response object for the HTTP server you have accepting GitHub's webhook events, but to stay flexible and reusable, I instead created a class method which simply accepts the common parts of an HTTP packet plus any crucial GitHub-specific details:

@classmethod
def from_http(cls, headers, body, *, secret=None):  
    ...

What this means is that if someone was using a library like aiohttp, they could use gidgethub.sansio.Event easily by simply teasing out the constituent parts of the HTTP request and pass them to from_http() without any extra effort.

# 'request' is an instance of aiohttp.web.BaseRequest.
headers = request.headers  
body = await request.read()  
event = gidethub.sansio.Event.from_http(headers, body)  

This frees gidgethub from having to care about I/O while letting anyone using any HTTP library not have to figure out how to tease out all the details in a GitHub webhook event.

A possibly even better example of sans-I/O at work is the gidgethub.sansio.create_headers() function:

def create_headers(requester, *, accept=accept_format(), oauth_token=None):  
    ...

When making a call to GitHub's REST API, there are some common things that you include in the HTTP headers. What this function does is allow you to specify those details in a function call and get back a dict which can be directly used as the headers in an HTTP request to GitHub. Once again this frees gidgethub from having to care about I/O and lets the user use whatever HTTP library they want while getting to skip the need to fiddle with HTTP headers in making a REST call to GitHub.

If you look at the gidethub.sansio module you will notice it's basically a utility library that helps either create or consume the parts of an HTTP request or response. The module just deals in strings, dicts, and bytes for the URL, headers, and body of an HTTP packet, respectively. It works out rather well and lets anyone use this code for making HTTP calls to GitHub while freeing me from caring about what HTTP library they prefer.

With gidgethub.sansio providing a base, you can make HTTP requests to GitHub's API like so:

import datetime  
import requests

request_headers = create_headers("brettcannon", oauth_token=auth)  
url = "https://api.github.com/repos/brettcannon/gidgethub/issues/1"  
response = requests.get(url, headers=request_headers)  
status_code = response.status_code  
headers = response.headers  
body = response.content  
data, rate, more = decipher_response(status_code, headers, body)  
# Response details are in `data`.
if more:  
    if not rate.remaining:
        now = datetime.datetime.now(datetime.tzinfo.utc)
        wait = rate.reset_datetime - now
        time.sleep(wait.total_seconds())
    response_more = requests.get(more, headers=request_headers)
    # Decipher `response_more` ...

Not horrible, but you certainly wouldn't want to write that out for every REST call you made if you had to make more than one. This is when one would have traditionally chosen an HTTP library and made the HTTP calls on behalf of the user to hide all of these details. But since that doesn't work well for asynchronous code without restricting the library to specific asynchronous frameworks, what does one do?

In the end I decided to look for commonality among the APIs of various HTTP libraries. I blogged about my search and basically found very little that was the same across libraries at the API level (i.e. requests all had arguments named url,headers, andmethod`, but that's where the commonality ended; don't even bother looking for consistency in the returned object). I hoped to find enough commonality that I could rely on a common API in gidgethub, but I wasn't that lucky. In the end I decided that if there wasn't a common API already out there which I could write code to use generically then I would simply create my own solution.

The gidgethub.abc module provides an abstract base class that one can implement for an asynchronous framework. This allows for any asynchronous framework to implement the simple request() method and the sleep() method, giving gidgethub a common API to work with for a nicer, higher-level API. The API that async frameworks need to implement is so simple, in fact, that the return value for the HTTP request is just a tuple of the status code, headers, and body of the response (I only used a tuple since I'm confident the API won't need to grow any more values due to the structure of HTTP, otherwise I would want something more flexible as a return value in case more data might needed to be returned in some future change to the API).

To verify that this idea of abstracting async frameworks down to just two functions worked, I decided that I would wrap aiohttp myself and include it in gidgethub in the gidethub.aiohttp module (the whole module is only 12 statements). And then Cory Benfield decided to wrap treq, creating the gidgethub.treq module and acting as a third-party verifier that my idea was not completely stupid (done in 19 statements; 21 if you break out the dict comprehensions into separate statements). 😉

In the end, I think the overall design of gidgethub turned out pretty well. Applying the sans-I/O approach let me expose the core parts of what's required to call GitHub's REST API so others don't have to re-implement them. Building a higher-level API where the underlying HTTP library is abstracted away for easy pluggability also seemed to also work out, with the amount of work for someone familiar with a framework being rather small. My hope is I'm not deluding myself that this is a reasonable design for library to take that need to make async HTTP calls as I might try proposing the async HTTP equivalent of the database API spec. That would then free up library authors from having to choose an async framework to tie themselves to or having to completely avoid providing any niceties on top of an HTTP call (although I do strongly support doing a sans-I/O approach to begin with for maximum flexibility). I suspect I will have some interesting discussions at PyCon US about this ...