Looking for commonality among HTTP request APIs

It all started when ...

While developing the CLA bot for Python I realized this might not be the last bot I or anyone else may have to write for helping with CPython's development once it shifts to GitHub. As such, I began to think about what my ideal GitHub library would look like.

Now you might be thinking that there has to be a library that provides some idiomatic access to GitHub's API and you would be right. If you look at GitHub's list of libraries, you will see there are 10 libraries listed for Python alone. But for me they all have one fatal flaw: none of them are asynchronous and thus none of them support async code. And so with that, I started to think about how I would design a GitHub library in an asynchronous Python world.

While I thought about this problem I was quickly reminded of how Python moving to an asynchronous world means old assumptions stop working. In this instance, it's assuming I can make HTTP requests occur implicitly behind the scenes in third-party library code. See, in a synchronous world I could develop a library that uses e.g. requests and any users of my library wouldn't really care what HTTP library I used. In a synchronous world, an HTTP request is just another blocking network call, which means what HTTP library I use doesn't impact what HTTP library you use (yes, you can do fancier things like connection pooling that is specific to an HTTP library, but I'm ignoring advanced features for this comparison).

But in an asynchronous world, you suddenly have a framework driving your HTTP requests for your entire application, e.g. aiohttp and its event loop. You can't then use just any old HTTP library behind the scenes for your library; the library author might choose aiohttp but you choose curio for your application code, which means you would need to have two separate frameworks driving asynchronous scheduling and that can get messy. What you end up wanting is uniformity in what your async functions await on so you have a single framework managing everything cleanly and efficiently.

This all means that if you're trying to develop a library that needs to make HTTP requests in an asynchronous world, you suddenly start to care a great deal about having defined APIs that everyone follows (this is also why in an asynchronous world you really care about people handling network protocols sans I/O so you don't force people to choose a specific async framework to support a protocol). Since there aren't tons of HTTP libraries in the async side of the Python world yet, there's a chance to try and standardize on the API now before there are too many libraries out there with slight differences between their APIs. This led me to look at some HTTP libraries in the Python community to see if some common API already exists that I wasn't aware of. A side benefit of looking into this is the search could potentially help curio out with an API design for any HTTP support they may end up providing in the future.

How I performed my search

I started my search by looking for the major HTTP request libraries out there for Python:

In the end I came up with a list of urllib3, requests, Twisted, Tornado, and aiohttp (the last one being the only library created in an async world). The key point is that this group of libraries should represent enough of the Python community's take on HTTP libraries to see if there are any commonalities among them.

I should mention that this blog post should never be viewed as negative towards any library. The work put into these libraries should always be appreciated even if they don't fit the needs of what this blog post is talking about. So thank you to everyone who has contributed to the libraries I'm analyzing; your work is appreciated. I'm also sure I've made some assumption incorrectly and that there is something that could stand to be changed, so don't assume I'm an expert at the intricacies of HTTP, cover every possible detail, or that I'm even right the majority of the time. This entire post is me just trying to answer the question of "is there commonality among the APIs of some HTTP libraries?" for myself.

Looking for commonalities

When designing an API in Python, there are three things you have to consider: input, output, and exceptions. "Inputs" are the parameters and their expected interfaces/types that people will be using to call into your library call. "Outputs" is what your library will give back for those "inputs". And "exceptions" are what you have to possibly catch and handle if something goes wrong (and when you're dealing with networking, errors are always possible). In all of these cases you're primarily dealing with naming and expected object interfaces/types (it can obviously get more complicated than this, but an HTTP request is already defined somewhat so there's a basic understanding of what an API serving a common REST requests would need to look like).

In the end I went through all of the libraries I listed and read their documentation for their latest release. I then filled in a multi-sheet spreadsheet with all the information I could for their APIs regarding making a request, what a response looks like, and what exceptions they may raise (I'm very lucky that I only have to care about what amounts to a single function call and what it returns or raises).

Input

An HTTP request is a simple concept broken up into three parts. The first part is the request line which specifies the HTTP verb/method you're using (e.g. POST), the URL you are requesting, and the version of HTTP you are making the request with (the last of which is more of an implementation detail for the HTTP library and not really your concern as a user of the library). After the request line is the header which contains multiple fields ranging from Content-Type to custom header fields. And finally there's the body of the request when there's some form of payload, like when you provide JSON in the body of a POST. So to recap, I have the HTTP verb/method, URL, header fields, and the body payload to consider.

Looking at all the libraries they do agree on a couple of things when it comes to the fundamentals of a request call. First of all, they all call the parameter for the HTTP verb method and take a str for it. Second, they take a str for the URL and call that parameter url and have it as the second parameter. Third, they all have a parameter named headers for header fields.

And that's where the similarities stop. Take the header parameter, for instance: requests has it as a keyword-only argument, Tornado has it as its third positional parameter, and the rest have it as the fourth positional parameter. The name of the parameter to take the body isn't consistent but everyone accepts bytes for it. There's also no agreement on the default value.

There's also extra features an HTTP library can support. A good example of this is whether redirection is automatically followed. Another is supporting a timeout. This means the request call doesn't stop at just what is in the HTTP request directly. But there's also inconsistency here from no parameters having the same name or position.

And then there's also the synchronous/asynchronous aspect since aiohttp uses an async with and no one else does.

In the end, here's what all of the libraries agree on:

from typing import Dict

def request(method: str, url: str, *, headers: Dict[str, str] = ..., some_name_for_body: bytes = None):
    ...

Output

Basically the only thing the libraries agree on in their return value from the request call is that you get back an object. That's it. Types don't match, attribute names don't match; nothing matches on a technical level.

Semantically at least there is some agreement. You do get back the status code of the HTTP response. All libraries also return the reason for the code, but HTTP/2 drops support for that, so it's not really necessary. They all return the header fields in the response, but everyone uses their own custom class to deal with case-insensitivity and potentially repeated fields. Finally, there's a way to get the body of the response.

All of this means I can update the common API to:

from typing import Dict

def request(method: str, url: str, *, headers: Dict[str, str] = ..., some_name_for_body: bytes = None) -> object:
    ...

Exceptions

Unfortunately exceptions aren't much better than the response object. All libraries agree there should be a base exception for all HTTP-related errors, but that's where it stops; the name for the exception and what should be exposed on it varies (e.g. most include the status code, but requests instead attaches the request and/or response objects that triggered the exception).

Three of them have an exception for socket-related errors and if the connection times out, and then it varies between two libraries for having another timeout exception, malformed URLs, invalid headers, or proxy errors. All other errors are unique to the library and they can be very specific. So in the end there's no exception consistency.

Manufacturing commonality

Okay, so there's not much commonality at the API level. But since the semantics are obviously similar since they are all essentially implementing the same thing, it would seem possible to come up with an API that makes sense overall that can then wrap the APIs of the libraries so they expose a uniform API for users who need it (or what would make sense if one were to design an HTTP library from scratch today).

Input

The first question is should the API be a context manager or a function? Since both requests and aiohttp have context managers for better connection management, I'm going to go with that as the best approach. Having said that, to keep the example simple, I'm going to simply define a request() function, but you can assume it's for an __init__() function for an asynchronous context manager and you will use it as:

async with request(...) as response:
    ...

So, what can I start from based on the analysis above (with the starting assumption that I am using an async function)?

async def request(method: str, url: str) -> Response:
    ...

That's enough to basically make an extremely simple GET request, but not much else (just assume Response will be defined later). First thing to support are header fields. All implementations agree on headers as the field name, but there's disagreement on the argument order. To follow the structure of an HTTP request I will make headers the next argument.

from typing import Dict

async def request(method: str, url: str, headers: Dict[str, str] = {}) -> Response:
    ...

The default value could be None, but since an empty dictionary doesn't change the meaning I don't see a need for using None. Using an empty dictionary also keeps the library code simple as it means it doesn't have to do a if headers check to cover the None case.

If one were to normalize cleanly across libraries then there would need to be a decision about inserting header values which can be inferred, e.g. content-length. My suspicion is that all the libraries do this for certain fields, I'm just not sure which ones (e.g. do they all automatically insert user-agent if it's left out?).

There's also the question of what to do about keys where only the case matters? Technically HTTP allows for multiple headers with the same field name and so really there should probably be a warning if any of the keys vary by case alone.

With the HTTP verb/method, URL, and headers taken care of, that leaves the body of the request left to specify. Since all the HTTP libraries accept bytes the only decision is position and name. As I already put in the headers argument that implicitly solves the order problem. But what about the name? The libraries use body, data, and postdata. The HTTP specification uses body. I personally also like body, but to check myself I asked on Twitter what people preferred:

So with my hunch (narrowly) validated, I end up with:

from typing import Dict

async def request(method: str, url: str, headers: Dict[str, str] = {}, body: bytes = b'') -> Response:
    ...

Much like with headers, I'm using the false-y value of the bytes type rather than None to keep the library code simple and not have to special case for None itself. The suggestion was made that if a data parameter existed it could encode its value based on what content-type was set to, but I don't want to go down the road of registering encoders to manage possible MIME types.

That covers what goes into an HTTP request, but there's always extra features around the request. Unfortunately none of the libraries agree on implementing a single extra feature. That means for instance that implementing a timeout isn't feasible if you were to wrap all of these libraries without having a special-case failure if a user asked for a timeout and it simply wasn't possible with the library they chose.

But one extra that can be added on to a library is automatic request forwarding. This can be normalized as any wrapper code could easily capture the 3xx response and then make another request based on the forwarding. This does get a little complicated in terms of semantics if you make a POST request but receive a 303 redirection which forces a GET request, but in such instances an exception would just need to be raised.

Assuming those semantics issues could be decided upon, there is once again a name disagreement among the libraries. The options are allow_redirects, follow_redirects, redirect, and followRedirect. Ignoring the last one for not following PEP 8 style (which I always try to follow in new projects), I asked on Twitter what people preferred with my preference being follow_redirects:

Obviously a very clear winner. There was also an interesting point made on Twitter that allow_redirects can be taken to mean that if a redirect is found an exception would be raised, otherwise the redirect wouldn't be followed but simply returned as the response (ironically Tornado uses follow_redirects as the parameter name but uses these semantics of raising an exception if set to False). I can see where they might reach that interpretation, so that solidifies for me the choice of follow_redirects.

from typing import Dict

async def request(method: str, url: str, headers: Dict[str, str] = {}, body: bytes = b'', *, follow_redirects = True) -> Response:
    ...

The parameter is keyword-only thanks to the fact that it's a boolean argument which makes passing it in as a positional argument lose any context of what the heck the value is meant to represent when a literal is used. Luckily all the other arguments to this function have easily inferred meaning if passed in through literals.

I also made the default parameter True since four of the five libraries that support redirection default to True.

And since timeout support is covered by everyone, I will put that in as well with 0 as the default to represent no timeout and keyword-only since a literal argument does not provide enough context:

from typing import Dict, Optional

async def request(method: str, url: str, headers: Dict[str, str] = {}, body: bytes = b'', *, follow_redirects = True, timeout: float = 0) -> Response:
    ...

And that's it! The input aspect of the API does not support every possible option that all the libraries have, but it's enough for the most common use-cases.

Output

Now that I can make a request I need to be be able to read the response. As mentioned above, there's is unfortunately no agreement among the libraries on what the returned object from our function should look like. That means I am not starting from much:

class Response:
    pass

Much like our request function, I will start from what HTTP responses contain and just work my way forward. That means the status code of the HTTP response is the first thing to worry about. Everyone agrees it should be an int, but what to name the attribute in the response? There's status, code, and status_code in use (the HTTP specification calls it a "status code"). My preference is with status_code. Will Twitter agree?

My streak of Twitter agreeing with me and the HTTP spec continues! That gives me something to start with (and thanks to Python 3.6 and its support for variable annotations I can type the class and its instances!):

class Response:

    status_code: int

Now all of the HTTP libraries provide access to the reason phrase in the response and three out of the five provide the HTTP version used in the response. But since HTTP/2 drops both the reason phrase and version info, I'm going to drop them as well.

Next up is headers. This is a little tricky thanks to how HTTP headers can be repeated. This has led to various custom mappings used by libraries: urllib3 has a custom class, requests returns a case-insensitive dictionary, Twisted has its own case-insensitive multi-dict as does Tornado which uses HTTP-based capitalization, and aiohttp broke out their case-insensitive multi-dict out into a separate project. That means consistency will either require using a proxy like multidict.CIMultiDictProxy, copying the headers over to a new case-insensitive multi-dict instance, lowercasing all names and only keeping the last value, not changing the keys at all and still only keeping the last value, or doing any of the last two and always using a list as the value to hold multiple values. Probably the case-insensitive multi-dict is the safest bet.

Regardless of the chosen solution, the type for the headers attribute should be a mapping:

from typing import Mapping

class Response:

    status_code: int
    headers: Mapping[str, str]

A near consistency is that all but one of the libraries records what URL was requested that led to the response. Names vary between geturl(), url, url_obj, and effective_url. I like url, but since redirects can lead to a different URL than what you started with I can see the appeal of effective_url. Twitter's opinion?

Now an interesting suggestion that was made as part of this poll was making the attribute be a sequence of URLs, recording any/all redirections taken to reach the final URL. The only issue with that is it would require any wrapper to handle redirection itself in order to record the intermediary URLs, but since that might be needed anyway then you might as well get the benefit of recording the redirection sequence.

from typing import Mapping, Sequence

class Response:

    status_code: int
    headers: Mapping[str, str]
    urls: Sequence[str]

This gets me to the final bit: how to get the body from the response. Once again this varied by name between the libraries: read(), content, deliverBody(), and buffer. Since this is for asynchronous code that removes the desire for an attribute (you could always return an awaitable from an attribute, but that would be weird). Sticking with the name body just like the matching parameter to the request function makes sense to me. What say you, Twitter?

So that's interesting as that doesn't align with any of the HTTP libraries. Sticking with what I like, the response object in the end becomes:

import abc
from typing import Mapping, Sequence

class Response(abc.ABC):

    """The response to an HTTP request."""

    status_code: int
    headers: Mapping[str, str]
    urls: Sequence[str]  # `url: str` for simpler solution.

    @abc.abstractmethod
    await body() -> bytes:
        """Read the body of the response."""

Exceptions

Trying to normalize what exceptions are raised is almost a lost cause. Every HTTP library has a base exception that's typically named HTTPError (probably because that's the name for urllib.errors.HTTPError), but otherwise there's very little agreement. Three of them have an exception for connection issues (think more along the lines of socket trouble), and those same three have an exception for timeouts. After that everyone has very unique exceptions; one has an exception for proxy scheme errors, another worries about websockets, and yet another has one for HTTP/2 special headers. What is one to do?

What I would probably do is define a base exception that all other exceptions in the library inherit from. I would then have an exception for connection problems and timeouts. The real question becomes whether 4xx and 5xx responses warrant an exception or not. If they don't then that's simple as you simply don't worry about them. If they do, though, then they should probably have the response as an attribute and that's it as the response did return successfully from an HTTP perspective.

Regardless of how the exception situation is handled, you can always raise the base exception with a cause, e.g. Exception from exc. That way the library-specific exception can still be caught in an agnostic fashion without losing any detail if one chose to consider the details of the original exception.

Is this even the right abstraction level?

But maybe I don't have to worry about any of the open questions I have because this is all at the wrong abstraction level? Would it be better to do this at the socket level? Both asyncio and curio emulate low-level socket calls, so could you then use h11 and/or hyper-h2 to build an HTTP library at that level and not have to wrap HTTP libraries themselves and instead abstract out the socket calls? Based on things like Nathaniel Smith's comparison of curio and asyncio, I think the answer is probably "no".