Unravelling attribute access in Python

I wonder how many people realize that Python has a lot of syntactic sugar? I'm not claiming it's like a Lisp-based language where the syntax is as bare bones as possible (although the Lisp comparison is not entirely unfounded), but much of Python's syntax isn't technically needed as under the hood a good chunk of it is just function calls.

But so what? Why care about how Python devolves into less syntax and more function calls? There's two reasons really. One is it's educational to know how Python actually functions to help you understand/debug when something goes awry. Two, it helps detail the bare minimum you need to implement the language.

And so, to both educate myself and to think about what might be required to implement Python for WebAssembly or a bare bones C API, I am writing this blog post about what attribute access looks like when you look beneath the syntax.

Now you could try to piece together exactly what is going on with attribute access by reading the Python language reference. That might lead you to the attribute reference expression and the data model for customizing attribute access, but there's a lot to try to comprehend and tie together into a single story of how attribute access works. And so I prefer to go through the CPython source code to tease out what is going on in the interpreter (and I'm specifically using the CPython 3.8.3 tag of the repository so I have stable links and am using the latest release at the time of writing).

Now there will be some C code in the beginning of this post, but I don't expect you to fully understand what's going on with it. I will explicitly say what you should get from the C code, so if you don't have any background in C it shouldn't hurt your understanding of what I'm about to talk about.

Looking at the bytecode

OK, so let's try to pull apart the following expression:

obj.attr

Probably the most straightforward place to start is with examining the bytecode for this. So, let's disassemble this line and see what the compiler emits for this:

>>> def example(): 
...     obj.attr
... 
>>> import dis
>>> dis.dis(example)
  2           0 LOAD_GLOBAL              0 (obj)
              2 LOAD_ATTR                1 (attr)
              4 POP_TOP
              6 LOAD_CONST               0 (None)
              8 RETURN_VALUE

The key opcode here is LOAD_ATTR. (In case you're interested, it replaces the object on the top of the stack with the result of accessing the named attribute as specified in co_names[i].)

CPython's interpreter loop is kept in Python/ceval.c. At it's core is a massive switch statement that branches based on the opcode to be executed. Looking there you find the following lines of C for LOAD_ATTR:

        case TARGET(LOAD_ATTR): {
            PyObject *name = GETITEM(names, oparg);
            PyObject *owner = TOP();
            PyObject *res = PyObject_GetAttr(owner, name);
            Py_DECREF(owner);
            SET_TOP(res);
            if (res == NULL)
                goto error;
            DISPATCH();
        }
https://github.com/python/cpython/blob/6f8c8320e9eac9bc7a7f653b43506e75916ce8e8/Python/ceval.c#L2963-L2972

Most of that is just stack manipulation code that we can ignore. The key bit is the PyObject_GetAttr() call which is what truly implements attribute access.

That function name looks like another function name ...

Now that name sure looks like getattr(), but in the convention of C function names that CPython uses. Poking around in Python/bltinmodule.c, which houses all of the built-ins in Python, we can check if this hunch is true. Searching that file for "getattr", you find the line which binds the "getattr" name to the builtin_getattr() function.

static PyObject *
builtin_getattr(PyObject *self, PyObject *const *args, Py_ssize_t nargs)
{
    PyObject *v, *name, *result;


    if (!_PyArg_CheckPositional("getattr", nargs, 2, 3))
        return NULL;


    v = args[0];
    name = args[1];
    if (!PyUnicode_Check(name)) {
        PyErr_SetString(PyExc_TypeError,
                        "getattr(): attribute name must be string");
        return NULL;
    }
    if (nargs > 2) {
        if (_PyObject_LookupAttr(v, name, &result) == 0) {
            PyObject *dflt = args[2];
            Py_INCREF(dflt);
            return dflt;
        }
    }
    else {
        result = PyObject_GetAttr(v, name);
    }
    return result;
}
https://github.com/python/cpython/blob/6f8c8320e9eac9bc7a7f653b43506e75916ce8e8/Python/bltinmodule.c#L1060-L1086

There's a bunch of stuff to tease apart parameters and such that doesn't interest us, but you will notice that if you only pass in two arguments to getattr() it ends up calling PyObject_GetAttr().

What does this mean? Well, it means you can directly desugar obj.attr to getattr(obj, "attr")! And that also means that if we can understand PyObject_GetAttr() then we can understand how that function works and thus how attribute access works in Python.

Unravelling getattr()

At this point I'm going to stop pasting in C code as the complexity of the code just goes up from here and it is no longer serving the purpose of demonstrating that obj.attr is syntax for getattr(obj, "attr"). I will continue to point out the relevant C code as comments in the pseudo-code for those that want to follow along in the bowels of CPython, though. Also note that the Python code should be considered pseudo-code as there is attribute access itself in the code implementing attribute access, but at the C level it isn't passing through normal attribute access machinery. So while you may see a . used syntactically in the pseudo-code, know that at the C level the attribute access is not recursive and is actually functioning the way you might naively assume it would.

What we know so far

At this point we know two things about getattr(). One is it takes (at least) two arguments. Two, the second argument must be a subclass of str and when it isn't then TypeError is raised with a static string argument (which is probably static for performance purposes).

def getattr(obj: Any, attr: str, default: Any) -> Any:
    if not isinstance(attr, str):
        raise TypeError("getattr(): attribute name must be string")

    ...  # Fill in with PyObject_GetAttr().
Function signature for getattr()

Looking up attributes via special methods

Attribute access on an object is implemented via two special methods. The first method is __getattribute__() which is called when trying to access any and all attributes. The second is __getattr__() which is called when __getattribute__() raises an AttributeError. The former method is (nowadays) always expected to be defined while the latter method is optional.

Python looks up special methods on an object's type, not the object itself. To be clear, I am very specifically using the word "type" here; the type of an instance is its class, the type for a class is its type. It's luckily very easy to get the type of something thanks to the type constructor returning an object's type: type(obj).

We also need to know the method resolution order (MRO) of the type. This specifies the order of the type hierarchy for an object. The algorithm used by Python is from the Dylan programming language and it's called C3. From Python code the MRO is exposed by type(obj).mro().

Working off of an object's type is on purpose as this allows for faster lookup and access. In general it eliminates an extra lookup by skipping the instance every time we look for something. At an internal CPython level it allows for having special methods live in a struct field for very fast lookup. So while it might seem a little odd at first glance to be somewhat ignoring the direct object and to use its type instead, it's very much on purpose.

Now in the name of simplicity I am going to cheat a little and have getattr() handle both __getattribute__() and __getattr__() methods explicitly, while in CPython it does some trickery under the hood to make an object handle both methods itself. In the end, though, the semantics are the same for our purposes.

# Based on https://github.com/python/cpython/tree/v3.8.3.
from __future__ import annotations
import builtins

NOTHING = builtins.object()  # C: NULL


def getattr(obj: Any, attr: str, default: Any = NOTHING) -> Any:
    """Implement attribute access via  __getattribute__ and __getattr__."""
    # Python/bltinmodule.c:builtin_getattr
    if not isinstance(attr, str):
        raise TypeError("getattr(): attribute name must be string")

    obj_type_mro = type(obj).mro()
    attr_exc = NOTHING
    for base in obj_type_mro:
        if "__getattribute__" in base.__dict__:
            try:
                return base.__dict__["__getattribute__"](obj, attr)
            except AttributeError as exc:
                attr_exc = exc
                break
    # Objects/typeobject.c:slot_tp_getattr_hook
    # It is cheating to do this here as CPython actually rebinds the tp_getattro
    # slot with a wrapper that handles __getattr__() when present.
    for base in obj_type_mro:
        if "__getattr__" in base.__dict__:
            return base.__dict__["__getattr__"](obj, attr)

    if default is not NOTHING:
        return default
    elif attr_exc is not NOTHING:
        raise attr_exc
    else:
        raise AttributeError(f"{self.__name__!r} object has no attribute {attr!r}")
Pseudo-code implementing getattr()

Unravelling object.__getattribute__()

While getting an implementation of getattr() is nice, it unfortunately doesn't tell us a whole lot about Python's rules for looking up an attribute since so much is handled in an object's __getattribute__() method. As such, I will cover how object.__getattribute__() works.

Looking for a data descriptor

The first substantial thing we are going to do in object.__getattribute__() is look for a data descriptor on the type. In case you have never heard of descriptors, it's a way to programmatically control how an individual attribute works. You may not have heard of descriptors, but if you have been using Python for a while I suspect you have used descriptors: properties, classmethod, and staticmethod are all descriptors.

There are two kinds of descriptors: data and non-data. Both kind of descriptors define a __get__ method for getting what the attribute should be. Data descriptors also define __set__ and __delete__ methods while non-data descriptors do not; property is a data descriptor, classmethod and staticmethod are non-data descriptors.

If we can't find a data descriptor for the attribute on the type, the next place we look is on the object itself. This is a straight-forward thing thanks to objects having a __dict__ attribute that stores the attributes of the object itself in a dictionary.

If the object itself doesn't have the attribute then we see if a non-data descriptor exists. Since we already searched for a descriptor previously we can assume that if it was found but not already used when we looked for a data descriptor then it's a non-data descriptor.

And finally, if we found the attribute on the type and it wasn't a descriptor, we return that. So to summarize, the search order for an attribute is:

  1. Data descriptor on the types
  2. Anything on the object itself
  3. Non-data descriptor on the types
  4. Anything on the types

You will notice we first look for some kind of descriptor, then if that fails we look for a regular object that matches the kind of descriptor we wanted. And we first look for data, then we look for something else. All of this makes sense when you think about how self.attr = val in an __init__() method is storing data on an object. Chances are that if you did that then you want that before a method or something. And you want descriptors first since if you bothered to programmatically define an attribute you probably meant for that to always be used.

And one detail that initially overlooked in doing this is that descriptors are objects, too. 😉 What I mean by that is you need to walk the descriptor's own MRO to find the appropriate __get__ method as well as look for __set__ or __delete__.

def _mro_getattr(type_: Type, attr: str) -> Any:
    """Get an attribute from a type based on its MRO."""
    for base in type_.mro():
        if attr in base.__dict__:
            return base.__dict__[attr]
    else:
        raise AttributeError(f"{type_.__name__!r} object has no attribute {attr!r}")


class object:
    def __getattribute__(self, attr: str, /) -> Any:
        """Attribute access."""
        # Objects/object.c:PyObject_GenericGetAttr
        self_type = type(self)
        if not isinstance(attr, str):
            raise TypeError(
                f"attribute name must be string, not {type(attr).__name__!r}"
            )

        type_attr = descriptor_type_get = NOTHING
        try:
            type_attr = _mro_getattr(self_type, attr)
        except AttributeError:
            pass  # Hopefully an instance attribute.
        else:
            type_attr_type = type(type_attr)
            try:
                descriptor_type_get = _mro_getattr(type_attr_type, "__get__")
            except AttributeError:
                pass  # At least a class attribute.
            else:
                # At least a non-data descriptor.
                for base in type_attr_type.mro():
                    if "__set__" in base.__dict__ or "__delete__" in base.__dict__:
                        # Data descriptor.
                        return descriptor_type_get(type_attr, self, self_type)

        if attr in self.__dict__:
            # Instance attribute.
            return self.__dict__[attr]
        elif descriptor_type_get is not NOTHING:
            # Non-data descriptor.
            return descriptor_type_get(type_attr, self, self_type)
        elif type_attr is not NOTHING:
            # Class attribute.
            return type_attr
        else:
            raise AttributeError(f"{self.__name__!r} object has no attribute {attr!r}")
Implementation of object.__getattribute__()

Summary

As you can see, there's a bunch of things going on when looking up an attribute in Python. While I would say no individual part is overly complicated conceptually, all together it does lead to a lot going on. This is also why some people try to minimize attribute access in Python when in very performance-critical code to avoid all of this machinery.

And as a historical note, almost all of these semantics came to Python as part of new-style classes compared to "classic" classes. This distinction went away in Python 3 when classic classes were left behind, so if you don't know about classic classes then that's probably a good thing.

Other posts in this series can be found under the "syntactic sugar" tag on this blog. You can find the code above at https://github.com/brettcannon/desugar.

Updates

  • 2020-11-13: The deletion method on data descriptors is __delete__, not __del__ (thanks Géry Ogam).
  • 2020-11-13: Update __getattribute__ implementation to do an MRO search for the descriptor (thanks Géry Ogam).