Unravelling attribute access in Python
I wonder how many people realize that Python has a lot of syntactic sugar? I'm not claiming it's like a Lisp-based language where the syntax is as bare bones as possible (although the Lisp comparison is not entirely unfounded), but much of Python's syntax isn't technically needed as under the hood a good chunk of it is just function calls.
But so what? Why care about how Python devolves into less syntax and more function calls? There's two reasons really. One is it's educational to know how Python actually functions to help you understand/debug when something goes awry. Two, it helps detail the bare minimum you need to implement the language.
And so, to both educate myself and to think about what might be required to implement Python for WebAssembly or a bare bones C API, I am writing this blog post about what attribute access looks like when you look beneath the syntax.
Now you could try to piece together exactly what is going on with attribute access by reading the Python language reference. That might lead you to the attribute reference expression and the data model for customizing attribute access, but there's a lot to try to comprehend and tie together into a single story of how attribute access works. And so I prefer to go through the CPython source code to tease out what is going on in the interpreter (and I'm specifically using the CPython 3.8.3 tag of the repository so I have stable links and am using the latest release at the time of writing).
Now there will be some C code in the beginning of this post, but I don't expect you to fully understand what's going on with it. I will explicitly say what you should get from the C code, so if you don't have any background in C it shouldn't hurt your understanding of what I'm about to talk about.
Looking at the bytecode
OK, so let's try to pull apart the following expression:
obj.attr
Probably the most straightforward place to start is with examining the bytecode for this. So, let's disassemble this line and see what the compiler emits for this:
>>> def example():
... obj.attr
...
>>> import dis
>>> dis.dis(example)
2 0 LOAD_GLOBAL 0 (obj)
2 LOAD_ATTR 1 (attr)
4 POP_TOP
6 LOAD_CONST 0 (None)
8 RETURN_VALUE
The key opcode here is LOAD_ATTR
. (In case you're interested, it replaces the object on the top of the stack with the result of accessing the named attribute as specified in co_names[i]
.)
CPython's interpreter loop is kept in Python/ceval.c
. At it's core is a massive switch
statement that branches based on the opcode to be executed. Looking there you find the following lines of C for LOAD_ATTR
:
Most of that is just stack manipulation code that we can ignore. The key bit is the PyObject_GetAttr()
call which is what truly implements attribute access.
That function name looks like another function name ...
Now that name sure looks like getattr()
, but in the convention of C function names that CPython uses. Poking around in Python/bltinmodule.c
, which houses all of the built-ins in Python, we can check if this hunch is true. Searching that file for "getattr"
, you find the line which binds the "getattr" name to the builtin_getattr()
function.
There's a bunch of stuff to tease apart parameters and such that doesn't interest us, but you will notice that if you only pass in two arguments to getattr()
it ends up calling PyObject_GetAttr()
.
What does this mean? Well, it means you can directly desugar obj.attr
to getattr(obj, "attr")
! And that also means that if we can understand PyObject_GetAttr()
then we can understand how that function works and thus how attribute access works in Python.
Unravelling getattr()
At this point I'm going to stop pasting in C code as the complexity of the code just goes up from here and it is no longer serving the purpose of demonstrating that obj.attr
is syntax for getattr(obj, "attr")
. I will continue to point out the relevant C code as comments in the pseudo-code for those that want to follow along in the bowels of CPython, though. Also note that the Python code should be considered pseudo-code as there is attribute access itself in the code implementing attribute access, but at the C level it isn't passing through normal attribute access machinery. So while you may see a .
used syntactically in the pseudo-code, know that at the C level the attribute access is not recursive and is actually functioning the way you might naively assume it would.
What we know so far
At this point we know two things about getattr()
. One is it takes (at least) two arguments. Two, the second argument must be a subclass of str
and when it isn't then TypeError
is raised with a static string argument (which is probably static for performance purposes).
Looking up attributes via special methods
Attribute access on an object is implemented via two special methods. The first method is __getattribute__()
which is called when trying to access any and all attributes. The second is __getattr__()
which is called when __getattribute__()
raises an AttributeError
. The former method is (nowadays) always expected to be defined while the latter method is optional.
Python looks up special methods on an object's type, not the object itself. To be clear, I am very specifically using the word "type" here; the type of an instance is its class, the type for a class is its type. It's luckily very easy to get the type of something thanks to the type
constructor returning an object's type: type(obj)
.
We also need to know the method resolution order (MRO) of the type. This specifies the order of the type hierarchy for an object. The algorithm used by Python is from the Dylan programming language and it's called C3. From Python code the MRO is exposed by type(obj).mro()
.
Working off of an object's type is on purpose as this allows for faster lookup and access. In general it eliminates an extra lookup by skipping the instance every time we look for something. At an internal CPython level it allows for having special methods live in a struct field for very fast lookup. So while it might seem a little odd at first glance to be somewhat ignoring the direct object and to use its type instead, it's very much on purpose.
Now in the name of simplicity I am going to cheat a little and have getattr()
handle both __getattribute__()
and __getattr__()
methods explicitly, while in CPython it does some trickery under the hood to make an object handle both methods itself. In the end, though, the semantics are the same for our purposes.
Unravelling object.__getattribute__()
While getting an implementation of getattr()
is nice, it unfortunately doesn't tell us a whole lot about Python's rules for looking up an attribute since so much is handled in an object's __getattribute__()
method. As such, I will cover how object.__getattribute__()
works.
Looking for a data descriptor
The first substantial thing we are going to do in object.__getattribute__()
is look for a data descriptor on the type. In case you have never heard of descriptors, it's a way to programmatically control how an individual attribute works. You may not have heard of descriptors, but if you have been using Python for a while I suspect you have used descriptors: properties, classmethod
, and staticmethod
are all descriptors.
There are two kinds of descriptors: data and non-data. Both kind of descriptors define a __get__
method for getting what the attribute should be. Data descriptors also define __set__
and __delete__
methods while non-data descriptors do not; property
is a data descriptor, classmethod
and staticmethod
are non-data descriptors.
If we can't find a data descriptor for the attribute on the type, the next place we look is on the object itself. This is a straight-forward thing thanks to objects having a __dict__
attribute that stores the attributes of the object itself in a dictionary.
If the object itself doesn't have the attribute then we see if a non-data descriptor exists. Since we already searched for a descriptor previously we can assume that if it was found but not already used when we looked for a data descriptor then it's a non-data descriptor.
And finally, if we found the attribute on the type and it wasn't a descriptor, we return that. So to summarize, the search order for an attribute is:
- Data descriptor on the types
- Anything on the object itself
- Non-data descriptor on the types
- Anything on the types
You will notice we first look for some kind of descriptor, then if that fails we look for a regular object that matches the kind of descriptor we wanted. And we first look for data, then we look for something else. All of this makes sense when you think about how self.attr = val
in an __init__()
method is storing data on an object. Chances are that if you did that then you want that before a method or something. And you want descriptors first since if you bothered to programmatically define an attribute you probably meant for that to always be used.
And one detail that initially overlooked in doing this is that descriptors are objects, too. 😉 What I mean by that is you need to walk the descriptor's own MRO to find the appropriate __get__
method as well as look for __set__
or __delete__
.
Summary
As you can see, there's a bunch of things going on when looking up an attribute in Python. While I would say no individual part is overly complicated conceptually, all together it does lead to a lot going on. This is also why some people try to minimize attribute access in Python when in very performance-critical code to avoid all of this machinery.
And as a historical note, almost all of these semantics came to Python as part of new-style classes compared to "classic" classes. This distinction went away in Python 3 when classic classes were left behind, so if you don't know about classic classes then that's probably a good thing.
Other posts in this series can be found under the "syntactic sugar" tag on this blog. You can find the code above at https://github.com/brettcannon/desugar.
Updates
- 2020-11-13: The deletion method on data descriptors is
__delete__
, not__del__
(thanks Géry Ogam). - 2020-11-13: Update
__getattribute__
implementation to do an MRO search for the descriptor (thanks Géry Ogam).