Unravelling Python's classes
For the second-to-last post in my syntactic sugar series (the final post will be a summary/conclusion post), I am tackling what I expect is my largest and most complicated: the class
statement (although I'm too lazy to check if that statement is true ๐).
Lucky for me, the language reference outlines what class creation entails as part of explaining metaclasses, which makes my life a bit easier and will act as an outline for this blog post:
- MRO entries are resolved;
- the appropriate metaclass is determined;
- the class namespace is prepared;
- the class body is executed;
- the class object is created.
To help compartmentalize this large topic, we will tackle each of those bullet points in its own section of this post.
Resolving MRO entries
The method resolution order โ aka MRO -- is the sequence of classes used to look up attributes: the class itself, its direct base classes, and all indirect superclasses (Guido says the term "method resolution order" comes from a book proposing a proprietary extension to C++ that adds metaclasses). Unfortunately, while the docs call this "Resolving MRO entries", it's actually not about the MRO, but instead the direct base classes that your class inherits from (i.e. the stuff between the parentheses after your class' name; class A(B, C)
has B, C
as the direct base classes).
This step is explicitly about taking the list of base classes for your class and making sure you really have the base classes you want to work with (i.e. for class A(B, C): pass
, are (B, C)
really the base classes for A
?) . There's a decent chance you may be feeling ๐คจ right now by that statement; how can the list of base classes not be the base classes you mean?
It turns out that PEP 560, in order to support generic types for typing, made it so that any non-class objects you list in your base classes can define an __mro_entries__
method which takes the original list of base classes and returns a tuple to replace the non-class object. This means there is some code to call to make sure you know exactly what your base classes should be (think of list[int]
returning a non-class object; how do you make sure Python knows that the base class is list
?). As an example, you can do this:
You can look at the source for types.resolve_bases()
to see the algorithm for resolving the MRO entries implemented in Python (the C code exists in type_new()
).
Determining the appropriate metaclass
We need an accurate list of base classes as our first step because the next step is determining the appropriate metaclass. In case you're not familiar with metaclasses, you can think of them as the classes you use to create classes; the class of the class (yes, it's very meta). Just try to remember that the point of this post is that class
is just syntactic sugar and classes are just another kind of object, so there needs to be something to call to make your class object. There is a 3-step process to determining the appropriate metaclass:
- if there are no base classes and no explicit metaclass are given, then
type
is used (e.g.class Example: ...
; - if an explicit metaclass is given and it is not an instance of
type
, then it is used directly as the metaclass (e.g.class Example(metaclass=NotType): ...
); - if an instance of
type
is given as the explicit metaclass, or there's no metaclass specified while base class(es) are, then the most derived metaclass is used (e.g.class Example(metaclass=TypeSubclass): ...
orclass Example(A, B): ...
).
The first two steps are straightforward. It's that last step that's a bit confusing: what does "most derived" mean? It's basically a fancy way of saying the class is more of a subclass than the others. As an example, assume we have class A: pass
and class B(A): pass
. In this instance, B
is more derived than A
because issubclass(B, A)
. So what step 3 is doing is taking the specified metaclass (or type
as a default), and then seeing whether any of the base classes have a metaclass which is more of a subclass than the others. You can see this implemented in Python as part of types.prepare_class()
(the C code exists in _PyType_CalculateMetaclass()
as called by type_new()
).
Preparing the class namespace
As part of PEP 3115, metaclasses gained the ability to specify the object to use for the class namespace while the class is being constructed (although ultimately a dict
is used in the final class object). This is what allowed people to use ordered dictionaries to keep track of the declaration order of things in a class
statement before dictionaries began preserving insertion order starting in Python 3.7.
If the metaclass specifies a __prepare__(name, bases, **kwds)
attribute that's callable (e.g. a function or staticmethod
), its return value is used as the namespace; an empty dictionary is used in the default case. The Python code in types.prepare_class()
shows this in action.
Executing the class body
This is where things get tricky. The language reference says that in creating the class,
The class body is executed (approximately) as exec(body, globals(), namespace).
It's that "approximately" part which makes things ... interesting.
Scoping
The difficulty in executing a class body is the scoping rules are not normal Python scoping rules (e.g. as if the code was all executing inside a function). For instance, while class attributes can see each other:
That same scoping does not apply to within methods:
But then scoping in methods does work past the class
statement (e.g. you can still access module globals from within a method defined in a class), so it's like methods pretend class-level definitions don't exist, but everything else is fine. Do note, though, I'm glossing over classes defined within functions as those classes and their methods can access stuff within the encolsing function just fine (plus you can do crazy stuff like insert things into the class scope to shadow stuff at the function scope; usual dynamic execution shenanigans when using CPython ๐).
And to add another wrinkle, there's an implicit __class__
variable which points to the executing class itself that is accessible only within methods:
And all of this has to work without changing the definition order of things; remember that one of the reasons __prepare__
exists is to allow you to record the definition order of attributes.
Isolating code
To unravel this, let's assume we have a custom function per class named _exec_<name>(_ns)
whose only argument is the object we got from the "preparing the class namespace" step above. One thing we will have to do is make sure the parameter name of the function is unique as that would otherwise leak an implementation detail (for our purposes we will assume _ns
is unique). This function needs to be defined at the same scope level as the class
statement to make sure Python's scoping continues to work (e.g. those global variables still get picked up).
As a running example, let's translate the following class:
That means we want to start with a function named _exec_Example()
:
Using the namespace
Now we can start executing the class body! Unfortunately we have to be very careful about the scoping of names; we need to make sure that the class-level attributes don't leak into the methods. To do that we make sure to directly assign class-level attributes to the namespace object so the attribute is never given a name within our custom function. We also rewrite the class-level code so any references to class-level attributes use the namespace object (we don't have to worry about methods since they can't see the objects regardless!).
Unfortunately we need to add a bit more complexity still. We need to use a temporary namespace to access values because we can't guarantee that the namespace object we were given doesn't have side-effects when using it (see my post on subscriptions to see how that's a possibility).
We also need to deal with methods in a special way. The tricky bit with methods is making sure they are hidden from each other. I originally wanted to delete the methods after they were defined, but then Python would make the method name inaccessible later which would mess things up if that name is used in a different scope (e.g. the same name used as a global variable in the module). And so I use a custom name for the method (I prepend _method_
to the name), patch __name__
and __qualname__
to what it should be (much like functools.wraps
, although it doesn't update __code__.co_name
which could be waived away as an implementation detail), and then store the method in the namespace.
Pulling all of that together looks like the following:
Special attributes
But we're actually still not done! The language reference (at least currently) doesn't mention the various special attributes of a class whose values can only be known at parse/compile/creation time (i.e. attributes the metaclass could never be expected to calculate the value). Specifically, __module__
and __qualname__
are always set in the namespace at the start. Also, depending on whether they are needed/defined, __doc__
and __annotations__
also need to be set upfront. Taking our example and adding in these special attributes, we now get:
def _exec_Example(_ns):
_temp_ns = {}
# Get `__module__` from the global `__name__` of the module.
_temp_ns["__module__"] = _ns["__module__"] = __name__
# Adjust `__qualname__` as necessary due to nesting.
_temp_ns[__"qualname__"] = _ns["__qualname__"] = "Example"
_temp_ns["__doc__"] = _ns["__doc__"] = """Docstring."""
# `from __future__ import annotations` affects whether the values
# are always strings.
_temp_ns["__annotations__"] = _ns["__annotations__"] = {"a": int}
_temp_ns["a"] = _ns["a"] = 3
_temp_ns["b"] = _ns["b"] = _temp_ns["a"] + 2
def _method_c(self):
return 42
_method_c.__name__ = "c"
_method_c.__qualname__ = "Spam.c"
temp_ns["c"] = _ns["c"] = _method_c
del _method_c
temp_ns["d"] = _ns["d"] = _temp_ns["b"] * 2
I know this can seem like a lot, but it can all be done statically and mechanically. The C code for __module__
, __qualname__
, and __doc__
live in type_new()
.
Creating the class object
The language reference for creating the class object is surprisingly straight-forward for the general case:
- Call
metaclass(name, bases, namespace, **kwds)
- Pass the class to the class decorators
- Make sure that
__dict__
gets set to a dict and as an attribute returns is a read-only proxy
That all seems simple (and it was this simple way back when), but its that step 1 which hides a ton of details thanks to Python providing a default implementation in type
.
type.__new__()
does a lot
One of the details is the calculation of the method resolution order. Technically this is done by calling the mro()
method of the metaclass whose result is stored as __mro__
on the class. Python uses something called the C3 algorithm from Dylan. You can look at the C code or the example Python code from when Python adopted the C3 algorithm if you're curious as to how it works.
There is also setting __class__
and __bases__
on the class, which are both pretty straightforward since they are passed to the metaclass as part of calling it.
There is also the handling of __set_name__()
on descriptors. This is called on each descriptor to allow for changing the bound name in the final namespace. If you're curious to see a use of this feature, see the implementation for functools.cached_property
.
Next, "the __init_subclass__()
hook is called on the immediate parent of the new class in its method resolution order." Because __init_subclass__()
is called on the parent of a class, it allows the parent to influence any future subclasses.
After all of this, Python will do steps ย 2 and 3. Step 2 is much like unravelling function decorators, so I won't cover those here. And step 3 requires a proxy object that I'm too lazy to implement, so just pretend we have such a thing and that __dict__
is a descriptor that wraps the dictionary with the proxy. ๐
What about __classcell__
and super()
?
There is a CPython implementation detail where __class__
gets passed in as __classcell__
to the metaclass. All of this is part of supporting zero-argument super()
calls as Python does some tricky there by reading an injected __class__
variable in that call. Unfortunately making this work for us is impossible. Since __class__
is set to the class being used and not the class that defined the code calling super()
, we don't have a way to dynamically set __class__
to the right value in each and every method. As such, this is one case where we can't match the semantics of Python via unravelling. ๐คท
Putting it all together
Let's use a simple example to bring this all together.
class Example(SuperClass):
"""Docstring."""
a: int = 3
def c(self): return 42
This class unravels to:
def _exec_Example(_ns):
_temp_ns = {}
_temp_ns["__module__"] = _ns["__module__"] = __name__
_temp_ns[__"qualname__"] = _ns["__qualname__"] = "Example"
_temp_ns["__doc__"] = _ns["__doc__"] = """Docstring."""
_temp_ns["__annotations__"] = _ns["__annotations__"] = {"a": int}
_temp_ns["a"] = _ns["a"] = 3
def _method_c(self):
return 42
_method_c.__name__ = "c"
_method_c.__qualname__ = "Example.c"
temp_ns["c"] = _ns["c"] = _method_c
del _method_c
def _class_Example():
# Resolving MRO entries.
bases = types.resolve_bases((SuperClass, ))
# Determining the appropriate metaclass **and**
# preparing the class namespace.
meta, ns, kwds = types.prepare_class("Example", bases)
# Executing the class body.
_exec_Example(ns)
# Creating the class object.
cls = meta("Example", bases, ns)
## Class decorators, if there were any.
## Make the namespace read-only.
cls.__dict__ = read_only_proxy(ns)
return cls
Example = _class_Example()
As I stated earlier, this isn't a perfect unravelling due to argument-free super()
requiring an injected __class__
value which is a runtime thing. But otherwise this covers how to unravel the class
statement!
Acknowledgements
A special thanks to Guido for proof-reading my initial draft. He pointed out various bits that needed clarification and improvement while providing some historical details.