Proposing a struct syntax for Python

Story time

When I go on vacation with a fellow Python developer, inevitably I will talk about Python. 😁 Back in September, Andrea and I drove the Cabot Trail with our friends Dusty and Jen, which led to a discussion about the match statement and how Dusty and I both wished more people would use the syntax. We created the match-variant package to try and help, but what Dusty and I really wanted was some syntax to make it easier to construct algebraic data types (ADTs) like Rust's enum type to facilitate type checkers in performing variant exhaustion checking (e.g., if your ADT had the directions of the compass, how do you make sure you don't skip the "west" case in your match statement?). At the same time, I was trying to think of a way to potentially meld this with syntax to replace the enum module to provide a simpler, stable representation of the same concept (the enum module has had some tweaks made that had to be rolled back due to compatibility concerns in the last couple of releases).

While I was trying to think this through, I also realized people would inevitably abuse whatever I came up with to act like dataclasses. That then made me think about whether an ADT could come from dataclasses as a base? I then had a realization that you could either construct an ADT and then "fake" dataclasses by having a single variant (e.g., your ADT had a single variant called Point), or you could make it easy to construct a dataclass and then somehow tell type checkers via a union that all of these dataclasses should be considered a single ADT for the purposes of type checking (e.g., you had East, West, North, and South and then created a union like Compass = typing.Union[East, West, North, South]).

And then two things happened to cause my mind to snap to the latter, dataclass-like solution. One, Glyph blogged about the idea of syntax support for dataclasses which made me think I was on to something. And two, Hynek Schlawack and Bruce Eckel both had talks at PyCon US 2023 about composition and how object-oriented programming isn't the end-all-and-be-all to programming (which also aligns with me blogging about functions versus classes back in 2017), respectively. Those talks independently back up some ideas I had which I thought may be considered extreme, but I now think are sensible for this idea.

So, I want some syntax to make create data-only classes extremely easy.

Goals

I want syntax that will keep people from using collections.namedtuple() just for its attribute API and ignoring/regretting its index API. Or put another way, I want syntax to replace the simple, data-only uses of dataclasses. An analogy to other languages is C structs, Rust structs, and Go structs.

Because of the data-only orientation, I want it to be comprehensible by beginners after they learn functions but before they have to learn the complexity of classes. This should also facilitate using composition over inheritance in general.

Performance should also be a key focus as I could see this being used a lot for objects to store small amounts of data individually, but having a huge amount of instances in total.

Typing should be properly supported from the start, but not required. And structural/duck typing should be how runtime semantics operate.

The instances should be immutable. It fits into the replacing of collections.namedtuple() while also being easier to reason about and being usable as a key in a dictionary. Tied into the composition angle, this leans a bit more into the functional programming side of Python.

No inheritance. Other languages also seem to get by fine without inheritance for this sort of data structure.

No methods. That's because methods are not that important if you don't have inheritance. At that point you might as well use a function and you can use a module or something to namespace things if you need. See Hynek and/or Bruce's talk for more reasons why methods are not necessary.

The proposal

I want to introduce the new (soft) keyword struct (the name was chosen by my wife, Andrea, as more understandable than data once you explain struct is short for structure and how it's used in other programming languages). You would be able to follow the keyword with the name of the struct class. The parentheses after the name would contain the positional-or-keyword and/or keyword-only parameters the constructor would accept. Each parameter name would directly map to an attribute name for which the parameter would be saved to. Because of this mapping of parameter name to attribute name, no positional-only parameters are allowed (the only deviation from function declearation syntax).

As an example, a struct called Point which had x and y parameters could be defined as:

struct Point(x: int, y: int)
Example of the proposed struct syntax

This would essentially lead to the definition of a struct class which has:

  • __slots__, for performance
  • __match_args__, but only for positional-and-keyword arguments (you can still match on keyword-only arguments directly)
  • __repr__(), and whose result is valid for eval() (if possible)
  • __eq__(), based on __slots__ instead of isinstance() checks to support structural/duck typing
  • __hash__()
  • asdict(), to easily get the data out in another data structure (much like dataclasses.asdict())
  • replace(), to make it easy to construct new instances with minor changes without having to call asdict() on the original instance, updating the dict however you need to, and then construcitng a new instance; this method is much like dataclasses.replace()

The equivalent Python code for our Point example would be:

from typing import Any, Self, TypedDict, Unpack


class _AsDictResult(TypedDict):
    """Return type of Point.asdict()."""

    x: int
    y: int


class _ReplaceParams(TypedDict, total=False):
    """For typing Point.replace()."""

    x: int
    y: int


class Point:
    __slots__ = ("x", "y")
    __match_args__ = ("x", "y")

    def __new__(cls, x: int, y: int) -> Self:
        """Create a new, immutable instance."""
        # Pretend this makes everything immutable in the end.
        self = mutable(cls.__slots__)
        self.x = x
        self.y = y
        return immutable(self)

    def __repr__(self) -> str:
        """Return the repr.

        Should be able to create a new instance via eval() if all attributes
        have a repr allowing for such usage.

        """
        params = []
        for attr in type(self).__slots__:
            params.append(f"{attr}={getattr(self, attr)!r}")
        return f"{type(self).__qualname__}({', '.join(params)})"

    def __eq__(self, other: Any) -> bool | NotImplemented:
        """Check for equality.

        The comparison is done per-attribute to allow for duck typing (i.e.,
        nominal typing is not used as a shortcut for comparing).

        """
        attrs = frozenset(type(self).__slots__)
        other_attrs = frozenset(getattr(type(other), "__slots__", [object()]))
        if attrs != other_attrs:
            # Avoids the question of what to do if there are extra attributes
            # on `other`.
            return NotImplemented

        for attr in type(self).__slots__:
            if not hasattr(other, attr):
                return NotImplemented
            elif getattr(self, attr) != getattr(other, attr):
                return False
        else:
            return True

    def __hash__(self) -> int:
        """Hash the instance."""
        attr_tuple = tuple(getattr(self, attr) for attr in type(self).__slots__)
        return hash(attr_tuple)

    def asdict(self) -> _AsDictResult:
        """Return a dict representation of the object."""
        return {attr: getattr(self, attr) for attr in type(self).__slots__}

    def replace(self, **new_data: Unpack[_ReplaceParams]) -> Self:
        """Return a new instance, replacing value as appropriate.

        Raise a TypeError if any unexpected keys are provided.

        """
        current_data = self.asdict()
        if diff := set(new_data.keys()).difference(current_data.keys()):
            raise TypeError(
                f"{type(self).__qualname__}.replace() called with unexpected arguments:"
                f"{', '.join(sorted(diff))}"
            )
        else:
            return type(self)(**(current_data | new_data))
Unravelling of the proposed struct class

I wish there was a way to do native docstring support while keeping this a single line, but e.g., struct Point(x: int, y: int), "a 2D point." just doesn't look right to me. Since it is a new keyword it might be possible to make a : optional and only usable to add a docstring, but that might be a little too weird when the : doesn't alow for other statements afterwards. Otherwise doing a bare string immediately after the definition could inherently be picked up as a docstring just like what PEP 257 calls an "attribute docstring".

If you squint a little bit, this kind of gets you immutable dicts, albeit with an attribute interface that can be typed without defining a separate TypedDict. Same goes for those people who want an attribute interface over dicts: define the equivalent struct and then creating instances using a dict; Point(**{'x': 0, 'y': 0}) would convert a dict to the appropriate attribute interface much like types.SimpleNamespace, but with a more rigid API guarantee and typing built in.

As for the original ADT/enum dream, PEP 695 introduces the type keyword for type aliases. I can imagine you declare a union of structs to get that collection you expect a match statement to exhaustively check for (although that would probably require a PEP to define when type checkers can consider the union final/complete for pattern matching exhaustion). For instance, if you had struct Point1D(x), struct Point2D(x, y) and struct Point3D(x, y, z), you could do type Point = Point1D | Point2D | Point3D and have type checkers recognize that all three cases need to be handled by a match statement (maybe with some PatternExhaustion type or something to denote that fact).

But why?!?

To be very specific as to why I think this could be better than dataclasses:

  • Typing is optional (there are still folks who don't want to lean into that and it simply isn't always necessary)
  • Class construction should be faster (which intuitively you would think shouldn't matter, but I have talked to folks where this is an actual concern, especially when startup time is critical)
  • Syntax typically leads to better tooling support since there's no ambiguity
  • Easier to teach than (data)classes, so can act as a stepping stone towards classes
  • Better semantics than dataclasses have by default (at least in my opinion 😁)

And that's it! I am sure this will be controversial, especially based on the restrictions I am proposing to keep the concept simple. I also have no illusions that this may not be popular and thus not go anywhere. But if folks show enough interest I'm willing to write this up as a PEP.