Proposing a struct syntax for Python
Story time
When I go on vacation with a fellow Python developer, inevitably I will talk about Python. 😁 Back in September, Andrea and I drove the Cabot Trail with our friends Dusty and Jen, which led to a discussion about the match
statement and how Dusty and I both wished more people would use the syntax. We created the match-variant
package to try and help, but what Dusty and I really wanted was some syntax to make it easier to construct algebraic data types (ADTs) like Rust's enum
type to facilitate type checkers in performing variant exhaustion checking (e.g., if your ADT had the directions of the compass, how do you make sure you don't skip the "west" case in your match
statement?). At the same time, I was trying to think of a way to potentially meld this with syntax to replace the enum
module to provide a simpler, stable representation of the same concept (the enum
module has had some tweaks made that had to be rolled back due to compatibility concerns in the last couple of releases).
While I was trying to think this through, I also realized people would inevitably abuse whatever I came up with to act like dataclasses. That then made me think about whether an ADT could come from dataclasses as a base? I then had a realization that you could either construct an ADT and then "fake" dataclasses by having a single variant (e.g., your ADT had a single variant called Point
), or you could make it easy to construct a dataclass and then somehow tell type checkers via a union that all of these dataclasses should be considered a single ADT for the purposes of type checking (e.g., you had East
, West
, North
, and South
and then created a union like Compass = typing.Union[East, West, North, South]
).
And then two things happened to cause my mind to snap to the latter, dataclass-like solution. One, Glyph blogged about the idea of syntax support for dataclasses which made me think I was on to something. And two, Hynek Schlawack and Bruce Eckel both had talks at PyCon US 2023 about composition and how object-oriented programming isn't the end-all-and-be-all to programming (which also aligns with me blogging about functions versus classes back in 2017), respectively. Those talks independently back up some ideas I had which I thought may be considered extreme, but I now think are sensible for this idea.
So, I want some syntax to make create data-only classes extremely easy.
Goals
I want syntax that will keep people from using collections.namedtuple()
just for its attribute API and ignoring/regretting its index API. Or put another way, I want syntax to replace the simple, data-only uses of dataclasses. An analogy to other languages is C structs, Rust structs, and Go structs.
Because of the data-only orientation, I want it to be comprehensible by beginners after they learn functions but before they have to learn the complexity of classes. This should also facilitate using composition over inheritance in general.
Performance should also be a key focus as I could see this being used a lot for objects to store small amounts of data individually, but having a huge amount of instances in total.
Typing should be properly supported from the start, but not required. And structural/duck typing should be how runtime semantics operate.
The instances should be immutable. It fits into the replacing of collections.namedtuple()
while also being easier to reason about and being usable as a key in a dictionary. Tied into the composition angle, this leans a bit more into the functional programming side of Python.
No inheritance. Other languages also seem to get by fine without inheritance for this sort of data structure.
No methods. That's because methods are not that important if you don't have inheritance. At that point you might as well use a function and you can use a module or something to namespace things if you need. See Hynek and/or Bruce's talk for more reasons why methods are not necessary.
The proposal
I want to introduce the new (soft) keyword struct
(the name was chosen by my wife, Andrea, as more understandable than data
once you explain struct
is short for structure
and how it's used in other programming languages). You would be able to follow the keyword with the name of the struct class. The parentheses after the name would contain the positional-or-keyword and/or keyword-only parameters the constructor would accept. Each parameter name would directly map to an attribute name for which the parameter would be saved to. Because of this mapping of parameter name to attribute name, no positional-only parameters are allowed (the only deviation from function declearation syntax).
As an example, a struct called Point
which had x
and y
parameters could be defined as:
This would essentially lead to the definition of a struct class which has:
__slots__
, for performance-
__match_args__
, but only for positional-and-keyword arguments (you can still match on keyword-only arguments directly) __repr__()
, and whose result is valid foreval()
(if possible)__eq__()
, based on__slots__
instead ofisinstance()
checks to support structural/duck typing__hash__()
asdict()
, to easily get the data out in another data structure (much likedataclasses.asdict()
)replace()
, to make it easy to construct new instances with minor changes without having to callasdict()
on the original instance, updating the dict however you need to, and then construcitng a new instance; this method is much likedataclasses.replace()
The equivalent Python code for our Point
example would be:
I wish there was a way to do native docstring support while keeping this a single line, but e.g., struct Point(x: int, y: int), "a 2D point."
just doesn't look right to me. Since it is a new keyword it might be possible to make a :
optional and only usable to add a docstring, but that might be a little too weird when the :
doesn't alow for other statements afterwards. Otherwise doing a bare string immediately after the definition could inherently be picked up as a docstring just like what PEP 257 calls an "attribute docstring".
If you squint a little bit, this kind of gets you immutable dicts, albeit with an attribute interface that can be typed without defining a separate TypedDict
. Same goes for those people who want an attribute interface over dicts: define the equivalent struct
and then creating instances using a dict; Point(**{'x': 0, 'y': 0})
would convert a dict to the appropriate attribute interface much like types.SimpleNamespace
, but with a more rigid API guarantee and typing built in.
As for the original ADT/enum dream, PEP 695 introduces the type
keyword for type aliases. I can imagine you declare a union of structs to get that collection you expect a match
statement to exhaustively check for (although that would probably require a PEP to define when type checkers can consider the union final/complete for pattern matching exhaustion). For instance, if you had struct Point1D(x)
, struct Point2D(x, y)
and struct Point3D(x, y, z)
, you could do type Point = Point1D | Point2D | Point3D
and have type checkers recognize that all three cases need to be handled by a match
statement (maybe with some PatternExhaustion
type or something to denote that fact).
But why?!?
To be very specific as to why I think this could be better than dataclasses:
- Typing is optional (there are still folks who don't want to lean into that and it simply isn't always necessary)
- Class construction should be faster (which intuitively you would think shouldn't matter, but I have talked to folks where this is an actual concern, especially when startup time is critical)
- Syntax typically leads to better tooling support since there's no ambiguity
- Easier to teach than (data)classes, so can act as a stepping stone towards classes
- Better semantics than dataclasses have by default (at least in my opinion 😁)
And that's it! I am sure this will be controversial, especially based on the restrictions I am proposing to keep the concept simple. I also have no illusions that this may not be popular and thus not go anywhere. But if folks show enough interest I'm willing to write this up as a PEP.