Unravelling generator expressions
In this post on Python's syntactic sugar, I want to try to tackle generator expressions. If you look at the language definition for generator expressions you will see that it says, "[a] generator expression yields a new generator object" for what is specified (which is essentially a compact for
loop with an expression for the body). So what does that look like if you take away the Python "magic" and unravel it down to its core Python semantics?
The bytecode
Let's take the following example:
The bytecode for this is:
You may notice a couple of things that are interesting about this:
- The generator expression is very much just a
for
loop in a generator. - The generator expression is stored as a constant in the function.
a
gets explicitly passed into the generator expression.
The semantics
The explicit passing of a
is the surprising bit in how generator expressions work, but it actually makes sense when you read the explanation as to why this occurs:
... the iterable expression in the leftmostfor
clause is immediately evaluated, so that an error produced by it will be emitted at the point where the generator expression is defined, rather than at the point where the first value is retrieved. Subsequentfor
clauses and any filter condition in the leftmostfor
clause cannot be evaluated in the enclosing scope as they may depend on the values obtained from the leftmost iterable.
So by passing a
in, the code for a
is evaluated at the time of creation of the generator expression, not at the time of execution. That way if there's an error with that part of the code the traceback will help you find where it was defined and not simply to where the generator expression happened to be run. Since subsequent for
loops in the generator expression may rely on the loop variant in the first clause, you can't eagerly evaluate any other parts of the expression.
The unravelling
There's a couple of details that required to make unravelling a generator expression successful, so I'm going to build up a running example to cover all the various cases.
With only one for
loop
Let's start with (c for b in a)
where c
can be some expression. To unravel this we need to make a generator which takes in a
as an argument to guarantee it is eagerly evaluated where the generator expression is defined.
We end up with a generator function which takes a single argument for the leftmost iterator. We call iter()
outside of the generator to control for scoping. We should also technically unravel the for
loop, but for readability I'm going to leave it in place. Now let's see what this looks like in some code that would use the generator expression:
This would then unravel to:
With multiple for
loops
Now let's toss in another for
loop: (e for b in a for d in c)
. This unravels to:
Since only the leftmost iterable is evaluated eagerly we can rely on the scoping rules for closures to get all of the other variables from the call site implicitly (this is where Python's simple namespace system comes in handy).
Putting this into an example like:
lead to an unravelling of:
The generator expression needs x
passed in because it's the leftmost iterable, but everything else is captured by the closure.
Assignment expressions
Let's make life complicated and throw in an assignment expression:
The bytecode for this becomes:
The key thing to notice is the various *_DEREF
opcodes which are what CPython uses to load/store nonlocal
variables.
Now we could just add a nonlocal
statement to our unravelled generator expression and assume we are done, but there is one issue to watch out for: has the variable previously been defined in the enclosing scope? If the variable doesn't exist when the scope with the nonlocal
is defined (technically the compiler walking the AST has not seen the variable yet), Python will raise an exception: SyntaxError: no binding for nonlocal 'b' found
.
Python gets to take a shortcut when it comes to a generator expression with an assignment expression and simply consider the nonlocal
as implicit without regards as to whether the variable was previously defined. But we don't get to cheat, and that means we may have to define the variable with a dummy value to make the CPython compiler happy.
But we also have to deal with whether the generator expression is ever run or runs but never sets b
(i.e. the iterable has a length of 0). In the example that would raise UnboundLocalError: local variable 'b' referenced before assignment
. To replicate that we need to delete b
if it never gets set appropriately.
What all of this means is our example unravels to:
But remember, we only want to do any of this nonlocal
work if there are assignment expressions to worry about.
The best laid plans ...
I actually wrote this entire post thinking I had solved the unravelling of generator expressions, and then I realized assignment expressions thwarted me in another way. Consider the following example:
If you run that example you end up with UnboundLocalError: local variable 'b' referenced before assignment
. Now let's unravel this:
Unfortunately calling this function succeeds. And since del
is a statement there's no way to insert ourselves into that expression to prevent b
from being resolved. But luckily Guido told me of a trick to still get the UnboundLocalError
: insert the initial assignment into an if False
calls. If we do if False: b = _PLACEHOLDER
it essentially tricks the compiler into doing the what we want and causing UnboundLocalError
to be raised when the variable isn't set while still having nonlocal
allow for assigning to the variable.
Taking our earlier unravelling that was successful, it change into the following:
Aside: what came first, the expression or the comprehension?
If you have not been programming in Python for more than 15 years you may think generator expressions came first, then list comprehensions. But actually it's the other way around: list comprehensions were introduced in Python 2.0 and generator expressions came in Python 2.4. This is because generators were introduced in Python 2.2 (thanks to inspiration from Icon), and so the possibility of even having generator expressions didn't exist when list comprehensions came into existence (thanks to inspiration from Haskell).
Acknowledgements
Thanks to Guido for pointing out the if False
trick to make unexecuted assignment expressions work. This was not part of the original post and so I initially thought the unraveling had failed.
Thanks to Serhiy Storchaka for pointing out that the scoping was off if you didn't call iter()
outside the generator and you needed to unravel the for
loops.