Unravelling generator expressions
In this post on Python's syntactic sugar, I want to try to tackle generator expressions. If you look at the language definition for generator expressions you will see that it says, "[a] generator expression yields a new generator object" for what is specified (which is essentially a compact
for loop with an expression for the body). So what does that look like if you take away the Python "magic" and unravel it down to its core Python semantics?
Let's take the following example:
The bytecode for this is:
You may notice a couple of things that are interesting about this:
- The generator expression is very much just a
forloop in a generator.
- The generator expression is stored as a constant in the function.
agets explicitly passed into the generator expression.
The explicit passing of
a is the surprising bit in how generator expressions work, but it actually makes sense when you read the explanation as to why this occurs:
... the iterable expression in the leftmost
forclause is immediately evaluated, so that an error produced by it will be emitted at the point where the generator expression is defined, rather than at the point where the first value is retrieved. Subsequent
forclauses and any filter condition in the leftmost
forclause cannot be evaluated in the enclosing scope as they may depend on the values obtained from the leftmost iterable.
So by passing
a in, the code for
a is evaluated at the time of creation of the generator expression, not at the time of execution. That way if there's an error with that part of the code the traceback will help you find where it was defined and not simply to where the generator expression happened to be run. Since subsequent
for loops in the generator expression may rely on the loop variant in the first clause, you can't eagerly evaluate any other parts of the expression.
There's a couple of details that required to make unravelling a generator expression successful, so I'm going to build up a running example to cover all the various cases.
With only one
Let's start with
(c for b in a) where
c can be some expression. To unravel this we need to make a generator which takes in
a as an argument to guarantee it is eagerly evaluated where the generator expression is defined.
We end up with a generator function which takes a single argument for the leftmost iterator. We call
iter() outside of the generator to control for scoping. We should also technically unravel the
for loop, but for readability I'm going to leave it in place. Now let's see what this looks like in some code that would use the generator expression:
This would then unravel to:
Now let's toss in another
(e for b in a for d in c). This unravels to:
Since only the leftmost iterable is evaluated eagerly we can rely on the scoping rules for closures to get all of the other variables from the call site implicitly (this is where Python's simple namespace system comes in handy).
Putting this into an example like:
lead to an unravelling of:
The generator expression needs
x passed in because it's the leftmost iterable, but everything else is captured by the closure.
Let's make life complicated and throw in an assignment expression:
The bytecode for this becomes:
The key thing to notice is the various
*_DEREF opcodes which are what CPython uses to load/store
Now we could just add a
nonlocal statement to our unravelled generator expression and assume we are done, but there is one issue to watch out for: has the variable previously been defined in the enclosing scope? If the variable doesn't exist when the scope with the
nonlocal is defined (technically the compiler walking the AST has not seen the variable yet), Python will raise an exception:
SyntaxError: no binding for nonlocal 'b' found.
Python gets to take a shortcut when it comes to a generator expression with an assignment expression and simply consider the
nonlocal as implicit without regards as to whether the variable was previously defined. But we don't get to cheat, and that means we may have to define the variable with a dummy value to make the CPython compiler happy.
But we also have to deal with whether the generator expression is ever run or runs but never sets
b (i.e. the iterable has a length of 0). In the example that would raise
UnboundLocalError: local variable 'b' referenced before assignment. To replicate that we need to delete
b if it never gets set appropriately.
What all of this means is our example unravels to:
But remember, we only want to do any of this
nonlocal work if there are assignment expressions to worry about.
The best laid plans ...
I actually wrote this entire post thinking I had solved the unravelling of generator expressions, and then I realized assignment expressions thwarted me in another way. Consider the following example:
If you run that example you end up with
UnboundLocalError: local variable 'b' referenced before assignment. Now let's unravel this:
Unfortunately calling this function succeeds. And since
del is a statement there's no way to insert ourselves into that expression to prevent
b from being resolved. But luckily Guido told me of a trick to still get the
UnboundLocalError: insert the initial assignment into an
if False calls. If we do
if False: b = _PLACEHOLDER it essentially tricks the compiler into doing the what we want and causing
UnboundLocalError to be raised when the variable isn't set while still having
nonlocal allow for assigning to the variable.
Taking our earlier unravelling that was successful, it change into the following:
Aside: what came first, the expression or the comprehension?
If you have not been programming in Python for more than 15 years you may think generator expressions came first, then list comprehensions. But actually it's the other way around: list comprehensions were introduced in Python 2.0 and generator expressions came in Python 2.4. This is because generators were introduced in Python 2.2 (thanks to inspiration from Icon), and so the possibility of even having generator expressions didn't exist when list comprehensions came into existence (thanks to inspiration from Haskell).
Thanks to Guido for pointing out the
if False trick to make unexecuted assignment expressions work. This was not part of the original post and so I initially thought the unraveling had failed.
Thanks to Serhiy Storchaka for pointing out that the scoping was off if you didn't call
iter() outside the generator and you needed to unravel the