In this post on Python's syntactic sugar, I want to try to tackle generator expressions. If you look at the language definition for generator expressions you will see that it says, "[a] generator expression yields a new generator object" for what is specified (which is essentially a compact
for loop with an expression for the body). So what does that look like if you take away the Python "magic" and unravel it down to its core Python semantics?
Let's take the following example:
The bytecode for this is:
You may notice a couple of things that are interesting about this:
- The generator expression is very much just a
forloop in a generator.
- The generator expression is stored as a constant in the function.
agets explicitly passed into the generator expression.
The explicit passing of
a is the surprising bit in how generator expressions work, but it actually makes sense when you read the explanation as to why this occurs:
... the iterable expression in the leftmost
forclause is immediately evaluated, so that an error produced by it will be emitted at the point where the generator expression is defined, rather than at the point where the first value is retrieved. Subsequent
forclauses and any filter condition in the leftmost
forclause cannot be evaluated in the enclosing scope as they may depend on the values obtained from the leftmost iterable.
So by passing
a in, the code for
a is evaluated at the time of creation of the generator expression, not at the time of execution. That way if there's an error with that part of the code the traceback will help you find where it was defined and not simply to where the generator expression happened to be run. Since subsequent
for loops in the generator expression may rely on the loop variant in the first clause, you can't eagerly evaluate any other parts of the expression.
There's a couple of details that required to make unravelling a generator expression successful, so I'm going to build up a running example to cover all the various cases.
With only one
Let's start with
(c for b in a) where
c can be some expression. To unravel this we need to make a generator which takes in
a as an argument to guarantee it is eagerly evaluated where the generator expression is defined.
We end up with a generator function which takes a single argument for the leftmost iterable. Let's see what this looks like in some code that would use the generator expression:
This would then unravel to:
Now let's toss in another
(e for b in a for d in c). This unravels to:
Since only the leftmost iterable is evaluated eagerly we can rely on the scoping rules for closures to get all of the other variables from the call site implicitly (this is where Python's simple namespace system comes in handy).
Putting this into an example like:
lead to an unravelling of:
The generator expression needs
x passed in because it's the leftmost iterable, but everything else is captured by the closure.
Let's make life complicated and throw in an assignment expression:
The bytecode for this becomes:
The key thing to notice is the various
*_DEREF opcodes which are what CPython uses to load/store
Now we could just add a
nonlocal statement to our unravelled generator expression and assume we are done, but there is one issue to watch out for: has the variable previously been defined in the enclosing scope? If the variable doesn't exist when the scope with the
nonlocal is defined (technically the compiler walking the AST has not seen the variable yet), Python will raise an exception:
SyntaxError: no binding for nonlocal 'b' found.
Python gets to take a shortcut when it comes to a generator expression with an assignment expression and simply consider the
nonlocal as implicit without regards as to whether the variable was previously defined. But we don't get to cheat, and that means we may have to define the variable with a dummy value to make the CPython compiler happy.
But we also have to deal with whether the generator expression is ever run or runs but never sets
b (i.e. the iterable has a length of 0). In the example that would raise
UnboundLocalError: local variable 'b' referenced before assignment. To replicate that we need to delete
b if it never gets set appropriately.
What all of this means is our example unravels to:
But remember, we only want to do any of this
nonlocal work if there are assignment expressions to worry about.
The best laid plans ...
I actually wrote this entire post thinking I had solved the unravelling of generator expressions, and then I realized assignment expressions thwarted me in the end. Consider the following example:
If you run that example you end up with
UnboundLocalError: local variable 'b' referenced before assignment. Now let's unravel this:
Unfortunately calling this function succeeds. And since
del is a statement there's no way to insert ourselves into that expression to prevent
b from being resolved. This means we cannot unravel assignment expressions. 🙁
So while I thought I had unravelled assignment expressions, it seems that in the end I was unsuccessful. But I have decided to publish this anyway to show how I typically approach unravelling a bit of syntax and sometimes are unable to do it.
Aside: what came first, the expression or the comprehension?
If you have not been programming in Python for more than 15 years you may think generator expressions came first, then list comprehensions. But actually it's the other way around: list comprehensions were introduced in Python 2.0 and generator expressions came in Python 2.4. This is because generators were introduced in Python 2.2 (thanks to the inspiration from Icon), and so the possibility of even having generator expressions didn't exist when list comprehensions came into existence (thanks to the inspiration from Haskell).