State of standardized lock files for Python: August 2023

Since people seemed to like my June 2023 post on the state of WASI support for CPython, I thought I would do one for another one of my other long-gestating projects: coming up with a standardized lock file format for Python packaging.

💡
When I say "lock file" I'm talking about pinning your dependencies and their versions and writing it to a file, like pip-compile from pip-tools takes a requirements.in file and produces a requirements.txt file. I am not talking about file locking like fcntl.flock().

On the VS Code team, we have taken the position that we much prefer working with standards over anything that's tool-specific when dealing with anything in our core Python extension. As such, I have been helping out in trying to standardize things in Python packaging. Probably the most visible thing I helped with was establishing pyproject.toml via PEP 518. I also drove the creation of the [project] table in pyproject.toml via PEP 621.

For me, the next thing to standardize was a lock file format. Historically, people either manually pinned their dependencies to a specific version or they used a requirements.txt file. The former is rather tedious and often misses indirect dependencies and the latter isn't actually a standard but a pip feature. Both of those things together made me want to come up with a file format that made environment reproducibility possible by making it easy for people to get the exact same package versions installed. I also wanted to take the opportunity to help people do installations in a more secure fashion on top of reproducibility as it takes 3 extra flags to pip to make it install things securely.

That led me to write PEP 665. The goal was to create a lock file format around wheels which would facilitate people installing packages in a consistent, secure way. Unfortunately, after 10 months of working on the PEP, it was ultimately rejected. I personally believe the rejection was due to lack of sdist support – which goes against the "secure" goal I had since they can basically do anything during install time – and due to a lack of understanding around how important lock files are for security purposes (let alone just making sure you can replicate your environment in other places).

And so I decided I needed a proof-of-concept lock file format in order to show the importance of this. That would require being able to do a few things:

  1. Get a list of top-level dependencies that need to be installed from the user
  2. Communicate with a Python package index server like PyPI to find out what packages (and their wheels) are available
  3. Resolve the dependency graph to know what needs to ultimately be installed
  4. Create a lock file for that dependency graph of wheel files
  5. Install the wheel files listed in the lock file

Step 1 is somewhat taken care of by pyproject.toml and project.dependencies, although if you're not writing code that's meant to eventually end up being in a wheel it's a bit of an abuse of that data (there's been a discussion about how projects not destined for being a wheel should write down their dependencies, but I don't know if it's going to go anywhere). Step 2 is taken care of via the simple repository API, which can be either HTML or JSON-based (I created mousebender to smooth over the details between the two types of API response formats, and that project is also where I'm hosting all of this work related to the proof-of-concept I want to end up with).

Step 3 is where I'm currently at. Working with a resolver like resolvelib means you need the initial sets of requirements, the constraints it has to operate under (e.g., platform details), and the ability to update the requirements the resolver is working with as it learns about more edges in the dependency graph. As I mentioned above, I can cheat about the initial set of requirements by grabbing them from pyproject.toml. The constraints are covered by packaging.markers and packaging.tags (and I wrote the latter module, so I'm "lucky" to be overly familiar with what's required for this situation). So that leaves updating requirements as you discover new edges to the dependency graph for step 3.

But how do you end up with new edges of the dependency graph? Well, every dependency has its own dependencies. So you what you end up doing is once you think you know what wheel you want to install you get the metadata for that wheel and read what requirements it has. That might sound simple, but the core metadata specification says wheel metadata gets written to a METADATA file that is formatted using email headers; not quite so easy as reading some JSON. Plus it has a lot of types of fields, the parsing requirements per field have changed over the years, etc. As such, the idea came up of putting some code into the packaging project – which I'm a co-maintainer of – so there could be a baseline core metadata parser which handled parsing this metadata, both in a forgiving and strict manner (for this project I need strict parsing of the dependency information).

I got the forgiving parsing done in packaging 23.4.0 via packaging.metadata. But today I got the strict parsing merged which also provides a higher-level API using richer object representations. All told, this part took me over 2.5 years to complete.

And with that, someone can tell me what their dependencies are, PyPI can tell me what wheels it has, and I can read what dependencies those wheels have. The next step is taking resolvelib and creating a resolver to generate the dependency graph. I'm planning to make the design of my resolver code flexible so that you can do interesting things like resolve for the oldest dependencies as well as the newest (handy for testing the range of versions you claim you support), most and least specific wheels (so you can see what your actual platform restrictions are), and to be able to specify the platform details so you can resolve for a different platform than you're running on (handy if your production environment is different than your development one). Those last two are important to me for work purposes as it would allow me to create a resolver that only supported pure Python wheels which is necessary for WASI since there isn't extension module support for that platform (yet).