16 Aug 2019 6 min read packaging

How do you verify that PyPI can be trusted?

A co-worker of mine attended a technical talk about how Go's module mirror works and he asked me whether there was something there that Python should do.

Best technical talk that I've seen in a long time: @katie_hockman on Go Module Proxy. Motivating the problem and convincing us that the solution really worked. @brettsky we should do this for Python too (go module proxy, not trying top Katie's talk 😀). https://t.co/9UeGyYTOwp
— John Lam (@john_lam) July 25, 2019

Now Go's packaging story is rather different from Python's since in Go you specify the location of a module by the URL you fetch it from, e.g. github.com/you/hello specifies the hello module as found at https://github.com/you/hello. This means Go's module ecosystem is distributed, which leads to interesting problems of caching so code doesn't disappear off the internet (e.g. a left-pad incident), and needing to verify that a module's provider isn't suddenly changing the code they provide with something malicious.

But since the Python community has PyPI our problems are slightly different in that we just have to worry about a single point of failure (which has its own downsides). Now obviously you can run your own mirror of PyPI (and plenty of companies do), but for the general community no one wants to bother to set something up like that and try to keep it maintained (do you really need your own mirror to download some dependencies for the script you just wrote to help clean up your photos from your latest trip?). But we should still care about whether PyPI has been compromised such that packages hosted there have not been tampered with somehow between when the project owner uploaded their release's files and from when you download them.

Verifying PyPI is okay

So the first thing we can do is see if we can tell if PyPI has been compromised somehow. This takes on two different levels of complexity. One is checking if post-release anything nefarious has occurred. The fancier step is to provide a way for project owners to tell other folks what they are giving PyPI to act as an auditor.

Post-release trust

In a post-release scenario you're trusting that PyPI received a release from a project owner successfully and safely. What you're worrying about here is that at some later point PyPI gets compromised and someone e.g. swapped out the files in requests so that someone could steal some Bitcoin. So what are some options here?

Trust PyPI

The simplest one is don't worry about it. 😁 PyPI is run by some very smart, dedicated folks and so if you feel comfortable trusting them to not mess up then you can simply not stress about compromises.

Trust PyPI up to when you froze your dependencies

Now perhaps you do generally trust the PyPI administrators and don't think anything has happened yet, but you wouldn't mind a fairly cheap way that's available to today to make sure nothing fishy happens in the future. In that case you can record the hashes of your locked dependencies. (If you're an app developer you are locking your dependencies, right?)

Basically what you do is you have whatever tool you're using to lock your dependencies – e.g. pip-tools, pipenv, poetry – record the hash of the files you depend on upon locking. That way in the future you can check for yourself that the files you downloaded from PyPI match bit-for-bit what you previously downloaded and used. Now this doesn't guarantee that what you initially downloaded when you froze your dependencies didn't contain compromised code, but at least you know going forward nothing questionable has occurred.

Trust PyPI or an independent 3rd-party since they started running

Now we're into the "someone would have to do work to make this happen" realm; everything up until now you can do today, but this idea requires money (although PyPI still requires money to simply function as well, so please have your company donate if you use PyPI at work).

What one could do is run a 3rd-party service that records all the hashes of files that end up on PyPI. That way, if one wanted to see if the hash from PyPI hasn't changed since the 3rd-party service started running then one could simply ask the 3rd-party service for the hash for whatever file they want from PyPI, ask PyPI what they think the hash should be, and then check if the hashes match. If they do match then you should be able to trust the hashes, but if they differ then either PyPI or the 3rd-party service is compromised.

Now this is predicated on the idea that the 3rd-party service is truly 3rd-party. If any staff is shared between the 3rd-party service and PyPI then that's a potential point of compromise. This is also assuming that PyPI has not already been compromised. But at least in this scenario the point in time where your trust in PyPI starts from when the 3rd-party service began running and not when you locked your dependencies.

You can also extend this out to multiple 3rd-parties recording file hashes so that you can compare hashes against multiple sources. This not only makes it harder by forcing someone to compromise multiple services in order to cover up a file change, but if someone is compromised you could choose to use quorum to decide who's right and who's wrong.

Auditing what everyone claims

This entire blog post started because of a Twitter thread about how to be able to validate what PyPI claims. At some point I joked that I was shocked no one had mentioned the blockchain yet. And that's when I was informed that Certificate Transparency logs are basically what we would want and they use something called Merkle hash trees that started with P2P networks and have been used in blockchains.

Yea. CT logs in this case are basically a less stupid blockchain.
— Donald Stufft (@dstufft) July 27, 2019

I'm not going to go into all the details as how Certificate Transparency works, but basically they use an append-only log that can be cryptographically verified as having not been manipulated (and you could totally treat recording hashes of files on PyPI as an append-only log).

There are two very nice properties of these hash trees. One is it is very cheap to verify when an update has been made that all the previous entries in the log have not changed. Basically what you need is some key values from the previous version of the hash tree so that when you add new values to the tree and re-balance it's easy to verify the old stuff is still the same. This is great to help monitor for manipulation of previous data while also making it easy to add to the log.

The second property is that checking an entry hasn't been tampered with can be done without having the entire tree available. Basically you only need all nodes along a path from a leaf node to the root plus all immediate siblings of those nodes. This means that even if your hash tree has a massive amount of leaf nodes it doesn't take much to audit that a single leaf node has not not changed.

So all of this leads to a nice system to help keep PyPI honest if you can assume the initial hashes are reliable.

Release-in-progress trust

So all of the above scenarios assume PyPI was secure at the time of initially receiving a file but then potentially was compromised later. But how could we check that PyPI isn't already compromised?

One idea I had was that twine could upload a project releases' hashes to some trusted 3rd-parties as well as to PyPI. Then the 3rd-parties could either directly compare the hashes PyPI claims to have to what they were given independently or they could use their data to create that release's entry in the append-only hash tree log and see if the final hash matched what PyPI claims. And if a 3rd-party wasn't given some hashes by the project owner then they could simply fill in with what PyPI has. But the key point is that by having the project owner directly share hashes with 3rd-parties that are monitoring PyPI we would then have a way to detect if PyPI isn't providing files as the project owner expected.

Making PyPI harder to hack

Now obviously it would be best if PyPI was as hard to compromise as possible as well as detecting compromises on its own. There are actually two PEPs on the topic: PEP 458 and PEP 480. I'm not going to go into details since that's why we have PEPs, but people have thought through how to make PyPI hard to compromise as well as how to detect it.

But knowing that a design is available, you may be wondering why hasn't it been implemented?

What can you do to help?

There is a major reason why the ideas above have not been implemented: money. People using Python for personal projects typically don't worry about this sort of stuff because it just isn't a big concern, so people are not chomping at the bit to implement any of this for fun in their spare time. But for any business relying on packages coming from PyPI, it should be a concern since their business relies on the integrity of PyPI and the Python ecosystem. And so if you work for a company that uses packages from PyPI, then please consider having the company donate to the packaging WG (you can also find the link by going to PyPI and clicking the "Donate" button). Previous donations got us the current back-end and look of PyPI as well as the recent work to add two-factor authentication and API tokens, so they already know how to handle donations and turning them into results. So if anything I talked about here sounds worth doing, then please consider donating to help making it so they can happen.