Writing a Zip File Importer: The Finder (Part 2)

When writing a finder like importlib.machinery.FileFinder it’s all about working with the storage mechanism you are trying to support. In this instance it’s making sure that the paths the finder is meant to work with are serviceable and being able to check if certain files exist.

In the case of zip files, it’s tricky because the paths use / as a path separator. If someone on Windows adds a path on sys.path using \ as the path separator, you need to support it. That means constantly translating between OS-specific paths and zip file paths.

Once you have that down, then you need to implement your finder so that when you are asked if it can provide a loader for a module, it can search within the zip file in the directory the finder is meant to be in charge of to see if a source file can be found. This is the core of a finder and it takes a few steps:

  1. See if a directory matches the tail end of the module being search for, and if it does record that fact
  2. If a directory exists then check for an __init__.py file for it to be a real package
  3. If no real package is found, check for a .py file
  4. If none of that works but a directory was found then return it as a portion of a possible future namespace package

All of this roughly translates into the following code:

def find_module(self, fullname):
  portions = []
  module_name = fullname.rpartition(‘.’)[-1]
  directory = self.make_directory_path(module_name)
  if self.has_directory(directory):
    portions.append(directory)
    package_base = self.join(directory, ‘__init__’)
    for suffix in SOURCE_SUFFIXES:
      package = package_base + suffix
      if self.has_file(package):
        return self.loader(fullname, package), portions
  module_base = self.make_file_path(module_name)
  for suffix in SOURCE_SUFFIXES:
    module = module_base + suffix
    if self.has_file(module):
      return self.loader(fullname, module), portions
  return None, portions

The tricky bit that came out of implementing this general algorithm for zip files is that there are the paths you are using to look in the zip file (i.e. paths with only / as the path separator) and then the paths that you are shuttling around to support people passing in OS-specific paths. In other words the paths coming in and going out of that method need to use the path separators that the user chose to work with while everything internal needs to work with / only (and on top of that have directories all end in /). And it’s that weird dichotomy of directories that prevents importlib from having just a generic algorithm with some abstract API for which finder subclasses can simply override.

Luckily people don’t typically muck with the storage mechanisms of module code. I’m willing to bet something like 95% of all code is backed by files on a filesystem, 4% are zip files, and the rest is some exotic, custom finder (e.g. SQL database). So for most people writing custom importers it’s going to come down to the loader and not the finder.
Anyway, that’s the key points for handling a zip file finder: managing paths and re-implementing the basic algorithm for determining if a package, module, or namespace exists for the requested module.