mach-nix/pypi-crawlers
2021-07-23 15:04:15 +02:00
..
debug init pypi-crawlers (moved from github.com/DavHau/pypi-crawlers) 2020-08-15 19:09:28 +07:00
nix Remove trailing slash in mach-nix url 2021-07-23 15:04:15 +02:00
src url-crawler: remove spaces in version numbers 2021-04-30 22:14:17 +07:00
.envrc init pypi-crawlers (moved from github.com/DavHau/pypi-crawlers) 2020-08-15 19:09:28 +07:00
.gitignore init pypi-crawlers (moved from github.com/DavHau/pypi-crawlers) 2020-08-15 19:09:28 +07:00
env.example init pypi-crawlers (moved from github.com/DavHau/pypi-crawlers) 2020-08-15 19:09:28 +07:00
LICENSE init pypi-crawlers (moved from github.com/DavHau/pypi-crawlers) 2020-08-15 19:09:28 +07:00
Readme.md init pypi-crawlers (moved from github.com/DavHau/pypi-crawlers) 2020-08-15 19:09:28 +07:00
shell.nix init pypi-crawlers (moved from github.com/DavHau/pypi-crawlers) 2020-08-15 19:09:28 +07:00

A set of tools to index python package URL's and dependencies

These crawlers have been created for the purpose of maintaining data required for mach-nix.

This project contains 2 crawlers. One which indexes available sdist python package downloads from pypi.org and another one which actually downloads all packages and extracts their dependency information.

The URL index is stored here: nix-pypi-fetcher (which at the same time is a convenient standalone pypi fetcher for nix)
The dependency graph is stored here: pypi-deps-db


URL Crawler

It takes the complete list of packages from pypi's xml API, then retrieves the download URLs for each package via pypi's json API. The sha256 hashes are already returned by this API. No package downloading needed to build this index.


Dependency Crawler

It doesn't seem like pypi provides any information about dependencies via API. A package's dependencies only get revealed during the installation process itself. Therefore the strategy to extract one package's dependencies is:

  1. Download and extract the sdist release of the package
  2. Run the package's setup routine through a modified python environment which doesn't do a real setup but instead just dumps the packages dependencies.

The main modifications that needed to be done to python to extract the dependencies are:

  • Patch the builtin module distutils to run the setup routine until all of the important information gathering is finished, then jsonify some relevant arguments and dump them to a file.
  • Patch setuptools to skip the installation of setup requirements and directly call the setup method of distutils.

The process to extract requirements for a single package is defined as a nix derivation under ./src/extractor/. This allows to run the extraction process as a nix builder in a sandboxed environment. This extractor derivation takes one python package's downlaod information as input and produces a json output containing the dependencies. A python based service regularly checks for new packages which werer detected by the URL crawler and runs those packages through the extractor builder to update the dependency DB. Afterwards this database is dumped to json and published at pypi-deps-db.


Project Structure

|- nix/                     Contains NixOps deployments for the crawler and database
|   |- crawler/             Deployment for both crawlers together on a small machine
|   |- database/            Deployment for the DB needed to store dependency information
|   |- power-deps-crawler   Alternative deployment of the dependency crawler on a strong
|                           machine which was needed to process the complete past history
|                           of python packages.
|
|- src/
    |- extractor/           Nix expression for extracting a single package
    |- crawl_deps.py        Entry point for the dependency crawler
    |- crawl_urls           Entry point for the URL crawler
    |- dump_deps.py         Entry point for dumping the dependencies from the DB into json.

Debugging

see ./debug