## A set of tools to index python package URL's and dependencies These crawlers have been created for the purpose of maintaining data required for [mach-nix](https://github.com/DavHau/mach-nix). This project contains 2 crawlers. One which indexes available `sdist` python package downloads from pypi.org and another one which actually downloads all packages and extracts their dependency information. The URL index is stored here: [nix-pypi-fetcher](https://github.com/DavHau/nix-pypi-fetcher) (which at the same time is a convenient standalone pypi fetcher for nix) The dependency graph is stored here: [pypi-deps-db](https://github.com/DavHau/pypi-deps-db) --- ## URL Crawler It takes the complete list of packages from pypi's xml API, then retrieves the download URLs for each package via pypi's json [API](https://warehouse.readthedocs.io/api-reference/json/). The sha256 hashes are already returned by this API. No package downloading needed to build this index. --- ## Dependency Crawler It doesn't seem like pypi provides any information about dependencies via API. A package's dependencies only get revealed during the installation process itself. Therefore the strategy to extract one package's dependencies is: 1. Download and extract the sdist release of the package 2. Run the package's setup routine through a modified python environment which doesn't do a real `setup` but instead just dumps the packages dependencies. The main modifications that needed to be done to python to extract the dependencies are: - Patch the builtin module `distutils` to run the setup routine until all of the important information gathering is finished, then jsonify some relevant arguments and dump them to a file. - Patch `setuptools` to skip the installation of setup requirements and directly call the setup method of `distutils`. The process to extract requirements for a single package is defined as a nix derivation under `./src/extractor/`. This allows to run the extraction process as a nix builder in a sandboxed environment. This extractor derivation takes one python package's downlaod information as input and produces a json output containing the dependencies. A python based service regularly checks for new packages which werer detected by the URL crawler and runs those packages through the `extractor` builder to update the dependency DB. Afterwards this database is dumped to json and published at [pypi-deps-db](https://github.com/DavHau/pypi-deps-db). --- ### Project Structure ``` |- nix/ Contains NixOps deployments for the crawler and database | |- crawler/ Deployment for both crawlers together on a small machine | |- database/ Deployment for the DB needed to store dependency information | |- power-deps-crawler Alternative deployment of the dependency crawler on a strong | machine which was needed to process the complete past history | of python packages. | |- src/ |- extractor/ Nix expression for extracting a single package |- crawl_deps.py Entry point for the dependency crawler |- crawl_urls Entry point for the URL crawler |- dump_deps.py Entry point for dumping the dependencies from the DB into json. ``` ### Debugging see [./debug](./debug)