Expand description
Management of the index of a registry source
This module contains management of the index and various operations, such as actually parsing the index, looking for crates, etc. This is intended to be abstract over remote indices (downloaded via git) and local registry indices (which are all just present on the filesystem).
Index Performance
One important aspect of the index is that we want to optimize the “happy
path” as much as possible. Whenever you type cargo build
Cargo will
always reparse the registry and learn about dependency information. This
is done because Cargo needs to learn about the upstream crates.io crates
that you’re using and ensure that the preexisting Cargo.lock
still matches
the current state of the world.
Consequently, Cargo “null builds” (the index that Cargo adds to each build itself) need to be fast when accessing the index. The primary performance optimization here is to avoid parsing JSON blobs from the registry if we don’t need them. Most secondary optimizations are centered around removing allocations and such, but avoiding parsing JSON is the #1 optimization.
When we get queries from the resolver we’re given a Dependency
. This
dependency in turn has a version requirement, and with lock files that
already exist these version requirements are exact version requirements
=a.b.c
. This means that we in theory only need to parse one line of JSON
per query in the registry, the one that matches version a.b.c
.
The crates.io index, however, is not amenable to this form of query. Instead the crates.io index simply is a file where each line is a JSON blob. To learn about the versions in each JSON blob we would need to parse the JSON, defeating the purpose of trying to parse as little as possible.
Note that as a small aside even loading the JSON from the registry is actually pretty slow. For crates.io and remote registries we don’t actually check out the git index on disk because that takes quite some time and is quite large. Instead we use
libgit2
to read the JSON from the raw git objects. This in turn can be slow (aka show up high in profiles) because libgit2 has to do deflate decompression and such.
To solve all these issues a strategy is employed here where Cargo basically creates an index into the index. The first time a package is queried about (first time being for an entire computer) Cargo will load the contents (slowly via libgit2) from the registry. It will then (slowly) parse every single line to learn about its versions. Afterwards, however, Cargo will emit a new file (a cache) which is amenable for speedily parsing in future invocations.
This cache file is currently organized by basically having the semver version extracted from each JSON blob. That way Cargo can quickly and easily parse all versions contained and which JSON blob they’re associated with. The JSON blob then doesn’t actually need to get parsed unless the version is parsed.
Altogether the initial measurements of this shows a massive improvement for Cargo null build performance. It’s expected that the improvements earned here will continue to grow over time in the sense that the previous implementation (parse all lines each time) actually continues to slow down over time as new versions of a crate are published. In any case when first implemented a null build of Cargo itself would parse 3700 JSON blobs from the registry and load 150 blobs from git. Afterwards it parses 150 JSON blobs and loads 0 files git. Removing 200ms or more from Cargo’s startup time is certainly nothing to sneeze at!
Note that this is just a high-level overview, there’s of course lots of details like invalidating caches and whatnot which are handled below, but hopefully those are more obvious inline in the code itself.
Structs
- A parsed representation of a summary from the index.
- Manager for handling the on-disk index.
- Split 🔒
- An internal cache of summaries for a particular package.
- A representation of the cache on disk that Cargo maintains of summaries. Cargo will initially parse all summaries in the registry and will then serialize that into this form and place it in a new location on disk, ensuring that access in the future is much speedier.
- Crates.io treats hyphen and underscores as interchangeable, but the index and old Cargo do not. Therefore, the index must store uncanonicalized version of the name so old Cargo’s can find it. This loop tries all possible combinations of switching hyphen and underscores to find the uncanonicalized one. As all stored inputs have the correct spelling, we start with the spelling as-provided.
Enums
- A lazily parsed
IndexSummary
.
Constants
Functions
- split 🔒