Skip to content

feat(pypi): store PyPI results as facts v2#3654

Open
aignas wants to merge 15 commits intobazel-contrib:mainfrom
aignas:aignas.feat.facts
Open

feat(pypi): store PyPI results as facts v2#3654
aignas wants to merge 15 commits intobazel-contrib:mainfrom
aignas:aignas.feat.facts

Conversation

@aignas
Copy link
Collaborator

@aignas aignas commented Mar 8, 2026

This PR adds functionality needed to write data that we find useful on the
SimpleAPI responses to the lock file. I.e. this will no longer connect to
the network if it can find the necessary information in the lock file.

Improvements that could be done but I don't have time right now for (but
contributions are welcome):

  • Improve the code to use sha256 in addition to version to do the filtering
    what we need to pull from the lock file. Whilst theoretically this could improve
    how quickly we find things, not sure how better it would be in practise.
  • Pull the METADATA contents from the PyPI SimpleAPI if the metadata_url is there
    and write the Requires-Dist to the MODULE.bazel.lock file as facts.
  • Shorten the package URLs to remove common prefix.

Superseeds #3559
Fixes #2731

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the PyPI package resolution process by introducing a sophisticated caching system. It enables the storage of PyPI index query results as 'facts' within the Bazel lock file, which will enhance build reproducibility and minimize external network dependencies. The changes also introduce granular control over package versions retrieved from the cache, allowing for more precise dependency management.

Highlights

  • PyPI Fact Storage: Implemented a two-tier caching system for PyPI index queries, storing results as 'facts' in the MODULE.bazel.lock file for improved reproducibility and reduced network calls. This includes a versioning mechanism for the stored facts.
  • Version-Aware Filtering: Introduced the ability to filter PyPI packages by specific requested versions, ensuring that the cache stores and retrieves only the necessary package distributions.
  • Improved HTML Parsing: Enhanced the parse_simpleapi_html function to more robustly parse SHA256 hashes and data-yanked attributes from PyPI simple API HTML, including handling escaped characters in yank reasons.
  • Cache Integration: Integrated the new caching and fact storage mechanisms into the _pip_impl extension and simpleapi_download logic, passing module_ctx to the cache and incorporating version information into cache keys.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/private/pypi/extension.bzl
    • Updated pypi_cache instantiation to pass module_ctx for fact management.
    • Added facts from the simpleapi_cache to the _pip_impl_repo call.
    • Included mods.facts in the module_ctx.extension_metadata for exposure.
  • python/private/pypi/hub_builder.bzl
    • Modified the sources parameter to accept a dictionary mapping distribution names to lists of versions, enabling version-specific requests.
  • python/private/pypi/parse_requirements.bzl
    • Updated the get_index_urls callable signature to accept a dictionary of distribution names and their requested versions.
    • Adjusted the logic to collect and pass distribution names along with their versions to get_index_urls.
  • python/private/pypi/parse_simpleapi_html.bzl
    • Refined the parsing of sha256 and data-yanked attributes from HTML, including handling of escaped quotes in yank reasons.
    • Updated metadata parsing to use attrs instead of tail for consistency.
  • python/private/pypi/pypi_cache.bzl
    • Introduced module_ctx and _FACT_VERSION to pypi_cache.
    • Implemented memory_cache and facts_cache structs for a two-tier caching strategy.
    • Modified _pypi_cache_setdefault and _pypi_cache_get to use a key including index_url, real_url, and versions.
    • Added _filter_packages to filter distributions based on requested versions before caching.
    • Implemented _get_facts, _get_from_facts, and _store_facts for managing facts in the lock file.
  • python/private/pypi/simpleapi_download.bzl
    • Introduced input_sources to store the distribution-version mapping.
    • Modified the loop over sources to iterate through package-version pairs.
    • Updated _read_simpleapi to accept a versions argument.
    • Included versions in the cache_key for _read_simpleapi.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

# Each line follows the following pattern
# <a href="https://...#sha256=..." attribute1="foo" ... attributeN="bar">filename</a><br />
#
# Sometimes the lines may be split, so we should seek until `<br />`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant here is that yank reason sometimes spans multiple lines. It would be best to build a small tokenizing parser here.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature to store PyPI query results as facts in the lock file, aiming to improve performance on subsequent runs. However, a critical security vulnerability exists in the handling of yanked packages: the parsing logic in parse_simpleapi_html.bzl ignores the data-yanked status if the reason provided by the index is empty. This violates the PyPI Simple Repository API specification and could lead to the installation of vulnerable packages. Furthermore, a critical bug in the new in-memory caching implementation prevents it from storing any data.

@aignas aignas force-pushed the aignas.feat.facts branch from 8206cf5 to b90cc51 Compare March 10, 2026 09:57
@aignas aignas marked this pull request as ready for review March 10, 2026 10:21
@aignas aignas requested review from groodt and rickeylev as code owners March 10, 2026 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write pip extension metadata to the MODULE.bazel.lock file

1 participant