Skip to content

ci: layered Python base images for cross-matrix dedup#9672

Open
richiejp wants to merge 3 commits intomasterfrom
ci/layered-base-images
Open

ci: layered Python base images for cross-matrix dedup#9672
richiejp wants to merge 3 commits intomasterfrom
ci/layered-base-images

Conversation

@richiejp
Copy link
Copy Markdown
Collaborator

@richiejp richiejp commented May 5, 2026

Experiment to see if creating OS+vendor+language base images can massively reduce CI time and build time locally.


The 234-entry backend matrix runs the same apt-update + GPU SDK install +
Python toolchain bootstrap into N independent registry-cache tags. Factor
that shared work out into a tier-1+2 base image (lang × accel × ubuntu ×
cuda) built once per workflow run, then consumed by every backend that
matches its tuple via BASE_IMAGE_PREBUILT.

The matrix data moves to .github/backend-matrix.yaml so backend.yml can
switch to fromJSON without duplicating the matrix. scripts/changed-backends.js
reads the data file, derives the deduplicated bases-matrix, annotates each
Python entry with the right base-image-prebuilt ref, and runs a collision
check that fails loudly if a future matrix change makes two consumers want
incompatible bases under the same tag-stem.

PR builds tag with -pr so end-to-end validation lives within one PR;
master builds tag without the suffix. The base-images registry cache
parallels the existing per-matrix-entry caches.

Adding a new (accel, cuda) flavour is a backend-matrix.yaml edit; adding
a new language tier is a Dockerfile. recipe + a slim of the
consumer Dockerfile (script auto-detects via .docker/bases/).

10 distinct bases derive from the current 234 entries, replacing the
inline bootstrap that previously ran into ~10 separate cache tags.

Assisted-by: Claude:opus-4-7-1m [Claude Code]

richiejp added a commit that referenced this pull request May 6, 2026
The previous tag scheme pushed to quay.io/go-skynet/localai-base, which
required a separate quay repo + a write-permission grant for the CI
robot. PR #9672 hit a 401 on push because that grant was missing — the
robot can log in but not write to localai-base.

ci-cache already exists, the robot already has write access (it writes
the buildkit cache there on every backend build), and OCI tags namespace
cleanly within a repo. So publish base images to
quay.io/go-skynet/ci-cache:base-image-<stem>[-pr<N>]. The `base-image-`
prefix doesn't collide with the existing tag prefixes:
  - cache<tag-suffix>           per-backend buildkit cache
  - cache-localai<tag-suffix>   root image buildkit cache
  - base-<stem>                 base image's own buildkit cache
  - base-image-<stem>           the published OCI image (new)

base_images.yml's compute_ref step and prebuiltRef() in
scripts/changed-backends.js are kept in lock-step. Local Makefile tags
are unchanged (they're just local docker labels with no remote
correlation).

Assisted-by: Claude:opus-4-7-1m [Claude Code]
richiejp added a commit that referenced this pull request May 6, 2026
The previous tag scheme pushed to quay.io/go-skynet/localai-base, which
required a separate quay repo + a write-permission grant for the CI
robot. PR #9672 hit a 401 on push because that grant was missing — the
robot can log in but not write to localai-base.

ci-cache already exists, the robot already has write access (it writes
the buildkit cache there on every backend build), and OCI tags namespace
cleanly within a repo. So publish base images to
quay.io/go-skynet/ci-cache:base-image-<stem>[-pr<N>]. The `base-image-`
prefix doesn't collide with the existing tag prefixes:
  - cache<tag-suffix>           per-backend buildkit cache
  - cache-localai<tag-suffix>   root image buildkit cache
  - base-<stem>                 base image's own buildkit cache
  - base-image-<stem>           the published OCI image (new)

base_images.yml's compute_ref step and prebuiltRef() in
scripts/changed-backends.js are kept in lock-step. Local Makefile tags
are unchanged (they're just local docker labels with no remote
correlation).

Assisted-by: Claude:opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
@richiejp richiejp force-pushed the ci/layered-base-images branch from 7481e52 to 76eab55 Compare May 6, 2026 15:08
richiejp added 3 commits May 6, 2026 16:10
The 234-entry backend matrix runs the same apt-update + GPU SDK install +
Python toolchain bootstrap into N independent registry-cache tags. Factor
that shared work out into a tier-1+2 base image (lang × accel × ubuntu ×
cuda) built once per workflow run, then consumed by every backend that
matches its tuple via BASE_IMAGE_PREBUILT.

The matrix data moves to .github/backend-matrix.yaml so backend.yml can
switch to fromJSON without duplicating the matrix. scripts/changed-backends.js
reads the data file, derives the deduplicated bases-matrix, annotates each
Python entry with the right base-image-prebuilt ref, and runs a collision
check that fails loudly if a future matrix change makes two consumers want
incompatible bases under the same tag-stem.

PR builds tag with -pr<N> so end-to-end validation lives within one PR;
master builds tag without the suffix. The base-images registry cache
parallels the existing per-matrix-entry caches.

Adding a new (accel, cuda) flavour is a backend-matrix.yaml edit; adding
a new language tier is a Dockerfile.<lang> recipe + a slim of the
consumer Dockerfile (script auto-detects via .docker/bases/).

10 distinct bases derive from the current 234 entries, replacing the
inline bootstrap that previously ran into ~10 separate cache tags.

Assisted-by: Claude:opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Python's tier-1+2 base image (apt + GPU SDK + lang toolchain) was the only
lang previously factored. The remaining 82 matrix entries (62 golang +
9 llama-cpp + 9 turboquant + 1 ik-llama-cpp + 1 rust) still inlined the
same bootstrap into per-backend cache tags.

Add .docker/bases/Dockerfile.{golang,cpp,rust} mirroring Dockerfile.python's
GPU stack, with the lang-specific tail at the bottom (Go + protoc + grpc
tooling; protoc + cmake + GRPC; rustup + audio dev libs respectively).
Slim the five consumer Dockerfiles to FROM ${BASE_IMAGE_PREBUILT} + the
per-backend COPY/make.

The C++ trio (llama-cpp, ik-llama-cpp, turboquant) only differ in their
make targets, so langOf() in scripts/changed-backends.js remaps all three
Dockerfile suffixes to the shared 'cpp' base. That collapses 17 would-be
distinct bases to 8. langTriggerSelector and baseTriggerFiles are
extended so PRs touching the new recipes fan out canaries; the
.docker/bases/ auto-detection picks up the new langs without further
script changes.

Makefile: add docker-build-{python,golang,cpp,rust}-base targets and a
local-base-tag/local-base-target macro pair so each backend's
docker-build-X chains through the right base. The previous python-only
prereq is now a generic per-lang dispatch.

Total distinct bases for the full 234-entry matrix: 29 (was 9 with only
python factored). The C++ base also absorbs the previously per-consumer
GRPC build stage, removing the dominant cost from the llama-cpp /
ik-llama-cpp / turboquant rebuild paths.

Assisted-by: Claude:opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The previous tag scheme pushed to quay.io/go-skynet/localai-base, which
required a separate quay repo + a write-permission grant for the CI
robot. PR #9672 hit a 401 on push because that grant was missing — the
robot can log in but not write to localai-base.

ci-cache already exists, the robot already has write access (it writes
the buildkit cache there on every backend build), and OCI tags namespace
cleanly within a repo. So publish base images to
quay.io/go-skynet/ci-cache:base-image-<stem>[-pr<N>]. The `base-image-`
prefix doesn't collide with the existing tag prefixes:
  - cache<tag-suffix>           per-backend buildkit cache
  - cache-localai<tag-suffix>   root image buildkit cache
  - base-<stem>                 base image's own buildkit cache
  - base-image-<stem>           the published OCI image (new)

base_images.yml's compute_ref step and prebuiltRef() in
scripts/changed-backends.js are kept in lock-step. Local Makefile tags
are unchanged (they're just local docker labels with no remote
correlation).

Assisted-by: Claude:opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
@richiejp richiejp force-pushed the ci/layered-base-images branch from 76eab55 to 9d42a16 Compare May 6, 2026 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant