feat: add buildx bake target for nmp-cpu-tasks#159
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Makefile image/build variables and pattern targets, a docker-bake.hcl with platform selection and three bake targets, three multi-stage Dockerfiles, README/.dockerignore updates, CI job, E2E tests with conftest feature gating and image fixture, and subprocess pause handling with unit tests. ChangesDocker build infrastructure
End-to-end job tests
Suggested reviewers:
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
Makefile (1)
82-95: 💤 Low value
DOCKERFILE_ROOTis declared but never propagated to bake.
docker-bake.hclreadsDOCKERFILE_ROOT, but neitherload/%norpush/%forwards it, so overriding it via Make has no effect (works only because both default to.).♻️ Forward DOCKERFILE_ROOT
load/%: ## Build and load a local single-arch image from bake target %-docker BUILD_ARCH=$(BUILD_ARCH) \ BUILD_CONTEXT=$(BUILD_CONTEXT) \ + DOCKERFILE_ROOT=$(DOCKERFILE_ROOT) \ NMP_IMAGE_REGISTRY=$(NMP_IMAGE_REGISTRY) \ NMP_IMAGE_TAG=$(NMP_IMAGE_TAG) \ docker buildx bake --load $*-docker push/%: ## Build and push a multi-arch image manifest from bake target %-docker BUILD_CONTEXT=$(BUILD_CONTEXT) \ + DOCKERFILE_ROOT=$(DOCKERFILE_ROOT) \ NMP_IMAGE_REGISTRY=$(NMP_IMAGE_REGISTRY) \ NMP_IMAGE_TAG=$(NMP_IMAGE_TAG) \ docker buildx bake --push $*-docker🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Makefile` around lines 82 - 95, The Makefile targets load/% and push/% do not forward DOCKERFILE_ROOT to docker buildx bake, so overrides are ignored; update both targets (load/% and push/%) to export DOCKERFILE_ROOT by adding DOCKERFILE_ROOT=$(DOCKERFILE_ROOT) to the environment block before the docker buildx bake invocation so docker-bake.hcl receives the overridden value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docker/Dockerfile.nmp-cpu-tasks`:
- Line 11: The Dockerfile uses COPY . . which sends the entire repo root into
the build context; update the repository's .dockerignore to exclude .git and any
other large/generated build-output directories relevant to this image (e.g.,
dist, build, .venv, node_modules, caches) so the build context is smaller and
cache invalidation is reduced—add entries for .git and the specific generated
dirs used by this project and verify the Docker build still has all required
files.
In `@Makefile`:
- Around line 82-88: The README references a non-existent make target
docker/nmp-cpu-tasks while the Makefile only defines load/% and push/%; either
update the README to use make load/nmp-cpu-tasks or add a docker/% alias target
in the Makefile that forwards to the existing load/% implementation (e.g., make
docker/% depend on or invoke load/%), ensuring the new target name (docker/% or
the README change to load/nmp-cpu-tasks) matches the referenced command and
reuses the existing BUILD_ARCH/BUILD_CONTEXT/NMP_IMAGE_REGISTRY/NMP_IMAGE_TAG
behavior used by load/%.
---
Nitpick comments:
In `@Makefile`:
- Around line 82-95: The Makefile targets load/% and push/% do not forward
DOCKERFILE_ROOT to docker buildx bake, so overrides are ignored; update both
targets (load/% and push/%) to export DOCKERFILE_ROOT by adding
DOCKERFILE_ROOT=$(DOCKERFILE_ROOT) to the environment block before the docker
buildx bake invocation so docker-bake.hcl receives the overridden value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b18ebfc3-135d-4809-a594-eaf7a11ab4c4
📒 Files selected for processing (3)
Makefiledocker-bake.hcldocker/Dockerfile.nmp-cpu-tasks
|
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docker/Dockerfile.nmp-api`:
- Around line 42-43: The Dockerfile uses a build argument ARG USERNAME but
hardcodes "nvs" in COPY --chown=nvs:nvs and ENV USERNAME=nvs, which breaks
overrides; update all COPY --chown and ENV USERNAME occurrences to use the ARG
(e.g., reference the variable instead of "nvs") so chown and the environment
follow the provided USERNAME, and ensure ARG USERNAME is declared before those
lines and used consistently in each block (the COPY --chown and ENV USERNAME
entries referenced in the diff and the similar blocks later).
- Line 99: The Dockerfile currently forces the container to run as root via the
"USER 0" directive; remove or replace that directive so the runtime image runs
as a non-root user (e.g., use the existing non-root user created earlier or
switch to a specific non-root UID/GID such as 1000) so that the image does not
drop to root at line containing "USER 0". Ensure the chosen non-root user exists
in the image (refer to the user creation step or existing user name) and that
file permissions/ownership are adjusted accordingly.
In `@docker/Dockerfile.nmp-core`:
- Around line 42-43: The Dockerfile defines ARG USERNAME and USER_UID but still
uses the hardcoded string "nvs" in places (e.g., COPY --chown and ENV), which
breaks overrides; find occurrences of the literal "nvs" (including in COPY
--chown, ENV, RUN chown commands) and replace them with the ARG-based variable
(use the ARG/ENV expansion like ${USERNAME} or "$USERNAME") and ensure UID uses
${USER_UID} where appropriate so COPY --chown, chown, and ENV remain consistent
when ARGs are overridden (also update the other occurrences noted around the
later blocks that reference "nvs").
- Line 62: Replace the final root user declaration "USER 0" with the non-root
username created earlier in the Dockerfile (the user added with useradd/addgroup
or created as e.g. "appuser" or "node"); locate the earlier creation (commands
like useradd, adduser, groupadd or the line that set a USER previously) and use
that exact username in the final USER instruction so the container runs as that
non-root user instead of root.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 98c79f47-48e2-4b53-b1b9-66203d2b7736
📒 Files selected for processing (3)
docker-bake.hcldocker/Dockerfile.nmp-apidocker/Dockerfile.nmp-core
| @@ -0,0 +1,105 @@ | |||
| # syntax=docker/dockerfile:1 | |||
|
|
|||
| FROM ghcr.io/astral-sh/uv:python3.11-bookworm-slim AS builder | |||
There was a problem hiding this comment.
so are we no longer using nvidia base images? IIRC thats why we removed the docker images outside of the OSS repo
There was a problem hiding this comment.
I'm just doing the bare minimum required to get things working. The idea here is that customers would build their own images, so they should be light. I'm not quite sure what to do about the bigger gpu and customizer images though. We need some big instances for those, and I don't know if we want to have those building on OSS.
There was a problem hiding this comment.
personally I don't like those distroless images as far as accessing them "in the field". If someone wants to build this repository for scratch, for whatever reason (weird customer stuff), it's probably best we don't depend on things inside a private network boundary. Also doesn't feel in the spirit of OS
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
e2e/conftest.py (1)
151-157: 💤 Low valueUse concrete return type instead of
Any.
image_builderreturnsCallable[[str], str], as confirmed by usage intest_jobs_auth.py.♻️ Proposed change
-from typing import IO, Any +from collections.abc import Callable +from typing import IO, Any @@ -@pytest.fixture(scope="session") -def image() -> Any: +@pytest.fixture(scope="session") +def image() -> Callable[[str], str]:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/conftest.py` around lines 151 - 157, The fixture `image` is typed too generically as Any; `image_builder` returns a Callable[[str], str], so update the fixture's return type to the concrete callable type (e.g., Callable[[str], str]) and import typing.Callable if not already present so the `image` function signature reads something like def image() -> Callable[[str], str]:; retain the existing logic and ensure any imports (Callable) are added or consolidated at the top of the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@e2e/test_jobs.py`:
- Around line 243-257: The test currently prints and asserts on the raw secret
via EnvironmentVariable(name="SECRET_ENV_VAR") and the shell command echo
"Secret value is: $SECRET_ENV_VAR", which leaks secrets; change the job command
to not output the secret (e.g., echo a fixed placeholder or a success message)
and update the assertion to verify job completion and that the secret value is
NOT present in logs (use wait_for_job_logs and assert the raw secret string is
absent) — locate and update the command in the job spec where
EnvironmentVariable is used and adjust assertions around completed_job and
wait_for_job_logs accordingly.
- Around line 230-234: The test creates a secret with a fixed name via
sdk.secrets.create (the secret variable) which causes collisions; change the
call to use a unique name per run (e.g., append a short UUID/timestamp/CI_JOB_ID
or workspace identifier) so the secret name is not constant across
retries/parallel runs, and ensure any tear-down references that expect the
original name are updated to use the generated name.
---
Nitpick comments:
In `@e2e/conftest.py`:
- Around line 151-157: The fixture `image` is typed too generically as Any;
`image_builder` returns a Callable[[str], str], so update the fixture's return
type to the concrete callable type (e.g., Callable[[str], str]) and import
typing.Callable if not already present so the `image` function signature reads
something like def image() -> Callable[[str], str]:; retain the existing logic
and ensure any imports (Callable) are added or consolidated at the top of the
file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 101fc346-dae2-4116-add3-4389eb9cf1d2
📒 Files selected for processing (6)
.dockerignoredocker/Dockerfile.nmp-apidocker/Dockerfile.nmp-coree2e/conftest.pye2e/test_jobs.pye2e/test_jobs_auth.py
✅ Files skipped from review due to trivial changes (1)
- .dockerignore
🚧 Files skipped from review as they are similar to previous changes (2)
- docker/Dockerfile.nmp-core
- docker/Dockerfile.nmp-api
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
services/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.py (1)
278-292:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTreat missing subprocess metadata as paused when step is
PAUSING.
sync()now supports pausing, but this branch still returnsERRORwhen metadata is missing forPAUSING. That can break pause transitions after backend restarts, even thoughPAUSING -> PAUSEDis valid.Suggested fix
if metadata is None: if step.status == PlatformJobStatus.CANCELLING: return JobUpdate( status=PlatformJobStatus.CANCELLED.value, status_details={"message": "Subprocess not found, job cancelled"}, ) + if step.status == PlatformJobStatus.PAUSING: + return JobUpdate( + status=PlatformJobStatus.PAUSED.value, + status_details={"message": "Subprocess not found, job paused"}, + ) return JobUpdate( status=PlatformJobStatus.ERROR.value, error_details={"message": "Local subprocess metadata not found"}, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.py` around lines 278 - 292, The branch that handles metadata is treating missing subprocess metadata as ERROR for PAUSING; change it so that when metadata is None and step.status == PlatformJobStatus.PAUSING you return a JobUpdate marking the step PAUSED (e.g., status=PlatformJobStatus.PAUSED.value with an appropriate status_details message) instead of ERROR; keep the existing behavior for CANCELLING (return CANCELLED) and for other statuses return the ERROR JobUpdate. This affects the metadata check around metadata is None in subprocess.py and should align with how _create_step_update/_terminate_process handle PAUSING elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@services/core/jobs/tests/controllers/test_subprocess_backend.py`:
- Around line 238-240: The test currently backdates timestamps using
step.created_at.replace(year=step.created_at.year - 1) which can fail for
leap-day dates; instead subtract a duration using datetime.timedelta (e.g.,
import datetime and set step.created_at = step.created_at -
datetime.timedelta(days=365) and same for step.updated_at) to deterministically
backdate by one year while preserving timezone/naive semantics.
---
Outside diff comments:
In `@services/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.py`:
- Around line 278-292: The branch that handles metadata is treating missing
subprocess metadata as ERROR for PAUSING; change it so that when metadata is
None and step.status == PlatformJobStatus.PAUSING you return a JobUpdate marking
the step PAUSED (e.g., status=PlatformJobStatus.PAUSED.value with an appropriate
status_details message) instead of ERROR; keep the existing behavior for
CANCELLING (return CANCELLED) and for other statuses return the ERROR JobUpdate.
This affects the metadata check around metadata is None in subprocess.py and
should align with how _create_step_update/_terminate_process handle PAUSING
elsewhere.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5349dc78-b6ca-47b7-99bb-f1ff22fcc141
📒 Files selected for processing (6)
.github/workflows/ci.yamle2e/test_jobs.pyservices/core/jobs/jobs-launcher/build-manual.shservices/core/jobs/src/nmp/core/jobs/controllers/backends/docker.pyservices/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.pyservices/core/jobs/tests/controllers/test_subprocess_backend.py
✅ Files skipped from review due to trivial changes (2)
- services/core/jobs/jobs-launcher/build-manual.sh
- services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py
🚧 Files skipped from review as they are similar to previous changes (1)
- e2e/test_jobs.py
b3a4593 to
536a0ed
Compare
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
c5ad1a3 to
debe34d
Compare
| env: | ||
| _TYPER_FORCE_DISABLE_TERMINAL: "1" | ||
| run: | | ||
| uv run --frozen nemo quickstart up --image my-registry/nmp-api:local --no-pull --force |
There was a problem hiding this comment.
do we want to push to our package registry?
| @@ -0,0 +1,105 @@ | |||
| # syntax=docker/dockerfile:1 | |||
|
|
|||
| FROM ghcr.io/astral-sh/uv:python3.11-bookworm-slim AS builder | |||
There was a problem hiding this comment.
personally I don't like those distroless images as far as accessing them "in the field". If someone wants to build this repository for scratch, for whatever reason (weird customer stuff), it's probably best we don't depend on things inside a private network boundary. Also doesn't feel in the spirit of OS
| COPY web/packages ./packages | ||
|
|
||
| RUN pnpm install --frozen-lockfile | ||
| RUN pnpm --filter nemo-studio-ui build:fastapi |
There was a problem hiding this comment.
I saw a decent amount of instability in https://github.com/NVIDIA-NeMo/Platform-Deploy/pull/66 with hang ups on arm64 for studio and have a retry workaround there. Might not be worth it but thought it was worth sharing
There was a problem hiding this comment.
appropriate to add these in this same PR? might be fine
| except yaml.YAMLError as e: | ||
| raise ValueError(f"Error parsing platform config at {path}: {e}") from e | ||
|
|
||
| if "platform" in config_data or "files" in config_data: |
There was a problem hiding this comment.
why this files check, even though we never use that key?
| import docker | ||
| from docker.errors import DockerException, ImageNotFound | ||
|
|
||
| import docker |
There was a problem hiding this comment.
is this the right sort order?
| from docker.errors import APIError, BuildError, DockerException, ImageNotFound, NotFound | ||
| from tenacity import retry, stop_after_attempt, wait_fixed | ||
|
|
||
| import docker |
There was a problem hiding this comment.
think this sort order is probably wrong across these fields, revert?
|
|
||
| echo "Building static binary for ${APP_NAME}..." | ||
|
|
||
| # Warning: this binary is copied by the Docker jobs controller into the target |
There was a problem hiding this comment.
when is this gonna be rewritten in rust? 😆
| ```bash | ||
| cd /path/to/nmp | ||
| make docker/nmp-cpu-tasks | ||
| make load/nmp-cpu-tasks |
There was a problem hiding this comment.
why did we change this target to load/?
| rm -rf .venv/ | ||
|
|
||
| .PHONY: load/% | ||
| load/%: ## Build and load a local single-arch image from bake target %-docker |
There was a problem hiding this comment.
not that I love docker/%, but I'd prefer build/% or docker-build/% or even bake/%
Summary
Testing
Summary by CodeRabbit