Skip to content

feat: add buildx bake target for nmp-cpu-tasks#159

Open
ironcommit wants to merge 14 commits into
mainfrom
cpu-tasks/rsadler
Open

feat: add buildx bake target for nmp-cpu-tasks#159
ironcommit wants to merge 14 commits into
mainfrom
cpu-tasks/rsadler

Conversation

@ironcommit
Copy link
Copy Markdown
Contributor

@ironcommit ironcommit commented Jun 3, 2026

Summary

  • add a local docker buildx bake target for nmp-cpu-tasks
  • add generic Makefile wrappers for load/ and push/
  • add the nmp-cpu-tasks Dockerfile used by the bake target

Testing

  • docker buildx bake --print nmp-cpu-tasks-docker
  • BUILD_ARCH=linux/arm64 docker buildx bake --print nmp-cpu-tasks-docker
  • BUILD_ARCH=linux/amd64 docker buildx bake --print nmp-cpu-tasks-docker

Summary by CodeRabbit

  • Chores
    • Configurable image/build defaults and Makefile targets for load/push; updated .dockerignore; CI job to build/run Docker-focused E2E tests.
  • New Features
    • New Docker bake targets and dedicated Dockerfiles for API, core, and CPU-tasks enabling single- and multi-arch image builds.
  • Documentation
    • Docker-based quickstart for local image loading, verification, and restart.
  • Bug Fixes
    • Improved subprocess backend handling for pausing → paused transitions.
  • Tests
    • Expanded E2E suites covering jobs, logs, secrets, pause/resume, cancellation, and auth.

@ironcommit ironcommit requested review from a team as code owners June 3, 2026 17:54
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Makefile image/build variables and pattern targets, a docker-bake.hcl with platform selection and three bake targets, three multi-stage Dockerfiles, README/.dockerignore updates, CI job, E2E tests with conftest feature gating and image fixture, and subprocess pause handling with unit tests.

Changes

Docker build infrastructure

Layer / File(s) Summary
Make variables and pattern targets
Makefile
Adds NMP_IMAGE_REGISTRY, NMP_IMAGE_TAG, BUILD_CONTEXT, DOCKERFILE_ROOT and pattern targets load/% and push/% that call docker buildx bake with BUILD_ARCH, BUILD_CONTEXT, DOCKERFILE_ROOT, NMP_IMAGE_REGISTRY, NMP_IMAGE_TAG.
Bake configuration and platform selection
docker-bake.hcl
Adds build variables (BUILD_ARCH, BUILD_CONTEXT, DOCKERFILE_ROOT, NMP_IMAGE_REGISTRY, NMP_IMAGE_TAG, NMP_PLATFORM_VERSION, NMP_CODE_REVISION), implements get_platforms() (single BUILD_ARCH if set else linux/amd64,linux/arm64), and defines nmp-api-docker, nmp-core-docker, nmp-cpu-tasks-docker targets (api passes build args).
nmp-api multi-stage Dockerfile
docker/Dockerfile.nmp-api
Builder: Python/uv venv, DuckDB prep, prefetch HF model; Go stage builds static jobs-launcher; Node stage builds Studio UI; runtime: assembles artifacts, exposes 8080, readiness healthcheck, ENTRYPOINT nemo services run.
nmp-core multi-stage Dockerfile
docker/Dockerfile.nmp-core
Builder: Python venv and DuckDB extensions; Go static jobs-launcher build; runtime: copies venv, DuckDB, binary, exposes 8080, readiness check, ENTRYPOINT nemo services run.
nmp-cpu-tasks multi-stage Dockerfile
docker/Dockerfile.nmp-cpu-tasks
Builder: uv sync --only cpu-tasks to freeze venv with cache; runtime: create non-root user, copy .venv and tiktoken-cache, set TIKTOKEN_CACHE_DIR and PATH, ENTRYPOINT nemo-platform, default CMD --help.
Docs and build context ignore
README.md, .dockerignore
Adds local nmp-api quickstart instructions and expands .dockerignore to exclude common caches and build outputs.
CI job and jobs-launcher note
.github/workflows/ci.yaml, services/core/jobs/jobs-launcher/build-manual.sh
Adds python-e2e-docker-test CI job to build/load local images and run Docker-focused E2E tests; adds warning comment about target OS/arch when building jobs-launcher.

End-to-end job tests

Layer / File(s) Summary
E2E pytest config and image fixture
e2e/conftest.py
Import image_builder, add _AUTH_ENABLED gating and feature marker registration/collection skip logic, and add a session-scoped image fixture resolving registry/tag.
Jobs tests: helpers and diagnostics
e2e/test_jobs.py
Adds module scaffolding, _single_step_job helper to build PlatformJobSpec, and _job_diagnostic_message diagnostic helper.
Jobs tests: lifecycle, logs, config, storage, secrets, control
e2e/test_jobs.py
Adds tests for job lifecycle, log batch streaming, config readability, inter-step persistent storage, secret-backed env vars, expected failures, cancel/pause/resume behaviors.
Auth-enabled jobs tests
e2e/test_jobs_auth.py
Adds bearer-token SDK helper and auth-gated tests verifying principal propagation, unauthorized access enforcement (403), and admin listing across ALL_WORKSPACES.
Subprocess backend pause handling and tests
services/core/jobs/src/.../subprocess.py, services/core/jobs/tests/controllers/test_subprocess_backend.py
Treat PAUSING similar to CANCELLING for termination; map PAUSING->PAUSING while running and PAUSING->PAUSED after exit; add unit tests for pausing transitions and TTL behavior.
Docker controller import and docstring
services/core/jobs/src/.../docker.py
Relocate import docker and expand get_jobs_launcher_binary docstring with a Warning about host-side resolution and cross-arch expectations.

Suggested reviewers:

  • svvarom
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title focuses narrowly on nmp-cpu-tasks bake target, but the PR also adds nmp-api, nmp-core, Docker image support, E2E tests, CI workflows, and subprocess pause/resume—far broader than the title suggests. Revise title to reflect the full scope: e.g., 'feat: add Docker buildx bake targets and E2E tests for nmp-api, nmp-core, and nmp-cpu-tasks' or split into multiple PRs.
Docstring Coverage ⚠️ Warning Docstring coverage is 32.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cpu-tasks/rsadler

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
Makefile (1)

82-95: 💤 Low value

DOCKERFILE_ROOT is declared but never propagated to bake.

docker-bake.hcl reads DOCKERFILE_ROOT, but neither load/% nor push/% forwards it, so overriding it via Make has no effect (works only because both default to .).

♻️ Forward DOCKERFILE_ROOT
 load/%: ## Build and load a local single-arch image from bake target %-docker
 	BUILD_ARCH=$(BUILD_ARCH) \
 	BUILD_CONTEXT=$(BUILD_CONTEXT) \
+	DOCKERFILE_ROOT=$(DOCKERFILE_ROOT) \
 	NMP_IMAGE_REGISTRY=$(NMP_IMAGE_REGISTRY) \
 	NMP_IMAGE_TAG=$(NMP_IMAGE_TAG) \
 	docker buildx bake --load $*-docker

 push/%: ## Build and push a multi-arch image manifest from bake target %-docker
 	BUILD_CONTEXT=$(BUILD_CONTEXT) \
+	DOCKERFILE_ROOT=$(DOCKERFILE_ROOT) \
 	NMP_IMAGE_REGISTRY=$(NMP_IMAGE_REGISTRY) \
 	NMP_IMAGE_TAG=$(NMP_IMAGE_TAG) \
 	docker buildx bake --push $*-docker
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` around lines 82 - 95, The Makefile targets load/% and push/% do not
forward DOCKERFILE_ROOT to docker buildx bake, so overrides are ignored; update
both targets (load/% and push/%) to export DOCKERFILE_ROOT by adding
DOCKERFILE_ROOT=$(DOCKERFILE_ROOT) to the environment block before the docker
buildx bake invocation so docker-bake.hcl receives the overridden value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docker/Dockerfile.nmp-cpu-tasks`:
- Line 11: The Dockerfile uses COPY . . which sends the entire repo root into
the build context; update the repository's .dockerignore to exclude .git and any
other large/generated build-output directories relevant to this image (e.g.,
dist, build, .venv, node_modules, caches) so the build context is smaller and
cache invalidation is reduced—add entries for .git and the specific generated
dirs used by this project and verify the Docker build still has all required
files.

In `@Makefile`:
- Around line 82-88: The README references a non-existent make target
docker/nmp-cpu-tasks while the Makefile only defines load/% and push/%; either
update the README to use make load/nmp-cpu-tasks or add a docker/% alias target
in the Makefile that forwards to the existing load/% implementation (e.g., make
docker/% depend on or invoke load/%), ensuring the new target name (docker/% or
the README change to load/nmp-cpu-tasks) matches the referenced command and
reuses the existing BUILD_ARCH/BUILD_CONTEXT/NMP_IMAGE_REGISTRY/NMP_IMAGE_TAG
behavior used by load/%.

---

Nitpick comments:
In `@Makefile`:
- Around line 82-95: The Makefile targets load/% and push/% do not forward
DOCKERFILE_ROOT to docker buildx bake, so overrides are ignored; update both
targets (load/% and push/%) to export DOCKERFILE_ROOT by adding
DOCKERFILE_ROOT=$(DOCKERFILE_ROOT) to the environment block before the docker
buildx bake invocation so docker-bake.hcl receives the overridden value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b18ebfc3-135d-4809-a594-eaf7a11ab4c4

📥 Commits

Reviewing files that changed from the base of the PR and between 274a76f and e1844dc.

📒 Files selected for processing (3)
  • Makefile
  • docker-bake.hcl
  • docker/Dockerfile.nmp-cpu-tasks

Comment thread docker/Dockerfile.nmp-cpu-tasks
Comment thread Makefile
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Suite Lines Covered Line Rate Branch Rate
Unit Tests 18705/24706 75.7% 62.1%
Integration Tests 11970/23483 51.0% 26.1%

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docker/Dockerfile.nmp-api`:
- Around line 42-43: The Dockerfile uses a build argument ARG USERNAME but
hardcodes "nvs" in COPY --chown=nvs:nvs and ENV USERNAME=nvs, which breaks
overrides; update all COPY --chown and ENV USERNAME occurrences to use the ARG
(e.g., reference the variable instead of "nvs") so chown and the environment
follow the provided USERNAME, and ensure ARG USERNAME is declared before those
lines and used consistently in each block (the COPY --chown and ENV USERNAME
entries referenced in the diff and the similar blocks later).
- Line 99: The Dockerfile currently forces the container to run as root via the
"USER 0" directive; remove or replace that directive so the runtime image runs
as a non-root user (e.g., use the existing non-root user created earlier or
switch to a specific non-root UID/GID such as 1000) so that the image does not
drop to root at line containing "USER 0". Ensure the chosen non-root user exists
in the image (refer to the user creation step or existing user name) and that
file permissions/ownership are adjusted accordingly.

In `@docker/Dockerfile.nmp-core`:
- Around line 42-43: The Dockerfile defines ARG USERNAME and USER_UID but still
uses the hardcoded string "nvs" in places (e.g., COPY --chown and ENV), which
breaks overrides; find occurrences of the literal "nvs" (including in COPY
--chown, ENV, RUN chown commands) and replace them with the ARG-based variable
(use the ARG/ENV expansion like ${USERNAME} or "$USERNAME") and ensure UID uses
${USER_UID} where appropriate so COPY --chown, chown, and ENV remain consistent
when ARGs are overridden (also update the other occurrences noted around the
later blocks that reference "nvs").
- Line 62: Replace the final root user declaration "USER 0" with the non-root
username created earlier in the Dockerfile (the user added with useradd/addgroup
or created as e.g. "appuser" or "node"); locate the earlier creation (commands
like useradd, adduser, groupadd or the line that set a USER previously) and use
that exact username in the final USER instruction so the container runs as that
non-root user instead of root.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 98c79f47-48e2-4b53-b1b9-66203d2b7736

📥 Commits

Reviewing files that changed from the base of the PR and between e1844dc and 5eeb60a.

📒 Files selected for processing (3)
  • docker-bake.hcl
  • docker/Dockerfile.nmp-api
  • docker/Dockerfile.nmp-core

Comment thread docker/Dockerfile.nmp-api
Comment thread docker/Dockerfile.nmp-api Outdated
Comment thread docker/Dockerfile.nmp-core
Comment thread docker/Dockerfile.nmp-core Outdated
Comment thread docker/Dockerfile.nmp-api
@@ -0,0 +1,105 @@
# syntax=docker/dockerfile:1

FROM ghcr.io/astral-sh/uv:python3.11-bookworm-slim AS builder
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so are we no longer using nvidia base images? IIRC thats why we removed the docker images outside of the OSS repo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just doing the bare minimum required to get things working. The idea here is that customers would build their own images, so they should be light. I'm not quite sure what to do about the bigger gpu and customizer images though. We need some big instances for those, and I don't know if we want to have those building on OSS.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I don't like those distroless images as far as accessing them "in the field". If someone wants to build this repository for scratch, for whatever reason (weird customer stuff), it's probably best we don't depend on things inside a private network boundary. Also doesn't feel in the spirit of OS

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
e2e/conftest.py (1)

151-157: 💤 Low value

Use concrete return type instead of Any.

image_builder returns Callable[[str], str], as confirmed by usage in test_jobs_auth.py.

♻️ Proposed change
-from typing import IO, Any
+from collections.abc import Callable
+from typing import IO, Any
@@
-@pytest.fixture(scope="session")
-def image() -> Any:
+@pytest.fixture(scope="session")
+def image() -> Callable[[str], str]:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/conftest.py` around lines 151 - 157, The fixture `image` is typed too
generically as Any; `image_builder` returns a Callable[[str], str], so update
the fixture's return type to the concrete callable type (e.g., Callable[[str],
str]) and import typing.Callable if not already present so the `image` function
signature reads something like def image() -> Callable[[str], str]:; retain the
existing logic and ensure any imports (Callable) are added or consolidated at
the top of the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@e2e/test_jobs.py`:
- Around line 243-257: The test currently prints and asserts on the raw secret
via EnvironmentVariable(name="SECRET_ENV_VAR") and the shell command echo
"Secret value is: $SECRET_ENV_VAR", which leaks secrets; change the job command
to not output the secret (e.g., echo a fixed placeholder or a success message)
and update the assertion to verify job completion and that the secret value is
NOT present in logs (use wait_for_job_logs and assert the raw secret string is
absent) — locate and update the command in the job spec where
EnvironmentVariable is used and adjust assertions around completed_job and
wait_for_job_logs accordingly.
- Around line 230-234: The test creates a secret with a fixed name via
sdk.secrets.create (the secret variable) which causes collisions; change the
call to use a unique name per run (e.g., append a short UUID/timestamp/CI_JOB_ID
or workspace identifier) so the secret name is not constant across
retries/parallel runs, and ensure any tear-down references that expect the
original name are updated to use the generated name.

---

Nitpick comments:
In `@e2e/conftest.py`:
- Around line 151-157: The fixture `image` is typed too generically as Any;
`image_builder` returns a Callable[[str], str], so update the fixture's return
type to the concrete callable type (e.g., Callable[[str], str]) and import
typing.Callable if not already present so the `image` function signature reads
something like def image() -> Callable[[str], str]:; retain the existing logic
and ensure any imports (Callable) are added or consolidated at the top of the
file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 101fc346-dae2-4116-add3-4389eb9cf1d2

📥 Commits

Reviewing files that changed from the base of the PR and between c13708f and 1de1466.

📒 Files selected for processing (6)
  • .dockerignore
  • docker/Dockerfile.nmp-api
  • docker/Dockerfile.nmp-core
  • e2e/conftest.py
  • e2e/test_jobs.py
  • e2e/test_jobs_auth.py
✅ Files skipped from review due to trivial changes (1)
  • .dockerignore
🚧 Files skipped from review as they are similar to previous changes (2)
  • docker/Dockerfile.nmp-core
  • docker/Dockerfile.nmp-api

Comment thread e2e/test_jobs.py
Comment thread e2e/test_jobs.py Outdated
@ironcommit ironcommit requested a review from a team as a code owner June 3, 2026 20:59
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
services/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.py (1)

278-292: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat missing subprocess metadata as paused when step is PAUSING.

sync() now supports pausing, but this branch still returns ERROR when metadata is missing for PAUSING. That can break pause transitions after backend restarts, even though PAUSING -> PAUSED is valid.

Suggested fix
         if metadata is None:
             if step.status == PlatformJobStatus.CANCELLING:
                 return JobUpdate(
                     status=PlatformJobStatus.CANCELLED.value,
                     status_details={"message": "Subprocess not found, job cancelled"},
                 )
+            if step.status == PlatformJobStatus.PAUSING:
+                return JobUpdate(
+                    status=PlatformJobStatus.PAUSED.value,
+                    status_details={"message": "Subprocess not found, job paused"},
+                )
             return JobUpdate(
                 status=PlatformJobStatus.ERROR.value,
                 error_details={"message": "Local subprocess metadata not found"},
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.py`
around lines 278 - 292, The branch that handles metadata is treating missing
subprocess metadata as ERROR for PAUSING; change it so that when metadata is
None and step.status == PlatformJobStatus.PAUSING you return a JobUpdate marking
the step PAUSED (e.g., status=PlatformJobStatus.PAUSED.value with an appropriate
status_details message) instead of ERROR; keep the existing behavior for
CANCELLING (return CANCELLED) and for other statuses return the ERROR JobUpdate.
This affects the metadata check around metadata is None in subprocess.py and
should align with how _create_step_update/_terminate_process handle PAUSING
elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@services/core/jobs/tests/controllers/test_subprocess_backend.py`:
- Around line 238-240: The test currently backdates timestamps using
step.created_at.replace(year=step.created_at.year - 1) which can fail for
leap-day dates; instead subtract a duration using datetime.timedelta (e.g.,
import datetime and set step.created_at = step.created_at -
datetime.timedelta(days=365) and same for step.updated_at) to deterministically
backdate by one year while preserving timezone/naive semantics.

---

Outside diff comments:
In `@services/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.py`:
- Around line 278-292: The branch that handles metadata is treating missing
subprocess metadata as ERROR for PAUSING; change it so that when metadata is
None and step.status == PlatformJobStatus.PAUSING you return a JobUpdate marking
the step PAUSED (e.g., status=PlatformJobStatus.PAUSED.value with an appropriate
status_details message) instead of ERROR; keep the existing behavior for
CANCELLING (return CANCELLED) and for other statuses return the ERROR JobUpdate.
This affects the metadata check around metadata is None in subprocess.py and
should align with how _create_step_update/_terminate_process handle PAUSING
elsewhere.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5349dc78-b6ca-47b7-99bb-f1ff22fcc141

📥 Commits

Reviewing files that changed from the base of the PR and between 1de1466 and e62787b.

📒 Files selected for processing (6)
  • .github/workflows/ci.yaml
  • e2e/test_jobs.py
  • services/core/jobs/jobs-launcher/build-manual.sh
  • services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py
  • services/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.py
  • services/core/jobs/tests/controllers/test_subprocess_backend.py
✅ Files skipped from review due to trivial changes (2)
  • services/core/jobs/jobs-launcher/build-manual.sh
  • services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • e2e/test_jobs.py

Comment thread services/core/jobs/tests/controllers/test_subprocess_backend.py Outdated
@ironcommit ironcommit force-pushed the cpu-tasks/rsadler branch 2 times, most recently from b3a4593 to 536a0ed Compare June 4, 2026 00:45
@mckornfield mckornfield self-requested a review June 4, 2026 04:03
ironcommit added 13 commits June 4, 2026 08:23
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
Signed-off-by: Ryan Sadler <267728323+ironcommit@users.noreply.github.com>
@ironcommit ironcommit force-pushed the cpu-tasks/rsadler branch from c5ad1a3 to debe34d Compare June 4, 2026 15:23
Comment thread .github/workflows/ci.yaml
env:
_TYPER_FORCE_DISABLE_TERMINAL: "1"
run: |
uv run --frozen nemo quickstart up --image my-registry/nmp-api:local --no-pull --force
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to push to our package registry?

Comment thread docker/Dockerfile.nmp-api
@@ -0,0 +1,105 @@
# syntax=docker/dockerfile:1

FROM ghcr.io/astral-sh/uv:python3.11-bookworm-slim AS builder
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I don't like those distroless images as far as accessing them "in the field". If someone wants to build this repository for scratch, for whatever reason (weird customer stuff), it's probably best we don't depend on things inside a private network boundary. Also doesn't feel in the spirit of OS

Comment thread docker/Dockerfile.nmp-api
COPY web/packages ./packages

RUN pnpm install --frozen-lockfile
RUN pnpm --filter nemo-studio-ui build:fastapi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a decent amount of instability in https://github.com/NVIDIA-NeMo/Platform-Deploy/pull/66 with hang ups on arm64 for studio and have a retry workaround there. Might not be worth it but thought it was worth sharing

Comment thread e2e/test_jobs.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appropriate to add these in this same PR? might be fine

except yaml.YAMLError as e:
raise ValueError(f"Error parsing platform config at {path}: {e}") from e

if "platform" in config_data or "files" in config_data:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this files check, even though we never use that key?

import docker
from docker.errors import DockerException, ImageNotFound

import docker
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the right sort order?

from docker.errors import APIError, BuildError, DockerException, ImageNotFound, NotFound
from tenacity import retry, stop_after_attempt, wait_fixed

import docker
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think this sort order is probably wrong across these fields, revert?


echo "Building static binary for ${APP_NAME}..."

# Warning: this binary is copied by the Docker jobs controller into the target
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is this gonna be rewritten in rust? 😆

```bash
cd /path/to/nmp
make docker/nmp-cpu-tasks
make load/nmp-cpu-tasks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we change this target to load/?

Comment thread Makefile
rm -rf .venv/

.PHONY: load/%
load/%: ## Build and load a local single-arch image from bake target %-docker
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that I love docker/%, but I'd prefer build/% or docker-build/% or even bake/%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants