Conversation
Prevents ImagePullBackOff failures on restricted/offline networks by using preloaded images from k3s's local store instead of pulling from DockerHub. Also pins alpine:latest to alpine:3.20 in k3s-memory-config for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a preloader stage to all three server Dockerfiles (CPU, GPU, Jetson) that uses crane to download k3s airgap images and Helm chart utility images (aws-cli, kubectl, alpine, busybox) as tarballs at build time. server.sh copies these to k3s's agent images directory on startup so pods can start without reaching DockerHub. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the warmup-inference-model Job and sync-pinamod init container with a mount-s3 FUSE mount on the edge-endpoint pod. Model weights (pinamod + pretrained weights) are now served via a read-only S3 FUSE mount with local disk caching, eliminating the need for separate S3 sync steps and enabling offline access to pretrained HuggingFace/timm weights. Changes: - Install mount-s3 in Dockerfile (amd64 + arm64) - Add mount-s3 init container to edge-deployment with FUSE mount - Remove warmup-inference-model Job and wait-for-warmup script - Remove sync-pinamod init container from inference deployment - Set HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 on inference pods - Remove batch/jobs RBAC rule (no longer needed) - Add s3Mount config values (bucket, region, mountPath, cachePath) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mount-s3 is a FUSE daemon that must stay running to serve file reads. As an init container, the process exits and the FUSE mount disappears. Convert to a sidecar container running in --foreground mode so the mount persists for the lifetime of the pod and is visible to inference pods via hostPath + mountPropagation: Bidirectional. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The FUSE mount must happen on the actual hostPath bind mount point inside the container, not on a path within the overlay filesystem. Mount on /mnt/s3 (the volume mount point) so the FUSE mount propagates to the host via Bidirectional mountPropagation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hostPath volume FUSE mounts don't propagate through containerd's bind mount setup. Instead, use hostPID + nsenter -t 1 -m to run mount-s3 directly in the host's mount namespace. Copy the mount-s3 binary and AWS credentials to the host filesystem first, then mount with --foreground to keep the sidecar alive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use cp+mv instead of direct cp to avoid ETXTBSY when the old pod is still running mount-s3 from the same host path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mount directly on /host-root/<path> (bind mount of host root) from within the container. This avoids nsenter complexity and library dependency issues. The FUSE mount on the host-root bind mount makes files visible at the host path for other pods via hostPath. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the mount-s3 logic out of the Helm template into a standalone shell script for readability. The template now just calls the script and passes config via environment variables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On Balena, k3s runs inside a Docker container whose root filesystem has private mount propagation by default. The mount-s3 sidecar needs mountPropagation: Bidirectional which requires shared propagation. Run mount --make-rshared / before k3s starts to enable this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
35fe022 to
3ac8e70
Compare
HuggingFace and DockerHub are no longer required at runtime: - Pretrained weights served from S3 via FUSE mount (HF_HUB_OFFLINE=1) - DockerHub images preloaded with imagePullPolicy: IfNotPresent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3ac8e70 to
5f6704b
Compare
| @@ -4,6 +4,44 @@ | |||
| # https://k3d.io/v5.7.4/usage/advanced/cuda/#building-a-customized-k3s-image | |||
There was a problem hiding this comment.
The following Dockerfile changes are for preloading images with crane. This will also be used in GEP for balena deployments. Note that the three Dockerfile represents different architectures (cpu, x86-GPU, and jetson).
In GEP we kept those in jinja format and we generate the actual Dockerfile via Python. Here we kept full Dockerfile for each deployments so it is easier for people to reuse this for their custom deployment environments.
| containers: | ||
| - name: k3s-memory-configurator | ||
| image: alpine:latest | ||
| image: alpine:3.20 |
There was a problem hiding this comment.
I modified alpine to use 3.20 instead of latest so that it matches what the network-healer is using below. I am not sure why they are different versions.
CoreyEWood
left a comment
There was a problem hiding this comment.
Overall this looks great! Love how simple the S3 mounting is to set up. And I think ultimately the validation via testing is the most key part of it. I left comments which I think are worth looking at before merging but mostly about smaller portions of it.
| crane pull --platform "$PLAT" docker.io/amazon/aws-cli:latest /preload/aws-cli-latest.tar && \ | ||
| crane pull --platform "$PLAT" docker.io/bitnami/kubectl:latest /preload/kubectl-latest.tar && \ | ||
| crane pull --platform "$PLAT" docker.io/alpine:3.20 /preload/alpine-3.20.tar && \ | ||
| crane pull --platform "$PLAT" docker.io/busybox:1.35 /preload/busybox-1.35.tar |
There was a problem hiding this comment.
This pulls busybox:1.35 but edge-deployment.yaml uses busybox:1.36 for apply-edge-config. Should the crane pull command include 1.36 to cover this?
There was a problem hiding this comment.
Good catch! Will add that to the list.
| command: ["/bin/sh", "/etc/groundlight/edge-config/apply-edge-config.sh"] | ||
|
|
||
| containers: | ||
| - name: mount-s3 |
There was a problem hiding this comment.
The other containers in this pod all have resources.requests.memory set to avoid eviction under memory pressure. Should mount-s3 have one as well?
| # ------------------------- | ||
| # Stage 1: Preload - download third-party images as tarballs for airgap/restricted network support | ||
| # ------------------------- |
There was a problem hiding this comment.
This preloader stage seems to be duplicated across Dockerfile, Dockerfile.gpu, and Dockerfile.jetson. Could we extract it into a shared Dockerfile or base image so the preload list only needs to be maintained in one place?
There was a problem hiding this comment.
We can, for GEP we used a jinja template. But for edge-endpoint since it is public facing I wanted to keep those architectures separate so it is easier for people to copy & modify it based on their needs.
With jinja templating we can make this cleaner, but it will require us to call a python script like GEP does to compile the actual Dockerfile at build time.
There was a problem hiding this comment.
Okay gotcha, makes sense. In this case we'll just have to remember to keep all of these in sync.
There was a problem hiding this comment.
Maybe add a note in this section that it should be kept in sync across all of these when making changes, so that people in the future are aware.
| - `*.us-west-2.amazonaws.com`: AWS access for inference image download (ECR), model weights (S3), and credential refresh (STS) | ||
| - `*.s3.amazonaws.com`: AWS S3 access for model weights via FUSE mount |
There was a problem hiding this comment.
Do you know why we need both of these instead of just the first one?
There was a problem hiding this comment.
I believe the AWS CLI may use global endpoints (s3.amazonaws.com, sts.amazonaws.com) without the region prefix even if specified, especially when talking to sts to get creds.
There was a problem hiding this comment.
Have we fully deprecated the non-helm deploy route? It looks like this script is still referenced in the old deploy/k3s/inference_deployment_template.yaml, so we might need to keep it around if anyone else is still using that deployment path.
There was a problem hiding this comment.
Good call, there are some people still using the non-helm deploy route. I will revert the script for now.
There was a problem hiding this comment.
Sounds good. And from looking at it, it seems to me like nothing else here should break non-helm deployments hopefully. Do you think we should test that explicitly to be safe? Or are you confident about that?
There was a problem hiding this comment.
Yea I think this is fine, since we did not touch any k3s templates. We can communicate with the people still using the old k3s method to switch over soon so we can fully deprecate it.
| # Copy AWS credentials to host for mount-s3 to use | ||
| mkdir -p /host-root/root/.aws | ||
| cp /root/.aws/credentials /host-root/root/.aws/credentials 2>/dev/null || true | ||
| cp /root/.aws/config /host-root/root/.aws/config 2>/dev/null || true |
There was a problem hiding this comment.
Why is this needed if it already has access to the credentials here?
There was a problem hiding this comment.
I think I originally planned to run mount-s3 in host for testing, but I agree it should be necessarily, will test out and update it.
| mountPropagation: Bidirectional | ||
| - name: aws-credentials | ||
| mountPath: /root/.aws | ||
| readOnly: true |
There was a problem hiding this comment.
Should this container have a liveness probe? If the FUSE mount fails silently (e.g. a credentials issue), the pod would appear healthy but inference pods would see an empty directory. Claude says that something like checking mountpoint -q /host-root/<mount_path> could catch this.
There was a problem hiding this comment.
Good call, will add it 🙂
| imagePullPolicy: "{{ include "groundlight-edge-endpoint.edgeEndpointPullPolicy" . }}" | ||
| securityContext: | ||
| privileged: true # Required for FUSE mount | ||
| command: ["/bin/bash", "/groundlight-edge/deploy/bin/mount-s3.sh"] |
There was a problem hiding this comment.
Do you think we need to think about the ordering of this container vs the inference pods starting up at all? Most of the time it seems like it should be fine because we expect this to start before inference pods do, and it should run quickly. And maybe adding a liveness probe would cover the scenario where this doesn't succeed. But just want to flag in case there's other scenarios we should cover to avoid some kind of silent failure.
There was a problem hiding this comment.
I think this is fine, usually mount-s3 will start fairly quickly, but a liveness probe from the comment above should solve this issue. Also, the inference pod at startup will still need to pull the inference image so it will definitely be slower than the s3 mount.
| fi | ||
|
|
||
| echo "Mounting s3://$S3_BUCKET at $MOUNT_POINT (cache: $CACHE_DIR, region: $S3_REGION)" | ||
| mount-s3 "$S3_BUCKET" "$MOUNT_POINT" \ |
There was a problem hiding this comment.
Not sure the current state of edge documentation but do you think we should add a short section about the mountpoint usage to a README somewhere?
There was a problem hiding this comment.
I think we can add that later, I think the current edge architecture docs only covers our escalation and ML logic but not the structure of the edge-endpoint.
- Add busybox:1.36 to preloader in all 3 Balena Dockerfiles (covers apply-edge-config init container added in main merge) - Add resources.requests.memory: 50Mi to mount-s3 sidecar (consistent with other containers in the pod) - Add liveness probe to mount-s3 (mountpoint -q check to detect FUSE mount failures) - Remove unnecessary AWS credentials copy to /host-root in mount-s3.sh (mount-s3 reads creds from its own /root/.aws mount) - Restore wait-for-warmup.sh for non-helm k3s deploy path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CoreyEWood
left a comment
There was a problem hiding this comment.
Changes look good! Left a few more comments but then should be ready to merge.
| crane pull --platform "$PLAT" docker.io/busybox:1.35 /preload/busybox-1.35.tar && \ | ||
| crane pull --platform "$PLAT" docker.io/busybox:1.36 /preload/busybox-1.36.tar |
There was a problem hiding this comment.
It looks like we use 1.35 and 1.36 in one place each, I wonder if we can switch to using just one version to avoid having to pull both images? Unless there's a reason we need both 1.35 and 1.36 for those specific spots.
| # ------------------------- | ||
| # Stage 1: Preload - download third-party images as tarballs for airgap/restricted network support | ||
| # ------------------------- |
There was a problem hiding this comment.
Okay gotcha, makes sense. In this case we'll just have to remember to keep all of these in sync.
| # ------------------------- | ||
| # Stage 1: Preload - download third-party images as tarballs for airgap/restricted network support | ||
| # ------------------------- |
There was a problem hiding this comment.
Maybe add a note in this section that it should be kept in sync across all of these when making changes, so that people in the future are aware.
There was a problem hiding this comment.
Sounds good. And from looking at it, it seems to me like nothing else here should break non-helm deployments hopefully. Do you think we should test that explicitly to be safe? Or are you confident about that?
- Standardize on busybox:1.36 everywhere (was 1.35 in splunk validation and preloaders, 1.36 in apply-edge-config) - Add NOTE comments to preloader stages reminding to keep image lists in sync across Dockerfile, Dockerfile.gpu, and Dockerfile.jetson Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…k-resilience # Conflicts: # deploy/helm/groundlight-edge-endpoint/files/inference-deployment-template.yaml
Summary
Improve network resilience for edge devices by eliminating runtime dependencies on DockerHub and replacing S3 sync with a FUSE-mounted S3 bucket for model weights.
1. Image pull policy & preloading (offline k3s support)
imagePullPolicy: IfNotPresentto all DockerHub images in Helm templates (aws-cli, kubectl, alpine, busybox)alpine:latesttoalpine:3.20for consistency with preloaded imagescraneto pull images at build time (Note that the preloading steps only happens inside Balena build process, we do not do pre-loading when installing from helm directly)server.sh2. Replace S3 sync with mount-s3 FUSE mount
warmup-inference-modelJob andwait-for-warmupinit containersync-pinamodinit container from inference deploymentsmount-s3sidecar container to the edge-endpoint pod that FUSE-mountss3://pinamod-artifacts-public--read-onlywith local disk cache (--cache) for performance/host-root/<path>pattern to make FUSE mount visible to other pods via hostPathHF_HUB_OFFLINE=1andTRANSFORMERS_OFFLINE=1to prevent runtime downloads from HuggingFaceaws-credentialsvolume andbatch/jobsRBAC from inference pods (no longer needed)s3Mountconfiguration values (bucket, region, mountPath, cachePath)3. Files changed
Dockerfile— installmount-s3+ FUSEdeploy/bin/mount-s3.sh— mount-s3 sidecar script (new)deploy/bin/wait-for-warmup.sh— deleteddeploy/helm/.../templates/edge-deployment.yaml— add mount-s3 sidecar + host-root volumedeploy/helm/.../templates/warmup-inference-model.yaml— deleteddeploy/helm/.../files/inference-deployment-template.yaml— remove sync init containers, add env varsdeploy/helm/.../templates/service-account.yaml— remove batch/jobs RBACdeploy/helm/.../values.yaml— add s3Mount configdeploy/balena-k3s/server/Dockerfile*— add preloader stagesdeploy/balena-k3s/server/server.sh— copy preloaded images to k3simagePullPolicy: IfNotPresentto all DockerHub images except for edge-endpoint and inference imagesTest plan
Tested on G4 EC2 (Tesla T4, k3s v1.33.5):
/opt/groundlight/edge/pinamod-mount/from_edge: true) for all 9 detectorsdet_34Ws7...) loads pinamod weights from FUSE mountimagePullPolicy: IfNotPresentset on all DockerHub images🤖 Generated with Claude Code