Network resilience: image preloading, pull policy, and S3 FUSE mount for model weights by honeytung · Pull Request #370 · groundlight/edge-endpoint

honeytung · 2026-04-03T23:26:37Z

Summary

Improve network resilience for edge devices by eliminating runtime dependencies on DockerHub and replacing S3 sync with a FUSE-mounted S3 bucket for model weights.

1. Image pull policy & preloading (offline k3s support)

Add imagePullPolicy: IfNotPresent to all DockerHub images in Helm templates (aws-cli, kubectl, alpine, busybox)
Pin alpine:latest to alpine:3.20 for consistency with preloaded images
Add image preloading stages to Balena Dockerfiles (CPU, GPU, Jetson) using crane to pull images at build time (Note that the preloading steps only happens inside Balena build process, we do not do pre-loading when installing from helm directly)
Copy preloaded tarballs to k3s airgap directory at runtime via server.sh

2. Replace S3 sync with mount-s3 FUSE mount

Remove warmup-inference-model Job and wait-for-warmup init container
Remove sync-pinamod init container from inference deployments
Add mount-s3 sidecar container to the edge-endpoint pod that FUSE-mounts s3://pinamod-artifacts-public
Mount uses --read-only with local disk cache (--cache) for performance
Mount on /host-root/<path> pattern to make FUSE mount visible to other pods via hostPath
Inference pods read pinamod models and pretrained weights directly from the mount
Set HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 to prevent runtime downloads from HuggingFace
Remove aws-credentials volume and batch/jobs RBAC from inference pods (no longer needed)
Add s3Mount configuration values (bucket, region, mountPath, cachePath)

3. Files changed

Dockerfile — install mount-s3 + FUSE
deploy/bin/mount-s3.sh — mount-s3 sidecar script (new)
deploy/bin/wait-for-warmup.sh — deleted
deploy/helm/.../templates/edge-deployment.yaml — add mount-s3 sidecar + host-root volume
deploy/helm/.../templates/warmup-inference-model.yaml — deleted
deploy/helm/.../files/inference-deployment-template.yaml — remove sync init containers, add env vars
deploy/helm/.../templates/service-account.yaml — remove batch/jobs RBAC
deploy/helm/.../values.yaml — add s3Mount config
deploy/balena-k3s/server/Dockerfile* — add preloader stages
deploy/balena-k3s/server/server.sh — copy preloaded images to k3s
Various Helm templates — add imagePullPolicy: IfNotPresent to all DockerHub images except for edge-endpoint and inference images

Test plan

Tested on G4 EC2 (Tesla T4, k3s v1.33.5):

mount-s3 sidecar starts and FUSE-mounts S3 bucket successfully
Host sees mounted files at /opt/groundlight/edge/pinamod-mount/
9 detector inference pods start without warmup Job or sync-pinamod
Inference returns edge predictions (from_edge: true) for all 9 detectors
OD detector (det_34Ws7...) loads pinamod weights from FUSE mount
mount-s3 disk cache populates on first read (~83MB for OD weights)
No HuggingFace downloads (HF_HUB_OFFLINE=1 enforced)
imagePullPolicy: IfNotPresent set on all DockerHub images
Balena device test with preloaded images + restricted network

🤖 Generated with Claude Code

Prevents ImagePullBackOff failures on restricted/offline networks by using preloaded images from k3s's local store instead of pulling from DockerHub. Also pins alpine:latest to alpine:3.20 in k3s-memory-config for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a preloader stage to all three server Dockerfiles (CPU, GPU, Jetson) that uses crane to download k3s airgap images and Helm chart utility images (aws-cli, kubectl, alpine, busybox) as tarballs at build time. server.sh copies these to k3s's agent images directory on startup so pods can start without reaching DockerHub. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the warmup-inference-model Job and sync-pinamod init container with a mount-s3 FUSE mount on the edge-endpoint pod. Model weights (pinamod + pretrained weights) are now served via a read-only S3 FUSE mount with local disk caching, eliminating the need for separate S3 sync steps and enabling offline access to pretrained HuggingFace/timm weights. Changes: - Install mount-s3 in Dockerfile (amd64 + arm64) - Add mount-s3 init container to edge-deployment with FUSE mount - Remove warmup-inference-model Job and wait-for-warmup script - Remove sync-pinamod init container from inference deployment - Set HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 on inference pods - Remove batch/jobs RBAC rule (no longer needed) - Add s3Mount config values (bucket, region, mountPath, cachePath) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mount-s3 is a FUSE daemon that must stay running to serve file reads. As an init container, the process exits and the FUSE mount disappears. Convert to a sidecar container running in --foreground mode so the mount persists for the lifetime of the pod and is visible to inference pods via hostPath + mountPropagation: Bidirectional. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The FUSE mount must happen on the actual hostPath bind mount point inside the container, not on a path within the overlay filesystem. Mount on /mnt/s3 (the volume mount point) so the FUSE mount propagates to the host via Bidirectional mountPropagation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hostPath volume FUSE mounts don't propagate through containerd's bind mount setup. Instead, use hostPID + nsenter -t 1 -m to run mount-s3 directly in the host's mount namespace. Copy the mount-s3 binary and AWS credentials to the host filesystem first, then mount with --foreground to keep the sidecar alive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use cp+mv instead of direct cp to avoid ETXTBSY when the old pod is still running mount-s3 from the same host path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mount directly on /host-root/<path> (bind mount of host root) from within the container. This avoids nsenter complexity and library dependency issues. The FUSE mount on the host-root bind mount makes files visible at the host path for other pods via hostPath. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move the mount-s3 logic out of the Helm template into a standalone shell script for readability. The template now just calls the script and passes config via environment variables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On Balena, k3s runs inside a Docker container whose root filesystem has private mount propagation by default. The mount-s3 sidecar needs mountPropagation: Bidirectional which requires shared propagation. Run mount --make-rshared / before k3s starts to enable this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…k-resilience

HuggingFace and DockerHub are no longer required at runtime: - Pretrained weights served from S3 via FUSE mount (HF_HUB_OFFLINE=1) - DockerHub images preloaded with imagePullPolicy: IfNotPresent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

honeytung · 2026-04-08T22:50:40Z

@@ -4,6 +4,44 @@
 # https://k3d.io/v5.7.4/usage/advanced/cuda/#building-a-customized-k3s-image


The following Dockerfile changes are for preloading images with crane. This will also be used in GEP for balena deployments. Note that the three Dockerfile represents different architectures (cpu, x86-GPU, and jetson).

In GEP we kept those in jinja format and we generate the actual Dockerfile via Python. Here we kept full Dockerfile for each deployments so it is easier for people to reuse this for their custom deployment environments.

honeytung · 2026-04-08T23:10:05Z

      containers:
      - name: k3s-memory-configurator
-        image: alpine:latest
+        image: alpine:3.20


I modified alpine to use 3.20 instead of latest so that it matches what the network-healer is using below. I am not sure why they are different versions.

CoreyEWood

Overall this looks great! Love how simple the S3 mounting is to set up. And I think ultimately the validation via testing is the most key part of it. I left comments which I think are worth looking at before merging but mostly about smaller portions of it.

CoreyEWood · 2026-04-13T22:19:11Z

+    crane pull --platform "$PLAT" docker.io/amazon/aws-cli:latest /preload/aws-cli-latest.tar && \
+    crane pull --platform "$PLAT" docker.io/bitnami/kubectl:latest /preload/kubectl-latest.tar && \
+    crane pull --platform "$PLAT" docker.io/alpine:3.20 /preload/alpine-3.20.tar && \
+    crane pull --platform "$PLAT" docker.io/busybox:1.35 /preload/busybox-1.35.tar


This pulls busybox:1.35 but edge-deployment.yaml uses busybox:1.36 for apply-edge-config. Should the crane pull command include 1.36 to cover this?

Good catch! Will add that to the list.

CoreyEWood · 2026-04-13T22:22:20Z

        command: ["/bin/sh", "/etc/groundlight/edge-config/apply-edge-config.sh"]

      containers:
+      - name: mount-s3


The other containers in this pod all have resources.requests.memory set to avoid eviction under memory pressure. Should mount-s3 have one as well?

CoreyEWood · 2026-04-13T22:23:25Z

+# -------------------------
+# Stage 1: Preload - download third-party images as tarballs for airgap/restricted network support
+# -------------------------


This preloader stage seems to be duplicated across Dockerfile, Dockerfile.gpu, and Dockerfile.jetson. Could we extract it into a shared Dockerfile or base image so the preload list only needs to be maintained in one place?

We can, for GEP we used a jinja template. But for edge-endpoint since it is public facing I wanted to keep those architectures separate so it is easier for people to copy & modify it based on their needs.

With jinja templating we can make this cleaner, but it will require us to call a python script like GEP does to compile the actual Dockerfile at build time.

Okay gotcha, makes sense. In this case we'll just have to remember to keep all of these in sync.

Maybe add a note in this section that it should be kept in sync across all of these when making changes, so that people in the future are aware.

CoreyEWood · 2026-04-13T23:17:12Z

+- `*.us-west-2.amazonaws.com`: AWS access for inference image download (ECR), model weights (S3), and credential refresh (STS)
+- `*.s3.amazonaws.com`: AWS S3 access for model weights via FUSE mount


Do you know why we need both of these instead of just the first one?

I believe the AWS CLI may use global endpoints (s3.amazonaws.com, sts.amazonaws.com) without the region prefix even if specified, especially when talking to sts to get creds.

CoreyEWood · 2026-04-13T23:19:06Z

Have we fully deprecated the non-helm deploy route? It looks like this script is still referenced in the old deploy/k3s/inference_deployment_template.yaml, so we might need to keep it around if anyone else is still using that deployment path.

Good call, there are some people still using the non-helm deploy route. I will revert the script for now.

Sounds good. And from looking at it, it seems to me like nothing else here should break non-helm deployments hopefully. Do you think we should test that explicitly to be safe? Or are you confident about that?

Yea I think this is fine, since we did not touch any k3s templates. We can communicate with the people still using the old k3s method to switch over soon so we can fully deprecate it.

CoreyEWood · 2026-04-13T23:31:58Z

+# Copy AWS credentials to host for mount-s3 to use
+mkdir -p /host-root/root/.aws
+cp /root/.aws/credentials /host-root/root/.aws/credentials 2>/dev/null || true
+cp /root/.aws/config /host-root/root/.aws/config 2>/dev/null || true


Why is this needed if it already has access to the credentials here?

I think I originally planned to run mount-s3 in host for testing, but I agree it should be necessarily, will test out and update it.

CoreyEWood · 2026-04-14T03:59:12Z

+          mountPropagation: Bidirectional
+        - name: aws-credentials
+          mountPath: /root/.aws
+          readOnly: true


Should this container have a liveness probe? If the FUSE mount fails silently (e.g. a credentials issue), the pod would appear healthy but inference pods would see an empty directory. Claude says that something like checking mountpoint -q /host-root/<mount_path> could catch this.

Good call, will add it 🙂

CoreyEWood · 2026-04-14T04:04:45Z

+        imagePullPolicy: "{{ include "groundlight-edge-endpoint.edgeEndpointPullPolicy" . }}"
+        securityContext:
+          privileged: true  # Required for FUSE mount
+        command: ["/bin/bash", "/groundlight-edge/deploy/bin/mount-s3.sh"]


Do you think we need to think about the ordering of this container vs the inference pods starting up at all? Most of the time it seems like it should be fine because we expect this to start before inference pods do, and it should run quickly. And maybe adding a liveness probe would cover the scenario where this doesn't succeed. But just want to flag in case there's other scenarios we should cover to avoid some kind of silent failure.

I think this is fine, usually mount-s3 will start fairly quickly, but a liveness probe from the comment above should solve this issue. Also, the inference pod at startup will still need to pull the inference image so it will definitely be slower than the s3 mount.

CoreyEWood · 2026-04-14T04:11:20Z

+fi
+
+echo "Mounting s3://$S3_BUCKET at $MOUNT_POINT (cache: $CACHE_DIR, region: $S3_REGION)"
+mount-s3 "$S3_BUCKET" "$MOUNT_POINT" \


Not sure the current state of edge documentation but do you think we should add a short section about the mountpoint usage to a README somewhere?

I think we can add that later, I think the current edge architecture docs only covers our escalation and ML logic but not the structure of the edge-endpoint.

- Add busybox:1.36 to preloader in all 3 Balena Dockerfiles (covers apply-edge-config init container added in main merge) - Add resources.requests.memory: 50Mi to mount-s3 sidecar (consistent with other containers in the pod) - Add liveness probe to mount-s3 (mountpoint -q check to detect FUSE mount failures) - Remove unnecessary AWS credentials copy to /host-root in mount-s3.sh (mount-s3 reads creds from its own /root/.aws mount) - Restore wait-for-warmup.sh for non-helm k3s deploy path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CoreyEWood

Changes look good! Left a few more comments but then should be ready to merge.

CoreyEWood · 2026-04-15T17:49:19Z

+    crane pull --platform "$PLAT" docker.io/busybox:1.35 /preload/busybox-1.35.tar && \
+    crane pull --platform "$PLAT" docker.io/busybox:1.36 /preload/busybox-1.36.tar


It looks like we use 1.35 and 1.36 in one place each, I wonder if we can switch to using just one version to avoid having to pull both images? Unless there's a reason we need both 1.35 and 1.36 for those specific spots.

CoreyEWood · 2026-04-15T17:49:57Z

+# -------------------------
+# Stage 1: Preload - download third-party images as tarballs for airgap/restricted network support
+# -------------------------


Okay gotcha, makes sense. In this case we'll just have to remember to keep all of these in sync.

CoreyEWood · 2026-04-15T17:51:33Z

+# -------------------------
+# Stage 1: Preload - download third-party images as tarballs for airgap/restricted network support
+# -------------------------


Maybe add a note in this section that it should be kept in sync across all of these when making changes, so that people in the future are aware.

CoreyEWood · 2026-04-15T17:54:57Z

Sounds good. And from looking at it, it seems to me like nothing else here should break non-helm deployments hopefully. Do you think we should test that explicitly to be safe? Or are you confident about that?

- Standardize on busybox:1.36 everywhere (was 1.35 in splunk validation and preloaders, 1.36 in apply-edge-config) - Add NOTE comments to preloader stages reminding to keep image lists in sync across Dockerfile, Dockerfile.gpu, and Dockerfile.jetson Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…k-resilience # Conflicts: # deploy/helm/groundlight-edge-endpoint/files/inference-deployment-template.yaml

honeytung requested a review from a team as a code owner April 3, 2026 23:26

honeytung and others added 8 commits April 3, 2026 16:49

Fix mount-s3 binary copy: handle 'Text file busy' during upgrades

7ab24ee

Use cp+mv instead of direct cp to avoid ETXTBSY when the old pod is still running mount-s3 from the same host path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

honeytung changed the title ~~Add imagePullPolicy: IfNotPresent to all DockerHub images~~ Network resilience: image preloading, pull policy, and S3 FUSE mount for model weights Apr 8, 2026

honeytung and others added 2 commits April 8, 2026 11:15

Merge remote-tracking branch 'origin/main' into htung_axoncorp/networ…

ae62cd1

…k-resilience

honeytung force-pushed the htung_axoncorp/network-resilience branch 2 times, most recently from 35fe022 to 3ac8e70 Compare April 8, 2026 18:54

honeytung force-pushed the htung_axoncorp/network-resilience branch from 3ac8e70 to 5f6704b Compare April 8, 2026 18:55

honeytung requested review from CoreyEWood and timmarkhuff April 8, 2026 22:06

honeytung commented Apr 8, 2026

View reviewed changes

CoreyEWood reviewed Apr 14, 2026

View reviewed changes

CoreyEWood approved these changes Apr 15, 2026

View reviewed changes

honeytung and others added 2 commits April 15, 2026 14:04

Merge remote-tracking branch 'origin/main' into htung_axoncorp/networ…

8700981

…k-resilience # Conflicts: # deploy/helm/groundlight-edge-endpoint/files/inference-deployment-template.yaml

honeytung merged commit bd1f7c2 into main Apr 15, 2026
12 checks passed

honeytung deleted the htung_axoncorp/network-resilience branch April 15, 2026 21:48

		@@ -4,6 +4,44 @@
		# https://k3d.io/v5.7.4/usage/advanced/cuda/#building-a-customized-k3s-image

		- `*.us-west-2.amazonaws.com`: AWS access for inference image download (ECR), model weights (S3), and credential refresh (STS)
		- `*.s3.amazonaws.com`: AWS S3 access for model weights via FUSE mount

		crane pull --platform "$PLAT" docker.io/busybox:1.35 /preload/busybox-1.35.tar && \
		crane pull --platform "$PLAT" docker.io/busybox:1.36 /preload/busybox-1.36.tar

Conversation

honeytung commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Image pull policy & preloading (offline k3s support)

2. Replace S3 sync with mount-s3 FUSE mount

3. Files changed

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

honeytung Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CoreyEWood left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

honeytung Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

honeytung Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

honeytung Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CoreyEWood left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

honeytung commented Apr 3, 2026 •

edited

Loading

honeytung Apr 8, 2026 •

edited

Loading

honeytung Apr 14, 2026 •

edited

Loading

honeytung Apr 14, 2026 •

edited

Loading

honeytung Apr 14, 2026 •

edited

Loading