Skip to content

Network resilience: image preloading, pull policy, and S3 FUSE mount for model weights#370

Merged
honeytung merged 15 commits intomainfrom
htung_axoncorp/network-resilience
Apr 15, 2026
Merged

Network resilience: image preloading, pull policy, and S3 FUSE mount for model weights#370
honeytung merged 15 commits intomainfrom
htung_axoncorp/network-resilience

Conversation

@honeytung
Copy link
Copy Markdown
Member

@honeytung honeytung commented Apr 3, 2026

Summary

Improve network resilience for edge devices by eliminating runtime dependencies on DockerHub and replacing S3 sync with a FUSE-mounted S3 bucket for model weights.

1. Image pull policy & preloading (offline k3s support)

  • Add imagePullPolicy: IfNotPresent to all DockerHub images in Helm templates (aws-cli, kubectl, alpine, busybox)
  • Pin alpine:latest to alpine:3.20 for consistency with preloaded images
  • Add image preloading stages to Balena Dockerfiles (CPU, GPU, Jetson) using crane to pull images at build time (Note that the preloading steps only happens inside Balena build process, we do not do pre-loading when installing from helm directly)
  • Copy preloaded tarballs to k3s airgap directory at runtime via server.sh

2. Replace S3 sync with mount-s3 FUSE mount

  • Remove warmup-inference-model Job and wait-for-warmup init container
  • Remove sync-pinamod init container from inference deployments
  • Add mount-s3 sidecar container to the edge-endpoint pod that FUSE-mounts s3://pinamod-artifacts-public
  • Mount uses --read-only with local disk cache (--cache) for performance
  • Mount on /host-root/<path> pattern to make FUSE mount visible to other pods via hostPath
  • Inference pods read pinamod models and pretrained weights directly from the mount
  • Set HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 to prevent runtime downloads from HuggingFace
  • Remove aws-credentials volume and batch/jobs RBAC from inference pods (no longer needed)
  • Add s3Mount configuration values (bucket, region, mountPath, cachePath)

3. Files changed

  • Dockerfile — install mount-s3 + FUSE
  • deploy/bin/mount-s3.sh — mount-s3 sidecar script (new)
  • deploy/bin/wait-for-warmup.sh — deleted
  • deploy/helm/.../templates/edge-deployment.yaml — add mount-s3 sidecar + host-root volume
  • deploy/helm/.../templates/warmup-inference-model.yaml — deleted
  • deploy/helm/.../files/inference-deployment-template.yaml — remove sync init containers, add env vars
  • deploy/helm/.../templates/service-account.yaml — remove batch/jobs RBAC
  • deploy/helm/.../values.yaml — add s3Mount config
  • deploy/balena-k3s/server/Dockerfile* — add preloader stages
  • deploy/balena-k3s/server/server.sh — copy preloaded images to k3s
  • Various Helm templates — add imagePullPolicy: IfNotPresent to all DockerHub images except for edge-endpoint and inference images

Test plan

Tested on G4 EC2 (Tesla T4, k3s v1.33.5):

  • mount-s3 sidecar starts and FUSE-mounts S3 bucket successfully
  • Host sees mounted files at /opt/groundlight/edge/pinamod-mount/
  • 9 detector inference pods start without warmup Job or sync-pinamod
  • Inference returns edge predictions (from_edge: true) for all 9 detectors
  • OD detector (det_34Ws7...) loads pinamod weights from FUSE mount
  • mount-s3 disk cache populates on first read (~83MB for OD weights)
  • No HuggingFace downloads (HF_HUB_OFFLINE=1 enforced)
  • imagePullPolicy: IfNotPresent set on all DockerHub images
  • Balena device test with preloaded images + restricted network

🤖 Generated with Claude Code

Prevents ImagePullBackOff failures on restricted/offline networks by using
preloaded images from k3s's local store instead of pulling from DockerHub.
Also pins alpine:latest to alpine:3.20 in k3s-memory-config for consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@honeytung honeytung requested a review from a team as a code owner April 3, 2026 23:26
honeytung and others added 8 commits April 3, 2026 16:49
Adds a preloader stage to all three server Dockerfiles (CPU, GPU, Jetson)
that uses crane to download k3s airgap images and Helm chart utility images
(aws-cli, kubectl, alpine, busybox) as tarballs at build time. server.sh
copies these to k3s's agent images directory on startup so pods can start
without reaching DockerHub.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the warmup-inference-model Job and sync-pinamod init container
with a mount-s3 FUSE mount on the edge-endpoint pod. Model weights
(pinamod + pretrained weights) are now served via a read-only S3 FUSE
mount with local disk caching, eliminating the need for separate S3
sync steps and enabling offline access to pretrained HuggingFace/timm
weights.

Changes:
- Install mount-s3 in Dockerfile (amd64 + arm64)
- Add mount-s3 init container to edge-deployment with FUSE mount
- Remove warmup-inference-model Job and wait-for-warmup script
- Remove sync-pinamod init container from inference deployment
- Set HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 on inference pods
- Remove batch/jobs RBAC rule (no longer needed)
- Add s3Mount config values (bucket, region, mountPath, cachePath)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mount-s3 is a FUSE daemon that must stay running to serve file reads.
As an init container, the process exits and the FUSE mount disappears.
Convert to a sidecar container running in --foreground mode so the
mount persists for the lifetime of the pod and is visible to inference
pods via hostPath + mountPropagation: Bidirectional.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The FUSE mount must happen on the actual hostPath bind mount point
inside the container, not on a path within the overlay filesystem.
Mount on /mnt/s3 (the volume mount point) so the FUSE mount propagates
to the host via Bidirectional mountPropagation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hostPath volume FUSE mounts don't propagate through containerd's
bind mount setup. Instead, use hostPID + nsenter -t 1 -m to run
mount-s3 directly in the host's mount namespace. Copy the mount-s3
binary and AWS credentials to the host filesystem first, then mount
with --foreground to keep the sidecar alive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use cp+mv instead of direct cp to avoid ETXTBSY when the old pod
is still running mount-s3 from the same host path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mount directly on /host-root/<path> (bind mount of host root) from
within the container. This avoids nsenter complexity and library
dependency issues. The FUSE mount on the host-root bind mount makes
files visible at the host path for other pods via hostPath.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the mount-s3 logic out of the Helm template into a standalone
shell script for readability. The template now just calls the script
and passes config via environment variables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@honeytung honeytung changed the title Add imagePullPolicy: IfNotPresent to all DockerHub images Network resilience: image preloading, pull policy, and S3 FUSE mount for model weights Apr 8, 2026
honeytung and others added 2 commits April 8, 2026 11:15
On Balena, k3s runs inside a Docker container whose root filesystem
has private mount propagation by default. The mount-s3 sidecar needs
mountPropagation: Bidirectional which requires shared propagation.
Run mount --make-rshared / before k3s starts to enable this.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@honeytung honeytung force-pushed the htung_axoncorp/network-resilience branch 2 times, most recently from 35fe022 to 3ac8e70 Compare April 8, 2026 18:54
HuggingFace and DockerHub are no longer required at runtime:
- Pretrained weights served from S3 via FUSE mount (HF_HUB_OFFLINE=1)
- DockerHub images preloaded with imagePullPolicy: IfNotPresent

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@honeytung honeytung force-pushed the htung_axoncorp/network-resilience branch from 3ac8e70 to 5f6704b Compare April 8, 2026 18:55
@@ -4,6 +4,44 @@
# https://k3d.io/v5.7.4/usage/advanced/cuda/#building-a-customized-k3s-image
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following Dockerfile changes are for preloading images with crane. This will also be used in GEP for balena deployments. Note that the three Dockerfile represents different architectures (cpu, x86-GPU, and jetson).

In GEP we kept those in jinja format and we generate the actual Dockerfile via Python. Here we kept full Dockerfile for each deployments so it is easier for people to reuse this for their custom deployment environments.

containers:
- name: k3s-memory-configurator
image: alpine:latest
image: alpine:3.20
Copy link
Copy Markdown
Member Author

@honeytung honeytung Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified alpine to use 3.20 instead of latest so that it matches what the network-healer is using below. I am not sure why they are different versions.

Copy link
Copy Markdown
Collaborator

@CoreyEWood CoreyEWood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great! Love how simple the S3 mounting is to set up. And I think ultimately the validation via testing is the most key part of it. I left comments which I think are worth looking at before merging but mostly about smaller portions of it.

Comment thread deploy/balena-k3s/server/Dockerfile Outdated
crane pull --platform "$PLAT" docker.io/amazon/aws-cli:latest /preload/aws-cli-latest.tar && \
crane pull --platform "$PLAT" docker.io/bitnami/kubectl:latest /preload/kubectl-latest.tar && \
crane pull --platform "$PLAT" docker.io/alpine:3.20 /preload/alpine-3.20.tar && \
crane pull --platform "$PLAT" docker.io/busybox:1.35 /preload/busybox-1.35.tar
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pulls busybox:1.35 but edge-deployment.yaml uses busybox:1.36 for apply-edge-config. Should the crane pull command include 1.36 to cover this?

Copy link
Copy Markdown
Member Author

@honeytung honeytung Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Will add that to the list.

command: ["/bin/sh", "/etc/groundlight/edge-config/apply-edge-config.sh"]

containers:
- name: mount-s3
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other containers in this pod all have resources.requests.memory set to avoid eviction under memory pressure. Should mount-s3 have one as well?

Comment on lines +7 to +9
# -------------------------
# Stage 1: Preload - download third-party images as tarballs for airgap/restricted network support
# -------------------------
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preloader stage seems to be duplicated across Dockerfile, Dockerfile.gpu, and Dockerfile.jetson. Could we extract it into a shared Dockerfile or base image so the preload list only needs to be maintained in one place?

Copy link
Copy Markdown
Member Author

@honeytung honeytung Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, for GEP we used a jinja template. But for edge-endpoint since it is public facing I wanted to keep those architectures separate so it is easier for people to copy & modify it based on their needs.

With jinja templating we can make this cleaner, but it will require us to call a python script like GEP does to compile the actual Dockerfile at build time.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay gotcha, makes sense. In this case we'll just have to remember to keep all of these in sync.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a note in this section that it should be kept in sync across all of these when making changes, so that people in the future are aware.

Comment thread NETWORK-REQUIREMENTS.md
Comment on lines +20 to +21
- `*.us-west-2.amazonaws.com`: AWS access for inference image download (ECR), model weights (S3), and credential refresh (STS)
- `*.s3.amazonaws.com`: AWS S3 access for model weights via FUSE mount
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we need both of these instead of just the first one?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the AWS CLI may use global endpoints (s3.amazonaws.com, sts.amazonaws.com) without the region prefix even if specified, especially when talking to sts to get creds.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we fully deprecated the non-helm deploy route? It looks like this script is still referenced in the old deploy/k3s/inference_deployment_template.yaml, so we might need to keep it around if anyone else is still using that deployment path.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, there are some people still using the non-helm deploy route. I will revert the script for now.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. And from looking at it, it seems to me like nothing else here should break non-helm deployments hopefully. Do you think we should test that explicitly to be safe? Or are you confident about that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think this is fine, since we did not touch any k3s templates. We can communicate with the people still using the old k3s method to switch over soon so we can fully deprecate it.

Comment thread deploy/bin/mount-s3.sh Outdated
Comment on lines +22 to +25
# Copy AWS credentials to host for mount-s3 to use
mkdir -p /host-root/root/.aws
cp /root/.aws/credentials /host-root/root/.aws/credentials 2>/dev/null || true
cp /root/.aws/config /host-root/root/.aws/config 2>/dev/null || true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed if it already has access to the credentials here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I originally planned to run mount-s3 in host for testing, but I agree it should be necessarily, will test out and update it.

mountPropagation: Bidirectional
- name: aws-credentials
mountPath: /root/.aws
readOnly: true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this container have a liveness probe? If the FUSE mount fails silently (e.g. a credentials issue), the pod would appear healthy but inference pods would see an empty directory. Claude says that something like checking mountpoint -q /host-root/<mount_path> could catch this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will add it 🙂

imagePullPolicy: "{{ include "groundlight-edge-endpoint.edgeEndpointPullPolicy" . }}"
securityContext:
privileged: true # Required for FUSE mount
command: ["/bin/bash", "/groundlight-edge/deploy/bin/mount-s3.sh"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need to think about the ordering of this container vs the inference pods starting up at all? Most of the time it seems like it should be fine because we expect this to start before inference pods do, and it should run quickly. And maybe adding a liveness probe would cover the scenario where this doesn't succeed. But just want to flag in case there's other scenarios we should cover to avoid some kind of silent failure.

Copy link
Copy Markdown
Member Author

@honeytung honeytung Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, usually mount-s3 will start fairly quickly, but a liveness probe from the comment above should solve this issue. Also, the inference pod at startup will still need to pull the inference image so it will definitely be slower than the s3 mount.

Comment thread deploy/bin/mount-s3.sh
fi

echo "Mounting s3://$S3_BUCKET at $MOUNT_POINT (cache: $CACHE_DIR, region: $S3_REGION)"
mount-s3 "$S3_BUCKET" "$MOUNT_POINT" \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the current state of edge documentation but do you think we should add a short section about the mountpoint usage to a README somewhere?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add that later, I think the current edge architecture docs only covers our escalation and ML logic but not the structure of the edge-endpoint.

- Add busybox:1.36 to preloader in all 3 Balena Dockerfiles (covers
  apply-edge-config init container added in main merge)
- Add resources.requests.memory: 50Mi to mount-s3 sidecar (consistent
  with other containers in the pod)
- Add liveness probe to mount-s3 (mountpoint -q check to detect FUSE
  mount failures)
- Remove unnecessary AWS credentials copy to /host-root in mount-s3.sh
  (mount-s3 reads creds from its own /root/.aws mount)
- Restore wait-for-warmup.sh for non-helm k3s deploy path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@CoreyEWood CoreyEWood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good! Left a few more comments but then should be ready to merge.

Comment thread deploy/balena-k3s/server/Dockerfile Outdated
Comment on lines +40 to +41
crane pull --platform "$PLAT" docker.io/busybox:1.35 /preload/busybox-1.35.tar && \
crane pull --platform "$PLAT" docker.io/busybox:1.36 /preload/busybox-1.36.tar
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we use 1.35 and 1.36 in one place each, I wonder if we can switch to using just one version to avoid having to pull both images? Unless there's a reason we need both 1.35 and 1.36 for those specific spots.

Comment on lines +7 to +9
# -------------------------
# Stage 1: Preload - download third-party images as tarballs for airgap/restricted network support
# -------------------------
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay gotcha, makes sense. In this case we'll just have to remember to keep all of these in sync.

Comment on lines +7 to +9
# -------------------------
# Stage 1: Preload - download third-party images as tarballs for airgap/restricted network support
# -------------------------
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a note in this section that it should be kept in sync across all of these when making changes, so that people in the future are aware.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. And from looking at it, it seems to me like nothing else here should break non-helm deployments hopefully. Do you think we should test that explicitly to be safe? Or are you confident about that?

honeytung and others added 2 commits April 15, 2026 14:04
- Standardize on busybox:1.36 everywhere (was 1.35 in splunk validation
  and preloaders, 1.36 in apply-edge-config)
- Add NOTE comments to preloader stages reminding to keep image lists
  in sync across Dockerfile, Dockerfile.gpu, and Dockerfile.jetson

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…k-resilience

# Conflicts:
#	deploy/helm/groundlight-edge-endpoint/files/inference-deployment-template.yaml
@honeytung honeytung merged commit bd1f7c2 into main Apr 15, 2026
12 checks passed
@honeytung honeytung deleted the htung_axoncorp/network-resilience branch April 15, 2026 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants