edge-endpoint can be configured by python-sdk#359
Conversation
4ac4b37 to
51f7da8
Compare
| cp deploy/helm/groundlight-edge-endpoint/files/default-edge-config.yaml $EDGE_CONFIG_FILE | ||
| sed -i "s/detector_id: \"\"/detector_id: \"$DETECTOR_ID\"/" $EDGE_CONFIG_FILE | ||
| sed -i "s/refresh_rate: 60/refresh_rate: $REFRESH_RATE/" $EDGE_CONFIG_FILE | ||
| cat > $EDGE_CONFIG_FILE <<EOF |
There was a problem hiding this comment.
We no longer have default-edge-config.yaml (since we rely on Pydantic defaults), so we need to construct a config from scratch here.
…ight/edge-endpoint into tim/edge-accepts-config-requests
…ight/edge-endpoint into tim/edge-accepts-config-requests
brandon-wada
left a comment
There was a problem hiding this comment.
I'm reminded how rusty I am on the edge code. At least all the parts look like they make sense to me
| detector is responding to health checks. | ||
| """ | ||
| config = EdgeConfigManager.active() | ||
| detector_ids = [d.detector_id for d in config.detectors if d.detector_id] |
There was a problem hiding this comment.
When is detector_id falsey?
There was a problem hiding this comment.
It should not be. I have adjusted the Pydantic model to ensure that it is not, and removed all the defensive code that assumes it might be.
| subPath: dummy-nginx.conf | ||
| - name: edge-endpoint-persistent-volume | ||
| mountPath: /opt/groundlight/edge/sqlite | ||
| - name: edge-endpoint-persistent-volume |
There was a problem hiding this comment.
Do we want to reuse the existing volume for this?
There was a problem hiding this comment.
Yes, this reuses the existing edge-endpoint-pvc. We just added a new mountPath at /opt/groundlight/edge/config on the same PVC that already mounts /opt/groundlight/edge/sqlite and /opt/groundlight/edge/serving/model-repo. No new PVC.
…ight/edge-endpoint into tim/edge-accepts-config-requests
…ight/edge-endpoint into tim/edge-accepts-config-requests
…ight/edge-endpoint into tim/edge-accepts-config-requests
Summary
Adds HTTP endpoints for reading and modifying the edge endpoint configuration at runtime, without requiring a Helm redeploy. A Python SDK companion PR provides the client-side methods.
New endpoints:
GET /edge-config-- returns the activeEdgeEndpointConfigPUT /edge-config-- replaces the active config (diffs against current state, adds/removes detector pods accordingly)GET /edge-detector-readiness-- reports which configured detectors have inference pods ready to serveKey design decisions
Config as a shared file on PVC
Multiple uvicorn workers cannot share in-memory state. The active config is persisted to a YAML file on the existing PVC (
/opt/groundlight/edge/config/active-edge-config.yaml). Workers read it via an mtime-based cache (EdgeConfigManager.active()), which callsos.path.getmtimeon each access and only re-parses when the file has changed. This gives cross-worker consistency with negligible overhead.Config seeding via init container
An
apply-edge-configinit container copies the Helm ConfigMap to the active config file on PVC. It uses a revision fingerprint (revision + deploy timestamp) stored on PVC to distinguishhelm install/upgradefrom restarts -- only copying when the fingerprint changes. This ensures Helm config is applied on deploy while SDK changes survive restarts.At startup,
main.pychecks:EDGE_CONFIGenv var (for Balena) > active config on PVC > Pydantic defaults.Backward compatibility
Older Helm charts lack the init container. In that case,
main.pydetects the missing active config file and falls back to copying the Helm ConfigMap directly.Detector reconciliation
PUT /edge-configtriggersreconcile_config(), which:compute_detector_diff)pending_deletionin the DBThe
inference-model-updaterpicks up these DB changes on its next loop iteration -- deleting pods forpending_deletionrecords, then creating pods for new ones. This could cause the config update process to take a bit longer than necessary (because we have to wait for the refresh loop to occur). I thought about adding logic to make the inference-model-updater poll for config updates while it waits, and terminate early if it sees one. In the end, I decided to not add that (yet). YAGNI.Deprecation of default-edge-config.yaml
The static
deploy/helm/groundlight-edge-endpoint/files/default-edge-config.yamlhas been removed. Default config is now defined by Pydantic model defaults in thepython-sdk. When no config file is provided to Helm,_helpers.tplgenerates an empty YAML object, and the system uses Pydantic defaults.Notable refactorings
EdgeConfigManager(new, replacesedge_config_loader.py): Singleton-style class with class methods for config lifecycle --save(),active(),detector_configs(),detector_config().EdgeInferenceManagersimplification: Removed storeddetector_inference_configs,inference_client_urls,oodd_inference_client_urls,min_times_between_escalations, and the 30-second scheduler poll. The manager now computes pod URLs on the fly from detector IDs and reads config fromEdgeConfigManager.active(). Onlylast_escalation_timesremains as runtime state.naming.py(new): Extractedget_edge_inference_service_nameandget_edge_inference_model_nameout ofedge_inference.pyto break a circular import betweenedge_inference.pyandedge_config_manager.py.pending_deletioncolumn: Added to theInferenceDeploymentDB model. The column has a server default ofFalse, so existing databases are compatible without migration.apply-edge-configinit container (new): Busybox-based init container added to the edge-endpoint Deployment. Uses a revision fingerprint to conditionally copy Helm config to PVC. Script lives infiles/apply-edge-config.sh, embedded in theedge-configConfigMap.Authentication
These new endpoints are unauthenticated, consistent with all other non-inference endpoints on the edge endpoint (status page, metrics, health checks). The edge endpoint assumes a trusted local network.
GET /edge-configandGET /edge-detector-readinessare comparable in sensitivity to the existing status page andmetrics.json, which already expose detector IDs, model versions, and readiness.PUT /edge-configis the only unauthenticated write endpoint that modifies system state. An attacker with network access could add or remove detectors. This is an accepted tradeoff under the trusted-network assumption. If edge endpoints are ever exposed to untrusted networks, this endpoint should be the first to get auth.Release strategy
This PR must be deployed before the companion SDK PR (
python-sdk0.26.0). The SDK'sgl.edge.*methods call endpoints that only exist after this PR is deployed. Releasing the SDK first would cause those methods to 404.The edge endpoint's
groundlightdependency does not need to change for this PR -- it uses SDK features already available in0.25.x. However, once the SDK is released at0.26.0, a follow-up bump to>=0.26.0, <0.27.0is recommended.Tests
test/api/test_edge_config.py: GET returns 200, PUT validates body (422 on invalid, 200 on valid)test/core/test_edge_config_manager.py:compute_detector_diff(6 cases),apply_detector_changes(3 cases),EdgeConfigManager(save/load/active roundtrip, mtime caching, startup priority, detector config lookups)Load Testing
This PR touches some code in the hot path of inference (in what I hope is a trivial way). To ensure that I am not introducing any performance regressions, I performed the following benchmark. The results appear nearly identical, each test achieving a "Max Steady RPS" of 30 with the
countstep-yolox-trackingpipeline.With changes:

Before changes:
