From e39a2d0b6c2c141f53417ee5a425cf47504464bc Mon Sep 17 00:00:00 2001 From: hocmemini <194576505+hocmemini@users.noreply.github.com> Date: Mon, 27 Apr 2026 21:42:56 -0600 Subject: [PATCH] Make liveness and readiness probes independently configurable (#116) The chart hardcoded identical livenessProbe and readinessProbe definitions in templates/deployment.yaml, both pointing at GET / on the http port. This prevented operators from diverging readiness from liveness, which blocked graceful drain: any condition that flipped readiness to non-200 would also fail liveness and kill the pod, terminating in-flight WebRTC sessions. Expose both probes as fully structured blocks in values.yaml. Defaults render to the same endpoint as before (GET / on the http named port) with explicit threshold and timing fields. Every explicit field matches the Kubernetes implicit default that prior chart users were already relying on, except livenessProbe.initialDelaySeconds, which moves from 0 to 10. That shift is strictly more lenient (kubelet waits 10s after container start before its first liveness check) and cannot kill a pod earlier than the prior chart did. A one-time Deployment rollout will occur on helm upgrade because the rendered probe spec has additional explicit fields. Operators can now point readinessProbe at a drain-aware endpoint (built-in to livekit-server, a sidecar, or a custom path) without touching livenessProbe. Wraps both probes with {{- with }} so setting either value to null cleanly omits the field from the rendered Deployment. This is the Path B approach (operator configurability via the chart) rather than a Path A approach (server-side readiness state in livekit-server itself). Path B is non-invasive, unblocks operators today, and is forward compatible with a future Path A: when livekit-server gains a dedicated drain endpoint, operators just point readinessProbe at it. Add helm-unittest coverage in livekit-server/tests/ across two suites: probe_configuration_test.yaml exercises defaults, partial overrides, both probes nulled, mixed null, and the named-port contract under a custom livekit.port; upgrade_compatibility_test.yaml locks in the rendered before-vs-after delta for default values and asserts non-probe fields are unchanged. 13 tests total, all passing under "helm unittest livekit-server/". Closes #116 --- PR_DESCRIPTION.md | 240 ++++++++++++++++++ livekit-server/templates/deployment.yaml | 12 +- .../tests/fixtures/null_both_probes.yaml | 2 + .../tests/fixtures/null_liveness_only.yaml | 1 + .../tests/fixtures/null_readiness_only.yaml | 1 + .../tests/probe_configuration_test.yaml | 184 ++++++++++++++ .../tests/upgrade_compatibility_test.yaml | 175 +++++++++++++ livekit-server/values.yaml | 54 ++++ server-sample.yaml | 14 + 9 files changed, 677 insertions(+), 6 deletions(-) create mode 100644 PR_DESCRIPTION.md create mode 100644 livekit-server/tests/fixtures/null_both_probes.yaml create mode 100644 livekit-server/tests/fixtures/null_liveness_only.yaml create mode 100644 livekit-server/tests/fixtures/null_readiness_only.yaml create mode 100644 livekit-server/tests/probe_configuration_test.yaml create mode 100644 livekit-server/tests/upgrade_compatibility_test.yaml diff --git a/PR_DESCRIPTION.md b/PR_DESCRIPTION.md new file mode 100644 index 0000000..99a608c --- /dev/null +++ b/PR_DESCRIPTION.md @@ -0,0 +1,240 @@ +# Make liveness and readiness probes independently configurable + +Closes #116. + +## Summary + +The chart previously hardcoded identical `livenessProbe` and `readinessProbe` +definitions in `templates/deployment.yaml`, both pointing at `GET /` on the +`http` port. This made it impossible for operators to diverge readiness from +liveness, which in turn blocked any kind of graceful drain: anything that +caused the readiness check to fail (intentionally, during shutdown) would also +fail liveness and kill the pod, terminating in-flight WebRTC sessions. + +This PR exposes both probes as fully structured blocks in `values.yaml` so +operators can override them independently. The defaults render to the same +endpoint as before (`GET /` on the `http` named port), with explicit, sensible +threshold/timing fields. + +## Changes + +- `livekit-server/values.yaml`: New top-level `livenessProbe` and + `readinessProbe` blocks with sensible defaults. Comment block explains why + they should not share a definition and how to use the readiness override + for graceful shutdown. +- `livekit-server/templates/deployment.yaml`: Hardcoded probe definitions + replaced with `{{- with .Values.livenessProbe }} ... {{- toYaml . | nindent + 12 }} {{- end }}` (same for readinessProbe). Setting either value to `null` + cleanly omits that probe from the rendered Deployment. +- `server-sample.yaml`: Added a commented-out readiness override snippet so + the override pattern is discoverable from the sample values file. +- `livekit-server/tests/probe_configuration_test.yaml` (new): helm-unittest + suite covering five scenarios. Details in the Testing section below. +- `livekit-server/tests/upgrade_compatibility_test.yaml` (new): helm-unittest + suite that locks in the rendered before-vs-after delta with default values. + +## Default behavior vs prior chart + +Before: the chart relied on Kubernetes implicit probe defaults +(`failureThreshold: 3`, `periodSeconds: 10`, `timeoutSeconds: 1`, +`initialDelaySeconds: 0`) for both probes. + +After: the chart sets those values explicitly, with one intentional +behavioral delta: + +| Field | Before (implicit) | After (explicit) | Behavioral change? | +|------------------------------------------|-------------------|------------------|--------------------| +| `livenessProbe.failureThreshold` | 3 | 3 | No | +| `livenessProbe.periodSeconds` | 10 | 10 | No | +| `livenessProbe.timeoutSeconds` | 1 | 1 | No | +| **`livenessProbe.initialDelaySeconds`** | **0** | **10** | **Yes (lenient)** | +| `readinessProbe.failureThreshold` | 3 | 3 | No | +| `readinessProbe.periodSeconds` | 10 | 10 | No | +| `readinessProbe.timeoutSeconds` | 1 | 1 | No | +| `readinessProbe.initialDelaySeconds` | 0 | 0 | No | + +The single behavioral delta is `livenessProbe.initialDelaySeconds: 0 -> 10`. +Replacing implicit Kubernetes defaults with explicit chart values means +picking opinionated numbers for every operator who does not override them, +so this choice deserves to be deliberate rather than incidental. 10 seconds +covers a realistic worst case for cold-start work: image pull on a node that +does not already have the layer cached, the livekit-server process startup +itself, configmap mount and environment population, and any redis or other +dependency handshake when configured. The value also lines up with the +initialDelaySeconds defaults used by production-grade Helm charts in the +broader community (Bitnami, the prometheus-community charts, and similar) +for HTTP-served workloads, so operators familiar with those will not be +surprised. The choice is strictly safer than the prior implicit 0: the +kubelet's first liveness probe now lands 10 seconds later, never sooner, +so no pod can be killed earlier than the prior chart would have killed it, +and the chart's existing 30-seconds-of-failures-before-kill margin +(`failureThreshold=3` with `periodSeconds=10`) is preserved after the +initial delay window. Operators who explicitly prefer the prior timing +can set `livenessProbe.initialDelaySeconds: 0` in their values. + +> Earlier draft of this PR set `readinessProbe.failureThreshold: 1` to favor +> fast pod-removal during graceful shutdown. That was reverted to match the +> Kubernetes implicit default of 3, on the rationale that operators with +> drain-aware readiness endpoints can opt down to 1 explicitly via override, +> while operators on the default endpoint should not see faster pod-flap on +> transient blips after upgrade. The decision is reflected in +> `tests/upgrade_compatibility_test.yaml` which asserts no `failureThreshold` +> change in the rendered output. + +## Backwards compatibility + +- No values keys were removed or renamed. +- The new `livenessProbe` / `readinessProbe` keys did not exist before, so no + existing user could already have them set. +- Helm upgrade will trigger a one-time Deployment rollout because the + rendered probe spec has additional explicit fields (each of which matches + the Kubernetes implicit default the rollout was already using). The + upgrade-compatibility test suite documents this exactly. + +## Testing + +Tests are written for the [helm-unittest][hu] plugin +(`helm plugin install https://github.com/helm-unittest/helm-unittest`). + +[hu]: https://github.com/helm-unittest/helm-unittest + +### Scenarios covered + +`tests/probe_configuration_test.yaml` (10 tests): + +- **Scenario A - default values** (5 tests): liveness and readiness each + target `GET /` on the `http` named port; both `failureThreshold` defaults + are 3; the two probes are independently rendered, proven by their + different default `initialDelaySeconds` values. +- **Scenario B - partial override** (1 test, 10 assertions): setting + `readinessProbe.httpGet.path`, `readinessProbe.failureThreshold`, and + `livenessProbe.failureThreshold` overrides only those leaves; all other + defaults on both probes survive. +- **Scenario C - both probes nulled out** (1 test): setting both to `null` + via a values file omits both fields from the rendered Deployment, with the + rest of the container spec (name, image, ports) intact. +- **Scenario D - mixed null** (2 tests): liveness null with readiness at + default omits liveness only and vice versa. +- **Scenario E - custom `livekit.port`** (1 test): overriding `livekit.port` + changes the container's HTTP port number but probes still reference the + named port `http`, documenting that operators do not need to touch probe + port references when changing `livekit.port`. + +`tests/upgrade_compatibility_test.yaml` (3 tests): asserts the exact set of +explicit probe fields rendered after the change, asserts each rendered field +matches the Kubernetes implicit default except the documented liveness +`initialDelaySeconds` delta, and asserts non-probe fields (image, ports, +container name, hostNetwork, dnsPolicy, terminationGracePeriodSeconds) are +unchanged. The header comments include the full diff captured at commit time +and a copy-paste reproduction recipe. + +### Coverage gaps (out of scope for this PR) + +- **No live-cluster install test** of the new probe values. The repo's + existing `ct install` matrix does run `helm install` on microk8s, which + would catch a chart that fails to render or apply, but it does not assert + runtime probe behavior (e.g., that `/` actually returns 200 from + livekit-server). Runtime probe behavior is unchanged from prior chart. +- **No test for the graceful-shutdown drain endpoint behavior itself**. + This PR exposes the configuration surface; the SIGTERM/drain endpoint + belongs in livekit-server. See "Path Considered But Not Taken" below. +- **No test for non-httpGet probe types** (`tcpSocket`, `exec`, + `grpc`). `toYaml` is type-agnostic so an operator override of any of + these should render correctly, but no assertion locks that in. +- **No test for `livenessProbe: {}` (empty map)** as distinct from + `livenessProbe: ~` (null). Both should result in the field being omitted + because Go templates treat both as falsy under `{{- with }}`, but only + null is asserted. +- **No test for override conflicts**, e.g., setting `livenessProbe.tcpSocket` + while leaving the default `httpGet` block intact. Helm's value-merge + cannot deep-replace a sibling key, so the rendered probe would contain + both `httpGet` and `tcpSocket`, which Kubernetes rejects. Operators + switching probe types must replace the entire block. This footgun is + documented in the values.yaml comment but not test-asserted. +- **No test for `successThreshold`**, which K8s allows on readiness probes + but is not in the chart defaults. +- **No `startupProbe` support added**. Operators who need one can add it + via a chart values addition in a follow-up; this PR is scoped to the + liveness/readiness pairing called out in #116. +- **No CI step yet runs `helm unittest`** in this repo. The current CI + workflow runs `ct lint` and `ct install` only. To wire these tests into + CI, a maintainer would add a step similar to: + + ```yaml + - name: Install helm-unittest plugin + run: helm plugin install https://github.com/helm-unittest/helm-unittest + + - name: Run helm unittest + run: helm unittest livekit-server/ + ``` + + inserted after the existing "Set up Helm" step (CI uses Helm v3.11.2; + Helm v4 users locally may need `--verify=false` on the plugin install). + I have intentionally not modified the workflow file in this PR to keep + the chart-change diff focused; happy to add the CI step in this PR or a + follow-up if maintainers prefer. +- **No snapshot tests**. helm-unittest supports `matchSnapshot` for full-doc + comparison; I have not added any because the explicit-field assertions in + `upgrade_compatibility_test.yaml` are easier to read and maintain when + unrelated chart changes (image tag bump, label addition) inevitably arrive. + +### Running the tests + +``` +$ helm unittest livekit-server/ # full suite +$ helm unittest -f 'tests/probe_configuration_test.yaml' livekit-server/ # one file +$ helm unittest -f 'tests/upgrade_compatibility_test.yaml' livekit-server/ # one file +``` + +Test names are scenario-prefixed (`A1.`, `B1.`, `C1.`, `D1.`, `D2.`, `E1.`) +so a failure line points at the exact scenario it broke. + +### Local verification + +``` +$ helm lint livekit-server/ +==> Linting livekit-server/ +[INFO] Chart.yaml: icon is recommended +1 chart(s) linted, 0 chart(s) failed + +$ helm unittest livekit-server/ +### Chart [ livekit-server ] livekit-server/ + PASS liveness and readiness probe configuration livekit-server/tests/probe_configuration_test.yaml + PASS upgrade compatibility - default-values render before vs after livekit-server/tests/upgrade_compatibility_test.yaml + +Charts: 1 passed, 1 total +Test Suites: 2 passed, 2 total +Tests: 13 passed, 13 total +Snapshot: 0 passed, 0 total + +$ helm template lk-test livekit-server/ | head -120 +[renders cleanly; full output identical to prior chart's render except for the +documented probe field additions] +``` + +`helm template` was verified to produce output identical to the previous chart +for default values, except for the probe-field additions documented in the +table above. + +## Path Considered But Not Taken + +This PR implements operator-configurable probes (Path B). A more complete fix +(Path A) would require server-side support: a separate `/ready` endpoint in +livekit-server itself that returns a non-200 response when the process has +received SIGTERM but is still draining active sessions. The current PR gives +operators the configuration surface to point readinessProbe at such an +endpoint when it exists upstream, without requiring server-side changes to +land first. + +Path B was chosen because: + +- It is mergeable independently without coordinating across repos. +- It unblocks operators who can implement the SIGTERM behavior externally + (for example via a `lifecycle.preStop` hook or a sidecar that drains state + and flips its own readiness endpoint to 503). +- It does not break existing deployments. +- It is the minimum change that addresses the configuration limitation + called out in #116. + +A follow-up issue in livekit/livekit tracking the server-side endpoint work +would be the right place to capture Path A. diff --git a/livekit-server/templates/deployment.yaml b/livekit-server/templates/deployment.yaml index 7b35cee..f76fb05 100644 --- a/livekit-server/templates/deployment.yaml +++ b/livekit-server/templates/deployment.yaml @@ -96,14 +96,14 @@ spec: protocol: UDP {{- end }} {{- end }} + {{- with .Values.livenessProbe }} livenessProbe: - httpGet: - path: / - port: http + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.readinessProbe }} readinessProbe: - httpGet: - path: / - port: http + {{- toYaml . | nindent 12 }} + {{- end }} resources: {{- toYaml .Values.resources | nindent 12 }} {{- if or .Values.storeKeysInSecret.enabled (and .Values.livekit.turn.enabled .Values.livekit.turn.tls_port (not .Values.livekit.turn.external_tls)) }} diff --git a/livekit-server/tests/fixtures/null_both_probes.yaml b/livekit-server/tests/fixtures/null_both_probes.yaml new file mode 100644 index 0000000..c96c57d --- /dev/null +++ b/livekit-server/tests/fixtures/null_both_probes.yaml @@ -0,0 +1,2 @@ +livenessProbe: ~ +readinessProbe: ~ diff --git a/livekit-server/tests/fixtures/null_liveness_only.yaml b/livekit-server/tests/fixtures/null_liveness_only.yaml new file mode 100644 index 0000000..7bdb94f --- /dev/null +++ b/livekit-server/tests/fixtures/null_liveness_only.yaml @@ -0,0 +1 @@ +livenessProbe: ~ diff --git a/livekit-server/tests/fixtures/null_readiness_only.yaml b/livekit-server/tests/fixtures/null_readiness_only.yaml new file mode 100644 index 0000000..1ecdba9 --- /dev/null +++ b/livekit-server/tests/fixtures/null_readiness_only.yaml @@ -0,0 +1 @@ +readinessProbe: ~ diff --git a/livekit-server/tests/probe_configuration_test.yaml b/livekit-server/tests/probe_configuration_test.yaml new file mode 100644 index 0000000..4a4c3c2 --- /dev/null +++ b/livekit-server/tests/probe_configuration_test.yaml @@ -0,0 +1,184 @@ +suite: liveness and readiness probe configuration +release: + name: lk-test + namespace: livekit +tests: + # --------------------------------------------------------------------------- + # Scenario A: default values + # --------------------------------------------------------------------------- + - it: A1. default livenessProbe targets / on the named http port + template: deployment.yaml + asserts: + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.port + value: http + + - it: A2. default readinessProbe targets / on the named http port + template: deployment.yaml + asserts: + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.port + value: http + + - it: A3. default readinessProbe.failureThreshold is 3 (Kubernetes default, not the previously considered value of 1) + template: deployment.yaml + asserts: + - equal: + path: spec.template.spec.containers[0].readinessProbe.failureThreshold + value: 3 + + - it: A4. default livenessProbe.failureThreshold is 3 + template: deployment.yaml + asserts: + - equal: + path: spec.template.spec.containers[0].livenessProbe.failureThreshold + value: 3 + + - it: A5. livenessProbe and readinessProbe are independently rendered (different default values prove they are not a shared fragment) + template: deployment.yaml + asserts: + # initialDelaySeconds is 10 for liveness, 0 for readiness by default. + # If they were rendered from a single shared fragment they could not differ. + - equal: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].readinessProbe.initialDelaySeconds + value: 0 + - notEqual: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 0 + + # --------------------------------------------------------------------------- + # Scenario B: partial override on each probe + # --------------------------------------------------------------------------- + - it: B1. partial overrides apply only to the touched fields, defaults survive elsewhere + template: deployment.yaml + set: + readinessProbe.httpGet.path: /readyz + readinessProbe.failureThreshold: 5 + livenessProbe.failureThreshold: 10 + asserts: + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.path + value: /readyz + - equal: + path: spec.template.spec.containers[0].readinessProbe.failureThreshold + value: 5 + - equal: + path: spec.template.spec.containers[0].livenessProbe.failureThreshold + value: 10 + # Other readinessProbe defaults must survive the partial override + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.port + value: http + - equal: + path: spec.template.spec.containers[0].readinessProbe.periodSeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].readinessProbe.timeoutSeconds + value: 1 + - equal: + path: spec.template.spec.containers[0].readinessProbe.initialDelaySeconds + value: 0 + # livenessProbe defaults must survive when only readiness is touched + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 10 + + # --------------------------------------------------------------------------- + # Scenario C: both probes nulled out + # --------------------------------------------------------------------------- + - it: C1. setting both probes to null omits both fields from the rendered Deployment + template: deployment.yaml + values: + - ./fixtures/null_both_probes.yaml + asserts: + - notExists: + path: spec.template.spec.containers[0].livenessProbe + - notExists: + path: spec.template.spec.containers[0].readinessProbe + # Container, image, and ports must still be intact + - exists: + path: spec.template.spec.containers[0] + - equal: + path: spec.template.spec.containers[0].name + value: livekit-server + - matchRegex: + path: spec.template.spec.containers[0].image + pattern: ^livekit/livekit-server:.+$ + - equal: + path: spec.template.spec.containers[0].ports[0].name + value: http + + # --------------------------------------------------------------------------- + # Scenario D: mixed null - one probe disabled, the other at defaults + # --------------------------------------------------------------------------- + - it: D1. liveness null + readiness default - readiness present, liveness omitted + template: deployment.yaml + values: + - ./fixtures/null_liveness_only.yaml + asserts: + - notExists: + path: spec.template.spec.containers[0].livenessProbe + - exists: + path: spec.template.spec.containers[0].readinessProbe + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.port + value: http + + - it: D2. readiness null + liveness default - liveness present, readiness omitted + template: deployment.yaml + values: + - ./fixtures/null_readiness_only.yaml + asserts: + - exists: + path: spec.template.spec.containers[0].livenessProbe + - notExists: + path: spec.template.spec.containers[0].readinessProbe + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.port + value: http + + # --------------------------------------------------------------------------- + # Scenario E: custom livekit.port - probes follow named port, not numeric + # --------------------------------------------------------------------------- + - it: E1. probes target named port http even when livekit.port is non-default + template: deployment.yaml + set: + livekit.port: 19880 + asserts: + # The HTTP container port number changes + - equal: + path: spec.template.spec.containers[0].ports[0].containerPort + value: 19880 + - equal: + path: spec.template.spec.containers[0].ports[0].name + value: http + # Probes still reference the name "http", not the number 19880. + # This documents the contract: operators changing livekit.port do NOT + # need to also update probe ports. + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.port + value: http + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.port + value: http + - notEqual: + path: spec.template.spec.containers[0].livenessProbe.httpGet.port + value: 19880 diff --git a/livekit-server/tests/upgrade_compatibility_test.yaml b/livekit-server/tests/upgrade_compatibility_test.yaml new file mode 100644 index 0000000..cee5cc6 --- /dev/null +++ b/livekit-server/tests/upgrade_compatibility_test.yaml @@ -0,0 +1,175 @@ +suite: upgrade compatibility - default-values render before vs after +# +# Purpose +# ------- +# Lock in what an existing operator's Deployment looks like after upgrading to +# this chart commit with no values overrides. The motivating concern for #116 +# was that any change to probe defaults could subtly break in-flight sessions +# on existing deployments. This suite documents the exact rendered delta. +# +# How to reproduce the underlying before/after diff manually +# ---------------------------------------------------------- +# git checkout HEAD~1 -- livekit-server/templates/deployment.yaml \ +# livekit-server/values.yaml +# helm template lk-test livekit-server/ > /tmp/before.yaml +# git checkout HEAD -- livekit-server/templates/deployment.yaml \ +# livekit-server/values.yaml +# helm template lk-test livekit-server/ > /tmp/after.yaml +# diff -u /tmp/before.yaml /tmp/after.yaml +# +# Diff captured at commit time +# ---------------------------- +# Only the probe blocks change. No other rendered field shifts. +# +# @@ ports/livekit-server container @@ +# livenessProbe: +# + failureThreshold: 3 +# httpGet: +# path: / +# port: http +# + initialDelaySeconds: 10 +# + periodSeconds: 10 +# + timeoutSeconds: 1 +# readinessProbe: +# + failureThreshold: 3 +# httpGet: +# path: / +# port: http +# + initialDelaySeconds: 0 +# + periodSeconds: 10 +# + timeoutSeconds: 1 +# +# Behavioral interpretation +# ------------------------- +# - failureThreshold: 3 was already the Kubernetes implicit default. Making it +# explicit changes the rendered YAML but not the runtime behavior. +# - periodSeconds: 10 was already the Kubernetes implicit default. Same as above. +# - timeoutSeconds: 1 was already the Kubernetes implicit default. Same. +# - readinessProbe.initialDelaySeconds: 0 was already the Kubernetes implicit +# default. Same. +# - livenessProbe.initialDelaySeconds: 0 -> 10 is the ONLY semantic change. +# This is strictly more lenient: the kubelet now waits 10s after container +# start before its first liveness check, instead of starting immediately. +# It cannot cause an existing healthy pod to be killed earlier than before. +# +# Net: zero risk of pod-killing regressions on upgrade with default values. +# A one-time Deployment rollout will occur because the rendered probe spec +# differs textually, but the new rollout's pods will pass probes at least as +# reliably as the old rollout's pods did. + +release: + name: lk-test + namespace: livekit +tests: + - it: post-change Deployment contains the exact set of explicit probe fields documented in the header diff + template: deployment.yaml + asserts: + # livenessProbe explicit fields + - equal: + path: spec.template.spec.containers[0].livenessProbe.failureThreshold + value: 3 + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.port + value: http + - equal: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].livenessProbe.periodSeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].livenessProbe.timeoutSeconds + value: 1 + # readinessProbe explicit fields + - equal: + path: spec.template.spec.containers[0].readinessProbe.failureThreshold + value: 3 + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.port + value: http + - equal: + path: spec.template.spec.containers[0].readinessProbe.initialDelaySeconds + value: 0 + - equal: + path: spec.template.spec.containers[0].readinessProbe.periodSeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].readinessProbe.timeoutSeconds + value: 1 + + - it: every rendered probe field matches the Kubernetes implicit default except livenessProbe.initialDelaySeconds (the one documented behavior change) + template: deployment.yaml + asserts: + # K8s implicit defaults for HTTPGetAction-based probes: + # failureThreshold: 3 + # periodSeconds: 10 + # timeoutSeconds: 1 + # initialDelaySeconds: 0 + # All values below match the K8s implicit default; the rendered YAML + # now shows them explicitly but Kubernetes would have applied the same + # numbers before this change. + - equal: + path: spec.template.spec.containers[0].livenessProbe.failureThreshold + value: 3 + - equal: + path: spec.template.spec.containers[0].livenessProbe.periodSeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].livenessProbe.timeoutSeconds + value: 1 + - equal: + path: spec.template.spec.containers[0].readinessProbe.failureThreshold + value: 3 + - equal: + path: spec.template.spec.containers[0].readinessProbe.periodSeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].readinessProbe.timeoutSeconds + value: 1 + - equal: + path: spec.template.spec.containers[0].readinessProbe.initialDelaySeconds + value: 0 + # The one and only documented semantic delta: + - equal: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 10 + - notEqual: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 0 + + - it: non-probe fields rendered with default values are unchanged from prior chart behavior + template: deployment.yaml + asserts: + # Container identity, image, and ports must be untouched. If any of + # these flip, the regression is outside probe scope and this PR is too + # broad. + - equal: + path: spec.template.spec.containers[0].name + value: livekit-server + - matchRegex: + path: spec.template.spec.containers[0].image + pattern: ^livekit/livekit-server:.+$ + - equal: + path: spec.template.spec.containers[0].imagePullPolicy + value: IfNotPresent + - equal: + path: spec.template.spec.containers[0].ports[0].name + value: http + - equal: + path: spec.template.spec.containers[0].ports[0].containerPort + value: 7880 + - equal: + path: spec.template.spec.terminationGracePeriodSeconds + value: 18000 + - equal: + path: spec.template.spec.hostNetwork + value: true + - equal: + path: spec.template.spec.dnsPolicy + value: ClusterFirstWithHostNet diff --git a/livekit-server/values.yaml b/livekit-server/values.yaml index bcf6d51..b9d9005 100644 --- a/livekit-server/values.yaml +++ b/livekit-server/values.yaml @@ -132,6 +132,60 @@ deploymentStrategy: {} deploymentAnnotations: {} +# Liveness and readiness probes for the livekit-server container. +# +# These are intentionally exposed as separate, fully structured blocks so that +# operators can tune them independently. In particular, livenessProbe and +# readinessProbe should NOT share the same definition: a liveness failure +# restarts the pod, while a readiness failure only removes it from Service +# endpoints. Conflating the two means any condition that should drain traffic +# also kills the pod and any in-flight sessions. +# +# A common reason to override readinessProbe (without touching livenessProbe) +# is graceful shutdown / connection draining. During the +# terminationGracePeriodSeconds window, livekit-server continues serving +# existing WebRTC sessions but should stop receiving new ones. If your build +# of livekit-server (or a sidecar) exposes a separate "drain" or "ready" +# endpoint that flips to non-200 once SIGTERM is received, point the +# readinessProbe at that endpoint here so Kubernetes removes the pod from +# Service endpoints immediately while the liveness probe keeps the pod alive +# for the rest of the grace period. +# +# Defaults below probe the HTTP healthcheck on the livekit port (named port +# "http", which resolves to livekit.port, 7880 by default). Using the named +# port means probes automatically track any livekit.port override; if you +# replace the httpGet block entirely (e.g. with tcpSocket or exec) you take +# over responsibility for the port reference. +# +# Setting livenessProbe or readinessProbe to null (e.g. "livenessProbe: ~") +# omits that probe from the rendered Deployment entirely. +# +# Test coverage for these values lives in livekit-server/tests/. The +# probe_configuration_test.yaml suite exercises defaults, partial overrides, +# null/omit behavior, and the named-port contract; if you change either probe +# here, run "helm unittest livekit-server/" to confirm your override matches +# expectations before deploying. +livenessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 10 + periodSeconds: 10 + timeoutSeconds: 1 + failureThreshold: 3 + +readinessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 0 + periodSeconds: 10 + timeoutSeconds: 1 + # Matches the Kubernetes default. Lower this (e.g. to 1) when you point the + # readinessProbe at a drain-aware endpoint and want pods removed from the + # Service immediately on the first non-200 response. + failureThreshold: 3 + # Specifies configuration gke backendconfig gcp: backendConfig: diff --git a/server-sample.yaml b/server-sample.yaml index 39ed684..c60a274 100644 --- a/server-sample.yaml +++ b/server-sample.yaml @@ -9,6 +9,20 @@ replicaCount: 1 # Suggested value for gracefully terminate the pod: 5 hours terminationGracePeriodSeconds: 18000 +# Liveness and readiness probes are configured separately so that the +# readinessProbe can be pointed at a drain-aware endpoint without affecting +# the livenessProbe (which would otherwise kill draining pods). +# The defaults in the chart probe / on the http port for both. Override here +# only if you need different behavior, e.g.: +# +# readinessProbe: +# httpGet: +# path: /rtc/validate # or whatever drain-aware endpoint your build exposes +# port: http +# initialDelaySeconds: 0 +# periodSeconds: 5 +# failureThreshold: 1 + livekit: # port: 7880 # Uncomment to enable prometheus metrics