diff --git a/PR_DESCRIPTION.md b/PR_DESCRIPTION.md new file mode 100644 index 0000000..99a608c --- /dev/null +++ b/PR_DESCRIPTION.md @@ -0,0 +1,240 @@ +# Make liveness and readiness probes independently configurable + +Closes #116. + +## Summary + +The chart previously hardcoded identical `livenessProbe` and `readinessProbe` +definitions in `templates/deployment.yaml`, both pointing at `GET /` on the +`http` port. This made it impossible for operators to diverge readiness from +liveness, which in turn blocked any kind of graceful drain: anything that +caused the readiness check to fail (intentionally, during shutdown) would also +fail liveness and kill the pod, terminating in-flight WebRTC sessions. + +This PR exposes both probes as fully structured blocks in `values.yaml` so +operators can override them independently. The defaults render to the same +endpoint as before (`GET /` on the `http` named port), with explicit, sensible +threshold/timing fields. + +## Changes + +- `livekit-server/values.yaml`: New top-level `livenessProbe` and + `readinessProbe` blocks with sensible defaults. Comment block explains why + they should not share a definition and how to use the readiness override + for graceful shutdown. +- `livekit-server/templates/deployment.yaml`: Hardcoded probe definitions + replaced with `{{- with .Values.livenessProbe }} ... {{- toYaml . | nindent + 12 }} {{- end }}` (same for readinessProbe). Setting either value to `null` + cleanly omits that probe from the rendered Deployment. +- `server-sample.yaml`: Added a commented-out readiness override snippet so + the override pattern is discoverable from the sample values file. +- `livekit-server/tests/probe_configuration_test.yaml` (new): helm-unittest + suite covering five scenarios. Details in the Testing section below. +- `livekit-server/tests/upgrade_compatibility_test.yaml` (new): helm-unittest + suite that locks in the rendered before-vs-after delta with default values. + +## Default behavior vs prior chart + +Before: the chart relied on Kubernetes implicit probe defaults +(`failureThreshold: 3`, `periodSeconds: 10`, `timeoutSeconds: 1`, +`initialDelaySeconds: 0`) for both probes. + +After: the chart sets those values explicitly, with one intentional +behavioral delta: + +| Field | Before (implicit) | After (explicit) | Behavioral change? | +|------------------------------------------|-------------------|------------------|--------------------| +| `livenessProbe.failureThreshold` | 3 | 3 | No | +| `livenessProbe.periodSeconds` | 10 | 10 | No | +| `livenessProbe.timeoutSeconds` | 1 | 1 | No | +| **`livenessProbe.initialDelaySeconds`** | **0** | **10** | **Yes (lenient)** | +| `readinessProbe.failureThreshold` | 3 | 3 | No | +| `readinessProbe.periodSeconds` | 10 | 10 | No | +| `readinessProbe.timeoutSeconds` | 1 | 1 | No | +| `readinessProbe.initialDelaySeconds` | 0 | 0 | No | + +The single behavioral delta is `livenessProbe.initialDelaySeconds: 0 -> 10`. +Replacing implicit Kubernetes defaults with explicit chart values means +picking opinionated numbers for every operator who does not override them, +so this choice deserves to be deliberate rather than incidental. 10 seconds +covers a realistic worst case for cold-start work: image pull on a node that +does not already have the layer cached, the livekit-server process startup +itself, configmap mount and environment population, and any redis or other +dependency handshake when configured. The value also lines up with the +initialDelaySeconds defaults used by production-grade Helm charts in the +broader community (Bitnami, the prometheus-community charts, and similar) +for HTTP-served workloads, so operators familiar with those will not be +surprised. The choice is strictly safer than the prior implicit 0: the +kubelet's first liveness probe now lands 10 seconds later, never sooner, +so no pod can be killed earlier than the prior chart would have killed it, +and the chart's existing 30-seconds-of-failures-before-kill margin +(`failureThreshold=3` with `periodSeconds=10`) is preserved after the +initial delay window. Operators who explicitly prefer the prior timing +can set `livenessProbe.initialDelaySeconds: 0` in their values. + +> Earlier draft of this PR set `readinessProbe.failureThreshold: 1` to favor +> fast pod-removal during graceful shutdown. That was reverted to match the +> Kubernetes implicit default of 3, on the rationale that operators with +> drain-aware readiness endpoints can opt down to 1 explicitly via override, +> while operators on the default endpoint should not see faster pod-flap on +> transient blips after upgrade. The decision is reflected in +> `tests/upgrade_compatibility_test.yaml` which asserts no `failureThreshold` +> change in the rendered output. + +## Backwards compatibility + +- No values keys were removed or renamed. +- The new `livenessProbe` / `readinessProbe` keys did not exist before, so no + existing user could already have them set. +- Helm upgrade will trigger a one-time Deployment rollout because the + rendered probe spec has additional explicit fields (each of which matches + the Kubernetes implicit default the rollout was already using). The + upgrade-compatibility test suite documents this exactly. + +## Testing + +Tests are written for the [helm-unittest][hu] plugin +(`helm plugin install https://github.com/helm-unittest/helm-unittest`). + +[hu]: https://github.com/helm-unittest/helm-unittest + +### Scenarios covered + +`tests/probe_configuration_test.yaml` (10 tests): + +- **Scenario A - default values** (5 tests): liveness and readiness each + target `GET /` on the `http` named port; both `failureThreshold` defaults + are 3; the two probes are independently rendered, proven by their + different default `initialDelaySeconds` values. +- **Scenario B - partial override** (1 test, 10 assertions): setting + `readinessProbe.httpGet.path`, `readinessProbe.failureThreshold`, and + `livenessProbe.failureThreshold` overrides only those leaves; all other + defaults on both probes survive. +- **Scenario C - both probes nulled out** (1 test): setting both to `null` + via a values file omits both fields from the rendered Deployment, with the + rest of the container spec (name, image, ports) intact. +- **Scenario D - mixed null** (2 tests): liveness null with readiness at + default omits liveness only and vice versa. +- **Scenario E - custom `livekit.port`** (1 test): overriding `livekit.port` + changes the container's HTTP port number but probes still reference the + named port `http`, documenting that operators do not need to touch probe + port references when changing `livekit.port`. + +`tests/upgrade_compatibility_test.yaml` (3 tests): asserts the exact set of +explicit probe fields rendered after the change, asserts each rendered field +matches the Kubernetes implicit default except the documented liveness +`initialDelaySeconds` delta, and asserts non-probe fields (image, ports, +container name, hostNetwork, dnsPolicy, terminationGracePeriodSeconds) are +unchanged. The header comments include the full diff captured at commit time +and a copy-paste reproduction recipe. + +### Coverage gaps (out of scope for this PR) + +- **No live-cluster install test** of the new probe values. The repo's + existing `ct install` matrix does run `helm install` on microk8s, which + would catch a chart that fails to render or apply, but it does not assert + runtime probe behavior (e.g., that `/` actually returns 200 from + livekit-server). Runtime probe behavior is unchanged from prior chart. +- **No test for the graceful-shutdown drain endpoint behavior itself**. + This PR exposes the configuration surface; the SIGTERM/drain endpoint + belongs in livekit-server. See "Path Considered But Not Taken" below. +- **No test for non-httpGet probe types** (`tcpSocket`, `exec`, + `grpc`). `toYaml` is type-agnostic so an operator override of any of + these should render correctly, but no assertion locks that in. +- **No test for `livenessProbe: {}` (empty map)** as distinct from + `livenessProbe: ~` (null). Both should result in the field being omitted + because Go templates treat both as falsy under `{{- with }}`, but only + null is asserted. +- **No test for override conflicts**, e.g., setting `livenessProbe.tcpSocket` + while leaving the default `httpGet` block intact. Helm's value-merge + cannot deep-replace a sibling key, so the rendered probe would contain + both `httpGet` and `tcpSocket`, which Kubernetes rejects. Operators + switching probe types must replace the entire block. This footgun is + documented in the values.yaml comment but not test-asserted. +- **No test for `successThreshold`**, which K8s allows on readiness probes + but is not in the chart defaults. +- **No `startupProbe` support added**. Operators who need one can add it + via a chart values addition in a follow-up; this PR is scoped to the + liveness/readiness pairing called out in #116. +- **No CI step yet runs `helm unittest`** in this repo. The current CI + workflow runs `ct lint` and `ct install` only. To wire these tests into + CI, a maintainer would add a step similar to: + + ```yaml + - name: Install helm-unittest plugin + run: helm plugin install https://github.com/helm-unittest/helm-unittest + + - name: Run helm unittest + run: helm unittest livekit-server/ + ``` + + inserted after the existing "Set up Helm" step (CI uses Helm v3.11.2; + Helm v4 users locally may need `--verify=false` on the plugin install). + I have intentionally not modified the workflow file in this PR to keep + the chart-change diff focused; happy to add the CI step in this PR or a + follow-up if maintainers prefer. +- **No snapshot tests**. helm-unittest supports `matchSnapshot` for full-doc + comparison; I have not added any because the explicit-field assertions in + `upgrade_compatibility_test.yaml` are easier to read and maintain when + unrelated chart changes (image tag bump, label addition) inevitably arrive. + +### Running the tests + +``` +$ helm unittest livekit-server/ # full suite +$ helm unittest -f 'tests/probe_configuration_test.yaml' livekit-server/ # one file +$ helm unittest -f 'tests/upgrade_compatibility_test.yaml' livekit-server/ # one file +``` + +Test names are scenario-prefixed (`A1.`, `B1.`, `C1.`, `D1.`, `D2.`, `E1.`) +so a failure line points at the exact scenario it broke. + +### Local verification + +``` +$ helm lint livekit-server/ +==> Linting livekit-server/ +[INFO] Chart.yaml: icon is recommended +1 chart(s) linted, 0 chart(s) failed + +$ helm unittest livekit-server/ +### Chart [ livekit-server ] livekit-server/ + PASS liveness and readiness probe configuration livekit-server/tests/probe_configuration_test.yaml + PASS upgrade compatibility - default-values render before vs after livekit-server/tests/upgrade_compatibility_test.yaml + +Charts: 1 passed, 1 total +Test Suites: 2 passed, 2 total +Tests: 13 passed, 13 total +Snapshot: 0 passed, 0 total + +$ helm template lk-test livekit-server/ | head -120 +[renders cleanly; full output identical to prior chart's render except for the +documented probe field additions] +``` + +`helm template` was verified to produce output identical to the previous chart +for default values, except for the probe-field additions documented in the +table above. + +## Path Considered But Not Taken + +This PR implements operator-configurable probes (Path B). A more complete fix +(Path A) would require server-side support: a separate `/ready` endpoint in +livekit-server itself that returns a non-200 response when the process has +received SIGTERM but is still draining active sessions. The current PR gives +operators the configuration surface to point readinessProbe at such an +endpoint when it exists upstream, without requiring server-side changes to +land first. + +Path B was chosen because: + +- It is mergeable independently without coordinating across repos. +- It unblocks operators who can implement the SIGTERM behavior externally + (for example via a `lifecycle.preStop` hook or a sidecar that drains state + and flips its own readiness endpoint to 503). +- It does not break existing deployments. +- It is the minimum change that addresses the configuration limitation + called out in #116. + +A follow-up issue in livekit/livekit tracking the server-side endpoint work +would be the right place to capture Path A. diff --git a/livekit-server/templates/deployment.yaml b/livekit-server/templates/deployment.yaml index 7b35cee..f76fb05 100644 --- a/livekit-server/templates/deployment.yaml +++ b/livekit-server/templates/deployment.yaml @@ -96,14 +96,14 @@ spec: protocol: UDP {{- end }} {{- end }} + {{- with .Values.livenessProbe }} livenessProbe: - httpGet: - path: / - port: http + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.readinessProbe }} readinessProbe: - httpGet: - path: / - port: http + {{- toYaml . | nindent 12 }} + {{- end }} resources: {{- toYaml .Values.resources | nindent 12 }} {{- if or .Values.storeKeysInSecret.enabled (and .Values.livekit.turn.enabled .Values.livekit.turn.tls_port (not .Values.livekit.turn.external_tls)) }} diff --git a/livekit-server/tests/fixtures/null_both_probes.yaml b/livekit-server/tests/fixtures/null_both_probes.yaml new file mode 100644 index 0000000..c96c57d --- /dev/null +++ b/livekit-server/tests/fixtures/null_both_probes.yaml @@ -0,0 +1,2 @@ +livenessProbe: ~ +readinessProbe: ~ diff --git a/livekit-server/tests/fixtures/null_liveness_only.yaml b/livekit-server/tests/fixtures/null_liveness_only.yaml new file mode 100644 index 0000000..7bdb94f --- /dev/null +++ b/livekit-server/tests/fixtures/null_liveness_only.yaml @@ -0,0 +1 @@ +livenessProbe: ~ diff --git a/livekit-server/tests/fixtures/null_readiness_only.yaml b/livekit-server/tests/fixtures/null_readiness_only.yaml new file mode 100644 index 0000000..1ecdba9 --- /dev/null +++ b/livekit-server/tests/fixtures/null_readiness_only.yaml @@ -0,0 +1 @@ +readinessProbe: ~ diff --git a/livekit-server/tests/probe_configuration_test.yaml b/livekit-server/tests/probe_configuration_test.yaml new file mode 100644 index 0000000..4a4c3c2 --- /dev/null +++ b/livekit-server/tests/probe_configuration_test.yaml @@ -0,0 +1,184 @@ +suite: liveness and readiness probe configuration +release: + name: lk-test + namespace: livekit +tests: + # --------------------------------------------------------------------------- + # Scenario A: default values + # --------------------------------------------------------------------------- + - it: A1. default livenessProbe targets / on the named http port + template: deployment.yaml + asserts: + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.port + value: http + + - it: A2. default readinessProbe targets / on the named http port + template: deployment.yaml + asserts: + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.port + value: http + + - it: A3. default readinessProbe.failureThreshold is 3 (Kubernetes default, not the previously considered value of 1) + template: deployment.yaml + asserts: + - equal: + path: spec.template.spec.containers[0].readinessProbe.failureThreshold + value: 3 + + - it: A4. default livenessProbe.failureThreshold is 3 + template: deployment.yaml + asserts: + - equal: + path: spec.template.spec.containers[0].livenessProbe.failureThreshold + value: 3 + + - it: A5. livenessProbe and readinessProbe are independently rendered (different default values prove they are not a shared fragment) + template: deployment.yaml + asserts: + # initialDelaySeconds is 10 for liveness, 0 for readiness by default. + # If they were rendered from a single shared fragment they could not differ. + - equal: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].readinessProbe.initialDelaySeconds + value: 0 + - notEqual: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 0 + + # --------------------------------------------------------------------------- + # Scenario B: partial override on each probe + # --------------------------------------------------------------------------- + - it: B1. partial overrides apply only to the touched fields, defaults survive elsewhere + template: deployment.yaml + set: + readinessProbe.httpGet.path: /readyz + readinessProbe.failureThreshold: 5 + livenessProbe.failureThreshold: 10 + asserts: + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.path + value: /readyz + - equal: + path: spec.template.spec.containers[0].readinessProbe.failureThreshold + value: 5 + - equal: + path: spec.template.spec.containers[0].livenessProbe.failureThreshold + value: 10 + # Other readinessProbe defaults must survive the partial override + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.port + value: http + - equal: + path: spec.template.spec.containers[0].readinessProbe.periodSeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].readinessProbe.timeoutSeconds + value: 1 + - equal: + path: spec.template.spec.containers[0].readinessProbe.initialDelaySeconds + value: 0 + # livenessProbe defaults must survive when only readiness is touched + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 10 + + # --------------------------------------------------------------------------- + # Scenario C: both probes nulled out + # --------------------------------------------------------------------------- + - it: C1. setting both probes to null omits both fields from the rendered Deployment + template: deployment.yaml + values: + - ./fixtures/null_both_probes.yaml + asserts: + - notExists: + path: spec.template.spec.containers[0].livenessProbe + - notExists: + path: spec.template.spec.containers[0].readinessProbe + # Container, image, and ports must still be intact + - exists: + path: spec.template.spec.containers[0] + - equal: + path: spec.template.spec.containers[0].name + value: livekit-server + - matchRegex: + path: spec.template.spec.containers[0].image + pattern: ^livekit/livekit-server:.+$ + - equal: + path: spec.template.spec.containers[0].ports[0].name + value: http + + # --------------------------------------------------------------------------- + # Scenario D: mixed null - one probe disabled, the other at defaults + # --------------------------------------------------------------------------- + - it: D1. liveness null + readiness default - readiness present, liveness omitted + template: deployment.yaml + values: + - ./fixtures/null_liveness_only.yaml + asserts: + - notExists: + path: spec.template.spec.containers[0].livenessProbe + - exists: + path: spec.template.spec.containers[0].readinessProbe + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.port + value: http + + - it: D2. readiness null + liveness default - liveness present, readiness omitted + template: deployment.yaml + values: + - ./fixtures/null_readiness_only.yaml + asserts: + - exists: + path: spec.template.spec.containers[0].livenessProbe + - notExists: + path: spec.template.spec.containers[0].readinessProbe + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.port + value: http + + # --------------------------------------------------------------------------- + # Scenario E: custom livekit.port - probes follow named port, not numeric + # --------------------------------------------------------------------------- + - it: E1. probes target named port http even when livekit.port is non-default + template: deployment.yaml + set: + livekit.port: 19880 + asserts: + # The HTTP container port number changes + - equal: + path: spec.template.spec.containers[0].ports[0].containerPort + value: 19880 + - equal: + path: spec.template.spec.containers[0].ports[0].name + value: http + # Probes still reference the name "http", not the number 19880. + # This documents the contract: operators changing livekit.port do NOT + # need to also update probe ports. + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.port + value: http + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.port + value: http + - notEqual: + path: spec.template.spec.containers[0].livenessProbe.httpGet.port + value: 19880 diff --git a/livekit-server/tests/upgrade_compatibility_test.yaml b/livekit-server/tests/upgrade_compatibility_test.yaml new file mode 100644 index 0000000..cee5cc6 --- /dev/null +++ b/livekit-server/tests/upgrade_compatibility_test.yaml @@ -0,0 +1,175 @@ +suite: upgrade compatibility - default-values render before vs after +# +# Purpose +# ------- +# Lock in what an existing operator's Deployment looks like after upgrading to +# this chart commit with no values overrides. The motivating concern for #116 +# was that any change to probe defaults could subtly break in-flight sessions +# on existing deployments. This suite documents the exact rendered delta. +# +# How to reproduce the underlying before/after diff manually +# ---------------------------------------------------------- +# git checkout HEAD~1 -- livekit-server/templates/deployment.yaml \ +# livekit-server/values.yaml +# helm template lk-test livekit-server/ > /tmp/before.yaml +# git checkout HEAD -- livekit-server/templates/deployment.yaml \ +# livekit-server/values.yaml +# helm template lk-test livekit-server/ > /tmp/after.yaml +# diff -u /tmp/before.yaml /tmp/after.yaml +# +# Diff captured at commit time +# ---------------------------- +# Only the probe blocks change. No other rendered field shifts. +# +# @@ ports/livekit-server container @@ +# livenessProbe: +# + failureThreshold: 3 +# httpGet: +# path: / +# port: http +# + initialDelaySeconds: 10 +# + periodSeconds: 10 +# + timeoutSeconds: 1 +# readinessProbe: +# + failureThreshold: 3 +# httpGet: +# path: / +# port: http +# + initialDelaySeconds: 0 +# + periodSeconds: 10 +# + timeoutSeconds: 1 +# +# Behavioral interpretation +# ------------------------- +# - failureThreshold: 3 was already the Kubernetes implicit default. Making it +# explicit changes the rendered YAML but not the runtime behavior. +# - periodSeconds: 10 was already the Kubernetes implicit default. Same as above. +# - timeoutSeconds: 1 was already the Kubernetes implicit default. Same. +# - readinessProbe.initialDelaySeconds: 0 was already the Kubernetes implicit +# default. Same. +# - livenessProbe.initialDelaySeconds: 0 -> 10 is the ONLY semantic change. +# This is strictly more lenient: the kubelet now waits 10s after container +# start before its first liveness check, instead of starting immediately. +# It cannot cause an existing healthy pod to be killed earlier than before. +# +# Net: zero risk of pod-killing regressions on upgrade with default values. +# A one-time Deployment rollout will occur because the rendered probe spec +# differs textually, but the new rollout's pods will pass probes at least as +# reliably as the old rollout's pods did. + +release: + name: lk-test + namespace: livekit +tests: + - it: post-change Deployment contains the exact set of explicit probe fields documented in the header diff + template: deployment.yaml + asserts: + # livenessProbe explicit fields + - equal: + path: spec.template.spec.containers[0].livenessProbe.failureThreshold + value: 3 + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].livenessProbe.httpGet.port + value: http + - equal: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].livenessProbe.periodSeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].livenessProbe.timeoutSeconds + value: 1 + # readinessProbe explicit fields + - equal: + path: spec.template.spec.containers[0].readinessProbe.failureThreshold + value: 3 + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.path + value: / + - equal: + path: spec.template.spec.containers[0].readinessProbe.httpGet.port + value: http + - equal: + path: spec.template.spec.containers[0].readinessProbe.initialDelaySeconds + value: 0 + - equal: + path: spec.template.spec.containers[0].readinessProbe.periodSeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].readinessProbe.timeoutSeconds + value: 1 + + - it: every rendered probe field matches the Kubernetes implicit default except livenessProbe.initialDelaySeconds (the one documented behavior change) + template: deployment.yaml + asserts: + # K8s implicit defaults for HTTPGetAction-based probes: + # failureThreshold: 3 + # periodSeconds: 10 + # timeoutSeconds: 1 + # initialDelaySeconds: 0 + # All values below match the K8s implicit default; the rendered YAML + # now shows them explicitly but Kubernetes would have applied the same + # numbers before this change. + - equal: + path: spec.template.spec.containers[0].livenessProbe.failureThreshold + value: 3 + - equal: + path: spec.template.spec.containers[0].livenessProbe.periodSeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].livenessProbe.timeoutSeconds + value: 1 + - equal: + path: spec.template.spec.containers[0].readinessProbe.failureThreshold + value: 3 + - equal: + path: spec.template.spec.containers[0].readinessProbe.periodSeconds + value: 10 + - equal: + path: spec.template.spec.containers[0].readinessProbe.timeoutSeconds + value: 1 + - equal: + path: spec.template.spec.containers[0].readinessProbe.initialDelaySeconds + value: 0 + # The one and only documented semantic delta: + - equal: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 10 + - notEqual: + path: spec.template.spec.containers[0].livenessProbe.initialDelaySeconds + value: 0 + + - it: non-probe fields rendered with default values are unchanged from prior chart behavior + template: deployment.yaml + asserts: + # Container identity, image, and ports must be untouched. If any of + # these flip, the regression is outside probe scope and this PR is too + # broad. + - equal: + path: spec.template.spec.containers[0].name + value: livekit-server + - matchRegex: + path: spec.template.spec.containers[0].image + pattern: ^livekit/livekit-server:.+$ + - equal: + path: spec.template.spec.containers[0].imagePullPolicy + value: IfNotPresent + - equal: + path: spec.template.spec.containers[0].ports[0].name + value: http + - equal: + path: spec.template.spec.containers[0].ports[0].containerPort + value: 7880 + - equal: + path: spec.template.spec.terminationGracePeriodSeconds + value: 18000 + - equal: + path: spec.template.spec.hostNetwork + value: true + - equal: + path: spec.template.spec.dnsPolicy + value: ClusterFirstWithHostNet diff --git a/livekit-server/values.yaml b/livekit-server/values.yaml index bcf6d51..b9d9005 100644 --- a/livekit-server/values.yaml +++ b/livekit-server/values.yaml @@ -132,6 +132,60 @@ deploymentStrategy: {} deploymentAnnotations: {} +# Liveness and readiness probes for the livekit-server container. +# +# These are intentionally exposed as separate, fully structured blocks so that +# operators can tune them independently. In particular, livenessProbe and +# readinessProbe should NOT share the same definition: a liveness failure +# restarts the pod, while a readiness failure only removes it from Service +# endpoints. Conflating the two means any condition that should drain traffic +# also kills the pod and any in-flight sessions. +# +# A common reason to override readinessProbe (without touching livenessProbe) +# is graceful shutdown / connection draining. During the +# terminationGracePeriodSeconds window, livekit-server continues serving +# existing WebRTC sessions but should stop receiving new ones. If your build +# of livekit-server (or a sidecar) exposes a separate "drain" or "ready" +# endpoint that flips to non-200 once SIGTERM is received, point the +# readinessProbe at that endpoint here so Kubernetes removes the pod from +# Service endpoints immediately while the liveness probe keeps the pod alive +# for the rest of the grace period. +# +# Defaults below probe the HTTP healthcheck on the livekit port (named port +# "http", which resolves to livekit.port, 7880 by default). Using the named +# port means probes automatically track any livekit.port override; if you +# replace the httpGet block entirely (e.g. with tcpSocket or exec) you take +# over responsibility for the port reference. +# +# Setting livenessProbe or readinessProbe to null (e.g. "livenessProbe: ~") +# omits that probe from the rendered Deployment entirely. +# +# Test coverage for these values lives in livekit-server/tests/. The +# probe_configuration_test.yaml suite exercises defaults, partial overrides, +# null/omit behavior, and the named-port contract; if you change either probe +# here, run "helm unittest livekit-server/" to confirm your override matches +# expectations before deploying. +livenessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 10 + periodSeconds: 10 + timeoutSeconds: 1 + failureThreshold: 3 + +readinessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 0 + periodSeconds: 10 + timeoutSeconds: 1 + # Matches the Kubernetes default. Lower this (e.g. to 1) when you point the + # readinessProbe at a drain-aware endpoint and want pods removed from the + # Service immediately on the first non-200 response. + failureThreshold: 3 + # Specifies configuration gke backendconfig gcp: backendConfig: diff --git a/server-sample.yaml b/server-sample.yaml index 39ed684..c60a274 100644 --- a/server-sample.yaml +++ b/server-sample.yaml @@ -9,6 +9,20 @@ replicaCount: 1 # Suggested value for gracefully terminate the pod: 5 hours terminationGracePeriodSeconds: 18000 +# Liveness and readiness probes are configured separately so that the +# readinessProbe can be pointed at a drain-aware endpoint without affecting +# the livenessProbe (which would otherwise kill draining pods). +# The defaults in the chart probe / on the http port for both. Override here +# only if you need different behavior, e.g.: +# +# readinessProbe: +# httpGet: +# path: /rtc/validate # or whatever drain-aware endpoint your build exposes +# port: http +# initialDelaySeconds: 0 +# periodSeconds: 5 +# failureThreshold: 1 + livekit: # port: 7880 # Uncomment to enable prometheus metrics