Skip to content

event-exporter-v2: shared-nothing HA re-architecture#3

Open
erain wants to merge 203 commits intomasterfrom
event-exporter-v2-shared-nothing-ha
Open

event-exporter-v2: shared-nothing HA re-architecture#3
erain wants to merge 203 commits intomasterfrom
event-exporter-v2-shared-nothing-ha

Conversation

@erain
Copy link
Owner

@erain erain commented Mar 26, 2026

Summary

Re-architecture of the event-exporter as a new standalone module (event-exporter-v2/) for horizontal scalability and high availability. The original event-exporter/ is untouched.

  • Shared-nothing consistent hashing — each pod watches all events, only sends events it "owns" via xxhash(namespace/name) % N
  • Peer discovery via headless Service Endpoints informer — no leader election, no gRPC, no external dependencies
  • Deterministic InsertId for Cloud Logging deduplication during scaling transitions
  • Tuned sink defaults — buffer 500 (was 100), concurrency 25 (was 10), flush 2s (was 5s)
  • Exponential backoff with jitter replacing fixed 10s retry in Cloud Logging writer
  • HPA-compatible Deployment with readiness probe, PDB, and custom metrics
  • Fully backward compatible — same binary runs in single-pod mode when --headless-service-name is empty

Design

Architecture

  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
  │   Pod 0     │  │   Pod 1     │  │   Pod 2     │
  │             │  │             │  │             │
  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │
  │ │ Watcher │ │  │ │ Watcher │ │  │ │ Watcher │ │
  │ │(all evts)│ │  │ │(all evts)│ │  │ │(all evts)│ │
  │ └────┬────┘ │  │ └────┬────┘ │  │ └────┬────┘ │
  │      │      │  │      │      │  │      │      │
  │ hash%3==0?  │  │ hash%3==1?  │  │ hash%3==2?  │
  │   ┌──┴──┐   │  │   ┌──┴──┐   │  │   ┌──┴──┐   │
  │   │Sink │   │  │   │Sink │   │  │   │Sink │   │
  │   │~1/3 │   │  │   │~1/3 │   │  │   │~1/3 │   │
  │   └──┬──┘   │  │   └──┬──┘   │  │   └──┬──┘   │
  └──────┼──────┘  └──────┼──────┘  └──────┼──────┘
         │                │                │
         └────────────────┼────────────────┘
                          ▼
                 GCP Cloud Logging

How consistent hashing works

  1. Each pod discovers peers via a headless Service's Endpoints informer
  2. Peer IPs are sorted lexicographically — all pods produce the same ordering
  3. Each pod identifies itself by POD_IP (Downward API) and finds its index in the sorted list
  4. For each event: owner = xxhash(namespace/name) % len(peers) — if owner == myIndex, process it
  5. Grace period: if a pod isn't in Endpoints yet, it processes ALL events (prevents gaps)
  6. Deduplication: deterministic InsertId = namespace/name/resourceVersion handles brief overlaps during scaling

Why shared-nothing over leader-worker

Dimension Shared-Nothing (chosen) Leader-Worker
Complexity Low — no gRPC, no leader election High — gRPC service, election, role transitions
SPOF None — all pods are equal Leader is a bottleneck
HPA Natural — add more pods Requires separate scaling for leader vs workers
API server impact N watchers (negligible for N<10) 1 watcher (minimal)
Memory per pod Full event cache per pod Only leader needs caches

Scaling behavior

  • Scale up (2→3): new pod appears in Endpoints, all pods recompute N=3. Brief overlap window (<1s) produces duplicates deduplicated by InsertId
  • Scale down (3→2): departing pod drains buffers on SIGTERM. Remaining pods absorb orphaned partition
  • Pod crash: remaining pods see updated Endpoints, absorb crashed pod's partition. Resync (1min) recovers any in-flight events
  • Throughput: ~200-500 events/sec per pod. N pods = N × throughput

New files (vs original event-exporter)

File Purpose
peerdiscovery/discovery.go Endpoints informer, sorted peer list, IsOwner() hash check
peerdiscovery/discovery_test.go Tests: determinism, distribution, grace period, rebalancing
peerdiscovery/metrics.go event_exporter_peer_count, event_exporter_peer_updates_total
example/event-exporter-ha.yaml HA deployment: headless Service, HPA, PDB, readiness probe

Modified files (vs original event-exporter)

File Change
main.go --headless-service-name flag, PeerDiscovery init, /readyz endpoint
event_exporter.go PeerDiscovery as third concurrent goroutine
sinks/stackdriver/sink.go Hash filtering in OnAdd/OnUpdate (before serialization), queue depth reporting
sinks/stackdriver/sink_factory.go Exported type, CreateNewWithOwnerChecker method
sinks/stackdriver/sink_config.go Tuned defaults: buffer 500, concurrency 25, flush 2s
sinks/stackdriver/writer.go Exponential backoff (1s→60s, 2x, 20% jitter)
sinks/stackdriver/log_entry_factory.go Deterministic InsertId
sinks/stackdriver/metrics.go events_dropped_by_hash_total, events_owned_total, queue_depth

New Prometheus metrics

Metric Type Purpose
event_exporter_peer_count Gauge Current number of peer pods
event_exporter_peer_updates_total Counter Peer list update events
event_exporter_events_dropped_by_hash_total Counter Events skipped (owned by another pod)
event_exporter_events_owned_total Counter Events this pod processes
event_exporter_queue_depth Gauge Log entry channel depth (HPA metric)

Status

This is an initial working implementation. Known TODOs:

  • End-to-end tests with multi-pod deployment
  • Load testing at scale (>1000 events/sec)
  • Validate Cloud Logging InsertId deduplication behavior under rebalancing
  • Tune HPA thresholds based on real-world metrics
  • Update Makefile/Dockerfile for v2 build paths
  • Consider EndpointSlice API as future replacement for Endpoints

Test plan

  • Unit tests pass: go test ./... (peerdiscovery, sinks, watchers, podlabels)
  • go vet ./... clean
  • go build ./... succeeds
  • Original event-exporter/ untouched and builds/tests independently
  • Deploy in test GKE cluster with 3 replicas
  • Verify all events appear in Cloud Logging with no gaps
  • Scale up/down and verify no event loss
  • Kill a pod and verify partition recovery

🤖 Generated with Claude Code

dependabot bot and others added 30 commits October 6, 2023 19:28
Bumps [github.com/google/gofuzz](https://github.com/google/gofuzz) from 1.1.0 to 1.2.0.
- [Release notes](https://github.com/google/gofuzz/releases)
- [Commits](google/gofuzz@v1.1.0...v1.2.0)

---
updated-dependencies:
- dependency-name: github.com/google/gofuzz
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps google.golang.org/protobuf from 1.28.1 to 1.33.0.

---
updated-dependencies:
- dependency-name: google.golang.org/protobuf
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/golang/glog](https://github.com/golang/glog) from 1.2.4 to 1.2.5.
- [Release notes](https://github.com/golang/glog/releases)
- [Commits](golang/glog@v1.2.4...v1.2.5)

---
updated-dependencies:
- dependency-name: github.com/golang/glog
  dependency-version: 1.2.5
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps golang from 1.24 to 1.25.

---
updated-dependencies:
- dependency-name: golang
  dependency-version: '1.25'
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [golang.org/x/oauth2](https://github.com/golang/oauth2) from 0.0.0-20220223155221-ee480838109b to 0.32.0.
- [Commits](https://github.com/golang/oauth2/commits/v0.32.0)

---
updated-dependencies:
- dependency-name: golang.org/x/oauth2
  dependency-version: 0.32.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.224.0 to 0.253.0.
- [Release notes](https://github.com/googleapis/google-api-go-client/releases)
- [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md)
- [Commits](googleapis/google-api-go-client@v0.224.0...v0.253.0)

---
updated-dependencies:
- dependency-name: google.golang.org/api
  dependency-version: 0.253.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.36.0 to 0.45.0.
- [Commits](golang/crypto@v0.36.0...v0.45.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.45.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps golang from 1.23.2 to 1.25.5.

---
updated-dependencies:
- dependency-name: golang
  dependency-version: 1.25.5
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps golang from 1.24 to 1.25.

---
updated-dependencies:
- dependency-name: golang
  dependency-version: '1.25'
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
…cher-streaming

Allow watch streaming option for lister watcher to prevent high memory spike.
…cher-streaming

Bump event-exporter version to v0.5.9
…cher-streaming

Enable streaming listWatcher only if the required k8s feature gate is enabled
Bump up golang.org/x/crypto to 0.45.0 for event-exporter
…rMetriclabel

update extractAllLabels to filter out labels that aren't metric labels
…attempt

Add restart attempt and jobset uid labels in pod owner transform
juli4n and others added 30 commits March 18, 2026 09:37
Switch to buildx from legacy docker builder
…e-fix

Switch to buildx from legacy docker builder
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.79.2 to 1.79.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](grpc/grpc-go@v1.79.2...v1.79.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.79.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
…dependabot/go_modules/custom-metrics-stackdriver-adapter/google.golang.org/grpc-1.79.3

Bump google.golang.org/grpc from 1.79.2 to 1.79.3 in /custom-metrics-stackdriver-adapter
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.56.3 to 1.79.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](grpc/grpc-go@v1.56.3...v1.79.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.79.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.79.2 to 1.79.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](grpc/grpc-go@v1.79.2...v1.79.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.79.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
…dependabot/go_modules/custom-metrics-stackdriver-adapter/examples/direct-to-sd/google.golang.org/grpc-1.79.3

Bump google.golang.org/grpc from 1.56.3 to 1.79.3 in /custom-metrics-stackdriver-adapter/examples/direct-to-sd
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.79.2 to 1.79.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](grpc/grpc-go@v1.79.2...v1.79.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.79.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.79.2 to 1.79.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](grpc/grpc-go@v1.79.2...v1.79.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.79.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
…dependabot/go_modules/event-exporter/google.golang.org/grpc-1.79.3

Bump google.golang.org/grpc from 1.79.2 to 1.79.3 in /event-exporter
…dependabot/go_modules/prometheus-to-sd/google.golang.org/grpc-1.79.3

Bump google.golang.org/grpc from 1.79.2 to 1.79.3 in /prometheus-to-sd
Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.270.0 to 0.272.0.
- [Release notes](https://github.com/googleapis/google-api-go-client/releases)
- [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md)
- [Commits](googleapis/google-api-go-client@v0.270.0...v0.272.0)

---
updated-dependencies:
- dependency-name: google.golang.org/api
  dependency-version: 0.272.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
…dependabot/go_modules/prometheus-to-sd/google.golang.org/api-0.272.0

Bump google.golang.org/api from 0.270.0 to 0.272.0 in /prometheus-to-sd
…dependabot/go_modules/kubelet-to-gcm/google.golang.org/grpc-1.79.3

Bump google.golang.org/grpc from 1.79.2 to 1.79.3 in /kubelet-to-gcm
Bumps [k8s.io/component-base](https://github.com/kubernetes/component-base) from 0.35.2 to 0.35.3.
- [Commits](kubernetes/component-base@v0.35.2...v0.35.3)

---
updated-dependencies:
- dependency-name: k8s.io/component-base
  dependency-version: 0.35.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
…dependabot/go_modules/custom-metrics-stackdriver-adapter/k8s.io/component-base-0.35.3

Bump k8s.io/component-base from 0.35.2 to 0.35.3 in /custom-metrics-stackdriver-adapter
Bumps [k8s.io/apimachinery](https://github.com/kubernetes/apimachinery) from 0.35.2 to 0.35.3.
- [Commits](kubernetes/apimachinery@v0.35.2...v0.35.3)

---
updated-dependencies:
- dependency-name: k8s.io/apimachinery
  dependency-version: 0.35.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [k8s.io/client-go](https://github.com/kubernetes/client-go) from 0.35.2 to 0.35.3.
- [Changelog](https://github.com/kubernetes/client-go/blob/master/CHANGELOG.md)
- [Commits](kubernetes/client-go@v0.35.2...v0.35.3)

---
updated-dependencies:
- dependency-name: k8s.io/client-go
  dependency-version: 0.35.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
…dependabot/go_modules/prometheus-to-sd/k8s.io/apimachinery-0.35.3

Bump k8s.io/apimachinery from 0.35.2 to 0.35.3 in /prometheus-to-sd
…dependabot/go_modules/event-exporter/k8s.io/client-go-0.35.3

Bump k8s.io/client-go from 0.35.2 to 0.35.3 in /event-exporter
Bumps [k8s.io/api](https://github.com/kubernetes/api) from 0.35.2 to 0.35.3.
- [Commits](kubernetes/api@v0.35.2...v0.35.3)

---
updated-dependencies:
- dependency-name: k8s.io/api
  dependency-version: 0.35.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [k8s.io/client-go](https://github.com/kubernetes/client-go) from 0.35.2 to 0.35.3.
- [Changelog](https://github.com/kubernetes/client-go/blob/master/CHANGELOG.md)
- [Commits](kubernetes/client-go@v0.35.2...v0.35.3)

---
updated-dependencies:
- dependency-name: k8s.io/client-go
  dependency-version: 0.35.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
…dependabot/go_modules/prometheus-to-sd/k8s.io/api-0.35.3

Bump k8s.io/api from 0.35.2 to 0.35.3 in /prometheus-to-sd
…dependabot/go_modules/prometheus-to-sd/k8s.io/client-go-0.35.3

Bump k8s.io/client-go from 0.35.2 to 0.35.3 in /prometheus-to-sd
Bumps [k8s.io/klog/v2](https://github.com/kubernetes/klog) from 2.130.1 to 2.140.0.
- [Release notes](https://github.com/kubernetes/klog/releases)
- [Changelog](https://github.com/kubernetes/klog/blob/main/RELEASE.md)
- [Commits](kubernetes/klog@v2.130.1...2.140.0)

---
updated-dependencies:
- dependency-name: k8s.io/klog/v2
  dependency-version: 2.140.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
…dependabot/go_modules/event-exporter/k8s.io/klog/v2-2.140.0

Bump k8s.io/klog/v2 from 2.130.1 to 2.140.0 in /event-exporter
…hing for HA

Add event-exporter-v2 as a new standalone module alongside the existing
event-exporter. This is a re-architecture of the event exporter for
higher performance, scalability, and high availability while maintaining
the same core functionality (exporting all cluster events to GCP Cloud
Logging) and low cluster-wide footprint.

Key changes from event-exporter:
- Shared-nothing consistent hashing for multi-pod horizontal scaling
- Peer discovery via headless Service Endpoints informer
- Hash-based event partitioning (xxhash modulo) across pods
- Deterministic InsertId for Cloud Logging deduplication during rebalancing
- Exponential backoff with jitter replacing fixed 10s retry
- Tuned defaults: buffer 500 (was 100), concurrency 25 (was 10), flush 2s (was 5s)
- New Prometheus metrics: queue_depth, peer_count, events_owned/dropped_by_hash
- Readiness probe (/readyz) gated on peer discovery sync
- HA deployment manifest with headless Service, HPA, and PDB
- Fully backward compatible: single-pod mode when --headless-service-name is empty

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants