feat(discovery): queries, Signal and Ranking rework #57
Conversation
… (Issue 1: CRD types) Implements the first issue of the three-stage discovery pipeline redesign documented in docs/decisions/13-discovery-signals-ranking.md. ## Breaking Changes - Removed: spec.sources[], DiscoverySource, PrometheusSource (API type), RegistrySource, status.sourceCount - Added: spec.queries[], spec.signals[], spec.ranking (new three-stage pipeline) - DiscoveredImage: removed score/source fields, added rank/finalScore/selected/ signals/ranking breakdown fields ## New API Types Query stage: - DiscoveryQuery (prometheus | loki) - DiscoveryPrometheusQuery, DiscoveryLokiQuery, LokiParser Signal stage (4 types): - aggregate, timeWeightedAggregate, windowAggregate, eventPullTime Ranking stage (3 strategies): - signal, weightedSum (minMax normalized), modelExposure (cold-node exposure) Status: - QueryResult[], SignalResult[] — per-query/signal observability - Rich DiscoveredImage with signals[] and ranking breakdown ## Other Changes - Regenerated deepcopy and CRD manifests - Stubbed controller: sets Ready=False/NotImplemented until Issues 2-10 land - Removed internal/discovery/registry.go (registry source retired) - Removed test/e2e/discovery-aggregation/ and discovery-registry/ (retired) - Updated all e2e tests to new schema, assert NotImplemented condition - Rewrote docs/content/docs/discovery.md with full pipeline explanation - Regenerated AI docs (knowledge.yaml, llms.txt, llms-full.txt) Closes #55
|
@copilot also implement the remaining parts so this is complete. Make sure the e2e tests and Linter and unit Tests Pass locally before pushing. |
…gistry datasource - Add DiscoveryQueryTypeRegistry + DiscoveryRegistryQuery to API types - Restore internal/discovery/registry.go and registry_test.go - Add internal/discovery/engine.go: full 3-stage pipeline execution (query → signal → ranking) - Prometheus instant/range, registry queries - aggregate, timeWeightedAggregate, windowAggregate signals - signal, weightedSum, modelExposure ranking strategies - Add internal/discovery/engine_test.go: tests for all pipeline stages - Add FetchRaw() to PrometheusSource for timestamp-preserving data access - Replace controller stub (NotImplemented) with real pipeline execution - Update e2e tests: assert real behavior (Synced/DNSError) instead of NotImplemented - Add discovery-registry e2e test suite - Regenerate deepcopy and CRD manifests All unit tests pass, linter clean (0 issues).
probe
Implemented in d3757c2. Here's what's now in place: Pipeline engine (
Registry datasource restored: Controller now executes the real pipeline — sets All unit tests pass, linter clean (0 issues). |
|
@copilot implement the remaining parts |
|
@copilot implement the remaining parts. Commit and push after each logically Part. The previous Session https://github.com/corewire/drop/tasks/0b950942-ffbf-466d-970d-903d7307ae19 Was broken because the Infrastructure was broken. |
…discovery Deploy a single-binary Loki into the e2e-infra namespace and seed it with kubelet-style image-pull event log lines (Pulling/Pulled/Failed/already present) so DiscoveryPolicy loki queries with the kubernetesEvents parser and the eventPullTime signal can be exercised end-to-end. Wired into hack/e2e-infra/setup.sh and the Tiltfile alongside the existing Prometheus and registry infrastructure.
Add a DiscoveryPolicy e2e suite that runs a Loki range query with the kubernetesEvents parser and derives p50 cold-pull-time and failure-count eventPullTime signals from the seeded image-pull events, asserting the pipeline reports Ready=Synced and discovers the expected images. Also refresh the e2e README scenario table (discovery, discovery-loki, discovery-registry).
The kubelet readiness probe against Loki's /ready was flaky during ring stabilization (the probe's 1s timeout was exceeded and /ready returns 503 until the ingester settles), leaving the deployment stuck as not-available. The existing Prometheus and registry manifests use no readiness probe; the seed job already polls /ready before pushing and consumers retry, so gate readiness the same way for consistency and reliability.
Also assert test/tools:v1 (the third seeded image) appears in status.discoveredImages so the assertions cover the full seed dataset.
|
@copilot fix the e2e tests. |
The readiness probe was dropped in the previous commit because the 1s timeout was too short for ring stabilization. Without any probe, kubectl wait --for=condition=available succeeds as soon as the container starts (before Loki's HTTP server accepts requests), so the seed job could run against a not-yet-ready Loki. Re-add the probe with a longer 5s timeout and 15s initial delay, giving Loki up to ~105s to pass before the Deployment is marked Available and the setup.sh seed step begins. Also: - Remove stale 02-assert-notimplemented.yaml (controller no longer returns NotImplemented; file was unused by any chainsaw-test.yaml) - Fix test/e2e/README.md: wrong make target, wrong scenario names, missing scenarios (cachedimageset-discovery, discovery-failure) - Update Makefile e2e-infra comment and CI step name to include Loki
Loki query type and
eventPullTimesignal were stubbed with "not yet implemented" errors. This wires both up end-to-end.Loki datasource (
internal/discovery/loki.go)LokiSource.FetchRaw— calls/loki/api/v1/query_range, returnsmap[image][]TimedSampleimagestream labelkubernetesEventsparser: extracts pull durations fromPulledmessages (in Xspattern), falls back to Pulling→Pulled timestamp delta; stores failures underimage:failed, cache hits underimage:cache_hitlokiInferReasonFromMessage— infers event reason from plain-text log lines when no structured label is presentEngine wiring (
internal/discovery/engine.go)executeLokiQuery— bridgesDiscoveryLokiQueryspec →LokiSource.FetchRawDiscoveryQueryTypeLokicase inexecuteQuerynow fully executes instead of returning stub errorSignalTypeEventPullTimecase inderiveSignalnow callsderiveEventPullTimederiveEventPullTime— computes p50/p90/p95/avg/max/count/failureCount/cacheHitCount statistics over Loki event samplescollectImagesupdated to strip:failed/:cache_hitsuffixes so failure-only images are still rankedTests
loki_test.go: generic path, kubernetesEvents parser, event-pair fallback, HTTP error, reason inference, duration parsingengine_test.go: full pipeline tests for Loki+eventPullTime avg score and failure count; percentile correctness against known 1–10s datasetE2E infrastructure and test (Loki)
Adds the previously-missing Loki end-to-end coverage so the Loki query type and
eventPullTimesignal are exercised against a real Loki backend:hack/e2e-infra/loki.yaml— deploys a single-binary Loki (ConfigMap + Deployment + Service) into thee2e-infranamespace, with no readiness probe to match the existing Prometheus/registry manifests (Loki's/readyis flaky during ring stabilization).hack/e2e-infra/seed-loki-job.yaml— seeds Loki via/loki/api/v1/pushwith kubelet-style image-pull event log lines (Pulling/Pulled … in Xs/Failed/already present) for the registry-seeded test images.hack/e2e-infra/setup.shandTiltfile— deploy and seed Loki alongside Prometheus and the registry.test/e2e/discovery-loki/— Chainsaw suite running a Loki range query with thekubernetesEventsparser, derivingp50cold-pull-time andfailureCounteventPullTimesignals, and assertingReady=Syncedplus that all three seeded images are discovered.test/e2e/README.md— refreshed scenario table (discovery,discovery-loki,discovery-registry).The full Loki data path (config → push format → LogQL query →
kubernetesEventsparser → per-image durations/failure counts) was validated end-to-end against a real Loki in Docker, and theDiscoveryPolicymanifest strict-decodes into the typed API. Note: a full Chainsawmake test-e2erun could not be completed in the sandbox because pod-to-pod networking in the nested-docker kind cluster is non-functional (the existing Prometheus infra fails cross-pod identically); the manifests mirror the working Prometheus/registry pattern and are expected to pass in real CI.