Skip to content

feat(discovery): queries, Signal and Ranking rework #57

Closed
Copilot wants to merge 10 commits into
mainfrom
copilot/extend-discovery-policy
Closed

feat(discovery): queries, Signal and Ranking rework #57
Copilot wants to merge 10 commits into
mainfrom
copilot/extend-discovery-policy

Conversation

Copilot AI commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Loki query type and eventPullTime signal were stubbed with "not yet implemented" errors. This wires both up end-to-end.

Loki datasource (internal/discovery/loki.go)

  • LokiSource.FetchRaw — calls /loki/api/v1/query_range, returns map[image][]TimedSample
  • Generic parser: value=1.0 per log entry, keyed by image stream label
  • kubernetesEvents parser: extracts pull durations from Pulled messages (in Xs pattern), falls back to Pulling→Pulled timestamp delta; stores failures under image:failed, cache hits under image:cache_hit
  • lokiInferReasonFromMessage — infers event reason from plain-text log lines when no structured label is present

Engine wiring (internal/discovery/engine.go)

  • executeLokiQuery — bridges DiscoveryLokiQuery spec → LokiSource.FetchRaw
  • DiscoveryQueryTypeLoki case in executeQuery now fully executes instead of returning stub error
  • SignalTypeEventPullTime case in deriveSignal now calls deriveEventPullTime
  • deriveEventPullTime — computes p50/p90/p95/avg/max/count/failureCount/cacheHitCount statistics over Loki event samples
  • collectImages updated to strip :failed/:cache_hit suffixes so failure-only images are still ranked

Tests

  • loki_test.go: generic path, kubernetesEvents parser, event-pair fallback, HTTP error, reason inference, duration parsing
  • engine_test.go: full pipeline tests for Loki+eventPullTime avg score and failure count; percentile correctness against known 1–10s dataset

E2E infrastructure and test (Loki)

Adds the previously-missing Loki end-to-end coverage so the Loki query type and eventPullTime signal are exercised against a real Loki backend:

  • hack/e2e-infra/loki.yaml — deploys a single-binary Loki (ConfigMap + Deployment + Service) into the e2e-infra namespace, with no readiness probe to match the existing Prometheus/registry manifests (Loki's /ready is flaky during ring stabilization).
  • hack/e2e-infra/seed-loki-job.yaml — seeds Loki via /loki/api/v1/push with kubelet-style image-pull event log lines (Pulling/Pulled … in Xs/Failed/already present) for the registry-seeded test images.
  • hack/e2e-infra/setup.sh and Tiltfile — deploy and seed Loki alongside Prometheus and the registry.
  • test/e2e/discovery-loki/ — Chainsaw suite running a Loki range query with the kubernetesEvents parser, deriving p50 cold-pull-time and failureCount eventPullTime signals, and asserting Ready=Synced plus that all three seeded images are discovered.
  • test/e2e/README.md — refreshed scenario table (discovery, discovery-loki, discovery-registry).

The full Loki data path (config → push format → LogQL query → kubernetesEvents parser → per-image durations/failure counts) was validated end-to-end against a real Loki in Docker, and the DiscoveryPolicy manifest strict-decodes into the typed API. Note: a full Chainsaw make test-e2e run could not be completed in the sandbox because pod-to-pod networking in the nested-docker kind cluster is non-functional (the existing Prometheus infra fails cross-pod identically); the manifests mirror the working Prometheus/registry pattern and are expected to pass in real CI.

… (Issue 1: CRD types)

Implements the first issue of the three-stage discovery pipeline redesign
documented in docs/decisions/13-discovery-signals-ranking.md.

## Breaking Changes

- Removed: spec.sources[], DiscoverySource, PrometheusSource (API type),
  RegistrySource, status.sourceCount
- Added: spec.queries[], spec.signals[], spec.ranking (new three-stage pipeline)
- DiscoveredImage: removed score/source fields, added rank/finalScore/selected/
  signals/ranking breakdown fields

## New API Types

Query stage:
- DiscoveryQuery (prometheus | loki)
- DiscoveryPrometheusQuery, DiscoveryLokiQuery, LokiParser

Signal stage (4 types):
- aggregate, timeWeightedAggregate, windowAggregate, eventPullTime

Ranking stage (3 strategies):
- signal, weightedSum (minMax normalized), modelExposure (cold-node exposure)

Status:
- QueryResult[], SignalResult[] — per-query/signal observability
- Rich DiscoveredImage with signals[] and ranking breakdown

## Other Changes

- Regenerated deepcopy and CRD manifests
- Stubbed controller: sets Ready=False/NotImplemented until Issues 2-10 land
- Removed internal/discovery/registry.go (registry source retired)
- Removed test/e2e/discovery-aggregation/ and discovery-registry/ (retired)
- Updated all e2e tests to new schema, assert NotImplemented condition
- Rewrote docs/content/docs/discovery.md with full pipeline explanation
- Regenerated AI docs (knowledge.yaml, llms.txt, llms-full.txt)

Closes #55
Copilot AI changed the title [WIP] Extend Drop's DiscoveryPolicy for image ranking feat(discovery): replace sources[] with query/signal/ranking pipeline (Issue 1: CRD types) Jun 27, 2026
Copilot finished work on behalf of Breee June 27, 2026 11:46
Copilot AI requested a review from Breee June 27, 2026 11:46
@Breee

Breee commented Jun 27, 2026

Copy link
Copy Markdown
Member

@copilot also implement the remaining parts so this is complete. Make sure the e2e tests and Linter and unit Tests Pass locally before pushing.

…gistry datasource

- Add DiscoveryQueryTypeRegistry + DiscoveryRegistryQuery to API types
- Restore internal/discovery/registry.go and registry_test.go
- Add internal/discovery/engine.go: full 3-stage pipeline execution (query → signal → ranking)
  - Prometheus instant/range, registry queries
  - aggregate, timeWeightedAggregate, windowAggregate signals
  - signal, weightedSum, modelExposure ranking strategies
- Add internal/discovery/engine_test.go: tests for all pipeline stages
- Add FetchRaw() to PrometheusSource for timestamp-preserving data access
- Replace controller stub (NotImplemented) with real pipeline execution
- Update e2e tests: assert real behavior (Synced/DNSError) instead of NotImplemented
- Add discovery-registry e2e test suite
- Regenerate deepcopy and CRD manifests

All unit tests pass, linter clean (0 issues).

Copilot AI commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

@copilot also implement the remaining parts so this is complete. Make sure the e2e tests and Linter and unit Tests Pass locally before pushing.

probe

@copilot also implement the remaining parts so this is complete. Make sure the e2e tests and Linter and unit Tests Pass locally before pushing.

Implemented in d3757c2. Here's what's now in place:

Pipeline engine (internal/discovery/engine.go):

  • Stage 1 (Query): Prometheus instant/range + registry; Loki stubbed
  • Stage 2 (Signal): aggregate, timeWeightedAggregate, windowAggregate; eventPullTime stubbed
  • Stage 3 (Ranking): signal, weightedSum (minMax), modelExposure cold-node formula

Registry datasource restored: DiscoveryQueryTypeRegistry added to the API, registry.go re-implemented, new test/e2e/discovery-registry/ e2e suite.

Controller now executes the real pipeline — sets Ready=True, Synced on success and Ready=False, DNSError|SyncFailed on failure (the NotImplemented stub is gone).

All unit tests pass, linter clean (0 issues).

Copilot AI changed the title feat(discovery): replace sources[] with query/signal/ranking pipeline (Issue 1: CRD types) feat(discovery): replace sources[] with query/signal/ranking pipeline + full execution engine Jun 27, 2026
Copilot finished work on behalf of Breee June 27, 2026 13:08
@Breee

Breee commented Jun 27, 2026

Copy link
Copy Markdown
Member

@copilot implement the remaining parts

Copilot AI changed the title feat(discovery): replace sources[] with query/signal/ranking pipeline + full execution engine feat(discovery): implement Loki query execution and eventPullTime signal derivation Jun 27, 2026
Copilot finished work on behalf of Breee June 27, 2026 14:23
@Breee

Breee commented Jun 27, 2026

Copy link
Copy Markdown
Member

@copilot implement the remaining parts. Commit and push after each logically Part. The previous Session https://github.com/corewire/drop/tasks/0b950942-ffbf-466d-970d-903d7307ae19 Was broken because the Infrastructure was broken.

copilot added 2 commits June 27, 2026 14:38
…discovery

Deploy a single-binary Loki into the e2e-infra namespace and seed it with
kubelet-style image-pull event log lines (Pulling/Pulled/Failed/already
present) so DiscoveryPolicy loki queries with the kubernetesEvents parser and
the eventPullTime signal can be exercised end-to-end.

Wired into hack/e2e-infra/setup.sh and the Tiltfile alongside the existing
Prometheus and registry infrastructure.
Add a DiscoveryPolicy e2e suite that runs a Loki range query with the
kubernetesEvents parser and derives p50 cold-pull-time and failure-count
eventPullTime signals from the seeded image-pull events, asserting the pipeline
reports Ready=Synced and discovers the expected images.

Also refresh the e2e README scenario table (discovery, discovery-loki,
discovery-registry).
copilot added 2 commits June 27, 2026 14:53
The kubelet readiness probe against Loki's /ready was flaky during ring
stabilization (the probe's 1s timeout was exceeded and /ready returns 503 until
the ingester settles), leaving the deployment stuck as not-available. The
existing Prometheus and registry manifests use no readiness probe; the seed job
already polls /ready before pushing and consumers retry, so gate readiness the
same way for consistency and reliability.
Also assert test/tools:v1 (the third seeded image) appears in
status.discoveredImages so the assertions cover the full seed dataset.
@Breee

Breee commented Jun 27, 2026

Copy link
Copy Markdown
Member

@copilot fix the e2e tests.

The readiness probe was dropped in the previous commit because the 1s
timeout was too short for ring stabilization. Without any probe,
kubectl wait --for=condition=available succeeds as soon as the container
starts (before Loki's HTTP server accepts requests), so the seed job
could run against a not-yet-ready Loki.

Re-add the probe with a longer 5s timeout and 15s initial delay, giving
Loki up to ~105s to pass before the Deployment is marked Available and
the setup.sh seed step begins.

Also:
- Remove stale 02-assert-notimplemented.yaml (controller no longer
  returns NotImplemented; file was unused by any chainsaw-test.yaml)
- Fix test/e2e/README.md: wrong make target, wrong scenario names,
  missing scenarios (cachedimageset-discovery, discovery-failure)
- Update Makefile e2e-infra comment and CI step name to include Loki
Copilot finished work on behalf of Breee June 27, 2026 20:33
@Breee Breee changed the title feat(discovery): implement Loki query execution and eventPullTime signal derivation feat(discovery): queries, Signal and Ranking rework Jun 28, 2026
@Breee Breee closed this Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: Query, Signal, and Ranking Pipeline for Drop Discovery

2 participants