Skip to content

Feat/breaking change rework ranking#58

Open
Breee wants to merge 35 commits into
mainfrom
feat/breaking-change-rework-ranking
Open

Feat/breaking change rework ranking#58
Breee wants to merge 35 commits into
mainfrom
feat/breaking-change-rework-ranking

Conversation

@Breee

@Breee Breee commented Jun 28, 2026

Copy link
Copy Markdown
Member

This pull request introduces significant enhancements to the DiscoveryPolicy API and the development environment, focusing on supporting richer query and ranking capabilities, as well as improving E2E infrastructure.

The main changes include a comprehensive refactor of DeepCopy methods to support new types, the addition of Loki (log aggregation) resources for E2E and local development, and the introduction of a feature spec for a future DiscoveryPolicy UI.

this is for our plans in #55

Copilot AI and others added 30 commits June 27, 2026 11:19
… (Issue 1: CRD types)

Implements the first issue of the three-stage discovery pipeline redesign
documented in docs/decisions/13-discovery-signals-ranking.md.

## Breaking Changes

- Removed: spec.sources[], DiscoverySource, PrometheusSource (API type),
  RegistrySource, status.sourceCount
- Added: spec.queries[], spec.signals[], spec.ranking (new three-stage pipeline)
- DiscoveredImage: removed score/source fields, added rank/finalScore/selected/
  signals/ranking breakdown fields

## New API Types

Query stage:
- DiscoveryQuery (prometheus | loki)
- DiscoveryPrometheusQuery, DiscoveryLokiQuery, LokiParser

Signal stage (4 types):
- aggregate, timeWeightedAggregate, windowAggregate, eventPullTime

Ranking stage (3 strategies):
- signal, weightedSum (minMax normalized), modelExposure (cold-node exposure)

Status:
- QueryResult[], SignalResult[] — per-query/signal observability
- Rich DiscoveredImage with signals[] and ranking breakdown

## Other Changes

- Regenerated deepcopy and CRD manifests
- Stubbed controller: sets Ready=False/NotImplemented until Issues 2-10 land
- Removed internal/discovery/registry.go (registry source retired)
- Removed test/e2e/discovery-aggregation/ and discovery-registry/ (retired)
- Updated all e2e tests to new schema, assert NotImplemented condition
- Rewrote docs/content/docs/discovery.md with full pipeline explanation
- Regenerated AI docs (knowledge.yaml, llms.txt, llms-full.txt)

Closes #55
…gistry datasource

- Add DiscoveryQueryTypeRegistry + DiscoveryRegistryQuery to API types
- Restore internal/discovery/registry.go and registry_test.go
- Add internal/discovery/engine.go: full 3-stage pipeline execution (query → signal → ranking)
  - Prometheus instant/range, registry queries
  - aggregate, timeWeightedAggregate, windowAggregate signals
  - signal, weightedSum, modelExposure ranking strategies
- Add internal/discovery/engine_test.go: tests for all pipeline stages
- Add FetchRaw() to PrometheusSource for timestamp-preserving data access
- Replace controller stub (NotImplemented) with real pipeline execution
- Update e2e tests: assert real behavior (Synced/DNSError) instead of NotImplemented
- Add discovery-registry e2e test suite
- Regenerate deepcopy and CRD manifests

All unit tests pass, linter clean (0 issues).
…discovery

Deploy a single-binary Loki into the e2e-infra namespace and seed it with
kubelet-style image-pull event log lines (Pulling/Pulled/Failed/already
present) so DiscoveryPolicy loki queries with the kubernetesEvents parser and
the eventPullTime signal can be exercised end-to-end.

Wired into hack/e2e-infra/setup.sh and the Tiltfile alongside the existing
Prometheus and registry infrastructure.
Add a DiscoveryPolicy e2e suite that runs a Loki range query with the
kubernetesEvents parser and derives p50 cold-pull-time and failure-count
eventPullTime signals from the seeded image-pull events, asserting the pipeline
reports Ready=Synced and discovers the expected images.

Also refresh the e2e README scenario table (discovery, discovery-loki,
discovery-registry).
The kubelet readiness probe against Loki's /ready was flaky during ring
stabilization (the probe's 1s timeout was exceeded and /ready returns 503 until
the ingester settles), leaving the deployment stuck as not-available. The
existing Prometheus and registry manifests use no readiness probe; the seed job
already polls /ready before pushing and consumers retry, so gate readiness the
same way for consistency and reliability.
Also assert test/tools:v1 (the third seeded image) appears in
status.discoveredImages so the assertions cover the full seed dataset.
The readiness probe was dropped in the previous commit because the 1s
timeout was too short for ring stabilization. Without any probe,
kubectl wait --for=condition=available succeeds as soon as the container
starts (before Loki's HTTP server accepts requests), so the seed job
could run against a not-yet-ready Loki.

Re-add the probe with a longer 5s timeout and 15s initial delay, giving
Loki up to ~105s to pass before the Deployment is marked Available and
the setup.sh seed step begins.

Also:
- Remove stale 02-assert-notimplemented.yaml (controller no longer
  returns NotImplemented; file was unused by any chainsaw-test.yaml)
- Fix test/e2e/README.md: wrong make target, wrong scenario names,
  missing scenarios (cachedimageset-discovery, discovery-failure)
- Update Makefile e2e-infra comment and CI step name to include Loki
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants