From 6705ccb92aa17b0f823503ded66e02b167e057ac Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 27 Jun 2026 11:19:44 +0000 Subject: [PATCH 01/35] Initial plan From fe2f81ab6d915880207707693362b6bf5828e4be Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 27 Jun 2026 11:44:25 +0000 Subject: [PATCH 02/35] feat(discovery): replace sources[] with query/signal/ranking pipeline (Issue 1: CRD types) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implements the first issue of the three-stage discovery pipeline redesign documented in docs/decisions/13-discovery-signals-ranking.md. ## Breaking Changes - Removed: spec.sources[], DiscoverySource, PrometheusSource (API type), RegistrySource, status.sourceCount - Added: spec.queries[], spec.signals[], spec.ranking (new three-stage pipeline) - DiscoveredImage: removed score/source fields, added rank/finalScore/selected/ signals/ranking breakdown fields ## New API Types Query stage: - DiscoveryQuery (prometheus | loki) - DiscoveryPrometheusQuery, DiscoveryLokiQuery, LokiParser Signal stage (4 types): - aggregate, timeWeightedAggregate, windowAggregate, eventPullTime Ranking stage (3 strategies): - signal, weightedSum (minMax normalized), modelExposure (cold-node exposure) Status: - QueryResult[], SignalResult[] — per-query/signal observability - Rich DiscoveredImage with signals[] and ranking breakdown ## Other Changes - Regenerated deepcopy and CRD manifests - Stubbed controller: sets Ready=False/NotImplemented until Issues 2-10 land - Removed internal/discovery/registry.go (registry source retired) - Removed test/e2e/discovery-aggregation/ and discovery-registry/ (retired) - Updated all e2e tests to new schema, assert NotImplemented condition - Rewrote docs/content/docs/discovery.md with full pipeline explanation - Regenerated AI docs (knowledge.yaml, llms.txt, llms-full.txt) Closes #55 --- .github/copilot-instructions.md | 2 +- api/v1alpha1/discoverypolicy_types.go | 619 ++++++++++++--- api/v1alpha1/zz_generated.deepcopy.go | 468 ++++++++++- .../drop.corewire.io_discoverypolicies.yaml | 728 ++++++++++++++--- .../drop_v1alpha1_discoverypolicy.yaml | 56 +- docs/content/docs/developing/architecture.md | 11 +- docs/content/docs/discovery.md | 541 +++++++++---- .../docs/reference/_generated_architecture.md | 1 - .../content/docs/reference/_generated_crds.md | 255 +++++- docs/static/llms-full.txt | 373 +++++++-- hack/dev-samples.yaml | 110 +-- .../controller/discoverypolicy_controller.go | 378 +-------- .../discoverypolicy_controller_test.go | 23 +- internal/discovery/registry.go | 159 ---- internal/discovery/registry_test.go | 93 --- knowledge.yaml | 732 ++++++++++++++---- llms-full.txt | 373 +++++++-- .../02-discoverypolicy.yaml | 26 +- .../03-assert-discovery-ready.yaml | 6 +- .../05-assert-children.yaml | 13 - .../06-assert-set-status.yaml | 8 - .../chainsaw-test.yaml | 47 +- .../01-discoverypolicies.yaml | 108 --- .../02-assert-count.yaml | 12 - .../discovery-aggregation/03-assert-avg.yaml | 12 - .../discovery-aggregation/04-assert-max.yaml | 12 - .../discovery-aggregation/05-assert-sum.yaml | 12 - .../06-assert-instant.yaml | 11 - .../discovery-aggregation/07-assert-none.yaml | 11 - .../discovery-aggregation/chainsaw-test.yaml | 108 --- .../01-broken-prometheus.yaml | 15 +- .../02-assert-notimplemented.yaml | 10 + .../discovery-failure/02-broken-registry.yaml | 13 - .../03-notfound-registry.yaml | 13 - .../05-assert-dns-registry.yaml | 9 - .../discovery-failure/06-assert-notfound.yaml | 8 - test/e2e/discovery-failure/chainsaw-test.yaml | 41 +- .../01-discoverypolicy.yaml | 14 - .../02-assert-discovery-status.yaml | 11 - .../e2e/discovery-registry/chainsaw-test.yaml | 26 - test/e2e/discovery/01-discoverypolicy.yaml | 18 +- .../discovery/02-assert-discovery-status.yaml | 10 +- test/e2e/discovery/chainsaw-test.yaml | 26 +- 43 files changed, 3580 insertions(+), 1942 deletions(-) delete mode 100644 internal/discovery/registry.go delete mode 100644 internal/discovery/registry_test.go delete mode 100644 test/e2e/cachedimageset-discovery/05-assert-children.yaml delete mode 100644 test/e2e/cachedimageset-discovery/06-assert-set-status.yaml delete mode 100644 test/e2e/discovery-aggregation/01-discoverypolicies.yaml delete mode 100644 test/e2e/discovery-aggregation/02-assert-count.yaml delete mode 100644 test/e2e/discovery-aggregation/03-assert-avg.yaml delete mode 100644 test/e2e/discovery-aggregation/04-assert-max.yaml delete mode 100644 test/e2e/discovery-aggregation/05-assert-sum.yaml delete mode 100644 test/e2e/discovery-aggregation/06-assert-instant.yaml delete mode 100644 test/e2e/discovery-aggregation/07-assert-none.yaml delete mode 100644 test/e2e/discovery-aggregation/chainsaw-test.yaml create mode 100644 test/e2e/discovery-failure/02-assert-notimplemented.yaml delete mode 100644 test/e2e/discovery-failure/02-broken-registry.yaml delete mode 100644 test/e2e/discovery-failure/03-notfound-registry.yaml delete mode 100644 test/e2e/discovery-failure/05-assert-dns-registry.yaml delete mode 100644 test/e2e/discovery-failure/06-assert-notfound.yaml delete mode 100644 test/e2e/discovery-registry/01-discoverypolicy.yaml delete mode 100644 test/e2e/discovery-registry/02-assert-discovery-status.yaml delete mode 100644 test/e2e/discovery-registry/chainsaw-test.yaml diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 600fc22..1b75c05 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -59,7 +59,7 @@ make docs-gen # regenerate AI docs from source ``` api/v1alpha1 — Package v1alpha1 contains API Schema definitions for the drop v1alpha1 API group. internal/controller — Package controller implements Kubernetes reconcilers for the drop CRDs (one per Kind). - imports: api/v1alpha1, internal/discovery, internal/metrics, internal/pacing, internal/podbuilder + imports: api/v1alpha1, internal/metrics, internal/pacing, internal/podbuilder internal/discovery — Package discovery implements image discovery from registries and Prometheus metrics. imports: api/v1alpha1 internal/metrics — Package metrics registers Prometheus metrics for the drop operator. diff --git a/api/v1alpha1/discoverypolicy_types.go b/api/v1alpha1/discoverypolicy_types.go index 14b87fd..6752ebe 100644 --- a/api/v1alpha1/discoverypolicy_types.go +++ b/api/v1alpha1/discoverypolicy_types.go @@ -8,21 +8,28 @@ package v1alpha1 import ( corev1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/api/resource" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/runtime" ) // DiscoveryPolicySpec defines the desired state of DiscoveryPolicy. type DiscoveryPolicySpec struct { - // Sources is the list of discovery backends to query. At least one source is required. - // Multiple sources are merged and ranked together before maxImages is applied. - // +kubebuilder:validation:MinItems=1 - Sources []DiscoverySource `json:"sources"` + // Queries is the list of named raw-data sources. Each query is referenced by name from signals. + // +optional + Queries []DiscoveryQuery `json:"queries,omitempty"` + // Signals is the list of named per-image metrics derived from query results. + // Each signal is referenced by name from the ranking configuration. + // +optional + Signals []DiscoverySignal `json:"signals,omitempty"` + // Ranking defines how signals are combined into a final ordered image list. + // +optional + Ranking *DiscoveryRanking `json:"ranking,omitempty"` // ImageFilter is a regex applied to discovered image references. Only matching images are kept. // Example: "registry.example.com/team/.*" (only keep images from that registry path) // +optional ImageFilter string `json:"imageFilter,omitempty"` - // SyncInterval is how often the operator re-queries all sources and updates status.discoveredImages. + // SyncInterval is how often the operator re-runs the pipeline and updates status.discoveredImages. // Default: "30m". Example: "1h", "15m" // +kubebuilder:default="30m" SyncInterval metav1.Duration `json:"syncInterval,omitempty"` @@ -34,161 +41,579 @@ type DiscoveryPolicySpec struct { MaxImages int32 `json:"maxImages,omitempty"` } -// DiscoverySource defines a single discovery backend. -type DiscoverySource struct { - // Type identifies the discovery backend. Must be "prometheus" or "registry". - // +kubebuilder:validation:Enum=prometheus;registry - Type string `json:"type"` +// ============================================================ +// Stage 1 — Queries +// ============================================================ + +// DiscoveryQueryType identifies the backend for a named query. +// +kubebuilder:validation:Enum=prometheus;loki +type DiscoveryQueryType string + +const ( + // DiscoveryQueryTypePrometheus fetches time-series data from a Prometheus-compatible API. + DiscoveryQueryTypePrometheus DiscoveryQueryType = "prometheus" + // DiscoveryQueryTypeLoki fetches log event data from a Loki-compatible API. + DiscoveryQueryTypeLoki DiscoveryQueryType = "loki" +) + +// DiscoveryQuery defines a named raw-data source referenced by signals. +type DiscoveryQuery struct { + // Name is the unique identifier for this query within the policy. + // Signals reference queries by this name via queryRef. + // +kubebuilder:validation:MinLength=1 + Name string `json:"name"` + // Type selects the backend. Must be "prometheus" or "loki". + // +kubebuilder:validation:Enum=prometheus;loki + Type DiscoveryQueryType `json:"type"` // Prometheus contains the configuration when type=prometheus. // +optional - Prometheus *PrometheusSource `json:"prometheus,omitempty"` - // Registry contains the configuration when type=registry. + Prometheus *DiscoveryPrometheusQuery `json:"prometheus,omitempty"` + // Loki contains the configuration when type=loki. // +optional - Registry *RegistrySource `json:"registry,omitempty"` - // SecretRef references a Secret in the namespace where Drop creates pull Pods. - // The default namespace is "drop-system" unless the controller is started with a different --pod-namespace. + Loki *DiscoveryLokiQuery `json:"loki,omitempty"` + // SecretRef references a Secret in the pod namespace (default "drop-system") for auth/TLS. // Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. - // Example: {name: "prometheus-creds"} // +optional SecretRef *corev1.LocalObjectReference `json:"secretRef,omitempty"` } -// AggregationMethod defines how range query values are aggregated into a score. -// +kubebuilder:validation:Enum=sum;count;avg;max -type AggregationMethod string - -const ( - // AggregationSum adds all data-point values over the lookback window. - // Use when the query returns a gauge/counter and the total magnitude matters - // (e.g., total memory usage across the window). - AggregationSum AggregationMethod = "sum" - // AggregationCount counts the number of non-zero data points over the lookback window. - // Use when you want to rank by how frequently an image appears - // (e.g., number of sample intervals where the image was running). - AggregationCount AggregationMethod = "count" - // AggregationAvg computes the arithmetic mean of all data-point values. - // Use when you want the average magnitude regardless of how many samples exist. - AggregationAvg AggregationMethod = "avg" - // AggregationMax takes the highest single data-point value. - // Use when peak usage is more relevant than cumulative usage. - AggregationMax AggregationMethod = "max" -) - // QueryType defines how the Prometheus query is executed. // +kubebuilder:validation:Enum=range;instant type QueryType string const ( // QueryTypeRange uses /api/v1/query_range with a time window defined by lookback. - // Returns multiple data points which are aggregated using the aggregationMethod. + // Returns multiple data points which are aggregated at the signal stage. QueryTypeRange QueryType = "range" // QueryTypeInstant uses /api/v1/query for a single point-in-time result. - // The returned value is used directly as the score. + // The returned value is used directly as the raw sample value. QueryTypeInstant QueryType = "instant" ) -// PrometheusSource defines Prometheus query configuration for image discovery. -type PrometheusSource struct { +// DiscoveryPrometheusQuery defines the Prometheus-specific query parameters. +// The PromQL result MUST carry an "image" label; that label value is the image reference. +type DiscoveryPrometheusQuery struct { // Endpoint is the Prometheus-compatible API URL (Prometheus, Thanos, Mimir, VictoriaMetrics). // Example: "http://prometheus.monitoring.svc:9090", "https://mimir.example.com" // +kubebuilder:validation:MinLength=1 Endpoint string `json:"endpoint"` - // Query is the PromQL expression. It MUST return results with an "image" label — - // that label value is used as the discovered image reference. - // The query result value is used as the ranking score (higher = more relevant). - // Example: count(container_memory_working_set_bytes{container!="",container!="POD",namespace="gitlab-runner"}) by (image) + // Query is the PromQL expression. Must return results with an "image" label. + // Example: count(container_memory_working_set_bytes{namespace="gitlab-runner"}) by (image) // +kubebuilder:validation:MinLength=1 Query string `json:"query"` - // QueryType controls how the Prometheus query is executed. - // "range" uses /api/v1/query_range with a time window defined by lookback. - // "instant" uses /api/v1/query for a single point-in-time result. - // Default: "range". + // QueryType controls how the query is executed: "range" or "instant". Default: "range". // +kubebuilder:default="range" // +optional QueryType QueryType `json:"queryType,omitempty"` - // Lookback is the time window for range queries. When queryType is "range", - // the operator queries (start=now-lookback, end=now) and aggregates all returned values per image. - // The aggregation function is controlled by the aggregationMethod field. + // Lookback is the time window for range queries (start=now-lookback, end=now). // Required when queryType is "range". Ignored when queryType is "instant". // Example: "168h" (7 days), "24h", "72h" // +optional Lookback *metav1.Duration `json:"lookback,omitempty"` - // AggregationMethod controls how data points from a range query are combined into a single score. - // Only used when queryType is "range". Ignored for instant queries. - // When not set (nil), Drop uses the last data-point value directly — use this when your PromQL - // already contains aggregation functions (e.g., count_over_time, topk). - // Options: "sum", "count", "avg", "max" - // +optional - AggregationMethod *AggregationMethod `json:"aggregationMethod,omitempty"` - // Step is the resolution step for range queries (only used when lookback is set). - // Smaller steps = more data points = more accurate aggregation but higher Prometheus load. + // Step is the resolution step for range queries. + // Smaller steps increase data-point density but also increase Prometheus load. // Default: 5m. Example: "1m", "15m" // +optional Step *metav1.Duration `json:"step,omitempty"` } -// RegistrySource defines OCI registry tag listing configuration for image discovery. -type RegistrySource struct { - // URL is the registry base URL (without repository path). - // Example: "https://registry.example.com", "https://ghcr.io" +// LokiQueryType defines how the Loki query is executed. +// +kubebuilder:validation:Enum=range +type LokiQueryType string + +const ( + // LokiQueryTypeRange uses /loki/api/v1/query_range with a lookback window. + LokiQueryTypeRange LokiQueryType = "range" +) + +// DiscoveryLokiQuery defines the Loki-specific query parameters. +type DiscoveryLokiQuery struct { + // Endpoint is the Loki API URL. + // Example: "https://loki.example.com" // +kubebuilder:validation:MinLength=1 - URL string `json:"url"` - // Repositories is the list of repository paths to list tags from. - // Example: ["team/app", "team/worker", "infra/tools"] + Endpoint string `json:"endpoint"` + // Query is the LogQL expression. + // +kubebuilder:validation:MinLength=1 + Query string `json:"query"` + // QueryType controls how the query is executed. Currently only "range" is supported. + // +kubebuilder:default="range" + // +optional + QueryType LokiQueryType `json:"queryType,omitempty"` + // Lookback is the time window for the query (start=now-lookback, end=now). + // Example: "168h" (7 days), "24h" + // +optional + Lookback *metav1.Duration `json:"lookback,omitempty"` + // Parser configures how log lines are parsed into structured event records. + // +optional + Parser *LokiParser `json:"parser,omitempty"` +} + +// LokiParserType identifies how Loki log lines are parsed. +// +kubebuilder:validation:Enum=kubernetesEvents +type LokiParserType string + +const ( + // LokiParserTypeKubernetesEvents parses Kubernetes Event log lines, + // extracting pod name, reason, message, and image reference. + LokiParserTypeKubernetesEvents LokiParserType = "kubernetesEvents" +) + +// LokiParser configures structured parsing of Loki log entries. +type LokiParser struct { + // Type selects the parser. Currently only "kubernetesEvents" is supported. + // +kubebuilder:validation:Enum=kubernetesEvents + Type LokiParserType `json:"type"` + // PodField is the log label or field that contains the pod name. + // Example: "involvedObject_name" + // +optional + PodField string `json:"podField,omitempty"` + // ReasonField is the log label or field that contains the event reason. + // Example: "reason" + // +optional + ReasonField string `json:"reasonField,omitempty"` + // MessageField is the log label or field that contains the event message. + // Example: "message" + // +optional + MessageField string `json:"messageField,omitempty"` + // ImageField is the log label or field from which the image reference is extracted. + // For kubernetesEvents, the image is parsed out of the message text. + // Example: "message" + // +optional + ImageField string `json:"imageField,omitempty"` +} + +// ============================================================ +// Stage 2 — Signals +// ============================================================ + +// SignalType identifies the derivation method for a named signal. +// +kubebuilder:validation:Enum=aggregate;timeWeightedAggregate;windowAggregate;eventPullTime +type SignalType string + +const ( + // SignalTypeAggregate aggregates all samples per image using a single method (sum, max, avg, count, min). + SignalTypeAggregate SignalType = "aggregate" + // SignalTypeTimeWeightedAggregate applies per-hour-window weights before aggregation. + SignalTypeTimeWeightedAggregate SignalType = "timeWeightedAggregate" + // SignalTypeWindowAggregate aggregates only the samples within a specific time sub-window. + SignalTypeWindowAggregate SignalType = "windowAggregate" + // SignalTypeEventPullTime derives image pull-time statistics from Loki event records. + SignalTypeEventPullTime SignalType = "eventPullTime" +) + +// AggregationMethod defines how data-point values are combined into a single per-image number. +// +kubebuilder:validation:Enum=sum;count;avg;max;min +type AggregationMethod string + +const ( + // AggregationSum adds all data-point values. + AggregationSum AggregationMethod = "sum" + // AggregationCount counts the number of data points. + AggregationCount AggregationMethod = "count" + // AggregationAvg computes the arithmetic mean of all data-point values. + AggregationAvg AggregationMethod = "avg" + // AggregationMax takes the highest single data-point value. + AggregationMax AggregationMethod = "max" + // AggregationMin takes the lowest single data-point value. + AggregationMin AggregationMethod = "min" +) + +// DiscoverySignal defines a named per-image metric derived from a single query. +type DiscoverySignal struct { + // Name is the unique identifier for this signal within the policy. + // Ranking configurations reference signals by this name. + // +kubebuilder:validation:MinLength=1 + Name string `json:"name"` + // QueryRef is the name of the query that provides raw data for this signal. + // Must match a queries[].name within the same policy. + // +kubebuilder:validation:MinLength=1 + QueryRef string `json:"queryRef"` + // Type selects the signal derivation method. + // +kubebuilder:validation:Enum=aggregate;timeWeightedAggregate;windowAggregate;eventPullTime + Type SignalType `json:"type"` + // Aggregate is required when type=aggregate. + // +optional + Aggregate *AggregateSignalConfig `json:"aggregate,omitempty"` + // TimeWeightedAggregate is required when type=timeWeightedAggregate. + // +optional + TimeWeightedAggregate *TimeWeightedAggregateSignalConfig `json:"timeWeightedAggregate,omitempty"` + // WindowAggregate is required when type=windowAggregate. + // +optional + WindowAggregate *WindowAggregateSignalConfig `json:"windowAggregate,omitempty"` + // EventPullTime is required when type=eventPullTime. + // +optional + EventPullTime *EventPullTimeSignalConfig `json:"eventPullTime,omitempty"` +} + +// AggregateSignalConfig configures the aggregate signal type. +type AggregateSignalConfig struct { + // Method is the aggregation function applied to all samples per image. + // +kubebuilder:validation:Enum=sum;count;avg;max;min + Method AggregationMethod `json:"method"` +} + +// TimeWeightedAggregateSignalConfig configures the timeWeightedAggregate signal type. +// Each sample value is multiplied by the weight of the matching time window before aggregation. +type TimeWeightedAggregateSignalConfig struct { + // Method is the aggregation function applied after weighting (currently only "sum" is meaningful). + // +kubebuilder:validation:Enum=sum;count;avg;max;min + Method AggregationMethod `json:"method"` + // Timezone is the IANA time zone used to evaluate window boundaries (wall-clock hours). + // Example: "Europe/Berlin", "America/New_York", "UTC" + // +kubebuilder:validation:MinLength=1 + Timezone string `json:"timezone"` + // DefaultWeight is applied to samples that do not fall in any configured window. + // Use "0" to exclude off-hours samples entirely. + DefaultWeight resource.Quantity `json:"defaultWeight"` + // Windows is the list of hour-of-day windows with associated weights. // +kubebuilder:validation:MinItems=1 - Repositories []string `json:"repositories"` - // TagFilter is a regex applied to tag names. Only matching tags are discovered. - // Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) + Windows []TimeWeightedWindow `json:"windows"` +} + +// TimeWeightedWindow defines a wall-clock hour range and its weight factor. +type TimeWeightedWindow struct { + // StartHour is the inclusive start of the window in local time (0–23). + // +kubebuilder:validation:Minimum=0 + // +kubebuilder:validation:Maximum=23 + StartHour int32 `json:"startHour"` + // EndHour is the exclusive end of the window in local time (1–24). + // +kubebuilder:validation:Minimum=1 + // +kubebuilder:validation:Maximum=24 + EndHour int32 `json:"endHour"` + // Weight is the factor applied to sample values within this window. + // Use "1.0" for full weight, "0.3" for partial, "0" to exclude. + Weight resource.Quantity `json:"weight"` +} + +// WindowAggregateSignalConfig configures the windowAggregate signal type. +// Exactly one of relativeWindow or (window + timezone) must be set. +type WindowAggregateSignalConfig struct { + // Method is the aggregation function applied to the windowed samples. + // +kubebuilder:validation:Enum=sum;count;avg;max;min + Method AggregationMethod `json:"method"` + // RelativeWindow aggregates only samples from the last N duration before now. + // Mutually exclusive with window + timezone. + // Example: "2h" (last 2 hours) // +optional - TagFilter string `json:"tagFilter,omitempty"` - // TopX limits the number of tags kept per repository after tagFilter is applied. - // The registry API does not provide creation timestamps here; Drop keeps the last N tags returned by the registry. - // Example: 3 (keep the last 3 matching tags returned per repo) + RelativeWindow *metav1.Duration `json:"relativeWindow,omitempty"` + // Timezone is the IANA time zone for evaluating wall-clock window boundaries. + // Required when window is set. // +optional + Timezone string `json:"timezone,omitempty"` + // Window defines fixed wall-clock start/end times within each day. + // Mutually exclusive with relativeWindow. + // +optional + Window *TimeOfDayWindow `json:"window,omitempty"` +} + +// TimeOfDayWindow defines a fixed wall-clock time range within each day. +type TimeOfDayWindow struct { + // Start is the inclusive start time in "HH:MM" format (24-hour, local time). + // Example: "09:00" + // +kubebuilder:validation:Pattern=`^([01][0-9]|2[0-3]):[0-5][0-9]$` + Start string `json:"start"` + // End is the exclusive end time in "HH:MM" format (24-hour, local time). + // Example: "17:00" + // +kubebuilder:validation:Pattern=`^([01][0-9]|2[0-3]):[0-5][0-9]$` + End string `json:"end"` +} + +// EventPullTimeStatistic defines which pull-time statistic to derive from event records. +// +kubebuilder:validation:Enum=p50;p90;p95;avg;max;count;failureCount;cacheHitCount +type EventPullTimeStatistic string + +const ( + // EventPullTimeStatisticP50 is the median cold-pull duration. + EventPullTimeStatisticP50 EventPullTimeStatistic = "p50" + // EventPullTimeStatisticP90 is the 90th-percentile cold-pull duration. + EventPullTimeStatisticP90 EventPullTimeStatistic = "p90" + // EventPullTimeStatisticP95 is the 95th-percentile cold-pull duration. + EventPullTimeStatisticP95 EventPullTimeStatistic = "p95" + // EventPullTimeStatisticAvg is the mean cold-pull duration. + EventPullTimeStatisticAvg EventPullTimeStatistic = "avg" + // EventPullTimeStatisticMax is the maximum observed cold-pull duration. + EventPullTimeStatisticMax EventPullTimeStatistic = "max" + // EventPullTimeStatisticCount is the total number of cold-pull events. + EventPullTimeStatisticCount EventPullTimeStatistic = "count" + // EventPullTimeStatisticFailureCount is the total number of pull failures. + EventPullTimeStatisticFailureCount EventPullTimeStatistic = "failureCount" + // EventPullTimeStatisticCacheHitCount is the number of cache-hit events. + EventPullTimeStatisticCacheHitCount EventPullTimeStatistic = "cacheHitCount" +) + +// DurationMode defines how pull duration is extracted from event records. +// +kubebuilder:validation:Enum=eventPair;messageDuration +type DurationMode string + +const ( + // DurationModeEventPair computes duration as Pulled.timestamp - Pulling.timestamp + // for the same Pod/image pair. + DurationModeEventPair DurationMode = "eventPair" + // DurationModeMessageDuration parses the duration directly from the Pulled event message + // (e.g., "Successfully pulled image ... in 42.3s"). + DurationModeMessageDuration DurationMode = "messageDuration" +) + +// EventPullTimeSignalConfig configures the eventPullTime signal type. +// The referenced query must be a Loki query. +type EventPullTimeSignalConfig struct { + // Statistic selects which pull-time metric to compute. + // +kubebuilder:validation:Enum=p50;p90;p95;avg;max;count;failureCount;cacheHitCount + Statistic EventPullTimeStatistic `json:"statistic"` + // IncludeCacheHits controls whether "already present on machine" events are included + // in cold-pull duration statistics. Set to false to exclude cache hits. + // +kubebuilder:default=false + IncludeCacheHits bool `json:"includeCacheHits"` + // DurationMode controls how pull duration is extracted from event records. + // +kubebuilder:validation:Enum=eventPair;messageDuration + DurationMode DurationMode `json:"durationMode"` +} + +// ============================================================ +// Stage 3 — Ranking +// ============================================================ + +// RankingStrategy identifies which ranking algorithm is applied. +// +kubebuilder:validation:Enum=signal;weightedSum;modelExposure +type RankingStrategy string + +const ( + // RankingStrategySignal ranks images directly by the value of a single signal. + RankingStrategySignal RankingStrategy = "signal" + // RankingStrategyWeightedSum combines normalized signals using a weighted sum. + RankingStrategyWeightedSum RankingStrategy = "weightedSum" + // RankingStrategyModelExposure ranks images by expected post-rotation cold-node exposure. + RankingStrategyModelExposure RankingStrategy = "modelExposure" +) + +// DiscoveryRanking defines how signals are combined into the final ordered image list. +type DiscoveryRanking struct { + // Strategy selects the ranking algorithm. + // +kubebuilder:validation:Enum=signal;weightedSum;modelExposure + Strategy RankingStrategy `json:"strategy"` + // Signal is required when strategy=signal. + // +optional + Signal *SignalRankingConfig `json:"signal,omitempty"` + // WeightedSum is required when strategy=weightedSum. + // +optional + WeightedSum *WeightedSumRankingConfig `json:"weightedSum,omitempty"` + // ModelExposure is required when strategy=modelExposure. + // +optional + ModelExposure *ModelExposureRankingConfig `json:"modelExposure,omitempty"` +} + +// SignalRankingConfig configures the signal ranking strategy. +type SignalRankingConfig struct { + // SignalRef is the name of the signal whose values determine image rank. + // Must match a signals[].name within the same policy. + // +kubebuilder:validation:MinLength=1 + SignalRef string `json:"signalRef"` +} + +// NormalizeMethod defines how signal values are normalized before weighted combination. +// +kubebuilder:validation:Enum=minMax +type NormalizeMethod string + +const ( + // NormalizeMethodMinMax applies min-max normalization: (x - min) / (max - min). + // When all values are equal, normalized(x) = 1. + NormalizeMethodMinMax NormalizeMethod = "minMax" +) + +// MissingSignalBehavior defines what happens when an image has no value for a required signal. +// +kubebuilder:validation:Enum=zero;drop +type MissingSignalBehavior string + +const ( + // MissingSignalBehaviorZero treats a missing signal value as zero. + MissingSignalBehaviorZero MissingSignalBehavior = "zero" + // MissingSignalBehaviorDrop removes the image from ranking if any required signal is missing. + MissingSignalBehaviorDrop MissingSignalBehavior = "drop" +) + +// WeightedSumTerm defines one signal contribution in a weightedSum ranking. +type WeightedSumTerm struct { + // SignalRef is the name of the signal to include in the weighted sum. + // Must match a signals[].name within the same policy. + // +kubebuilder:validation:MinLength=1 + SignalRef string `json:"signalRef"` + // Weight is the factor applied to the normalized signal value. + // All weights should be non-negative; they do not need to sum to 1. + // Example: "0.7" + Weight resource.Quantity `json:"weight"` +} + +// WeightedSumRankingConfig configures the weightedSum ranking strategy. +// Score = Σ weight_k * normalize(signal_k(image)). +type WeightedSumRankingConfig struct { + // Normalize selects the normalization method applied to each signal before weighting. + // Currently only "minMax" is supported. + // +kubebuilder:validation:Enum=minMax + // +kubebuilder:default="minMax" + Normalize NormalizeMethod `json:"normalize"` + // MissingSignal controls behavior when an image has no value for a required signal. + // "zero" treats missing as 0; "drop" removes the image from ranking. + // +kubebuilder:validation:Enum=zero;drop + // +kubebuilder:default="zero" + MissingSignal MissingSignalBehavior `json:"missingSignal"` + // Terms is the list of signals and their weights. + // +kubebuilder:validation:MinItems=1 + Terms []WeightedSumTerm `json:"terms"` +} + +// ModelExposureRankingConfig configures the modelExposure ranking strategy. +// Score = J_target(I) * (1 - 1/N)^J_pre(I) * p_hat(I) +// where N=nodeCount, J_pre is pre-window usage, J_target is target-window usage, +// and p_hat is the pull-time signal value. +type ModelExposureRankingConfig struct { + // NodeCount is the number of eligible CI nodes (N in the exposure formula). // +kubebuilder:validation:Minimum=1 - TopX int32 `json:"topX,omitempty"` - // ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. - // Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} - // Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" - // Example: "{{.Registry}}/{{.Repository}}@{{.Tag}}" (if tags are actually digests) + NodeCount int32 `json:"nodeCount"` + // PreWindowUsageSignalRef is the name of the signal representing usage before the target window. + // Must match a signals[].name within the same policy. + // +kubebuilder:validation:MinLength=1 + PreWindowUsageSignalRef string `json:"preWindowUsageSignalRef"` + // TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. + // Must match a signals[].name within the same policy. + // +kubebuilder:validation:MinLength=1 + TargetWindowUsageSignalRef string `json:"targetWindowUsageSignalRef"` + // PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. + // Must match a signals[].name within the same policy. + // +kubebuilder:validation:MinLength=1 + PullTimeSignalRef string `json:"pullTimeSignalRef"` +} + +// ============================================================ +// Status +// ============================================================ + +// QueryResultStatus reports whether a named query succeeded or failed. +// +kubebuilder:validation:Enum=success;failed +type QueryResultStatus string + +const ( + // QueryResultStatusSuccess indicates the query executed without errors. + QueryResultStatusSuccess QueryResultStatus = "success" + // QueryResultStatusFailed indicates the query encountered an error. + QueryResultStatusFailed QueryResultStatus = "failed" +) + +// QueryResult reports the outcome of a single named query execution. +type QueryResult struct { + // Name matches the queries[].name that produced this result. + Name string `json:"name"` + // Type is the query backend type (prometheus or loki). + Type DiscoveryQueryType `json:"type"` + // Series is the number of time-series returned (Prometheus queries only). + // +optional + Series *int32 `json:"series,omitempty"` + // Samples is the total number of data points across all series (Prometheus range queries only). + // +optional + Samples *int64 `json:"samples,omitempty"` + // Records is the number of log records returned (Loki queries only). + // +optional + Records *int64 `json:"records,omitempty"` + // Status is "success" or "failed". + Status QueryResultStatus `json:"status"` + // Message describes the failure reason when status=failed. + // +optional + Message string `json:"message,omitempty"` +} + +// SignalResult reports the outcome of a single signal derivation. +type SignalResult struct { + // Name matches the signals[].name that produced this result. + Name string `json:"name"` + // Images is the number of images for which this signal produced a value. + Images int32 `json:"images"` + // Status is "success" or "failed". + Status string `json:"status"` + // Message describes the failure reason when status=failed. + // +optional + Message string `json:"message,omitempty"` +} + +// ImageSignalValue records the raw and normalized value of a signal for one image. +type ImageSignalValue struct { + // Name is the signal name. + Name string `json:"name"` + // RawValue is the unscaled signal value as a decimal string. + RawValue string `json:"rawValue"` + // NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. + // Only populated for signals used in a weightedSum ranking. // +optional - ImageTemplate string `json:"imageTemplate,omitempty"` + NormalizedValue string `json:"normalizedValue,omitempty"` +} + +// RankingTerm records the contribution of one signal to the final score of an image. +type RankingTerm struct { + // Signal is the signal name. + Signal string `json:"signal"` + // Weight is the configured weight as a decimal string. + Weight string `json:"weight"` + // Contribution is weight * normalizedValue as a decimal string. + Contribution string `json:"contribution"` +} + +// ImageRankingDetail explains how the final score was computed for one image. +type ImageRankingDetail struct { + // Strategy is the ranking strategy that produced this detail. + Strategy string `json:"strategy"` + // Terms lists the per-signal contributions (populated for weightedSum and modelExposure). + // +optional + Terms []RankingTerm `json:"terms,omitempty"` +} + +// DiscoveredImage represents a single discovered and ranked image. +type DiscoveredImage struct { + // Image is the fully qualified image reference. + Image string `json:"image"` + // Rank is the position of this image in the final ordered list (1 = highest score). + Rank int32 `json:"rank"` + // FinalScore is the computed ranking score as a decimal string. + FinalScore string `json:"finalScore"` + // Selected is true when this image is within the maxImages cap and will be + // propagated to dependent CachedImageSet resources. + Selected bool `json:"selected"` + // Signals lists the per-signal values used during ranking (for observability). + // +optional + Signals []ImageSignalValue `json:"signals,omitempty"` + // Ranking explains how the final score was computed. + // +optional + Ranking *ImageRankingDetail `json:"ranking,omitempty"` } // DiscoveryPolicyStatus defines the observed state of DiscoveryPolicy. type DiscoveryPolicyStatus struct { - // LastSyncTime is the timestamp of the last successful sync. + // LastSyncTime is the timestamp of the last reconciliation attempt. // +optional LastSyncTime *metav1.Time `json:"lastSyncTime,omitempty"` - // DiscoveredImages is the list of discovered images from all sources. + // QueryResults reports the outcome of each named query execution. + // +optional + QueryResults []QueryResult `json:"queryResults,omitempty"` + // SignalResults reports the outcome of each signal derivation. + // +optional + SignalResults []SignalResult `json:"signalResults,omitempty"` + // DiscoveredImages is the ordered list of discovered and ranked images. + // Only images with selected=true are propagated to dependent CachedImageSet resources. // +optional DiscoveredImages []DiscoveredImage `json:"discoveredImages,omitempty"` - // ImageCount is the number of discovered images. + // ImageCount is the number of selected discovered images. // +optional ImageCount int32 `json:"imageCount,omitempty"` - // SourceCount is the number of configured sources. + // QueryCount is the number of configured queries. // +optional - SourceCount int32 `json:"sourceCount,omitempty"` + QueryCount int32 `json:"queryCount,omitempty"` // Conditions represent the latest available observations. // +optional Conditions []metav1.Condition `json:"conditions,omitempty"` } -// DiscoveredImage represents a single discovered image with metadata. -type DiscoveredImage struct { - // Image is the fully qualified image reference. - Image string `json:"image"` - // Score is the ranking score from the source (higher = more relevant). - Score int64 `json:"score"` - // Source identifies which discovery source produced this image. - Source string `json:"source"` -} - // +kubebuilder:object:root=true // +kubebuilder:subresource:status // +kubebuilder:resource:scope=Cluster,categories=drop // +kubebuilder:printcolumn:name="Status",type=string,JSONPath=`.status.conditions[?(@.type=="Ready")].reason` -// +kubebuilder:printcolumn:name="Sources",type=integer,JSONPath=`.status.sourceCount` +// +kubebuilder:printcolumn:name="Queries",type=integer,JSONPath=`.status.queryCount` // +kubebuilder:printcolumn:name="Images",type=integer,JSONPath=`.status.imageCount` // +kubebuilder:printcolumn:name="LastSync",type=date,JSONPath=`.status.lastSyncTime` // +kubebuilder:printcolumn:name="Message",type=string,JSONPath=`.status.conditions[?(@.type=="Ready")].message`,priority=1 diff --git a/api/v1alpha1/zz_generated.deepcopy.go b/api/v1alpha1/zz_generated.deepcopy.go index eafb2e1..a8760a4 100644 --- a/api/v1alpha1/zz_generated.deepcopy.go +++ b/api/v1alpha1/zz_generated.deepcopy.go @@ -16,6 +16,21 @@ import ( "k8s.io/apimachinery/pkg/runtime" ) +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *AggregateSignalConfig) DeepCopyInto(out *AggregateSignalConfig) { + *out = *in +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new AggregateSignalConfig. +func (in *AggregateSignalConfig) DeepCopy() *AggregateSignalConfig { + if in == nil { + return nil + } + out := new(AggregateSignalConfig) + in.DeepCopyInto(out) + return out +} + // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *BackoffConfig) DeepCopyInto(out *BackoffConfig) { *out = *in @@ -304,6 +319,16 @@ func (in *CachedImageStatus) DeepCopy() *CachedImageStatus { // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *DiscoveredImage) DeepCopyInto(out *DiscoveredImage) { *out = *in + if in.Signals != nil { + in, out := &in.Signals, &out.Signals + *out = make([]ImageSignalValue, len(*in)) + copy(*out, *in) + } + if in.Ranking != nil { + in, out := &in.Ranking, &out.Ranking + *out = new(ImageRankingDetail) + (*in).DeepCopyInto(*out) + } } // DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new DiscoveredImage. @@ -316,6 +341,31 @@ func (in *DiscoveredImage) DeepCopy() *DiscoveredImage { return out } +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *DiscoveryLokiQuery) DeepCopyInto(out *DiscoveryLokiQuery) { + *out = *in + if in.Lookback != nil { + in, out := &in.Lookback, &out.Lookback + *out = new(metav1.Duration) + **out = **in + } + if in.Parser != nil { + in, out := &in.Parser, &out.Parser + *out = new(LokiParser) + **out = **in + } +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new DiscoveryLokiQuery. +func (in *DiscoveryLokiQuery) DeepCopy() *DiscoveryLokiQuery { + if in == nil { + return nil + } + out := new(DiscoveryLokiQuery) + in.DeepCopyInto(out) + return out +} + // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *DiscoveryPolicy) DeepCopyInto(out *DiscoveryPolicy) { *out = *in @@ -393,13 +443,25 @@ func (in *DiscoveryPolicyReference) DeepCopy() *DiscoveryPolicyReference { // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *DiscoveryPolicySpec) DeepCopyInto(out *DiscoveryPolicySpec) { *out = *in - if in.Sources != nil { - in, out := &in.Sources, &out.Sources - *out = make([]DiscoverySource, len(*in)) + if in.Queries != nil { + in, out := &in.Queries, &out.Queries + *out = make([]DiscoveryQuery, len(*in)) + for i := range *in { + (*in)[i].DeepCopyInto(&(*out)[i]) + } + } + if in.Signals != nil { + in, out := &in.Signals, &out.Signals + *out = make([]DiscoverySignal, len(*in)) for i := range *in { (*in)[i].DeepCopyInto(&(*out)[i]) } } + if in.Ranking != nil { + in, out := &in.Ranking, &out.Ranking + *out = new(DiscoveryRanking) + (*in).DeepCopyInto(*out) + } out.SyncInterval = in.SyncInterval } @@ -420,10 +482,24 @@ func (in *DiscoveryPolicyStatus) DeepCopyInto(out *DiscoveryPolicyStatus) { in, out := &in.LastSyncTime, &out.LastSyncTime *out = (*in).DeepCopy() } + if in.QueryResults != nil { + in, out := &in.QueryResults, &out.QueryResults + *out = make([]QueryResult, len(*in)) + for i := range *in { + (*in)[i].DeepCopyInto(&(*out)[i]) + } + } + if in.SignalResults != nil { + in, out := &in.SignalResults, &out.SignalResults + *out = make([]SignalResult, len(*in)) + copy(*out, *in) + } if in.DiscoveredImages != nil { in, out := &in.DiscoveredImages, &out.DiscoveredImages *out = make([]DiscoveredImage, len(*in)) - copy(*out, *in) + for i := range *in { + (*in)[i].DeepCopyInto(&(*out)[i]) + } } if in.Conditions != nil { in, out := &in.Conditions, &out.Conditions @@ -445,16 +521,41 @@ func (in *DiscoveryPolicyStatus) DeepCopy() *DiscoveryPolicyStatus { } // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *DiscoverySource) DeepCopyInto(out *DiscoverySource) { +func (in *DiscoveryPrometheusQuery) DeepCopyInto(out *DiscoveryPrometheusQuery) { + *out = *in + if in.Lookback != nil { + in, out := &in.Lookback, &out.Lookback + *out = new(metav1.Duration) + **out = **in + } + if in.Step != nil { + in, out := &in.Step, &out.Step + *out = new(metav1.Duration) + **out = **in + } +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new DiscoveryPrometheusQuery. +func (in *DiscoveryPrometheusQuery) DeepCopy() *DiscoveryPrometheusQuery { + if in == nil { + return nil + } + out := new(DiscoveryPrometheusQuery) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *DiscoveryQuery) DeepCopyInto(out *DiscoveryQuery) { *out = *in if in.Prometheus != nil { in, out := &in.Prometheus, &out.Prometheus - *out = new(PrometheusSource) + *out = new(DiscoveryPrometheusQuery) (*in).DeepCopyInto(*out) } - if in.Registry != nil { - in, out := &in.Registry, &out.Registry - *out = new(RegistrySource) + if in.Loki != nil { + in, out := &in.Loki, &out.Loki + *out = new(DiscoveryLokiQuery) (*in).DeepCopyInto(*out) } if in.SecretRef != nil { @@ -464,12 +565,92 @@ func (in *DiscoverySource) DeepCopyInto(out *DiscoverySource) { } } -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new DiscoverySource. -func (in *DiscoverySource) DeepCopy() *DiscoverySource { +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new DiscoveryQuery. +func (in *DiscoveryQuery) DeepCopy() *DiscoveryQuery { if in == nil { return nil } - out := new(DiscoverySource) + out := new(DiscoveryQuery) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *DiscoveryRanking) DeepCopyInto(out *DiscoveryRanking) { + *out = *in + if in.Signal != nil { + in, out := &in.Signal, &out.Signal + *out = new(SignalRankingConfig) + **out = **in + } + if in.WeightedSum != nil { + in, out := &in.WeightedSum, &out.WeightedSum + *out = new(WeightedSumRankingConfig) + (*in).DeepCopyInto(*out) + } + if in.ModelExposure != nil { + in, out := &in.ModelExposure, &out.ModelExposure + *out = new(ModelExposureRankingConfig) + **out = **in + } +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new DiscoveryRanking. +func (in *DiscoveryRanking) DeepCopy() *DiscoveryRanking { + if in == nil { + return nil + } + out := new(DiscoveryRanking) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *DiscoverySignal) DeepCopyInto(out *DiscoverySignal) { + *out = *in + if in.Aggregate != nil { + in, out := &in.Aggregate, &out.Aggregate + *out = new(AggregateSignalConfig) + **out = **in + } + if in.TimeWeightedAggregate != nil { + in, out := &in.TimeWeightedAggregate, &out.TimeWeightedAggregate + *out = new(TimeWeightedAggregateSignalConfig) + (*in).DeepCopyInto(*out) + } + if in.WindowAggregate != nil { + in, out := &in.WindowAggregate, &out.WindowAggregate + *out = new(WindowAggregateSignalConfig) + (*in).DeepCopyInto(*out) + } + if in.EventPullTime != nil { + in, out := &in.EventPullTime, &out.EventPullTime + *out = new(EventPullTimeSignalConfig) + **out = **in + } +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new DiscoverySignal. +func (in *DiscoverySignal) DeepCopy() *DiscoverySignal { + if in == nil { + return nil + } + out := new(DiscoverySignal) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *EventPullTimeSignalConfig) DeepCopyInto(out *EventPullTimeSignalConfig) { + *out = *in +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new EventPullTimeSignalConfig. +func (in *EventPullTimeSignalConfig) DeepCopy() *EventPullTimeSignalConfig { + if in == nil { + return nil + } + out := new(EventPullTimeSignalConfig) in.DeepCopyInto(out) return out } @@ -490,46 +671,81 @@ func (in *ImageEntry) DeepCopy() *ImageEntry { } // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *PolicyReference) DeepCopyInto(out *PolicyReference) { +func (in *ImageRankingDetail) DeepCopyInto(out *ImageRankingDetail) { *out = *in + if in.Terms != nil { + in, out := &in.Terms, &out.Terms + *out = make([]RankingTerm, len(*in)) + copy(*out, *in) + } } -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new PolicyReference. -func (in *PolicyReference) DeepCopy() *PolicyReference { +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new ImageRankingDetail. +func (in *ImageRankingDetail) DeepCopy() *ImageRankingDetail { if in == nil { return nil } - out := new(PolicyReference) + out := new(ImageRankingDetail) in.DeepCopyInto(out) return out } // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *PrometheusSource) DeepCopyInto(out *PrometheusSource) { +func (in *ImageSignalValue) DeepCopyInto(out *ImageSignalValue) { *out = *in - if in.Lookback != nil { - in, out := &in.Lookback, &out.Lookback - *out = new(metav1.Duration) - **out = **in +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new ImageSignalValue. +func (in *ImageSignalValue) DeepCopy() *ImageSignalValue { + if in == nil { + return nil } - if in.AggregationMethod != nil { - in, out := &in.AggregationMethod, &out.AggregationMethod - *out = new(AggregationMethod) - **out = **in + out := new(ImageSignalValue) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *LokiParser) DeepCopyInto(out *LokiParser) { + *out = *in +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new LokiParser. +func (in *LokiParser) DeepCopy() *LokiParser { + if in == nil { + return nil } - if in.Step != nil { - in, out := &in.Step, &out.Step - *out = new(metav1.Duration) - **out = **in + out := new(LokiParser) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *ModelExposureRankingConfig) DeepCopyInto(out *ModelExposureRankingConfig) { + *out = *in +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new ModelExposureRankingConfig. +func (in *ModelExposureRankingConfig) DeepCopy() *ModelExposureRankingConfig { + if in == nil { + return nil } + out := new(ModelExposureRankingConfig) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *PolicyReference) DeepCopyInto(out *PolicyReference) { + *out = *in } -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new PrometheusSource. -func (in *PrometheusSource) DeepCopy() *PrometheusSource { +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new PolicyReference. +func (in *PolicyReference) DeepCopy() *PolicyReference { if in == nil { return nil } - out := new(PrometheusSource) + out := new(PolicyReference) in.DeepCopyInto(out) return out } @@ -633,21 +849,193 @@ func (in *PullPolicySpec) DeepCopy() *PullPolicySpec { } // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *RegistrySource) DeepCopyInto(out *RegistrySource) { +func (in *QueryResult) DeepCopyInto(out *QueryResult) { *out = *in - if in.Repositories != nil { - in, out := &in.Repositories, &out.Repositories - *out = make([]string, len(*in)) - copy(*out, *in) + if in.Series != nil { + in, out := &in.Series, &out.Series + *out = new(int32) + **out = **in + } + if in.Samples != nil { + in, out := &in.Samples, &out.Samples + *out = new(int64) + **out = **in + } + if in.Records != nil { + in, out := &in.Records, &out.Records + *out = new(int64) + **out = **in + } +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new QueryResult. +func (in *QueryResult) DeepCopy() *QueryResult { + if in == nil { + return nil + } + out := new(QueryResult) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *RankingTerm) DeepCopyInto(out *RankingTerm) { + *out = *in +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new RankingTerm. +func (in *RankingTerm) DeepCopy() *RankingTerm { + if in == nil { + return nil + } + out := new(RankingTerm) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *SignalRankingConfig) DeepCopyInto(out *SignalRankingConfig) { + *out = *in +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new SignalRankingConfig. +func (in *SignalRankingConfig) DeepCopy() *SignalRankingConfig { + if in == nil { + return nil + } + out := new(SignalRankingConfig) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *SignalResult) DeepCopyInto(out *SignalResult) { + *out = *in +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new SignalResult. +func (in *SignalResult) DeepCopy() *SignalResult { + if in == nil { + return nil + } + out := new(SignalResult) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *TimeOfDayWindow) DeepCopyInto(out *TimeOfDayWindow) { + *out = *in +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new TimeOfDayWindow. +func (in *TimeOfDayWindow) DeepCopy() *TimeOfDayWindow { + if in == nil { + return nil + } + out := new(TimeOfDayWindow) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *TimeWeightedAggregateSignalConfig) DeepCopyInto(out *TimeWeightedAggregateSignalConfig) { + *out = *in + out.DefaultWeight = in.DefaultWeight.DeepCopy() + if in.Windows != nil { + in, out := &in.Windows, &out.Windows + *out = make([]TimeWeightedWindow, len(*in)) + for i := range *in { + (*in)[i].DeepCopyInto(&(*out)[i]) + } + } +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new TimeWeightedAggregateSignalConfig. +func (in *TimeWeightedAggregateSignalConfig) DeepCopy() *TimeWeightedAggregateSignalConfig { + if in == nil { + return nil + } + out := new(TimeWeightedAggregateSignalConfig) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *TimeWeightedWindow) DeepCopyInto(out *TimeWeightedWindow) { + *out = *in + out.Weight = in.Weight.DeepCopy() +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new TimeWeightedWindow. +func (in *TimeWeightedWindow) DeepCopy() *TimeWeightedWindow { + if in == nil { + return nil + } + out := new(TimeWeightedWindow) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *WeightedSumRankingConfig) DeepCopyInto(out *WeightedSumRankingConfig) { + *out = *in + if in.Terms != nil { + in, out := &in.Terms, &out.Terms + *out = make([]WeightedSumTerm, len(*in)) + for i := range *in { + (*in)[i].DeepCopyInto(&(*out)[i]) + } + } +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new WeightedSumRankingConfig. +func (in *WeightedSumRankingConfig) DeepCopy() *WeightedSumRankingConfig { + if in == nil { + return nil + } + out := new(WeightedSumRankingConfig) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *WeightedSumTerm) DeepCopyInto(out *WeightedSumTerm) { + *out = *in + out.Weight = in.Weight.DeepCopy() +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new WeightedSumTerm. +func (in *WeightedSumTerm) DeepCopy() *WeightedSumTerm { + if in == nil { + return nil + } + out := new(WeightedSumTerm) + in.DeepCopyInto(out) + return out +} + +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *WindowAggregateSignalConfig) DeepCopyInto(out *WindowAggregateSignalConfig) { + *out = *in + if in.RelativeWindow != nil { + in, out := &in.RelativeWindow, &out.RelativeWindow + *out = new(metav1.Duration) + **out = **in + } + if in.Window != nil { + in, out := &in.Window, &out.Window + *out = new(TimeOfDayWindow) + **out = **in } } -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new RegistrySource. -func (in *RegistrySource) DeepCopy() *RegistrySource { +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new WindowAggregateSignalConfig. +func (in *WindowAggregateSignalConfig) DeepCopy() *WindowAggregateSignalConfig { if in == nil { return nil } - out := new(RegistrySource) + out := new(WindowAggregateSignalConfig) in.DeepCopyInto(out) return out } diff --git a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml index a1183f2..998fe05 100644 --- a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml +++ b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml @@ -20,8 +20,8 @@ spec: - jsonPath: .status.conditions[?(@.type=="Ready")].reason name: Status type: string - - jsonPath: .status.sourceCount - name: Sources + - jsonPath: .status.queryCount + name: Queries type: integer - jsonPath: .status.imageCount name: Images @@ -76,29 +76,88 @@ spec: format: int32 minimum: 1 type: integer - sources: - description: |- - Sources is the list of discovery backends to query. At least one source is required. - Multiple sources are merged and ranked together before maxImages is applied. + queries: + description: Queries is the list of named raw-data sources. Each query + is referenced by name from signals. items: - description: DiscoverySource defines a single discovery backend. + description: DiscoveryQuery defines a named raw-data source referenced + by signals. properties: - prometheus: - description: Prometheus contains the configuration when type=prometheus. + loki: + description: Loki contains the configuration when type=loki. properties: - aggregationMethod: + endpoint: description: |- - AggregationMethod controls how data points from a range query are combined into a single score. - Only used when queryType is "range". Ignored for instant queries. - When not set (nil), Drop uses the last data-point value directly — use this when your PromQL - already contains aggregation functions (e.g., count_over_time, topk). - Options: "sum", "count", "avg", "max" + Endpoint is the Loki API URL. + Example: "https://loki.example.com" + minLength: 1 + type: string + lookback: + description: |- + Lookback is the time window for the query (start=now-lookback, end=now). + Example: "168h" (7 days), "24h" + type: string + parser: + description: Parser configures how log lines are parsed + into structured event records. + properties: + imageField: + description: |- + ImageField is the log label or field from which the image reference is extracted. + For kubernetesEvents, the image is parsed out of the message text. + Example: "message" + type: string + messageField: + description: |- + MessageField is the log label or field that contains the event message. + Example: "message" + type: string + podField: + description: |- + PodField is the log label or field that contains the pod name. + Example: "involvedObject_name" + type: string + reasonField: + description: |- + ReasonField is the log label or field that contains the event reason. + Example: "reason" + type: string + type: + allOf: + - enum: + - kubernetesEvents + - enum: + - kubernetesEvents + description: Type selects the parser. Currently only + "kubernetesEvents" is supported. + type: string + required: + - type + type: object + query: + description: Query is the LogQL expression. + minLength: 1 + type: string + queryType: + default: range + description: QueryType controls how the query is executed. + Currently only "range" is supported. enum: - - sum - - count - - avg - - max + - range type: string + required: + - endpoint + - query + type: object + name: + description: |- + Name is the unique identifier for this query within the policy. + Signals reference queries by this name via queryRef. + minLength: 1 + type: string + prometheus: + description: Prometheus contains the configuration when type=prometheus. + properties: endpoint: description: |- Endpoint is the Prometheus-compatible API URL (Prometheus, Thanos, Mimir, VictoriaMetrics). @@ -107,88 +166,38 @@ spec: type: string lookback: description: |- - Lookback is the time window for range queries. When queryType is "range", - the operator queries (start=now-lookback, end=now) and aggregates all returned values per image. - The aggregation function is controlled by the aggregationMethod field. + Lookback is the time window for range queries (start=now-lookback, end=now). Required when queryType is "range". Ignored when queryType is "instant". Example: "168h" (7 days), "24h", "72h" type: string query: description: |- - Query is the PromQL expression. It MUST return results with an "image" label — - that label value is used as the discovered image reference. - The query result value is used as the ranking score (higher = more relevant). - Example: count(container_memory_working_set_bytes{container!="",container!="POD",namespace="gitlab-runner"}) by (image) + Query is the PromQL expression. Must return results with an "image" label. + Example: count(container_memory_working_set_bytes{namespace="gitlab-runner"}) by (image) minLength: 1 type: string queryType: default: range - description: |- - QueryType controls how the Prometheus query is executed. - "range" uses /api/v1/query_range with a time window defined by lookback. - "instant" uses /api/v1/query for a single point-in-time result. - Default: "range". + description: 'QueryType controls how the query is executed: + "range" or "instant". Default: "range".' enum: - range - instant type: string step: description: |- - Step is the resolution step for range queries (only used when lookback is set). - Smaller steps = more data points = more accurate aggregation but higher Prometheus load. + Step is the resolution step for range queries. + Smaller steps increase data-point density but also increase Prometheus load. Default: 5m. Example: "1m", "15m" type: string required: - endpoint - query type: object - registry: - description: Registry contains the configuration when type=registry. - properties: - imageTemplate: - description: |- - ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. - Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} - Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" - Example: "{{.Registry}}/{{.Repository}}@{{.Tag}}" (if tags are actually digests) - type: string - repositories: - description: |- - Repositories is the list of repository paths to list tags from. - Example: ["team/app", "team/worker", "infra/tools"] - items: - type: string - minItems: 1 - type: array - tagFilter: - description: |- - TagFilter is a regex applied to tag names. Only matching tags are discovered. - Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) - type: string - topX: - description: |- - TopX limits the number of tags kept per repository after tagFilter is applied. - The registry API does not provide creation timestamps here; Drop keeps the last N tags returned by the registry. - Example: 3 (keep the last 3 matching tags returned per repo) - format: int32 - minimum: 1 - type: integer - url: - description: |- - URL is the registry base URL (without repository path). - Example: "https://registry.example.com", "https://ghcr.io" - minLength: 1 - type: string - required: - - repositories - - url - type: object secretRef: description: |- - SecretRef references a Secret in the namespace where Drop creates pull Pods. - The default namespace is "drop-system" unless the controller is started with a different --pod-namespace. + SecretRef references a Secret in the pod namespace (default "drop-system") for auth/TLS. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. - Example: {name: "prometheus-creds"} properties: name: default: "" @@ -202,25 +211,395 @@ spec: type: object x-kubernetes-map-type: atomic type: - description: Type identifies the discovery backend. Must be - "prometheus" or "registry". - enum: - - prometheus - - registry + allOf: + - enum: + - prometheus + - loki + - enum: + - prometheus + - loki + description: Type selects the backend. Must be "prometheus" + or "loki". type: string required: + - name + - type + type: object + type: array + ranking: + description: Ranking defines how signals are combined into a final + ordered image list. + properties: + modelExposure: + description: ModelExposure is required when strategy=modelExposure. + properties: + nodeCount: + description: NodeCount is the number of eligible CI nodes + (N in the exposure formula). + format: int32 + minimum: 1 + type: integer + preWindowUsageSignalRef: + description: |- + PreWindowUsageSignalRef is the name of the signal representing usage before the target window. + Must match a signals[].name within the same policy. + minLength: 1 + type: string + pullTimeSignalRef: + description: |- + PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. + Must match a signals[].name within the same policy. + minLength: 1 + type: string + targetWindowUsageSignalRef: + description: |- + TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. + Must match a signals[].name within the same policy. + minLength: 1 + type: string + required: + - nodeCount + - preWindowUsageSignalRef + - pullTimeSignalRef + - targetWindowUsageSignalRef + type: object + signal: + description: Signal is required when strategy=signal. + properties: + signalRef: + description: |- + SignalRef is the name of the signal whose values determine image rank. + Must match a signals[].name within the same policy. + minLength: 1 + type: string + required: + - signalRef + type: object + strategy: + allOf: + - enum: + - signal + - weightedSum + - modelExposure + - enum: + - signal + - weightedSum + - modelExposure + description: Strategy selects the ranking algorithm. + type: string + weightedSum: + description: WeightedSum is required when strategy=weightedSum. + properties: + missingSignal: + allOf: + - enum: + - zero + - drop + - enum: + - zero + - drop + default: zero + description: |- + MissingSignal controls behavior when an image has no value for a required signal. + "zero" treats missing as 0; "drop" removes the image from ranking. + type: string + normalize: + allOf: + - enum: + - minMax + - enum: + - minMax + default: minMax + description: |- + Normalize selects the normalization method applied to each signal before weighting. + Currently only "minMax" is supported. + type: string + terms: + description: Terms is the list of signals and their weights. + items: + description: WeightedSumTerm defines one signal contribution + in a weightedSum ranking. + properties: + signalRef: + description: |- + SignalRef is the name of the signal to include in the weighted sum. + Must match a signals[].name within the same policy. + minLength: 1 + type: string + weight: + anyOf: + - type: integer + - type: string + description: |- + Weight is the factor applied to the normalized signal value. + All weights should be non-negative; they do not need to sum to 1. + Example: "0.7" + pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$ + x-kubernetes-int-or-string: true + required: + - signalRef + - weight + type: object + minItems: 1 + type: array + required: + - missingSignal + - normalize + - terms + type: object + required: + - strategy + type: object + signals: + description: |- + Signals is the list of named per-image metrics derived from query results. + Each signal is referenced by name from the ranking configuration. + items: + description: DiscoverySignal defines a named per-image metric derived + from a single query. + properties: + aggregate: + description: Aggregate is required when type=aggregate. + properties: + method: + allOf: + - enum: + - sum + - count + - avg + - max + - min + - enum: + - sum + - count + - avg + - max + - min + description: Method is the aggregation function applied + to all samples per image. + type: string + required: + - method + type: object + eventPullTime: + description: EventPullTime is required when type=eventPullTime. + properties: + durationMode: + allOf: + - enum: + - eventPair + - messageDuration + - enum: + - eventPair + - messageDuration + description: DurationMode controls how pull duration is + extracted from event records. + type: string + includeCacheHits: + default: false + description: |- + IncludeCacheHits controls whether "already present on machine" events are included + in cold-pull duration statistics. Set to false to exclude cache hits. + type: boolean + statistic: + allOf: + - enum: + - p50 + - p90 + - p95 + - avg + - max + - count + - failureCount + - cacheHitCount + - enum: + - p50 + - p90 + - p95 + - avg + - max + - count + - failureCount + - cacheHitCount + description: Statistic selects which pull-time metric to + compute. + type: string + required: + - durationMode + - includeCacheHits + - statistic + type: object + name: + description: |- + Name is the unique identifier for this signal within the policy. + Ranking configurations reference signals by this name. + minLength: 1 + type: string + queryRef: + description: |- + QueryRef is the name of the query that provides raw data for this signal. + Must match a queries[].name within the same policy. + minLength: 1 + type: string + timeWeightedAggregate: + description: TimeWeightedAggregate is required when type=timeWeightedAggregate. + properties: + defaultWeight: + anyOf: + - type: integer + - type: string + description: |- + DefaultWeight is applied to samples that do not fall in any configured window. + Use "0" to exclude off-hours samples entirely. + pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$ + x-kubernetes-int-or-string: true + method: + allOf: + - enum: + - sum + - count + - avg + - max + - min + - enum: + - sum + - count + - avg + - max + - min + description: Method is the aggregation function applied + after weighting (currently only "sum" is meaningful). + type: string + timezone: + description: |- + Timezone is the IANA time zone used to evaluate window boundaries (wall-clock hours). + Example: "Europe/Berlin", "America/New_York", "UTC" + minLength: 1 + type: string + windows: + description: Windows is the list of hour-of-day windows + with associated weights. + items: + description: TimeWeightedWindow defines a wall-clock hour + range and its weight factor. + properties: + endHour: + description: EndHour is the exclusive end of the window + in local time (1–24). + format: int32 + maximum: 24 + minimum: 1 + type: integer + startHour: + description: StartHour is the inclusive start of the + window in local time (0–23). + format: int32 + maximum: 23 + minimum: 0 + type: integer + weight: + anyOf: + - type: integer + - type: string + description: |- + Weight is the factor applied to sample values within this window. + Use "1.0" for full weight, "0.3" for partial, "0" to exclude. + pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$ + x-kubernetes-int-or-string: true + required: + - endHour + - startHour + - weight + type: object + minItems: 1 + type: array + required: + - defaultWeight + - method + - timezone + - windows + type: object + type: + allOf: + - enum: + - aggregate + - timeWeightedAggregate + - windowAggregate + - eventPullTime + - enum: + - aggregate + - timeWeightedAggregate + - windowAggregate + - eventPullTime + description: Type selects the signal derivation method. + type: string + windowAggregate: + description: WindowAggregate is required when type=windowAggregate. + properties: + method: + allOf: + - enum: + - sum + - count + - avg + - max + - min + - enum: + - sum + - count + - avg + - max + - min + description: Method is the aggregation function applied + to the windowed samples. + type: string + relativeWindow: + description: |- + RelativeWindow aggregates only samples from the last N duration before now. + Mutually exclusive with window + timezone. + Example: "2h" (last 2 hours) + type: string + timezone: + description: |- + Timezone is the IANA time zone for evaluating wall-clock window boundaries. + Required when window is set. + type: string + window: + description: |- + Window defines fixed wall-clock start/end times within each day. + Mutually exclusive with relativeWindow. + properties: + end: + description: |- + End is the exclusive end time in "HH:MM" format (24-hour, local time). + Example: "17:00" + pattern: ^([01][0-9]|2[0-3]):[0-5][0-9]$ + type: string + start: + description: |- + Start is the inclusive start time in "HH:MM" format (24-hour, local time). + Example: "09:00" + pattern: ^([01][0-9]|2[0-3]):[0-5][0-9]$ + type: string + required: + - end + - start + type: object + required: + - method + type: object + required: + - name + - queryRef - type type: object - minItems: 1 type: array syncInterval: default: 30m description: |- - SyncInterval is how often the operator re-queries all sources and updates status.discoveredImages. + SyncInterval is how often the operator re-runs the pipeline and updates status.discoveredImages. Default: "30m". Example: "1h", "15m" type: string - required: - - sources type: object status: description: DiscoveryPolicyStatus defines the observed state of DiscoveryPolicy. @@ -283,43 +662,182 @@ spec: type: object type: array discoveredImages: - description: DiscoveredImages is the list of discovered images from - all sources. + description: |- + DiscoveredImages is the ordered list of discovered and ranked images. + Only images with selected=true are propagated to dependent CachedImageSet resources. items: - description: DiscoveredImage represents a single discovered image - with metadata. + description: DiscoveredImage represents a single discovered and + ranked image. properties: + finalScore: + description: FinalScore is the computed ranking score as a decimal + string. + type: string image: description: Image is the fully qualified image reference. type: string - score: - description: Score is the ranking score from the source (higher - = more relevant). - format: int64 + rank: + description: Rank is the position of this image in the final + ordered list (1 = highest score). + format: int32 type: integer - source: - description: Source identifies which discovery source produced - this image. - type: string + ranking: + description: Ranking explains how the final score was computed. + properties: + strategy: + description: Strategy is the ranking strategy that produced + this detail. + type: string + terms: + description: Terms lists the per-signal contributions (populated + for weightedSum and modelExposure). + items: + description: RankingTerm records the contribution of one + signal to the final score of an image. + properties: + contribution: + description: Contribution is weight * normalizedValue + as a decimal string. + type: string + signal: + description: Signal is the signal name. + type: string + weight: + description: Weight is the configured weight as a + decimal string. + type: string + required: + - contribution + - signal + - weight + type: object + type: array + required: + - strategy + type: object + selected: + description: |- + Selected is true when this image is within the maxImages cap and will be + propagated to dependent CachedImageSet resources. + type: boolean + signals: + description: Signals lists the per-signal values used during + ranking (for observability). + items: + description: ImageSignalValue records the raw and normalized + value of a signal for one image. + properties: + name: + description: Name is the signal name. + type: string + normalizedValue: + description: |- + NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. + Only populated for signals used in a weightedSum ranking. + type: string + rawValue: + description: RawValue is the unscaled signal value as + a decimal string. + type: string + required: + - name + - rawValue + type: object + type: array required: + - finalScore - image - - score - - source + - rank + - selected type: object type: array imageCount: - description: ImageCount is the number of discovered images. + description: ImageCount is the number of selected discovered images. format: int32 type: integer lastSyncTime: - description: LastSyncTime is the timestamp of the last successful - sync. + description: LastSyncTime is the timestamp of the last reconciliation + attempt. format: date-time type: string - sourceCount: - description: SourceCount is the number of configured sources. + queryCount: + description: QueryCount is the number of configured queries. format: int32 type: integer + queryResults: + description: QueryResults reports the outcome of each named query + execution. + items: + description: QueryResult reports the outcome of a single named query + execution. + properties: + message: + description: Message describes the failure reason when status=failed. + type: string + name: + description: Name matches the queries[].name that produced this + result. + type: string + records: + description: Records is the number of log records returned (Loki + queries only). + format: int64 + type: integer + samples: + description: Samples is the total number of data points across + all series (Prometheus range queries only). + format: int64 + type: integer + series: + description: Series is the number of time-series returned (Prometheus + queries only). + format: int32 + type: integer + status: + description: Status is "success" or "failed". + enum: + - success + - failed + type: string + type: + description: Type is the query backend type (prometheus or loki). + enum: + - prometheus + - loki + type: string + required: + - name + - status + - type + type: object + type: array + signalResults: + description: SignalResults reports the outcome of each signal derivation. + items: + description: SignalResult reports the outcome of a single signal + derivation. + properties: + images: + description: Images is the number of images for which this signal + produced a value. + format: int32 + type: integer + message: + description: Message describes the failure reason when status=failed. + type: string + name: + description: Name matches the signals[].name that produced this + result. + type: string + status: + description: Status is "success" or "failed". + type: string + required: + - images + - name + - status + type: object + type: array type: object type: object served: true diff --git a/config/samples/drop_v1alpha1_discoverypolicy.yaml b/config/samples/drop_v1alpha1_discoverypolicy.yaml index 7b7d044..82a4856 100644 --- a/config/samples/drop_v1alpha1_discoverypolicy.yaml +++ b/config/samples/drop_v1alpha1_discoverypolicy.yaml @@ -1,15 +1,49 @@ apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: registry-discovery + name: gitlab-hybrid-usage-concurrency spec: - sources: - - type: registry - registry: - url: "https://registry.example.com" - repositories: - - "myorg/myapp" - - "myorg/worker" - topX: 5 - syncInterval: 5m - maxImages: 20 + syncInterval: 1h + maxImages: 30 + + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + queryType: range + lookback: 168h + step: 1m + query: | + count( + container_memory_working_set_bytes{ + container!="", + container!="POD", + namespace="gitlab-runner", + pod=~"runner-.*" + } + ) by (image) + + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + + - name: peak-concurrency + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max + + ranking: + strategy: weightedSum + weightedSum: + normalize: minMax + missingSignal: zero + terms: + - signalRef: total-usage + weight: "700m" + - signalRef: peak-concurrency + weight: "300m" diff --git a/docs/content/docs/developing/architecture.md b/docs/content/docs/developing/architecture.md index 7775d73..82a10c5 100644 --- a/docs/content/docs/developing/architecture.md +++ b/docs/content/docs/developing/architecture.md @@ -19,8 +19,9 @@ CachedImageSet ──owns──▶ CachedImage[] ──creates──▶ Pod (per │ image pulled by DiscoveryPolicy ──discovers───┘ kubelet │ - ├── PrometheusSource (PromQL query) - └── RegistrySource (OCI tag list) + ├── queries[] (Prometheus / Loki raw data) + ├── signals[] (per-image metrics derived from queries) + └── ranking (combines signals into ordered image list) ``` ## Package Dependency Graph @@ -34,7 +35,7 @@ cmd/main.go │ ├── internal/pacing/ (rate-limiting engine) ├── internal/podbuilder/ (pure Pod construction) - ├── internal/discovery/ (source interface + impls) + ├── internal/discovery/ (query execution + source interface) └── internal/metrics/ (Prometheus counters/gauges) api/v1alpha1/ (CRD type definitions — imported by all) @@ -116,6 +117,6 @@ type Source interface { } ``` -**PrometheusSource:** Queries Prometheus for container images (requires `image` label in results). Supports instant and range queries. +**PrometheusSource:** Queries a Prometheus-compatible API for container images (requires `image` label in results). Supports instant and range queries. Used as the execution backend for `type: prometheus` queries in the pipeline. -**RegistrySource:** Lists tags from an OCI registry via `/v2//tags/list`. Filters by regex, limits to TopX most recent. +> **Note:** Registry tag discovery (`RegistrySource`) has been removed in the pipeline redesign. Use a Prometheus or Loki query to discover images from runtime metrics instead. diff --git a/docs/content/docs/discovery.md b/docs/content/docs/discovery.md index 8ee8440..6b9fdd5 100644 --- a/docs/content/docs/discovery.md +++ b/docs/content/docs/discovery.md @@ -5,10 +5,10 @@ aliases: - /drop/docs/discovery/ description: Automatic image discovery with DiscoveryPolicy. llmsDescription: | - DiscoveryPolicy CRD enables automatic image discovery from Prometheus metrics - or OCI registries. Referenced by CachedImageSet via discoveryPolicyRef. - Discovered images are materialized as CachedImage resources. Supports - filtering, deduplication, and periodic re-discovery. + DiscoveryPolicy CRD enables automatic image discovery using a three-stage pipeline: + queries → signals → ranking. Referenced by CachedImageSet via discoveryPolicyRef. + Discovered images are materialized as CachedImage resources. Supports filtering, + time-weighted scoring, weighted ranking, and periodic re-discovery. --- The DiscoveryPolicy CRD enables automatic image discovery from external sources. When referenced by a CachedImageSet, discovered images are automatically materialized as CachedImage resources. @@ -22,241 +22,456 @@ Discovery came from operational pain: - Hand-maintained image lists became stale and missed newly hot images - Node rotation (e.g. Cluster API MachineDeployments rolling new nodes daily or weekly) means fresh nodes start with empty image caches — every rotation triggers a full re-pull of all active images -This last point is especially painful in CI clusters: if your build nodes are managed by Cluster API and regularly replaced (scaling events, OS upgrades, spot instance recycling), every new node must pull the same large build images from scratch. Discovery combined with pre-caching ensures that the most relevant images are warmed immediately after a node joins, eliminating the cold-start penalty from node rotation. +With DiscoveryPolicy, image candidates are continuously sourced from real usage signals (metrics), ranked by configurable strategies, and consumed by CachedImageSet. -With DiscoveryPolicy, image candidates are continuously sourced from real usage signals (metrics) or registry data, then consumed by CachedImageSet. +## Pipeline Overview -## How It Works +``` +queries → signals → ranking → selected images +``` + +The pipeline has three stages: + +1. **Queries** fetch raw observations from systems such as Prometheus or Loki. +2. **Signals** derive named per-image metrics from query results (e.g. `total-usage`, `peak-concurrency`). +3. **Ranking** combines one or more signals into the final ordered image list. ``` -DiscoveryPolicy → queries sources → writes to status.discoveredImages - ↓ +DiscoveryPolicy → runs pipeline → writes to status.discoveredImages + ↓ CachedImageSet → reads discoveredImages → creates/deletes CachedImage children ``` -1. The DiscoveryPolicy reconciler queries all configured sources at the specified interval -2. Results are normalized to `{image, score}` pairs, merged, deduplicated, filtered, and sorted by score -3. Top results (capped by `maxImages`) are written to `status.discoveredImages` -4. The CachedImageSet reconciler watches DiscoveryPolicy status changes -5. It diffs the desired images against existing CachedImage children -6. New CachedImages are created; orphaned ones are deleted via ownerReference GC - -## Prometheus Source +## Stage 1 — Queries -### Query Contract +A query fetches raw observations and is referenced by name from signals. -Your Prometheus query **must** return an `image` label. The metric value becomes the ranking score (higher = more important). +### Prometheus Query -In practice this means each result series should look like: +```yaml +queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + queryType: range # range | instant (default: range) + lookback: 168h # time window for range queries + step: 1m # range resolution (default: 5m) + query: | + count( + container_memory_working_set_bytes{ + container!="", container!="POD", + namespace="gitlab-runner", pod=~"runner-.*" + } + ) by (image) +``` -- Labels include `image="/:"` (or equivalent image ref like `registry.example.com/team/app@sha256:...`) -- Value is numeric and used for ranking +The PromQL result **must** carry an `image` label. That label value is the discovered image reference. -**Example:** Find the 30 most-used images in a namespace: +### Loki Query -```promql -count(container_memory_working_set_bytes{ - container!="", - container!="POD", - namespace="build-stuff" -}) by (image) +```yaml +queries: + - name: image-pull-events + type: loki + loki: + endpoint: https://loki.example.com + queryType: range + lookback: 168h + query: | + {job="kubernetes-events", namespace="gitlab-runner"} + | json + | involvedObject_name =~ "runner-.*" + | reason =~ "Pulling|Pulled|Failed|BackOff" + parser: + type: kubernetesEvents + podField: involvedObject_name + reasonField: reason + messageField: message + imageField: message ``` -### War Story Example: Top GitLab Runner Images (last 7 days) +### Auth / TLS -Hand-maintained image lists do not keep up in environments where automation (for example Renovate) ships new image versions every day. A practical pattern is to rank images by observed CI usage over a rolling window. +Both query types support a `secretRef` for authentication and TLS: -The `queryType` field controls whether Drop sends an instant or range query (default: `range`). When set to `range`, the `lookback` field defines the time window and `aggregationMethod` controls how the returned data points are combined into a single score per image. +```yaml +queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + query: ... + secretRef: + name: prometheus-creds # Secret in the drop-system namespace +``` -#### Query Types +Supported Secret keys: `token`, `username`, `password`, `ca.crt`, `tls.crt`, `tls.key`, `headers.`. -{{< figure src="/drop/images/query-type-range.svg" alt="Range query: multiple data points over a lookback window" >}} +## Stage 2 — Signals -{{< figure src="/drop/images/query-type-instant.svg" alt="Instant query: single point-in-time value used as score" >}} +A signal derives a named per-image value from exactly one query. -#### Aggregation Methods +### `aggregate` -When using `queryType: range`, the `aggregationMethod` field determines how the returned data points are reduced into a single score: +Aggregates all samples per image using a single method. -{{< figure src="/drop/images/aggregation-methods.svg" alt="Aggregation methods: nil (last value), sum, count, avg, max" >}} +```yaml +signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum # sum | max | avg | count | min + + - name: peak-concurrency + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max +``` + +### `timeWeightedAggregate` -| Method | Behavior | Use when | -|--------|----------|----------| -| *(not set)* | Uses the last data-point value directly | Your PromQL already aggregates (e.g. `count_over_time`, `topk`) | -| `sum` | Adds all data-point values over the window | Total cumulative usage matters (e.g. total memory consumed) | -| `count` | Counts the number of data points returned | You want to rank by how frequently an image appears | -| `avg` | Arithmetic mean of all data-point values | Average magnitude matters regardless of sample count | -| `max` | Highest single data-point value | Peak usage is more relevant than cumulative | +Multiplies each sample value by a per-hour window weight before aggregation. ```yaml -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: popular-build-images -spec: - syncInterval: 1h - maxImages: 30 - sources: - - type: prometheus - prometheus: - endpoint: https://mimir.example.com - queryType: range # default — use query_range API - lookback: 168h # 7 days - step: 5m - aggregationMethod: sum # rank by total usage over 7 days (omit to use last value directly) - query: | - count( - container_memory_working_set_bytes{ - container!="",container!="POD", - namespace="gitlab-runner",pod=~"runner-.*" - } - ) by (image) +signals: + - name: developer-weighted-usage + queryRef: runner-image-usage + type: timeWeightedAggregate + timeWeightedAggregate: + method: sum + timezone: Europe/Berlin + defaultWeight: "0" + windows: + - startHour: 7 + endHour: 9 + weight: "300m" # 0.3 (resource.Quantity format) + - startHour: 9 + endHour: 17 + weight: "1" # 1.0 — full weight during core hours + - startHour: 17 + endHour: 20 + weight: "300m" ``` -Use this when you want DiscoveryPolicy to continuously follow what your GitLab runner jobs really pulled in the last week. +### `windowAggregate` -#### Field-by-field explanation +Aggregates only the samples within a specific time sub-window. -- `queryType: range` — tells Drop to use the Prometheus `query_range` API. This is the default. Set to `instant` for a single point-in-time query. -- `lookback: 168h` — defines the time window for range queries (start=now-7d, end=now). Required when `queryType` is `range`. -- `aggregationMethod: sum` — sums all data-point values to rank by total usage. When omitted (nil), the last value is used directly — ideal for self-contained PromQL queries. Other options: `count` to rank by number of appearances, `avg` for average magnitude, or `max` for peak value. -- `step: 5m` — resolution step for the range query (controls how many data points Prometheus returns). -- `count(...) by (image)` — counts the number of running containers per image to rank by popularity. -- `container_memory_working_set_bytes{...}` — source metric used to observe running containers. -- `container!=""` — ignore empty image labels. -- `container!="POD"` — ignore sandbox/pause container noise. -- `namespace="gitlab-runner"` — scope discovery to CI jobs in that namespace. -- `pod=~"runner-.*"` — further scope to runner pods only. +```yaml +signals: + # Relative window (last N duration before now) + - name: recent-usage + queryRef: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + relativeWindow: 2h + + # Wall-clock window (specific hours of day) + - name: pre-window-usage + queryRef: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + timezone: Europe/Berlin + window: + start: "00:00" + end: "09:00" +``` -#### How score is calculated +### `eventPullTime` -For each unique `image` label, Drop uses the Prometheus query result value as the score. +Derives image pull-time statistics from Loki event records. -When `queryType` is `range` (the default), Drop uses a range query (`/api/v1/query_range`) over the `lookback` window and aggregates data points using the `aggregationMethod`. When `queryType` is `instant`, Drop sends an instant query (`/api/v1/query`) and uses the returned value directly: +```yaml +signals: + - name: p50-cold-pull-time + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: p50 # p50 | p90 | p95 | avg | max | count | failureCount | cacheHitCount + includeCacheHits: false + durationMode: eventPair # eventPair | messageDuration +``` -- *(not set)*: uses the last data-point value — ideal when your PromQL already contains aggregation functions like `count_over_time` or `topk` -- `sum`: adds all data-point values — images with higher cumulative usage score higher -- `count`: counts the number of data points — images that appear more frequently score higher -- `avg`: averages data-point values — images with higher average value score higher -- `max`: takes the peak value — images with the highest single observation score higher +## Stage 3 — Ranking -The example above uses `queryType: range` with `lookback: 168h` so Drop handles the 7-day windowing via the API — no need to embed `[7d]` in PromQL. +Exactly one ranking strategy per policy. -If Prometheus returns: +### `signal` -| image | value returned by query | meaning | -|---|---:|---| -| `registry.example.com/ci/build:1.0.3` | 4200 | seen most frequently in the 7-day window | -| `registry.example.com/ci/test:2.4.1` | 2500 | medium usage | -| `registry.example.com/ci/lint:1.8.0` | 900 | lower usage | +Ranks images directly by the value of a single signal. -Drop stores the returned values as `{image, score}` pairs in memory and then applies `spec.maxImages` as the final cap when writing `status.discoveredImages`. +```yaml +ranking: + strategy: signal + signal: + signalRef: total-usage +``` -So the flow is: +### `weightedSum` -1. Prometheus query returns per-image counts to Drop. -2. Drop ranks by score and applies `spec.maxImages` as the final list size. +Combines normalized signals using a weighted sum. +```yaml +ranking: + strategy: weightedSum + weightedSum: + normalize: minMax # only method available + missingSignal: zero # zero | drop + terms: + - signalRef: total-usage + weight: "700m" # 0.7 in resource.Quantity format + - signalRef: peak-concurrency + weight: "300m" # 0.3 ``` -score -4200 | build ██████████████████████████ -2500 | test ████████████████ -900 | lint ██████ - (bar length indicates score) + +Score: `final_score(I) = Σ weight_k * normalize(signal_k(I))` + +`minMax` normalization: `normalized(x) = (x - min) / (max - min)` — equals 1 when all values are equal. + +### `modelExposure` + +Ranks images by expected post-rotation cold-node exposure. + +```yaml +ranking: + strategy: modelExposure + modelExposure: + nodeCount: 100 + preWindowUsageSignalRef: pre-window-usage + targetWindowUsageSignalRef: developer-window-usage + pullTimeSignalRef: p50-cold-pull-time ``` -### Production Patterns +Score: `score(I) = J_target(I) * (1 - 1/N)^J_pre(I) * p_hat(I)` -- Use `maxImages` to cap churn and focus on the highest-impact images -- Use `imageFilter` to exclude mirrors or registries you do not want to pre-cache -- Start with one high-traffic namespace/team first, then expand source scope +## Complete Examples -### Full Example +### Example 1: Total Usage (simplest) ```yaml apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: popular-build-images + name: total-usage spec: syncInterval: 1h maxImages: 30 - imageFilter: "^(?!.*ecr\\..*amazonaws\\.com).*$" # Exclude ECR images - sources: - - type: prometheus + + queries: + - name: runner-image-usage + type: prometheus prometheus: endpoint: https://mimir.example.com - queryType: instant + queryType: range + lookback: 168h + step: 1m query: | - count(container_memory_working_set_bytes{ - container!="", container!="POD", - namespace="build-stuff", cluster="mycluster" - }) by (image) - secretRef: - name: prometheus-creds ---- -apiVersion: v1 -kind: Secret -metadata: - name: prometheus-creds - namespace: drop-system -type: Opaque -stringData: - username: admin - password: my-prometheus-password -``` - -## Registry Source + count( + container_memory_working_set_bytes{ + container!="", container!="POD", + namespace="gitlab-runner", pod=~"runner-.*" + } + ) by (image) -### Use Case: GitLab Runner Helper Images + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + + ranking: + strategy: signal + signal: + signalRef: total-usage +``` -The registry source uses OCI Distribution API tag listing. Combined with `imageTemplate`, it handles complex tag patterns like GitLab Runner helpers: +### Example 2: Hybrid Usage + Peak Concurrency ```yaml apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: gitlab-helpers + name: gitlab-hybrid-usage-concurrency spec: - syncInterval: 6h - maxImages: 10 - sources: - - type: registry - registry: - url: https://registry.gitlab.com - repositories: - - gitlab-org/gitlab-runner/gitlab-runner-helper - tagFilter: "^v\\d+\\.\\d+\\.\\d+$" - topX: 5 - imageTemplate: "registry.gitlab.com/{{ .Repository }}:x86_64-{{ .Tag }}" -``` + syncInterval: 1h + maxImages: 30 -This replaces the legacy bash script that curled the GitLab API and constructed image refs manually. + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + queryType: range + lookback: 168h + step: 1m + query: | + count( + container_memory_working_set_bytes{ + container!="", container!="POD", + namespace="gitlab-runner", pod=~"runner-.*" + } + ) by (image) -### Additional Example: Stable App Tags from Private Registry + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + + - name: peak-concurrency + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max + + ranking: + strategy: weightedSum + weightedSum: + normalize: minMax + missingSignal: zero + terms: + - signalRef: total-usage + weight: "700m" + - signalRef: peak-concurrency + weight: "300m" +``` + +### Example 3: Developer-Time Weighted Usage ```yaml apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: platform-apps + name: gitlab-developer-and-burst spec: - syncInterval: 2h - maxImages: 20 - imageFilter: "^registry\\.example\\.com/platform/.*$" - sources: - - type: registry - registry: - url: https://registry.example.com - repositories: - - platform/api - - platform/web - tagFilter: "^v\\d+\\.\\d+\\.\\d+$" - topX: 10 + syncInterval: 1h + maxImages: 30 + + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + queryType: range + lookback: 168h + step: 1m + query: | + count( + container_memory_working_set_bytes{ + container!="", container!="POD", + namespace="gitlab-runner", pod=~"runner-.*" + } + ) by (image) + + signals: + - name: developer-weighted-usage + queryRef: runner-image-usage + type: timeWeightedAggregate + timeWeightedAggregate: + method: sum + timezone: Europe/Berlin + defaultWeight: "0" + windows: + - startHour: 7 + endHour: 9 + weight: "300m" + - startHour: 9 + endHour: 17 + weight: "1" + - startHour: 17 + endHour: 20 + weight: "300m" + + - name: peak-concurrency + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max + + ranking: + strategy: weightedSum + weightedSum: + normalize: minMax + missingSignal: zero + terms: + - signalRef: developer-weighted-usage + weight: "700m" + - signalRef: peak-concurrency + weight: "300m" +``` + +## Status and Observability + +The controller exposes per-query, per-signal, and per-image ranking detail in status: + +```yaml +status: + lastSyncTime: "2026-06-18T10:00:00Z" + + queryResults: + - name: runner-image-usage + type: prometheus + series: 30 + samples: 60480 + status: success + + signalResults: + - name: total-usage + images: 30 + status: success + - name: peak-concurrency + images: 30 + status: success + + discoveredImages: + - image: registry.example.com/ci/java-gradle:21 + rank: 1 + finalScore: "0.8768" + selected: true + signals: + - name: total-usage + rawValue: "8210" + normalizedValue: "0.824" + - name: peak-concurrency + rawValue: "96" + normalizedValue: "1.0" + ranking: + strategy: weightedSum + terms: + - signal: total-usage + weight: "0.7" + contribution: "0.5768" + - signal: peak-concurrency + weight: "0.3" + contribution: "0.3" ``` +> **Note:** Pipeline execution is not yet implemented. The controller currently sets +> `Ready=False, reason=NotImplemented` and will populate status once execution is +> available in a future release (Issues 2–10 in the implementation sequence). + +## Discovery Strategies Reference + +| # | Strategy | Score formula | Signals needed | +|---|----------|---------------|----------------| +| 1 | Total usage | `Σ count_I(t)` over W | `total-usage` | +| 2 | Peak same-image concurrency | `max count_I(t)` over W | `peak-concurrency` | +| 3 | Developer-time weighted usage | `Σ weight(t)·count_I(t)` | `developer-weighted-usage` | +| 4 | Recent usage | `Σ count_I(t)` over recent window | `recent-usage` | +| 5 | Hybrid usage + peak | `α·norm(total) + (1-α)·norm(peak)` | `total-usage`, `peak-concurrency` | +| 6 | Hybrid dev-time + peak | `α·norm(dev) + (1-α)·norm(peak)` | `developer-weighted-usage`, `peak-concurrency` | +| 7 | Count × pull time | `total_usage(I) · p_hat(I)` | `total-usage`, `p50-cold-pull-time` | +| 9 | Model-aware exposure | `J_target · (1-1/N)^J_pre · p_hat` | `pre-window-usage`, `target-window-usage`, `p50-cold-pull-time` | + ## Error Handling - On transient failures, the operator keeps the **last known good** discovery results - Source health is tracked via conditions on the DiscoveryPolicy status -- Each source is queried independently — one failing source doesn't block others +- Each query is executed independently — one failing query does not block others diff --git a/docs/content/docs/reference/_generated_architecture.md b/docs/content/docs/reference/_generated_architecture.md index 1abb6ac..3091959 100644 --- a/docs/content/docs/reference/_generated_architecture.md +++ b/docs/content/docs/reference/_generated_architecture.md @@ -26,7 +26,6 @@ graph TD graph LR cmd/main.go --> internal/controller internal/controller --> api/v1alpha1 - internal/controller --> internal/discovery internal/controller --> internal/metrics internal/controller --> internal/pacing internal/controller --> internal/podbuilder diff --git a/docs/content/docs/reference/_generated_crds.md b/docs/content/docs/reference/_generated_crds.md index 1d72338..453b997 100644 --- a/docs/content/docs/reference/_generated_crds.md +++ b/docs/content/docs/reference/_generated_crds.md @@ -106,19 +106,23 @@ DiscoveryPolicy automatically discovers images from registries or Prometheus met | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| -| `sources` | `[]DiscoverySource` | Yes | — | Sources is the list of discovery backends to query. At least one source is required. Multiple sources are merged and ranked together before maxImages is applied. | +| `queries` | `[]DiscoveryQuery` | No | — | Queries is the list of named raw-data sources. Each query is referenced by name from signals. | +| `signals` | `[]DiscoverySignal` | No | — | Signals is the list of named per-image metrics derived from query results. Each signal is referenced by name from the ranking configuration. | +| `ranking` | `*DiscoveryRanking` | No | — | Ranking defines how signals are combined into a final ordered image list. | | `imageFilter` | `string` | No | — | ImageFilter is a regex applied to discovered image references. Only matching images are kept. Example: "registry.example.com/team/.*" (only keep images from that registry path) | -| `syncInterval` | `metav1.Duration` | No | 30m | SyncInterval is how often the operator re-queries all sources and updates status.discoveredImages. Default: "30m". Example: "1h", "15m" | +| `syncInterval` | `metav1.Duration` | No | 30m | SyncInterval is how often the operator re-runs the pipeline and updates status.discoveredImages. Default: "30m". Example: "1h", "15m" | | `maxImages` | `int32` | No | 50 | MaxImages caps the total number of images stored in status.discoveredImages. Images are ranked by score; lowest-scoring images are dropped when the cap is exceeded. Default: 50. Example: 30, 100 | ### Status | Field | Type | Description | |-------|------|-------------| -| `lastSyncTime` | `*metav1.Time` | LastSyncTime is the timestamp of the last successful sync. | -| `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the list of discovered images from all sources. | -| `imageCount` | `int32` | ImageCount is the number of discovered images. | -| `sourceCount` | `int32` | SourceCount is the number of configured sources. | +| `lastSyncTime` | `*metav1.Time` | LastSyncTime is the timestamp of the last reconciliation attempt. | +| `queryResults` | `[]QueryResult` | QueryResults reports the outcome of each named query execution. | +| `signalResults` | `[]SignalResult` | SignalResults reports the outcome of each signal derivation. | +| `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the ordered list of discovered and ranked images. Only images with selected=true are propagated to dependent CachedImageSet resources. | +| `imageCount` | `int32` | ImageCount is the number of selected discovered images. | +| `queryCount` | `int32` | QueryCount is the number of configured queries. | | `conditions` | `[]metav1.Condition` | Conditions represent the latest available observations. | --- @@ -143,6 +147,14 @@ PullPolicy controls the pacing and retry behavior for image pulls across cluster ## Helper Types +### AggregateSignalConfig + +AggregateSignalConfig configures the aggregate signal type. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `method` | `AggregationMethod` | Yes | — | Method is the aggregation function applied to all samples per image. | + ### BackoffConfig BackoffConfig defines exponential retry backoff behavior for failed pulls. @@ -154,13 +166,28 @@ BackoffConfig defines exponential retry backoff behavior for failed pulls. ### DiscoveredImage -DiscoveredImage represents a single discovered image with metadata. +DiscoveredImage represents a single discovered and ranked image. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `image` | `string` | Yes | — | Image is the fully qualified image reference. | -| `score` | `int64` | Yes | — | Score is the ranking score from the source (higher = more relevant). | -| `source` | `string` | Yes | — | Source identifies which discovery source produced this image. | +| `rank` | `int32` | Yes | — | Rank is the position of this image in the final ordered list (1 = highest score). | +| `finalScore` | `string` | Yes | — | FinalScore is the computed ranking score as a decimal string. | +| `selected` | `bool` | Yes | — | Selected is true when this image is within the maxImages cap and will be propagated to dependent CachedImageSet resources. | +| `signals` | `[]ImageSignalValue` | No | — | Signals lists the per-signal values used during ranking (for observability). | +| `ranking` | `*ImageRankingDetail` | No | — | Ranking explains how the final score was computed. | + +### DiscoveryLokiQuery + +DiscoveryLokiQuery defines the Loki-specific query parameters. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `endpoint` | `string` | Yes | — | Endpoint is the Loki API URL. Example: "https://loki.example.com" | +| `query` | `string` | Yes | — | Query is the LogQL expression. | +| `queryType` | `LokiQueryType` | No | range | QueryType controls how the query is executed. Currently only "range" is supported. | +| `lookback` | `*metav1.Duration` | No | — | Lookback is the time window for the query (start=now-lookback, end=now). Example: "168h" (7 days), "24h" | +| `parser` | `*LokiParser` | No | — | Parser configures how log lines are parsed into structured event records. | ### DiscoveryPolicyReference @@ -170,16 +197,64 @@ DiscoveryPolicyReference is a reference to a DiscoveryPolicy resource. |-------|------|----------|---------|-------------| | `name` | `string` | Yes | — | Name of the DiscoveryPolicy resource. | -### DiscoverySource +### DiscoveryPrometheusQuery + +DiscoveryPrometheusQuery defines the Prometheus-specific query parameters. The PromQL result MUST carry an "image" label; that label value is the image reference. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `endpoint` | `string` | Yes | — | Endpoint is the Prometheus-compatible API URL (Prometheus, Thanos, Mimir, VictoriaMetrics). Example: "http://prometheus.monitoring.svc:9090", "https://mimir.example.com" | +| `query` | `string` | Yes | — | Query is the PromQL expression. Must return results with an "image" label. Example: count(container_memory_working_set_bytes{namespace="gitlab-runner"}) by (image) | +| `queryType` | `QueryType` | No | range | QueryType controls how the query is executed: "range" or "instant". Default: "range". | +| `lookback` | `*metav1.Duration` | No | — | Lookback is the time window for range queries (start=now-lookback, end=now). Required when queryType is "range". Ignored when queryType is "instant". Example: "168h" (7 days), "24h", "72h" | +| `step` | `*metav1.Duration` | No | — | Step is the resolution step for range queries. Smaller steps increase data-point density but also increase Prometheus load. Default: 5m. Example: "1m", "15m" | + +### DiscoveryQuery + +DiscoveryQuery defines a named raw-data source referenced by signals. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `name` | `string` | Yes | — | Name is the unique identifier for this query within the policy. Signals reference queries by this name via queryRef. | +| `type` | `DiscoveryQueryType` | Yes | — | Type selects the backend. Must be "prometheus" or "loki". | +| `prometheus` | `*DiscoveryPrometheusQuery` | No | — | Prometheus contains the configuration when type=prometheus. | +| `loki` | `*DiscoveryLokiQuery` | No | — | Loki contains the configuration when type=loki. | +| `secretRef` | `*corev1.LocalObjectReference` | No | — | SecretRef references a Secret in the pod namespace (default "drop-system") for auth/TLS. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. | + +### DiscoveryRanking + +DiscoveryRanking defines how signals are combined into the final ordered image list. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `strategy` | `RankingStrategy` | Yes | — | Strategy selects the ranking algorithm. | +| `signal` | `*SignalRankingConfig` | No | — | Signal is required when strategy=signal. | +| `weightedSum` | `*WeightedSumRankingConfig` | No | — | WeightedSum is required when strategy=weightedSum. | +| `modelExposure` | `*ModelExposureRankingConfig` | No | — | ModelExposure is required when strategy=modelExposure. | + +### DiscoverySignal + +DiscoverySignal defines a named per-image metric derived from a single query. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `name` | `string` | Yes | — | Name is the unique identifier for this signal within the policy. Ranking configurations reference signals by this name. | +| `queryRef` | `string` | Yes | — | QueryRef is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. | +| `type` | `SignalType` | Yes | — | Type selects the signal derivation method. | +| `aggregate` | `*AggregateSignalConfig` | No | — | Aggregate is required when type=aggregate. | +| `timeWeightedAggregate` | `*TimeWeightedAggregateSignalConfig` | No | — | TimeWeightedAggregate is required when type=timeWeightedAggregate. | +| `windowAggregate` | `*WindowAggregateSignalConfig` | No | — | WindowAggregate is required when type=windowAggregate. | +| `eventPullTime` | `*EventPullTimeSignalConfig` | No | — | EventPullTime is required when type=eventPullTime. | + +### EventPullTimeSignalConfig -DiscoverySource defines a single discovery backend. +EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| -| `type` | `string` | Yes | — | Type identifies the discovery backend. Must be "prometheus" or "registry". | -| `prometheus` | `*PrometheusSource` | No | — | Prometheus contains the configuration when type=prometheus. | -| `registry` | `*RegistrySource` | No | — | Registry contains the configuration when type=registry. | -| `secretRef` | `*corev1.LocalObjectReference` | No | — | SecretRef references a Secret in the namespace where Drop creates pull Pods. The default namespace is "drop-system" unless the controller is started with a different --pod-namespace. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. Example: {name: "prometheus-creds"} | +| `statistic` | `EventPullTimeStatistic` | Yes | — | Statistic selects which pull-time metric to compute. | +| `includeCacheHits` | `bool` | Yes | false | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. | +| `durationMode` | `DurationMode` | Yes | — | DurationMode controls how pull duration is extracted from event records. | ### ImageEntry @@ -191,6 +266,48 @@ ImageEntry defines a single image to include in a set. | `tag` | `string` | No | — | Tag to pull. Mutually exclusive with Digest. Example: "1.25-alpine", "v2.4.1" | | `digest` | `string` | No | — | Digest to pull as an immutable reference. Mutually exclusive with Tag. Example: "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4" | +### ImageRankingDetail + +ImageRankingDetail explains how the final score was computed for one image. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `strategy` | `string` | Yes | — | Strategy is the ranking strategy that produced this detail. | +| `terms` | `[]RankingTerm` | No | — | Terms lists the per-signal contributions (populated for weightedSum and modelExposure). | + +### ImageSignalValue + +ImageSignalValue records the raw and normalized value of a signal for one image. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `name` | `string` | Yes | — | Name is the signal name. | +| `rawValue` | `string` | Yes | — | RawValue is the unscaled signal value as a decimal string. | +| `normalizedValue` | `string` | No | — | NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. Only populated for signals used in a weightedSum ranking. | + +### LokiParser + +LokiParser configures structured parsing of Loki log entries. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `type` | `LokiParserType` | Yes | — | Type selects the parser. Currently only "kubernetesEvents" is supported. | +| `podField` | `string` | No | — | PodField is the log label or field that contains the pod name. Example: "involvedObject_name" | +| `reasonField` | `string` | No | — | ReasonField is the log label or field that contains the event reason. Example: "reason" | +| `messageField` | `string` | No | — | MessageField is the log label or field that contains the event message. Example: "message" | +| `imageField` | `string` | No | — | ImageField is the log label or field from which the image reference is extracted. For kubernetesEvents, the image is parsed out of the message text. Example: "message" | + +### ModelExposureRankingConfig + +ModelExposureRankingConfig configures the modelExposure ranking strategy. Score = J_target(I) * (1 - 1/N)^J_pre(I) * p_hat(I) where N=nodeCount, J_pre is pre-window usage, J_target is target-window usage, and p_hat is the pull-time signal value. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `nodeCount` | `int32` | Yes | — | NodeCount is the number of eligible CI nodes (N in the exposure formula). | +| `preWindowUsageSignalRef` | `string` | Yes | — | PreWindowUsageSignalRef is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. | +| `targetWindowUsageSignalRef` | `string` | Yes | — | TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. | +| `pullTimeSignalRef` | `string` | Yes | — | PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. | + ### PolicyReference PolicyReference is a reference to a PullPolicy resource. @@ -199,28 +316,106 @@ PolicyReference is a reference to a PullPolicy resource. |-------|------|----------|---------|-------------| | `name` | `string` | Yes | — | Name of the PullPolicy resource. | -### PrometheusSource +### QueryResult -PrometheusSource defines Prometheus query configuration for image discovery. +QueryResult reports the outcome of a single named query execution. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| -| `endpoint` | `string` | Yes | — | Endpoint is the Prometheus-compatible API URL (Prometheus, Thanos, Mimir, VictoriaMetrics). Example: "http://prometheus.monitoring.svc:9090", "https://mimir.example.com" | -| `query` | `string` | Yes | — | Query is the PromQL expression. It MUST return results with an "image" label — that label value is used as the discovered image reference. The query result value is used as the ranking score (higher = more relevant). Example: count(container_memory_working_set_bytes{container!="",container!="POD",namespace="gitlab-runner"}) by (image) | -| `queryType` | `QueryType` | No | range | QueryType controls how the Prometheus query is executed. "range" uses /api/v1/query_range with a time window defined by lookback. "instant" uses /api/v1/query for a single point-in-time result. Default: "range". | -| `lookback` | `*metav1.Duration` | No | — | Lookback is the time window for range queries. When queryType is "range", the operator queries (start=now-lookback, end=now) and aggregates all returned values per image. The aggregation function is controlled by the aggregationMethod field. Required when queryType is "range". Ignored when queryType is "instant". Example: "168h" (7 days), "24h", "72h" | -| `aggregationMethod` | `*AggregationMethod` | No | — | AggregationMethod controls how data points from a range query are combined into a single score. Only used when queryType is "range". Ignored for instant queries. When not set (nil), Drop uses the last data-point value directly — use this when your PromQL already contains aggregation functions (e.g., count_over_time, topk). Options: "sum", "count", "avg", "max" | -| `step` | `*metav1.Duration` | No | — | Step is the resolution step for range queries (only used when lookback is set). Smaller steps = more data points = more accurate aggregation but higher Prometheus load. Default: 5m. Example: "1m", "15m" | +| `name` | `string` | Yes | — | Name matches the queries[].name that produced this result. | +| `type` | `DiscoveryQueryType` | Yes | — | Type is the query backend type (prometheus or loki). | +| `series` | `*int32` | No | — | Series is the number of time-series returned (Prometheus queries only). | +| `samples` | `*int64` | No | — | Samples is the total number of data points across all series (Prometheus range queries only). | +| `records` | `*int64` | No | — | Records is the number of log records returned (Loki queries only). | +| `status` | `QueryResultStatus` | Yes | — | Status is "success" or "failed". | +| `message` | `string` | No | — | Message describes the failure reason when status=failed. | + +### RankingTerm + +RankingTerm records the contribution of one signal to the final score of an image. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `signal` | `string` | Yes | — | Signal is the signal name. | +| `weight` | `string` | Yes | — | Weight is the configured weight as a decimal string. | +| `contribution` | `string` | Yes | — | Contribution is weight * normalizedValue as a decimal string. | + +### SignalRankingConfig + +SignalRankingConfig configures the signal ranking strategy. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `signalRef` | `string` | Yes | — | SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. | + +### SignalResult + +SignalResult reports the outcome of a single signal derivation. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `name` | `string` | Yes | — | Name matches the signals[].name that produced this result. | +| `images` | `int32` | Yes | — | Images is the number of images for which this signal produced a value. | +| `status` | `string` | Yes | — | Status is "success" or "failed". | +| `message` | `string` | No | — | Message describes the failure reason when status=failed. | + +### TimeOfDayWindow + +TimeOfDayWindow defines a fixed wall-clock time range within each day. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `start` | `string` | Yes | — | Start is the inclusive start time in "HH:MM" format (24-hour, local time). Example: "09:00" | +| `end` | `string` | Yes | — | End is the exclusive end time in "HH:MM" format (24-hour, local time). Example: "17:00" | + +### TimeWeightedAggregateSignalConfig + +TimeWeightedAggregateSignalConfig configures the timeWeightedAggregate signal type. Each sample value is multiplied by the weight of the matching time window before aggregation. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `method` | `AggregationMethod` | Yes | — | Method is the aggregation function applied after weighting (currently only "sum" is meaningful). | +| `timezone` | `string` | Yes | — | Timezone is the IANA time zone used to evaluate window boundaries (wall-clock hours). Example: "Europe/Berlin", "America/New_York", "UTC" | +| `defaultWeight` | `resource.Quantity` | Yes | — | DefaultWeight is applied to samples that do not fall in any configured window. Use "0" to exclude off-hours samples entirely. | +| `windows` | `[]TimeWeightedWindow` | Yes | — | Windows is the list of hour-of-day windows with associated weights. | + +### TimeWeightedWindow + +TimeWeightedWindow defines a wall-clock hour range and its weight factor. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `startHour` | `int32` | Yes | — | StartHour is the inclusive start of the window in local time (0–23). | +| `endHour` | `int32` | Yes | — | EndHour is the exclusive end of the window in local time (1–24). | +| `weight` | `resource.Quantity` | Yes | — | Weight is the factor applied to sample values within this window. Use "1.0" for full weight, "0.3" for partial, "0" to exclude. | + +### WeightedSumRankingConfig + +WeightedSumRankingConfig configures the weightedSum ranking strategy. Score = Σ weight_k * normalize(signal_k(image)). + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `normalize` | `NormalizeMethod` | Yes | minMax | Normalize selects the normalization method applied to each signal before weighting. Currently only "minMax" is supported. | +| `missingSignal` | `MissingSignalBehavior` | Yes | zero | MissingSignal controls behavior when an image has no value for a required signal. "zero" treats missing as 0; "drop" removes the image from ranking. | +| `terms` | `[]WeightedSumTerm` | Yes | — | Terms is the list of signals and their weights. | + +### WeightedSumTerm + +WeightedSumTerm defines one signal contribution in a weightedSum ranking. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `signalRef` | `string` | Yes | — | SignalRef is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. | +| `weight` | `resource.Quantity` | Yes | — | Weight is the factor applied to the normalized signal value. All weights should be non-negative; they do not need to sum to 1. Example: "0.7" | -### RegistrySource +### WindowAggregateSignalConfig -RegistrySource defines OCI registry tag listing configuration for image discovery. +WindowAggregateSignalConfig configures the windowAggregate signal type. Exactly one of relativeWindow or (window + timezone) must be set. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| -| `url` | `string` | Yes | — | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | -| `repositories` | `[]string` | Yes | — | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | -| `tagFilter` | `string` | No | — | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | -| `topX` | `int32` | No | — | TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not provide creation timestamps here; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo) | -| `imageTemplate` | `string` | No | — | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "{{.Registry}}/{{.Repository}}@{{.Tag}}" (if tags are actually digests) | +| `method` | `AggregationMethod` | Yes | — | Method is the aggregation function applied to the windowed samples. | +| `relativeWindow` | `*metav1.Duration` | No | — | RelativeWindow aggregates only samples from the last N duration before now. Mutually exclusive with window + timezone. Example: "2h" (last 2 hours) | +| `timezone` | `string` | No | — | Timezone is the IANA time zone for evaluating wall-clock window boundaries. Required when window is set. | +| `window` | `*TimeOfDayWindow` | No | — | Window defines fixed wall-clock start/end times within each day. Mutually exclusive with relativeWindow. | diff --git a/docs/static/llms-full.txt b/docs/static/llms-full.txt index b0ca6cc..9ed121d 100644 --- a/docs/static/llms-full.txt +++ b/docs/static/llms-full.txt @@ -84,18 +84,22 @@ Controller: internal/controller/discoverypolicy_controller.go | Test: internal/c #### Spec | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| Sources | `sources` | `[]DiscoverySource` | ✓ | | Sources is the list of discovery backends to query. At least one source is required. Multiple sources are merged and ranked together before maxImages is applied. | +| Queries | `queries` | `[]DiscoveryQuery` | — | | Queries is the list of named raw-data sources. Each query is referenced by name from signals. | +| Signals | `signals` | `[]DiscoverySignal` | — | | Signals is the list of named per-image metrics derived from query results. Each signal is referenced by name from the ranking configuration. | +| Ranking | `ranking` | `*DiscoveryRanking` | — | | Ranking defines how signals are combined into a final ordered image list. | | ImageFilter | `imageFilter` | `string` | — | | ImageFilter is a regex applied to discovered image references. Only matching images are kept. Example: "registry.example.com/team/.*" (only keep images from that registry path) | -| SyncInterval | `syncInterval` | `metav1.Duration` | — | `30m` | SyncInterval is how often the operator re-queries all sources and updates status.discoveredImages. Default: "30m". Example: "1h", "15m" | +| SyncInterval | `syncInterval` | `metav1.Duration` | — | `30m` | SyncInterval is how often the operator re-runs the pipeline and updates status.discoveredImages. Default: "30m". Example: "1h", "15m" | | MaxImages | `maxImages` | `int32` | — | `50` | MaxImages caps the total number of images stored in status.discoveredImages. Images are ranked by score; lowest-scoring images are dropped when the cap is exceeded. Default: 50. Example: 30, 100 | #### Status | Field | JSON | Type | Description | |-------|------|------|-------------| -| LastSyncTime | `lastSyncTime` | `*metav1.Time` | LastSyncTime is the timestamp of the last successful sync. | -| DiscoveredImages | `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the list of discovered images from all sources. | -| ImageCount | `imageCount` | `int32` | ImageCount is the number of discovered images. | -| SourceCount | `sourceCount` | `int32` | SourceCount is the number of configured sources. | +| LastSyncTime | `lastSyncTime` | `*metav1.Time` | LastSyncTime is the timestamp of the last reconciliation attempt. | +| QueryResults | `queryResults` | `[]QueryResult` | QueryResults reports the outcome of each named query execution. | +| SignalResults | `signalResults` | `[]SignalResult` | SignalResults reports the outcome of each signal derivation. | +| DiscoveredImages | `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the ordered list of discovered and ranked images. Only images with selected=true are propagated to dependent CachedImageSet resources. | +| ImageCount | `imageCount` | `int32` | ImageCount is the number of selected discovered images. | +| QueryCount | `queryCount` | `int32` | QueryCount is the number of configured queries. | | Conditions | `conditions` | `[]metav1.Condition` | Conditions represent the latest available observations. | @@ -117,6 +121,14 @@ PullPolicy controls the pacing and retry behavior for image pulls across cluster ## Helper Types +### AggregateSignalConfig + +AggregateSignalConfig configures the aggregate signal type. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Method | `method` | `AggregationMethod` | ✓ | | Method is the aggregation function applied to all samples per image. Enum: `sum`,`count`,`avg`,`max`,`min` | + ### BackoffConfig BackoffConfig defines exponential retry backoff behavior for failed pulls. @@ -128,13 +140,28 @@ BackoffConfig defines exponential retry backoff behavior for failed pulls. ### DiscoveredImage -DiscoveredImage represents a single discovered image with metadata. +DiscoveredImage represents a single discovered and ranked image. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | Image | `image` | `string` | ✓ | | Image is the fully qualified image reference. | -| Score | `score` | `int64` | ✓ | | Score is the ranking score from the source (higher = more relevant). | -| Source | `source` | `string` | ✓ | | Source identifies which discovery source produced this image. | +| Rank | `rank` | `int32` | ✓ | | Rank is the position of this image in the final ordered list (1 = highest score). | +| FinalScore | `finalScore` | `string` | ✓ | | FinalScore is the computed ranking score as a decimal string. | +| Selected | `selected` | `bool` | ✓ | | Selected is true when this image is within the maxImages cap and will be propagated to dependent CachedImageSet resources. | +| Signals | `signals` | `[]ImageSignalValue` | — | | Signals lists the per-signal values used during ranking (for observability). | +| Ranking | `ranking` | `*ImageRankingDetail` | — | | Ranking explains how the final score was computed. | + +### DiscoveryLokiQuery + +DiscoveryLokiQuery defines the Loki-specific query parameters. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Endpoint | `endpoint` | `string` | ✓ | | Endpoint is the Loki API URL. Example: "https://loki.example.com" | +| Query | `query` | `string` | ✓ | | Query is the LogQL expression. | +| QueryType | `queryType` | `LokiQueryType` | — | `range` | QueryType controls how the query is executed. Currently only "range" is supported. | +| Lookback | `lookback` | `*metav1.Duration` | — | | Lookback is the time window for the query (start=now-lookback, end=now). Example: "168h" (7 days), "24h" | +| Parser | `parser` | `*LokiParser` | — | | Parser configures how log lines are parsed into structured event records. | ### DiscoveryPolicyReference @@ -144,16 +171,64 @@ DiscoveryPolicyReference is a reference to a DiscoveryPolicy resource. |-------|------|------|----------|---------|-------------| | Name | `name` | `string` | ✓ | | Name of the DiscoveryPolicy resource. | -### DiscoverySource +### DiscoveryPrometheusQuery -DiscoverySource defines a single discovery backend. +DiscoveryPrometheusQuery defines the Prometheus-specific query parameters. The PromQL result MUST carry an "image" label; that label value is the image reference. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| Type | `type` | `string` | ✓ | | Type identifies the discovery backend. Must be "prometheus" or "registry". Enum: `prometheus`,`registry` | -| Prometheus | `prometheus` | `*PrometheusSource` | — | | Prometheus contains the configuration when type=prometheus. | -| Registry | `registry` | `*RegistrySource` | — | | Registry contains the configuration when type=registry. | -| SecretRef | `secretRef` | `*corev1.LocalObjectReference` | — | | SecretRef references a Secret in the namespace where Drop creates pull Pods. The default namespace is "drop-system" unless the controller is started with a different --pod-namespace. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. Example: {name: "prometheus-creds"} | +| Endpoint | `endpoint` | `string` | ✓ | | Endpoint is the Prometheus-compatible API URL (Prometheus, Thanos, Mimir, VictoriaMetrics). Example: "http://prometheus.monitoring.svc:9090", "https://mimir.example.com" | +| Query | `query` | `string` | ✓ | | Query is the PromQL expression. Must return results with an "image" label. Example: count(container_memory_working_set_bytes{namespace="gitlab-runner"}) by (image) | +| QueryType | `queryType` | `QueryType` | — | `range` | QueryType controls how the query is executed: "range" or "instant". Default: "range". | +| Lookback | `lookback` | `*metav1.Duration` | — | | Lookback is the time window for range queries (start=now-lookback, end=now). Required when queryType is "range". Ignored when queryType is "instant". Example: "168h" (7 days), "24h", "72h" | +| Step | `step` | `*metav1.Duration` | — | | Step is the resolution step for range queries. Smaller steps increase data-point density but also increase Prometheus load. Default: 5m. Example: "1m", "15m" | + +### DiscoveryQuery + +DiscoveryQuery defines a named raw-data source referenced by signals. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Name | `name` | `string` | ✓ | | Name is the unique identifier for this query within the policy. Signals reference queries by this name via queryRef. | +| Type | `type` | `DiscoveryQueryType` | ✓ | | Type selects the backend. Must be "prometheus" or "loki". Enum: `prometheus`,`loki` | +| Prometheus | `prometheus` | `*DiscoveryPrometheusQuery` | — | | Prometheus contains the configuration when type=prometheus. | +| Loki | `loki` | `*DiscoveryLokiQuery` | — | | Loki contains the configuration when type=loki. | +| SecretRef | `secretRef` | `*corev1.LocalObjectReference` | — | | SecretRef references a Secret in the pod namespace (default "drop-system") for auth/TLS. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. | + +### DiscoveryRanking + +DiscoveryRanking defines how signals are combined into the final ordered image list. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Strategy | `strategy` | `RankingStrategy` | ✓ | | Strategy selects the ranking algorithm. Enum: `signal`,`weightedSum`,`modelExposure` | +| Signal | `signal` | `*SignalRankingConfig` | — | | Signal is required when strategy=signal. | +| WeightedSum | `weightedSum` | `*WeightedSumRankingConfig` | — | | WeightedSum is required when strategy=weightedSum. | +| ModelExposure | `modelExposure` | `*ModelExposureRankingConfig` | — | | ModelExposure is required when strategy=modelExposure. | + +### DiscoverySignal + +DiscoverySignal defines a named per-image metric derived from a single query. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Name | `name` | `string` | ✓ | | Name is the unique identifier for this signal within the policy. Ranking configurations reference signals by this name. | +| QueryRef | `queryRef` | `string` | ✓ | | QueryRef is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. | +| Type | `type` | `SignalType` | ✓ | | Type selects the signal derivation method. Enum: `aggregate`,`timeWeightedAggregate`,`windowAggregate`,`eventPullTime` | +| Aggregate | `aggregate` | `*AggregateSignalConfig` | — | | Aggregate is required when type=aggregate. | +| TimeWeightedAggregate | `timeWeightedAggregate` | `*TimeWeightedAggregateSignalConfig` | — | | TimeWeightedAggregate is required when type=timeWeightedAggregate. | +| WindowAggregate | `windowAggregate` | `*WindowAggregateSignalConfig` | — | | WindowAggregate is required when type=windowAggregate. | +| EventPullTime | `eventPullTime` | `*EventPullTimeSignalConfig` | — | | EventPullTime is required when type=eventPullTime. | + +### EventPullTimeSignalConfig + +EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Statistic | `statistic` | `EventPullTimeStatistic` | ✓ | | Statistic selects which pull-time metric to compute. Enum: `p50`,`p90`,`p95`,`avg`,`max`,`count`,`failureCount`,`cacheHitCount` | +| IncludeCacheHits | `includeCacheHits` | `bool` | ✓ | `false` | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. | +| DurationMode | `durationMode` | `DurationMode` | ✓ | | DurationMode controls how pull duration is extracted from event records. Enum: `eventPair`,`messageDuration` | ### ImageEntry @@ -165,6 +240,48 @@ ImageEntry defines a single image to include in a set. | Tag | `tag` | `string` | — | | Tag to pull. Mutually exclusive with Digest. Example: "1.25-alpine", "v2.4.1" | | Digest | `digest` | `string` | — | | Digest to pull as an immutable reference. Mutually exclusive with Tag. Example: "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4" | +### ImageRankingDetail + +ImageRankingDetail explains how the final score was computed for one image. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Strategy | `strategy` | `string` | ✓ | | Strategy is the ranking strategy that produced this detail. | +| Terms | `terms` | `[]RankingTerm` | — | | Terms lists the per-signal contributions (populated for weightedSum and modelExposure). | + +### ImageSignalValue + +ImageSignalValue records the raw and normalized value of a signal for one image. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Name | `name` | `string` | ✓ | | Name is the signal name. | +| RawValue | `rawValue` | `string` | ✓ | | RawValue is the unscaled signal value as a decimal string. | +| NormalizedValue | `normalizedValue` | `string` | — | | NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. Only populated for signals used in a weightedSum ranking. | + +### LokiParser + +LokiParser configures structured parsing of Loki log entries. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Type | `type` | `LokiParserType` | ✓ | | Type selects the parser. Currently only "kubernetesEvents" is supported. Enum: `kubernetesEvents` | +| PodField | `podField` | `string` | — | | PodField is the log label or field that contains the pod name. Example: "involvedObject_name" | +| ReasonField | `reasonField` | `string` | — | | ReasonField is the log label or field that contains the event reason. Example: "reason" | +| MessageField | `messageField` | `string` | — | | MessageField is the log label or field that contains the event message. Example: "message" | +| ImageField | `imageField` | `string` | — | | ImageField is the log label or field from which the image reference is extracted. For kubernetesEvents, the image is parsed out of the message text. Example: "message" | + +### ModelExposureRankingConfig + +ModelExposureRankingConfig configures the modelExposure ranking strategy. Score = J_target(I) * (1 - 1/N)^J_pre(I) * p_hat(I) where N=nodeCount, J_pre is pre-window usage, J_target is target-window usage, and p_hat is the pull-time signal value. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| NodeCount | `nodeCount` | `int32` | ✓ | | NodeCount is the number of eligible CI nodes (N in the exposure formula). | +| PreWindowUsageSignalRef | `preWindowUsageSignalRef` | `string` | ✓ | | PreWindowUsageSignalRef is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. | +| TargetWindowUsageSignalRef | `targetWindowUsageSignalRef` | `string` | ✓ | | TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. | +| PullTimeSignalRef | `pullTimeSignalRef` | `string` | ✓ | | PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. | + ### PolicyReference PolicyReference is a reference to a PullPolicy resource. @@ -173,30 +290,108 @@ PolicyReference is a reference to a PullPolicy resource. |-------|------|------|----------|---------|-------------| | Name | `name` | `string` | ✓ | | Name of the PullPolicy resource. | -### PrometheusSource +### QueryResult -PrometheusSource defines Prometheus query configuration for image discovery. +QueryResult reports the outcome of a single named query execution. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| Endpoint | `endpoint` | `string` | ✓ | | Endpoint is the Prometheus-compatible API URL (Prometheus, Thanos, Mimir, VictoriaMetrics). Example: "http://prometheus.monitoring.svc:9090", "https://mimir.example.com" | -| Query | `query` | `string` | ✓ | | Query is the PromQL expression. It MUST return results with an "image" label — that label value is used as the discovered image reference. The query result value is used as the ranking score (higher = more relevant). Example: count(container_memory_working_set_bytes{container!="",container!="POD",namespace="gitlab-runner"}) by (image) | -| QueryType | `queryType` | `QueryType` | — | `range` | QueryType controls how the Prometheus query is executed. "range" uses /api/v1/query_range with a time window defined by lookback. "instant" uses /api/v1/query for a single point-in-time result. Default: "range". | -| Lookback | `lookback` | `*metav1.Duration` | — | | Lookback is the time window for range queries. When queryType is "range", the operator queries (start=now-lookback, end=now) and aggregates all returned values per image. The aggregation function is controlled by the aggregationMethod field. Required when queryType is "range". Ignored when queryType is "instant". Example: "168h" (7 days), "24h", "72h" | -| AggregationMethod | `aggregationMethod` | `*AggregationMethod` | — | | AggregationMethod controls how data points from a range query are combined into a single score. Only used when queryType is "range". Ignored for instant queries. When not set (nil), Drop uses the last data-point value directly — use this when your PromQL already contains aggregation functions (e.g., count_over_time, topk). Options: "sum", "count", "avg", "max" | -| Step | `step` | `*metav1.Duration` | — | | Step is the resolution step for range queries (only used when lookback is set). Smaller steps = more data points = more accurate aggregation but higher Prometheus load. Default: 5m. Example: "1m", "15m" | +| Name | `name` | `string` | ✓ | | Name matches the queries[].name that produced this result. | +| Type | `type` | `DiscoveryQueryType` | ✓ | | Type is the query backend type (prometheus or loki). | +| Series | `series` | `*int32` | — | | Series is the number of time-series returned (Prometheus queries only). | +| Samples | `samples` | `*int64` | — | | Samples is the total number of data points across all series (Prometheus range queries only). | +| Records | `records` | `*int64` | — | | Records is the number of log records returned (Loki queries only). | +| Status | `status` | `QueryResultStatus` | ✓ | | Status is "success" or "failed". | +| Message | `message` | `string` | — | | Message describes the failure reason when status=failed. | + +### RankingTerm + +RankingTerm records the contribution of one signal to the final score of an image. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Signal | `signal` | `string` | ✓ | | Signal is the signal name. | +| Weight | `weight` | `string` | ✓ | | Weight is the configured weight as a decimal string. | +| Contribution | `contribution` | `string` | ✓ | | Contribution is weight * normalizedValue as a decimal string. | -### RegistrySource +### SignalRankingConfig -RegistrySource defines OCI registry tag listing configuration for image discovery. +SignalRankingConfig configures the signal ranking strategy. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| URL | `url` | `string` | ✓ | | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | -| Repositories | `repositories` | `[]string` | ✓ | | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | -| TagFilter | `tagFilter` | `string` | — | | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | -| TopX | `topX` | `int32` | — | | TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not provide creation timestamps here; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo) | -| ImageTemplate | `imageTemplate` | `string` | — | | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "{{.Registry}}/{{.Repository}}@{{.Tag}}" (if tags are actually digests) | +| SignalRef | `signalRef` | `string` | ✓ | | SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. | + +### SignalResult + +SignalResult reports the outcome of a single signal derivation. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Name | `name` | `string` | ✓ | | Name matches the signals[].name that produced this result. | +| Images | `images` | `int32` | ✓ | | Images is the number of images for which this signal produced a value. | +| Status | `status` | `string` | ✓ | | Status is "success" or "failed". | +| Message | `message` | `string` | — | | Message describes the failure reason when status=failed. | + +### TimeOfDayWindow + +TimeOfDayWindow defines a fixed wall-clock time range within each day. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Start | `start` | `string` | ✓ | | Start is the inclusive start time in "HH:MM" format (24-hour, local time). Example: "09:00" | +| End | `end` | `string` | ✓ | | End is the exclusive end time in "HH:MM" format (24-hour, local time). Example: "17:00" | + +### TimeWeightedAggregateSignalConfig + +TimeWeightedAggregateSignalConfig configures the timeWeightedAggregate signal type. Each sample value is multiplied by the weight of the matching time window before aggregation. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Method | `method` | `AggregationMethod` | ✓ | | Method is the aggregation function applied after weighting (currently only "sum" is meaningful). Enum: `sum`,`count`,`avg`,`max`,`min` | +| Timezone | `timezone` | `string` | ✓ | | Timezone is the IANA time zone used to evaluate window boundaries (wall-clock hours). Example: "Europe/Berlin", "America/New_York", "UTC" | +| DefaultWeight | `defaultWeight` | `resource.Quantity` | ✓ | | DefaultWeight is applied to samples that do not fall in any configured window. Use "0" to exclude off-hours samples entirely. | +| Windows | `windows` | `[]TimeWeightedWindow` | ✓ | | Windows is the list of hour-of-day windows with associated weights. | + +### TimeWeightedWindow + +TimeWeightedWindow defines a wall-clock hour range and its weight factor. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| StartHour | `startHour` | `int32` | ✓ | | StartHour is the inclusive start of the window in local time (0–23). | +| EndHour | `endHour` | `int32` | ✓ | | EndHour is the exclusive end of the window in local time (1–24). | +| Weight | `weight` | `resource.Quantity` | ✓ | | Weight is the factor applied to sample values within this window. Use "1.0" for full weight, "0.3" for partial, "0" to exclude. | + +### WeightedSumRankingConfig + +WeightedSumRankingConfig configures the weightedSum ranking strategy. Score = Σ weight_k * normalize(signal_k(image)). + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Normalize | `normalize` | `NormalizeMethod` | ✓ | `minMax` | Normalize selects the normalization method applied to each signal before weighting. Currently only "minMax" is supported. Enum: `minMax` | +| MissingSignal | `missingSignal` | `MissingSignalBehavior` | ✓ | `zero` | MissingSignal controls behavior when an image has no value for a required signal. "zero" treats missing as 0; "drop" removes the image from ranking. Enum: `zero`,`drop` | +| Terms | `terms` | `[]WeightedSumTerm` | ✓ | | Terms is the list of signals and their weights. | + +### WeightedSumTerm + +WeightedSumTerm defines one signal contribution in a weightedSum ranking. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| SignalRef | `signalRef` | `string` | ✓ | | SignalRef is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. | +| Weight | `weight` | `resource.Quantity` | ✓ | | Weight is the factor applied to the normalized signal value. All weights should be non-negative; they do not need to sum to 1. Example: "0.7" | + +### WindowAggregateSignalConfig + +WindowAggregateSignalConfig configures the windowAggregate signal type. Exactly one of relativeWindow or (window + timezone) must be set. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Method | `method` | `AggregationMethod` | ✓ | | Method is the aggregation function applied to the windowed samples. Enum: `sum`,`count`,`avg`,`max`,`min` | +| RelativeWindow | `relativeWindow` | `*metav1.Duration` | — | | RelativeWindow aggregates only samples from the last N duration before now. Mutually exclusive with window + timezone. Example: "2h" (last 2 hours) | +| Timezone | `timezone` | `string` | — | | Timezone is the IANA time zone for evaluating wall-clock window boundaries. Required when window is set. | +| Window | `window` | `*TimeOfDayWindow` | — | | Window defines fixed wall-clock start/end times within each day. Mutually exclusive with relativeWindow. | ## Relationships @@ -222,13 +417,7 @@ graph LR | Degraded | CachedImageSet | N/N images cached, failing: N | | | Progressing | CachedImageSet | N/N images cached | | | Ready | CachedImageSet | All N images are cached | | -| AllSourcesHealthy | DiscoveryPolicy | All discovery sources responded successfully | | -| ConnectionRefused | DiscoveryPolicy | | | -| DNSError | DiscoveryPolicy | | | -| PartiallyFailed | DiscoveryPolicy | Discovered N images, but some sources failed: N | | -| SourceError | DiscoveryPolicy | One or more sources failed to respond | | -| SyncFailed | DiscoveryPolicy | | | -| Synced | DiscoveryPolicy | Discovered N images | | +| NotImplemented | DiscoveryPolicy | | | ## Metrics @@ -319,83 +508,97 @@ spec: policyRef: name: dev-conservative discoveryPolicyRef: - name: dev-registry + name: dev-prometheus --- -# === DiscoveryPolicy: healthy (Prometheus range query) === +# === DiscoveryPolicy: Prometheus range query with total-usage signal === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: dev-prometheus spec: - sources: - - type: prometheus + queries: + - name: runner-image-usage + type: prometheus prometheus: endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" - query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' queryType: range lookback: 24h step: 5m - aggregationMethod: sum + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage syncInterval: 30s maxImages: 10 --- -# === DiscoveryPolicy: healthy (registry tag listing) === +# === DiscoveryPolicy: Prometheus with hybrid weightedSum ranking === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: dev-registry + name: dev-hybrid spec: - sources: - - type: registry - registry: - url: "http://registry.e2e-infra.svc.cluster.local:5000" - repositories: - - "test/myapp" - topX: 3 + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + - name: peak-concurrency + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max + ranking: + strategy: weightedSum + weightedSum: + normalize: minMax + missingSignal: zero + terms: + - signalRef: total-usage + weight: "700m" + - signalRef: peak-concurrency + weight: "300m" syncInterval: 30s maxImages: 10 --- -# === DiscoveryPolicy: broken (DNS error → DNSError) === +# === DiscoveryPolicy: broken Prometheus endpoint (DNS error) === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: test-broken-prom spec: - sources: - - type: prometheus + queries: + - name: broken-query + type: prometheus prometheus: endpoint: "http://nonexistent-prometheus:9090" query: "up{}" - syncInterval: 30m - maxImages: 10 ---- -# === DiscoveryPolicy: broken (DNS error → DNSError) === -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: test-broken-registry -spec: - sources: - - type: registry - registry: - url: "http://nonexistent-registry:5000" - repositories: - - "test/nope" - syncInterval: 30m - maxImages: 10 ---- -# === DiscoveryPolicy: broken (repo doesn't exist → NotFound) === -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: test-notfound-repo -spec: - sources: - - type: registry - registry: - url: "http://registry.e2e-infra.svc.cluster.local:5000" - repositories: - - "this/does-not-exist" + signals: + - name: total-usage + queryRef: broken-query + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage syncInterval: 30m maxImages: 10 diff --git a/hack/dev-samples.yaml b/hack/dev-samples.yaml index 767b904..2c52eb1 100644 --- a/hack/dev-samples.yaml +++ b/hack/dev-samples.yaml @@ -68,82 +68,96 @@ spec: policyRef: name: dev-conservative discoveryPolicyRef: - name: dev-registry + name: dev-prometheus --- -# === DiscoveryPolicy: healthy (Prometheus range query) === +# === DiscoveryPolicy: Prometheus range query with total-usage signal === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: dev-prometheus spec: - sources: - - type: prometheus + queries: + - name: runner-image-usage + type: prometheus prometheus: endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" - query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' queryType: range lookback: 24h step: 5m - aggregationMethod: sum + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage syncInterval: 30s maxImages: 10 --- -# === DiscoveryPolicy: healthy (registry tag listing) === +# === DiscoveryPolicy: Prometheus with hybrid weightedSum ranking === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: dev-registry + name: dev-hybrid spec: - sources: - - type: registry - registry: - url: "http://registry.e2e-infra.svc.cluster.local:5000" - repositories: - - "test/myapp" - topX: 3 + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + - name: peak-concurrency + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max + ranking: + strategy: weightedSum + weightedSum: + normalize: minMax + missingSignal: zero + terms: + - signalRef: total-usage + weight: "700m" + - signalRef: peak-concurrency + weight: "300m" syncInterval: 30s maxImages: 10 --- -# === DiscoveryPolicy: broken (DNS error → DNSError) === +# === DiscoveryPolicy: broken Prometheus endpoint (DNS error) === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: test-broken-prom spec: - sources: - - type: prometheus + queries: + - name: broken-query + type: prometheus prometheus: endpoint: "http://nonexistent-prometheus:9090" query: "up{}" - syncInterval: 30m - maxImages: 10 ---- -# === DiscoveryPolicy: broken (DNS error → DNSError) === -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: test-broken-registry -spec: - sources: - - type: registry - registry: - url: "http://nonexistent-registry:5000" - repositories: - - "test/nope" - syncInterval: 30m - maxImages: 10 ---- -# === DiscoveryPolicy: broken (repo doesn't exist → NotFound) === -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: test-notfound-repo -spec: - sources: - - type: registry - registry: - url: "http://registry.e2e-infra.svc.cluster.local:5000" - repositories: - - "this/does-not-exist" + signals: + - name: total-usage + queryRef: broken-query + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage syncInterval: 30m maxImages: 10 diff --git a/internal/controller/discoverypolicy_controller.go b/internal/controller/discoverypolicy_controller.go index c801165..377a42c 100644 --- a/internal/controller/discoverypolicy_controller.go +++ b/internal/controller/discoverypolicy_controller.go @@ -10,13 +10,8 @@ import ( "context" "crypto/tls" "crypto/x509" - "errors" "fmt" - "net" "net/http" - "net/url" - "regexp" - "sort" "strings" "time" @@ -31,8 +26,6 @@ import ( logf "sigs.k8s.io/controller-runtime/pkg/log" dropv1alpha1 "github.com/corewire/drop/api/v1alpha1" - "github.com/corewire/drop/internal/discovery" - dropmetrics "github.com/corewire/drop/internal/metrics" ) // DiscoveryPolicyReconciler reconciles a DiscoveryPolicy object @@ -42,17 +35,14 @@ type DiscoveryPolicyReconciler struct { SecretNamespace string } -const ( - reasonDNSError = "DNSError" - reasonConnectionRefused = "ConnectionRefused" -) - // +kubebuilder:rbac:groups=drop.corewire.io,resources=discoverypolicies,verbs=get;list;watch;create;update;patch;delete // +kubebuilder:rbac:groups=drop.corewire.io,resources=discoverypolicies/status,verbs=get;update;patch // +kubebuilder:rbac:groups=drop.corewire.io,resources=discoverypolicies/finalizers,verbs=update // +kubebuilder:rbac:groups="",resources=secrets,verbs=get;list;watch -// Reconcile queries discovery sources and updates the DiscoveryPolicy status. +// Reconcile updates the DiscoveryPolicy status. +// NOTE: Query/signal/ranking execution is not yet implemented. The controller sets a +// NotImplemented condition and requeues after syncInterval until a future release adds execution. func (r *DiscoveryPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { log := logf.FromContext(ctx) @@ -65,215 +55,48 @@ func (r *DiscoveryPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Requ return ctrl.Result{}, err } - // 2. Query each source - patch := client.MergeFrom(dp.DeepCopy()) - var allResults []discovery.ImageResult - allSourcesHealthy := true - var lastFailReason, lastFailMessage string - - for i, src := range dp.Spec.Sources { - source, err := r.buildSource(ctx, src) - if err != nil { - log.Error(err, "building source", "index", i, "type", src.Type) - allSourcesHealthy = false - lastFailReason, lastFailMessage = classifyError(err) - dropmetrics.DiscoverySourceHealth.WithLabelValues(dp.Name, src.Type, sourceEndpoint(src)).Set(0) - continue - } - - start := time.Now() - results, err := source.Fetch(ctx) - elapsed := time.Since(start).Seconds() - dropmetrics.DiscoverySourceLatencySeconds.WithLabelValues(dp.Name, src.Type).Observe(elapsed) - - if err != nil { - log.Error(err, "fetching from source", "index", i, "type", src.Type) - allSourcesHealthy = false - lastFailReason, lastFailMessage = classifyError(err) - dropmetrics.DiscoverySourceHealth.WithLabelValues(dp.Name, src.Type, sourceEndpoint(src)).Set(0) - continue - } - - dropmetrics.DiscoverySourceHealth.WithLabelValues(dp.Name, src.Type, sourceEndpoint(src)).Set(1) - - // Tag results with source type - for j := range results { - results[j] = discovery.ImageResult{ - Image: results[j].Image, - Score: results[j].Score, - } - } - dropmetrics.DiscoveryImagesFound.WithLabelValues(dp.Name, src.Type).Set(float64(len(results))) - allResults = append(allResults, results...) - } - - // 3. Merge results (deduplicate by image, keep highest score) - merged := deduplicateResults(allResults) + log.Info("reconciling DiscoveryPolicy (pipeline execution not yet implemented)", + "queries", len(dp.Spec.Queries), + "signals", len(dp.Spec.Signals), + ) - // 4. Apply image filter - if dp.Spec.ImageFilter != "" { - re, err := regexp.Compile(dp.Spec.ImageFilter) - if err != nil { - log.Error(err, "compiling image filter regex") - } else { - var filtered []discovery.ImageResult - for _, r := range merged { - if re.MatchString(r.Image) { - filtered = append(filtered, r) - } - } - merged = filtered - } - } - - // 5. Sort by score descending, truncate to maxImages - sort.Slice(merged, func(i, j int) bool { - if merged[i].Score != merged[j].Score { - return merged[i].Score > merged[j].Score - } - return merged[i].Image < merged[j].Image - }) - - maxImages := dp.Spec.MaxImages - if maxImages <= 0 { - maxImages = 50 - } - if int32(len(merged)) > maxImages { - merged = merged[:maxImages] - } - - // 6. Write status - // On total failure and previous results exist, keep last good results - if len(merged) == 0 && !allSourcesHealthy && len(dp.Status.DiscoveredImages) > 0 { - log.Info("all sources failed, keeping previous discovery results") - } else { - discoveredImages := make([]dropv1alpha1.DiscoveredImage, 0, len(merged)) - for _, r := range merged { - discoveredImages = append(discoveredImages, dropv1alpha1.DiscoveredImage{ - Image: r.Image, - Score: r.Score, - Source: "discovery", - }) - } - dp.Status.DiscoveredImages = discoveredImages - } + // 2. Update status with query/image counts and NotImplemented condition. + patch := client.MergeFrom(dp.DeepCopy()) now := metav1.Now() - if allSourcesHealthy || len(merged) > 0 { - dp.Status.LastSyncTime = &now - } - - // 7. Set conditions - sourceCondition := metav1.Condition{ - Type: "SourceHealthy", - ObservedGeneration: dp.Generation, - LastTransitionTime: now, - } - if allSourcesHealthy { - sourceCondition.Status = metav1.ConditionTrue - sourceCondition.Reason = "AllSourcesHealthy" - sourceCondition.Message = "All discovery sources responded successfully" - } else { - sourceCondition.Status = metav1.ConditionFalse - sourceCondition.Reason = "SourceError" - sourceCondition.Message = "One or more sources failed to respond" - } - meta.SetStatusCondition(&dp.Status.Conditions, sourceCondition) + dp.Status.LastSyncTime = &now + dp.Status.QueryCount = int32(len(dp.Spec.Queries)) + dp.Status.ImageCount = int32(len(dp.Status.DiscoveredImages)) readyCondition := metav1.Condition{ Type: conditionTypeReady, + Status: metav1.ConditionFalse, + Reason: "NotImplemented", + Message: "Query/signal/ranking pipeline execution is not yet implemented; discovered images will be populated in a future release.", ObservedGeneration: dp.Generation, LastTransitionTime: now, } - if allSourcesHealthy { - readyCondition.Status = metav1.ConditionTrue - readyCondition.Reason = "Synced" - readyCondition.Message = fmt.Sprintf("Discovered %d images", len(dp.Status.DiscoveredImages)) - } else if len(dp.Status.DiscoveredImages) > 0 { - readyCondition.Status = metav1.ConditionTrue - readyCondition.Reason = "PartiallyFailed" - readyCondition.Message = fmt.Sprintf("Discovered %d images, but some sources failed: %s", len(dp.Status.DiscoveredImages), lastFailMessage) - } else { - readyCondition.Status = metav1.ConditionFalse - readyCondition.Reason = lastFailReason - if lastFailReason == "" { - readyCondition.Reason = "SyncFailed" - } - if lastFailMessage != "" { - readyCondition.Message = lastFailMessage - } else { - readyCondition.Message = "All sources failed, no images discovered" - } - } meta.SetStatusCondition(&dp.Status.Conditions, readyCondition) - // Set scalar counts for printer columns - dp.Status.SourceCount = int32(len(dp.Spec.Sources)) - dp.Status.ImageCount = int32(len(dp.Status.DiscoveredImages)) - if err := r.Status().Patch(ctx, dp, patch); err != nil { return ctrl.Result{}, fmt.Errorf("patching status: %w", err) } - // 8. Requeue after sync interval + // 3. Requeue after sync interval. syncInterval := dp.Spec.SyncInterval.Duration if syncInterval == 0 { syncInterval = 30 * time.Minute } - - // If sources failed, return error → controller-runtime rate limiter - // applies exponential backoff (standard k8s pattern). - if !allSourcesHealthy && len(dp.Status.DiscoveredImages) == 0 { - return ctrl.Result{}, fmt.Errorf("discovery sync failed: %s", lastFailMessage) - } - return ctrl.Result{RequeueAfter: syncInterval}, nil } -// buildSource creates the appropriate Source implementation from a DiscoverySource config. -func (r *DiscoveryPolicyReconciler) buildSource(ctx context.Context, src dropv1alpha1.DiscoverySource) (discovery.Source, error) { - httpClient, err := r.buildHTTPClient(ctx, src.SecretRef) - if err != nil { - return nil, fmt.Errorf("building HTTP client: %w", err) - } - - switch src.Type { - case "prometheus": - if src.Prometheus == nil { - return nil, fmt.Errorf("prometheus config is required when type=prometheus") - } - var lookback time.Duration - if src.Prometheus.Lookback != nil { - lookback = src.Prometheus.Lookback.Duration - } - var step time.Duration - if src.Prometheus.Step != nil { - step = src.Prometheus.Step.Duration - } - return discovery.NewPrometheusSource(src.Prometheus.Endpoint, src.Prometheus.Query, src.Prometheus.QueryType, lookback, src.Prometheus.AggregationMethod, step, httpClient), nil - case "registry": - if src.Registry == nil { - return nil, fmt.Errorf("registry config is required when type=registry") - } - return discovery.NewRegistrySource( - src.Registry.URL, - src.Registry.Repositories, - src.Registry.TagFilter, - src.Registry.TopX, - src.Registry.ImageTemplate, - httpClient, - ), nil - default: - return nil, fmt.Errorf("unsupported source type: %s", src.Type) - } -} - // buildHTTPClient creates an HTTP client with auth/TLS from a Secret. +// This is retained for use by future query execution (Issues 2 and 8). func (r *DiscoveryPolicyReconciler) buildHTTPClient(ctx context.Context, secretRef *corev1.LocalObjectReference) (*http.Client, error) { - client := &http.Client{Timeout: 30 * time.Second} + httpClient := &http.Client{Timeout: 30 * time.Second} if secretRef == nil { - return client, nil + return httpClient, nil } secret := &corev1.Secret{} @@ -313,8 +136,8 @@ func (r *DiscoveryPolicyReconciler) buildHTTPClient(ctx context.Context, secretR transport.base = &http.Transport{TLSClientConfig: tlsConfig} } - client.Transport = transport - return client, nil + httpClient.Transport = transport + return httpClient, nil } // authTransport adds authentication headers from a Secret to HTTP requests. @@ -324,7 +147,7 @@ type authTransport struct { } func (t *authTransport) RoundTrip(req *http.Request) (*http.Response, error) { - // Bearer token auth + // ****** auth if token, ok := t.secret.Data["token"]; ok { req.Header.Set("Authorization", "Bearer "+string(token)) } @@ -338,8 +161,8 @@ func (t *authTransport) RoundTrip(req *http.Request) (*http.Response, error) { // Custom headers (headers.) for key, value := range t.secret.Data { - if len(key) > 8 && key[:8] == "headers." { - headerName := key[8:] + if strings.HasPrefix(key, "headers.") { + headerName := key[len("headers."):] req.Header.Set(headerName, string(value)) } } @@ -347,26 +170,6 @@ func (t *authTransport) RoundTrip(req *http.Request) (*http.Response, error) { return t.base.RoundTrip(req) } -// deduplicateResults merges results, keeping the highest score per image. -func deduplicateResults(results []discovery.ImageResult) []discovery.ImageResult { - seen := make(map[string]discovery.ImageResult, len(results)) - for _, r := range results { - if existing, ok := seen[r.Image]; ok { - if r.Score > existing.Score { - seen[r.Image] = r - } - } else { - seen[r.Image] = r - } - } - - deduplicated := make([]discovery.ImageResult, 0, len(seen)) - for _, r := range seen { - deduplicated = append(deduplicated, r) - } - return deduplicated -} - // SetupWithManager sets up the controller with the Manager. func (r *DiscoveryPolicyReconciler) SetupWithManager(mgr ctrl.Manager) error { return ctrl.NewControllerManagedBy(mgr). @@ -374,136 +177,3 @@ func (r *DiscoveryPolicyReconciler) SetupWithManager(mgr ctrl.Manager) error { Named("discoverypolicy"). Complete(r) } - -// sourceEndpoint returns the endpoint URL for a discovery source (for metric labels). -func sourceEndpoint(src dropv1alpha1.DiscoverySource) string { - switch src.Type { - case "prometheus": - if src.Prometheus != nil { - return src.Prometheus.Endpoint - } - case "registry": - if src.Registry != nil { - return src.Registry.URL - } - } - return "unknown" -} - -// classifyError maps a source fetch error into a k8s-style reason and human-readable message. -func classifyError(err error) (reason, message string) { - if err == nil { - return "", "" - } - - errStr := err.Error() - - // Network-level errors (typed) - var netErr net.Error - if errors.As(err, &netErr) && netErr.Timeout() { - return "Timeout", cleanMessage(errStr) - } - - var dnsErr *net.DNSError - if errors.As(err, &dnsErr) { - return reasonDNSError, fmt.Sprintf("cannot resolve host %q", dnsErr.Name) - } - - var opErr *net.OpError - if errors.As(err, &opErr) { - if opErr.Op == "dial" { - // Check if the underlying error is DNS - if strings.Contains(opErr.Err.Error(), "lookup") || strings.Contains(opErr.Err.Error(), "no such host") || strings.Contains(opErr.Err.Error(), "server misbehaving") { - host := extractHost(errStr) - return reasonDNSError, fmt.Sprintf("cannot resolve host %q", host) - } - host := extractHost(errStr) - return reasonConnectionRefused, fmt.Sprintf("cannot connect to %s", host) - } - } - - var urlErr *url.Error - if errors.As(err, &urlErr) { - inner := urlErr.Err.Error() - if strings.Contains(inner, "no such host") || strings.Contains(inner, "server misbehaving") || strings.Contains(inner, "lookup") { - host := extractHost(errStr) - return reasonDNSError, fmt.Sprintf("cannot resolve host %q", host) - } - if strings.Contains(inner, "connection refused") { - host := extractHost(errStr) - return reasonConnectionRefused, fmt.Sprintf("cannot connect to %s", host) - } - } - - // HTTP status-based errors - if strings.Contains(errStr, "status 401") { - return "Unauthorized", cleanMessage(errStr) - } - if strings.Contains(errStr, "status 403") { - return "Forbidden", cleanMessage(errStr) - } - if strings.Contains(errStr, "status 404") { - return "NotFound", cleanMessage(errStr) - } - if strings.Contains(errStr, "status 5") { - return "ServerError", cleanMessage(errStr) - } - - // String-based fallbacks - if strings.Contains(errStr, "no such host") || strings.Contains(errStr, "server misbehaving") { - host := extractHost(errStr) - return reasonDNSError, fmt.Sprintf("cannot resolve host %q", host) - } - if strings.Contains(errStr, "connection refused") { - host := extractHost(errStr) - return reasonConnectionRefused, fmt.Sprintf("cannot connect to %s", host) - } - if strings.Contains(errStr, "timeout") || strings.Contains(errStr, "deadline exceeded") { - return "Timeout", cleanMessage(errStr) - } - if strings.Contains(errStr, "certificate") || strings.Contains(errStr, "x509") { - return "TLSError", cleanMessage(errStr) - } - if strings.Contains(errStr, "decoding") || strings.Contains(errStr, "unmarshal") || strings.Contains(errStr, "invalid") { - return "InvalidResponse", cleanMessage(errStr) - } - - return "SyncFailed", cleanMessage(errStr) -} - -// extractHost pulls the hostname (or host:port) from a Go error string like -// "... lookup nonexistent-prometheus on 10.96.0.10:53 ..." or -// "... dial tcp nonexistent-registry:5000 ..." -func extractHost(errStr string) string { - // Try "lookup on" pattern (DNS errors) - if idx := strings.Index(errStr, "lookup "); idx != -1 { - rest := errStr[idx+len("lookup "):] - if end := strings.IndexAny(rest, " :"); end != -1 { - return rest[:end] - } - return rest - } - // Try to extract from URL pattern "://..." - if idx := strings.Index(errStr, "://"); idx != -1 { - rest := errStr[idx+3:] - if end := strings.IndexAny(rest, "/?"); end != -1 { - return rest[:end] - } - return rest - } - return "unknown" -} - -// cleanMessage truncates verbose Go error chains for human display. -func cleanMessage(errStr string) string { - // Take the last meaningful segment after the last colon-space - parts := strings.Split(errStr, ": ") - if len(parts) > 2 { - // Keep last 2 segments for context - return strings.Join(parts[len(parts)-2:], ": ") - } - if len(errStr) > 120 { - return errStr[:120] + "..." - } - return errStr -} diff --git a/internal/controller/discoverypolicy_controller_test.go b/internal/controller/discoverypolicy_controller_test.go index 4948e1a..aca5766 100644 --- a/internal/controller/discoverypolicy_controller_test.go +++ b/internal/controller/discoverypolicy_controller_test.go @@ -40,10 +40,11 @@ var _ = Describe("DiscoveryPolicy Controller", func() { Name: resourceName, }, Spec: dropv1alpha1.DiscoveryPolicySpec{ - Sources: []dropv1alpha1.DiscoverySource{ + Queries: []dropv1alpha1.DiscoveryQuery{ { - Type: "prometheus", - Prometheus: &dropv1alpha1.PrometheusSource{ + Name: "test-query", + Type: dropv1alpha1.DiscoveryQueryTypePrometheus, + Prometheus: &dropv1alpha1.DiscoveryPrometheusQuery{ Endpoint: "http://localhost:9090", Query: "test_query", }, @@ -74,9 +75,19 @@ var _ = Describe("DiscoveryPolicy Controller", func() { _, err := controllerReconciler.Reconcile(ctx, reconcile.Request{ NamespacedName: typeNamespacedName, }) - // Discovery will fail to connect to prometheus, but should not panic - // The reconciler handles errors gracefully - _ = err + // The stub reconciler sets a NotImplemented condition and does not return an error. + Expect(err).NotTo(HaveOccurred()) + + // Verify the NotImplemented condition is set in status. + updated := &dropv1alpha1.DiscoveryPolicy{} + Expect(k8sClient.Get(ctx, typeNamespacedName, updated)).To(Succeed()) + var readyReason string + for _, c := range updated.Status.Conditions { + if c.Type == "Ready" { + readyReason = c.Reason + } + } + Expect(readyReason).To(Equal("NotImplemented")) }) It("uses the configured secret namespace for discovery source credentials", func() { diff --git a/internal/discovery/registry.go b/internal/discovery/registry.go deleted file mode 100644 index 44292af..0000000 --- a/internal/discovery/registry.go +++ /dev/null @@ -1,159 +0,0 @@ -package discovery - -import ( - "context" - "encoding/json" - "fmt" - "io" - "net/http" - "regexp" - "sort" - "strings" - "text/template" - "time" -) - -// RegistrySource queries OCI registries for image tags. -type RegistrySource struct { - URL string - Repositories []string - TagFilter string - TopX int32 - ImageTemplate string - HTTPClient *http.Client -} - -// NewRegistrySource creates a new registry discovery source. -func NewRegistrySource(url string, repos []string, tagFilter string, topX int32, imageTemplate string, httpClient *http.Client) *RegistrySource { - if httpClient == nil { - httpClient = &http.Client{Timeout: 30 * time.Second} - } - return &RegistrySource{ - URL: strings.TrimSuffix(url, "/"), - Repositories: repos, - TagFilter: tagFilter, - TopX: topX, - ImageTemplate: imageTemplate, - HTTPClient: httpClient, - } -} - -// tagListResponse represents the OCI Distribution API tag list response. -type tagListResponse struct { - Name string `json:"name"` - Tags []string `json:"tags"` -} - -// Fetch queries the registry for tags and returns discovered images. -func (rs *RegistrySource) Fetch(ctx context.Context) ([]ImageResult, error) { - var allResults []ImageResult - - for _, repo := range rs.Repositories { - results, err := rs.fetchRepo(ctx, repo) - if err != nil { - return nil, fmt.Errorf("fetching tags for %s: %w", repo, err) - } - allResults = append(allResults, results...) - } - - // Sort by score descending (higher index = more recent) - sort.Slice(allResults, func(i, j int) bool { - return allResults[i].Score > allResults[j].Score - }) - - return allResults, nil -} - -func (rs *RegistrySource) fetchRepo(ctx context.Context, repo string) ([]ImageResult, error) { - u := fmt.Sprintf("%s/v2/%s/tags/list", rs.URL, repo) - - req, err := http.NewRequestWithContext(ctx, http.MethodGet, u, nil) - if err != nil { - return nil, fmt.Errorf("creating request: %w", err) - } - - resp, err := rs.HTTPClient.Do(req) - if err != nil { - return nil, fmt.Errorf("listing tags: %w", err) - } - defer func() { _ = resp.Body.Close() }() - - if resp.StatusCode != http.StatusOK { - body, _ := io.ReadAll(resp.Body) - return nil, fmt.Errorf("registry returned status %d: %s", resp.StatusCode, string(body)) - } - - var tagList tagListResponse - if err := json.NewDecoder(resp.Body).Decode(&tagList); err != nil { - return nil, fmt.Errorf("decoding response: %w", err) - } - - // Filter tags - tags := tagList.Tags - if rs.TagFilter != "" { - re, err := regexp.Compile(rs.TagFilter) - if err != nil { - return nil, fmt.Errorf("compiling tag filter: %w", err) - } - var filtered []string - for _, tag := range tags { - if re.MatchString(tag) { - filtered = append(filtered, tag) - } - } - tags = filtered - } - - // Limit to topX - if rs.TopX > 0 && int32(len(tags)) > rs.TopX { - tags = tags[len(tags)-int(rs.TopX):] - } - - // Build image refs - results := make([]ImageResult, 0, len(tags)) - for i, tag := range tags { - imageRef, err := rs.buildImageRef(repo, tag) - if err != nil { - return nil, fmt.Errorf("building image ref for tag %s: %w", tag, err) - } - results = append(results, ImageResult{ - Image: imageRef, - Score: int64(i + 1), // Higher index = more recent - }) - } - - return results, nil -} - -// templateData provides variables for the image template. -type templateData struct { - Registry string - Repository string - Tag string -} - -func (rs *RegistrySource) buildImageRef(repo, tag string) (string, error) { - if rs.ImageTemplate != "" { - tmpl, err := template.New("image").Parse(rs.ImageTemplate) - if err != nil { - return "", fmt.Errorf("parsing image template: %w", err) - } - - data := templateData{ - Registry: rs.URL, - Repository: repo, - Tag: tag, - } - - var buf strings.Builder - if err := tmpl.Execute(&buf, data); err != nil { - return "", fmt.Errorf("executing image template: %w", err) - } - return buf.String(), nil - } - - // Default: registry/repo:tag - registry := strings.TrimPrefix(rs.URL, "https://") - registry = strings.TrimPrefix(registry, "http://") - return fmt.Sprintf("%s/%s:%s", registry, repo, tag), nil -} diff --git a/internal/discovery/registry_test.go b/internal/discovery/registry_test.go deleted file mode 100644 index f3b9dc6..0000000 --- a/internal/discovery/registry_test.go +++ /dev/null @@ -1,93 +0,0 @@ -package discovery - -import ( - "context" - "encoding/json" - "net/http" - "net/http/httptest" - "testing" -) - -func TestRegistrySource_Fetch(t *testing.T) { - tests := []struct { - name string - repos []string - tagFilter string - topX int32 - imageTemplate string - tags []string - wantCount int - wantFirst string - wantErr bool - }{ - { - name: "basic tag listing", - repos: []string{"library/nginx"}, - tags: []string{"1.24", "1.25", "1.26"}, - wantCount: 3, - }, - { - name: "tag filter", - repos: []string{"library/nginx"}, - tagFilter: `^1\.2[56]$`, - tags: []string{"1.24", "1.25", "1.26"}, - wantCount: 2, - }, - { - name: "topX limit", - repos: []string{"library/nginx"}, - topX: 2, - tags: []string{"1.24", "1.25", "1.26"}, - wantCount: 2, - }, - { - name: "image template", - repos: []string{"gitlab-org/gitlab-runner/gitlab-runner-helper"}, - imageTemplate: "registry.gitlab.com/{{.Repository}}:x86_64-{{.Tag}}", - tags: []string{"v16.0", "v16.1"}, - wantCount: 2, - wantFirst: "registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v16.1", - }, - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { - resp := tagListResponse{ - Name: tt.repos[0], - Tags: tt.tags, - } - w.WriteHeader(http.StatusOK) - if err := json.NewEncoder(w).Encode(resp); err != nil { - t.Fatal(err) - } - })) - defer server.Close() - - source := NewRegistrySource(server.URL, tt.repos, tt.tagFilter, tt.topX, tt.imageTemplate, server.Client()) - results, err := source.Fetch(context.Background()) - - if tt.wantErr { - if err == nil { - t.Fatal("expected error, got nil") - } - return - } - - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - - if len(results) != tt.wantCount { - t.Errorf("got %d results, want %d", len(results), tt.wantCount) - } - - if tt.wantFirst != "" && len(results) > 0 { - // Results sorted by score descending, highest score = last tag - if results[0].Image != tt.wantFirst { - t.Errorf("first image = %q, want %q", results[0].Image, tt.wantFirst) - } - } - }) - } -} diff --git a/knowledge.yaml b/knowledge.yaml index a088e30..fea19b9 100644 --- a/knowledge.yaml +++ b/knowledge.yaml @@ -237,11 +237,21 @@ crds: controller: internal/controller/discoverypolicy_controller.go testFile: internal/controller/discoverypolicy_controller_test.go specFields: - - name: Sources - json: sources - type: '[]DiscoverySource' - required: true - doc: Sources is the list of discovery backends to query. At least one source is required. Multiple sources are merged and ranked together before maxImages is applied. + - name: Queries + json: queries + type: '[]DiscoveryQuery' + required: false + doc: Queries is the list of named raw-data sources. Each query is referenced by name from signals. + - name: Signals + json: signals + type: '[]DiscoverySignal' + required: false + doc: Signals is the list of named per-image metrics derived from query results. Each signal is referenced by name from the ranking configuration. + - name: Ranking + json: ranking + type: '*DiscoveryRanking' + required: false + doc: Ranking defines how signals are combined into a final ordered image list. - name: ImageFilter json: imageFilter type: string @@ -252,7 +262,7 @@ crds: type: metav1.Duration required: false default: 30m - doc: 'SyncInterval is how often the operator re-queries all sources and updates status.discoveredImages. Default: "30m". Example: "1h", "15m"' + doc: 'SyncInterval is how often the operator re-runs the pipeline and updates status.discoveredImages. Default: "30m". Example: "1h", "15m"' - name: MaxImages json: maxImages type: int32 @@ -264,22 +274,32 @@ crds: json: lastSyncTime type: '*metav1.Time' required: false - doc: LastSyncTime is the timestamp of the last successful sync. + doc: LastSyncTime is the timestamp of the last reconciliation attempt. + - name: QueryResults + json: queryResults + type: '[]QueryResult' + required: false + doc: QueryResults reports the outcome of each named query execution. + - name: SignalResults + json: signalResults + type: '[]SignalResult' + required: false + doc: SignalResults reports the outcome of each signal derivation. - name: DiscoveredImages json: discoveredImages type: '[]DiscoveredImage' required: false - doc: DiscoveredImages is the list of discovered images from all sources. + doc: DiscoveredImages is the ordered list of discovered and ranked images. Only images with selected=true are propagated to dependent CachedImageSet resources. - name: ImageCount json: imageCount type: int32 required: false - doc: ImageCount is the number of discovered images. - - name: SourceCount - json: sourceCount + doc: ImageCount is the number of selected discovered images. + - name: QueryCount + json: queryCount type: int32 required: false - doc: SourceCount is the number of configured sources. + doc: QueryCount is the number of configured queries. - name: Conditions json: conditions type: '[]metav1.Condition' @@ -290,7 +310,7 @@ crds: - +kubebuilder:printcolumn:name="Message",type=string,JSONPath=`.status.conditions[?(@.type=="Ready")].message`,priority=1 - +kubebuilder:printcolumn:name="LastSync",type=date,JSONPath=`.status.lastSyncTime` - +kubebuilder:printcolumn:name="Images",type=integer,JSONPath=`.status.imageCount` - - +kubebuilder:printcolumn:name="Sources",type=integer,JSONPath=`.status.sourceCount` + - +kubebuilder:printcolumn:name="Queries",type=integer,JSONPath=`.status.queryCount` - +kubebuilder:printcolumn:name="Status",type=string,JSONPath=`.status.conditions[?(@.type=="Ready")].reason` - +kubebuilder:resource:scope=Cluster,categories=drop - +kubebuilder:subresource:status @@ -339,6 +359,20 @@ crds: - +kubebuilder:resource:scope=Cluster,categories=drop - +kubebuilder:object:root=true helperTypes: + - name: AggregateSignalConfig + doc: AggregateSignalConfig configures the aggregate signal type. + fields: + - name: Method + json: method + type: AggregationMethod + required: true + enum: + - sum + - count + - avg + - max + - min + doc: Method is the aggregation function applied to all samples per image. - name: BackoffConfig doc: BackoffConfig defines exponential retry backoff behavior for failed pulls. fields: @@ -355,23 +389,67 @@ helperTypes: default: 5m doc: 'Max is the upper bound on backoff delay. Retries will never wait longer than this. Default: "5m". Example: "10m"' - name: DiscoveredImage - doc: DiscoveredImage represents a single discovered image with metadata. + doc: DiscoveredImage represents a single discovered and ranked image. fields: - name: Image json: image type: string required: true doc: Image is the fully qualified image reference. - - name: Score - json: score - type: int64 + - name: Rank + json: rank + type: int32 + required: true + doc: Rank is the position of this image in the final ordered list (1 = highest score). + - name: FinalScore + json: finalScore + type: string + required: true + doc: FinalScore is the computed ranking score as a decimal string. + - name: Selected + json: selected + type: bool required: true - doc: Score is the ranking score from the source (higher = more relevant). - - name: Source - json: source + doc: Selected is true when this image is within the maxImages cap and will be propagated to dependent CachedImageSet resources. + - name: Signals + json: signals + type: '[]ImageSignalValue' + required: false + doc: Signals lists the per-signal values used during ranking (for observability). + - name: Ranking + json: ranking + type: '*ImageRankingDetail' + required: false + doc: Ranking explains how the final score was computed. + - name: DiscoveryLokiQuery + doc: DiscoveryLokiQuery defines the Loki-specific query parameters. + fields: + - name: Endpoint + json: endpoint + type: string + required: true + doc: 'Endpoint is the Loki API URL. Example: "https://loki.example.com"' + - name: Query + json: query type: string required: true - doc: Source identifies which discovery source produced this image. + doc: Query is the LogQL expression. + - name: QueryType + json: queryType + type: LokiQueryType + required: false + default: range + doc: QueryType controls how the query is executed. Currently only "range" is supported. + - name: Lookback + json: lookback + type: '*metav1.Duration' + required: false + doc: 'Lookback is the time window for the query (start=now-lookback, end=now). Example: "168h" (7 days), "24h"' + - name: Parser + json: parser + type: '*LokiParser' + required: false + doc: Parser configures how log lines are parsed into structured event records. - name: DiscoveryPolicyReference doc: DiscoveryPolicyReference is a reference to a DiscoveryPolicy resource. fields: @@ -380,32 +458,167 @@ helperTypes: type: string required: true doc: Name of the DiscoveryPolicy resource. - - name: DiscoverySource - doc: DiscoverySource defines a single discovery backend. + - name: DiscoveryPrometheusQuery + doc: DiscoveryPrometheusQuery defines the Prometheus-specific query parameters. The PromQL result MUST carry an "image" label; that label value is the image reference. + fields: + - name: Endpoint + json: endpoint + type: string + required: true + doc: 'Endpoint is the Prometheus-compatible API URL (Prometheus, Thanos, Mimir, VictoriaMetrics). Example: "http://prometheus.monitoring.svc:9090", "https://mimir.example.com"' + - name: Query + json: query + type: string + required: true + doc: 'Query is the PromQL expression. Must return results with an "image" label. Example: count(container_memory_working_set_bytes{namespace="gitlab-runner"}) by (image)' + - name: QueryType + json: queryType + type: QueryType + required: false + default: range + doc: 'QueryType controls how the query is executed: "range" or "instant". Default: "range".' + - name: Lookback + json: lookback + type: '*metav1.Duration' + required: false + doc: 'Lookback is the time window for range queries (start=now-lookback, end=now). Required when queryType is "range". Ignored when queryType is "instant". Example: "168h" (7 days), "24h", "72h"' + - name: Step + json: step + type: '*metav1.Duration' + required: false + doc: 'Step is the resolution step for range queries. Smaller steps increase data-point density but also increase Prometheus load. Default: 5m. Example: "1m", "15m"' + - name: DiscoveryQuery + doc: DiscoveryQuery defines a named raw-data source referenced by signals. fields: + - name: Name + json: name + type: string + required: true + doc: Name is the unique identifier for this query within the policy. Signals reference queries by this name via queryRef. - name: Type json: type - type: string + type: DiscoveryQueryType required: true enum: - prometheus - - registry - doc: Type identifies the discovery backend. Must be "prometheus" or "registry". + - loki + doc: Type selects the backend. Must be "prometheus" or "loki". - name: Prometheus json: prometheus - type: '*PrometheusSource' + type: '*DiscoveryPrometheusQuery' required: false doc: Prometheus contains the configuration when type=prometheus. - - name: Registry - json: registry - type: '*RegistrySource' + - name: Loki + json: loki + type: '*DiscoveryLokiQuery' required: false - doc: Registry contains the configuration when type=registry. + doc: Loki contains the configuration when type=loki. - name: SecretRef json: secretRef type: '*corev1.LocalObjectReference' required: false - doc: 'SecretRef references a Secret in the namespace where Drop creates pull Pods. The default namespace is "drop-system" unless the controller is started with a different --pod-namespace. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. Example: {name: "prometheus-creds"}' + doc: 'SecretRef references a Secret in the pod namespace (default "drop-system") for auth/TLS. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers..' + - name: DiscoveryRanking + doc: DiscoveryRanking defines how signals are combined into the final ordered image list. + fields: + - name: Strategy + json: strategy + type: RankingStrategy + required: true + enum: + - signal + - weightedSum + - modelExposure + doc: Strategy selects the ranking algorithm. + - name: Signal + json: signal + type: '*SignalRankingConfig' + required: false + doc: Signal is required when strategy=signal. + - name: WeightedSum + json: weightedSum + type: '*WeightedSumRankingConfig' + required: false + doc: WeightedSum is required when strategy=weightedSum. + - name: ModelExposure + json: modelExposure + type: '*ModelExposureRankingConfig' + required: false + doc: ModelExposure is required when strategy=modelExposure. + - name: DiscoverySignal + doc: DiscoverySignal defines a named per-image metric derived from a single query. + fields: + - name: Name + json: name + type: string + required: true + doc: Name is the unique identifier for this signal within the policy. Ranking configurations reference signals by this name. + - name: QueryRef + json: queryRef + type: string + required: true + doc: QueryRef is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. + - name: Type + json: type + type: SignalType + required: true + enum: + - aggregate + - timeWeightedAggregate + - windowAggregate + - eventPullTime + doc: Type selects the signal derivation method. + - name: Aggregate + json: aggregate + type: '*AggregateSignalConfig' + required: false + doc: Aggregate is required when type=aggregate. + - name: TimeWeightedAggregate + json: timeWeightedAggregate + type: '*TimeWeightedAggregateSignalConfig' + required: false + doc: TimeWeightedAggregate is required when type=timeWeightedAggregate. + - name: WindowAggregate + json: windowAggregate + type: '*WindowAggregateSignalConfig' + required: false + doc: WindowAggregate is required when type=windowAggregate. + - name: EventPullTime + json: eventPullTime + type: '*EventPullTimeSignalConfig' + required: false + doc: EventPullTime is required when type=eventPullTime. + - name: EventPullTimeSignalConfig + doc: EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. + fields: + - name: Statistic + json: statistic + type: EventPullTimeStatistic + required: true + enum: + - p50 + - p90 + - p95 + - avg + - max + - count + - failureCount + - cacheHitCount + doc: Statistic selects which pull-time metric to compute. + - name: IncludeCacheHits + json: includeCacheHits + type: bool + required: true + default: "false" + doc: IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. + - name: DurationMode + json: durationMode + type: DurationMode + required: true + enum: + - eventPair + - messageDuration + doc: DurationMode controls how pull duration is extracted from event records. - name: ImageEntry doc: ImageEntry defines a single image to include in a set. fields: @@ -424,6 +637,90 @@ helperTypes: type: string required: false doc: 'Digest to pull as an immutable reference. Mutually exclusive with Tag. Example: "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4"' + - name: ImageRankingDetail + doc: ImageRankingDetail explains how the final score was computed for one image. + fields: + - name: Strategy + json: strategy + type: string + required: true + doc: Strategy is the ranking strategy that produced this detail. + - name: Terms + json: terms + type: '[]RankingTerm' + required: false + doc: Terms lists the per-signal contributions (populated for weightedSum and modelExposure). + - name: ImageSignalValue + doc: ImageSignalValue records the raw and normalized value of a signal for one image. + fields: + - name: Name + json: name + type: string + required: true + doc: Name is the signal name. + - name: RawValue + json: rawValue + type: string + required: true + doc: RawValue is the unscaled signal value as a decimal string. + - name: NormalizedValue + json: normalizedValue + type: string + required: false + doc: NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. Only populated for signals used in a weightedSum ranking. + - name: LokiParser + doc: LokiParser configures structured parsing of Loki log entries. + fields: + - name: Type + json: type + type: LokiParserType + required: true + enum: + - kubernetesEvents + doc: Type selects the parser. Currently only "kubernetesEvents" is supported. + - name: PodField + json: podField + type: string + required: false + doc: 'PodField is the log label or field that contains the pod name. Example: "involvedObject_name"' + - name: ReasonField + json: reasonField + type: string + required: false + doc: 'ReasonField is the log label or field that contains the event reason. Example: "reason"' + - name: MessageField + json: messageField + type: string + required: false + doc: 'MessageField is the log label or field that contains the event message. Example: "message"' + - name: ImageField + json: imageField + type: string + required: false + doc: 'ImageField is the log label or field from which the image reference is extracted. For kubernetesEvents, the image is parsed out of the message text. Example: "message"' + - name: ModelExposureRankingConfig + doc: ModelExposureRankingConfig configures the modelExposure ranking strategy. Score = J_target(I) * (1 - 1/N)^J_pre(I) * p_hat(I) where N=nodeCount, J_pre is pre-window usage, J_target is target-window usage, and p_hat is the pull-time signal value. + fields: + - name: NodeCount + json: nodeCount + type: int32 + required: true + doc: NodeCount is the number of eligible CI nodes (N in the exposure formula). + - name: PreWindowUsageSignalRef + json: preWindowUsageSignalRef + type: string + required: true + doc: PreWindowUsageSignalRef is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. + - name: TargetWindowUsageSignalRef + json: targetWindowUsageSignalRef + type: string + required: true + doc: TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. + - name: PullTimeSignalRef + json: pullTimeSignalRef + type: string + required: true + doc: PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. - name: PolicyReference doc: PolicyReference is a reference to a PullPolicy resource. fields: @@ -432,68 +729,220 @@ helperTypes: type: string required: true doc: Name of the PullPolicy resource. - - name: PrometheusSource - doc: PrometheusSource defines Prometheus query configuration for image discovery. + - name: QueryResult + doc: QueryResult reports the outcome of a single named query execution. fields: - - name: Endpoint - json: endpoint + - name: Name + json: name type: string required: true - doc: 'Endpoint is the Prometheus-compatible API URL (Prometheus, Thanos, Mimir, VictoriaMetrics). Example: "http://prometheus.monitoring.svc:9090", "https://mimir.example.com"' - - name: Query - json: query - type: string + doc: Name matches the queries[].name that produced this result. + - name: Type + json: type + type: DiscoveryQueryType required: true - doc: 'Query is the PromQL expression. It MUST return results with an "image" label — that label value is used as the discovered image reference. The query result value is used as the ranking score (higher = more relevant). Example: count(container_memory_working_set_bytes{container!="",container!="POD",namespace="gitlab-runner"}) by (image)' - - name: QueryType - json: queryType - type: QueryType + doc: Type is the query backend type (prometheus or loki). + - name: Series + json: series + type: '*int32' required: false - default: range - doc: 'QueryType controls how the Prometheus query is executed. "range" uses /api/v1/query_range with a time window defined by lookback. "instant" uses /api/v1/query for a single point-in-time result. Default: "range".' - - name: Lookback - json: lookback - type: '*metav1.Duration' + doc: Series is the number of time-series returned (Prometheus queries only). + - name: Samples + json: samples + type: '*int64' required: false - doc: 'Lookback is the time window for range queries. When queryType is "range", the operator queries (start=now-lookback, end=now) and aggregates all returned values per image. The aggregation function is controlled by the aggregationMethod field. Required when queryType is "range". Ignored when queryType is "instant". Example: "168h" (7 days), "24h", "72h"' - - name: AggregationMethod - json: aggregationMethod - type: '*AggregationMethod' + doc: Samples is the total number of data points across all series (Prometheus range queries only). + - name: Records + json: records + type: '*int64' required: false - doc: 'AggregationMethod controls how data points from a range query are combined into a single score. Only used when queryType is "range". Ignored for instant queries. When not set (nil), Drop uses the last data-point value directly — use this when your PromQL already contains aggregation functions (e.g., count_over_time, topk). Options: "sum", "count", "avg", "max"' - - name: Step - json: step - type: '*metav1.Duration' + doc: Records is the number of log records returned (Loki queries only). + - name: Status + json: status + type: QueryResultStatus + required: true + doc: Status is "success" or "failed". + - name: Message + json: message + type: string required: false - doc: 'Step is the resolution step for range queries (only used when lookback is set). Smaller steps = more data points = more accurate aggregation but higher Prometheus load. Default: 5m. Example: "1m", "15m"' - - name: RegistrySource - doc: RegistrySource defines OCI registry tag listing configuration for image discovery. + doc: Message describes the failure reason when status=failed. + - name: RankingTerm + doc: RankingTerm records the contribution of one signal to the final score of an image. fields: - - name: URL - json: url + - name: Signal + json: signal type: string required: true - doc: 'URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io"' - - name: Repositories - json: repositories - type: '[]string' + doc: Signal is the signal name. + - name: Weight + json: weight + type: string + required: true + doc: Weight is the configured weight as a decimal string. + - name: Contribution + json: contribution + type: string + required: true + doc: Contribution is weight * normalizedValue as a decimal string. + - name: SignalRankingConfig + doc: SignalRankingConfig configures the signal ranking strategy. + fields: + - name: SignalRef + json: signalRef + type: string + required: true + doc: SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. + - name: SignalResult + doc: SignalResult reports the outcome of a single signal derivation. + fields: + - name: Name + json: name + type: string + required: true + doc: Name matches the signals[].name that produced this result. + - name: Images + json: images + type: int32 required: true - doc: 'Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"]' - - name: TagFilter - json: tagFilter + doc: Images is the number of images for which this signal produced a value. + - name: Status + json: status + type: string + required: true + doc: Status is "success" or "failed". + - name: Message + json: message type: string required: false - doc: 'TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds)' - - name: TopX - json: topX + doc: Message describes the failure reason when status=failed. + - name: TimeOfDayWindow + doc: TimeOfDayWindow defines a fixed wall-clock time range within each day. + fields: + - name: Start + json: start + type: string + required: true + doc: 'Start is the inclusive start time in "HH:MM" format (24-hour, local time). Example: "09:00"' + - name: End + json: end + type: string + required: true + doc: 'End is the exclusive end time in "HH:MM" format (24-hour, local time). Example: "17:00"' + - name: TimeWeightedAggregateSignalConfig + doc: TimeWeightedAggregateSignalConfig configures the timeWeightedAggregate signal type. Each sample value is multiplied by the weight of the matching time window before aggregation. + fields: + - name: Method + json: method + type: AggregationMethod + required: true + enum: + - sum + - count + - avg + - max + - min + doc: Method is the aggregation function applied after weighting (currently only "sum" is meaningful). + - name: Timezone + json: timezone + type: string + required: true + doc: 'Timezone is the IANA time zone used to evaluate window boundaries (wall-clock hours). Example: "Europe/Berlin", "America/New_York", "UTC"' + - name: DefaultWeight + json: defaultWeight + type: resource.Quantity + required: true + doc: DefaultWeight is applied to samples that do not fall in any configured window. Use "0" to exclude off-hours samples entirely. + - name: Windows + json: windows + type: '[]TimeWeightedWindow' + required: true + doc: Windows is the list of hour-of-day windows with associated weights. + - name: TimeWeightedWindow + doc: TimeWeightedWindow defines a wall-clock hour range and its weight factor. + fields: + - name: StartHour + json: startHour + type: int32 + required: true + doc: StartHour is the inclusive start of the window in local time (0–23). + - name: EndHour + json: endHour type: int32 + required: true + doc: EndHour is the exclusive end of the window in local time (1–24). + - name: Weight + json: weight + type: resource.Quantity + required: true + doc: Weight is the factor applied to sample values within this window. Use "1.0" for full weight, "0.3" for partial, "0" to exclude. + - name: WeightedSumRankingConfig + doc: WeightedSumRankingConfig configures the weightedSum ranking strategy. Score = Σ weight_k * normalize(signal_k(image)). + fields: + - name: Normalize + json: normalize + type: NormalizeMethod + required: true + default: minMax + enum: + - minMax + doc: Normalize selects the normalization method applied to each signal before weighting. Currently only "minMax" is supported. + - name: MissingSignal + json: missingSignal + type: MissingSignalBehavior + required: true + default: zero + enum: + - zero + - drop + doc: MissingSignal controls behavior when an image has no value for a required signal. "zero" treats missing as 0; "drop" removes the image from ranking. + - name: Terms + json: terms + type: '[]WeightedSumTerm' + required: true + doc: Terms is the list of signals and their weights. + - name: WeightedSumTerm + doc: WeightedSumTerm defines one signal contribution in a weightedSum ranking. + fields: + - name: SignalRef + json: signalRef + type: string + required: true + doc: SignalRef is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. + - name: Weight + json: weight + type: resource.Quantity + required: true + doc: 'Weight is the factor applied to the normalized signal value. All weights should be non-negative; they do not need to sum to 1. Example: "0.7"' + - name: WindowAggregateSignalConfig + doc: WindowAggregateSignalConfig configures the windowAggregate signal type. Exactly one of relativeWindow or (window + timezone) must be set. + fields: + - name: Method + json: method + type: AggregationMethod + required: true + enum: + - sum + - count + - avg + - max + - min + doc: Method is the aggregation function applied to the windowed samples. + - name: RelativeWindow + json: relativeWindow + type: '*metav1.Duration' required: false - doc: 'TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not provide creation timestamps here; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo)' - - name: ImageTemplate - json: imageTemplate + doc: 'RelativeWindow aggregates only samples from the last N duration before now. Mutually exclusive with window + timezone. Example: "2h" (last 2 hours)' + - name: Timezone + json: timezone type: string required: false - doc: 'ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "{{.Registry}}/{{.Repository}}@{{.Tag}}" (if tags are actually digests)' + doc: Timezone is the IANA time zone for evaluating wall-clock window boundaries. Required when window is set. + - name: Window + json: window + type: '*TimeOfDayWindow' + required: false + doc: Window defines fixed wall-clock start/end times within each day. Mutually exclusive with relativeWindow. relationships: - from: CachedImage to: PullPolicy @@ -514,7 +963,6 @@ packages: role: Package controller implements Kubernetes reconcilers for the drop CRDs (one per Kind). imports: - api/v1alpha1 - - internal/discovery - internal/metrics - internal/pacing - internal/podbuilder @@ -589,27 +1037,9 @@ errors: - reason: Ready controller: CachedImageSet meaning: All N images are cached - - reason: AllSourcesHealthy - controller: DiscoveryPolicy - meaning: All discovery sources responded successfully - - reason: ConnectionRefused + - reason: NotImplemented controller: DiscoveryPolicy meaning: "" - - reason: DNSError - controller: DiscoveryPolicy - meaning: "" - - reason: PartiallyFailed - controller: DiscoveryPolicy - meaning: 'Discovered N images, but some sources failed: N' - - reason: SourceError - controller: DiscoveryPolicy - meaning: One or more sources failed to respond - - reason: SyncFailed - controller: DiscoveryPolicy - meaning: "" - - reason: Synced - controller: DiscoveryPolicy - meaning: Discovered N images metrics: - name: drop_images_cached_total help: Total number of images successfully cached on nodes. @@ -770,82 +1200,96 @@ samples: | policyRef: name: dev-conservative discoveryPolicyRef: - name: dev-registry + name: dev-prometheus --- - # === DiscoveryPolicy: healthy (Prometheus range query) === + # === DiscoveryPolicy: Prometheus range query with total-usage signal === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: dev-prometheus spec: - sources: - - type: prometheus + queries: + - name: runner-image-usage + type: prometheus prometheus: endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" - query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' queryType: range lookback: 24h step: 5m - aggregationMethod: sum + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage syncInterval: 30s maxImages: 10 --- - # === DiscoveryPolicy: healthy (registry tag listing) === + # === DiscoveryPolicy: Prometheus with hybrid weightedSum ranking === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: dev-registry + name: dev-hybrid spec: - sources: - - type: registry - registry: - url: "http://registry.e2e-infra.svc.cluster.local:5000" - repositories: - - "test/myapp" - topX: 3 + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + - name: peak-concurrency + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max + ranking: + strategy: weightedSum + weightedSum: + normalize: minMax + missingSignal: zero + terms: + - signalRef: total-usage + weight: "700m" + - signalRef: peak-concurrency + weight: "300m" syncInterval: 30s maxImages: 10 --- - # === DiscoveryPolicy: broken (DNS error → DNSError) === + # === DiscoveryPolicy: broken Prometheus endpoint (DNS error) === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: test-broken-prom spec: - sources: - - type: prometheus + queries: + - name: broken-query + type: prometheus prometheus: endpoint: "http://nonexistent-prometheus:9090" query: "up{}" - syncInterval: 30m - maxImages: 10 - --- - # === DiscoveryPolicy: broken (DNS error → DNSError) === - apiVersion: drop.corewire.io/v1alpha1 - kind: DiscoveryPolicy - metadata: - name: test-broken-registry - spec: - sources: - - type: registry - registry: - url: "http://nonexistent-registry:5000" - repositories: - - "test/nope" - syncInterval: 30m - maxImages: 10 - --- - # === DiscoveryPolicy: broken (repo doesn't exist → NotFound) === - apiVersion: drop.corewire.io/v1alpha1 - kind: DiscoveryPolicy - metadata: - name: test-notfound-repo - spec: - sources: - - type: registry - registry: - url: "http://registry.e2e-infra.svc.cluster.local:5000" - repositories: - - "this/does-not-exist" + signals: + - name: total-usage + queryRef: broken-query + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage syncInterval: 30m maxImages: 10 diff --git a/llms-full.txt b/llms-full.txt index b0ca6cc..9ed121d 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -84,18 +84,22 @@ Controller: internal/controller/discoverypolicy_controller.go | Test: internal/c #### Spec | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| Sources | `sources` | `[]DiscoverySource` | ✓ | | Sources is the list of discovery backends to query. At least one source is required. Multiple sources are merged and ranked together before maxImages is applied. | +| Queries | `queries` | `[]DiscoveryQuery` | — | | Queries is the list of named raw-data sources. Each query is referenced by name from signals. | +| Signals | `signals` | `[]DiscoverySignal` | — | | Signals is the list of named per-image metrics derived from query results. Each signal is referenced by name from the ranking configuration. | +| Ranking | `ranking` | `*DiscoveryRanking` | — | | Ranking defines how signals are combined into a final ordered image list. | | ImageFilter | `imageFilter` | `string` | — | | ImageFilter is a regex applied to discovered image references. Only matching images are kept. Example: "registry.example.com/team/.*" (only keep images from that registry path) | -| SyncInterval | `syncInterval` | `metav1.Duration` | — | `30m` | SyncInterval is how often the operator re-queries all sources and updates status.discoveredImages. Default: "30m". Example: "1h", "15m" | +| SyncInterval | `syncInterval` | `metav1.Duration` | — | `30m` | SyncInterval is how often the operator re-runs the pipeline and updates status.discoveredImages. Default: "30m". Example: "1h", "15m" | | MaxImages | `maxImages` | `int32` | — | `50` | MaxImages caps the total number of images stored in status.discoveredImages. Images are ranked by score; lowest-scoring images are dropped when the cap is exceeded. Default: 50. Example: 30, 100 | #### Status | Field | JSON | Type | Description | |-------|------|------|-------------| -| LastSyncTime | `lastSyncTime` | `*metav1.Time` | LastSyncTime is the timestamp of the last successful sync. | -| DiscoveredImages | `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the list of discovered images from all sources. | -| ImageCount | `imageCount` | `int32` | ImageCount is the number of discovered images. | -| SourceCount | `sourceCount` | `int32` | SourceCount is the number of configured sources. | +| LastSyncTime | `lastSyncTime` | `*metav1.Time` | LastSyncTime is the timestamp of the last reconciliation attempt. | +| QueryResults | `queryResults` | `[]QueryResult` | QueryResults reports the outcome of each named query execution. | +| SignalResults | `signalResults` | `[]SignalResult` | SignalResults reports the outcome of each signal derivation. | +| DiscoveredImages | `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the ordered list of discovered and ranked images. Only images with selected=true are propagated to dependent CachedImageSet resources. | +| ImageCount | `imageCount` | `int32` | ImageCount is the number of selected discovered images. | +| QueryCount | `queryCount` | `int32` | QueryCount is the number of configured queries. | | Conditions | `conditions` | `[]metav1.Condition` | Conditions represent the latest available observations. | @@ -117,6 +121,14 @@ PullPolicy controls the pacing and retry behavior for image pulls across cluster ## Helper Types +### AggregateSignalConfig + +AggregateSignalConfig configures the aggregate signal type. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Method | `method` | `AggregationMethod` | ✓ | | Method is the aggregation function applied to all samples per image. Enum: `sum`,`count`,`avg`,`max`,`min` | + ### BackoffConfig BackoffConfig defines exponential retry backoff behavior for failed pulls. @@ -128,13 +140,28 @@ BackoffConfig defines exponential retry backoff behavior for failed pulls. ### DiscoveredImage -DiscoveredImage represents a single discovered image with metadata. +DiscoveredImage represents a single discovered and ranked image. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | Image | `image` | `string` | ✓ | | Image is the fully qualified image reference. | -| Score | `score` | `int64` | ✓ | | Score is the ranking score from the source (higher = more relevant). | -| Source | `source` | `string` | ✓ | | Source identifies which discovery source produced this image. | +| Rank | `rank` | `int32` | ✓ | | Rank is the position of this image in the final ordered list (1 = highest score). | +| FinalScore | `finalScore` | `string` | ✓ | | FinalScore is the computed ranking score as a decimal string. | +| Selected | `selected` | `bool` | ✓ | | Selected is true when this image is within the maxImages cap and will be propagated to dependent CachedImageSet resources. | +| Signals | `signals` | `[]ImageSignalValue` | — | | Signals lists the per-signal values used during ranking (for observability). | +| Ranking | `ranking` | `*ImageRankingDetail` | — | | Ranking explains how the final score was computed. | + +### DiscoveryLokiQuery + +DiscoveryLokiQuery defines the Loki-specific query parameters. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Endpoint | `endpoint` | `string` | ✓ | | Endpoint is the Loki API URL. Example: "https://loki.example.com" | +| Query | `query` | `string` | ✓ | | Query is the LogQL expression. | +| QueryType | `queryType` | `LokiQueryType` | — | `range` | QueryType controls how the query is executed. Currently only "range" is supported. | +| Lookback | `lookback` | `*metav1.Duration` | — | | Lookback is the time window for the query (start=now-lookback, end=now). Example: "168h" (7 days), "24h" | +| Parser | `parser` | `*LokiParser` | — | | Parser configures how log lines are parsed into structured event records. | ### DiscoveryPolicyReference @@ -144,16 +171,64 @@ DiscoveryPolicyReference is a reference to a DiscoveryPolicy resource. |-------|------|------|----------|---------|-------------| | Name | `name` | `string` | ✓ | | Name of the DiscoveryPolicy resource. | -### DiscoverySource +### DiscoveryPrometheusQuery -DiscoverySource defines a single discovery backend. +DiscoveryPrometheusQuery defines the Prometheus-specific query parameters. The PromQL result MUST carry an "image" label; that label value is the image reference. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| Type | `type` | `string` | ✓ | | Type identifies the discovery backend. Must be "prometheus" or "registry". Enum: `prometheus`,`registry` | -| Prometheus | `prometheus` | `*PrometheusSource` | — | | Prometheus contains the configuration when type=prometheus. | -| Registry | `registry` | `*RegistrySource` | — | | Registry contains the configuration when type=registry. | -| SecretRef | `secretRef` | `*corev1.LocalObjectReference` | — | | SecretRef references a Secret in the namespace where Drop creates pull Pods. The default namespace is "drop-system" unless the controller is started with a different --pod-namespace. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. Example: {name: "prometheus-creds"} | +| Endpoint | `endpoint` | `string` | ✓ | | Endpoint is the Prometheus-compatible API URL (Prometheus, Thanos, Mimir, VictoriaMetrics). Example: "http://prometheus.monitoring.svc:9090", "https://mimir.example.com" | +| Query | `query` | `string` | ✓ | | Query is the PromQL expression. Must return results with an "image" label. Example: count(container_memory_working_set_bytes{namespace="gitlab-runner"}) by (image) | +| QueryType | `queryType` | `QueryType` | — | `range` | QueryType controls how the query is executed: "range" or "instant". Default: "range". | +| Lookback | `lookback` | `*metav1.Duration` | — | | Lookback is the time window for range queries (start=now-lookback, end=now). Required when queryType is "range". Ignored when queryType is "instant". Example: "168h" (7 days), "24h", "72h" | +| Step | `step` | `*metav1.Duration` | — | | Step is the resolution step for range queries. Smaller steps increase data-point density but also increase Prometheus load. Default: 5m. Example: "1m", "15m" | + +### DiscoveryQuery + +DiscoveryQuery defines a named raw-data source referenced by signals. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Name | `name` | `string` | ✓ | | Name is the unique identifier for this query within the policy. Signals reference queries by this name via queryRef. | +| Type | `type` | `DiscoveryQueryType` | ✓ | | Type selects the backend. Must be "prometheus" or "loki". Enum: `prometheus`,`loki` | +| Prometheus | `prometheus` | `*DiscoveryPrometheusQuery` | — | | Prometheus contains the configuration when type=prometheus. | +| Loki | `loki` | `*DiscoveryLokiQuery` | — | | Loki contains the configuration when type=loki. | +| SecretRef | `secretRef` | `*corev1.LocalObjectReference` | — | | SecretRef references a Secret in the pod namespace (default "drop-system") for auth/TLS. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. | + +### DiscoveryRanking + +DiscoveryRanking defines how signals are combined into the final ordered image list. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Strategy | `strategy` | `RankingStrategy` | ✓ | | Strategy selects the ranking algorithm. Enum: `signal`,`weightedSum`,`modelExposure` | +| Signal | `signal` | `*SignalRankingConfig` | — | | Signal is required when strategy=signal. | +| WeightedSum | `weightedSum` | `*WeightedSumRankingConfig` | — | | WeightedSum is required when strategy=weightedSum. | +| ModelExposure | `modelExposure` | `*ModelExposureRankingConfig` | — | | ModelExposure is required when strategy=modelExposure. | + +### DiscoverySignal + +DiscoverySignal defines a named per-image metric derived from a single query. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Name | `name` | `string` | ✓ | | Name is the unique identifier for this signal within the policy. Ranking configurations reference signals by this name. | +| QueryRef | `queryRef` | `string` | ✓ | | QueryRef is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. | +| Type | `type` | `SignalType` | ✓ | | Type selects the signal derivation method. Enum: `aggregate`,`timeWeightedAggregate`,`windowAggregate`,`eventPullTime` | +| Aggregate | `aggregate` | `*AggregateSignalConfig` | — | | Aggregate is required when type=aggregate. | +| TimeWeightedAggregate | `timeWeightedAggregate` | `*TimeWeightedAggregateSignalConfig` | — | | TimeWeightedAggregate is required when type=timeWeightedAggregate. | +| WindowAggregate | `windowAggregate` | `*WindowAggregateSignalConfig` | — | | WindowAggregate is required when type=windowAggregate. | +| EventPullTime | `eventPullTime` | `*EventPullTimeSignalConfig` | — | | EventPullTime is required when type=eventPullTime. | + +### EventPullTimeSignalConfig + +EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Statistic | `statistic` | `EventPullTimeStatistic` | ✓ | | Statistic selects which pull-time metric to compute. Enum: `p50`,`p90`,`p95`,`avg`,`max`,`count`,`failureCount`,`cacheHitCount` | +| IncludeCacheHits | `includeCacheHits` | `bool` | ✓ | `false` | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. | +| DurationMode | `durationMode` | `DurationMode` | ✓ | | DurationMode controls how pull duration is extracted from event records. Enum: `eventPair`,`messageDuration` | ### ImageEntry @@ -165,6 +240,48 @@ ImageEntry defines a single image to include in a set. | Tag | `tag` | `string` | — | | Tag to pull. Mutually exclusive with Digest. Example: "1.25-alpine", "v2.4.1" | | Digest | `digest` | `string` | — | | Digest to pull as an immutable reference. Mutually exclusive with Tag. Example: "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4" | +### ImageRankingDetail + +ImageRankingDetail explains how the final score was computed for one image. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Strategy | `strategy` | `string` | ✓ | | Strategy is the ranking strategy that produced this detail. | +| Terms | `terms` | `[]RankingTerm` | — | | Terms lists the per-signal contributions (populated for weightedSum and modelExposure). | + +### ImageSignalValue + +ImageSignalValue records the raw and normalized value of a signal for one image. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Name | `name` | `string` | ✓ | | Name is the signal name. | +| RawValue | `rawValue` | `string` | ✓ | | RawValue is the unscaled signal value as a decimal string. | +| NormalizedValue | `normalizedValue` | `string` | — | | NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. Only populated for signals used in a weightedSum ranking. | + +### LokiParser + +LokiParser configures structured parsing of Loki log entries. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Type | `type` | `LokiParserType` | ✓ | | Type selects the parser. Currently only "kubernetesEvents" is supported. Enum: `kubernetesEvents` | +| PodField | `podField` | `string` | — | | PodField is the log label or field that contains the pod name. Example: "involvedObject_name" | +| ReasonField | `reasonField` | `string` | — | | ReasonField is the log label or field that contains the event reason. Example: "reason" | +| MessageField | `messageField` | `string` | — | | MessageField is the log label or field that contains the event message. Example: "message" | +| ImageField | `imageField` | `string` | — | | ImageField is the log label or field from which the image reference is extracted. For kubernetesEvents, the image is parsed out of the message text. Example: "message" | + +### ModelExposureRankingConfig + +ModelExposureRankingConfig configures the modelExposure ranking strategy. Score = J_target(I) * (1 - 1/N)^J_pre(I) * p_hat(I) where N=nodeCount, J_pre is pre-window usage, J_target is target-window usage, and p_hat is the pull-time signal value. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| NodeCount | `nodeCount` | `int32` | ✓ | | NodeCount is the number of eligible CI nodes (N in the exposure formula). | +| PreWindowUsageSignalRef | `preWindowUsageSignalRef` | `string` | ✓ | | PreWindowUsageSignalRef is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. | +| TargetWindowUsageSignalRef | `targetWindowUsageSignalRef` | `string` | ✓ | | TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. | +| PullTimeSignalRef | `pullTimeSignalRef` | `string` | ✓ | | PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. | + ### PolicyReference PolicyReference is a reference to a PullPolicy resource. @@ -173,30 +290,108 @@ PolicyReference is a reference to a PullPolicy resource. |-------|------|------|----------|---------|-------------| | Name | `name` | `string` | ✓ | | Name of the PullPolicy resource. | -### PrometheusSource +### QueryResult -PrometheusSource defines Prometheus query configuration for image discovery. +QueryResult reports the outcome of a single named query execution. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| Endpoint | `endpoint` | `string` | ✓ | | Endpoint is the Prometheus-compatible API URL (Prometheus, Thanos, Mimir, VictoriaMetrics). Example: "http://prometheus.monitoring.svc:9090", "https://mimir.example.com" | -| Query | `query` | `string` | ✓ | | Query is the PromQL expression. It MUST return results with an "image" label — that label value is used as the discovered image reference. The query result value is used as the ranking score (higher = more relevant). Example: count(container_memory_working_set_bytes{container!="",container!="POD",namespace="gitlab-runner"}) by (image) | -| QueryType | `queryType` | `QueryType` | — | `range` | QueryType controls how the Prometheus query is executed. "range" uses /api/v1/query_range with a time window defined by lookback. "instant" uses /api/v1/query for a single point-in-time result. Default: "range". | -| Lookback | `lookback` | `*metav1.Duration` | — | | Lookback is the time window for range queries. When queryType is "range", the operator queries (start=now-lookback, end=now) and aggregates all returned values per image. The aggregation function is controlled by the aggregationMethod field. Required when queryType is "range". Ignored when queryType is "instant". Example: "168h" (7 days), "24h", "72h" | -| AggregationMethod | `aggregationMethod` | `*AggregationMethod` | — | | AggregationMethod controls how data points from a range query are combined into a single score. Only used when queryType is "range". Ignored for instant queries. When not set (nil), Drop uses the last data-point value directly — use this when your PromQL already contains aggregation functions (e.g., count_over_time, topk). Options: "sum", "count", "avg", "max" | -| Step | `step` | `*metav1.Duration` | — | | Step is the resolution step for range queries (only used when lookback is set). Smaller steps = more data points = more accurate aggregation but higher Prometheus load. Default: 5m. Example: "1m", "15m" | +| Name | `name` | `string` | ✓ | | Name matches the queries[].name that produced this result. | +| Type | `type` | `DiscoveryQueryType` | ✓ | | Type is the query backend type (prometheus or loki). | +| Series | `series` | `*int32` | — | | Series is the number of time-series returned (Prometheus queries only). | +| Samples | `samples` | `*int64` | — | | Samples is the total number of data points across all series (Prometheus range queries only). | +| Records | `records` | `*int64` | — | | Records is the number of log records returned (Loki queries only). | +| Status | `status` | `QueryResultStatus` | ✓ | | Status is "success" or "failed". | +| Message | `message` | `string` | — | | Message describes the failure reason when status=failed. | + +### RankingTerm + +RankingTerm records the contribution of one signal to the final score of an image. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Signal | `signal` | `string` | ✓ | | Signal is the signal name. | +| Weight | `weight` | `string` | ✓ | | Weight is the configured weight as a decimal string. | +| Contribution | `contribution` | `string` | ✓ | | Contribution is weight * normalizedValue as a decimal string. | -### RegistrySource +### SignalRankingConfig -RegistrySource defines OCI registry tag listing configuration for image discovery. +SignalRankingConfig configures the signal ranking strategy. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| URL | `url` | `string` | ✓ | | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | -| Repositories | `repositories` | `[]string` | ✓ | | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | -| TagFilter | `tagFilter` | `string` | — | | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | -| TopX | `topX` | `int32` | — | | TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not provide creation timestamps here; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo) | -| ImageTemplate | `imageTemplate` | `string` | — | | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "{{.Registry}}/{{.Repository}}@{{.Tag}}" (if tags are actually digests) | +| SignalRef | `signalRef` | `string` | ✓ | | SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. | + +### SignalResult + +SignalResult reports the outcome of a single signal derivation. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Name | `name` | `string` | ✓ | | Name matches the signals[].name that produced this result. | +| Images | `images` | `int32` | ✓ | | Images is the number of images for which this signal produced a value. | +| Status | `status` | `string` | ✓ | | Status is "success" or "failed". | +| Message | `message` | `string` | — | | Message describes the failure reason when status=failed. | + +### TimeOfDayWindow + +TimeOfDayWindow defines a fixed wall-clock time range within each day. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Start | `start` | `string` | ✓ | | Start is the inclusive start time in "HH:MM" format (24-hour, local time). Example: "09:00" | +| End | `end` | `string` | ✓ | | End is the exclusive end time in "HH:MM" format (24-hour, local time). Example: "17:00" | + +### TimeWeightedAggregateSignalConfig + +TimeWeightedAggregateSignalConfig configures the timeWeightedAggregate signal type. Each sample value is multiplied by the weight of the matching time window before aggregation. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Method | `method` | `AggregationMethod` | ✓ | | Method is the aggregation function applied after weighting (currently only "sum" is meaningful). Enum: `sum`,`count`,`avg`,`max`,`min` | +| Timezone | `timezone` | `string` | ✓ | | Timezone is the IANA time zone used to evaluate window boundaries (wall-clock hours). Example: "Europe/Berlin", "America/New_York", "UTC" | +| DefaultWeight | `defaultWeight` | `resource.Quantity` | ✓ | | DefaultWeight is applied to samples that do not fall in any configured window. Use "0" to exclude off-hours samples entirely. | +| Windows | `windows` | `[]TimeWeightedWindow` | ✓ | | Windows is the list of hour-of-day windows with associated weights. | + +### TimeWeightedWindow + +TimeWeightedWindow defines a wall-clock hour range and its weight factor. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| StartHour | `startHour` | `int32` | ✓ | | StartHour is the inclusive start of the window in local time (0–23). | +| EndHour | `endHour` | `int32` | ✓ | | EndHour is the exclusive end of the window in local time (1–24). | +| Weight | `weight` | `resource.Quantity` | ✓ | | Weight is the factor applied to sample values within this window. Use "1.0" for full weight, "0.3" for partial, "0" to exclude. | + +### WeightedSumRankingConfig + +WeightedSumRankingConfig configures the weightedSum ranking strategy. Score = Σ weight_k * normalize(signal_k(image)). + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Normalize | `normalize` | `NormalizeMethod` | ✓ | `minMax` | Normalize selects the normalization method applied to each signal before weighting. Currently only "minMax" is supported. Enum: `minMax` | +| MissingSignal | `missingSignal` | `MissingSignalBehavior` | ✓ | `zero` | MissingSignal controls behavior when an image has no value for a required signal. "zero" treats missing as 0; "drop" removes the image from ranking. Enum: `zero`,`drop` | +| Terms | `terms` | `[]WeightedSumTerm` | ✓ | | Terms is the list of signals and their weights. | + +### WeightedSumTerm + +WeightedSumTerm defines one signal contribution in a weightedSum ranking. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| SignalRef | `signalRef` | `string` | ✓ | | SignalRef is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. | +| Weight | `weight` | `resource.Quantity` | ✓ | | Weight is the factor applied to the normalized signal value. All weights should be non-negative; they do not need to sum to 1. Example: "0.7" | + +### WindowAggregateSignalConfig + +WindowAggregateSignalConfig configures the windowAggregate signal type. Exactly one of relativeWindow or (window + timezone) must be set. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| Method | `method` | `AggregationMethod` | ✓ | | Method is the aggregation function applied to the windowed samples. Enum: `sum`,`count`,`avg`,`max`,`min` | +| RelativeWindow | `relativeWindow` | `*metav1.Duration` | — | | RelativeWindow aggregates only samples from the last N duration before now. Mutually exclusive with window + timezone. Example: "2h" (last 2 hours) | +| Timezone | `timezone` | `string` | — | | Timezone is the IANA time zone for evaluating wall-clock window boundaries. Required when window is set. | +| Window | `window` | `*TimeOfDayWindow` | — | | Window defines fixed wall-clock start/end times within each day. Mutually exclusive with relativeWindow. | ## Relationships @@ -222,13 +417,7 @@ graph LR | Degraded | CachedImageSet | N/N images cached, failing: N | | | Progressing | CachedImageSet | N/N images cached | | | Ready | CachedImageSet | All N images are cached | | -| AllSourcesHealthy | DiscoveryPolicy | All discovery sources responded successfully | | -| ConnectionRefused | DiscoveryPolicy | | | -| DNSError | DiscoveryPolicy | | | -| PartiallyFailed | DiscoveryPolicy | Discovered N images, but some sources failed: N | | -| SourceError | DiscoveryPolicy | One or more sources failed to respond | | -| SyncFailed | DiscoveryPolicy | | | -| Synced | DiscoveryPolicy | Discovered N images | | +| NotImplemented | DiscoveryPolicy | | | ## Metrics @@ -319,83 +508,97 @@ spec: policyRef: name: dev-conservative discoveryPolicyRef: - name: dev-registry + name: dev-prometheus --- -# === DiscoveryPolicy: healthy (Prometheus range query) === +# === DiscoveryPolicy: Prometheus range query with total-usage signal === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: dev-prometheus spec: - sources: - - type: prometheus + queries: + - name: runner-image-usage + type: prometheus prometheus: endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" - query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' queryType: range lookback: 24h step: 5m - aggregationMethod: sum + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage syncInterval: 30s maxImages: 10 --- -# === DiscoveryPolicy: healthy (registry tag listing) === +# === DiscoveryPolicy: Prometheus with hybrid weightedSum ranking === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: dev-registry + name: dev-hybrid spec: - sources: - - type: registry - registry: - url: "http://registry.e2e-infra.svc.cluster.local:5000" - repositories: - - "test/myapp" - topX: 3 + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + - name: peak-concurrency + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max + ranking: + strategy: weightedSum + weightedSum: + normalize: minMax + missingSignal: zero + terms: + - signalRef: total-usage + weight: "700m" + - signalRef: peak-concurrency + weight: "300m" syncInterval: 30s maxImages: 10 --- -# === DiscoveryPolicy: broken (DNS error → DNSError) === +# === DiscoveryPolicy: broken Prometheus endpoint (DNS error) === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: test-broken-prom spec: - sources: - - type: prometheus + queries: + - name: broken-query + type: prometheus prometheus: endpoint: "http://nonexistent-prometheus:9090" query: "up{}" - syncInterval: 30m - maxImages: 10 ---- -# === DiscoveryPolicy: broken (DNS error → DNSError) === -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: test-broken-registry -spec: - sources: - - type: registry - registry: - url: "http://nonexistent-registry:5000" - repositories: - - "test/nope" - syncInterval: 30m - maxImages: 10 ---- -# === DiscoveryPolicy: broken (repo doesn't exist → NotFound) === -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: test-notfound-repo -spec: - sources: - - type: registry - registry: - url: "http://registry.e2e-infra.svc.cluster.local:5000" - repositories: - - "this/does-not-exist" + signals: + - name: total-usage + queryRef: broken-query + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage syncInterval: 30m maxImages: 10 diff --git a/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml b/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml index 54da3b4..a955d9e 100644 --- a/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml +++ b/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml @@ -3,12 +3,24 @@ kind: DiscoveryPolicy metadata: name: test-registry-discovery spec: - sources: - - type: registry - registry: - url: "http://registry.e2e-infra.svc.cluster.local:5000" - repositories: - - "test/myapp" - topX: 1 + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", namespace="build-stuff"}) by (image)' + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage syncInterval: 30s maxImages: 10 diff --git a/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml b/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml index cb90fcd..855829b 100644 --- a/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml +++ b/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml @@ -1,9 +1,9 @@ -# Assert DiscoveryPolicy is synced and has discovered images +# Assert DiscoveryPolicy is reconciled with NotImplemented condition (pipeline not yet implemented) apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: test-registry-discovery status: (conditions[?type == 'Ready']): - - status: "True" - reason: Synced + - status: "False" + reason: NotImplemented diff --git a/test/e2e/cachedimageset-discovery/05-assert-children.yaml b/test/e2e/cachedimageset-discovery/05-assert-children.yaml deleted file mode 100644 index bb88061..0000000 --- a/test/e2e/cachedimageset-discovery/05-assert-children.yaml +++ /dev/null @@ -1,13 +0,0 @@ -# Assert child CachedImages are created with proper labels and ownerRef -apiVersion: drop.corewire.io/v1alpha1 -kind: CachedImage -metadata: - labels: - drop.corewire.io/imageset: test-discovered-set - ownerReferences: - - apiVersion: drop.corewire.io/v1alpha1 - kind: CachedImageSet - name: test-discovered-set -spec: - policyRef: - name: test-set-policy diff --git a/test/e2e/cachedimageset-discovery/06-assert-set-status.yaml b/test/e2e/cachedimageset-discovery/06-assert-set-status.yaml deleted file mode 100644 index 72ae564..0000000 --- a/test/e2e/cachedimageset-discovery/06-assert-set-status.yaml +++ /dev/null @@ -1,8 +0,0 @@ -# Assert CachedImageSet shows healthy status -apiVersion: drop.corewire.io/v1alpha1 -kind: CachedImageSet -metadata: - name: test-discovered-set -status: - (conditions[?type == 'Ready']): - - status: "True" diff --git a/test/e2e/cachedimageset-discovery/chainsaw-test.yaml b/test/e2e/cachedimageset-discovery/chainsaw-test.yaml index fd43b98..20f4bec 100644 --- a/test/e2e/cachedimageset-discovery/chainsaw-test.yaml +++ b/test/e2e/cachedimageset-discovery/chainsaw-test.yaml @@ -5,61 +5,28 @@ metadata: name: cachedimageset-discovery spec: description: | - Verify that a CachedImageSet with discoveryPolicyRef creates child CachedImages - from a registry-based DiscoveryPolicy, with policyRef propagated to children. + Verify that a CachedImageSet with discoveryPolicyRef accepts the new pipeline + schema and the DiscoveryPolicy is reconciled with the expected condition. + NOTE: Full CachedImage creation from discovered images will be re-enabled in + Issue 4 once signal ranking and status output are implemented. steps: - name: Create PullPolicy try: - apply: file: 01-pullpolicy.yaml - - name: Create Registry DiscoveryPolicy + - name: Create DiscoveryPolicy with pipeline schema try: - apply: file: 02-discoverypolicy.yaml - - name: Wait for discovery to sync + - name: Wait for DiscoveryPolicy to be reconciled try: - assert: - timeout: 90s + timeout: 60s file: 03-assert-discovery-ready.yaml - name: Create CachedImageSet with discoveryPolicyRef and policyRef try: - apply: file: 04-cachedimageset.yaml - - name: Verify child CachedImages created with policyRef - try: - - assert: - timeout: 60s - file: 05-assert-children.yaml - - name: Verify CachedImageSet status shows Ready - try: - - script: - timeout: 120s - content: | - deadline=$(( $(date +%s) + 120 )) - while [ "$(date +%s)" -lt "$deadline" ]; do - ready=$(kubectl get cachedimageset test-discovered-set -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || true) - images_managed=$(kubectl get cachedimageset test-discovered-set -o jsonpath='{.status.imagesManaged}' 2>/dev/null || true) - images_ready=$(kubectl get cachedimageset test-discovered-set -o jsonpath='{.status.imagesReady}' 2>/dev/null || true) - - case "$images_managed" in - ''|*[!0-9]*) images_managed=0 ;; - esac - case "$images_ready" in - ''|*[!0-9]*) images_ready=0 ;; - esac - - if [ "$images_managed" -ge 1 ] && [ "$images_ready" = "$images_managed" ] && [ "$ready" = "True" ]; then - echo "OK: CachedImageSet is Ready with $images_ready/$images_managed images cached" - exit 0 - fi - - sleep 2 - done - - kubectl get cachedimageset test-discovered-set -o yaml - kubectl get cachedimage -l drop.corewire.io/imageset=test-discovered-set -o yaml - echo "FAIL: CachedImageSet did not become Ready" - exit 1 - name: Cleanup try: - delete: diff --git a/test/e2e/discovery-aggregation/01-discoverypolicies.yaml b/test/e2e/discovery-aggregation/01-discoverypolicies.yaml deleted file mode 100644 index 52f9cf7..0000000 --- a/test/e2e/discovery-aggregation/01-discoverypolicies.yaml +++ /dev/null @@ -1,108 +0,0 @@ -# Four DiscoveryPolicies using queryType: range with different aggregationMethods, -# plus one using queryType: instant. -# All query the same seed metrics (container_cpu_usage_seconds_total in namespace aggregation-test). -# Seed data: alpine has 3 pods (values 100, 200, 300), busybox has 1 pod (value 500). ---- -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-count -spec: - sources: - - type: prometheus - prometheus: - endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" - query: 'count(container_cpu_usage_seconds_total{namespace="aggregation-test"}) by (image)' - queryType: range - lookback: 1h - step: 5m - aggregationMethod: count - syncInterval: 30s - maxImages: 10 ---- -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-avg -spec: - sources: - - type: prometheus - prometheus: - endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" - query: 'sum(container_cpu_usage_seconds_total{namespace="aggregation-test"}) by (image)' - queryType: range - lookback: 1h - step: 5m - aggregationMethod: avg - syncInterval: 30s - maxImages: 10 ---- -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-max -spec: - sources: - - type: prometheus - prometheus: - endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" - query: 'sum(container_cpu_usage_seconds_total{namespace="aggregation-test"}) by (image)' - queryType: range - lookback: 1h - step: 5m - aggregationMethod: max - syncInterval: 30s - maxImages: 10 ---- -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-sum -spec: - sources: - - type: prometheus - prometheus: - endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" - query: 'sum(container_cpu_usage_seconds_total{namespace="aggregation-test"}) by (image)' - queryType: range - lookback: 1h - step: 5m - aggregationMethod: sum - syncInterval: 30s - maxImages: 10 ---- -# queryType: range without aggregationMethod — field is nullable, omitting it means -# Drop uses the last data-point value directly without aggregation. -# Ideal for self-contained PromQL queries that already aggregate internally. -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-none -spec: - sources: - - type: prometheus - prometheus: - endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" - query: 'sum(container_cpu_usage_seconds_total{namespace="aggregation-test"}) by (image)' - queryType: range - lookback: 1h - step: 5m - # aggregationMethod intentionally omitted (nil) — uses last value directly - syncInterval: 30s - maxImages: 10 ---- -# queryType: instant — uses /api/v1/query for a single point-in-time result. -# The returned value is used directly as the score without aggregation. -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-instant -spec: - sources: - - type: prometheus - prometheus: - endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" - query: 'count(container_cpu_usage_seconds_total{namespace="aggregation-test"}) by (image)' - queryType: instant - syncInterval: 30s - maxImages: 10 diff --git a/test/e2e/discovery-aggregation/02-assert-count.yaml b/test/e2e/discovery-aggregation/02-assert-count.yaml deleted file mode 100644 index ee5e76b..0000000 --- a/test/e2e/discovery-aggregation/02-assert-count.yaml +++ /dev/null @@ -1,12 +0,0 @@ -# Assert count aggregation: policy is Ready, both images discovered. -# count() by (image) returns alpine=3, busybox=1 at each step. -# aggregationMethod=count counts the number of data points (steps) per image. -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-count -status: - (conditions[?type == 'Ready']): - - status: "True" - reason: Synced - imageCount: 2 diff --git a/test/e2e/discovery-aggregation/03-assert-avg.yaml b/test/e2e/discovery-aggregation/03-assert-avg.yaml deleted file mode 100644 index ae09c4b..0000000 --- a/test/e2e/discovery-aggregation/03-assert-avg.yaml +++ /dev/null @@ -1,12 +0,0 @@ -# Assert avg aggregation: policy is Ready, both images discovered. -# sum() by (image) returns alpine=600, busybox=500 at each step. -# aggregationMethod=avg averages the data-point values over the lookback window. -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-avg -status: - (conditions[?type == 'Ready']): - - status: "True" - reason: Synced - imageCount: 2 diff --git a/test/e2e/discovery-aggregation/04-assert-max.yaml b/test/e2e/discovery-aggregation/04-assert-max.yaml deleted file mode 100644 index 2d240ef..0000000 --- a/test/e2e/discovery-aggregation/04-assert-max.yaml +++ /dev/null @@ -1,12 +0,0 @@ -# Assert max aggregation: policy is Ready, both images discovered. -# sum() by (image) returns alpine=600, busybox=500 at each step. -# aggregationMethod=max takes the highest single data-point value. -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-max -status: - (conditions[?type == 'Ready']): - - status: "True" - reason: Synced - imageCount: 2 diff --git a/test/e2e/discovery-aggregation/05-assert-sum.yaml b/test/e2e/discovery-aggregation/05-assert-sum.yaml deleted file mode 100644 index af43f08..0000000 --- a/test/e2e/discovery-aggregation/05-assert-sum.yaml +++ /dev/null @@ -1,12 +0,0 @@ -# Assert sum (default) aggregation: policy is Ready, both images discovered. -# sum() by (image) returns alpine=600, busybox=500 at each step. -# aggregationMethod=sum adds all data-point values over the lookback window. -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-sum -status: - (conditions[?type == 'Ready']): - - status: "True" - reason: Synced - imageCount: 2 diff --git a/test/e2e/discovery-aggregation/06-assert-instant.yaml b/test/e2e/discovery-aggregation/06-assert-instant.yaml deleted file mode 100644 index 2d42fc5..0000000 --- a/test/e2e/discovery-aggregation/06-assert-instant.yaml +++ /dev/null @@ -1,11 +0,0 @@ -# Assert instant query: policy is Ready, both images discovered. -# queryType=instant uses /api/v1/query — the returned value is used directly as the score. -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-instant -status: - (conditions[?type == 'Ready']): - - status: "True" - reason: Synced - imageCount: 2 diff --git a/test/e2e/discovery-aggregation/07-assert-none.yaml b/test/e2e/discovery-aggregation/07-assert-none.yaml deleted file mode 100644 index 94e6b0a..0000000 --- a/test/e2e/discovery-aggregation/07-assert-none.yaml +++ /dev/null @@ -1,11 +0,0 @@ -# Assert none aggregation: policy is Ready, both images discovered. -# aggregationMethod=none uses the last data-point value from the range query directly. -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-agg-none -status: - (conditions[?type == 'Ready']): - - status: "True" - reason: Synced - imageCount: 2 diff --git a/test/e2e/discovery-aggregation/chainsaw-test.yaml b/test/e2e/discovery-aggregation/chainsaw-test.yaml deleted file mode 100644 index 16a95b2..0000000 --- a/test/e2e/discovery-aggregation/chainsaw-test.yaml +++ /dev/null @@ -1,108 +0,0 @@ -# yaml-language-server: $schema=https://raw.githubusercontent.com/kyverno/chainsaw/main/.schemas/json/test-chainsaw-v1alpha1.json -apiVersion: chainsaw.kyverno.io/v1alpha1 -kind: Test -metadata: - name: discovery-aggregation-methods -spec: - description: | - Verify that DiscoveryPolicy aggregationMethod and queryType fields work correctly - against a real Prometheus endpoint. Seeds use container_cpu_usage_seconds_total with - two images (alpine: 3 pods with values 100/200/300, busybox: 1 pod with value 500). - - Expected rankings per method (queryType: range): - count → alpine first (3 > 1) - avg → busybox first (500 > 200) - max → busybox first (500 > 300) - sum → alpine first (600 > 500) - none → uses last data-point value directly - - queryType: instant uses /api/v1/query directly — no aggregation. - steps: - - name: Create DiscoveryPolicies with different aggregation methods and query types - try: - - apply: - file: 01-discoverypolicies.yaml - - name: Assert count aggregation discovers images (alpine ranked first) - try: - - assert: - timeout: 90s - file: 02-assert-count.yaml - - name: Assert avg aggregation discovers images (busybox ranked first) - try: - - assert: - timeout: 90s - file: 03-assert-avg.yaml - - name: Assert max aggregation discovers images (busybox ranked first) - try: - - assert: - timeout: 90s - file: 04-assert-max.yaml - - name: Assert sum aggregation discovers images (alpine ranked first, default) - try: - - assert: - timeout: 90s - file: 05-assert-sum.yaml - - name: Assert instant query discovers images - try: - - assert: - timeout: 90s - file: 06-assert-instant.yaml - - name: Assert none aggregation discovers images (last value used directly) - try: - - assert: - timeout: 90s - file: 07-assert-none.yaml - - name: Verify aggregation scores are populated - try: - - script: - timeout: 30s - content: | - # Verify aggregation outputs are populated. - # Score relationships can vary with the number of data points and values - # returned by Prometheus in the lookback window. - SUM_SCORE=$(kubectl get discoverypolicy e2e-agg-sum -o jsonpath='{.status.discoveredImages[0].score}') - AVG_SCORE=$(kubectl get discoverypolicy e2e-agg-avg -o jsonpath='{.status.discoveredImages[0].score}') - COUNT_SCORE=$(kubectl get discoverypolicy e2e-agg-count -o jsonpath='{.status.discoveredImages[0].score}') - MAX_SCORE=$(kubectl get discoverypolicy e2e-agg-max -o jsonpath='{.status.discoveredImages[0].score}') - INSTANT_SCORE=$(kubectl get discoverypolicy e2e-agg-instant -o jsonpath='{.status.discoveredImages[0].score}') - NONE_SCORE=$(kubectl get discoverypolicy e2e-agg-none -o jsonpath='{.status.discoveredImages[0].score}') - - echo "Scores — sum:$SUM_SCORE avg:$AVG_SCORE count:$COUNT_SCORE max:$MAX_SCORE instant:$INSTANT_SCORE none:$NONE_SCORE" - - if [ -z "$SUM_SCORE" ] || [ -z "$AVG_SCORE" ] || [ -z "$COUNT_SCORE" ] || [ -z "$MAX_SCORE" ] || [ -z "$INSTANT_SCORE" ] || [ -z "$NONE_SCORE" ]; then - echo "FAIL: expected non-empty scores for all methods" - exit 1 - fi - echo "OK: all query types and aggregation methods produced non-empty scores" - - name: Cleanup - try: - - delete: - ref: - apiVersion: drop.corewire.io/v1alpha1 - kind: DiscoveryPolicy - name: e2e-agg-count - - delete: - ref: - apiVersion: drop.corewire.io/v1alpha1 - kind: DiscoveryPolicy - name: e2e-agg-avg - - delete: - ref: - apiVersion: drop.corewire.io/v1alpha1 - kind: DiscoveryPolicy - name: e2e-agg-max - - delete: - ref: - apiVersion: drop.corewire.io/v1alpha1 - kind: DiscoveryPolicy - name: e2e-agg-sum - - delete: - ref: - apiVersion: drop.corewire.io/v1alpha1 - kind: DiscoveryPolicy - name: e2e-agg-instant - - delete: - ref: - apiVersion: drop.corewire.io/v1alpha1 - kind: DiscoveryPolicy - name: e2e-agg-none diff --git a/test/e2e/discovery-failure/01-broken-prometheus.yaml b/test/e2e/discovery-failure/01-broken-prometheus.yaml index a44f533..cc096df 100644 --- a/test/e2e/discovery-failure/01-broken-prometheus.yaml +++ b/test/e2e/discovery-failure/01-broken-prometheus.yaml @@ -3,10 +3,21 @@ kind: DiscoveryPolicy metadata: name: test-broken-prom spec: - sources: - - type: prometheus + queries: + - name: broken-query + type: prometheus prometheus: endpoint: "http://nonexistent-prometheus:9090" query: "up{}" + signals: + - name: total-usage + queryRef: broken-query + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage syncInterval: 30m maxImages: 10 diff --git a/test/e2e/discovery-failure/02-assert-notimplemented.yaml b/test/e2e/discovery-failure/02-assert-notimplemented.yaml new file mode 100644 index 0000000..17bc32a --- /dev/null +++ b/test/e2e/discovery-failure/02-assert-notimplemented.yaml @@ -0,0 +1,10 @@ +# Assert that DiscoveryPolicy with the new pipeline schema gets NotImplemented condition. +# Pipeline execution is not yet implemented; DNS error testing will be re-enabled in Issue 2. +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: test-broken-prom +status: + (conditions[?type == 'Ready']): + - status: "False" + reason: NotImplemented diff --git a/test/e2e/discovery-failure/02-broken-registry.yaml b/test/e2e/discovery-failure/02-broken-registry.yaml deleted file mode 100644 index 2a97e3f..0000000 --- a/test/e2e/discovery-failure/02-broken-registry.yaml +++ /dev/null @@ -1,13 +0,0 @@ -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: test-broken-registry -spec: - sources: - - type: registry - registry: - url: "http://nonexistent-registry:5000" - repositories: - - "test/nope" - syncInterval: 30m - maxImages: 10 diff --git a/test/e2e/discovery-failure/03-notfound-registry.yaml b/test/e2e/discovery-failure/03-notfound-registry.yaml deleted file mode 100644 index 3bd1f35..0000000 --- a/test/e2e/discovery-failure/03-notfound-registry.yaml +++ /dev/null @@ -1,13 +0,0 @@ -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: test-notfound-repo -spec: - sources: - - type: registry - registry: - url: "http://registry.e2e-infra.svc.cluster.local:5000" - repositories: - - "this/does-not-exist" - syncInterval: 30m - maxImages: 10 diff --git a/test/e2e/discovery-failure/05-assert-dns-registry.yaml b/test/e2e/discovery-failure/05-assert-dns-registry.yaml deleted file mode 100644 index 893a3e5..0000000 --- a/test/e2e/discovery-failure/05-assert-dns-registry.yaml +++ /dev/null @@ -1,9 +0,0 @@ -# Assert broken registry shows DNSError reason -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: test-broken-registry -status: - (conditions[?type == 'Ready']): - - status: "False" - reason: DNSError diff --git a/test/e2e/discovery-failure/06-assert-notfound.yaml b/test/e2e/discovery-failure/06-assert-notfound.yaml deleted file mode 100644 index 0d8ee0a..0000000 --- a/test/e2e/discovery-failure/06-assert-notfound.yaml +++ /dev/null @@ -1,8 +0,0 @@ -# Assert notfound repo shows error (Ready=False with a reason) -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: test-notfound-repo -status: - (conditions[?type == 'Ready']): - - status: "False" diff --git a/test/e2e/discovery-failure/chainsaw-test.yaml b/test/e2e/discovery-failure/chainsaw-test.yaml index 5afe93c..31ea3fa 100644 --- a/test/e2e/discovery-failure/chainsaw-test.yaml +++ b/test/e2e/discovery-failure/chainsaw-test.yaml @@ -5,36 +5,19 @@ metadata: name: discovery-failure spec: description: | - Verify that DiscoveryPolicy with broken sources reports appropriate error - reasons: DNSError for unreachable endpoints, NotFound for missing repos. + Verify that DiscoveryPolicy with the new query/signal/ranking schema is accepted + and the controller sets the NotImplemented condition. + DNS error and query failure testing will be re-enabled in Issue 2 (Prometheus execution). steps: - - name: Create broken Prometheus DiscoveryPolicy (DNS failure) + - name: Create DiscoveryPolicy with broken Prometheus endpoint try: - apply: file: 01-broken-prometheus.yaml - - name: Create broken Registry DiscoveryPolicy (DNS failure) - try: - - apply: - file: 02-broken-registry.yaml - - name: Create DiscoveryPolicy with nonexistent repo (NotFound) - try: - - apply: - file: 03-notfound-registry.yaml - - name: Assert broken Prometheus shows DNSError - try: - - assert: - timeout: 90s - file: 04-assert-dns-prometheus.yaml - - name: Assert broken registry shows DNSError - try: - - assert: - timeout: 90s - file: 05-assert-dns-registry.yaml - - name: Assert notfound repo shows error + - name: Assert NotImplemented condition is set try: - assert: - timeout: 90s - file: 06-assert-notfound.yaml + timeout: 60s + file: 02-assert-notimplemented.yaml - name: Cleanup try: - delete: @@ -42,13 +25,3 @@ spec: apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy name: test-broken-prom - - delete: - ref: - apiVersion: drop.corewire.io/v1alpha1 - kind: DiscoveryPolicy - name: test-broken-registry - - delete: - ref: - apiVersion: drop.corewire.io/v1alpha1 - kind: DiscoveryPolicy - name: test-notfound-repo diff --git a/test/e2e/discovery-registry/01-discoverypolicy.yaml b/test/e2e/discovery-registry/01-discoverypolicy.yaml deleted file mode 100644 index bedc5a6..0000000 --- a/test/e2e/discovery-registry/01-discoverypolicy.yaml +++ /dev/null @@ -1,14 +0,0 @@ -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-registry -spec: - sources: - - type: registry - registry: - url: "http://registry.e2e-infra.svc.cluster.local:5000" - repositories: - - "test/myapp" - topX: 3 - syncInterval: 30s - maxImages: 10 diff --git a/test/e2e/discovery-registry/02-assert-discovery-status.yaml b/test/e2e/discovery-registry/02-assert-discovery-status.yaml deleted file mode 100644 index a387594..0000000 --- a/test/e2e/discovery-registry/02-assert-discovery-status.yaml +++ /dev/null @@ -1,11 +0,0 @@ -# Assert that DiscoveryPolicy status contains images from registry and Ready condition. -# The registry source lists tags for test/myapp and builds refs as host/repo:tag. -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: e2e-registry -status: - (conditions[?type == 'Ready']): - - status: "True" - reason: Synced - imageCount: 3 diff --git a/test/e2e/discovery-registry/chainsaw-test.yaml b/test/e2e/discovery-registry/chainsaw-test.yaml deleted file mode 100644 index 32f165a..0000000 --- a/test/e2e/discovery-registry/chainsaw-test.yaml +++ /dev/null @@ -1,26 +0,0 @@ -# yaml-language-server: $schema=https://raw.githubusercontent.com/kyverno/chainsaw/main/.schemas/json/test-chainsaw-v1alpha1.json -apiVersion: chainsaw.kyverno.io/v1alpha1 -kind: Test -metadata: - name: discovery-registry -spec: - description: | - Verify that a DiscoveryPolicy with a registry source discovers tags - from the in-cluster registry seeded with test images. - steps: - - name: Create DiscoveryPolicy with registry source - try: - - apply: - file: 01-discoverypolicy.yaml - - name: Wait for discovered images in status - try: - - assert: - timeout: 90s - file: 02-assert-discovery-status.yaml - - name: Cleanup - try: - - delete: - ref: - apiVersion: drop.corewire.io/v1alpha1 - kind: DiscoveryPolicy - name: e2e-registry diff --git a/test/e2e/discovery/01-discoverypolicy.yaml b/test/e2e/discovery/01-discoverypolicy.yaml index f01591c..aba13cf 100644 --- a/test/e2e/discovery/01-discoverypolicy.yaml +++ b/test/e2e/discovery/01-discoverypolicy.yaml @@ -3,12 +3,24 @@ kind: DiscoveryPolicy metadata: name: e2e-prometheus spec: - sources: - - type: prometheus + queries: + - name: runner-image-usage + type: prometheus prometheus: endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" - query: 'count(container_memory_working_set_bytes{container!="", namespace="build-stuff"}) by (image)' + queryType: range lookback: 24h step: 5m + query: 'count(container_memory_working_set_bytes{container!="", namespace="build-stuff"}) by (image)' + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage syncInterval: 30s maxImages: 10 diff --git a/test/e2e/discovery/02-assert-discovery-status.yaml b/test/e2e/discovery/02-assert-discovery-status.yaml index 1cb8f4d..23a4f10 100644 --- a/test/e2e/discovery/02-assert-discovery-status.yaml +++ b/test/e2e/discovery/02-assert-discovery-status.yaml @@ -1,11 +1,11 @@ -# Assert that DiscoveryPolicy status contains discovered images and Ready condition. -# The query 'count(...{namespace="build-stuff"}) by (image)' returns alpine + busybox. +# Assert that DiscoveryPolicy is reconciled and has the NotImplemented condition. +# Pipeline execution is not yet implemented; the controller sets NotImplemented condition. apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: e2e-prometheus status: (conditions[?type == 'Ready']): - - status: "True" - reason: Synced - imageCount: 2 + - status: "False" + reason: NotImplemented + queryCount: 1 diff --git a/test/e2e/discovery/chainsaw-test.yaml b/test/e2e/discovery/chainsaw-test.yaml index fa8e168..7a962c2 100644 --- a/test/e2e/discovery/chainsaw-test.yaml +++ b/test/e2e/discovery/chainsaw-test.yaml @@ -2,37 +2,23 @@ apiVersion: chainsaw.kyverno.io/v1alpha1 kind: Test metadata: - name: discovery-prometheus + name: discovery spec: description: | - Verify that a DiscoveryPolicy with a Prometheus source discovers images - from seeded metrics, and a CachedImageSet referencing it creates child CachedImages. + Verify that a DiscoveryPolicy with the new query/signal/ranking schema is accepted + and the controller sets a NotImplemented condition until pipeline execution is implemented. steps: - - name: Create DiscoveryPolicy with Prometheus source + - name: Create DiscoveryPolicy with query/signal/ranking pipeline try: - apply: file: 01-discoverypolicy.yaml - - name: Wait for discovered images in status - try: - - assert: - timeout: 90s - file: 02-assert-discovery-status.yaml - - name: Create CachedImageSet referencing the DiscoveryPolicy - try: - - apply: - file: 03-cachedimageset-discovery.yaml - - name: Verify child CachedImages are created from discovered images + - name: Assert NotImplemented condition is set try: - assert: timeout: 60s - file: 04-assert-children.yaml + file: 02-assert-discovery-status.yaml - name: Cleanup try: - - delete: - ref: - apiVersion: drop.corewire.io/v1alpha1 - kind: CachedImageSet - name: discovered-set - delete: ref: apiVersion: drop.corewire.io/v1alpha1 From d3757c2f548e66e9ba66b1488c3db3b6b3b9a4e5 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 27 Jun 2026 12:58:32 +0000 Subject: [PATCH 03/35] feat(discovery): implement query/signal/ranking pipeline + restore registry datasource MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add DiscoveryQueryTypeRegistry + DiscoveryRegistryQuery to API types - Restore internal/discovery/registry.go and registry_test.go - Add internal/discovery/engine.go: full 3-stage pipeline execution (query → signal → ranking) - Prometheus instant/range, registry queries - aggregate, timeWeightedAggregate, windowAggregate signals - signal, weightedSum, modelExposure ranking strategies - Add internal/discovery/engine_test.go: tests for all pipeline stages - Add FetchRaw() to PrometheusSource for timestamp-preserving data access - Replace controller stub (NotImplemented) with real pipeline execution - Update e2e tests: assert real behavior (Synced/DNSError) instead of NotImplemented - Add discovery-registry e2e test suite - Regenerate deepcopy and CRD manifests All unit tests pass, linter clean (0 issues). --- api/v1alpha1/discoverypolicy_types.go | 39 +- api/v1alpha1/zz_generated.deepcopy.go | 25 + .../drop.corewire.io_discoverypolicies.yaml | 48 +- .../controller/discoverypolicy_controller.go | 136 +++- .../discoverypolicy_controller_test.go | 88 ++- internal/discovery/engine.go | 681 ++++++++++++++++++ internal/discovery/engine_test.go | 354 +++++++++ internal/discovery/prometheus.go | 115 +++ internal/discovery/registry.go | 162 +++++ internal/discovery/registry_test.go | 93 +++ .../03-assert-discovery-ready.yaml | 8 +- .../chainsaw-test.yaml | 6 +- test/e2e/discovery-failure/chainsaw-test.yaml | 9 +- .../01-discoverypolicy.yaml | 28 + .../02-assert-discovery-status.yaml | 14 + .../e2e/discovery-registry/chainsaw-test.yaml | 26 + .../discovery/02-assert-discovery-status.yaml | 13 +- test/e2e/discovery/chainsaw-test.yaml | 22 +- 18 files changed, 1816 insertions(+), 51 deletions(-) create mode 100644 internal/discovery/engine.go create mode 100644 internal/discovery/engine_test.go create mode 100644 internal/discovery/registry.go create mode 100644 internal/discovery/registry_test.go create mode 100644 test/e2e/discovery-registry/01-discoverypolicy.yaml create mode 100644 test/e2e/discovery-registry/02-assert-discovery-status.yaml create mode 100644 test/e2e/discovery-registry/chainsaw-test.yaml diff --git a/api/v1alpha1/discoverypolicy_types.go b/api/v1alpha1/discoverypolicy_types.go index 6752ebe..c832ca7 100644 --- a/api/v1alpha1/discoverypolicy_types.go +++ b/api/v1alpha1/discoverypolicy_types.go @@ -46,7 +46,7 @@ type DiscoveryPolicySpec struct { // ============================================================ // DiscoveryQueryType identifies the backend for a named query. -// +kubebuilder:validation:Enum=prometheus;loki +// +kubebuilder:validation:Enum=prometheus;loki;registry type DiscoveryQueryType string const ( @@ -54,6 +54,8 @@ const ( DiscoveryQueryTypePrometheus DiscoveryQueryType = "prometheus" // DiscoveryQueryTypeLoki fetches log event data from a Loki-compatible API. DiscoveryQueryTypeLoki DiscoveryQueryType = "loki" + // DiscoveryQueryTypeRegistry lists image tags from an OCI-compatible container registry. + DiscoveryQueryTypeRegistry DiscoveryQueryType = "registry" ) // DiscoveryQuery defines a named raw-data source referenced by signals. @@ -62,8 +64,8 @@ type DiscoveryQuery struct { // Signals reference queries by this name via queryRef. // +kubebuilder:validation:MinLength=1 Name string `json:"name"` - // Type selects the backend. Must be "prometheus" or "loki". - // +kubebuilder:validation:Enum=prometheus;loki + // Type selects the backend. Must be "prometheus", "loki", or "registry". + // +kubebuilder:validation:Enum=prometheus;loki;registry Type DiscoveryQueryType `json:"type"` // Prometheus contains the configuration when type=prometheus. // +optional @@ -71,12 +73,43 @@ type DiscoveryQuery struct { // Loki contains the configuration when type=loki. // +optional Loki *DiscoveryLokiQuery `json:"loki,omitempty"` + // Registry contains the configuration when type=registry. + // +optional + Registry *DiscoveryRegistryQuery `json:"registry,omitempty"` // SecretRef references a Secret in the pod namespace (default "drop-system") for auth/TLS. // Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. // +optional SecretRef *corev1.LocalObjectReference `json:"secretRef,omitempty"` } +// DiscoveryRegistryQuery defines OCI registry tag listing configuration for image discovery. +type DiscoveryRegistryQuery struct { + // URL is the registry base URL (without repository path). + // Example: "https://registry.example.com", "https://ghcr.io" + // +kubebuilder:validation:MinLength=1 + URL string `json:"url"` + // Repositories is the list of repository paths to list tags from. + // Example: ["team/app", "team/worker", "infra/tools"] + // +kubebuilder:validation:MinItems=1 + Repositories []string `json:"repositories"` + // TagFilter is a regex applied to tag names. Only matching tags are discovered. + // Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) + // +optional + TagFilter string `json:"tagFilter,omitempty"` + // TopX limits the number of tags kept per repository after tagFilter is applied. + // The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. + // Example: 3 (keep the last 3 matching tags returned per repo) + // +optional + // +kubebuilder:validation:Minimum=1 + TopX int32 `json:"topX,omitempty"` + // ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. + // Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} + // Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" + // Example: "registry.example.com/{{.Repository}}:{{.Tag}}" + // +optional + ImageTemplate string `json:"imageTemplate,omitempty"` +} + // QueryType defines how the Prometheus query is executed. // +kubebuilder:validation:Enum=range;instant type QueryType string diff --git a/api/v1alpha1/zz_generated.deepcopy.go b/api/v1alpha1/zz_generated.deepcopy.go index a8760a4..4c0c209 100644 --- a/api/v1alpha1/zz_generated.deepcopy.go +++ b/api/v1alpha1/zz_generated.deepcopy.go @@ -558,6 +558,11 @@ func (in *DiscoveryQuery) DeepCopyInto(out *DiscoveryQuery) { *out = new(DiscoveryLokiQuery) (*in).DeepCopyInto(*out) } + if in.Registry != nil { + in, out := &in.Registry, &out.Registry + *out = new(DiscoveryRegistryQuery) + (*in).DeepCopyInto(*out) + } if in.SecretRef != nil { in, out := &in.SecretRef, &out.SecretRef *out = new(v1.LocalObjectReference) @@ -605,6 +610,26 @@ func (in *DiscoveryRanking) DeepCopy() *DiscoveryRanking { return out } +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *DiscoveryRegistryQuery) DeepCopyInto(out *DiscoveryRegistryQuery) { + *out = *in + if in.Repositories != nil { + in, out := &in.Repositories, &out.Repositories + *out = make([]string, len(*in)) + copy(*out, *in) + } +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new DiscoveryRegistryQuery. +func (in *DiscoveryRegistryQuery) DeepCopy() *DiscoveryRegistryQuery { + if in == nil { + return nil + } + out := new(DiscoveryRegistryQuery) + in.DeepCopyInto(out) + return out +} + // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *DiscoverySignal) DeepCopyInto(out *DiscoverySignal) { *out = *in diff --git a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml index 998fe05..608aa34 100644 --- a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml +++ b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml @@ -194,6 +194,47 @@ spec: - endpoint - query type: object + registry: + description: Registry contains the configuration when type=registry. + properties: + imageTemplate: + description: |- + ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. + Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} + Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" + Example: "registry.example.com/{{.Repository}}:{{.Tag}}" + type: string + repositories: + description: |- + Repositories is the list of repository paths to list tags from. + Example: ["team/app", "team/worker", "infra/tools"] + items: + type: string + minItems: 1 + type: array + tagFilter: + description: |- + TagFilter is a regex applied to tag names. Only matching tags are discovered. + Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) + type: string + topX: + description: |- + TopX limits the number of tags kept per repository after tagFilter is applied. + The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. + Example: 3 (keep the last 3 matching tags returned per repo) + format: int32 + minimum: 1 + type: integer + url: + description: |- + URL is the registry base URL (without repository path). + Example: "https://registry.example.com", "https://ghcr.io" + minLength: 1 + type: string + required: + - repositories + - url + type: object secretRef: description: |- SecretRef references a Secret in the pod namespace (default "drop-system") for auth/TLS. @@ -215,11 +256,13 @@ spec: - enum: - prometheus - loki + - registry - enum: - prometheus - loki - description: Type selects the backend. Must be "prometheus" - or "loki". + - registry + description: Type selects the backend. Must be "prometheus", + "loki", or "registry". type: string required: - name @@ -804,6 +847,7 @@ spec: enum: - prometheus - loki + - registry type: string required: - name diff --git a/internal/controller/discoverypolicy_controller.go b/internal/controller/discoverypolicy_controller.go index 377a42c..f8f7f2c 100644 --- a/internal/controller/discoverypolicy_controller.go +++ b/internal/controller/discoverypolicy_controller.go @@ -26,6 +26,8 @@ import ( logf "sigs.k8s.io/controller-runtime/pkg/log" dropv1alpha1 "github.com/corewire/drop/api/v1alpha1" + "github.com/corewire/drop/internal/discovery" + dropmetrics "github.com/corewire/drop/internal/metrics" ) // DiscoveryPolicyReconciler reconciles a DiscoveryPolicy object @@ -35,14 +37,18 @@ type DiscoveryPolicyReconciler struct { SecretNamespace string } +const ( + reasonDNSError = "DNSError" + reasonConnectionRefused = "ConnectionRefused" + secretHeaderPrefix = "headers." +) + // +kubebuilder:rbac:groups=drop.corewire.io,resources=discoverypolicies,verbs=get;list;watch;create;update;patch;delete // +kubebuilder:rbac:groups=drop.corewire.io,resources=discoverypolicies/status,verbs=get;update;patch // +kubebuilder:rbac:groups=drop.corewire.io,resources=discoverypolicies/finalizers,verbs=update // +kubebuilder:rbac:groups="",resources=secrets,verbs=get;list;watch -// Reconcile updates the DiscoveryPolicy status. -// NOTE: Query/signal/ranking execution is not yet implemented. The controller sets a -// NotImplemented condition and requeues after syncInterval until a future release adds execution. +// Reconcile executes the query/signal/ranking pipeline for a DiscoveryPolicy and updates status. func (r *DiscoveryPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { log := logf.FromContext(ctx) @@ -55,43 +61,145 @@ func (r *DiscoveryPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Requ return ctrl.Result{}, err } - log.Info("reconciling DiscoveryPolicy (pipeline execution not yet implemented)", + log.Info("reconciling DiscoveryPolicy", "queries", len(dp.Spec.Queries), "signals", len(dp.Spec.Signals), ) - // 2. Update status with query/image counts and NotImplemented condition. - patch := client.MergeFrom(dp.DeepCopy()) + // 2. Execute pipeline + httpClientFunc := r.buildHTTPClientFunc(dp) + result := discovery.ExecutePipeline(ctx, dp.Spec, httpClientFunc) + // 3. Build status patch + patch := client.MergeFrom(dp.DeepCopy()) now := metav1.Now() + dp.Status.LastSyncTime = &now dp.Status.QueryCount = int32(len(dp.Spec.Queries)) - dp.Status.ImageCount = int32(len(dp.Status.DiscoveredImages)) + dp.Status.QueryResults = result.QueryResults + dp.Status.SignalResults = result.SignalResults + dp.Status.DiscoveredImages = result.Images + dp.Status.ImageCount = int32(len(result.Images)) + // Determine overall health from query results + allHealthy, failReason, failMsg := summarizeQueryResults(result.QueryResults) + + // Emit per-query metrics + for _, qr := range result.QueryResults { + healthy := float64(0) + if qr.Status == dropv1alpha1.QueryResultStatusSuccess { + healthy = 1 + } + dropmetrics.DiscoverySourceHealth.WithLabelValues(dp.Name, string(qr.Type), qr.Name).Set(healthy) + if qr.Status == dropv1alpha1.QueryResultStatusSuccess { + images := 0 + if qr.Series != nil { + images = int(*qr.Series) + } + dropmetrics.DiscoveryImagesFound.WithLabelValues(dp.Name, string(qr.Type)).Set(float64(images)) + } + } + + // 4. Set Ready condition readyCondition := metav1.Condition{ Type: conditionTypeReady, - Status: metav1.ConditionFalse, - Reason: "NotImplemented", - Message: "Query/signal/ranking pipeline execution is not yet implemented; discovered images will be populated in a future release.", ObservedGeneration: dp.Generation, LastTransitionTime: now, } + if allHealthy || len(result.Images) > 0 { + readyCondition.Status = metav1.ConditionTrue + readyCondition.Reason = "Synced" + readyCondition.Message = fmt.Sprintf("Pipeline executed successfully; %d images discovered.", len(result.Images)) + } else { + readyCondition.Status = metav1.ConditionFalse + readyCondition.Reason = failReason + readyCondition.Message = failMsg + } meta.SetStatusCondition(&dp.Status.Conditions, readyCondition) if err := r.Status().Patch(ctx, dp, patch); err != nil { return ctrl.Result{}, fmt.Errorf("patching status: %w", err) } - // 3. Requeue after sync interval. + // 5. Requeue after sync interval syncInterval := dp.Spec.SyncInterval.Duration if syncInterval == 0 { syncInterval = 30 * time.Minute } + + // Return an error to trigger rate-limited backoff when all queries failed and no images available. + if !allHealthy && len(result.Images) == 0 { + return ctrl.Result{}, fmt.Errorf("discovery sync failed: %s", failMsg) + } + return ctrl.Result{RequeueAfter: syncInterval}, nil } +// buildHTTPClientFunc returns a discovery.HTTPClientFunc that provides per-query auth/TLS clients. +func (r *DiscoveryPolicyReconciler) buildHTTPClientFunc(dp *dropv1alpha1.DiscoveryPolicy) discovery.HTTPClientFunc { + // Build a name → secretRef index for quick lookup + secretIndex := make(map[string]*corev1.LocalObjectReference, len(dp.Spec.Queries)) + for _, q := range dp.Spec.Queries { + if q.SecretRef != nil { + secretIndex[q.Name] = q.SecretRef + } + } + + return func(innerCtx context.Context, queryName string) (*http.Client, error) { + secretRef, hasSecret := secretIndex[queryName] + if !hasSecret { + return &http.Client{Timeout: 30 * time.Second}, nil + } + return r.buildHTTPClient(innerCtx, secretRef) + } +} + +// summarizeQueryResults determines overall health and a human-readable reason/message. +func summarizeQueryResults(qrs []dropv1alpha1.QueryResult) (allHealthy bool, reason, message string) { + if len(qrs) == 0 { + return true, "Synced", "No queries configured." + } + + var failures []string + for _, qr := range qrs { + if qr.Status != dropv1alpha1.QueryResultStatusSuccess { + failures = append(failures, fmt.Sprintf("%s: %s", qr.Name, qr.Message)) + } + } + + if len(failures) == 0 { + return true, "Synced", "" + } + + // Classify the first failure for the Reason field + reason = classifyReason(failures[0]) + message = strings.Join(failures, "; ") + return false, reason, message +} + +// classifyReason maps a failure message to a k8s-style reason string. +func classifyReason(msg string) string { + switch { + case strings.Contains(msg, "no such host") || strings.Contains(msg, "server misbehaving") || strings.Contains(msg, "lookup"): + return reasonDNSError + case strings.Contains(msg, "connection refused"): + return reasonConnectionRefused + case strings.Contains(msg, "timeout") || strings.Contains(msg, "deadline exceeded"): + return "Timeout" + case strings.Contains(msg, "401") || strings.Contains(msg, "Unauthorized"): + return "Unauthorized" + case strings.Contains(msg, "403") || strings.Contains(msg, "Forbidden"): + return "Forbidden" + case strings.Contains(msg, "404") || strings.Contains(msg, "NotFound"): + return "NotFound" + case strings.Contains(msg, "certificate") || strings.Contains(msg, "x509"): + return "TLSError" + default: + return "SyncFailed" + } +} + // buildHTTPClient creates an HTTP client with auth/TLS from a Secret. -// This is retained for use by future query execution (Issues 2 and 8). func (r *DiscoveryPolicyReconciler) buildHTTPClient(ctx context.Context, secretRef *corev1.LocalObjectReference) (*http.Client, error) { httpClient := &http.Client{Timeout: 30 * time.Second} @@ -161,8 +269,8 @@ func (t *authTransport) RoundTrip(req *http.Request) (*http.Response, error) { // Custom headers (headers.) for key, value := range t.secret.Data { - if strings.HasPrefix(key, "headers.") { - headerName := key[len("headers."):] + if strings.HasPrefix(key, secretHeaderPrefix) { + headerName := key[len(secretHeaderPrefix):] req.Header.Set(headerName, string(value)) } } diff --git a/internal/controller/discoverypolicy_controller_test.go b/internal/controller/discoverypolicy_controller_test.go index aca5766..095996c 100644 --- a/internal/controller/discoverypolicy_controller_test.go +++ b/internal/controller/discoverypolicy_controller_test.go @@ -65,29 +65,97 @@ var _ = Describe("DiscoveryPolicy Controller", func() { } }) - It("should successfully reconcile the resource", func() { + It("reconciles and sets a failure condition when the Prometheus endpoint is unreachable", func() { By("Reconciling the created resource") controllerReconciler := &DiscoveryPolicyReconciler{ Client: k8sClient, Scheme: k8sClient.Scheme(), } - _, err := controllerReconciler.Reconcile(ctx, reconcile.Request{ + // The reconciler will attempt to query localhost:9090 which will fail. + // It returns an error so controller-runtime applies rate-limited backoff. + _, _ = controllerReconciler.Reconcile(ctx, reconcile.Request{ NamespacedName: typeNamespacedName, }) - // The stub reconciler sets a NotImplemented condition and does not return an error. - Expect(err).NotTo(HaveOccurred()) - // Verify the NotImplemented condition is set in status. + // Verify the status reflects the query failure. updated := &dropv1alpha1.DiscoveryPolicy{} Expect(k8sClient.Get(ctx, typeNamespacedName, updated)).To(Succeed()) - var readyReason string - for _, c := range updated.Status.Conditions { - if c.Type == "Ready" { - readyReason = c.Reason + + var readyCondition *metav1.Condition + for i := range updated.Status.Conditions { + if updated.Status.Conditions[i].Type == "Ready" { + readyCondition = &updated.Status.Conditions[i] } } - Expect(readyReason).To(Equal("NotImplemented")) + Expect(readyCondition).NotTo(BeNil(), "Ready condition should be set") + Expect(readyCondition.Status).To(Equal(metav1.ConditionFalse)) + // Reason is one of ConnectionRefused / SyncFailed depending on OS + Expect(readyCondition.Reason).NotTo(BeEmpty()) + + // queryCount should reflect the spec + Expect(updated.Status.QueryCount).To(Equal(int32(1))) + }) + + It("reconciles successfully with a registry query that lists from a mock server", func() { + By("creating a DiscoveryPolicy with a registry query") + const regResourceName = "test-discovery-registry" + + // We can't spin up a real registry in unit tests, but we can verify the + // full pipeline runs without panicking and sets the correct status fields. + resource := &dropv1alpha1.DiscoveryPolicy{ + ObjectMeta: metav1.ObjectMeta{ + Name: regResourceName, + }, + Spec: dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{ + { + Name: "reg-query", + Type: dropv1alpha1.DiscoveryQueryTypeRegistry, + Registry: &dropv1alpha1.DiscoveryRegistryQuery{ + URL: "http://nonexistent-registry:5000", + Repositories: []string{"team/app"}, + }, + }, + }, + Signals: []dropv1alpha1.DiscoverySignal{ + { + Name: "tag-score", + QueryRef: "reg-query", + Type: dropv1alpha1.SignalTypeAggregate, + Aggregate: &dropv1alpha1.AggregateSignalConfig{ + Method: dropv1alpha1.AggregationSum, + }, + }, + }, + Ranking: &dropv1alpha1.DiscoveryRanking{ + Strategy: dropv1alpha1.RankingStrategySignal, + Signal: &dropv1alpha1.SignalRankingConfig{ + SignalRef: "tag-score", + }, + }, + }, + } + Expect(k8sClient.Create(ctx, resource)).To(Succeed()) + defer func() { + _ = k8sClient.Delete(ctx, resource) + }() + + controllerReconciler := &DiscoveryPolicyReconciler{ + Client: k8sClient, + Scheme: k8sClient.Scheme(), + } + _, _ = controllerReconciler.Reconcile(ctx, reconcile.Request{ + NamespacedName: types.NamespacedName{Name: regResourceName}, + }) + + updated := &dropv1alpha1.DiscoveryPolicy{} + Expect(k8sClient.Get(ctx, types.NamespacedName{Name: regResourceName}, updated)).To(Succeed()) + + // Status should have a QueryResult entry for the registry query + Expect(updated.Status.QueryResults).To(HaveLen(1)) + Expect(updated.Status.QueryResults[0].Name).To(Equal("reg-query")) + Expect(updated.Status.QueryResults[0].Type).To(Equal(dropv1alpha1.DiscoveryQueryTypeRegistry)) }) It("uses the configured secret namespace for discovery source credentials", func() { diff --git a/internal/discovery/engine.go b/internal/discovery/engine.go new file mode 100644 index 0000000..42313be --- /dev/null +++ b/internal/discovery/engine.go @@ -0,0 +1,681 @@ +package discovery + +import ( + "context" + "fmt" + "math" + "net/http" + "regexp" + "sort" + "strconv" + "strings" + "time" + + dropv1alpha1 "github.com/corewire/drop/api/v1alpha1" +) + +const signalStatusFailed = "failed" + +// QueryRawData holds raw per-image samples from a single query execution. +// For prometheus range queries each image may have multiple samples. +// For prometheus instant and registry queries each image has exactly one sample. +type QueryRawData struct { + // Samples maps image reference → ordered list of (timestamp, value) pairs. + // Timestamp is Unix seconds; value is the numeric sample value. + Samples map[string][]TimedSample + // QueryType is the DiscoveryQueryType that produced this data. + QueryType dropv1alpha1.DiscoveryQueryType +} + +// TimedSample pairs a Unix timestamp (seconds) with a float64 value. +type TimedSample struct { + Timestamp float64 + Value float64 +} + +// PipelineResult is the output of a full pipeline execution. +type PipelineResult struct { + QueryResults []dropv1alpha1.QueryResult + SignalResults []dropv1alpha1.SignalResult + Images []dropv1alpha1.DiscoveredImage +} + +// HTTPClientFunc builds an HTTP client for a query (used by the controller to inject auth/TLS). +type HTTPClientFunc func(ctx context.Context, queryName string) (*http.Client, error) + +// scoredItem is an intermediate ranked image used during the ranking stage. +type scoredItem struct { + image string + score float64 + signals []dropv1alpha1.ImageSignalValue + ranking *dropv1alpha1.ImageRankingDetail +} + +// ExecutePipeline runs all stages of the discovery pipeline and returns a PipelineResult. +// +// queryHTTPClient is called once per query to obtain an HTTP client with appropriate +// auth/TLS configuration. Pass nil to use a plain default client for every query. +func ExecutePipeline( + ctx context.Context, + spec dropv1alpha1.DiscoveryPolicySpec, + queryHTTPClient HTTPClientFunc, +) PipelineResult { + if queryHTTPClient == nil { + queryHTTPClient = func(_ context.Context, _ string) (*http.Client, error) { + return &http.Client{Timeout: 30 * time.Second}, nil + } + } + + // ────────────────────────────────────────────────────────── + // Stage 1 — Execute queries + // ────────────────────────────────────────────────────────── + rawByQuery := make(map[string]*QueryRawData, len(spec.Queries)) + qResults := make([]dropv1alpha1.QueryResult, 0, len(spec.Queries)) + + for _, q := range spec.Queries { + httpClient, err := queryHTTPClient(ctx, q.Name) + if err != nil { + qResults = append(qResults, dropv1alpha1.QueryResult{ + Name: q.Name, + Type: q.Type, + Status: dropv1alpha1.QueryResultStatusFailed, + Message: fmt.Sprintf("building HTTP client: %v", err), + }) + continue + } + + raw, qr := executeQuery(ctx, q, httpClient) + qResults = append(qResults, qr) + if raw != nil { + rawByQuery[q.Name] = raw + } + } + + // ────────────────────────────────────────────────────────── + // Stage 2 — Derive signals + // ────────────────────────────────────────────────────────── + signalValues := make(map[string]map[string]float64, len(spec.Signals)) + sResults := make([]dropv1alpha1.SignalResult, 0, len(spec.Signals)) + + for _, sig := range spec.Signals { + raw, ok := rawByQuery[sig.QueryRef] + if !ok { + sResults = append(sResults, dropv1alpha1.SignalResult{ + Name: sig.Name, + Status: signalStatusFailed, + Message: fmt.Sprintf("query %q did not produce results (query failed or missing)", sig.QueryRef), + }) + continue + } + + values, sr := deriveSignal(sig, raw) + sResults = append(sResults, sr) + if values != nil { + signalValues[sig.Name] = values + } + } + + // ────────────────────────────────────────────────────────── + // Stage 3 — Rank images + // ────────────────────────────────────────────────────────── + allImages := collectImages(rawByQuery) + + // Apply image filter + if spec.ImageFilter != "" { + re, err := regexp.Compile(spec.ImageFilter) + if err == nil { + var filtered []string + for _, img := range allImages { + if re.MatchString(img) { + filtered = append(filtered, img) + } + } + allImages = filtered + } + } + + discovered := rankImages(spec.Ranking, signalValues, allImages) + + // Apply maxImages cap; mark selected + maxImages := int(spec.MaxImages) + if maxImages <= 0 { + maxImages = 50 + } + for i := range discovered { + discovered[i].Selected = i < maxImages + } + if len(discovered) > maxImages { + discovered = discovered[:maxImages] + } + + return PipelineResult{ + QueryResults: qResults, + SignalResults: sResults, + Images: discovered, + } +} + +// executeQuery fetches raw data for a single DiscoveryQuery. +func executeQuery(ctx context.Context, q dropv1alpha1.DiscoveryQuery, httpClient *http.Client) (*QueryRawData, dropv1alpha1.QueryResult) { + qr := dropv1alpha1.QueryResult{Name: q.Name, Type: q.Type} + + switch q.Type { + case dropv1alpha1.DiscoveryQueryTypePrometheus: + if q.Prometheus == nil { + qr.Status = dropv1alpha1.QueryResultStatusFailed + qr.Message = "prometheus config is required when type=prometheus" + return nil, qr + } + raw, err := executePrometheusQuery(ctx, q.Prometheus, httpClient) + if err != nil { + qr.Status = dropv1alpha1.QueryResultStatusFailed + qr.Message = err.Error() + return nil, qr + } + total := countSamples(raw.Samples) + series := int32(len(raw.Samples)) + qr.Series = &series + qr.Samples = &total + qr.Status = dropv1alpha1.QueryResultStatusSuccess + return raw, qr + + case dropv1alpha1.DiscoveryQueryTypeRegistry: + if q.Registry == nil { + qr.Status = dropv1alpha1.QueryResultStatusFailed + qr.Message = "registry config is required when type=registry" + return nil, qr + } + raw, err := executeRegistryQuery(ctx, q.Registry, httpClient) + if err != nil { + qr.Status = dropv1alpha1.QueryResultStatusFailed + qr.Message = err.Error() + return nil, qr + } + series := int32(len(raw.Samples)) + qr.Series = &series + qr.Status = dropv1alpha1.QueryResultStatusSuccess + return raw, qr + + case dropv1alpha1.DiscoveryQueryTypeLoki: + qr.Status = dropv1alpha1.QueryResultStatusFailed + qr.Message = "loki query execution is not yet implemented" + return nil, qr + + default: + qr.Status = dropv1alpha1.QueryResultStatusFailed + qr.Message = fmt.Sprintf("unsupported query type: %s", q.Type) + return nil, qr + } +} + +// executePrometheusQuery runs a Prometheus range or instant query and returns raw samples. +func executePrometheusQuery(ctx context.Context, cfg *dropv1alpha1.DiscoveryPrometheusQuery, httpClient *http.Client) (*QueryRawData, error) { + var lookback time.Duration + if cfg.Lookback != nil { + lookback = cfg.Lookback.Duration + } + var step time.Duration + if cfg.Step != nil { + step = cfg.Step.Duration + } + + src := NewPrometheusSource(cfg.Endpoint, cfg.Query, cfg.QueryType, lookback, nil, step, httpClient) + results, err := src.FetchRaw(ctx) + if err != nil { + return nil, err + } + + raw := &QueryRawData{ + Samples: results, + QueryType: dropv1alpha1.DiscoveryQueryTypePrometheus, + } + return raw, nil +} + +// executeRegistryQuery lists tags from an OCI registry and returns raw samples. +func executeRegistryQuery(ctx context.Context, cfg *dropv1alpha1.DiscoveryRegistryQuery, httpClient *http.Client) (*QueryRawData, error) { + src := NewRegistrySource(cfg.URL, cfg.Repositories, cfg.TagFilter, cfg.TopX, cfg.ImageTemplate, httpClient) + results, err := src.Fetch(ctx) + if err != nil { + return nil, err + } + + raw := &QueryRawData{ + Samples: make(map[string][]TimedSample, len(results)), + QueryType: dropv1alpha1.DiscoveryQueryTypeRegistry, + } + now := float64(time.Now().Unix()) + for _, r := range results { + raw.Samples[r.Image] = []TimedSample{{Timestamp: now, Value: float64(r.Score)}} + } + return raw, nil +} + +// deriveSignal computes per-image float64 values for a single signal. +func deriveSignal(sig dropv1alpha1.DiscoverySignal, raw *QueryRawData) (map[string]float64, dropv1alpha1.SignalResult) { + sr := dropv1alpha1.SignalResult{Name: sig.Name} + + switch sig.Type { + case dropv1alpha1.SignalTypeAggregate: + if sig.Aggregate == nil { + sr.Status = signalStatusFailed + sr.Message = "aggregate config is required when type=aggregate" + return nil, sr + } + values := aggregateSamples(raw.Samples, sig.Aggregate.Method, nil) + sr.Images = int32(len(values)) + sr.Status = "success" + return values, sr + + case dropv1alpha1.SignalTypeTimeWeightedAggregate: + if sig.TimeWeightedAggregate == nil { + sr.Status = signalStatusFailed + sr.Message = "timeWeightedAggregate config is required when type=timeWeightedAggregate" + return nil, sr + } + values, err := deriveTimeWeightedAggregate(raw.Samples, sig.TimeWeightedAggregate) + if err != nil { + sr.Status = signalStatusFailed + sr.Message = err.Error() + return nil, sr + } + sr.Images = int32(len(values)) + sr.Status = "success" + return values, sr + + case dropv1alpha1.SignalTypeWindowAggregate: + if sig.WindowAggregate == nil { + sr.Status = signalStatusFailed + sr.Message = "windowAggregate config is required when type=windowAggregate" + return nil, sr + } + values, err := deriveWindowAggregate(raw.Samples, sig.WindowAggregate) + if err != nil { + sr.Status = signalStatusFailed + sr.Message = err.Error() + return nil, sr + } + sr.Images = int32(len(values)) + sr.Status = "success" + return values, sr + + case dropv1alpha1.SignalTypeEventPullTime: + sr.Status = signalStatusFailed + sr.Message = "eventPullTime signal derivation is not yet implemented" + return nil, sr + + default: + sr.Status = signalStatusFailed + sr.Message = fmt.Sprintf("unsupported signal type: %s", sig.Type) + return nil, sr + } +} + +// aggregateSamples applies an AggregationMethod to per-image sample lists. +// cutoffUnix, when non-nil, excludes samples with timestamp < cutoffUnix. +func aggregateSamples(samples map[string][]TimedSample, method dropv1alpha1.AggregationMethod, cutoffUnix *float64) map[string]float64 { + out := make(map[string]float64, len(samples)) + for image, pts := range samples { + vals := make([]float64, 0, len(pts)) + for _, pt := range pts { + if cutoffUnix != nil && pt.Timestamp < *cutoffUnix { + continue + } + vals = append(vals, pt.Value) + } + if len(vals) == 0 { + continue + } + out[image] = applyMethod(vals, method) + } + return out +} + +// applyMethod applies a single AggregationMethod to a non-empty slice of values. +func applyMethod(vals []float64, method dropv1alpha1.AggregationMethod) float64 { + switch method { + case dropv1alpha1.AggregationCount: + return float64(len(vals)) + case dropv1alpha1.AggregationAvg: + var sum float64 + for _, v := range vals { + sum += v + } + return sum / float64(len(vals)) + case dropv1alpha1.AggregationMax: + m := vals[0] + for _, v := range vals[1:] { + if v > m { + m = v + } + } + return m + case dropv1alpha1.AggregationMin: + m := vals[0] + for _, v := range vals[1:] { + if v < m { + m = v + } + } + return m + default: // sum + var s float64 + for _, v := range vals { + s += v + } + return s + } +} + +// deriveTimeWeightedAggregate applies per-hour weights before aggregating. +func deriveTimeWeightedAggregate(samples map[string][]TimedSample, cfg *dropv1alpha1.TimeWeightedAggregateSignalConfig) (map[string]float64, error) { + loc, err := time.LoadLocation(cfg.Timezone) + if err != nil { + return nil, fmt.Errorf("loading timezone %q: %w", cfg.Timezone, err) + } + + defaultWeightQ := cfg.DefaultWeight.AsApproximateFloat64() + + out := make(map[string]float64, len(samples)) + for image, pts := range samples { + var weighted []float64 + for _, pt := range pts { + t := time.Unix(int64(pt.Timestamp), 0).In(loc) + hour := int32(t.Hour()) + + w := defaultWeightQ + for _, win := range cfg.Windows { + if hour >= win.StartHour && hour < win.EndHour { + w = win.Weight.AsApproximateFloat64() + break + } + } + weighted = append(weighted, pt.Value*w) + } + if len(weighted) == 0 { + continue + } + out[image] = applyMethod(weighted, cfg.Method) + } + return out, nil +} + +// deriveWindowAggregate aggregates only samples in a specific time window. +func deriveWindowAggregate(samples map[string][]TimedSample, cfg *dropv1alpha1.WindowAggregateSignalConfig) (map[string]float64, error) { + now := time.Now().UTC() + + var cutoff *float64 + var windowEnd *float64 + + if cfg.RelativeWindow != nil { + c := float64(now.Add(-cfg.RelativeWindow.Duration).Unix()) + cutoff = &c + } else if cfg.Window != nil { + if cfg.Timezone == "" { + return nil, fmt.Errorf("timezone is required when window is set") + } + loc, err := time.LoadLocation(cfg.Timezone) + if err != nil { + return nil, fmt.Errorf("loading timezone %q: %w", cfg.Timezone, err) + } + startT, err := parseTimeOfDay(cfg.Window.Start, now.In(loc)) + if err != nil { + return nil, fmt.Errorf("parsing window start: %w", err) + } + endT, err := parseTimeOfDay(cfg.Window.End, now.In(loc)) + if err != nil { + return nil, fmt.Errorf("parsing window end: %w", err) + } + c := float64(startT.Unix()) + e := float64(endT.Unix()) + cutoff = &c + windowEnd = &e + } + + out := make(map[string]float64, len(samples)) + for image, pts := range samples { + vals := make([]float64, 0, len(pts)) + for _, pt := range pts { + if cutoff != nil && pt.Timestamp < *cutoff { + continue + } + if windowEnd != nil && pt.Timestamp > *windowEnd { + continue + } + vals = append(vals, pt.Value) + } + if len(vals) == 0 { + continue + } + out[image] = applyMethod(vals, cfg.Method) + } + return out, nil +} + +// parseTimeOfDay parses a "HH:MM" time string relative to a reference day. +func parseTimeOfDay(hhmm string, ref time.Time) (time.Time, error) { + parts := strings.SplitN(hhmm, ":", 2) + if len(parts) != 2 { + return time.Time{}, fmt.Errorf("invalid time format %q (want HH:MM)", hhmm) + } + h, errH := strconv.Atoi(parts[0]) + m, errM := strconv.Atoi(parts[1]) + if errH != nil || errM != nil { + return time.Time{}, fmt.Errorf("invalid time format %q (want HH:MM)", hhmm) + } + return time.Date(ref.Year(), ref.Month(), ref.Day(), h, m, 0, 0, ref.Location()), nil +} + +// rankImages converts per-signal values into an ordered DiscoveredImage slice. +func rankImages(ranking *dropv1alpha1.DiscoveryRanking, signals map[string]map[string]float64, images []string) []dropv1alpha1.DiscoveredImage { + if ranking == nil || len(images) == 0 { + // No ranking configured: return images in alphabetical order with score 0. + out := make([]dropv1alpha1.DiscoveredImage, len(images)) + for i, img := range images { + out[i] = dropv1alpha1.DiscoveredImage{Image: img, Rank: int32(i + 1), FinalScore: "0"} + } + return out + } + + var items []scoredItem + + switch ranking.Strategy { + case dropv1alpha1.RankingStrategySignal: + ref := "" + if ranking.Signal != nil { + ref = ranking.Signal.SignalRef + } + sigMap := signals[ref] + for _, img := range images { + v := sigMap[img] + items = append(items, scoredItem{ + image: img, + score: v, + signals: []dropv1alpha1.ImageSignalValue{{ + Name: ref, + RawValue: strconv.FormatFloat(v, 'f', -1, 64), + }}, + ranking: &dropv1alpha1.ImageRankingDetail{Strategy: string(ranking.Strategy)}, + }) + } + + case dropv1alpha1.RankingStrategyWeightedSum: + if ranking.WeightedSum != nil { + items = weightedSumRank(ranking.WeightedSum, signals, images) + } + + case dropv1alpha1.RankingStrategyModelExposure: + if ranking.ModelExposure != nil { + items = modelExposureRank(ranking.ModelExposure, signals, images) + } + + default: + // Unknown strategy: score 0 + for _, img := range images { + items = append(items, scoredItem{image: img}) + } + } + + // Sort descending by score, then alphabetically for stability + sort.Slice(items, func(i, j int) bool { + if items[i].score != items[j].score { + return items[i].score > items[j].score + } + return items[i].image < items[j].image + }) + + out := make([]dropv1alpha1.DiscoveredImage, len(items)) + for i, it := range items { + out[i] = dropv1alpha1.DiscoveredImage{ + Image: it.image, + Rank: int32(i + 1), + FinalScore: strconv.FormatFloat(it.score, 'f', -1, 64), + Signals: it.signals, + Ranking: it.ranking, + } + } + return out +} + +// weightedSumRank computes Score = Σ weight_k * normalize(signal_k(image)). +func weightedSumRank(cfg *dropv1alpha1.WeightedSumRankingConfig, signals map[string]map[string]float64, images []string) []scoredItem { + // Compute min/max per signal for minMax normalization + type minMax struct{ min, max float64 } + bounds := make(map[string]minMax, len(cfg.Terms)) + for _, term := range cfg.Terms { + sigMap := signals[term.SignalRef] + var mn, mx float64 + first := true + for _, img := range images { + v, ok := sigMap[img] + if !ok { + continue + } + if first || v < mn { + mn = v + } + if first || v > mx { + mx = v + } + first = false + } + bounds[term.SignalRef] = minMax{min: mn, max: mx} + } + + normalize := func(v float64, b minMax) float64 { + if b.max == b.min { + return 1.0 + } + return (v - b.min) / (b.max - b.min) + } + + var out []scoredItem + for _, img := range images { + var totalScore float64 + sigVals := make([]dropv1alpha1.ImageSignalValue, 0, len(cfg.Terms)) + terms := make([]dropv1alpha1.RankingTerm, 0, len(cfg.Terms)) + + drop := false + for _, term := range cfg.Terms { + sigMap := signals[term.SignalRef] + v, ok := sigMap[img] + if !ok { + if cfg.MissingSignal == dropv1alpha1.MissingSignalBehaviorDrop { + drop = true + break + } + v = 0 + } + b := bounds[term.SignalRef] + norm := normalize(v, b) + wf := term.Weight.AsApproximateFloat64() + contribution := wf * norm + totalScore += contribution + + sigVals = append(sigVals, dropv1alpha1.ImageSignalValue{ + Name: term.SignalRef, + RawValue: strconv.FormatFloat(v, 'f', -1, 64), + NormalizedValue: strconv.FormatFloat(norm, 'f', -1, 64), + }) + terms = append(terms, dropv1alpha1.RankingTerm{ + Signal: term.SignalRef, + Weight: term.Weight.String(), + Contribution: strconv.FormatFloat(contribution, 'f', -1, 64), + }) + } + if drop { + continue + } + out = append(out, scoredItem{ + image: img, + score: totalScore, + signals: sigVals, + ranking: &dropv1alpha1.ImageRankingDetail{ + Strategy: string(dropv1alpha1.RankingStrategyWeightedSum), + Terms: terms, + }, + }) + } + return out +} + +// modelExposureRank computes Score = J_target * (1 - 1/N)^J_pre * p_hat. +func modelExposureRank(cfg *dropv1alpha1.ModelExposureRankingConfig, signals map[string]map[string]float64, images []string) []scoredItem { + n := float64(cfg.NodeCount) + if n < 1 { + n = 1 + } + oneMinusInvN := 1.0 - 1.0/n + + preMap := signals[cfg.PreWindowUsageSignalRef] + targetMap := signals[cfg.TargetWindowUsageSignalRef] + pullMap := signals[cfg.PullTimeSignalRef] + + out := make([]scoredItem, 0, len(images)) + for _, img := range images { + jPre := preMap[img] + jTarget := targetMap[img] + pHat := pullMap[img] + + score := jTarget * math.Pow(oneMinusInvN, jPre) * pHat + + out = append(out, scoredItem{ + image: img, + score: score, + signals: []dropv1alpha1.ImageSignalValue{ + {Name: cfg.PreWindowUsageSignalRef, RawValue: strconv.FormatFloat(jPre, 'f', -1, 64)}, + {Name: cfg.TargetWindowUsageSignalRef, RawValue: strconv.FormatFloat(jTarget, 'f', -1, 64)}, + {Name: cfg.PullTimeSignalRef, RawValue: strconv.FormatFloat(pHat, 'f', -1, 64)}, + }, + ranking: &dropv1alpha1.ImageRankingDetail{ + Strategy: string(dropv1alpha1.RankingStrategyModelExposure), + }, + }) + } + return out +} + +// collectImages returns a sorted, deduplicated list of all image references across all query results. +func collectImages(rawByQuery map[string]*QueryRawData) []string { + seen := make(map[string]struct{}) + for _, raw := range rawByQuery { + for img := range raw.Samples { + seen[img] = struct{}{} + } + } + images := make([]string, 0, len(seen)) + for img := range seen { + images = append(images, img) + } + sort.Strings(images) + return images +} + +// countSamples returns the total number of samples across all images. +func countSamples(samples map[string][]TimedSample) int64 { + var total int64 + for _, pts := range samples { + total += int64(len(pts)) + } + return total +} diff --git a/internal/discovery/engine_test.go b/internal/discovery/engine_test.go new file mode 100644 index 0000000..7d2c52b --- /dev/null +++ b/internal/discovery/engine_test.go @@ -0,0 +1,354 @@ +package discovery + +import ( + "context" + "encoding/json" + "net/http" + "net/http/httptest" + "testing" + "time" + + "k8s.io/apimachinery/pkg/api/resource" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + + dropv1alpha1 "github.com/corewire/drop/api/v1alpha1" +) + +// TestExecutePipeline_PrometheusInstant verifies the full pipeline with a Prometheus instant query. +func TestExecutePipeline_PrometheusInstant(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + resp := prometheusResponse{ + Status: prometheusStatusSuccess, + Data: struct { + ResultType string `json:"resultType"` + Result []prometheusResult `json:"result"` + }{ + ResultType: "vector", + Result: []prometheusResult{ + {Metric: map[string]string{"image": "nginx:1.25"}, Value: []interface{}{float64(1000), "30"}}, + {Metric: map[string]string{"image": "redis:7.0"}, Value: []interface{}{float64(1000), "10"}}, + {Metric: map[string]string{"image": "alpine:3.19"}, Value: []interface{}{float64(1000), "20"}}, + }, + }, + } + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(resp) + })) + defer srv.Close() + + spec := dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{ + { + Name: "usage", + Type: dropv1alpha1.DiscoveryQueryTypePrometheus, + Prometheus: &dropv1alpha1.DiscoveryPrometheusQuery{Endpoint: srv.URL, Query: "test", QueryType: dropv1alpha1.QueryTypeInstant}, + }, + }, + Signals: []dropv1alpha1.DiscoverySignal{ + {Name: "score", QueryRef: "usage", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + }, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "score"}}, + MaxImages: 10, + } + + clientFn := func(_ context.Context, _ string) (*http.Client, error) { return srv.Client(), nil } + result := ExecutePipeline(context.Background(), spec, clientFn) + + if len(result.QueryResults) != 1 { + t.Fatalf("expected 1 query result, got %d", len(result.QueryResults)) + } + if result.QueryResults[0].Status != dropv1alpha1.QueryResultStatusSuccess { + t.Fatalf("expected success, got %s: %s", result.QueryResults[0].Status, result.QueryResults[0].Message) + } + if len(result.Images) != 3 { + t.Fatalf("expected 3 images, got %d", len(result.Images)) + } + // Ranked by score desc: nginx(30) > alpine(20) > redis(10) + if result.Images[0].Image != "nginx:1.25" { + t.Errorf("expected nginx:1.25 first, got %s", result.Images[0].Image) + } + if result.Images[0].Rank != 1 { + t.Errorf("expected rank 1, got %d", result.Images[0].Rank) + } + if !result.Images[0].Selected { + t.Error("top image should be selected") + } +} + +// TestExecutePipeline_Registry verifies the full pipeline with a registry query. +func TestExecutePipeline_Registry(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + resp := tagListResponse{ + Name: "team/app", + Tags: []string{"v1.0", "v1.1", "v1.2"}, + } + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(resp) + })) + defer srv.Close() + + spec := dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{ + { + Name: "tags", + Type: dropv1alpha1.DiscoveryQueryTypeRegistry, + Registry: &dropv1alpha1.DiscoveryRegistryQuery{ + URL: srv.URL, + Repositories: []string{"team/app"}, + }, + }, + }, + Signals: []dropv1alpha1.DiscoverySignal{ + {Name: "tag-score", QueryRef: "tags", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + }, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "tag-score"}}, + MaxImages: 10, + } + + clientFn := func(_ context.Context, _ string) (*http.Client, error) { return srv.Client(), nil } + result := ExecutePipeline(context.Background(), spec, clientFn) + + if len(result.QueryResults) != 1 { + t.Fatalf("expected 1 query result, got %d", len(result.QueryResults)) + } + if result.QueryResults[0].Status != dropv1alpha1.QueryResultStatusSuccess { + t.Fatalf("expected success, got %s: %s", result.QueryResults[0].Status, result.QueryResults[0].Message) + } + if len(result.Images) != 3 { + t.Fatalf("expected 3 images, got %d: %v", len(result.Images), result.Images) + } + // v1.2 has the highest score (index 3), then v1.1 (2), then v1.0 (1) + registryHost := srv.URL[len("http://"):] + expectedFirst := registryHost + "/team/app:v1.2" + if result.Images[0].Image != expectedFirst { + t.Errorf("expected %s first, got %s", expectedFirst, result.Images[0].Image) + } +} + +// TestExecutePipeline_WeightedSum verifies weighted sum ranking. +func TestExecutePipeline_WeightedSum(t *testing.T) { + // Two queries with different image sets + srv1 := httptest.NewServer(prometheusInstantHandler(map[string]string{ + "nginx:1.25": "100", + "redis:7.0": "10", + })) + defer srv1.Close() + + srv2 := httptest.NewServer(prometheusInstantHandler(map[string]string{ + "nginx:1.25": "5", + "redis:7.0": "50", + })) + defer srv2.Close() + + weight700m := resource.MustParse("700m") + weight300m := resource.MustParse("300m") + + spec := dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{ + {Name: "q1", Type: dropv1alpha1.DiscoveryQueryTypePrometheus, Prometheus: &dropv1alpha1.DiscoveryPrometheusQuery{Endpoint: srv1.URL, Query: "test", QueryType: dropv1alpha1.QueryTypeInstant}}, + {Name: "q2", Type: dropv1alpha1.DiscoveryQueryTypePrometheus, Prometheus: &dropv1alpha1.DiscoveryPrometheusQuery{Endpoint: srv2.URL, Query: "test", QueryType: dropv1alpha1.QueryTypeInstant}}, + }, + Signals: []dropv1alpha1.DiscoverySignal{ + {Name: "sig1", QueryRef: "q1", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + {Name: "sig2", QueryRef: "q2", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + }, + Ranking: &dropv1alpha1.DiscoveryRanking{ + Strategy: dropv1alpha1.RankingStrategyWeightedSum, + WeightedSum: &dropv1alpha1.WeightedSumRankingConfig{ + Normalize: dropv1alpha1.NormalizeMethodMinMax, + MissingSignal: dropv1alpha1.MissingSignalBehaviorZero, + Terms: []dropv1alpha1.WeightedSumTerm{ + {SignalRef: "sig1", Weight: weight700m}, + {SignalRef: "sig2", Weight: weight300m}, + }, + }, + }, + MaxImages: 10, + } + + srvMap := map[string]*http.Client{"q1": srv1.Client(), "q2": srv2.Client()} + clientFn := func(_ context.Context, queryName string) (*http.Client, error) { + return srvMap[queryName], nil + } + result := ExecutePipeline(context.Background(), spec, clientFn) + + if len(result.Images) != 2 { + t.Fatalf("expected 2 images, got %d", len(result.Images)) + } + // nginx: sig1=100 (norm=1), sig2=5 (norm=0) → 0.7*1 + 0.3*0 = 0.7 + // redis: sig1=10 (norm=0), sig2=50 (norm=1) → 0.7*0 + 0.3*1 = 0.3 + // nginx should rank first + if result.Images[0].Image != "nginx:1.25" { + t.Errorf("expected nginx:1.25 first (weightedSum), got %s", result.Images[0].Image) + } +} + +// TestExecutePipeline_MaxImages verifies the maxImages cap is applied. +func TestExecutePipeline_MaxImages(t *testing.T) { + srv := httptest.NewServer(prometheusInstantHandler(map[string]string{ + "img1:v1": "10", + "img2:v2": "20", + "img3:v3": "30", + "img4:v4": "40", + "img5:v5": "50", + })) + defer srv.Close() + + spec := dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{ + {Name: "q", Type: dropv1alpha1.DiscoveryQueryTypePrometheus, Prometheus: &dropv1alpha1.DiscoveryPrometheusQuery{Endpoint: srv.URL, Query: "test", QueryType: dropv1alpha1.QueryTypeInstant}}, + }, + Signals: []dropv1alpha1.DiscoverySignal{ + {Name: "s", QueryRef: "q", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + }, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "s"}}, + MaxImages: 3, + } + + clientFn := func(_ context.Context, _ string) (*http.Client, error) { return srv.Client(), nil } + result := ExecutePipeline(context.Background(), spec, clientFn) + + if len(result.Images) != 3 { + t.Fatalf("expected 3 images (maxImages cap), got %d", len(result.Images)) + } + for _, img := range result.Images { + if !img.Selected { + t.Errorf("image %s should be selected (within cap)", img.Image) + } + } +} + +// TestExecutePipeline_QueryFailure verifies failed query results are reported correctly. +func TestExecutePipeline_QueryFailure(t *testing.T) { + spec := dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{ + {Name: "bad-query", Type: dropv1alpha1.DiscoveryQueryTypePrometheus, Prometheus: &dropv1alpha1.DiscoveryPrometheusQuery{Endpoint: "http://127.0.0.1:19999", Query: "test"}}, + }, + Signals: []dropv1alpha1.DiscoverySignal{ + {Name: "s", QueryRef: "bad-query", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + }, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "s"}}, + MaxImages: 10, + } + + result := ExecutePipeline(context.Background(), spec, nil) + + if len(result.QueryResults) != 1 { + t.Fatalf("expected 1 query result, got %d", len(result.QueryResults)) + } + if result.QueryResults[0].Status != dropv1alpha1.QueryResultStatusFailed { + t.Errorf("expected failed query result, got %s", result.QueryResults[0].Status) + } + if len(result.SignalResults) != 1 || result.SignalResults[0].Status != signalStatusFailed { + t.Errorf("expected failed signal result when query fails") + } + if len(result.Images) != 0 { + t.Errorf("expected no images when query fails, got %d", len(result.Images)) + } +} + +// TestExecutePipeline_WindowAggregate verifies the windowAggregate signal type (relative window). +func TestExecutePipeline_WindowAggregate(t *testing.T) { + now := float64(time.Now().Unix()) + oneHourAgo := now - 3600 + threeHoursAgo := now - 10800 + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + resp := prometheusResponse{ + Status: prometheusStatusSuccess, + Data: struct { + ResultType string `json:"resultType"` + Result []prometheusResult `json:"result"` + }{ + ResultType: "matrix", + Result: []prometheusResult{ + { + Metric: map[string]string{"image": "nginx:1.25"}, + Values: [][]interface{}{ + {threeHoursAgo, "5"}, // outside 2h window + {oneHourAgo, "10"}, // inside 2h window + {now - 600, "15"}, // inside 2h window + }, + }, + }, + }, + } + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(resp) + })) + defer srv.Close() + + window := metav1.Duration{Duration: 2 * time.Hour} + spec := dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{ + {Name: "q", Type: dropv1alpha1.DiscoveryQueryTypePrometheus, Prometheus: &dropv1alpha1.DiscoveryPrometheusQuery{Endpoint: srv.URL, Query: "test", QueryType: dropv1alpha1.QueryTypeRange, Lookback: &metav1.Duration{Duration: 4 * time.Hour}}}, + }, + Signals: []dropv1alpha1.DiscoverySignal{ + { + Name: "recent", + QueryRef: "q", + Type: dropv1alpha1.SignalTypeWindowAggregate, + WindowAggregate: &dropv1alpha1.WindowAggregateSignalConfig{ + Method: dropv1alpha1.AggregationSum, + RelativeWindow: &window, + }, + }, + }, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "recent"}}, + MaxImages: 10, + } + + clientFn := func(_ context.Context, _ string) (*http.Client, error) { return srv.Client(), nil } + result := ExecutePipeline(context.Background(), spec, clientFn) + + if len(result.Images) != 1 { + t.Fatalf("expected 1 image, got %d", len(result.Images)) + } + // Only the two samples within the 2h window (10 + 15 = 25) should be summed + if result.Images[0].FinalScore != "25" { + t.Errorf("expected score 25 (window sum), got %s", result.Images[0].FinalScore) + } +} + +// TestApplyMethod covers all aggregation methods. +func TestApplyMethod(t *testing.T) { + vals := []float64{10, 20, 30, 5} + tests := []struct { + method dropv1alpha1.AggregationMethod + want float64 + }{ + {dropv1alpha1.AggregationSum, 65}, + {dropv1alpha1.AggregationCount, 4}, + {dropv1alpha1.AggregationAvg, 16.25}, + {dropv1alpha1.AggregationMax, 30}, + {dropv1alpha1.AggregationMin, 5}, + } + for _, tt := range tests { + got := applyMethod(vals, tt.method) + if got != tt.want { + t.Errorf("applyMethod(%s) = %v, want %v", tt.method, got, tt.want) + } + } +} + +// prometheusInstantHandler returns an HTTP handler that serves a fixed instant vector. +func prometheusInstantHandler(imageValues map[string]string) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + results := make([]prometheusResult, 0, len(imageValues)) + for img, val := range imageValues { + results = append(results, prometheusResult{ + Metric: map[string]string{"image": img}, + Value: []interface{}{float64(1000), val}, + }) + } + resp := prometheusResponse{ + Status: prometheusStatusSuccess, + Data: struct { + ResultType string `json:"resultType"` + Result []prometheusResult `json:"result"` + }{ResultType: "vector", Result: results}, + } + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(resp) + }) +} diff --git a/internal/discovery/prometheus.go b/internal/discovery/prometheus.go index 94423f8..7863412 100644 --- a/internal/discovery/prometheus.go +++ b/internal/discovery/prometheus.go @@ -8,6 +8,7 @@ import ( "net/http" "net/url" "sort" + "strconv" "time" dropv1alpha1 "github.com/corewire/drop/api/v1alpha1" @@ -219,3 +220,117 @@ func aggregateRangeValues(values [][]interface{}, method *dropv1alpha1.Aggregati return int64(total) } } + +// FetchRaw queries Prometheus and returns raw timed samples per image, preserving timestamps. +// This is used by the pipeline engine so that signal derivation can apply per-timestamp logic +// (timeWeightedAggregate, windowAggregate) without discarding timestamp information. +func (p *PrometheusSource) FetchRaw(ctx context.Context) (map[string][]TimedSample, error) { + u, err := url.Parse(p.Endpoint) + if err != nil { + return nil, fmt.Errorf("parsing endpoint: %w", err) + } + + q := u.Query() + q.Set("query", p.Query) + + if p.QueryType == dropv1alpha1.QueryTypeRange { + u.Path = "/api/v1/query_range" + now := time.Now().UTC() + lookback := p.Lookback + if lookback == 0 { + lookback = 24 * time.Hour + } + step := p.Step + if step == 0 { + step = 5 * time.Minute + } + q.Set("start", now.Add(-lookback).Format(time.RFC3339)) + q.Set("end", now.Format(time.RFC3339)) + q.Set("step", fmt.Sprintf("%ds", int(step.Seconds()))) + } else { + u.Path = "/api/v1/query" + } + u.RawQuery = q.Encode() + + req, err := http.NewRequestWithContext(ctx, http.MethodGet, u.String(), nil) + if err != nil { + return nil, fmt.Errorf("creating request: %w", err) + } + + resp, err := p.HTTPClient.Do(req) + if err != nil { + return nil, fmt.Errorf("querying prometheus: %w", err) + } + defer func() { _ = resp.Body.Close() }() + + if resp.StatusCode != http.StatusOK { + body, _ := io.ReadAll(resp.Body) + return nil, fmt.Errorf("prometheus returned status %d: %s", resp.StatusCode, string(body)) + } + + var promResp prometheusResponse + if err := json.NewDecoder(resp.Body).Decode(&promResp); err != nil { + return nil, fmt.Errorf("decoding response: %w", err) + } + + if promResp.Status != prometheusStatusSuccess { + return nil, fmt.Errorf("prometheus query failed with status: %s", promResp.Status) + } + + out := make(map[string][]TimedSample, len(promResp.Data.Result)) + for _, r := range promResp.Data.Result { + image, ok := r.Metric["image"] + if !ok || image == "" { + continue + } + + if p.QueryType == dropv1alpha1.QueryTypeRange { + samples := make([]TimedSample, 0, len(r.Values)) + for _, pair := range r.Values { + if len(pair) < 2 { + continue + } + var ts float64 + switch v := pair[0].(type) { + case float64: + ts = v + default: + continue + } + strVal, ok := pair[1].(string) + if !ok { + continue + } + val, err := strconv.ParseFloat(strVal, 64) + if err != nil { + continue + } + samples = append(samples, TimedSample{Timestamp: ts, Value: val}) + } + out[image] = samples + } else { + // Instant query + if len(r.Value) < 2 { + continue + } + var ts float64 + switch v := r.Value[0].(type) { + case float64: + ts = v + default: + ts = float64(time.Now().Unix()) + } + strVal, ok := r.Value[1].(string) + if !ok { + continue + } + val, err := strconv.ParseFloat(strVal, 64) + if err != nil { + continue + } + out[image] = []TimedSample{{Timestamp: ts, Value: val}} + } + } + + return out, nil +} diff --git a/internal/discovery/registry.go b/internal/discovery/registry.go new file mode 100644 index 0000000..bc303b9 --- /dev/null +++ b/internal/discovery/registry.go @@ -0,0 +1,162 @@ +package discovery + +import ( + "context" + "encoding/json" + "fmt" + "io" + "net/http" + "regexp" + "sort" + "strings" + "text/template" + "time" +) + +// RegistrySource queries OCI registries for image tags. +type RegistrySource struct { + URL string + Repositories []string + TagFilter string + TopX int32 + ImageTemplate string + HTTPClient *http.Client +} + +// NewRegistrySource creates a new registry discovery source. +func NewRegistrySource(url string, repos []string, tagFilter string, topX int32, imageTemplate string, httpClient *http.Client) *RegistrySource { + if httpClient == nil { + httpClient = &http.Client{Timeout: 30 * time.Second} + } + return &RegistrySource{ + URL: strings.TrimSuffix(url, "/"), + Repositories: repos, + TagFilter: tagFilter, + TopX: topX, + ImageTemplate: imageTemplate, + HTTPClient: httpClient, + } +} + +// tagListResponse represents the OCI Distribution API tag list response. +type tagListResponse struct { + Name string `json:"name"` + Tags []string `json:"tags"` +} + +// Fetch queries the registry for tags and returns discovered images. +func (rs *RegistrySource) Fetch(ctx context.Context) ([]ImageResult, error) { + var allResults []ImageResult + + for _, repo := range rs.Repositories { + results, err := rs.fetchRepo(ctx, repo) + if err != nil { + return nil, fmt.Errorf("fetching tags for %s: %w", repo, err) + } + allResults = append(allResults, results...) + } + + // Sort by score descending (higher index = more recent) + sort.Slice(allResults, func(i, j int) bool { + return allResults[i].Score > allResults[j].Score + }) + + return allResults, nil +} + +func (rs *RegistrySource) fetchRepo(ctx context.Context, repo string) ([]ImageResult, error) { + u := fmt.Sprintf("%s/v2/%s/tags/list", rs.URL, repo) + + req, err := http.NewRequestWithContext(ctx, http.MethodGet, u, nil) + if err != nil { + return nil, fmt.Errorf("creating request: %w", err) + } + + resp, err := rs.HTTPClient.Do(req) + if err != nil { + return nil, fmt.Errorf("listing tags: %w", err) + } + defer func() { _ = resp.Body.Close() }() + + if resp.StatusCode != http.StatusOK { + body, _ := io.ReadAll(resp.Body) + return nil, fmt.Errorf("registry returned status %d: %s", resp.StatusCode, string(body)) + } + + var tagList tagListResponse + if err := json.NewDecoder(resp.Body).Decode(&tagList); err != nil { + return nil, fmt.Errorf("decoding response: %w", err) + } + + // Filter tags + tags := tagList.Tags + if rs.TagFilter != "" { + re, err := regexp.Compile(rs.TagFilter) + if err != nil { + return nil, fmt.Errorf("compiling tag filter: %w", err) + } + var filtered []string + for _, tag := range tags { + if re.MatchString(tag) { + filtered = append(filtered, tag) + } + } + tags = filtered + } + + // Limit to topX by keeping the last N tags in the slice returned by the registry. + // The OCI Distribution Spec does not define tag ordering, so this is best-effort: + // many registries return tags in push order (oldest first, newest last), which + // means we naturally keep the most recently pushed tags. + if rs.TopX > 0 && int32(len(tags)) > rs.TopX { + tags = tags[len(tags)-int(rs.TopX):] + } + + // Build image refs + results := make([]ImageResult, 0, len(tags)) + for i, tag := range tags { + imageRef, err := rs.buildImageRef(repo, tag) + if err != nil { + return nil, fmt.Errorf("building image ref for tag %s: %w", tag, err) + } + results = append(results, ImageResult{ + Image: imageRef, + Score: int64(i + 1), // Higher index = more recent + }) + } + + return results, nil +} + +// templateData provides variables for the image template. +type templateData struct { + Registry string + Repository string + Tag string +} + +func (rs *RegistrySource) buildImageRef(repo, tag string) (string, error) { + if rs.ImageTemplate != "" { + tmpl, err := template.New("image").Parse(rs.ImageTemplate) + if err != nil { + return "", fmt.Errorf("parsing image template: %w", err) + } + + data := templateData{ + Registry: rs.URL, + Repository: repo, + Tag: tag, + } + + var buf strings.Builder + if err := tmpl.Execute(&buf, data); err != nil { + return "", fmt.Errorf("executing image template: %w", err) + } + return buf.String(), nil + } + + // Default: registry/repo:tag + registry := strings.TrimPrefix(rs.URL, "https://") + registry = strings.TrimPrefix(registry, "http://") + return fmt.Sprintf("%s/%s:%s", registry, repo, tag), nil +} diff --git a/internal/discovery/registry_test.go b/internal/discovery/registry_test.go new file mode 100644 index 0000000..f3b9dc6 --- /dev/null +++ b/internal/discovery/registry_test.go @@ -0,0 +1,93 @@ +package discovery + +import ( + "context" + "encoding/json" + "net/http" + "net/http/httptest" + "testing" +) + +func TestRegistrySource_Fetch(t *testing.T) { + tests := []struct { + name string + repos []string + tagFilter string + topX int32 + imageTemplate string + tags []string + wantCount int + wantFirst string + wantErr bool + }{ + { + name: "basic tag listing", + repos: []string{"library/nginx"}, + tags: []string{"1.24", "1.25", "1.26"}, + wantCount: 3, + }, + { + name: "tag filter", + repos: []string{"library/nginx"}, + tagFilter: `^1\.2[56]$`, + tags: []string{"1.24", "1.25", "1.26"}, + wantCount: 2, + }, + { + name: "topX limit", + repos: []string{"library/nginx"}, + topX: 2, + tags: []string{"1.24", "1.25", "1.26"}, + wantCount: 2, + }, + { + name: "image template", + repos: []string{"gitlab-org/gitlab-runner/gitlab-runner-helper"}, + imageTemplate: "registry.gitlab.com/{{.Repository}}:x86_64-{{.Tag}}", + tags: []string{"v16.0", "v16.1"}, + wantCount: 2, + wantFirst: "registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v16.1", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + resp := tagListResponse{ + Name: tt.repos[0], + Tags: tt.tags, + } + w.WriteHeader(http.StatusOK) + if err := json.NewEncoder(w).Encode(resp); err != nil { + t.Fatal(err) + } + })) + defer server.Close() + + source := NewRegistrySource(server.URL, tt.repos, tt.tagFilter, tt.topX, tt.imageTemplate, server.Client()) + results, err := source.Fetch(context.Background()) + + if tt.wantErr { + if err == nil { + t.Fatal("expected error, got nil") + } + return + } + + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + + if len(results) != tt.wantCount { + t.Errorf("got %d results, want %d", len(results), tt.wantCount) + } + + if tt.wantFirst != "" && len(results) > 0 { + // Results sorted by score descending, highest score = last tag + if results[0].Image != tt.wantFirst { + t.Errorf("first image = %q, want %q", results[0].Image, tt.wantFirst) + } + } + }) + } +} diff --git a/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml b/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml index 855829b..b7215b4 100644 --- a/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml +++ b/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml @@ -1,9 +1,9 @@ -# Assert DiscoveryPolicy is reconciled with NotImplemented condition (pipeline not yet implemented) +# Assert DiscoveryPolicy is reconciled: pipeline executed (queries may fail for +# the mock Prometheus endpoint) but status fields are always set after reconciliation. apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: test-registry-discovery status: - (conditions[?type == 'Ready']): - - status: "False" - reason: NotImplemented + (conditions[?type == 'Ready'] | length(@) > `0`): true + (queryCount == `1`): true diff --git a/test/e2e/cachedimageset-discovery/chainsaw-test.yaml b/test/e2e/cachedimageset-discovery/chainsaw-test.yaml index 20f4bec..c7f2c6e 100644 --- a/test/e2e/cachedimageset-discovery/chainsaw-test.yaml +++ b/test/e2e/cachedimageset-discovery/chainsaw-test.yaml @@ -5,10 +5,8 @@ metadata: name: cachedimageset-discovery spec: description: | - Verify that a CachedImageSet with discoveryPolicyRef accepts the new pipeline - schema and the DiscoveryPolicy is reconciled with the expected condition. - NOTE: Full CachedImage creation from discovered images will be re-enabled in - Issue 4 once signal ranking and status output are implemented. + Verify that a CachedImageSet with discoveryPolicyRef correctly reads discovered + images from a DiscoveryPolicy that has executed the query/signal/ranking pipeline. steps: - name: Create PullPolicy try: diff --git a/test/e2e/discovery-failure/chainsaw-test.yaml b/test/e2e/discovery-failure/chainsaw-test.yaml index 31ea3fa..5fc855d 100644 --- a/test/e2e/discovery-failure/chainsaw-test.yaml +++ b/test/e2e/discovery-failure/chainsaw-test.yaml @@ -5,19 +5,18 @@ metadata: name: discovery-failure spec: description: | - Verify that DiscoveryPolicy with the new query/signal/ranking schema is accepted - and the controller sets the NotImplemented condition. - DNS error and query failure testing will be re-enabled in Issue 2 (Prometheus execution). + Verify that a DiscoveryPolicy pointing at a non-existent Prometheus endpoint + sets Ready=False with reason DNSError in the status. steps: - name: Create DiscoveryPolicy with broken Prometheus endpoint try: - apply: file: 01-broken-prometheus.yaml - - name: Assert NotImplemented condition is set + - name: Assert DNSError condition is set try: - assert: timeout: 60s - file: 02-assert-notimplemented.yaml + file: 04-assert-dns-prometheus.yaml - name: Cleanup try: - delete: diff --git a/test/e2e/discovery-registry/01-discoverypolicy.yaml b/test/e2e/discovery-registry/01-discoverypolicy.yaml new file mode 100644 index 0000000..e062dfe --- /dev/null +++ b/test/e2e/discovery-registry/01-discoverypolicy.yaml @@ -0,0 +1,28 @@ +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: e2e-registry +spec: + queries: + - name: registry-tags + type: registry + registry: + url: "http://registry.e2e-infra.svc.cluster.local:5000" + repositories: + - test/myapp + - test/worker + - test/tools + tagFilter: "^v" + topX: 5 + signals: + - name: tag-recency + queryRef: registry-tags + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30s + maxImages: 20 diff --git a/test/e2e/discovery-registry/02-assert-discovery-status.yaml b/test/e2e/discovery-registry/02-assert-discovery-status.yaml new file mode 100644 index 0000000..b378454 --- /dev/null +++ b/test/e2e/discovery-registry/02-assert-discovery-status.yaml @@ -0,0 +1,14 @@ +# Assert that DiscoveryPolicy with registry query executed the pipeline successfully: +# - Ready=True with reason Synced +# - At least one image discovered from the registry +# - queryCount reflects the spec +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: e2e-registry +status: + (conditions[?type == 'Ready']): + - status: "True" + reason: Synced + (queryCount == `1`): true + (imageCount > `0`): true diff --git a/test/e2e/discovery-registry/chainsaw-test.yaml b/test/e2e/discovery-registry/chainsaw-test.yaml new file mode 100644 index 0000000..136a0f6 --- /dev/null +++ b/test/e2e/discovery-registry/chainsaw-test.yaml @@ -0,0 +1,26 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/kyverno/chainsaw/main/.schemas/json/test-chainsaw-v1alpha1.json +apiVersion: chainsaw.kyverno.io/v1alpha1 +kind: Test +metadata: + name: discovery-registry +spec: + description: | + Verify that a DiscoveryPolicy with a registry query lists image tags from the + local e2e registry and populates status.discoveredImages. + steps: + - name: Create DiscoveryPolicy with registry query + try: + - apply: + file: 01-discoverypolicy.yaml + - name: Assert pipeline executed and images were discovered from registry + try: + - assert: + timeout: 120s + file: 02-assert-discovery-status.yaml + - name: Cleanup + try: + - delete: + ref: + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + name: e2e-registry diff --git a/test/e2e/discovery/02-assert-discovery-status.yaml b/test/e2e/discovery/02-assert-discovery-status.yaml index 23a4f10..9fd7d43 100644 --- a/test/e2e/discovery/02-assert-discovery-status.yaml +++ b/test/e2e/discovery/02-assert-discovery-status.yaml @@ -1,11 +1,14 @@ -# Assert that DiscoveryPolicy is reconciled and has the NotImplemented condition. -# Pipeline execution is not yet implemented; the controller sets NotImplemented condition. +# Assert that DiscoveryPolicy pipeline executed successfully: +# - Ready=True with reason Synced +# - At least one image discovered +# - queryCount reflects the spec apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: name: e2e-prometheus status: (conditions[?type == 'Ready']): - - status: "False" - reason: NotImplemented - queryCount: 1 + - status: "True" + reason: Synced + (queryCount == `1`): true + (imageCount > `0`): true diff --git a/test/e2e/discovery/chainsaw-test.yaml b/test/e2e/discovery/chainsaw-test.yaml index 7a962c2..e521d82 100644 --- a/test/e2e/discovery/chainsaw-test.yaml +++ b/test/e2e/discovery/chainsaw-test.yaml @@ -5,20 +5,34 @@ metadata: name: discovery spec: description: | - Verify that a DiscoveryPolicy with the new query/signal/ranking schema is accepted - and the controller sets a NotImplemented condition until pipeline execution is implemented. + Verify that a DiscoveryPolicy with a Prometheus query executes the full + query/signal/ranking pipeline and populates status.discoveredImages. steps: - name: Create DiscoveryPolicy with query/signal/ranking pipeline try: - apply: file: 01-discoverypolicy.yaml - - name: Assert NotImplemented condition is set + - name: Assert pipeline executed and images were discovered try: - assert: - timeout: 60s + timeout: 120s file: 02-assert-discovery-status.yaml + - name: Create CachedImageSet backed by discovery + try: + - apply: + file: 03-cachedimageset-discovery.yaml + - name: Assert child CachedImages were created from discovered images + try: + - assert: + timeout: 60s + file: 04-assert-children.yaml - name: Cleanup try: + - delete: + ref: + apiVersion: drop.corewire.io/v1alpha1 + kind: CachedImageSet + name: discovered-set - delete: ref: apiVersion: drop.corewire.io/v1alpha1 From a00b969be4377741b3d49b2dce7910c36d10c82f Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 27 Jun 2026 14:22:50 +0000 Subject: [PATCH 04/35] feat(discovery): implement Loki query execution and eventPullTime signal derivation --- internal/discovery/engine.go | 158 ++++++++++++- internal/discovery/engine_test.go | 186 ++++++++++++++++ internal/discovery/loki.go | 359 ++++++++++++++++++++++++++++++ internal/discovery/loki_test.go | 237 ++++++++++++++++++++ 4 files changed, 932 insertions(+), 8 deletions(-) create mode 100644 internal/discovery/loki.go create mode 100644 internal/discovery/loki_test.go diff --git a/internal/discovery/engine.go b/internal/discovery/engine.go index 42313be..dd3faef 100644 --- a/internal/discovery/engine.go +++ b/internal/discovery/engine.go @@ -14,7 +14,10 @@ import ( dropv1alpha1 "github.com/corewire/drop/api/v1alpha1" ) -const signalStatusFailed = "failed" +const ( + signalStatusFailed = "failed" + signalStatusSuccess = "success" +) // QueryRawData holds raw per-image samples from a single query execution. // For prometheus range queries each image may have multiple samples. @@ -197,9 +200,21 @@ func executeQuery(ctx context.Context, q dropv1alpha1.DiscoveryQuery, httpClient return raw, qr case dropv1alpha1.DiscoveryQueryTypeLoki: - qr.Status = dropv1alpha1.QueryResultStatusFailed - qr.Message = "loki query execution is not yet implemented" - return nil, qr + if q.Loki == nil { + qr.Status = dropv1alpha1.QueryResultStatusFailed + qr.Message = "loki config is required when type=loki" + return nil, qr + } + raw, err := executeLokiQuery(ctx, q.Loki, httpClient) + if err != nil { + qr.Status = dropv1alpha1.QueryResultStatusFailed + qr.Message = err.Error() + return nil, qr + } + records := countSamples(raw.Samples) + qr.Records = &records + qr.Status = dropv1alpha1.QueryResultStatusSuccess + return raw, qr default: qr.Status = dropv1alpha1.QueryResultStatusFailed @@ -251,6 +266,24 @@ func executeRegistryQuery(ctx context.Context, cfg *dropv1alpha1.DiscoveryRegist return raw, nil } +// executeLokiQuery fetches log entries from Loki and returns raw per-image samples. +func executeLokiQuery(ctx context.Context, cfg *dropv1alpha1.DiscoveryLokiQuery, httpClient *http.Client) (*QueryRawData, error) { + var lookback time.Duration + if cfg.Lookback != nil { + lookback = cfg.Lookback.Duration + } + src := NewLokiSource(cfg.Endpoint, cfg.Query, lookback, cfg.Parser, httpClient) + results, err := src.FetchRaw(ctx) + if err != nil { + return nil, err + } + raw := &QueryRawData{ + Samples: results, + QueryType: dropv1alpha1.DiscoveryQueryTypeLoki, + } + return raw, nil +} + // deriveSignal computes per-image float64 values for a single signal. func deriveSignal(sig dropv1alpha1.DiscoverySignal, raw *QueryRawData) (map[string]float64, dropv1alpha1.SignalResult) { sr := dropv1alpha1.SignalResult{Name: sig.Name} @@ -300,9 +333,15 @@ func deriveSignal(sig dropv1alpha1.DiscoverySignal, raw *QueryRawData) (map[stri return values, sr case dropv1alpha1.SignalTypeEventPullTime: - sr.Status = signalStatusFailed - sr.Message = "eventPullTime signal derivation is not yet implemented" - return nil, sr + if sig.EventPullTime == nil { + sr.Status = signalStatusFailed + sr.Message = "eventPullTime config is required when type=eventPullTime" + return nil, sr + } + values := deriveEventPullTime(raw.Samples, sig.EventPullTime) + sr.Images = int32(len(values)) + sr.Status = signalStatusSuccess + return values, sr default: sr.Status = signalStatusFailed @@ -656,11 +695,20 @@ func modelExposureRank(cfg *dropv1alpha1.ModelExposureRankingConfig, signals map } // collectImages returns a sorted, deduplicated list of all image references across all query results. +// For Loki query data, special per-image suffix keys (":failed", ":cache_hit") are stripped to +// their base image name so that images visible only via failure/cache events are still included. func collectImages(rawByQuery map[string]*QueryRawData) []string { seen := make(map[string]struct{}) for _, raw := range rawByQuery { for img := range raw.Samples { - seen[img] = struct{}{} + switch { + case strings.HasSuffix(img, lokiFailedSuffix): + seen[strings.TrimSuffix(img, lokiFailedSuffix)] = struct{}{} + case strings.HasSuffix(img, lokiCacheHitSuffix): + seen[strings.TrimSuffix(img, lokiCacheHitSuffix)] = struct{}{} + default: + seen[img] = struct{}{} + } } } images := make([]string, 0, len(seen)) @@ -679,3 +727,97 @@ func countSamples(samples map[string][]TimedSample) int64 { } return total } + +// deriveEventPullTime computes per-image pull-time statistics from Loki event samples. +// +// The samples map is expected to come from a Loki kubernetesEvents query: +// - samples[image] → pull duration values in seconds (from Pulled events) +// - samples[image+":failed"] → count of pull-failure events (value=1.0 each) +// - samples[image+":cache_hit"] → count of already-present events (value=1.0 each) +func deriveEventPullTime(samples map[string][]TimedSample, cfg *dropv1alpha1.EventPullTimeSignalConfig) map[string]float64 { + imageSet := make(map[string]struct{}) + for key := range samples { + switch { + case strings.HasSuffix(key, lokiFailedSuffix): + imageSet[strings.TrimSuffix(key, lokiFailedSuffix)] = struct{}{} + case strings.HasSuffix(key, lokiCacheHitSuffix): + imageSet[strings.TrimSuffix(key, lokiCacheHitSuffix)] = struct{}{} + default: + imageSet[key] = struct{}{} + } + } + + out := make(map[string]float64, len(imageSet)) + for img := range imageSet { + var v float64 + switch cfg.Statistic { + case dropv1alpha1.EventPullTimeStatisticFailureCount: + v = float64(len(samples[img+lokiFailedSuffix])) + case dropv1alpha1.EventPullTimeStatisticCacheHitCount: + v = float64(len(samples[img+lokiCacheHitSuffix])) + case dropv1alpha1.EventPullTimeStatisticCount: + pts := append([]TimedSample(nil), samples[img]...) + if cfg.IncludeCacheHits { + pts = append(pts, samples[img+lokiCacheHitSuffix]...) + } + v = float64(len(pts)) + default: + // Duration statistics: p50, p90, p95, avg, max. + pts := append([]TimedSample(nil), samples[img]...) + if cfg.IncludeCacheHits { + pts = append(pts, samples[img+lokiCacheHitSuffix]...) + } + if len(pts) == 0 { + continue + } + durations := make([]float64, len(pts)) + for i, pt := range pts { + durations[i] = pt.Value + } + v = computeEventPullTimeStat(durations, cfg.Statistic) + } + out[img] = v + } + return out +} + +// computeEventPullTimeStat computes a duration statistic over a non-empty slice. +func computeEventPullTimeStat(vals []float64, stat dropv1alpha1.EventPullTimeStatistic) float64 { + sorted := make([]float64, len(vals)) + copy(sorted, vals) + sort.Float64s(sorted) + + switch stat { + case dropv1alpha1.EventPullTimeStatisticP50: + return durationPercentile(sorted, 50) + case dropv1alpha1.EventPullTimeStatisticP90: + return durationPercentile(sorted, 90) + case dropv1alpha1.EventPullTimeStatisticP95: + return durationPercentile(sorted, 95) + case dropv1alpha1.EventPullTimeStatisticAvg: + var sum float64 + for _, v := range sorted { + sum += v + } + return sum / float64(len(sorted)) + case dropv1alpha1.EventPullTimeStatisticMax: + return sorted[len(sorted)-1] + default: + return 0 + } +} + +// durationPercentile returns the p-th percentile of a sorted slice using linear interpolation. +func durationPercentile(sorted []float64, p float64) float64 { + n := len(sorted) + if n == 1 { + return sorted[0] + } + rank := p / 100.0 * float64(n-1) + lo := int(rank) + hi := lo + 1 + if hi >= n { + return sorted[n-1] + } + return sorted[lo] + (rank-float64(lo))*(sorted[hi]-sorted[lo]) +} diff --git a/internal/discovery/engine_test.go b/internal/discovery/engine_test.go index 7d2c52b..57c35bc 100644 --- a/internal/discovery/engine_test.go +++ b/internal/discovery/engine_test.go @@ -5,6 +5,7 @@ import ( "encoding/json" "net/http" "net/http/httptest" + "strconv" "testing" "time" @@ -352,3 +353,188 @@ func prometheusInstantHandler(imageValues map[string]string) http.Handler { _ = json.NewEncoder(w).Encode(resp) }) } + +// lokiStreamHandler returns an HTTP handler that serves a fixed Loki query_range response. +func lokiStreamHandler(streams []lokiStream) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + resp := lokiResponse{ + Status: lokiStatusSuccess, + Data: lokiData{ + ResultType: "streams", + Result: streams, + }, + } + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(resp) + }) +} + +// TestExecutePipeline_Loki verifies the full pipeline with a Loki query and eventPullTime signal. +func TestExecutePipeline_Loki(t *testing.T) { + now := time.Now() + nanoStr := func(t time.Time) string { + return strconv.FormatInt(t.UnixNano(), 10) + } + + streams := []lokiStream{ + { + Stream: map[string]string{"app": "kubelet"}, + Values: [][]string{ + {nanoStr(now.Add(-10 * time.Second)), `Pulling image "nginx:1.25"`}, + {nanoStr(now.Add(-7 * time.Second)), `Successfully pulled image "nginx:1.25" in 3s (3s including waiting)`}, + {nanoStr(now.Add(-5 * time.Second)), `Pulling image "redis:7.0"`}, + {nanoStr(now.Add(-2 * time.Second)), `Successfully pulled image "redis:7.0" in 3s (3s including waiting)`}, + }, + }, + } + + srv := httptest.NewServer(lokiStreamHandler(streams)) + defer srv.Close() + + spec := dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{ + { + Name: "pull-events", + Type: dropv1alpha1.DiscoveryQueryTypeLoki, + Loki: &dropv1alpha1.DiscoveryLokiQuery{ + Endpoint: srv.URL, + Query: `{app="kubelet"}`, + QueryType: dropv1alpha1.LokiQueryTypeRange, + Lookback: &metav1.Duration{Duration: time.Hour}, + Parser: &dropv1alpha1.LokiParser{ + Type: dropv1alpha1.LokiParserTypeKubernetesEvents, + MessageField: "message", + }, + }, + }, + }, + Signals: []dropv1alpha1.DiscoverySignal{ + { + Name: "pull-time", + QueryRef: "pull-events", + Type: dropv1alpha1.SignalTypeEventPullTime, + EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Statistic: dropv1alpha1.EventPullTimeStatisticAvg, DurationMode: dropv1alpha1.DurationModeMessageDuration}, + }, + }, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "pull-time"}}, + MaxImages: 10, + } + + clientFn := func(_ context.Context, _ string) (*http.Client, error) { return srv.Client(), nil } + result := ExecutePipeline(context.Background(), spec, clientFn) + + if len(result.QueryResults) != 1 { + t.Fatalf("expected 1 query result, got %d", len(result.QueryResults)) + } + if result.QueryResults[0].Status != dropv1alpha1.QueryResultStatusSuccess { + t.Fatalf("expected success, got %s: %s", result.QueryResults[0].Status, result.QueryResults[0].Message) + } + if len(result.Images) != 2 { + t.Fatalf("expected 2 images, got %d: %v", len(result.Images), result.Images) + } + // Both images have avg pull time of 3s + for _, img := range result.Images { + if img.FinalScore != "3" { + t.Errorf("expected score 3 for %s, got %s", img.Image, img.FinalScore) + } + } +} + +// TestExecutePipeline_LokiFailureCount verifies that failure event counts are reported correctly. +func TestExecutePipeline_LokiFailureCount(t *testing.T) { + now := time.Now() + nanoStr := func(t time.Time) string { + return strconv.FormatInt(t.UnixNano(), 10) + } + + streams := []lokiStream{ + { + Stream: map[string]string{"app": "kubelet"}, + Values: [][]string{ + {nanoStr(now.Add(-5 * time.Second)), `Pulling image "nginx:1.25"`}, + {nanoStr(now.Add(-4 * time.Second)), `Failed to pull image "nginx:1.25": rpc error`}, + {nanoStr(now.Add(-3 * time.Second)), `Back-off pulling image "nginx:1.25"`}, + }, + }, + } + + srv := httptest.NewServer(lokiStreamHandler(streams)) + defer srv.Close() + + spec := dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{ + { + Name: "pull-events", + Type: dropv1alpha1.DiscoveryQueryTypeLoki, + Loki: &dropv1alpha1.DiscoveryLokiQuery{ + Endpoint: srv.URL, + Query: `{app="kubelet"}`, + Parser: &dropv1alpha1.LokiParser{ + Type: dropv1alpha1.LokiParserTypeKubernetesEvents, + MessageField: "message", + }, + }, + }, + }, + Signals: []dropv1alpha1.DiscoverySignal{ + { + Name: "failures", + QueryRef: "pull-events", + Type: dropv1alpha1.SignalTypeEventPullTime, + EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Statistic: dropv1alpha1.EventPullTimeStatisticFailureCount, DurationMode: dropv1alpha1.DurationModeMessageDuration}, + }, + }, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "failures"}}, + MaxImages: 10, + } + + clientFn := func(_ context.Context, _ string) (*http.Client, error) { return srv.Client(), nil } + result := ExecutePipeline(context.Background(), spec, clientFn) + + if result.QueryResults[0].Status != dropv1alpha1.QueryResultStatusSuccess { + t.Fatalf("expected success, got %s: %s", result.QueryResults[0].Status, result.QueryResults[0].Message) + } + if len(result.Images) != 1 { + t.Fatalf("expected 1 image, got %d: %v", len(result.Images), result.Images) + } + // Both "failed" and "backoff" reasons count as failures → 2 failure events + if result.Images[0].FinalScore != "2" { + t.Errorf("expected failureCount=2, got %s", result.Images[0].FinalScore) + } +} + +// TestDeriveEventPullTime_Percentiles verifies p50/p90/p95 computation. +func TestDeriveEventPullTime_Percentiles(t *testing.T) { + // 10 duration samples: 1,2,3,4,5,6,7,8,9,10 seconds + pts := make([]TimedSample, 10) + for i := range pts { + pts[i] = TimedSample{Timestamp: float64(i), Value: float64(i + 1)} + } + samples := map[string][]TimedSample{"nginx:1.25": pts} + + tests := []struct { + stat dropv1alpha1.EventPullTimeStatistic + want float64 + }{ + {dropv1alpha1.EventPullTimeStatisticP50, 5.5}, + {dropv1alpha1.EventPullTimeStatisticP90, 9.1}, + {dropv1alpha1.EventPullTimeStatisticP95, 9.55}, + {dropv1alpha1.EventPullTimeStatisticAvg, 5.5}, + {dropv1alpha1.EventPullTimeStatisticMax, 10}, + {dropv1alpha1.EventPullTimeStatisticCount, 10}, + } + for _, tt := range tests { + cfg := &dropv1alpha1.EventPullTimeSignalConfig{Statistic: tt.stat} + got := deriveEventPullTime(samples, cfg)["nginx:1.25"] + if absFloat(got-tt.want) > 0.01 { + t.Errorf("statistic %s: got %v, want %v", tt.stat, got, tt.want) + } + } +} + +func absFloat(x float64) float64 { + if x < 0 { + return -x + } + return x +} diff --git a/internal/discovery/loki.go b/internal/discovery/loki.go new file mode 100644 index 0000000..742c3f4 --- /dev/null +++ b/internal/discovery/loki.go @@ -0,0 +1,359 @@ +package discovery + +import ( + "context" + "encoding/json" + "fmt" + "io" + "net/http" + "net/url" + "regexp" + "sort" + "strconv" + "strings" + "time" + + dropv1alpha1 "github.com/corewire/drop/api/v1alpha1" +) + +const ( + lokiStatusSuccess = "success" + // lokiLimitDefault is the maximum number of log entries to fetch per query. + lokiLimitDefault = 5000 + // lokiFailedSuffix is appended to image keys for pull-failure event counts. + lokiFailedSuffix = ":failed" + // lokiCacheHitSuffix is appended to image keys for cache-hit event counts. + lokiCacheHitSuffix = ":cache_hit" +) + +// rePulledDuration matches the pull duration in Pulled event messages. +// Examples: "in 2.345s", "in 100ms", "in 1m", "in 1h" +var rePulledDuration = regexp.MustCompile(`\bin\s+(\d+(?:\.\d+)?)(ms|s|m|h)\b`) + +// reImageRef matches an image reference in log messages. +// Handles: Pulling image "nginx:1.25" / image "nginx:1.25" +var reImageRef = regexp.MustCompile(`(?:image|Image)\s+"([^"]+)"`) + +// lokiResponse is the top-level Loki query_range API response. +type lokiResponse struct { + Status string `json:"status"` + Data lokiData `json:"data"` +} + +// lokiData is the data section of a Loki response. +type lokiData struct { + ResultType string `json:"resultType"` + Result []lokiStream `json:"result"` +} + +// lokiStream is a single log stream from Loki (labels + values). +type lokiStream struct { + Stream map[string]string `json:"stream"` + Values [][]string `json:"values"` // [nanosecond_timestamp_string, log_line] +} + +// LokiSource fetches log events from a Loki-compatible API. +type LokiSource struct { + Endpoint string + Query string + Lookback time.Duration + Parser *dropv1alpha1.LokiParser + HTTPClient *http.Client +} + +// NewLokiSource creates a new LokiSource. +func NewLokiSource(endpoint, query string, lookback time.Duration, parser *dropv1alpha1.LokiParser, httpClient *http.Client) *LokiSource { + if httpClient == nil { + httpClient = &http.Client{Timeout: 30 * time.Second} + } + return &LokiSource{ + Endpoint: endpoint, + Query: query, + Lookback: lookback, + Parser: parser, + HTTPClient: httpClient, + } +} + +// FetchRaw calls /loki/api/v1/query_range and returns per-image timed samples. +// +// For a kubernetesEvents parser, sample values are pull durations in seconds +// (from Pulled event messages or Pulling→Pulled timestamp pairs). +// Pull failures are stored under the key "image:failed" with value 1.0, +// and cache hits under "image:cache_hit" with value 1.0. +// +// Without a parser, each log entry produces a value=1.0 sample keyed by +// the "image" stream label. +func (l *LokiSource) FetchRaw(ctx context.Context) (map[string][]TimedSample, error) { + u, err := url.Parse(l.Endpoint) + if err != nil { + return nil, fmt.Errorf("parsing endpoint: %w", err) + } + u.Path = "/loki/api/v1/query_range" + + lookback := l.Lookback + if lookback == 0 { + lookback = 24 * time.Hour + } + now := time.Now().UTC() + + q := u.Query() + q.Set("query", l.Query) + q.Set("start", strconv.FormatInt(now.Add(-lookback).UnixNano(), 10)) + q.Set("end", strconv.FormatInt(now.UnixNano(), 10)) + q.Set("limit", strconv.Itoa(lokiLimitDefault)) + q.Set("direction", "forward") + u.RawQuery = q.Encode() + + req, err := http.NewRequestWithContext(ctx, http.MethodGet, u.String(), nil) + if err != nil { + return nil, fmt.Errorf("creating request: %w", err) + } + + resp, err := l.HTTPClient.Do(req) + if err != nil { + return nil, fmt.Errorf("querying loki: %w", err) + } + defer func() { _ = resp.Body.Close() }() + + if resp.StatusCode != http.StatusOK { + body, _ := io.ReadAll(resp.Body) + return nil, fmt.Errorf("loki returned status %d: %s", resp.StatusCode, string(body)) + } + + var lokiResp lokiResponse + if err := json.NewDecoder(resp.Body).Decode(&lokiResp); err != nil { + return nil, fmt.Errorf("decoding loki response: %w", err) + } + if lokiResp.Status != lokiStatusSuccess { + return nil, fmt.Errorf("loki query failed with status: %s", lokiResp.Status) + } + + return l.parseLokiStreams(lokiResp.Data.Result), nil +} + +// parseLokiStreams converts Loki streams into per-image timed samples using +// the configured parser (or a generic image-label fallback). +func (l *LokiSource) parseLokiStreams(streams []lokiStream) map[string][]TimedSample { + if l.Parser != nil && l.Parser.Type == dropv1alpha1.LokiParserTypeKubernetesEvents { + return parseKubernetesEventStreams(streams, l.Parser) + } + return parseGenericLokiStreams(streams) +} + +// parseGenericLokiStreams produces value=1.0 samples keyed by the "image" stream label. +func parseGenericLokiStreams(streams []lokiStream) map[string][]TimedSample { + out := make(map[string][]TimedSample) + for _, stream := range streams { + image := stream.Stream["image"] + if image == "" { + continue + } + for _, entry := range stream.Values { + if len(entry) < 2 { + continue + } + ts := parseLokiNanoTimestamp(entry[0]) + out[image] = append(out[image], TimedSample{Timestamp: ts, Value: 1.0}) + } + } + return out +} + +// lokiEventRecord is an intermediate representation of a parsed Kubernetes Event. +type lokiEventRecord struct { + image string + pod string + reason string + message string + timestamp float64 +} + +// parseKubernetesEventStreams parses Kubernetes Event records from Loki log entries. +// +// It produces: +// - samples[image] → pull duration in seconds for each Pulled event +// - samples[image+":failed"] → 1.0 per pull-failure event +// - samples[image+":cache_hit"] → 1.0 per already-present event +// +// Durations are derived from the "in Xs" pattern in Pulled messages (messageDuration). +// When no duration is present in the message, a Pulling→Pulled event-pair duration +// is used as a fallback. +func parseKubernetesEventStreams(streams []lokiStream, parser *dropv1alpha1.LokiParser) map[string][]TimedSample { + reasonField := lokiCoalesceField(parser.ReasonField, "reason") + podField := lokiCoalesceField(parser.PodField, "involvedObject_name") + messageField := lokiCoalesceField(parser.MessageField, "message") + imageField := lokiCoalesceField(parser.ImageField, "message") + + var records []lokiEventRecord + for _, stream := range streams { + for _, entry := range stream.Values { + if len(entry) < 2 { + continue + } + ts := parseLokiNanoTimestamp(entry[0]) + + rec := lokiEventRecord{ + timestamp: ts, + reason: stream.Stream[reasonField], + pod: stream.Stream[podField], + message: stream.Stream[messageField], + } + + // If key fields are absent from labels, try to parse the log line as JSON. + if rec.reason == "" || rec.message == "" { + var parsed map[string]interface{} + if err := json.Unmarshal([]byte(entry[1]), &parsed); err == nil { + if rec.reason == "" { + rec.reason, _ = parsed[reasonField].(string) + } + if rec.pod == "" { + rec.pod, _ = parsed[podField].(string) + } + if rec.message == "" { + rec.message, _ = parsed[messageField].(string) + } + } else if rec.message == "" { + rec.message = entry[1] + } + } + + // Infer reason from message text when no structured label provided it. + if rec.reason == "" && rec.message != "" { + rec.reason = lokiInferReasonFromMessage(rec.message) + } + + // Determine the source string for image extraction. + var imgSource string + if imageField == messageField || imageField == "message" { + imgSource = rec.message + } else { + imgSource = stream.Stream[imageField] + if imgSource == "" { + imgSource = rec.message + } + } + rec.image = lokiExtractImageFromMessage(imgSource) + if rec.image == "" { + continue + } + records = append(records, rec) + } + } + + // Sort records chronologically for correct eventPair matching. + sort.Slice(records, func(i, j int) bool { + return records[i].timestamp < records[j].timestamp + }) + + // pullingMap tracks the start timestamp of Pulling events per (pod:image). + pullingMap := make(map[string]float64) + out := make(map[string][]TimedSample) + + for _, rec := range records { + switch strings.ToLower(rec.reason) { + case "pulling": + pullingMap[rec.pod+":"+rec.image] = rec.timestamp + + case "pulled": + // Primary: parse duration from message ("in Xs"). + dur := lokiParsePullDuration(rec.message) + // Fallback: event-pair (Pulling → Pulled timestamp delta). + if dur == 0 { + if pullStart, ok := pullingMap[rec.pod+":"+rec.image]; ok { + if d := rec.timestamp - pullStart; d > 0 { + dur = d + } + } + } + if dur > 0 { + out[rec.image] = append(out[rec.image], TimedSample{Timestamp: rec.timestamp, Value: dur}) + } + delete(pullingMap, rec.pod+":"+rec.image) + + case "failed", "backoff": + out[rec.image+lokiFailedSuffix] = append( + out[rec.image+lokiFailedSuffix], + TimedSample{Timestamp: rec.timestamp, Value: 1.0}, + ) + + case "alreadypresent": + out[rec.image+lokiCacheHitSuffix] = append( + out[rec.image+lokiCacheHitSuffix], + TimedSample{Timestamp: rec.timestamp, Value: 1.0}, + ) + } + } + + return out +} + +// lokiExtractImageFromMessage extracts an image reference from a message string. +// Handles patterns such as: Pulling image "nginx:1.25" +func lokiExtractImageFromMessage(msg string) string { + m := reImageRef.FindStringSubmatch(msg) + if len(m) > 1 { + return m[1] + } + return "" +} + +// lokiParsePullDuration extracts the pull duration in seconds from a Pulled event message. +// Example: "Successfully pulled image \"nginx:1.25\" in 2.345s ..." +func lokiParsePullDuration(msg string) float64 { + m := rePulledDuration.FindStringSubmatch(msg) + if len(m) < 3 { + return 0 + } + v, err := strconv.ParseFloat(m[1], 64) + if err != nil { + return 0 + } + switch m[2] { + case "ms": + return v / 1000.0 + case "m": + return v * 60 + case "h": + return v * 3600 + default: // "s" + return v + } +} + +// lokiInferReasonFromMessage infers a Kubernetes Event reason from a plain-text log message. +// This is used when the reason field is not present in the Loki stream labels. +func lokiInferReasonFromMessage(msg string) string { + lower := strings.ToLower(msg) + switch { + case strings.Contains(lower, "successfully pulled"): + return "Pulled" + case strings.Contains(lower, "back-off pulling") || strings.Contains(lower, "back-off"): + return "Backoff" + case strings.Contains(lower, "failed to pull"): + return "Failed" + case strings.Contains(lower, "pulling image"): + return "Pulling" + case strings.Contains(lower, "already present"): + return "AlreadyPresent" + default: + return "" + } +} + +// parseLokiNanoTimestamp converts a Loki nanosecond epoch string to Unix seconds (float64). +func parseLokiNanoTimestamp(s string) float64 { + v, err := strconv.ParseInt(s, 10, 64) + if err != nil { + return 0 + } + return float64(v) / 1e9 +} + +// lokiCoalesceField returns field if non-empty, otherwise defaultVal. +func lokiCoalesceField(field, defaultVal string) string { + if field != "" { + return field + } + return defaultVal +} diff --git a/internal/discovery/loki_test.go b/internal/discovery/loki_test.go new file mode 100644 index 0000000..a852fcf --- /dev/null +++ b/internal/discovery/loki_test.go @@ -0,0 +1,237 @@ +package discovery + +import ( + "encoding/json" + "net/http" + "net/http/httptest" + "strconv" + "testing" + "time" + + dropv1alpha1 "github.com/corewire/drop/api/v1alpha1" +) + +// TestLokiSource_FetchRaw_Generic verifies the generic (non-parser) FetchRaw path, +// which produces value=1.0 samples keyed by the "image" stream label. +func TestLokiSource_FetchRaw_Generic(t *testing.T) { + now := time.Now() + streams := []lokiStream{ + { + Stream: map[string]string{"image": "nginx:1.25"}, + Values: [][]string{ + {nanoStringLoki(now.Add(-2 * time.Second)), "log line 1"}, + {nanoStringLoki(now.Add(-1 * time.Second)), "log line 2"}, + }, + }, + { + Stream: map[string]string{"image": "redis:7.0"}, + Values: [][]string{ + {nanoStringLoki(now), "log line 3"}, + }, + }, + { + // no image label → should be skipped + Stream: map[string]string{"app": "kubelet"}, + Values: [][]string{ + {nanoStringLoki(now), "unrelated line"}, + }, + }, + } + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + resp := lokiResponse{ + Status: lokiStatusSuccess, + Data: lokiData{ResultType: "streams", Result: streams}, + } + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(resp) + })) + defer srv.Close() + + src := NewLokiSource(srv.URL, `{app="test"}`, time.Hour, nil, srv.Client()) + samples, err := src.FetchRaw(t.Context()) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if len(samples["nginx:1.25"]) != 2 { + t.Errorf("expected 2 samples for nginx:1.25, got %d", len(samples["nginx:1.25"])) + } + if len(samples["redis:7.0"]) != 1 { + t.Errorf("expected 1 sample for redis:7.0, got %d", len(samples["redis:7.0"])) + } + for _, s := range samples["nginx:1.25"] { + if s.Value != 1.0 { + t.Errorf("expected generic sample value 1.0, got %f", s.Value) + } + } +} + +// TestLokiSource_FetchRaw_KubernetesEvents verifies the kubernetesEvents parser +// with message-based duration extraction and eventPair fallback. +func TestLokiSource_FetchRaw_KubernetesEvents(t *testing.T) { + now := time.Now() + streams := []lokiStream{ + { + Stream: map[string]string{ + "reason": "Pulling", + "involvedObject_name": "pod-abc", + "message": `Pulling image "nginx:1.25"`, + }, + Values: [][]string{{nanoStringLoki(now.Add(-3 * time.Second)), ""}}, + }, + { + Stream: map[string]string{ + "reason": "Pulled", + "involvedObject_name": "pod-abc", + "message": `Successfully pulled image "nginx:1.25" in 2.5s (2.5s including waiting)`, + }, + Values: [][]string{{nanoStringLoki(now.Add(-500 * time.Millisecond)), ""}}, + }, + } + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + resp := lokiResponse{ + Status: lokiStatusSuccess, + Data: lokiData{ResultType: "streams", Result: streams}, + } + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(resp) + })) + defer srv.Close() + + src := NewLokiSource(srv.URL, `{app="kubelet"}`, time.Hour, &dropv1alpha1.LokiParser{ + Type: dropv1alpha1.LokiParserTypeKubernetesEvents, + ReasonField: "reason", + PodField: "involvedObject_name", + MessageField: "message", + }, srv.Client()) + samples, err := src.FetchRaw(t.Context()) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + + // Expect one duration sample for nginx:1.25 (2.5s from message) + if len(samples["nginx:1.25"]) != 1 { + t.Fatalf("expected 1 sample for nginx:1.25, got %d", len(samples["nginx:1.25"])) + } + if got := samples["nginx:1.25"][0].Value; got != 2.5 { + t.Errorf("expected duration 2.5s, got %f", got) + } +} + +// TestLokiSource_FetchRaw_KubernetesEvents_EventPair verifies that when no duration +// is present in the message, the Pulling→Pulled timestamp delta is used. +func TestLokiSource_FetchRaw_KubernetesEvents_EventPair(t *testing.T) { + now := time.Now() + pullingTime := now.Add(-3 * time.Second) + pulledTime := now.Add(-1 * time.Second) + + streams := []lokiStream{ + { + Stream: map[string]string{ + "reason": "Pulling", + "involvedObject_name": "pod-xyz", + "message": `Pulling image "alpine:3.19"`, + }, + Values: [][]string{{nanoStringLoki(pullingTime), ""}}, + }, + { + Stream: map[string]string{ + "reason": "Pulled", + "involvedObject_name": "pod-xyz", + "message": `Successfully pulled image "alpine:3.19"`, // no duration + }, + Values: [][]string{{nanoStringLoki(pulledTime), ""}}, + }, + } + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + resp := lokiResponse{ + Status: lokiStatusSuccess, + Data: lokiData{ResultType: "streams", Result: streams}, + } + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(resp) + })) + defer srv.Close() + + src := NewLokiSource(srv.URL, `{app="kubelet"}`, time.Hour, &dropv1alpha1.LokiParser{ + Type: dropv1alpha1.LokiParserTypeKubernetesEvents, + ReasonField: "reason", + PodField: "involvedObject_name", + MessageField: "message", + }, srv.Client()) + samples, err := src.FetchRaw(t.Context()) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + + if len(samples["alpine:3.19"]) != 1 { + t.Fatalf("expected 1 sample for alpine:3.19, got %d", len(samples["alpine:3.19"])) + } + // eventPair duration ≈ 2 seconds (pulledTime - pullingTime) + got := samples["alpine:3.19"][0].Value + if got < 1.9 || got > 2.1 { + t.Errorf("expected eventPair duration ~2s, got %f", got) + } +} + +// TestLokiSource_FetchRaw_HTTPError verifies that HTTP errors are surfaced. +func TestLokiSource_FetchRaw_HTTPError(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + http.Error(w, "internal error", http.StatusInternalServerError) + })) + defer srv.Close() + + src := NewLokiSource(srv.URL, `{app="test"}`, time.Hour, nil, srv.Client()) + _, err := src.FetchRaw(t.Context()) + if err == nil { + t.Fatal("expected error, got nil") + } +} + +// TestLokiInferReasonFromMessage verifies the plain-text reason inference. +func TestLokiInferReasonFromMessage(t *testing.T) { + tests := []struct { + msg string + want string + }{ + {`Successfully pulled image "nginx:1.25" in 2s`, "Pulled"}, + {`Pulling image "nginx:1.25"`, "Pulling"}, + {`Failed to pull image "nginx:1.25": not found`, "Failed"}, + {`Back-off pulling image "nginx:1.25"`, "Backoff"}, + {`Container image "nginx:1.25" already present on machine`, "AlreadyPresent"}, + {`some unrelated log line`, ""}, + } + for _, tt := range tests { + got := lokiInferReasonFromMessage(tt.msg) + if got != tt.want { + t.Errorf("msg=%q: got %q, want %q", tt.msg, got, tt.want) + } + } +} + +// TestLokiParsePullDuration verifies duration parsing from event messages. +func TestLokiParsePullDuration(t *testing.T) { + tests := []struct { + msg string + want float64 + }{ + {`Successfully pulled image "nginx:1.25" in 2.5s`, 2.5}, + {`Successfully pulled image "nginx:1.25" in 500ms`, 0.5}, + {`Successfully pulled image "nginx:1.25" in 1m`, 60}, + {`Successfully pulled image "nginx:1.25" in 1h`, 3600}, + {`Successfully pulled image "nginx:1.25"`, 0}, // no duration + } + for _, tt := range tests { + got := lokiParsePullDuration(tt.msg) + if got != tt.want { + t.Errorf("msg=%q: got %f, want %f", tt.msg, got, tt.want) + } + } +} + +// nanoStringLoki formats a time as a nanosecond epoch string for Loki responses. +func nanoStringLoki(t time.Time) string { + return strconv.FormatInt(t.UnixNano(), 10) +} From 61199bd378fdcfaf051c51c988fba2edc51081de Mon Sep 17 00:00:00 2001 From: copilot Date: Sat, 27 Jun 2026 14:38:09 +0000 Subject: [PATCH 05/35] test(e2e): add Loki infrastructure and seed job for image-pull event discovery Deploy a single-binary Loki into the e2e-infra namespace and seed it with kubelet-style image-pull event log lines (Pulling/Pulled/Failed/already present) so DiscoveryPolicy loki queries with the kubernetesEvents parser and the eventPullTime signal can be exercised end-to-end. Wired into hack/e2e-infra/setup.sh and the Tiltfile alongside the existing Prometheus and registry infrastructure. --- Tiltfile | 6 ++ hack/e2e-infra/loki.yaml | 101 ++++++++++++++++++++++++++++++ hack/e2e-infra/seed-loki-job.yaml | 87 +++++++++++++++++++++++++ hack/e2e-infra/setup.sh | 13 ++++ 4 files changed, 207 insertions(+) create mode 100644 hack/e2e-infra/loki.yaml create mode 100644 hack/e2e-infra/seed-loki-job.yaml diff --git a/Tiltfile b/Tiltfile index 3682fc8..36afcc2 100644 --- a/Tiltfile +++ b/Tiltfile @@ -82,9 +82,11 @@ local('kubectl create namespace e2e-infra --dry-run=client -o yaml | kubectl app k8s_yaml('hack/e2e-infra/prometheus-config.yaml') k8s_yaml('hack/e2e-infra/prometheus.yaml') k8s_yaml('hack/e2e-infra/registry.yaml') +k8s_yaml('hack/e2e-infra/loki.yaml') k8s_resource('prometheus', objects=['prometheus-config:configmap', 'prometheus:serviceaccount', 'prometheus-metrics-reader:clusterrolebinding'], port_forwards=['9090:9090'], labels=['infra']) k8s_resource('registry', port_forwards=['5000:5000'], labels=['infra']) +k8s_resource('loki', objects=['loki-config:configmap'], port_forwards=['3100:3100'], labels=['infra']) # Configure kind nodes to reach the in-cluster registry. # Kubelet/containerd can't resolve cluster DNS, so we point them at the registry's ClusterIP. @@ -99,6 +101,10 @@ local_resource( k8s_yaml('hack/e2e-infra/seed-registry-job.yaml') k8s_resource('seed-registry', labels=['infra'], resource_deps=['registry-mirror']) +# Seed Loki with image-pull events +k8s_yaml('hack/e2e-infra/seed-loki-job.yaml') +k8s_resource('seed-loki', labels=['infra'], resource_deps=['loki']) + # --- Grafana with Drop dashboard --- # Create dashboard ConfigMap from the shipped JSON, then apply grafana manifests. dashboard_json = str(read_file('charts/drop/dashboards/drop-operator.json')) diff --git a/hack/e2e-infra/loki.yaml b/hack/e2e-infra/loki.yaml new file mode 100644 index 0000000..7ed1939 --- /dev/null +++ b/hack/e2e-infra/loki.yaml @@ -0,0 +1,101 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: loki-config + namespace: e2e-infra +data: + loki.yaml: | + auth_enabled: false + server: + http_listen_port: 3100 + grpc_listen_port: 9096 + common: + instance_addr: 127.0.0.1 + path_prefix: /loki + storage: + filesystem: + chunks_directory: /loki/chunks + rules_directory: /loki/rules + replication_factor: 1 + ring: + kvstore: + store: inmemory + schema_config: + configs: + - from: 2020-10-24 + store: tsdb + object_store: filesystem + schema: v13 + index: + prefix: index_ + period: 24h + limits_config: + # E2E seed entries carry explicit timestamps; never reject them. + reject_old_samples: false + allow_structured_metadata: true + volume_enabled: true + analytics: + reporting_enabled: false +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: loki + namespace: e2e-infra + labels: + app: loki +spec: + replicas: 1 + selector: + matchLabels: + app: loki + template: + metadata: + labels: + app: loki + spec: + containers: + - name: loki + image: grafana/loki:3.1.1 + args: + - "-config.file=/etc/loki/loki.yaml" + ports: + - containerPort: 3100 + readinessProbe: + httpGet: + path: /ready + port: 3100 + initialDelaySeconds: 15 + periodSeconds: 5 + volumeMounts: + - name: config + mountPath: /etc/loki + - name: data + mountPath: /loki + resources: + requests: + cpu: 50m + memory: 128Mi + limits: + memory: 256Mi + volumes: + - name: config + configMap: + name: loki-config + - name: data + emptyDir: {} +--- +apiVersion: v1 +kind: Service +metadata: + name: loki + namespace: e2e-infra + labels: + app: loki +spec: + selector: + app: loki + ports: + - port: 3100 + targetPort: 3100 + protocol: TCP diff --git a/hack/e2e-infra/seed-loki-job.yaml b/hack/e2e-infra/seed-loki-job.yaml new file mode 100644 index 0000000..ca36793 --- /dev/null +++ b/hack/e2e-infra/seed-loki-job.yaml @@ -0,0 +1,87 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: seed-loki + namespace: e2e-infra +spec: + backoffLimit: 2 + template: + spec: + restartPolicy: Never + containers: + - name: seed + image: curlimages/curl:8.10.1 + command: + - /bin/sh + - -c + - | + set -eu + LOKI="http://loki.e2e-infra.svc.cluster.local:3100" + REGISTRY="registry.e2e-infra.svc.cluster.local:5000" + + # Wait for Loki to be ready + echo "Waiting for Loki..." + for i in $(seq 1 60); do + if curl -sf "$LOKI/ready" >/dev/null 2>&1; then + echo "Loki is ready" + break + fi + sleep 2 + done + + # Base timestamp (Unix nanoseconds). Each entry adds a small offset so + # values are uniquely ordered within the stream. + BASE="$(date +%s)000000000" + n=0 + ENTRIES="" + add() { + # add + TS=$(( BASE + n * 1000000000 )) + n=$(( n + 1 )) + MSG=$(printf '%s' "$1" | sed 's/"/\\"/g') + if [ -n "$ENTRIES" ]; then ENTRIES="$ENTRIES,"; fi + ENTRIES="$ENTRIES[ \"$TS\", \"$MSG\" ]" + } + + # myapp:v1 — three cold pulls (3s, 4s, 5s) → avg 4s, plus a cache hit. + add "Pulling image \"$REGISTRY/test/myapp:v1\"" + add "Successfully pulled image \"$REGISTRY/test/myapp:v1\" in 3.0s (3.0s including waiting)" + add "Successfully pulled image \"$REGISTRY/test/myapp:v1\" in 4.0s (4.0s including waiting)" + add "Successfully pulled image \"$REGISTRY/test/myapp:v1\" in 5.0s (5.0s including waiting)" + add "Container image \"$REGISTRY/test/myapp:v1\" already present on machine" + + # worker:v2 — one slow cold pull (12s) and one pull failure. + add "Pulling image \"$REGISTRY/test/worker:v2\"" + add "Successfully pulled image \"$REGISTRY/test/worker:v2\" in 12.0s (12.0s including waiting)" + add "Failed to pull image \"$REGISTRY/test/worker:v2\": rpc error: code = Unknown" + + # tools:v1 — two quick cold pulls (1s, 2s). + add "Successfully pulled image \"$REGISTRY/test/tools:v1\" in 1.0s (1.0s including waiting)" + add "Successfully pulled image \"$REGISTRY/test/tools:v1\" in 2.0s (2.0s including waiting)" + + PAYLOAD="{\"streams\":[{\"stream\":{\"job\":\"kubelet\",\"namespace\":\"default\"},\"values\":[$ENTRIES]}]}" + + echo "Pushing image-pull events to Loki..." + CODE=$(printf '%s' "$PAYLOAD" | curl -s -o /dev/stderr -w '%{http_code}' \ + -X POST -H 'Content-Type: application/json' \ + --data-binary @- "$LOKI/loki/api/v1/push") + echo "push HTTP $CODE" + case "$CODE" in + 204|200) echo "Seed events accepted." ;; + *) echo "WARNING: unexpected status $CODE" ;; + esac + + # Verify the events are queryable. + echo "Verifying seed events..." + for i in $(seq 1 30); do + RESULT=$(curl -s -G "$LOKI/loki/api/v1/query_range" \ + --data-urlencode 'query={job="kubelet"}' \ + --data-urlencode 'limit=10' 2>/dev/null || echo "") + if echo "$RESULT" | grep -q "Successfully pulled"; then + echo "Seed events are queryable!" + exit 0 + fi + sleep 2 + done + echo "WARNING: seed events may not be queryable yet" + exit 0 diff --git a/hack/e2e-infra/setup.sh b/hack/e2e-infra/setup.sh index ecbbf42..9a1d01d 100755 --- a/hack/e2e-infra/setup.sh +++ b/hack/e2e-infra/setup.sh @@ -19,6 +19,10 @@ echo "[e2e-infra] Deploying Prometheus with seed data..." kubectl apply -n "$NAMESPACE" -f "$SCRIPT_DIR/prometheus-config.yaml" kubectl apply -n "$NAMESPACE" -f "$SCRIPT_DIR/prometheus.yaml" +# --- Deploy Loki for image-pull event discovery --- +echo "[e2e-infra] Deploying Loki..." +kubectl apply -n "$NAMESPACE" -f "$SCRIPT_DIR/loki.yaml" + # --- Wait for readiness --- echo "[e2e-infra] Waiting for registry to be ready..." kubectl -n "$NAMESPACE" wait --for=condition=available deployment/registry --timeout=90s @@ -43,6 +47,9 @@ echo "[e2e-infra] Containerd mirror configured on all nodes." echo "[e2e-infra] Waiting for Prometheus to be ready..." kubectl -n "$NAMESPACE" wait --for=condition=available deployment/prometheus --timeout=90s +echo "[e2e-infra] Waiting for Loki to be ready..." +kubectl -n "$NAMESPACE" wait --for=condition=available deployment/loki --timeout=120s + # --- Seed the registry with a few images --- echo "[e2e-infra] Seeding registry with test images..." REGISTRY_POD=$(kubectl -n "$NAMESPACE" get pods -l app=registry -o jsonpath='{.items[0].metadata.name}') @@ -57,6 +64,12 @@ echo "[e2e-infra] Seeding Prometheus with image metrics..." kubectl apply -n "$NAMESPACE" -f "$SCRIPT_DIR/seed-metrics-job.yaml" kubectl -n "$NAMESPACE" wait --for=condition=complete job/seed-metrics --timeout=60s 2>/dev/null || true +# --- Seed Loki with image-pull events --- +echo "[e2e-infra] Seeding Loki with image-pull events..." +kubectl apply -n "$NAMESPACE" -f "$SCRIPT_DIR/seed-loki-job.yaml" +kubectl -n "$NAMESPACE" wait --for=condition=complete job/seed-loki --timeout=120s 2>/dev/null || true + echo "[e2e-infra] Infrastructure ready." echo " Prometheus: http://prometheus.$NAMESPACE.svc.cluster.local:9090" +echo " Loki: http://loki.$NAMESPACE.svc.cluster.local:3100" echo " Registry: http://registry.$NAMESPACE.svc.cluster.local:5000" From cedd715ee39a23de0b1232585bf007ff4201a8b4 Mon Sep 17 00:00:00 2001 From: copilot Date: Sat, 27 Jun 2026 14:40:37 +0000 Subject: [PATCH 06/35] test(e2e): add discovery-loki Chainsaw test for Loki + eventPullTime Add a DiscoveryPolicy e2e suite that runs a Loki range query with the kubernetesEvents parser and derives p50 cold-pull-time and failure-count eventPullTime signals from the seeded image-pull events, asserting the pipeline reports Ready=Synced and discovers the expected images. Also refresh the e2e README scenario table (discovery, discovery-loki, discovery-registry). --- test/e2e/README.md | 4 +- .../discovery-loki/01-discoverypolicy.yaml | 38 +++++++++++++++++++ .../02-assert-discovery-status.yaml | 23 +++++++++++ test/e2e/discovery-loki/chainsaw-test.yaml | 27 +++++++++++++ 4 files changed, 91 insertions(+), 1 deletion(-) create mode 100644 test/e2e/discovery-loki/01-discoverypolicy.yaml create mode 100644 test/e2e/discovery-loki/02-assert-discovery-status.yaml create mode 100644 test/e2e/discovery-loki/chainsaw-test.yaml diff --git a/test/e2e/README.md b/test/e2e/README.md index 70b9987..3199034 100644 --- a/test/e2e/README.md +++ b/test/e2e/README.md @@ -21,5 +21,7 @@ make test-e2e-chainsaw | `cachedimage-basic/` | Basic CachedImage creation and pod scheduling | | `cachedimage-pacing/` | PullPolicy pacing enforcement | | `cachedimageset/` | CachedImageSet managing child resources | -| `discovery-prometheus/` | DiscoveryPolicy with mock Prometheus | +| `discovery/` | DiscoveryPolicy with mock Prometheus | +| `discovery-loki/` | DiscoveryPolicy with mock Loki + eventPullTime signals | +| `discovery-registry/` | DiscoveryPolicy listing tags from a mock registry | | `pull-policy-backoff/` | Failure backoff behavior | diff --git a/test/e2e/discovery-loki/01-discoverypolicy.yaml b/test/e2e/discovery-loki/01-discoverypolicy.yaml new file mode 100644 index 0000000..08c30ec --- /dev/null +++ b/test/e2e/discovery-loki/01-discoverypolicy.yaml @@ -0,0 +1,38 @@ +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: e2e-loki +spec: + queries: + - name: image-pull-events + type: loki + loki: + endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" + queryType: range + lookback: 24h + query: '{job="kubelet"}' + parser: + type: kubernetesEvents + signals: + # Median cold-pull time derived from the "Successfully pulled ... in Xs" messages. + - name: p50-cold-pull-time + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: p50 + durationMode: messageDuration + includeCacheHits: false + # Number of pull failures per image. + - name: pull-failures + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: failureCount + durationMode: messageDuration + includeCacheHits: false + ranking: + strategy: signal + signal: + signalRef: p50-cold-pull-time + syncInterval: 30s + maxImages: 10 diff --git a/test/e2e/discovery-loki/02-assert-discovery-status.yaml b/test/e2e/discovery-loki/02-assert-discovery-status.yaml new file mode 100644 index 0000000..6404e82 --- /dev/null +++ b/test/e2e/discovery-loki/02-assert-discovery-status.yaml @@ -0,0 +1,23 @@ +# Assert that the DiscoveryPolicy with a Loki query + eventPullTime signals +# executed the full pipeline successfully: +# - Ready=True with reason Synced +# - The Loki query succeeded +# - The eventPullTime signals produced per-image values +# - Images parsed from kubelet pull events were discovered and ranked +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: e2e-loki +status: + (conditions[?type == 'Ready']): + - status: "True" + reason: Synced + (queryCount == `1`): true + (imageCount > `0`): true + (queryResults[?name == 'image-pull-events'] | [0].status): success + (queryResults[?name == 'image-pull-events'] | [0].type): loki + (signalResults[?name == 'p50-cold-pull-time'] | [0].status): success + (signalResults[?name == 'p50-cold-pull-time'] | [0].images > `0`): true + (signalResults[?name == 'pull-failures'] | [0].status): success + (length(discoveredImages[?contains(image, 'test/myapp:v1')]) > `0`): true + (length(discoveredImages[?contains(image, 'test/worker:v2')]) > `0`): true diff --git a/test/e2e/discovery-loki/chainsaw-test.yaml b/test/e2e/discovery-loki/chainsaw-test.yaml new file mode 100644 index 0000000..1cf7af7 --- /dev/null +++ b/test/e2e/discovery-loki/chainsaw-test.yaml @@ -0,0 +1,27 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/kyverno/chainsaw/main/.schemas/json/test-chainsaw-v1alpha1.json +apiVersion: chainsaw.kyverno.io/v1alpha1 +kind: Test +metadata: + name: discovery-loki +spec: + description: | + Verify that a DiscoveryPolicy with a Loki query and the kubernetesEvents + parser derives eventPullTime signals (cold-pull time and failure count) from + seeded image-pull events and populates status.discoveredImages. + steps: + - name: Create DiscoveryPolicy with a Loki query and eventPullTime signals + try: + - apply: + file: 01-discoverypolicy.yaml + - name: Assert pipeline executed and images were discovered from Loki events + try: + - assert: + timeout: 120s + file: 02-assert-discovery-status.yaml + - name: Cleanup + try: + - delete: + ref: + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + name: e2e-loki From 7a44c2d2bda60b23395fbbabfd62106e6c7f28dc Mon Sep 17 00:00:00 2001 From: copilot Date: Sat, 27 Jun 2026 14:53:19 +0000 Subject: [PATCH 07/35] test(e2e): drop Loki readiness probe to match prometheus/registry infra The kubelet readiness probe against Loki's /ready was flaky during ring stabilization (the probe's 1s timeout was exceeded and /ready returns 503 until the ingester settles), leaving the deployment stuck as not-available. The existing Prometheus and registry manifests use no readiness probe; the seed job already polls /ready before pushing and consumers retry, so gate readiness the same way for consistency and reliability. --- hack/e2e-infra/loki.yaml | 6 ------ 1 file changed, 6 deletions(-) diff --git a/hack/e2e-infra/loki.yaml b/hack/e2e-infra/loki.yaml index 7ed1939..6cbd5ac 100644 --- a/hack/e2e-infra/loki.yaml +++ b/hack/e2e-infra/loki.yaml @@ -61,12 +61,6 @@ spec: - "-config.file=/etc/loki/loki.yaml" ports: - containerPort: 3100 - readinessProbe: - httpGet: - path: /ready - port: 3100 - initialDelaySeconds: 15 - periodSeconds: 5 volumeMounts: - name: config mountPath: /etc/loki From 769a38d602c2b94995e2b4ee4321faa51fb64601 Mon Sep 17 00:00:00 2001 From: copilot Date: Sat, 27 Jun 2026 14:55:02 +0000 Subject: [PATCH 08/35] test(e2e): assert all three seeded images are discovered from Loki Also assert test/tools:v1 (the third seeded image) appears in status.discoveredImages so the assertions cover the full seed dataset. --- test/e2e/discovery-loki/02-assert-discovery-status.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/test/e2e/discovery-loki/02-assert-discovery-status.yaml b/test/e2e/discovery-loki/02-assert-discovery-status.yaml index 6404e82..bb51364 100644 --- a/test/e2e/discovery-loki/02-assert-discovery-status.yaml +++ b/test/e2e/discovery-loki/02-assert-discovery-status.yaml @@ -21,3 +21,4 @@ status: (signalResults[?name == 'pull-failures'] | [0].status): success (length(discoveredImages[?contains(image, 'test/myapp:v1')]) > `0`): true (length(discoveredImages[?contains(image, 'test/worker:v2')]) > `0`): true + (length(discoveredImages[?contains(image, 'test/tools:v1')]) > `0`): true From 44f271b6af94fa794f861465c593d9172de1e609 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 27 Jun 2026 19:40:21 +0000 Subject: [PATCH 09/35] fix(e2e): restore Loki readiness probe, fix README, remove stale assert The readiness probe was dropped in the previous commit because the 1s timeout was too short for ring stabilization. Without any probe, kubectl wait --for=condition=available succeeds as soon as the container starts (before Loki's HTTP server accepts requests), so the seed job could run against a not-yet-ready Loki. Re-add the probe with a longer 5s timeout and 15s initial delay, giving Loki up to ~105s to pass before the Deployment is marked Available and the setup.sh seed step begins. Also: - Remove stale 02-assert-notimplemented.yaml (controller no longer returns NotImplemented; file was unused by any chainsaw-test.yaml) - Fix test/e2e/README.md: wrong make target, wrong scenario names, missing scenarios (cachedimageset-discovery, discovery-failure) - Update Makefile e2e-infra comment and CI step name to include Loki --- .github/workflows/ci.yml | 2 +- Makefile | 2 +- hack/e2e-infra/loki.yaml | 8 ++++++++ test/e2e/README.md | 6 ++++-- .../discovery-failure/02-assert-notimplemented.yaml | 10 ---------- 5 files changed, 14 insertions(+), 14 deletions(-) delete mode 100644 test/e2e/discovery-failure/02-assert-notimplemented.yaml diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 6013ae7..f076e5e 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -133,7 +133,7 @@ jobs: make controller-gen make sync-crds kubectl apply -f config/crd/bases/ - - name: Deploy E2E infrastructure (Prometheus + Registry) + - name: Deploy E2E infrastructure (Prometheus, Loki, Registry) run: make e2e-infra - name: Deploy operator run: | diff --git a/Makefile b/Makefile index 13ece82..cd1e033 100644 --- a/Makefile +++ b/Makefile @@ -103,7 +103,7 @@ uninstall: manifests kustomize ## Uninstall CRDs from cluster. $(KUSTOMIZE) build config/crd | $(KUBECTL) delete --ignore-not-found -f - .PHONY: e2e-infra -e2e-infra: ## Deploy Prometheus + Registry for E2E/dev. +e2e-infra: ## Deploy Prometheus, Loki, and Registry for E2E/dev. @chmod +x hack/e2e-infra/setup.sh && hack/e2e-infra/setup.sh ##@ Docker diff --git a/hack/e2e-infra/loki.yaml b/hack/e2e-infra/loki.yaml index 6cbd5ac..7636d35 100644 --- a/hack/e2e-infra/loki.yaml +++ b/hack/e2e-infra/loki.yaml @@ -66,6 +66,14 @@ spec: mountPath: /etc/loki - name: data mountPath: /loki + readinessProbe: + httpGet: + path: /ready + port: 3100 + initialDelaySeconds: 15 + periodSeconds: 5 + timeoutSeconds: 5 + failureThreshold: 18 resources: requests: cpu: 50m diff --git a/test/e2e/README.md b/test/e2e/README.md index 3199034..e144451 100644 --- a/test/e2e/README.md +++ b/test/e2e/README.md @@ -11,7 +11,7 @@ This directory contains scenario-based E2E tests using [Kyverno Chainsaw](https: ```bash # From repo root -make test-e2e-chainsaw +make test-e2e ``` ## Test Scenarios @@ -19,9 +19,11 @@ make test-e2e-chainsaw | Directory | Description | |-----------|-------------| | `cachedimage-basic/` | Basic CachedImage creation and pod scheduling | +| `cachedimage-failure/` | Failure backoff and Degraded phase behavior | | `cachedimage-pacing/` | PullPolicy pacing enforcement | | `cachedimageset/` | CachedImageSet managing child resources | +| `cachedimageset-discovery/` | CachedImageSet backed by a DiscoveryPolicy | | `discovery/` | DiscoveryPolicy with mock Prometheus | +| `discovery-failure/` | DiscoveryPolicy with unreachable Prometheus endpoint | | `discovery-loki/` | DiscoveryPolicy with mock Loki + eventPullTime signals | | `discovery-registry/` | DiscoveryPolicy listing tags from a mock registry | -| `pull-policy-backoff/` | Failure backoff behavior | diff --git a/test/e2e/discovery-failure/02-assert-notimplemented.yaml b/test/e2e/discovery-failure/02-assert-notimplemented.yaml deleted file mode 100644 index 17bc32a..0000000 --- a/test/e2e/discovery-failure/02-assert-notimplemented.yaml +++ /dev/null @@ -1,10 +0,0 @@ -# Assert that DiscoveryPolicy with the new pipeline schema gets NotImplemented condition. -# Pipeline execution is not yet implemented; DNS error testing will be re-enabled in Issue 2. -apiVersion: drop.corewire.io/v1alpha1 -kind: DiscoveryPolicy -metadata: - name: test-broken-prom -status: - (conditions[?type == 'Ready']): - - status: "False" - reason: NotImplemented From 0c52eff8557b0dac7da693d11ddb02633a8393d6 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 27 Jun 2026 20:33:00 +0000 Subject: [PATCH 10/35] Apply remaining changes --- hack/e2e-infra/seed-loki-job.yaml | 22 ++++++++++++++----- hack/e2e-infra/setup.sh | 5 +++-- .../discovery-loki/01-discoverypolicy.yaml | 2 +- 3 files changed, 21 insertions(+), 8 deletions(-) diff --git a/hack/e2e-infra/seed-loki-job.yaml b/hack/e2e-infra/seed-loki-job.yaml index ca36793..ae33be5 100644 --- a/hack/e2e-infra/seed-loki-job.yaml +++ b/hack/e2e-infra/seed-loki-job.yaml @@ -21,13 +21,19 @@ spec: # Wait for Loki to be ready echo "Waiting for Loki..." + READY=0 for i in $(seq 1 60); do if curl -sf "$LOKI/ready" >/dev/null 2>&1; then echo "Loki is ready" + READY=1 break fi sleep 2 done + if [ "$READY" -ne 1 ]; then + echo "ERROR: Loki did not become ready in time" + exit 1 + fi # Base timestamp (Unix nanoseconds). Each entry adds a small offset so # values are uniquely ordered within the stream. @@ -59,12 +65,18 @@ spec: add "Successfully pulled image \"$REGISTRY/test/tools:v1\" in 1.0s (1.0s including waiting)" add "Successfully pulled image \"$REGISTRY/test/tools:v1\" in 2.0s (2.0s including waiting)" - PAYLOAD="{\"streams\":[{\"stream\":{\"job\":\"kubelet\",\"namespace\":\"default\"},\"values\":[$ENTRIES]}]}" + PAYLOAD="{\"streams\":[{\"stream\":{\"job\":\"kubelet\",\"namespace\":\"default\",\"drop_e2e\":\"true\"},\"values\":[$ENTRIES]}]}" echo "Pushing image-pull events to Loki..." - CODE=$(printf '%s' "$PAYLOAD" | curl -s -o /dev/stderr -w '%{http_code}' \ + RESP_FILE=$(mktemp) + CODE=$(printf '%s' "$PAYLOAD" | curl -s -o "$RESP_FILE" -w '%{http_code}' \ -X POST -H 'Content-Type: application/json' \ --data-binary @- "$LOKI/loki/api/v1/push") + RESP_BODY="$(cat "$RESP_FILE")" + rm -f "$RESP_FILE" + if [ -n "$RESP_BODY" ]; then + echo "$RESP_BODY" + fi echo "push HTTP $CODE" case "$CODE" in 204|200) echo "Seed events accepted." ;; @@ -75,7 +87,7 @@ spec: echo "Verifying seed events..." for i in $(seq 1 30); do RESULT=$(curl -s -G "$LOKI/loki/api/v1/query_range" \ - --data-urlencode 'query={job="kubelet"}' \ + --data-urlencode 'query={job="kubelet",drop_e2e="true"}' \ --data-urlencode 'limit=10' 2>/dev/null || echo "") if echo "$RESULT" | grep -q "Successfully pulled"; then echo "Seed events are queryable!" @@ -83,5 +95,5 @@ spec: fi sleep 2 done - echo "WARNING: seed events may not be queryable yet" - exit 0 + echo "ERROR: seed events are not queryable" + exit 1 diff --git a/hack/e2e-infra/setup.sh b/hack/e2e-infra/setup.sh index 9a1d01d..31fc872 100755 --- a/hack/e2e-infra/setup.sh +++ b/hack/e2e-infra/setup.sh @@ -48,7 +48,8 @@ echo "[e2e-infra] Waiting for Prometheus to be ready..." kubectl -n "$NAMESPACE" wait --for=condition=available deployment/prometheus --timeout=90s echo "[e2e-infra] Waiting for Loki to be ready..." -kubectl -n "$NAMESPACE" wait --for=condition=available deployment/loki --timeout=120s +# Loki single-binary startup can lag behind registry/prometheus in CI clusters. +kubectl -n "$NAMESPACE" wait --for=condition=available deployment/loki --timeout=300s # --- Seed the registry with a few images --- echo "[e2e-infra] Seeding registry with test images..." @@ -67,7 +68,7 @@ kubectl -n "$NAMESPACE" wait --for=condition=complete job/seed-metrics --timeout # --- Seed Loki with image-pull events --- echo "[e2e-infra] Seeding Loki with image-pull events..." kubectl apply -n "$NAMESPACE" -f "$SCRIPT_DIR/seed-loki-job.yaml" -kubectl -n "$NAMESPACE" wait --for=condition=complete job/seed-loki --timeout=120s 2>/dev/null || true +kubectl -n "$NAMESPACE" wait --for=condition=complete job/seed-loki --timeout=180s echo "[e2e-infra] Infrastructure ready." echo " Prometheus: http://prometheus.$NAMESPACE.svc.cluster.local:9090" diff --git a/test/e2e/discovery-loki/01-discoverypolicy.yaml b/test/e2e/discovery-loki/01-discoverypolicy.yaml index 08c30ec..214ccd0 100644 --- a/test/e2e/discovery-loki/01-discoverypolicy.yaml +++ b/test/e2e/discovery-loki/01-discoverypolicy.yaml @@ -10,7 +10,7 @@ spec: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{job="kubelet"}' + query: '{job="kubelet",drop_e2e="true"}' parser: type: kubernetesEvents signals: From 19db89395bac76c466564127a8286eb828bd0a32 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Sun, 28 Jun 2026 12:59:22 +0200 Subject: [PATCH 11/35] fix(devenv): fix tiltfile. --- Tiltfile | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/Tiltfile b/Tiltfile index 36afcc2..1076ad9 100644 --- a/Tiltfile +++ b/Tiltfile @@ -156,10 +156,8 @@ k8s_resource( 'dev-set:cachedimageset', 'dev-set-discovered:cachedimageset', 'dev-prometheus:discoverypolicy', - 'dev-registry:discoverypolicy', + 'dev-hybrid:discoverypolicy', 'test-broken-prom:discoverypolicy', - 'test-broken-registry:discoverypolicy', - 'test-notfound-repo:discoverypolicy', ], labels=['samples'], resource_deps=['drop'], From 260dfbe7f38305e08c70f9f8c837d231f13934c5 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Sun, 28 Jun 2026 13:10:43 +0200 Subject: [PATCH 12/35] feat(devenv): devsample up to date with current featureset --- Tiltfile | 8 ++ hack/dev-samples.yaml | 285 +++++++++++++++++++++++++++++++++++++++++- 2 files changed, 292 insertions(+), 1 deletion(-) diff --git a/Tiltfile b/Tiltfile index 1076ad9..77265e1 100644 --- a/Tiltfile +++ b/Tiltfile @@ -156,8 +156,16 @@ k8s_resource( 'dev-set:cachedimageset', 'dev-set-discovered:cachedimageset', 'dev-prometheus:discoverypolicy', + 'dev-prometheus-instant:discoverypolicy', 'dev-hybrid:discoverypolicy', + 'dev-timeweighted:discoverypolicy', + 'dev-window:discoverypolicy', + 'dev-loki:discoverypolicy', + 'dev-registry:discoverypolicy', + 'dev-modelexposure:discoverypolicy', 'test-broken-prom:discoverypolicy', + 'test-broken-registry:discoverypolicy', + 'test-notfound-repo:discoverypolicy', ], labels=['samples'], resource_deps=['drop'], diff --git a/hack/dev-samples.yaml b/hack/dev-samples.yaml index 2c52eb1..17e843d 100644 --- a/hack/dev-samples.yaml +++ b/hack/dev-samples.yaml @@ -1,4 +1,24 @@ -# Dev samples: deployed by Tilt for interactive testing +# Dev samples: deployed by Tilt for interactive testing. +# +# These samples exercise EVERY feature of the operator so developers can spot +# regressions at a glance in the Tilt UI. They run against the e2e-infra stack +# (Prometheus, Loki, and a seeded OCI registry) that Tilt brings up. +# +# Feature coverage: +# PullPolicy ............ dev-conservative +# CachedImage .......... dev-nginx, dev-redis (healthy), test-invalid-image (broken) +# CachedImageSet ....... dev-set (static), dev-set-discovered (discovery-backed) +# Query: prometheus .... dev-prometheus (range), dev-prometheus-instant (instant) +# Query: loki .......... dev-loki (kubernetesEvents parser) +# Query: registry ...... dev-registry +# Signal: aggregate .... dev-prometheus +# Signal: timeWeighted . dev-timeweighted +# Signal: windowAgg .... dev-window +# Signal: eventPullTime dev-loki +# Ranking: signal ...... dev-prometheus +# Ranking: weightedSum . dev-hybrid +# Ranking: modelExposure dev-modelexposure +# Failure cases ........ test-broken-prom, test-broken-registry, test-notfound-repo --- # === PullPolicy === apiVersion: drop.corewire.io/v1alpha1 @@ -137,6 +157,217 @@ spec: syncInterval: 30s maxImages: 10 --- +# === DiscoveryPolicy: Prometheus instant query === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-prometheus-instant +spec: + queries: + - name: current-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: instant + query: 'container_memory_working_set_bytes{namespace="build-stuff"}' + signals: + - name: current + queryRef: current-usage + type: aggregate + aggregate: + method: max + ranking: + strategy: signal + signal: + signalRef: current + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: timeWeightedAggregate signal === +# Weights samples by hour-of-day before aggregating. +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-timeweighted +spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: business-hours-usage + queryRef: runner-image-usage + type: timeWeightedAggregate + timeWeightedAggregate: + method: sum + timezone: "UTC" + defaultWeight: "1" + windows: + - startHour: 8 + endHour: 18 + weight: "2" + ranking: + strategy: signal + signal: + signalRef: business-hours-usage + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: windowAggregate signal (relative window) === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-window +spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: recent-usage + queryRef: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + relativeWindow: 6h + ranking: + strategy: signal + signal: + signalRef: recent-usage + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: Loki query + eventPullTime signals === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-loki +spec: + queries: + - name: image-pull-events + type: loki + loki: + endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" + queryType: range + lookback: 24h + query: '{job="kubelet",drop_e2e="true"}' + parser: + type: kubernetesEvents + signals: + - name: p50-cold-pull-time + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: p50 + durationMode: messageDuration + includeCacheHits: false + - name: pull-failures + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: failureCount + durationMode: messageDuration + includeCacheHits: false + ranking: + strategy: signal + signal: + signalRef: p50-cold-pull-time + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: registry tag discovery === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-registry +spec: + queries: + - name: registry-tags + type: registry + registry: + url: "http://registry.e2e-infra.svc.cluster.local:5000" + repositories: + - test/myapp + - test/worker + - test/tools + tagFilter: "^v" + topX: 5 + signals: + - name: tag-recency + queryRef: registry-tags + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30s + maxImages: 20 +--- +# === DiscoveryPolicy: modelExposure ranking (multi-query) === +# Combines Prometheus usage signals with a Loki pull-time signal. +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-modelexposure +spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + - name: image-pull-events + type: loki + loki: + endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" + queryType: range + lookback: 24h + query: '{job="kubelet",drop_e2e="true"}' + parser: + type: kubernetesEvents + signals: + - name: pre-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + - name: target-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max + - name: pull-time + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: p50 + durationMode: messageDuration + includeCacheHits: false + ranking: + strategy: modelExposure + modelExposure: + nodeCount: 3 + preWindowUsageSignalRef: pre-usage + targetWindowUsageSignalRef: target-usage + pullTimeSignalRef: pull-time + syncInterval: 30s + maxImages: 10 +--- # === DiscoveryPolicy: broken Prometheus endpoint (DNS error) === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy @@ -161,3 +392,55 @@ spec: signalRef: total-usage syncInterval: 30m maxImages: 10 +--- +# === DiscoveryPolicy: broken registry endpoint (DNS error) === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: test-broken-registry +spec: + queries: + - name: broken-registry + type: registry + registry: + url: "http://nonexistent-registry:5000" + repositories: + - test/app + signals: + - name: tag-recency + queryRef: broken-registry + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30m + maxImages: 10 +--- +# === DiscoveryPolicy: registry repository not found (404) === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: test-notfound-repo +spec: + queries: + - name: missing-repo + type: registry + registry: + url: "http://registry.e2e-infra.svc.cluster.local:5000" + repositories: + - test/does-not-exist + signals: + - name: tag-recency + queryRef: missing-repo + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30m + maxImages: 10 From 480a13bf3c116f33aa9cc04738111245f062418b Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Sun, 28 Jun 2026 13:38:39 +0200 Subject: [PATCH 13/35] feat(crds): slim the crd status to not pollute etcd --- api/v1alpha1/discoverypolicy_types.go | 74 +-------- .../controller/discoverypolicy_controller.go | 9 -- .../discoverypolicy_controller_test.go | 3 - internal/discovery/engine.go | 140 +++--------------- internal/discovery/engine_test.go | 12 +- internal/discovery/loki.go | 7 +- 6 files changed, 29 insertions(+), 216 deletions(-) diff --git a/api/v1alpha1/discoverypolicy_types.go b/api/v1alpha1/discoverypolicy_types.go index c832ca7..2e49cb2 100644 --- a/api/v1alpha1/discoverypolicy_types.go +++ b/api/v1alpha1/discoverypolicy_types.go @@ -535,17 +535,8 @@ const ( type QueryResult struct { // Name matches the queries[].name that produced this result. Name string `json:"name"` - // Type is the query backend type (prometheus or loki). + // Type is the query backend type (prometheus, loki, or registry). Type DiscoveryQueryType `json:"type"` - // Series is the number of time-series returned (Prometheus queries only). - // +optional - Series *int32 `json:"series,omitempty"` - // Samples is the total number of data points across all series (Prometheus range queries only). - // +optional - Samples *int64 `json:"samples,omitempty"` - // Records is the number of log records returned (Loki queries only). - // +optional - Records *int64 `json:"records,omitempty"` // Status is "success" or "failed". Status QueryResultStatus `json:"status"` // Message describes the failure reason when status=failed. @@ -553,50 +544,6 @@ type QueryResult struct { Message string `json:"message,omitempty"` } -// SignalResult reports the outcome of a single signal derivation. -type SignalResult struct { - // Name matches the signals[].name that produced this result. - Name string `json:"name"` - // Images is the number of images for which this signal produced a value. - Images int32 `json:"images"` - // Status is "success" or "failed". - Status string `json:"status"` - // Message describes the failure reason when status=failed. - // +optional - Message string `json:"message,omitempty"` -} - -// ImageSignalValue records the raw and normalized value of a signal for one image. -type ImageSignalValue struct { - // Name is the signal name. - Name string `json:"name"` - // RawValue is the unscaled signal value as a decimal string. - RawValue string `json:"rawValue"` - // NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. - // Only populated for signals used in a weightedSum ranking. - // +optional - NormalizedValue string `json:"normalizedValue,omitempty"` -} - -// RankingTerm records the contribution of one signal to the final score of an image. -type RankingTerm struct { - // Signal is the signal name. - Signal string `json:"signal"` - // Weight is the configured weight as a decimal string. - Weight string `json:"weight"` - // Contribution is weight * normalizedValue as a decimal string. - Contribution string `json:"contribution"` -} - -// ImageRankingDetail explains how the final score was computed for one image. -type ImageRankingDetail struct { - // Strategy is the ranking strategy that produced this detail. - Strategy string `json:"strategy"` - // Terms lists the per-signal contributions (populated for weightedSum and modelExposure). - // +optional - Terms []RankingTerm `json:"terms,omitempty"` -} - // DiscoveredImage represents a single discovered and ranked image. type DiscoveredImage struct { // Image is the fully qualified image reference. @@ -605,15 +552,6 @@ type DiscoveredImage struct { Rank int32 `json:"rank"` // FinalScore is the computed ranking score as a decimal string. FinalScore string `json:"finalScore"` - // Selected is true when this image is within the maxImages cap and will be - // propagated to dependent CachedImageSet resources. - Selected bool `json:"selected"` - // Signals lists the per-signal values used during ranking (for observability). - // +optional - Signals []ImageSignalValue `json:"signals,omitempty"` - // Ranking explains how the final score was computed. - // +optional - Ranking *ImageRankingDetail `json:"ranking,omitempty"` } // DiscoveryPolicyStatus defines the observed state of DiscoveryPolicy. @@ -624,19 +562,12 @@ type DiscoveryPolicyStatus struct { // QueryResults reports the outcome of each named query execution. // +optional QueryResults []QueryResult `json:"queryResults,omitempty"` - // SignalResults reports the outcome of each signal derivation. - // +optional - SignalResults []SignalResult `json:"signalResults,omitempty"` // DiscoveredImages is the ordered list of discovered and ranked images. - // Only images with selected=true are propagated to dependent CachedImageSet resources. // +optional DiscoveredImages []DiscoveredImage `json:"discoveredImages,omitempty"` - // ImageCount is the number of selected discovered images. + // ImageCount is the number of discovered images. // +optional ImageCount int32 `json:"imageCount,omitempty"` - // QueryCount is the number of configured queries. - // +optional - QueryCount int32 `json:"queryCount,omitempty"` // Conditions represent the latest available observations. // +optional Conditions []metav1.Condition `json:"conditions,omitempty"` @@ -646,7 +577,6 @@ type DiscoveryPolicyStatus struct { // +kubebuilder:subresource:status // +kubebuilder:resource:scope=Cluster,categories=drop // +kubebuilder:printcolumn:name="Status",type=string,JSONPath=`.status.conditions[?(@.type=="Ready")].reason` -// +kubebuilder:printcolumn:name="Queries",type=integer,JSONPath=`.status.queryCount` // +kubebuilder:printcolumn:name="Images",type=integer,JSONPath=`.status.imageCount` // +kubebuilder:printcolumn:name="LastSync",type=date,JSONPath=`.status.lastSyncTime` // +kubebuilder:printcolumn:name="Message",type=string,JSONPath=`.status.conditions[?(@.type=="Ready")].message`,priority=1 diff --git a/internal/controller/discoverypolicy_controller.go b/internal/controller/discoverypolicy_controller.go index f8f7f2c..f8b6ed6 100644 --- a/internal/controller/discoverypolicy_controller.go +++ b/internal/controller/discoverypolicy_controller.go @@ -75,9 +75,7 @@ func (r *DiscoveryPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Requ now := metav1.Now() dp.Status.LastSyncTime = &now - dp.Status.QueryCount = int32(len(dp.Spec.Queries)) dp.Status.QueryResults = result.QueryResults - dp.Status.SignalResults = result.SignalResults dp.Status.DiscoveredImages = result.Images dp.Status.ImageCount = int32(len(result.Images)) @@ -91,13 +89,6 @@ func (r *DiscoveryPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Requ healthy = 1 } dropmetrics.DiscoverySourceHealth.WithLabelValues(dp.Name, string(qr.Type), qr.Name).Set(healthy) - if qr.Status == dropv1alpha1.QueryResultStatusSuccess { - images := 0 - if qr.Series != nil { - images = int(*qr.Series) - } - dropmetrics.DiscoveryImagesFound.WithLabelValues(dp.Name, string(qr.Type)).Set(float64(images)) - } } // 4. Set Ready condition diff --git a/internal/controller/discoverypolicy_controller_test.go b/internal/controller/discoverypolicy_controller_test.go index 095996c..8dc3119 100644 --- a/internal/controller/discoverypolicy_controller_test.go +++ b/internal/controller/discoverypolicy_controller_test.go @@ -92,9 +92,6 @@ var _ = Describe("DiscoveryPolicy Controller", func() { Expect(readyCondition.Status).To(Equal(metav1.ConditionFalse)) // Reason is one of ConnectionRefused / SyncFailed depending on OS Expect(readyCondition.Reason).NotTo(BeEmpty()) - - // queryCount should reflect the spec - Expect(updated.Status.QueryCount).To(Equal(int32(1))) }) It("reconciles successfully with a registry query that lists from a mock server", func() { diff --git a/internal/discovery/engine.go b/internal/discovery/engine.go index dd3faef..c39bb01 100644 --- a/internal/discovery/engine.go +++ b/internal/discovery/engine.go @@ -14,11 +14,6 @@ import ( dropv1alpha1 "github.com/corewire/drop/api/v1alpha1" ) -const ( - signalStatusFailed = "failed" - signalStatusSuccess = "success" -) - // QueryRawData holds raw per-image samples from a single query execution. // For prometheus range queries each image may have multiple samples. // For prometheus instant and registry queries each image has exactly one sample. @@ -38,9 +33,8 @@ type TimedSample struct { // PipelineResult is the output of a full pipeline execution. type PipelineResult struct { - QueryResults []dropv1alpha1.QueryResult - SignalResults []dropv1alpha1.SignalResult - Images []dropv1alpha1.DiscoveredImage + QueryResults []dropv1alpha1.QueryResult + Images []dropv1alpha1.DiscoveredImage } // HTTPClientFunc builds an HTTP client for a query (used by the controller to inject auth/TLS). @@ -48,10 +42,8 @@ type HTTPClientFunc func(ctx context.Context, queryName string) (*http.Client, e // scoredItem is an intermediate ranked image used during the ranking stage. type scoredItem struct { - image string - score float64 - signals []dropv1alpha1.ImageSignalValue - ranking *dropv1alpha1.ImageRankingDetail + image string + score float64 } // ExecutePipeline runs all stages of the discovery pipeline and returns a PipelineResult. @@ -98,21 +90,14 @@ func ExecutePipeline( // Stage 2 — Derive signals // ────────────────────────────────────────────────────────── signalValues := make(map[string]map[string]float64, len(spec.Signals)) - sResults := make([]dropv1alpha1.SignalResult, 0, len(spec.Signals)) for _, sig := range spec.Signals { raw, ok := rawByQuery[sig.QueryRef] if !ok { - sResults = append(sResults, dropv1alpha1.SignalResult{ - Name: sig.Name, - Status: signalStatusFailed, - Message: fmt.Sprintf("query %q did not produce results (query failed or missing)", sig.QueryRef), - }) continue } - values, sr := deriveSignal(sig, raw) - sResults = append(sResults, sr) + values := deriveSignal(sig, raw) if values != nil { signalValues[sig.Name] = values } @@ -144,17 +129,13 @@ func ExecutePipeline( if maxImages <= 0 { maxImages = 50 } - for i := range discovered { - discovered[i].Selected = i < maxImages - } if len(discovered) > maxImages { discovered = discovered[:maxImages] } return PipelineResult{ - QueryResults: qResults, - SignalResults: sResults, - Images: discovered, + QueryResults: qResults, + Images: discovered, } } @@ -175,10 +156,6 @@ func executeQuery(ctx context.Context, q dropv1alpha1.DiscoveryQuery, httpClient qr.Message = err.Error() return nil, qr } - total := countSamples(raw.Samples) - series := int32(len(raw.Samples)) - qr.Series = &series - qr.Samples = &total qr.Status = dropv1alpha1.QueryResultStatusSuccess return raw, qr @@ -194,8 +171,6 @@ func executeQuery(ctx context.Context, q dropv1alpha1.DiscoveryQuery, httpClient qr.Message = err.Error() return nil, qr } - series := int32(len(raw.Samples)) - qr.Series = &series qr.Status = dropv1alpha1.QueryResultStatusSuccess return raw, qr @@ -211,8 +186,6 @@ func executeQuery(ctx context.Context, q dropv1alpha1.DiscoveryQuery, httpClient qr.Message = err.Error() return nil, qr } - records := countSamples(raw.Samples) - qr.Records = &records qr.Status = dropv1alpha1.QueryResultStatusSuccess return raw, qr @@ -285,68 +258,42 @@ func executeLokiQuery(ctx context.Context, cfg *dropv1alpha1.DiscoveryLokiQuery, } // deriveSignal computes per-image float64 values for a single signal. -func deriveSignal(sig dropv1alpha1.DiscoverySignal, raw *QueryRawData) (map[string]float64, dropv1alpha1.SignalResult) { - sr := dropv1alpha1.SignalResult{Name: sig.Name} - +func deriveSignal(sig dropv1alpha1.DiscoverySignal, raw *QueryRawData) map[string]float64 { switch sig.Type { case dropv1alpha1.SignalTypeAggregate: if sig.Aggregate == nil { - sr.Status = signalStatusFailed - sr.Message = "aggregate config is required when type=aggregate" - return nil, sr + return nil } - values := aggregateSamples(raw.Samples, sig.Aggregate.Method, nil) - sr.Images = int32(len(values)) - sr.Status = "success" - return values, sr + return aggregateSamples(raw.Samples, sig.Aggregate.Method, nil) case dropv1alpha1.SignalTypeTimeWeightedAggregate: if sig.TimeWeightedAggregate == nil { - sr.Status = signalStatusFailed - sr.Message = "timeWeightedAggregate config is required when type=timeWeightedAggregate" - return nil, sr + return nil } values, err := deriveTimeWeightedAggregate(raw.Samples, sig.TimeWeightedAggregate) if err != nil { - sr.Status = signalStatusFailed - sr.Message = err.Error() - return nil, sr + return nil } - sr.Images = int32(len(values)) - sr.Status = "success" - return values, sr + return values case dropv1alpha1.SignalTypeWindowAggregate: if sig.WindowAggregate == nil { - sr.Status = signalStatusFailed - sr.Message = "windowAggregate config is required when type=windowAggregate" - return nil, sr + return nil } values, err := deriveWindowAggregate(raw.Samples, sig.WindowAggregate) if err != nil { - sr.Status = signalStatusFailed - sr.Message = err.Error() - return nil, sr + return nil } - sr.Images = int32(len(values)) - sr.Status = "success" - return values, sr + return values case dropv1alpha1.SignalTypeEventPullTime: if sig.EventPullTime == nil { - sr.Status = signalStatusFailed - sr.Message = "eventPullTime config is required when type=eventPullTime" - return nil, sr + return nil } - values := deriveEventPullTime(raw.Samples, sig.EventPullTime) - sr.Images = int32(len(values)) - sr.Status = signalStatusSuccess - return values, sr + return deriveEventPullTime(raw.Samples, sig.EventPullTime) default: - sr.Status = signalStatusFailed - sr.Message = fmt.Sprintf("unsupported signal type: %s", sig.Type) - return nil, sr + return nil } } @@ -530,11 +477,6 @@ func rankImages(ranking *dropv1alpha1.DiscoveryRanking, signals map[string]map[s items = append(items, scoredItem{ image: img, score: v, - signals: []dropv1alpha1.ImageSignalValue{{ - Name: ref, - RawValue: strconv.FormatFloat(v, 'f', -1, 64), - }}, - ranking: &dropv1alpha1.ImageRankingDetail{Strategy: string(ranking.Strategy)}, }) } @@ -569,8 +511,6 @@ func rankImages(ranking *dropv1alpha1.DiscoveryRanking, signals map[string]map[s Image: it.image, Rank: int32(i + 1), FinalScore: strconv.FormatFloat(it.score, 'f', -1, 64), - Signals: it.signals, - Ranking: it.ranking, } } return out @@ -611,8 +551,6 @@ func weightedSumRank(cfg *dropv1alpha1.WeightedSumRankingConfig, signals map[str var out []scoredItem for _, img := range images { var totalScore float64 - sigVals := make([]dropv1alpha1.ImageSignalValue, 0, len(cfg.Terms)) - terms := make([]dropv1alpha1.RankingTerm, 0, len(cfg.Terms)) drop := false for _, term := range cfg.Terms { @@ -628,31 +566,14 @@ func weightedSumRank(cfg *dropv1alpha1.WeightedSumRankingConfig, signals map[str b := bounds[term.SignalRef] norm := normalize(v, b) wf := term.Weight.AsApproximateFloat64() - contribution := wf * norm - totalScore += contribution - - sigVals = append(sigVals, dropv1alpha1.ImageSignalValue{ - Name: term.SignalRef, - RawValue: strconv.FormatFloat(v, 'f', -1, 64), - NormalizedValue: strconv.FormatFloat(norm, 'f', -1, 64), - }) - terms = append(terms, dropv1alpha1.RankingTerm{ - Signal: term.SignalRef, - Weight: term.Weight.String(), - Contribution: strconv.FormatFloat(contribution, 'f', -1, 64), - }) + totalScore += wf * norm } if drop { continue } out = append(out, scoredItem{ - image: img, - score: totalScore, - signals: sigVals, - ranking: &dropv1alpha1.ImageRankingDetail{ - Strategy: string(dropv1alpha1.RankingStrategyWeightedSum), - Terms: terms, - }, + image: img, + score: totalScore, }) } return out @@ -681,14 +602,6 @@ func modelExposureRank(cfg *dropv1alpha1.ModelExposureRankingConfig, signals map out = append(out, scoredItem{ image: img, score: score, - signals: []dropv1alpha1.ImageSignalValue{ - {Name: cfg.PreWindowUsageSignalRef, RawValue: strconv.FormatFloat(jPre, 'f', -1, 64)}, - {Name: cfg.TargetWindowUsageSignalRef, RawValue: strconv.FormatFloat(jTarget, 'f', -1, 64)}, - {Name: cfg.PullTimeSignalRef, RawValue: strconv.FormatFloat(pHat, 'f', -1, 64)}, - }, - ranking: &dropv1alpha1.ImageRankingDetail{ - Strategy: string(dropv1alpha1.RankingStrategyModelExposure), - }, }) } return out @@ -719,15 +632,6 @@ func collectImages(rawByQuery map[string]*QueryRawData) []string { return images } -// countSamples returns the total number of samples across all images. -func countSamples(samples map[string][]TimedSample) int64 { - var total int64 - for _, pts := range samples { - total += int64(len(pts)) - } - return total -} - // deriveEventPullTime computes per-image pull-time statistics from Loki event samples. // // The samples map is expected to come from a Loki kubernetesEvents query: diff --git a/internal/discovery/engine_test.go b/internal/discovery/engine_test.go index 57c35bc..2fe70b4 100644 --- a/internal/discovery/engine_test.go +++ b/internal/discovery/engine_test.go @@ -71,9 +71,6 @@ func TestExecutePipeline_PrometheusInstant(t *testing.T) { if result.Images[0].Rank != 1 { t.Errorf("expected rank 1, got %d", result.Images[0].Rank) } - if !result.Images[0].Selected { - t.Error("top image should be selected") - } } // TestExecutePipeline_Registry verifies the full pipeline with a registry query. @@ -212,11 +209,6 @@ func TestExecutePipeline_MaxImages(t *testing.T) { if len(result.Images) != 3 { t.Fatalf("expected 3 images (maxImages cap), got %d", len(result.Images)) } - for _, img := range result.Images { - if !img.Selected { - t.Errorf("image %s should be selected (within cap)", img.Image) - } - } } // TestExecutePipeline_QueryFailure verifies failed query results are reported correctly. @@ -240,9 +232,7 @@ func TestExecutePipeline_QueryFailure(t *testing.T) { if result.QueryResults[0].Status != dropv1alpha1.QueryResultStatusFailed { t.Errorf("expected failed query result, got %s", result.QueryResults[0].Status) } - if len(result.SignalResults) != 1 || result.SignalResults[0].Status != signalStatusFailed { - t.Errorf("expected failed signal result when query fails") - } + if len(result.Images) != 0 { t.Errorf("expected no images when query fails, got %d", len(result.Images)) } diff --git a/internal/discovery/loki.go b/internal/discovery/loki.go index 742c3f4..7877263 100644 --- a/internal/discovery/loki.go +++ b/internal/discovery/loki.go @@ -18,6 +18,7 @@ import ( const ( lokiStatusSuccess = "success" + lokiMessageField = "message" // lokiLimitDefault is the maximum number of log entries to fetch per query. lokiLimitDefault = 5000 // lokiFailedSuffix is appended to image keys for pull-failure event counts. @@ -182,8 +183,8 @@ type lokiEventRecord struct { func parseKubernetesEventStreams(streams []lokiStream, parser *dropv1alpha1.LokiParser) map[string][]TimedSample { reasonField := lokiCoalesceField(parser.ReasonField, "reason") podField := lokiCoalesceField(parser.PodField, "involvedObject_name") - messageField := lokiCoalesceField(parser.MessageField, "message") - imageField := lokiCoalesceField(parser.ImageField, "message") + messageField := lokiCoalesceField(parser.MessageField, lokiMessageField) + imageField := lokiCoalesceField(parser.ImageField, lokiMessageField) var records []lokiEventRecord for _, stream := range streams { @@ -225,7 +226,7 @@ func parseKubernetesEventStreams(streams []lokiStream, parser *dropv1alpha1.Loki // Determine the source string for image extraction. var imgSource string - if imageField == messageField || imageField == "message" { + if imageField == messageField || imageField == lokiMessageField { imgSource = rec.message } else { imgSource = stream.Stream[imageField] From ecf9f632dc75235d005e4f7e0c3a7030380f8982 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Sun, 28 Jun 2026 13:40:13 +0200 Subject: [PATCH 14/35] feat(tests): rework tests --- .github/copilot-instructions.md | 2 +- api/v1alpha1/zz_generated.deepcopy.go | 101 +--- .../drop.corewire.io_discoverypolicies.yaml | 123 +---- .../docs/reference/_generated_architecture.md | 1 + .../content/docs/reference/_generated_crds.md | 69 +-- docs/static/llms-full.txt | 360 +++++++++++-- knowledge.yaml | 451 ++++++++++++----- llms-full.txt | 360 +++++++++++-- .../01-pullpolicy.yaml | 2 +- .../02-discoverypolicy.yaml | 2 +- .../03-assert-discovery-ready.yaml | 3 +- .../04-cachedimageset.yaml | 6 +- .../chainsaw-test.yaml | 6 +- .../discovery-loki/01-discoverypolicy.yaml | 8 +- .../02-assert-discovery-status.yaml | 12 +- test/e2e/discovery-loki/chainsaw-test.yaml | 2 +- .../01-discoverypolicy.yaml | 2 +- .../02-assert-discovery-status.yaml | 3 +- .../e2e/discovery-registry/chainsaw-test.yaml | 2 +- test/e2e/discovery/01-discoverypolicy.yaml | 2 +- .../discovery/02-assert-discovery-status.yaml | 3 +- .../03-cachedimageset-discovery.yaml | 4 +- test/e2e/discovery/04-assert-children.yaml | 2 +- test/e2e/discovery/chainsaw-test.yaml | 4 +- test/e2e/test-e2e-20260628-133056.log | 478 ++++++++++++++++++ 25 files changed, 1471 insertions(+), 537 deletions(-) create mode 100644 test/e2e/test-e2e-20260628-133056.log diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 1b75c05..600fc22 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -59,7 +59,7 @@ make docs-gen # regenerate AI docs from source ``` api/v1alpha1 — Package v1alpha1 contains API Schema definitions for the drop v1alpha1 API group. internal/controller — Package controller implements Kubernetes reconcilers for the drop CRDs (one per Kind). - imports: api/v1alpha1, internal/metrics, internal/pacing, internal/podbuilder + imports: api/v1alpha1, internal/discovery, internal/metrics, internal/pacing, internal/podbuilder internal/discovery — Package discovery implements image discovery from registries and Prometheus metrics. imports: api/v1alpha1 internal/metrics — Package metrics registers Prometheus metrics for the drop operator. diff --git a/api/v1alpha1/zz_generated.deepcopy.go b/api/v1alpha1/zz_generated.deepcopy.go index 4c0c209..03d047e 100644 --- a/api/v1alpha1/zz_generated.deepcopy.go +++ b/api/v1alpha1/zz_generated.deepcopy.go @@ -319,16 +319,6 @@ func (in *CachedImageStatus) DeepCopy() *CachedImageStatus { // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *DiscoveredImage) DeepCopyInto(out *DiscoveredImage) { *out = *in - if in.Signals != nil { - in, out := &in.Signals, &out.Signals - *out = make([]ImageSignalValue, len(*in)) - copy(*out, *in) - } - if in.Ranking != nil { - in, out := &in.Ranking, &out.Ranking - *out = new(ImageRankingDetail) - (*in).DeepCopyInto(*out) - } } // DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new DiscoveredImage. @@ -485,21 +475,12 @@ func (in *DiscoveryPolicyStatus) DeepCopyInto(out *DiscoveryPolicyStatus) { if in.QueryResults != nil { in, out := &in.QueryResults, &out.QueryResults *out = make([]QueryResult, len(*in)) - for i := range *in { - (*in)[i].DeepCopyInto(&(*out)[i]) - } - } - if in.SignalResults != nil { - in, out := &in.SignalResults, &out.SignalResults - *out = make([]SignalResult, len(*in)) copy(*out, *in) } if in.DiscoveredImages != nil { in, out := &in.DiscoveredImages, &out.DiscoveredImages *out = make([]DiscoveredImage, len(*in)) - for i := range *in { - (*in)[i].DeepCopyInto(&(*out)[i]) - } + copy(*out, *in) } if in.Conditions != nil { in, out := &in.Conditions, &out.Conditions @@ -695,41 +676,6 @@ func (in *ImageEntry) DeepCopy() *ImageEntry { return out } -// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *ImageRankingDetail) DeepCopyInto(out *ImageRankingDetail) { - *out = *in - if in.Terms != nil { - in, out := &in.Terms, &out.Terms - *out = make([]RankingTerm, len(*in)) - copy(*out, *in) - } -} - -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new ImageRankingDetail. -func (in *ImageRankingDetail) DeepCopy() *ImageRankingDetail { - if in == nil { - return nil - } - out := new(ImageRankingDetail) - in.DeepCopyInto(out) - return out -} - -// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *ImageSignalValue) DeepCopyInto(out *ImageSignalValue) { - *out = *in -} - -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new ImageSignalValue. -func (in *ImageSignalValue) DeepCopy() *ImageSignalValue { - if in == nil { - return nil - } - out := new(ImageSignalValue) - in.DeepCopyInto(out) - return out -} - // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *LokiParser) DeepCopyInto(out *LokiParser) { *out = *in @@ -876,21 +822,6 @@ func (in *PullPolicySpec) DeepCopy() *PullPolicySpec { // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *QueryResult) DeepCopyInto(out *QueryResult) { *out = *in - if in.Series != nil { - in, out := &in.Series, &out.Series - *out = new(int32) - **out = **in - } - if in.Samples != nil { - in, out := &in.Samples, &out.Samples - *out = new(int64) - **out = **in - } - if in.Records != nil { - in, out := &in.Records, &out.Records - *out = new(int64) - **out = **in - } } // DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new QueryResult. @@ -903,21 +834,6 @@ func (in *QueryResult) DeepCopy() *QueryResult { return out } -// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *RankingTerm) DeepCopyInto(out *RankingTerm) { - *out = *in -} - -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new RankingTerm. -func (in *RankingTerm) DeepCopy() *RankingTerm { - if in == nil { - return nil - } - out := new(RankingTerm) - in.DeepCopyInto(out) - return out -} - // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *SignalRankingConfig) DeepCopyInto(out *SignalRankingConfig) { *out = *in @@ -933,21 +849,6 @@ func (in *SignalRankingConfig) DeepCopy() *SignalRankingConfig { return out } -// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *SignalResult) DeepCopyInto(out *SignalResult) { - *out = *in -} - -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new SignalResult. -func (in *SignalResult) DeepCopy() *SignalResult { - if in == nil { - return nil - } - out := new(SignalResult) - in.DeepCopyInto(out) - return out -} - // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *TimeOfDayWindow) DeepCopyInto(out *TimeOfDayWindow) { *out = *in diff --git a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml index 608aa34..792671b 100644 --- a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml +++ b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml @@ -20,9 +20,6 @@ spec: - jsonPath: .status.conditions[?(@.type=="Ready")].reason name: Status type: string - - jsonPath: .status.queryCount - name: Queries - type: integer - jsonPath: .status.imageCount name: Images type: integer @@ -705,9 +702,8 @@ spec: type: object type: array discoveredImages: - description: |- - DiscoveredImages is the ordered list of discovered and ranked images. - Only images with selected=true are propagated to dependent CachedImageSet resources. + description: DiscoveredImages is the ordered list of discovered and + ranked images. items: description: DiscoveredImage represents a single discovered and ranked image. @@ -724,78 +720,14 @@ spec: ordered list (1 = highest score). format: int32 type: integer - ranking: - description: Ranking explains how the final score was computed. - properties: - strategy: - description: Strategy is the ranking strategy that produced - this detail. - type: string - terms: - description: Terms lists the per-signal contributions (populated - for weightedSum and modelExposure). - items: - description: RankingTerm records the contribution of one - signal to the final score of an image. - properties: - contribution: - description: Contribution is weight * normalizedValue - as a decimal string. - type: string - signal: - description: Signal is the signal name. - type: string - weight: - description: Weight is the configured weight as a - decimal string. - type: string - required: - - contribution - - signal - - weight - type: object - type: array - required: - - strategy - type: object - selected: - description: |- - Selected is true when this image is within the maxImages cap and will be - propagated to dependent CachedImageSet resources. - type: boolean - signals: - description: Signals lists the per-signal values used during - ranking (for observability). - items: - description: ImageSignalValue records the raw and normalized - value of a signal for one image. - properties: - name: - description: Name is the signal name. - type: string - normalizedValue: - description: |- - NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. - Only populated for signals used in a weightedSum ranking. - type: string - rawValue: - description: RawValue is the unscaled signal value as - a decimal string. - type: string - required: - - name - - rawValue - type: object - type: array required: - finalScore - image - rank - - selected type: object type: array imageCount: - description: ImageCount is the number of selected discovered images. + description: ImageCount is the number of discovered images. format: int32 type: integer lastSyncTime: @@ -803,10 +735,6 @@ spec: attempt. format: date-time type: string - queryCount: - description: QueryCount is the number of configured queries. - format: int32 - type: integer queryResults: description: QueryResults reports the outcome of each named query execution. @@ -821,21 +749,6 @@ spec: description: Name matches the queries[].name that produced this result. type: string - records: - description: Records is the number of log records returned (Loki - queries only). - format: int64 - type: integer - samples: - description: Samples is the total number of data points across - all series (Prometheus range queries only). - format: int64 - type: integer - series: - description: Series is the number of time-series returned (Prometheus - queries only). - format: int32 - type: integer status: description: Status is "success" or "failed". enum: @@ -843,7 +756,8 @@ spec: - failed type: string type: - description: Type is the query backend type (prometheus or loki). + description: Type is the query backend type (prometheus, loki, + or registry). enum: - prometheus - loki @@ -855,33 +769,6 @@ spec: - type type: object type: array - signalResults: - description: SignalResults reports the outcome of each signal derivation. - items: - description: SignalResult reports the outcome of a single signal - derivation. - properties: - images: - description: Images is the number of images for which this signal - produced a value. - format: int32 - type: integer - message: - description: Message describes the failure reason when status=failed. - type: string - name: - description: Name matches the signals[].name that produced this - result. - type: string - status: - description: Status is "success" or "failed". - type: string - required: - - images - - name - - status - type: object - type: array type: object type: object served: true diff --git a/docs/content/docs/reference/_generated_architecture.md b/docs/content/docs/reference/_generated_architecture.md index 3091959..1abb6ac 100644 --- a/docs/content/docs/reference/_generated_architecture.md +++ b/docs/content/docs/reference/_generated_architecture.md @@ -26,6 +26,7 @@ graph TD graph LR cmd/main.go --> internal/controller internal/controller --> api/v1alpha1 + internal/controller --> internal/discovery internal/controller --> internal/metrics internal/controller --> internal/pacing internal/controller --> internal/podbuilder diff --git a/docs/content/docs/reference/_generated_crds.md b/docs/content/docs/reference/_generated_crds.md index 453b997..7f4de4a 100644 --- a/docs/content/docs/reference/_generated_crds.md +++ b/docs/content/docs/reference/_generated_crds.md @@ -119,10 +119,8 @@ DiscoveryPolicy automatically discovers images from registries or Prometheus met |-------|------|-------------| | `lastSyncTime` | `*metav1.Time` | LastSyncTime is the timestamp of the last reconciliation attempt. | | `queryResults` | `[]QueryResult` | QueryResults reports the outcome of each named query execution. | -| `signalResults` | `[]SignalResult` | SignalResults reports the outcome of each signal derivation. | -| `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the ordered list of discovered and ranked images. Only images with selected=true are propagated to dependent CachedImageSet resources. | -| `imageCount` | `int32` | ImageCount is the number of selected discovered images. | -| `queryCount` | `int32` | QueryCount is the number of configured queries. | +| `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the ordered list of discovered and ranked images. | +| `imageCount` | `int32` | ImageCount is the number of discovered images. | | `conditions` | `[]metav1.Condition` | Conditions represent the latest available observations. | --- @@ -173,9 +171,6 @@ DiscoveredImage represents a single discovered and ranked image. | `image` | `string` | Yes | — | Image is the fully qualified image reference. | | `rank` | `int32` | Yes | — | Rank is the position of this image in the final ordered list (1 = highest score). | | `finalScore` | `string` | Yes | — | FinalScore is the computed ranking score as a decimal string. | -| `selected` | `bool` | Yes | — | Selected is true when this image is within the maxImages cap and will be propagated to dependent CachedImageSet resources. | -| `signals` | `[]ImageSignalValue` | No | — | Signals lists the per-signal values used during ranking (for observability). | -| `ranking` | `*ImageRankingDetail` | No | — | Ranking explains how the final score was computed. | ### DiscoveryLokiQuery @@ -216,9 +211,10 @@ DiscoveryQuery defines a named raw-data source referenced by signals. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `name` | `string` | Yes | — | Name is the unique identifier for this query within the policy. Signals reference queries by this name via queryRef. | -| `type` | `DiscoveryQueryType` | Yes | — | Type selects the backend. Must be "prometheus" or "loki". | +| `type` | `DiscoveryQueryType` | Yes | — | Type selects the backend. Must be "prometheus", "loki", or "registry". | | `prometheus` | `*DiscoveryPrometheusQuery` | No | — | Prometheus contains the configuration when type=prometheus. | | `loki` | `*DiscoveryLokiQuery` | No | — | Loki contains the configuration when type=loki. | +| `registry` | `*DiscoveryRegistryQuery` | No | — | Registry contains the configuration when type=registry. | | `secretRef` | `*corev1.LocalObjectReference` | No | — | SecretRef references a Secret in the pod namespace (default "drop-system") for auth/TLS. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. | ### DiscoveryRanking @@ -232,6 +228,18 @@ DiscoveryRanking defines how signals are combined into the final ordered image l | `weightedSum` | `*WeightedSumRankingConfig` | No | — | WeightedSum is required when strategy=weightedSum. | | `modelExposure` | `*ModelExposureRankingConfig` | No | — | ModelExposure is required when strategy=modelExposure. | +### DiscoveryRegistryQuery + +DiscoveryRegistryQuery defines OCI registry tag listing configuration for image discovery. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `url` | `string` | Yes | — | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | +| `repositories` | `[]string` | Yes | — | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | +| `tagFilter` | `string` | No | — | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | +| `topX` | `int32` | No | — | TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo) | +| `imageTemplate` | `string` | No | — | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "registry.example.com/{{.Repository}}:{{.Tag}}" | + ### DiscoverySignal DiscoverySignal defines a named per-image metric derived from a single query. @@ -266,25 +274,6 @@ ImageEntry defines a single image to include in a set. | `tag` | `string` | No | — | Tag to pull. Mutually exclusive with Digest. Example: "1.25-alpine", "v2.4.1" | | `digest` | `string` | No | — | Digest to pull as an immutable reference. Mutually exclusive with Tag. Example: "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4" | -### ImageRankingDetail - -ImageRankingDetail explains how the final score was computed for one image. - -| Field | Type | Required | Default | Description | -|-------|------|----------|---------|-------------| -| `strategy` | `string` | Yes | — | Strategy is the ranking strategy that produced this detail. | -| `terms` | `[]RankingTerm` | No | — | Terms lists the per-signal contributions (populated for weightedSum and modelExposure). | - -### ImageSignalValue - -ImageSignalValue records the raw and normalized value of a signal for one image. - -| Field | Type | Required | Default | Description | -|-------|------|----------|---------|-------------| -| `name` | `string` | Yes | — | Name is the signal name. | -| `rawValue` | `string` | Yes | — | RawValue is the unscaled signal value as a decimal string. | -| `normalizedValue` | `string` | No | — | NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. Only populated for signals used in a weightedSum ranking. | - ### LokiParser LokiParser configures structured parsing of Loki log entries. @@ -323,23 +312,10 @@ QueryResult reports the outcome of a single named query execution. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `name` | `string` | Yes | — | Name matches the queries[].name that produced this result. | -| `type` | `DiscoveryQueryType` | Yes | — | Type is the query backend type (prometheus or loki). | -| `series` | `*int32` | No | — | Series is the number of time-series returned (Prometheus queries only). | -| `samples` | `*int64` | No | — | Samples is the total number of data points across all series (Prometheus range queries only). | -| `records` | `*int64` | No | — | Records is the number of log records returned (Loki queries only). | +| `type` | `DiscoveryQueryType` | Yes | — | Type is the query backend type (prometheus, loki, or registry). | | `status` | `QueryResultStatus` | Yes | — | Status is "success" or "failed". | | `message` | `string` | No | — | Message describes the failure reason when status=failed. | -### RankingTerm - -RankingTerm records the contribution of one signal to the final score of an image. - -| Field | Type | Required | Default | Description | -|-------|------|----------|---------|-------------| -| `signal` | `string` | Yes | — | Signal is the signal name. | -| `weight` | `string` | Yes | — | Weight is the configured weight as a decimal string. | -| `contribution` | `string` | Yes | — | Contribution is weight * normalizedValue as a decimal string. | - ### SignalRankingConfig SignalRankingConfig configures the signal ranking strategy. @@ -348,17 +324,6 @@ SignalRankingConfig configures the signal ranking strategy. |-------|------|----------|---------|-------------| | `signalRef` | `string` | Yes | — | SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. | -### SignalResult - -SignalResult reports the outcome of a single signal derivation. - -| Field | Type | Required | Default | Description | -|-------|------|----------|---------|-------------| -| `name` | `string` | Yes | — | Name matches the signals[].name that produced this result. | -| `images` | `int32` | Yes | — | Images is the number of images for which this signal produced a value. | -| `status` | `string` | Yes | — | Status is "success" or "failed". | -| `message` | `string` | No | — | Message describes the failure reason when status=failed. | - ### TimeOfDayWindow TimeOfDayWindow defines a fixed wall-clock time range within each day. diff --git a/docs/static/llms-full.txt b/docs/static/llms-full.txt index 9ed121d..1b02bee 100644 --- a/docs/static/llms-full.txt +++ b/docs/static/llms-full.txt @@ -96,10 +96,8 @@ Controller: internal/controller/discoverypolicy_controller.go | Test: internal/c |-------|------|------|-------------| | LastSyncTime | `lastSyncTime` | `*metav1.Time` | LastSyncTime is the timestamp of the last reconciliation attempt. | | QueryResults | `queryResults` | `[]QueryResult` | QueryResults reports the outcome of each named query execution. | -| SignalResults | `signalResults` | `[]SignalResult` | SignalResults reports the outcome of each signal derivation. | -| DiscoveredImages | `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the ordered list of discovered and ranked images. Only images with selected=true are propagated to dependent CachedImageSet resources. | -| ImageCount | `imageCount` | `int32` | ImageCount is the number of selected discovered images. | -| QueryCount | `queryCount` | `int32` | QueryCount is the number of configured queries. | +| DiscoveredImages | `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the ordered list of discovered and ranked images. | +| ImageCount | `imageCount` | `int32` | ImageCount is the number of discovered images. | | Conditions | `conditions` | `[]metav1.Condition` | Conditions represent the latest available observations. | @@ -147,9 +145,6 @@ DiscoveredImage represents a single discovered and ranked image. | Image | `image` | `string` | ✓ | | Image is the fully qualified image reference. | | Rank | `rank` | `int32` | ✓ | | Rank is the position of this image in the final ordered list (1 = highest score). | | FinalScore | `finalScore` | `string` | ✓ | | FinalScore is the computed ranking score as a decimal string. | -| Selected | `selected` | `bool` | ✓ | | Selected is true when this image is within the maxImages cap and will be propagated to dependent CachedImageSet resources. | -| Signals | `signals` | `[]ImageSignalValue` | — | | Signals lists the per-signal values used during ranking (for observability). | -| Ranking | `ranking` | `*ImageRankingDetail` | — | | Ranking explains how the final score was computed. | ### DiscoveryLokiQuery @@ -190,9 +185,10 @@ DiscoveryQuery defines a named raw-data source referenced by signals. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | Name | `name` | `string` | ✓ | | Name is the unique identifier for this query within the policy. Signals reference queries by this name via queryRef. | -| Type | `type` | `DiscoveryQueryType` | ✓ | | Type selects the backend. Must be "prometheus" or "loki". Enum: `prometheus`,`loki` | +| Type | `type` | `DiscoveryQueryType` | ✓ | | Type selects the backend. Must be "prometheus", "loki", or "registry". Enum: `prometheus`,`loki`,`registry` | | Prometheus | `prometheus` | `*DiscoveryPrometheusQuery` | — | | Prometheus contains the configuration when type=prometheus. | | Loki | `loki` | `*DiscoveryLokiQuery` | — | | Loki contains the configuration when type=loki. | +| Registry | `registry` | `*DiscoveryRegistryQuery` | — | | Registry contains the configuration when type=registry. | | SecretRef | `secretRef` | `*corev1.LocalObjectReference` | — | | SecretRef references a Secret in the pod namespace (default "drop-system") for auth/TLS. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. | ### DiscoveryRanking @@ -206,6 +202,18 @@ DiscoveryRanking defines how signals are combined into the final ordered image l | WeightedSum | `weightedSum` | `*WeightedSumRankingConfig` | — | | WeightedSum is required when strategy=weightedSum. | | ModelExposure | `modelExposure` | `*ModelExposureRankingConfig` | — | | ModelExposure is required when strategy=modelExposure. | +### DiscoveryRegistryQuery + +DiscoveryRegistryQuery defines OCI registry tag listing configuration for image discovery. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| URL | `url` | `string` | ✓ | | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | +| Repositories | `repositories` | `[]string` | ✓ | | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | +| TagFilter | `tagFilter` | `string` | — | | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | +| TopX | `topX` | `int32` | — | | TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo) | +| ImageTemplate | `imageTemplate` | `string` | — | | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "registry.example.com/{{.Repository}}:{{.Tag}}" | + ### DiscoverySignal DiscoverySignal defines a named per-image metric derived from a single query. @@ -240,25 +248,6 @@ ImageEntry defines a single image to include in a set. | Tag | `tag` | `string` | — | | Tag to pull. Mutually exclusive with Digest. Example: "1.25-alpine", "v2.4.1" | | Digest | `digest` | `string` | — | | Digest to pull as an immutable reference. Mutually exclusive with Tag. Example: "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4" | -### ImageRankingDetail - -ImageRankingDetail explains how the final score was computed for one image. - -| Field | JSON | Type | Required | Default | Description | -|-------|------|------|----------|---------|-------------| -| Strategy | `strategy` | `string` | ✓ | | Strategy is the ranking strategy that produced this detail. | -| Terms | `terms` | `[]RankingTerm` | — | | Terms lists the per-signal contributions (populated for weightedSum and modelExposure). | - -### ImageSignalValue - -ImageSignalValue records the raw and normalized value of a signal for one image. - -| Field | JSON | Type | Required | Default | Description | -|-------|------|------|----------|---------|-------------| -| Name | `name` | `string` | ✓ | | Name is the signal name. | -| RawValue | `rawValue` | `string` | ✓ | | RawValue is the unscaled signal value as a decimal string. | -| NormalizedValue | `normalizedValue` | `string` | — | | NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. Only populated for signals used in a weightedSum ranking. | - ### LokiParser LokiParser configures structured parsing of Loki log entries. @@ -297,23 +286,10 @@ QueryResult reports the outcome of a single named query execution. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | Name | `name` | `string` | ✓ | | Name matches the queries[].name that produced this result. | -| Type | `type` | `DiscoveryQueryType` | ✓ | | Type is the query backend type (prometheus or loki). | -| Series | `series` | `*int32` | — | | Series is the number of time-series returned (Prometheus queries only). | -| Samples | `samples` | `*int64` | — | | Samples is the total number of data points across all series (Prometheus range queries only). | -| Records | `records` | `*int64` | — | | Records is the number of log records returned (Loki queries only). | +| Type | `type` | `DiscoveryQueryType` | ✓ | | Type is the query backend type (prometheus, loki, or registry). | | Status | `status` | `QueryResultStatus` | ✓ | | Status is "success" or "failed". | | Message | `message` | `string` | — | | Message describes the failure reason when status=failed. | -### RankingTerm - -RankingTerm records the contribution of one signal to the final score of an image. - -| Field | JSON | Type | Required | Default | Description | -|-------|------|------|----------|---------|-------------| -| Signal | `signal` | `string` | ✓ | | Signal is the signal name. | -| Weight | `weight` | `string` | ✓ | | Weight is the configured weight as a decimal string. | -| Contribution | `contribution` | `string` | ✓ | | Contribution is weight * normalizedValue as a decimal string. | - ### SignalRankingConfig SignalRankingConfig configures the signal ranking strategy. @@ -322,17 +298,6 @@ SignalRankingConfig configures the signal ranking strategy. |-------|------|------|----------|---------|-------------| | SignalRef | `signalRef` | `string` | ✓ | | SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. | -### SignalResult - -SignalResult reports the outcome of a single signal derivation. - -| Field | JSON | Type | Required | Default | Description | -|-------|------|------|----------|---------|-------------| -| Name | `name` | `string` | ✓ | | Name matches the signals[].name that produced this result. | -| Images | `images` | `int32` | ✓ | | Images is the number of images for which this signal produced a value. | -| Status | `status` | `string` | ✓ | | Status is "success" or "failed". | -| Message | `message` | `string` | — | | Message describes the failure reason when status=failed. | - ### TimeOfDayWindow TimeOfDayWindow defines a fixed wall-clock time range within each day. @@ -417,7 +382,9 @@ graph LR | Degraded | CachedImageSet | N/N images cached, failing: N | | | Progressing | CachedImageSet | N/N images cached | | | Ready | CachedImageSet | All N images are cached | | -| NotImplemented | DiscoveryPolicy | | | +| ConnectionRefused | DiscoveryPolicy | | | +| DNSError | DiscoveryPolicy | | | +| Synced | DiscoveryPolicy | Pipeline executed successfully; N images discovered. | | ## Metrics @@ -438,7 +405,27 @@ graph LR ## Sample CRs ```yaml -# Dev samples: deployed by Tilt for interactive testing +# Dev samples: deployed by Tilt for interactive testing. +# +# These samples exercise EVERY feature of the operator so developers can spot +# regressions at a glance in the Tilt UI. They run against the e2e-infra stack +# (Prometheus, Loki, and a seeded OCI registry) that Tilt brings up. +# +# Feature coverage: +# PullPolicy ............ dev-conservative +# CachedImage .......... dev-nginx, dev-redis (healthy), test-invalid-image (broken) +# CachedImageSet ....... dev-set (static), dev-set-discovered (discovery-backed) +# Query: prometheus .... dev-prometheus (range), dev-prometheus-instant (instant) +# Query: loki .......... dev-loki (kubernetesEvents parser) +# Query: registry ...... dev-registry +# Signal: aggregate .... dev-prometheus +# Signal: timeWeighted . dev-timeweighted +# Signal: windowAgg .... dev-window +# Signal: eventPullTime dev-loki +# Ranking: signal ...... dev-prometheus +# Ranking: weightedSum . dev-hybrid +# Ranking: modelExposure dev-modelexposure +# Failure cases ........ test-broken-prom, test-broken-registry, test-notfound-repo --- # === PullPolicy === apiVersion: drop.corewire.io/v1alpha1 @@ -577,6 +564,217 @@ spec: syncInterval: 30s maxImages: 10 --- +# === DiscoveryPolicy: Prometheus instant query === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-prometheus-instant +spec: + queries: + - name: current-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: instant + query: 'container_memory_working_set_bytes{namespace="build-stuff"}' + signals: + - name: current + queryRef: current-usage + type: aggregate + aggregate: + method: max + ranking: + strategy: signal + signal: + signalRef: current + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: timeWeightedAggregate signal === +# Weights samples by hour-of-day before aggregating. +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-timeweighted +spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: business-hours-usage + queryRef: runner-image-usage + type: timeWeightedAggregate + timeWeightedAggregate: + method: sum + timezone: "UTC" + defaultWeight: "1" + windows: + - startHour: 8 + endHour: 18 + weight: "2" + ranking: + strategy: signal + signal: + signalRef: business-hours-usage + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: windowAggregate signal (relative window) === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-window +spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: recent-usage + queryRef: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + relativeWindow: 6h + ranking: + strategy: signal + signal: + signalRef: recent-usage + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: Loki query + eventPullTime signals === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-loki +spec: + queries: + - name: image-pull-events + type: loki + loki: + endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" + queryType: range + lookback: 24h + query: '{job="kubelet",drop_e2e="true"}' + parser: + type: kubernetesEvents + signals: + - name: p50-cold-pull-time + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: p50 + durationMode: messageDuration + includeCacheHits: false + - name: pull-failures + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: failureCount + durationMode: messageDuration + includeCacheHits: false + ranking: + strategy: signal + signal: + signalRef: p50-cold-pull-time + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: registry tag discovery === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-registry +spec: + queries: + - name: registry-tags + type: registry + registry: + url: "http://registry.e2e-infra.svc.cluster.local:5000" + repositories: + - test/myapp + - test/worker + - test/tools + tagFilter: "^v" + topX: 5 + signals: + - name: tag-recency + queryRef: registry-tags + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30s + maxImages: 20 +--- +# === DiscoveryPolicy: modelExposure ranking (multi-query) === +# Combines Prometheus usage signals with a Loki pull-time signal. +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-modelexposure +spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + - name: image-pull-events + type: loki + loki: + endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" + queryType: range + lookback: 24h + query: '{job="kubelet",drop_e2e="true"}' + parser: + type: kubernetesEvents + signals: + - name: pre-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + - name: target-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max + - name: pull-time + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: p50 + durationMode: messageDuration + includeCacheHits: false + ranking: + strategy: modelExposure + modelExposure: + nodeCount: 3 + preWindowUsageSignalRef: pre-usage + targetWindowUsageSignalRef: target-usage + pullTimeSignalRef: pull-time + syncInterval: 30s + maxImages: 10 +--- # === DiscoveryPolicy: broken Prometheus endpoint (DNS error) === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy @@ -601,6 +799,58 @@ spec: signalRef: total-usage syncInterval: 30m maxImages: 10 +--- +# === DiscoveryPolicy: broken registry endpoint (DNS error) === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: test-broken-registry +spec: + queries: + - name: broken-registry + type: registry + registry: + url: "http://nonexistent-registry:5000" + repositories: + - test/app + signals: + - name: tag-recency + queryRef: broken-registry + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30m + maxImages: 10 +--- +# === DiscoveryPolicy: registry repository not found (404) === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: test-notfound-repo +spec: + queries: + - name: missing-repo + type: registry + registry: + url: "http://registry.e2e-infra.svc.cluster.local:5000" + repositories: + - test/does-not-exist + signals: + - name: tag-recency + queryRef: missing-repo + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30m + maxImages: 10 ``` @@ -624,7 +874,7 @@ spec: make kind-delete # Delete the kind cluster. make install # Install CRDs into cluster. make uninstall # Uninstall CRDs from cluster. - make e2e-infra # Deploy Prometheus + Registry for E2E/dev. + make e2e-infra # Deploy Prometheus, Loki, and Registry for E2E/dev. make docker-build # Build docker image. make docker-push # Push docker image. make kind-load # Build and load image into kind. diff --git a/knowledge.yaml b/knowledge.yaml index fea19b9..3658558 100644 --- a/knowledge.yaml +++ b/knowledge.yaml @@ -280,26 +280,16 @@ crds: type: '[]QueryResult' required: false doc: QueryResults reports the outcome of each named query execution. - - name: SignalResults - json: signalResults - type: '[]SignalResult' - required: false - doc: SignalResults reports the outcome of each signal derivation. - name: DiscoveredImages json: discoveredImages type: '[]DiscoveredImage' required: false - doc: DiscoveredImages is the ordered list of discovered and ranked images. Only images with selected=true are propagated to dependent CachedImageSet resources. + doc: DiscoveredImages is the ordered list of discovered and ranked images. - name: ImageCount json: imageCount type: int32 required: false - doc: ImageCount is the number of selected discovered images. - - name: QueryCount - json: queryCount - type: int32 - required: false - doc: QueryCount is the number of configured queries. + doc: ImageCount is the number of discovered images. - name: Conditions json: conditions type: '[]metav1.Condition' @@ -310,7 +300,6 @@ crds: - +kubebuilder:printcolumn:name="Message",type=string,JSONPath=`.status.conditions[?(@.type=="Ready")].message`,priority=1 - +kubebuilder:printcolumn:name="LastSync",type=date,JSONPath=`.status.lastSyncTime` - +kubebuilder:printcolumn:name="Images",type=integer,JSONPath=`.status.imageCount` - - +kubebuilder:printcolumn:name="Queries",type=integer,JSONPath=`.status.queryCount` - +kubebuilder:printcolumn:name="Status",type=string,JSONPath=`.status.conditions[?(@.type=="Ready")].reason` - +kubebuilder:resource:scope=Cluster,categories=drop - +kubebuilder:subresource:status @@ -406,21 +395,6 @@ helperTypes: type: string required: true doc: FinalScore is the computed ranking score as a decimal string. - - name: Selected - json: selected - type: bool - required: true - doc: Selected is true when this image is within the maxImages cap and will be propagated to dependent CachedImageSet resources. - - name: Signals - json: signals - type: '[]ImageSignalValue' - required: false - doc: Signals lists the per-signal values used during ranking (for observability). - - name: Ranking - json: ranking - type: '*ImageRankingDetail' - required: false - doc: Ranking explains how the final score was computed. - name: DiscoveryLokiQuery doc: DiscoveryLokiQuery defines the Loki-specific query parameters. fields: @@ -502,7 +476,8 @@ helperTypes: enum: - prometheus - loki - doc: Type selects the backend. Must be "prometheus" or "loki". + - registry + doc: Type selects the backend. Must be "prometheus", "loki", or "registry". - name: Prometheus json: prometheus type: '*DiscoveryPrometheusQuery' @@ -513,6 +488,11 @@ helperTypes: type: '*DiscoveryLokiQuery' required: false doc: Loki contains the configuration when type=loki. + - name: Registry + json: registry + type: '*DiscoveryRegistryQuery' + required: false + doc: Registry contains the configuration when type=registry. - name: SecretRef json: secretRef type: '*corev1.LocalObjectReference' @@ -545,6 +525,34 @@ helperTypes: type: '*ModelExposureRankingConfig' required: false doc: ModelExposure is required when strategy=modelExposure. + - name: DiscoveryRegistryQuery + doc: DiscoveryRegistryQuery defines OCI registry tag listing configuration for image discovery. + fields: + - name: URL + json: url + type: string + required: true + doc: 'URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io"' + - name: Repositories + json: repositories + type: '[]string' + required: true + doc: 'Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"]' + - name: TagFilter + json: tagFilter + type: string + required: false + doc: 'TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds)' + - name: TopX + json: topX + type: int32 + required: false + doc: 'TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo)' + - name: ImageTemplate + json: imageTemplate + type: string + required: false + doc: 'ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "registry.example.com/{{.Repository}}:{{.Tag}}"' - name: DiscoverySignal doc: DiscoverySignal defines a named per-image metric derived from a single query. fields: @@ -637,37 +645,6 @@ helperTypes: type: string required: false doc: 'Digest to pull as an immutable reference. Mutually exclusive with Tag. Example: "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4"' - - name: ImageRankingDetail - doc: ImageRankingDetail explains how the final score was computed for one image. - fields: - - name: Strategy - json: strategy - type: string - required: true - doc: Strategy is the ranking strategy that produced this detail. - - name: Terms - json: terms - type: '[]RankingTerm' - required: false - doc: Terms lists the per-signal contributions (populated for weightedSum and modelExposure). - - name: ImageSignalValue - doc: ImageSignalValue records the raw and normalized value of a signal for one image. - fields: - - name: Name - json: name - type: string - required: true - doc: Name is the signal name. - - name: RawValue - json: rawValue - type: string - required: true - doc: RawValue is the unscaled signal value as a decimal string. - - name: NormalizedValue - json: normalizedValue - type: string - required: false - doc: NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. Only populated for signals used in a weightedSum ranking. - name: LokiParser doc: LokiParser configures structured parsing of Loki log entries. fields: @@ -741,22 +718,7 @@ helperTypes: json: type type: DiscoveryQueryType required: true - doc: Type is the query backend type (prometheus or loki). - - name: Series - json: series - type: '*int32' - required: false - doc: Series is the number of time-series returned (Prometheus queries only). - - name: Samples - json: samples - type: '*int64' - required: false - doc: Samples is the total number of data points across all series (Prometheus range queries only). - - name: Records - json: records - type: '*int64' - required: false - doc: Records is the number of log records returned (Loki queries only). + doc: Type is the query backend type (prometheus, loki, or registry). - name: Status json: status type: QueryResultStatus @@ -767,24 +729,6 @@ helperTypes: type: string required: false doc: Message describes the failure reason when status=failed. - - name: RankingTerm - doc: RankingTerm records the contribution of one signal to the final score of an image. - fields: - - name: Signal - json: signal - type: string - required: true - doc: Signal is the signal name. - - name: Weight - json: weight - type: string - required: true - doc: Weight is the configured weight as a decimal string. - - name: Contribution - json: contribution - type: string - required: true - doc: Contribution is weight * normalizedValue as a decimal string. - name: SignalRankingConfig doc: SignalRankingConfig configures the signal ranking strategy. fields: @@ -793,29 +737,6 @@ helperTypes: type: string required: true doc: SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. - - name: SignalResult - doc: SignalResult reports the outcome of a single signal derivation. - fields: - - name: Name - json: name - type: string - required: true - doc: Name matches the signals[].name that produced this result. - - name: Images - json: images - type: int32 - required: true - doc: Images is the number of images for which this signal produced a value. - - name: Status - json: status - type: string - required: true - doc: Status is "success" or "failed". - - name: Message - json: message - type: string - required: false - doc: Message describes the failure reason when status=failed. - name: TimeOfDayWindow doc: TimeOfDayWindow defines a fixed wall-clock time range within each day. fields: @@ -963,6 +884,7 @@ packages: role: Package controller implements Kubernetes reconcilers for the drop CRDs (one per Kind). imports: - api/v1alpha1 + - internal/discovery - internal/metrics - internal/pacing - internal/podbuilder @@ -1037,9 +959,15 @@ errors: - reason: Ready controller: CachedImageSet meaning: All N images are cached - - reason: NotImplemented + - reason: ConnectionRefused controller: DiscoveryPolicy meaning: "" + - reason: DNSError + controller: DiscoveryPolicy + meaning: "" + - reason: Synced + controller: DiscoveryPolicy + meaning: Pipeline executed successfully; N images discovered. metrics: - name: drop_images_cached_total help: Total number of images successfully cached on nodes. @@ -1110,7 +1038,7 @@ makeTargets: - name: uninstall desc: Uninstall CRDs from cluster. - name: e2e-infra - desc: Deploy Prometheus + Registry for E2E/dev. + desc: Deploy Prometheus, Loki, and Registry for E2E/dev. - name: docker-build desc: Build docker image. - name: docker-push @@ -1130,7 +1058,27 @@ makeTargets: - name: tools desc: Install local tooling and check optional docs/chart binaries. samples: | - # Dev samples: deployed by Tilt for interactive testing + # Dev samples: deployed by Tilt for interactive testing. + # + # These samples exercise EVERY feature of the operator so developers can spot + # regressions at a glance in the Tilt UI. They run against the e2e-infra stack + # (Prometheus, Loki, and a seeded OCI registry) that Tilt brings up. + # + # Feature coverage: + # PullPolicy ............ dev-conservative + # CachedImage .......... dev-nginx, dev-redis (healthy), test-invalid-image (broken) + # CachedImageSet ....... dev-set (static), dev-set-discovered (discovery-backed) + # Query: prometheus .... dev-prometheus (range), dev-prometheus-instant (instant) + # Query: loki .......... dev-loki (kubernetesEvents parser) + # Query: registry ...... dev-registry + # Signal: aggregate .... dev-prometheus + # Signal: timeWeighted . dev-timeweighted + # Signal: windowAgg .... dev-window + # Signal: eventPullTime dev-loki + # Ranking: signal ...... dev-prometheus + # Ranking: weightedSum . dev-hybrid + # Ranking: modelExposure dev-modelexposure + # Failure cases ........ test-broken-prom, test-broken-registry, test-notfound-repo --- # === PullPolicy === apiVersion: drop.corewire.io/v1alpha1 @@ -1269,6 +1217,217 @@ samples: | syncInterval: 30s maxImages: 10 --- + # === DiscoveryPolicy: Prometheus instant query === + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + metadata: + name: dev-prometheus-instant + spec: + queries: + - name: current-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: instant + query: 'container_memory_working_set_bytes{namespace="build-stuff"}' + signals: + - name: current + queryRef: current-usage + type: aggregate + aggregate: + method: max + ranking: + strategy: signal + signal: + signalRef: current + syncInterval: 30s + maxImages: 10 + --- + # === DiscoveryPolicy: timeWeightedAggregate signal === + # Weights samples by hour-of-day before aggregating. + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + metadata: + name: dev-timeweighted + spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: business-hours-usage + queryRef: runner-image-usage + type: timeWeightedAggregate + timeWeightedAggregate: + method: sum + timezone: "UTC" + defaultWeight: "1" + windows: + - startHour: 8 + endHour: 18 + weight: "2" + ranking: + strategy: signal + signal: + signalRef: business-hours-usage + syncInterval: 30s + maxImages: 10 + --- + # === DiscoveryPolicy: windowAggregate signal (relative window) === + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + metadata: + name: dev-window + spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: recent-usage + queryRef: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + relativeWindow: 6h + ranking: + strategy: signal + signal: + signalRef: recent-usage + syncInterval: 30s + maxImages: 10 + --- + # === DiscoveryPolicy: Loki query + eventPullTime signals === + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + metadata: + name: dev-loki + spec: + queries: + - name: image-pull-events + type: loki + loki: + endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" + queryType: range + lookback: 24h + query: '{job="kubelet",drop_e2e="true"}' + parser: + type: kubernetesEvents + signals: + - name: p50-cold-pull-time + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: p50 + durationMode: messageDuration + includeCacheHits: false + - name: pull-failures + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: failureCount + durationMode: messageDuration + includeCacheHits: false + ranking: + strategy: signal + signal: + signalRef: p50-cold-pull-time + syncInterval: 30s + maxImages: 10 + --- + # === DiscoveryPolicy: registry tag discovery === + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + metadata: + name: dev-registry + spec: + queries: + - name: registry-tags + type: registry + registry: + url: "http://registry.e2e-infra.svc.cluster.local:5000" + repositories: + - test/myapp + - test/worker + - test/tools + tagFilter: "^v" + topX: 5 + signals: + - name: tag-recency + queryRef: registry-tags + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30s + maxImages: 20 + --- + # === DiscoveryPolicy: modelExposure ranking (multi-query) === + # Combines Prometheus usage signals with a Loki pull-time signal. + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + metadata: + name: dev-modelexposure + spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + - name: image-pull-events + type: loki + loki: + endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" + queryType: range + lookback: 24h + query: '{job="kubelet",drop_e2e="true"}' + parser: + type: kubernetesEvents + signals: + - name: pre-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + - name: target-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max + - name: pull-time + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: p50 + durationMode: messageDuration + includeCacheHits: false + ranking: + strategy: modelExposure + modelExposure: + nodeCount: 3 + preWindowUsageSignalRef: pre-usage + targetWindowUsageSignalRef: target-usage + pullTimeSignalRef: pull-time + syncInterval: 30s + maxImages: 10 + --- # === DiscoveryPolicy: broken Prometheus endpoint (DNS error) === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy @@ -1293,3 +1452,55 @@ samples: | signalRef: total-usage syncInterval: 30m maxImages: 10 + --- + # === DiscoveryPolicy: broken registry endpoint (DNS error) === + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + metadata: + name: test-broken-registry + spec: + queries: + - name: broken-registry + type: registry + registry: + url: "http://nonexistent-registry:5000" + repositories: + - test/app + signals: + - name: tag-recency + queryRef: broken-registry + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30m + maxImages: 10 + --- + # === DiscoveryPolicy: registry repository not found (404) === + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + metadata: + name: test-notfound-repo + spec: + queries: + - name: missing-repo + type: registry + registry: + url: "http://registry.e2e-infra.svc.cluster.local:5000" + repositories: + - test/does-not-exist + signals: + - name: tag-recency + queryRef: missing-repo + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30m + maxImages: 10 diff --git a/llms-full.txt b/llms-full.txt index 9ed121d..1b02bee 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -96,10 +96,8 @@ Controller: internal/controller/discoverypolicy_controller.go | Test: internal/c |-------|------|------|-------------| | LastSyncTime | `lastSyncTime` | `*metav1.Time` | LastSyncTime is the timestamp of the last reconciliation attempt. | | QueryResults | `queryResults` | `[]QueryResult` | QueryResults reports the outcome of each named query execution. | -| SignalResults | `signalResults` | `[]SignalResult` | SignalResults reports the outcome of each signal derivation. | -| DiscoveredImages | `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the ordered list of discovered and ranked images. Only images with selected=true are propagated to dependent CachedImageSet resources. | -| ImageCount | `imageCount` | `int32` | ImageCount is the number of selected discovered images. | -| QueryCount | `queryCount` | `int32` | QueryCount is the number of configured queries. | +| DiscoveredImages | `discoveredImages` | `[]DiscoveredImage` | DiscoveredImages is the ordered list of discovered and ranked images. | +| ImageCount | `imageCount` | `int32` | ImageCount is the number of discovered images. | | Conditions | `conditions` | `[]metav1.Condition` | Conditions represent the latest available observations. | @@ -147,9 +145,6 @@ DiscoveredImage represents a single discovered and ranked image. | Image | `image` | `string` | ✓ | | Image is the fully qualified image reference. | | Rank | `rank` | `int32` | ✓ | | Rank is the position of this image in the final ordered list (1 = highest score). | | FinalScore | `finalScore` | `string` | ✓ | | FinalScore is the computed ranking score as a decimal string. | -| Selected | `selected` | `bool` | ✓ | | Selected is true when this image is within the maxImages cap and will be propagated to dependent CachedImageSet resources. | -| Signals | `signals` | `[]ImageSignalValue` | — | | Signals lists the per-signal values used during ranking (for observability). | -| Ranking | `ranking` | `*ImageRankingDetail` | — | | Ranking explains how the final score was computed. | ### DiscoveryLokiQuery @@ -190,9 +185,10 @@ DiscoveryQuery defines a named raw-data source referenced by signals. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | Name | `name` | `string` | ✓ | | Name is the unique identifier for this query within the policy. Signals reference queries by this name via queryRef. | -| Type | `type` | `DiscoveryQueryType` | ✓ | | Type selects the backend. Must be "prometheus" or "loki". Enum: `prometheus`,`loki` | +| Type | `type` | `DiscoveryQueryType` | ✓ | | Type selects the backend. Must be "prometheus", "loki", or "registry". Enum: `prometheus`,`loki`,`registry` | | Prometheus | `prometheus` | `*DiscoveryPrometheusQuery` | — | | Prometheus contains the configuration when type=prometheus. | | Loki | `loki` | `*DiscoveryLokiQuery` | — | | Loki contains the configuration when type=loki. | +| Registry | `registry` | `*DiscoveryRegistryQuery` | — | | Registry contains the configuration when type=registry. | | SecretRef | `secretRef` | `*corev1.LocalObjectReference` | — | | SecretRef references a Secret in the pod namespace (default "drop-system") for auth/TLS. Supported Secret keys: token, username, password, ca.crt, tls.crt, tls.key, headers.. | ### DiscoveryRanking @@ -206,6 +202,18 @@ DiscoveryRanking defines how signals are combined into the final ordered image l | WeightedSum | `weightedSum` | `*WeightedSumRankingConfig` | — | | WeightedSum is required when strategy=weightedSum. | | ModelExposure | `modelExposure` | `*ModelExposureRankingConfig` | — | | ModelExposure is required when strategy=modelExposure. | +### DiscoveryRegistryQuery + +DiscoveryRegistryQuery defines OCI registry tag listing configuration for image discovery. + +| Field | JSON | Type | Required | Default | Description | +|-------|------|------|----------|---------|-------------| +| URL | `url` | `string` | ✓ | | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | +| Repositories | `repositories` | `[]string` | ✓ | | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | +| TagFilter | `tagFilter` | `string` | — | | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | +| TopX | `topX` | `int32` | — | | TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo) | +| ImageTemplate | `imageTemplate` | `string` | — | | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "registry.example.com/{{.Repository}}:{{.Tag}}" | + ### DiscoverySignal DiscoverySignal defines a named per-image metric derived from a single query. @@ -240,25 +248,6 @@ ImageEntry defines a single image to include in a set. | Tag | `tag` | `string` | — | | Tag to pull. Mutually exclusive with Digest. Example: "1.25-alpine", "v2.4.1" | | Digest | `digest` | `string` | — | | Digest to pull as an immutable reference. Mutually exclusive with Tag. Example: "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4" | -### ImageRankingDetail - -ImageRankingDetail explains how the final score was computed for one image. - -| Field | JSON | Type | Required | Default | Description | -|-------|------|------|----------|---------|-------------| -| Strategy | `strategy` | `string` | ✓ | | Strategy is the ranking strategy that produced this detail. | -| Terms | `terms` | `[]RankingTerm` | — | | Terms lists the per-signal contributions (populated for weightedSum and modelExposure). | - -### ImageSignalValue - -ImageSignalValue records the raw and normalized value of a signal for one image. - -| Field | JSON | Type | Required | Default | Description | -|-------|------|------|----------|---------|-------------| -| Name | `name` | `string` | ✓ | | Name is the signal name. | -| RawValue | `rawValue` | `string` | ✓ | | RawValue is the unscaled signal value as a decimal string. | -| NormalizedValue | `normalizedValue` | `string` | — | | NormalizedValue is the normalized value (after minMax or other normalization) as a decimal string. Only populated for signals used in a weightedSum ranking. | - ### LokiParser LokiParser configures structured parsing of Loki log entries. @@ -297,23 +286,10 @@ QueryResult reports the outcome of a single named query execution. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | Name | `name` | `string` | ✓ | | Name matches the queries[].name that produced this result. | -| Type | `type` | `DiscoveryQueryType` | ✓ | | Type is the query backend type (prometheus or loki). | -| Series | `series` | `*int32` | — | | Series is the number of time-series returned (Prometheus queries only). | -| Samples | `samples` | `*int64` | — | | Samples is the total number of data points across all series (Prometheus range queries only). | -| Records | `records` | `*int64` | — | | Records is the number of log records returned (Loki queries only). | +| Type | `type` | `DiscoveryQueryType` | ✓ | | Type is the query backend type (prometheus, loki, or registry). | | Status | `status` | `QueryResultStatus` | ✓ | | Status is "success" or "failed". | | Message | `message` | `string` | — | | Message describes the failure reason when status=failed. | -### RankingTerm - -RankingTerm records the contribution of one signal to the final score of an image. - -| Field | JSON | Type | Required | Default | Description | -|-------|------|------|----------|---------|-------------| -| Signal | `signal` | `string` | ✓ | | Signal is the signal name. | -| Weight | `weight` | `string` | ✓ | | Weight is the configured weight as a decimal string. | -| Contribution | `contribution` | `string` | ✓ | | Contribution is weight * normalizedValue as a decimal string. | - ### SignalRankingConfig SignalRankingConfig configures the signal ranking strategy. @@ -322,17 +298,6 @@ SignalRankingConfig configures the signal ranking strategy. |-------|------|------|----------|---------|-------------| | SignalRef | `signalRef` | `string` | ✓ | | SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. | -### SignalResult - -SignalResult reports the outcome of a single signal derivation. - -| Field | JSON | Type | Required | Default | Description | -|-------|------|------|----------|---------|-------------| -| Name | `name` | `string` | ✓ | | Name matches the signals[].name that produced this result. | -| Images | `images` | `int32` | ✓ | | Images is the number of images for which this signal produced a value. | -| Status | `status` | `string` | ✓ | | Status is "success" or "failed". | -| Message | `message` | `string` | — | | Message describes the failure reason when status=failed. | - ### TimeOfDayWindow TimeOfDayWindow defines a fixed wall-clock time range within each day. @@ -417,7 +382,9 @@ graph LR | Degraded | CachedImageSet | N/N images cached, failing: N | | | Progressing | CachedImageSet | N/N images cached | | | Ready | CachedImageSet | All N images are cached | | -| NotImplemented | DiscoveryPolicy | | | +| ConnectionRefused | DiscoveryPolicy | | | +| DNSError | DiscoveryPolicy | | | +| Synced | DiscoveryPolicy | Pipeline executed successfully; N images discovered. | | ## Metrics @@ -438,7 +405,27 @@ graph LR ## Sample CRs ```yaml -# Dev samples: deployed by Tilt for interactive testing +# Dev samples: deployed by Tilt for interactive testing. +# +# These samples exercise EVERY feature of the operator so developers can spot +# regressions at a glance in the Tilt UI. They run against the e2e-infra stack +# (Prometheus, Loki, and a seeded OCI registry) that Tilt brings up. +# +# Feature coverage: +# PullPolicy ............ dev-conservative +# CachedImage .......... dev-nginx, dev-redis (healthy), test-invalid-image (broken) +# CachedImageSet ....... dev-set (static), dev-set-discovered (discovery-backed) +# Query: prometheus .... dev-prometheus (range), dev-prometheus-instant (instant) +# Query: loki .......... dev-loki (kubernetesEvents parser) +# Query: registry ...... dev-registry +# Signal: aggregate .... dev-prometheus +# Signal: timeWeighted . dev-timeweighted +# Signal: windowAgg .... dev-window +# Signal: eventPullTime dev-loki +# Ranking: signal ...... dev-prometheus +# Ranking: weightedSum . dev-hybrid +# Ranking: modelExposure dev-modelexposure +# Failure cases ........ test-broken-prom, test-broken-registry, test-notfound-repo --- # === PullPolicy === apiVersion: drop.corewire.io/v1alpha1 @@ -577,6 +564,217 @@ spec: syncInterval: 30s maxImages: 10 --- +# === DiscoveryPolicy: Prometheus instant query === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-prometheus-instant +spec: + queries: + - name: current-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: instant + query: 'container_memory_working_set_bytes{namespace="build-stuff"}' + signals: + - name: current + queryRef: current-usage + type: aggregate + aggregate: + method: max + ranking: + strategy: signal + signal: + signalRef: current + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: timeWeightedAggregate signal === +# Weights samples by hour-of-day before aggregating. +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-timeweighted +spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: business-hours-usage + queryRef: runner-image-usage + type: timeWeightedAggregate + timeWeightedAggregate: + method: sum + timezone: "UTC" + defaultWeight: "1" + windows: + - startHour: 8 + endHour: 18 + weight: "2" + ranking: + strategy: signal + signal: + signalRef: business-hours-usage + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: windowAggregate signal (relative window) === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-window +spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + signals: + - name: recent-usage + queryRef: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + relativeWindow: 6h + ranking: + strategy: signal + signal: + signalRef: recent-usage + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: Loki query + eventPullTime signals === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-loki +spec: + queries: + - name: image-pull-events + type: loki + loki: + endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" + queryType: range + lookback: 24h + query: '{job="kubelet",drop_e2e="true"}' + parser: + type: kubernetesEvents + signals: + - name: p50-cold-pull-time + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: p50 + durationMode: messageDuration + includeCacheHits: false + - name: pull-failures + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: failureCount + durationMode: messageDuration + includeCacheHits: false + ranking: + strategy: signal + signal: + signalRef: p50-cold-pull-time + syncInterval: 30s + maxImages: 10 +--- +# === DiscoveryPolicy: registry tag discovery === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-registry +spec: + queries: + - name: registry-tags + type: registry + registry: + url: "http://registry.e2e-infra.svc.cluster.local:5000" + repositories: + - test/myapp + - test/worker + - test/tools + tagFilter: "^v" + topX: 5 + signals: + - name: tag-recency + queryRef: registry-tags + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30s + maxImages: 20 +--- +# === DiscoveryPolicy: modelExposure ranking (multi-query) === +# Combines Prometheus usage signals with a Loki pull-time signal. +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: dev-modelexposure +spec: + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: "http://prometheus.e2e-infra.svc.cluster.local:9090" + queryType: range + lookback: 24h + step: 5m + query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' + - name: image-pull-events + type: loki + loki: + endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" + queryType: range + lookback: 24h + query: '{job="kubelet",drop_e2e="true"}' + parser: + type: kubernetesEvents + signals: + - name: pre-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + - name: target-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: max + - name: pull-time + queryRef: image-pull-events + type: eventPullTime + eventPullTime: + statistic: p50 + durationMode: messageDuration + includeCacheHits: false + ranking: + strategy: modelExposure + modelExposure: + nodeCount: 3 + preWindowUsageSignalRef: pre-usage + targetWindowUsageSignalRef: target-usage + pullTimeSignalRef: pull-time + syncInterval: 30s + maxImages: 10 +--- # === DiscoveryPolicy: broken Prometheus endpoint (DNS error) === apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy @@ -601,6 +799,58 @@ spec: signalRef: total-usage syncInterval: 30m maxImages: 10 +--- +# === DiscoveryPolicy: broken registry endpoint (DNS error) === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: test-broken-registry +spec: + queries: + - name: broken-registry + type: registry + registry: + url: "http://nonexistent-registry:5000" + repositories: + - test/app + signals: + - name: tag-recency + queryRef: broken-registry + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30m + maxImages: 10 +--- +# === DiscoveryPolicy: registry repository not found (404) === +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: test-notfound-repo +spec: + queries: + - name: missing-repo + type: registry + registry: + url: "http://registry.e2e-infra.svc.cluster.local:5000" + repositories: + - test/does-not-exist + signals: + - name: tag-recency + queryRef: missing-repo + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: tag-recency + syncInterval: 30m + maxImages: 10 ``` @@ -624,7 +874,7 @@ spec: make kind-delete # Delete the kind cluster. make install # Install CRDs into cluster. make uninstall # Uninstall CRDs from cluster. - make e2e-infra # Deploy Prometheus + Registry for E2E/dev. + make e2e-infra # Deploy Prometheus, Loki, and Registry for E2E/dev. make docker-build # Build docker image. make docker-push # Push docker image. make kind-load # Build and load image into kind. diff --git a/test/e2e/cachedimageset-discovery/01-pullpolicy.yaml b/test/e2e/cachedimageset-discovery/01-pullpolicy.yaml index ae0c58d..527bb57 100644 --- a/test/e2e/cachedimageset-discovery/01-pullpolicy.yaml +++ b/test/e2e/cachedimageset-discovery/01-pullpolicy.yaml @@ -1,7 +1,7 @@ apiVersion: drop.corewire.io/v1alpha1 kind: PullPolicy metadata: - name: test-set-policy + name: test-cachedimageset-set-policy spec: maxConcurrentNodes: 1 minDelayBetweenPulls: 1s diff --git a/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml b/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml index a955d9e..8ccd3c8 100644 --- a/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml +++ b/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml @@ -1,7 +1,7 @@ apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: test-registry-discovery + name: test-cachedimageset-prometheus-discovery spec: queries: - name: runner-image-usage diff --git a/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml b/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml index b7215b4..23e8e5e 100644 --- a/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml +++ b/test/e2e/cachedimageset-discovery/03-assert-discovery-ready.yaml @@ -3,7 +3,6 @@ apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: test-registry-discovery + name: test-cachedimageset-prometheus-discovery status: (conditions[?type == 'Ready'] | length(@) > `0`): true - (queryCount == `1`): true diff --git a/test/e2e/cachedimageset-discovery/04-cachedimageset.yaml b/test/e2e/cachedimageset-discovery/04-cachedimageset.yaml index 761cb4c..e1319ae 100644 --- a/test/e2e/cachedimageset-discovery/04-cachedimageset.yaml +++ b/test/e2e/cachedimageset-discovery/04-cachedimageset.yaml @@ -1,9 +1,9 @@ apiVersion: drop.corewire.io/v1alpha1 kind: CachedImageSet metadata: - name: test-discovered-set + name: test-cachedimageset-discovered-set spec: policyRef: - name: test-set-policy + name: test-cachedimageset-set-policy discoveryPolicyRef: - name: test-registry-discovery + name: test-cachedimageset-prometheus-discovery diff --git a/test/e2e/cachedimageset-discovery/chainsaw-test.yaml b/test/e2e/cachedimageset-discovery/chainsaw-test.yaml index c7f2c6e..68e39da 100644 --- a/test/e2e/cachedimageset-discovery/chainsaw-test.yaml +++ b/test/e2e/cachedimageset-discovery/chainsaw-test.yaml @@ -31,14 +31,14 @@ spec: ref: apiVersion: drop.corewire.io/v1alpha1 kind: CachedImageSet - name: test-discovered-set + name: test-cachedimageset-discovered-set - delete: ref: apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy - name: test-registry-discovery + name: test-cachedimageset-prometheus-discovery - delete: ref: apiVersion: drop.corewire.io/v1alpha1 kind: PullPolicy - name: test-set-policy + name: test-cachedimageset-set-policy diff --git a/test/e2e/discovery-loki/01-discoverypolicy.yaml b/test/e2e/discovery-loki/01-discoverypolicy.yaml index 214ccd0..19c7cd7 100644 --- a/test/e2e/discovery-loki/01-discoverypolicy.yaml +++ b/test/e2e/discovery-loki/01-discoverypolicy.yaml @@ -1,10 +1,10 @@ apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: e2e-loki + name: e2e-discovery-loki spec: queries: - - name: image-pull-events + - name: discovery-loki-image-pull-events type: loki loki: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" @@ -16,7 +16,7 @@ spec: signals: # Median cold-pull time derived from the "Successfully pulled ... in Xs" messages. - name: p50-cold-pull-time - queryRef: image-pull-events + queryRef: discovery-loki-image-pull-events type: eventPullTime eventPullTime: statistic: p50 @@ -24,7 +24,7 @@ spec: includeCacheHits: false # Number of pull failures per image. - name: pull-failures - queryRef: image-pull-events + queryRef: discovery-loki-image-pull-events type: eventPullTime eventPullTime: statistic: failureCount diff --git a/test/e2e/discovery-loki/02-assert-discovery-status.yaml b/test/e2e/discovery-loki/02-assert-discovery-status.yaml index bb51364..a10db88 100644 --- a/test/e2e/discovery-loki/02-assert-discovery-status.yaml +++ b/test/e2e/discovery-loki/02-assert-discovery-status.yaml @@ -1,24 +1,18 @@ # Assert that the DiscoveryPolicy with a Loki query + eventPullTime signals # executed the full pipeline successfully: # - Ready=True with reason Synced -# - The Loki query succeeded -# - The eventPullTime signals produced per-image values # - Images parsed from kubelet pull events were discovered and ranked apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: e2e-loki + name: e2e-discovery-loki status: (conditions[?type == 'Ready']): - status: "True" reason: Synced - (queryCount == `1`): true (imageCount > `0`): true - (queryResults[?name == 'image-pull-events'] | [0].status): success - (queryResults[?name == 'image-pull-events'] | [0].type): loki - (signalResults[?name == 'p50-cold-pull-time'] | [0].status): success - (signalResults[?name == 'p50-cold-pull-time'] | [0].images > `0`): true - (signalResults[?name == 'pull-failures'] | [0].status): success + (queryResults[?name == 'discovery-loki-image-pull-events'] | [0].status): success + (queryResults[?name == 'discovery-loki-image-pull-events'] | [0].type): loki (length(discoveredImages[?contains(image, 'test/myapp:v1')]) > `0`): true (length(discoveredImages[?contains(image, 'test/worker:v2')]) > `0`): true (length(discoveredImages[?contains(image, 'test/tools:v1')]) > `0`): true diff --git a/test/e2e/discovery-loki/chainsaw-test.yaml b/test/e2e/discovery-loki/chainsaw-test.yaml index 1cf7af7..fe028a3 100644 --- a/test/e2e/discovery-loki/chainsaw-test.yaml +++ b/test/e2e/discovery-loki/chainsaw-test.yaml @@ -24,4 +24,4 @@ spec: ref: apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy - name: e2e-loki + name: e2e-discovery-loki diff --git a/test/e2e/discovery-registry/01-discoverypolicy.yaml b/test/e2e/discovery-registry/01-discoverypolicy.yaml index e062dfe..5e87686 100644 --- a/test/e2e/discovery-registry/01-discoverypolicy.yaml +++ b/test/e2e/discovery-registry/01-discoverypolicy.yaml @@ -1,7 +1,7 @@ apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: e2e-registry + name: e2e-discovery-registry spec: queries: - name: registry-tags diff --git a/test/e2e/discovery-registry/02-assert-discovery-status.yaml b/test/e2e/discovery-registry/02-assert-discovery-status.yaml index b378454..fc3f031 100644 --- a/test/e2e/discovery-registry/02-assert-discovery-status.yaml +++ b/test/e2e/discovery-registry/02-assert-discovery-status.yaml @@ -5,10 +5,9 @@ apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: e2e-registry + name: e2e-discovery-registry status: (conditions[?type == 'Ready']): - status: "True" reason: Synced - (queryCount == `1`): true (imageCount > `0`): true diff --git a/test/e2e/discovery-registry/chainsaw-test.yaml b/test/e2e/discovery-registry/chainsaw-test.yaml index 136a0f6..1d347e5 100644 --- a/test/e2e/discovery-registry/chainsaw-test.yaml +++ b/test/e2e/discovery-registry/chainsaw-test.yaml @@ -23,4 +23,4 @@ spec: ref: apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy - name: e2e-registry + name: e2e-discovery-registry diff --git a/test/e2e/discovery/01-discoverypolicy.yaml b/test/e2e/discovery/01-discoverypolicy.yaml index aba13cf..659dd3f 100644 --- a/test/e2e/discovery/01-discoverypolicy.yaml +++ b/test/e2e/discovery/01-discoverypolicy.yaml @@ -1,7 +1,7 @@ apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: e2e-prometheus + name: e2e-discovery-prometheus spec: queries: - name: runner-image-usage diff --git a/test/e2e/discovery/02-assert-discovery-status.yaml b/test/e2e/discovery/02-assert-discovery-status.yaml index 9fd7d43..92303b0 100644 --- a/test/e2e/discovery/02-assert-discovery-status.yaml +++ b/test/e2e/discovery/02-assert-discovery-status.yaml @@ -5,10 +5,9 @@ apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy metadata: - name: e2e-prometheus + name: e2e-discovery-prometheus status: (conditions[?type == 'Ready']): - status: "True" reason: Synced - (queryCount == `1`): true (imageCount > `0`): true diff --git a/test/e2e/discovery/03-cachedimageset-discovery.yaml b/test/e2e/discovery/03-cachedimageset-discovery.yaml index f0b81aa..efa84bf 100644 --- a/test/e2e/discovery/03-cachedimageset-discovery.yaml +++ b/test/e2e/discovery/03-cachedimageset-discovery.yaml @@ -1,7 +1,7 @@ apiVersion: drop.corewire.io/v1alpha1 kind: CachedImageSet metadata: - name: discovered-set + name: e2e-discovery-set spec: discoveryPolicyRef: - name: e2e-prometheus + name: e2e-discovery-prometheus diff --git a/test/e2e/discovery/04-assert-children.yaml b/test/e2e/discovery/04-assert-children.yaml index ccc972a..c409898 100644 --- a/test/e2e/discovery/04-assert-children.yaml +++ b/test/e2e/discovery/04-assert-children.yaml @@ -3,4 +3,4 @@ apiVersion: drop.corewire.io/v1alpha1 kind: CachedImage metadata: labels: - drop.corewire.io/imageset: discovered-set + drop.corewire.io/imageset: e2e-discovery-set diff --git a/test/e2e/discovery/chainsaw-test.yaml b/test/e2e/discovery/chainsaw-test.yaml index e521d82..6176675 100644 --- a/test/e2e/discovery/chainsaw-test.yaml +++ b/test/e2e/discovery/chainsaw-test.yaml @@ -32,9 +32,9 @@ spec: ref: apiVersion: drop.corewire.io/v1alpha1 kind: CachedImageSet - name: discovered-set + name: e2e-discovery-set - delete: ref: apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy - name: e2e-prometheus + name: e2e-discovery-prometheus diff --git a/test/e2e/test-e2e-20260628-133056.log b/test/e2e/test-e2e-20260628-133056.log new file mode 100644 index 0000000..8094166 --- /dev/null +++ b/test/e2e/test-e2e-20260628-133056.log @@ -0,0 +1,478 @@ +/home/bree/repos/github.com/Breee/puller/bin/chainsaw test test/e2e/ +Version: v0.2.15 +Loading default configuration... +- Using test file: chainsaw-test +- TestDirs [test/e2e/] +- Quiet false +- SkipDelete false +- FailFast false +- Namespace '' +- FastNamespaceDeletion false +- FullName false +- IncludeTestRegex '' +- ExcludeTestRegex '' +- ApplyTimeout 5s +- AssertTimeout 30s +- CleanupTimeout 30s +- DeleteTimeout 15s +- ErrorTimeout 30s +- ExecTimeout 5s +- DeletionPropagationPolicy Background +- Template true +- NoCluster false +- PauseOnFailure false +Loading tests... +- cachedimage-basic (test/e2e/cachedimage-basic) +- cachedimage-failure (test/e2e/cachedimage-failure) +- cachedimage-pacing (test/e2e/cachedimage-pacing) +- cachedimageset (test/e2e/cachedimageset) +- cachedimageset-discovery (test/e2e/cachedimageset-discovery) +- discovery (test/e2e/discovery) +- discovery-failure (test/e2e/discovery-failure) +- discovery-loki (test/e2e/discovery-loki) +- discovery-registry (test/e2e/discovery-registry) +Loading values... +Running tests... +=== RUN chainsaw +=== PAUSE chainsaw +=== CONT chainsaw +=== RUN chainsaw/cachedimage-basic +=== PAUSE chainsaw/cachedimage-basic +=== RUN chainsaw/cachedimage-failure +=== PAUSE chainsaw/cachedimage-failure +=== RUN chainsaw/cachedimage-pacing +=== PAUSE chainsaw/cachedimage-pacing +=== RUN chainsaw/cachedimageset +=== PAUSE chainsaw/cachedimageset +=== RUN chainsaw/cachedimageset-discovery +=== PAUSE chainsaw/cachedimageset-discovery +=== RUN chainsaw/discovery +=== PAUSE chainsaw/discovery +=== RUN chainsaw/discovery-failure +=== PAUSE chainsaw/discovery-failure +=== RUN chainsaw/discovery-loki +=== PAUSE chainsaw/discovery-loki +=== RUN chainsaw/discovery-registry +=== PAUSE chainsaw/discovery-registry +=== CONT chainsaw/cachedimage-basic +=== CONT chainsaw/discovery +=== CONT chainsaw/cachedimageset +=== CONT chainsaw/cachedimageset-discovery +=== CONT chainsaw/discovery-loki +=== CONT chainsaw/discovery-registry +=== CONT chainsaw/discovery-failure +=== CONT chainsaw/cachedimage-pacing +=== CONT chainsaw/cachedimage-failure +=== NAME chainsaw/discovery-failure + sink.go:61: | 13:30:57 | discovery-failure | @chainsaw  | CREATE | OK | v1/Namespace @ chainsaw-model-troll + sink.go:61: | 13:30:57 | discovery-failure | Create DiscoveryPolicy with broken Prometheus endpoint | TRY | BEGIN | + sink.go:61: | 13:30:57 | discovery-failure | Create DiscoveryPolicy with broken Prometheus endpoint | APPLY | RUN | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-broken-prom +=== NAME chainsaw/discovery + sink.go:61: | 13:30:57 | discovery | @chainsaw  | CREATE | OK | v1/Namespace @ chainsaw-closing-egret + sink.go:61: | 13:30:57 | discovery | Create DiscoveryPolicy with query/signal/ranking pipeline  | TRY | BEGIN | +=== NAME chainsaw/cachedimage-failure + sink.go:61: | 13:30:57 | cachedimage-failure | @chainsaw  | CREATE | OK | v1/Namespace @ chainsaw-hip-horse + sink.go:61: | 13:30:57 | cachedimage-failure | Create PullPolicy  | TRY | BEGIN | +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:30:57 | cachedimageset | @chainsaw  | CREATE | OK | v1/Namespace @ chainsaw-healthy-poodle +=== NAME chainsaw/discovery-loki + sink.go:61: | 13:30:57 | discovery-loki | @chainsaw  | CREATE | OK | v1/Namespace @ chainsaw-distinct-asp +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:30:57 | cachedimageset | Create CachedImageSet  | TRY | BEGIN | +=== NAME chainsaw/discovery-loki + sink.go:61: | 13:30:57 | discovery-loki | Create DiscoveryPolicy with a Loki query and eventPullTime signals  | TRY | BEGIN | +=== NAME chainsaw/cachedimage-basic + sink.go:61: | 13:30:57 | cachedimage-basic | @chainsaw  | CREATE | OK | v1/Namespace @ chainsaw-modern-egret + sink.go:61: | 13:30:57 | cachedimage-basic | Create CachedImage  | TRY | BEGIN | +=== NAME chainsaw/discovery-registry + sink.go:61: | 13:30:57 | discovery-registry | @chainsaw  | CREATE | OK | v1/Namespace @ chainsaw-finer-mantis + sink.go:61: | 13:30:57 | discovery-registry | Create DiscoveryPolicy with registry query  | TRY | BEGIN | +=== NAME chainsaw/cachedimageset-discovery + sink.go:61: | 13:30:57 | cachedimageset-discovery | @chainsaw  | CREATE | OK | v1/Namespace @ chainsaw-viable-kingfish + sink.go:61: | 13:30:57 | cachedimageset-discovery | Create PullPolicy  | TRY | BEGIN | +=== NAME chainsaw/cachedimage-pacing + sink.go:61: | 13:30:57 | cachedimage-pacing | @chainsaw  | CREATE | OK | v1/Namespace @ chainsaw-giving-liger + sink.go:61: | 13:30:57 | cachedimage-pacing | Create PullPolicy  | TRY | BEGIN | +=== NAME chainsaw/cachedimage-failure + sink.go:61: | 13:30:57 | cachedimage-failure | Create PullPolicy  | APPLY | RUN | drop.corewire.io/v1alpha1/PullPolicy @ test-backoff-policy +=== NAME chainsaw/cachedimage-pacing + sink.go:61: | 13:30:57 | cachedimage-pacing | Create PullPolicy  | APPLY | RUN | drop.corewire.io/v1alpha1/PullPolicy @ test-conservative +=== NAME chainsaw/discovery + sink.go:61: | 13:30:57 | discovery | Create DiscoveryPolicy with query/signal/ranking pipeline  | APPLY | RUN | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-prometheus +=== NAME chainsaw/discovery-registry + sink.go:61: | 13:30:57 | discovery-registry | Create DiscoveryPolicy with registry query  | APPLY | RUN | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-registry +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:30:57 | cachedimageset | Create CachedImageSet  | APPLY | RUN | drop.corewire.io/v1alpha1/CachedImageSet @ test-set +=== NAME chainsaw/cachedimage-basic + sink.go:61: | 13:30:57 | cachedimage-basic | Create CachedImage  | APPLY | RUN | drop.corewire.io/v1alpha1/CachedImage @ test-nginx +=== NAME chainsaw/discovery-loki + sink.go:61: | 13:30:57 | discovery-loki | Create DiscoveryPolicy with a Loki query and eventPullTime signals  | APPLY | RUN | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-loki +=== NAME chainsaw/cachedimageset-discovery + sink.go:61: | 13:30:57 | cachedimageset-discovery | Create PullPolicy  | APPLY | RUN | drop.corewire.io/v1alpha1/PullPolicy @ test-set-policy +=== NAME chainsaw/discovery-failure + sink.go:61: | 13:30:57 | discovery-failure | Create DiscoveryPolicy with broken Prometheus endpoint | CREATE | OK | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-broken-prom + sink.go:61: | 13:30:57 | discovery-failure | Create DiscoveryPolicy with broken Prometheus endpoint | APPLY | DONE | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-broken-prom + sink.go:61: | 13:30:57 | discovery-failure | Create DiscoveryPolicy with broken Prometheus endpoint | TRY | END | + sink.go:61: | 13:30:57 | discovery-failure | Assert DNSError condition is set  | TRY | BEGIN | + sink.go:61: | 13:30:57 | discovery-failure | Assert DNSError condition is set  | ASSERT | RUN | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-broken-prom +=== NAME chainsaw/cachedimage-failure + sink.go:61: | 13:30:57 | cachedimage-failure | Create PullPolicy  | CREATE | OK | drop.corewire.io/v1alpha1/PullPolicy @ test-backoff-policy + sink.go:61: | 13:30:57 | cachedimage-failure | Create PullPolicy  | APPLY | DONE | drop.corewire.io/v1alpha1/PullPolicy @ test-backoff-policy + sink.go:61: | 13:30:57 | cachedimage-failure | Create PullPolicy  | TRY | END | + sink.go:61: | 13:30:57 | cachedimage-failure | Create broken CachedImage  | TRY | BEGIN | +=== NAME chainsaw/cachedimage-pacing + sink.go:61: | 13:30:57 | cachedimage-pacing | Create PullPolicy  | CREATE | OK | drop.corewire.io/v1alpha1/PullPolicy @ test-conservative + sink.go:61: | 13:30:57 | cachedimage-pacing | Create PullPolicy  | APPLY | DONE | drop.corewire.io/v1alpha1/PullPolicy @ test-conservative + sink.go:61: | 13:30:57 | cachedimage-pacing | Create PullPolicy  | TRY | END | + sink.go:61: | 13:30:57 | cachedimage-pacing | Create CachedImage referencing policy  | TRY | BEGIN | +=== NAME chainsaw/discovery + sink.go:61: | 13:30:57 | discovery | Create DiscoveryPolicy with query/signal/ranking pipeline  | PATCH | OK | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-prometheus + sink.go:61: | 13:30:57 | discovery | Create DiscoveryPolicy with query/signal/ranking pipeline  | APPLY | DONE | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-prometheus + sink.go:61: | 13:30:57 | discovery | Create DiscoveryPolicy with query/signal/ranking pipeline  | TRY | END | + sink.go:61: | 13:30:57 | discovery | Assert pipeline executed and images were discovered  | TRY | BEGIN | +=== NAME chainsaw/cachedimage-failure + sink.go:61: | 13:30:57 | cachedimage-failure | Create broken CachedImage  | APPLY | RUN | drop.corewire.io/v1alpha1/CachedImage @ test-broken-image +=== NAME chainsaw/cachedimage-pacing + sink.go:61: | 13:30:57 | cachedimage-pacing | Create CachedImage referencing policy  | APPLY | RUN | drop.corewire.io/v1alpha1/CachedImage @ test-paced +=== NAME chainsaw/discovery + sink.go:61: | 13:30:57 | discovery | Assert pipeline executed and images were discovered  | ASSERT | RUN | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-prometheus +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:30:57 | cachedimageset | Create CachedImageSet  | CREATE | OK | drop.corewire.io/v1alpha1/CachedImageSet @ test-set +=== NAME chainsaw/cachedimageset-discovery + sink.go:61: | 13:30:57 | cachedimageset-discovery | Create PullPolicy  | PATCH | OK | drop.corewire.io/v1alpha1/PullPolicy @ test-set-policy + sink.go:61: | 13:30:57 | cachedimageset-discovery | Create PullPolicy  | APPLY | DONE | drop.corewire.io/v1alpha1/PullPolicy @ test-set-policy +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:30:57 | cachedimageset | Create CachedImageSet  | APPLY | DONE | drop.corewire.io/v1alpha1/CachedImageSet @ test-set +=== NAME chainsaw/cachedimageset-discovery + sink.go:61: | 13:30:57 | cachedimageset-discovery | Create PullPolicy  | TRY | END | +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:30:57 | cachedimageset | Create CachedImageSet  | TRY | END | + sink.go:61: | 13:30:57 | cachedimageset | Verify child CachedImages created  | TRY | BEGIN | +=== NAME chainsaw/cachedimageset-discovery + sink.go:61: | 13:30:57 | cachedimageset-discovery | Create DiscoveryPolicy with pipeline schema  | TRY | BEGIN | +=== NAME chainsaw/cachedimage-basic + sink.go:61: | 13:30:57 | cachedimage-basic | Create CachedImage  | CREATE | OK | drop.corewire.io/v1alpha1/CachedImage @ test-nginx + sink.go:61: | 13:30:57 | cachedimage-basic | Create CachedImage  | APPLY | DONE | drop.corewire.io/v1alpha1/CachedImage @ test-nginx + sink.go:61: | 13:30:57 | cachedimage-basic | Create CachedImage  | TRY | END | + sink.go:61: | 13:30:57 | cachedimage-basic | Verify drop Pod is created | TRY | BEGIN | + sink.go:61: | 13:30:57 | cachedimage-basic | Verify drop Pod is created | ASSERT | RUN | v1/Pod @ drop-system/* +=== NAME chainsaw/discovery-loki + sink.go:61: | 13:30:57 | discovery-loki | Create DiscoveryPolicy with a Loki query and eventPullTime signals  | PATCH | OK | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-loki + sink.go:61: | 13:30:57 | discovery-loki | Create DiscoveryPolicy with a Loki query and eventPullTime signals  | APPLY | DONE | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-loki + sink.go:61: | 13:30:57 | discovery-loki | Create DiscoveryPolicy with a Loki query and eventPullTime signals  | TRY | END | + sink.go:61: | 13:30:57 | discovery-loki | Assert pipeline executed and images were discovered from Loki events | TRY | BEGIN | +=== NAME chainsaw/discovery-registry + sink.go:61: | 13:30:57 | discovery-registry | Create DiscoveryPolicy with registry query  | PATCH | OK | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-registry + sink.go:61: | 13:30:57 | discovery-registry | Create DiscoveryPolicy with registry query  | APPLY | DONE | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-registry + sink.go:61: | 13:30:57 | discovery-registry | Create DiscoveryPolicy with registry query  | TRY | END | + sink.go:61: | 13:30:57 | discovery-registry | Assert pipeline executed and images were discovered from registry | TRY | BEGIN | +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:30:57 | cachedimageset | Verify child CachedImages created  | ASSERT | RUN | drop.corewire.io/v1alpha1/CachedImage @ * +=== NAME chainsaw/cachedimageset-discovery + sink.go:61: | 13:30:57 | cachedimageset-discovery | Create DiscoveryPolicy with pipeline schema  | APPLY | RUN | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-prometheus-discovery +=== NAME chainsaw/discovery-loki + sink.go:61: | 13:30:57 | discovery-loki | Assert pipeline executed and images were discovered from Loki events | ASSERT | RUN | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-loki +=== NAME chainsaw/discovery-registry + sink.go:61: | 13:30:57 | discovery-registry | Assert pipeline executed and images were discovered from registry | ASSERT | RUN | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-registry +=== NAME chainsaw/discovery-failure + sink.go:61: | 13:30:57 | discovery-failure | Assert DNSError condition is set  | ASSERT | DONE | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-broken-prom + sink.go:61: | 13:30:57 | discovery-failure | Assert DNSError condition is set  | TRY | END | + sink.go:61: | 13:30:57 | discovery-failure | Cleanup  | TRY | BEGIN | + sink.go:61: | 13:30:57 | discovery-failure | Cleanup  | DELETE | RUN | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-broken-prom +=== NAME chainsaw/cachedimage-failure + sink.go:61: | 13:30:57 | cachedimage-failure | Create broken CachedImage  | CREATE | OK | drop.corewire.io/v1alpha1/CachedImage @ test-broken-image + sink.go:61: | 13:30:57 | cachedimage-failure | Create broken CachedImage  | APPLY | DONE | drop.corewire.io/v1alpha1/CachedImage @ test-broken-image + sink.go:61: | 13:30:57 | cachedimage-failure | Create broken CachedImage  | TRY | END | + sink.go:61: | 13:30:57 | cachedimage-failure | Wait for Degraded status with failure reason | TRY | BEGIN | +=== NAME chainsaw/cachedimage-pacing + sink.go:61: | 13:30:57 | cachedimage-pacing | Create CachedImage referencing policy  | CREATE | OK | drop.corewire.io/v1alpha1/CachedImage @ test-paced + sink.go:61: | 13:30:57 | cachedimage-pacing | Create CachedImage referencing policy  | APPLY | DONE | drop.corewire.io/v1alpha1/CachedImage @ test-paced + sink.go:61: | 13:30:57 | cachedimage-pacing | Create CachedImage referencing policy  | TRY | END | + sink.go:61: | 13:30:57 | cachedimage-pacing | Verify at most one active Pod at a time | TRY | BEGIN | + sink.go:61: | 13:30:57 | cachedimage-pacing | Verify at most one active Pod at a time | CMD | RUN | + === COMMAND + /usr/bin/sh -c count=$(kubectl get pods -n drop-system -l app.kubernetes.io/managed-by=drop,drop.corewire.io/cachedimage=test-paced --no-headers 2>/dev/null | wc -l) + if [ "$count" -gt 1 ]; then + echo "FAIL: expected at most 1 drop pod, got $count" + exit 1 + fi + echo "OK: $count drop pod(s) active" +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:30:57 | cachedimageset | Verify child CachedImages created  | ASSERT | DONE | drop.corewire.io/v1alpha1/CachedImage @ * + sink.go:61: | 13:30:57 | cachedimageset | Verify child CachedImages created  | TRY | END | + sink.go:61: | 13:30:57 | cachedimageset | Delete CachedImageSet and verify GC | TRY | BEGIN | +=== NAME chainsaw/cachedimage-basic + sink.go:61: | 13:30:57 | cachedimage-basic | Verify drop Pod is created | ASSERT | DONE | v1/Pod @ drop-system/* + sink.go:61: | 13:30:57 | cachedimage-basic | Verify drop Pod is created | TRY | END | + sink.go:61: | 13:30:57 | cachedimage-basic | Wait for Ready status  | TRY | BEGIN | + sink.go:61: | 13:30:57 | cachedimage-basic | Wait for Ready status  | CMD | RUN | + === COMMAND + /usr/bin/sh -c deadline=$(( $(date +%s) + 90 )) + while [ "$(date +%s)" -lt "$deadline" ]; do + phase=$(kubectl get cachedimage test-nginx -o jsonpath='{.status.phase}' 2>/dev/null || true) + nodes_ready=$(kubectl get cachedimage test-nginx -o jsonpath='{.status.nodesReady}' 2>/dev/null || true) + nodes_targeted=$(kubectl get cachedimage test-nginx -o jsonpath='{.status.nodesTargeted}' 2>/dev/null || true) + + case "$nodes_ready" in + ''|*[!0-9]*) nodes_ready=0 ;; + esac + case "$nodes_targeted" in + ''|*[!0-9]*) nodes_targeted=0 ;; + esac + + if [ "$nodes_targeted" -ge 1 ] && [ "$nodes_ready" = "$nodes_targeted" ] && [ "$phase" = "Ready" ]; then + echo "OK: CachedImage reached Ready with $nodes_ready/$nodes_targeted target nodes" + exit 0 + fi + + sleep 2 + done + + kubectl get cachedimage test-nginx -o yaml + echo "FAIL: CachedImage did not reach Ready on all targeted nodes" + exit 1 +=== NAME chainsaw/cachedimage-failure + sink.go:61: | 13:30:57 | cachedimage-failure | Wait for Degraded status with failure reason | ASSERT | RUN | drop.corewire.io/v1alpha1/CachedImage @ test-broken-image +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:30:57 | cachedimageset | Delete CachedImageSet and verify GC | DELETE | RUN | drop.corewire.io/v1alpha1/CachedImageSet @ test-set +=== NAME chainsaw/discovery-failure + sink.go:61: | 13:30:57 | discovery-failure | Cleanup  | DELETE | OK | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-broken-prom +=== NAME chainsaw/cachedimageset-discovery + sink.go:61: | 13:30:57 | cachedimageset-discovery | Create DiscoveryPolicy with pipeline schema  | PATCH | OK | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-prometheus-discovery + sink.go:61: | 13:30:57 | cachedimageset-discovery | Create DiscoveryPolicy with pipeline schema  | APPLY | DONE | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-prometheus-discovery + sink.go:61: | 13:30:57 | cachedimageset-discovery | Create DiscoveryPolicy with pipeline schema  | TRY | END | + sink.go:61: | 13:30:57 | cachedimageset-discovery | Wait for DiscoveryPolicy to be reconciled  | TRY | BEGIN | +=== NAME chainsaw/discovery-failure + sink.go:61: | 13:30:57 | discovery-failure | Cleanup  | DELETE | DONE | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-broken-prom + sink.go:61: | 13:30:57 | discovery-failure | Cleanup  | TRY | END | + sink.go:61: | 13:30:57 | discovery-failure | Create DiscoveryPolicy with broken Prometheus endpoint | CLEANUP | BEGIN | + sink.go:61: | 13:30:57 | discovery-failure | Create DiscoveryPolicy with broken Prometheus endpoint | DELETE | OK | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-broken-prom + === ERROR + discoverypolicies.drop.corewire.io "test-broken-prom" not found + sink.go:61: | 13:30:57 | discovery-failure | Create DiscoveryPolicy with broken Prometheus endpoint | CLEANUP | END | + sink.go:61: | 13:30:57 | discovery-failure | @chainsaw  | CLEANUP | BEGIN | +=== NAME chainsaw/cachedimageset-discovery + sink.go:61: | 13:30:57 | cachedimageset-discovery | Wait for DiscoveryPolicy to be reconciled  | ASSERT | RUN | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-prometheus-discovery +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:30:57 | cachedimageset | Delete CachedImageSet and verify GC | DELETE | OK | drop.corewire.io/v1alpha1/CachedImageSet @ test-set +=== NAME chainsaw/discovery-failure + sink.go:61: | 13:30:57 | discovery-failure | @chainsaw  | DELETE | OK | v1/Namespace @ chainsaw-model-troll +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:30:57 | cachedimageset | Delete CachedImageSet and verify GC | DELETE | DONE | drop.corewire.io/v1alpha1/CachedImageSet @ test-set + sink.go:61: | 13:30:57 | cachedimageset | Delete CachedImageSet and verify GC | ERROR | RUN | drop.corewire.io/v1alpha1/CachedImage @ * + sink.go:61: | 13:30:57 | cachedimageset | Delete CachedImageSet and verify GC | ERROR | DONE | drop.corewire.io/v1alpha1/CachedImage @ * + sink.go:61: | 13:30:57 | cachedimageset | Delete CachedImageSet and verify GC | TRY | END | + sink.go:61: | 13:30:57 | cachedimageset | Create CachedImageSet  | CLEANUP | BEGIN | + sink.go:61: | 13:30:57 | cachedimageset | Create CachedImageSet  | DELETE | OK | drop.corewire.io/v1alpha1/CachedImageSet @ test-set + === ERROR + cachedimagesets.drop.corewire.io "test-set" not found + sink.go:61: | 13:30:57 | cachedimageset | Create CachedImageSet  | CLEANUP | END | + sink.go:61: | 13:30:57 | cachedimageset | @chainsaw  | CLEANUP | BEGIN | + sink.go:61: | 13:30:57 | cachedimageset | @chainsaw  | DELETE | OK | v1/Namespace @ chainsaw-healthy-poodle +=== NAME chainsaw/cachedimage-pacing + sink.go:61: | 13:30:57 | cachedimage-pacing | Verify at most one active Pod at a time | SCRIPT | LOG | + === STDOUT + OK: 0 drop pod(s) active + sink.go:61: | 13:30:57 | cachedimage-pacing | Verify at most one active Pod at a time | SCRIPT | DONE | + sink.go:61: | 13:30:57 | cachedimage-pacing | Verify at most one active Pod at a time | TRY | END | + sink.go:61: | 13:30:57 | cachedimage-pacing | Cleanup  | TRY | BEGIN | + sink.go:61: | 13:30:57 | cachedimage-pacing | Cleanup  | DELETE | RUN | drop.corewire.io/v1alpha1/CachedImage @ test-paced + sink.go:61: | 13:30:57 | cachedimage-pacing | Cleanup  | DELETE | OK | drop.corewire.io/v1alpha1/CachedImage @ test-paced + sink.go:61: | 13:30:57 | cachedimage-pacing | Cleanup  | DELETE | DONE | drop.corewire.io/v1alpha1/CachedImage @ test-paced + sink.go:61: | 13:30:57 | cachedimage-pacing | Cleanup  | DELETE | RUN | drop.corewire.io/v1alpha1/PullPolicy @ test-conservative + sink.go:61: | 13:30:57 | cachedimage-pacing | Cleanup  | DELETE | OK | drop.corewire.io/v1alpha1/PullPolicy @ test-conservative + sink.go:61: | 13:30:57 | cachedimage-pacing | Cleanup  | DELETE | DONE | drop.corewire.io/v1alpha1/PullPolicy @ test-conservative + sink.go:61: | 13:30:57 | cachedimage-pacing | Cleanup  | TRY | END | + sink.go:61: | 13:30:57 | cachedimage-pacing | Create CachedImage referencing policy  | CLEANUP | BEGIN | + sink.go:61: | 13:30:57 | cachedimage-pacing | Create CachedImage referencing policy  | DELETE | OK | drop.corewire.io/v1alpha1/CachedImage @ test-paced + === ERROR + cachedimages.drop.corewire.io "test-paced" not found + sink.go:61: | 13:30:57 | cachedimage-pacing | Create CachedImage referencing policy  | CLEANUP | END | + sink.go:61: | 13:30:57 | cachedimage-pacing | Create PullPolicy  | CLEANUP | BEGIN | + sink.go:61: | 13:30:57 | cachedimage-pacing | Create PullPolicy  | DELETE | OK | drop.corewire.io/v1alpha1/PullPolicy @ test-conservative + === ERROR + pullpolicies.drop.corewire.io "test-conservative" not found + sink.go:61: | 13:30:57 | cachedimage-pacing | Create PullPolicy  | CLEANUP | END | + sink.go:61: | 13:30:57 | cachedimage-pacing | @chainsaw  | CLEANUP | BEGIN | + sink.go:61: | 13:30:57 | cachedimage-pacing | @chainsaw  | DELETE | OK | v1/Namespace @ chainsaw-giving-liger +=== NAME chainsaw/discovery-failure + sink.go:61: | 13:31:02 | discovery-failure | @chainsaw  | CLEANUP | END | +=== NAME chainsaw/cachedimageset + sink.go:61: | 13:31:02 | cachedimageset | @chainsaw  | CLEANUP | END | +=== NAME chainsaw/cachedimage-pacing + sink.go:61: | 13:31:03 | cachedimage-pacing | @chainsaw  | CLEANUP | END | +=== NAME chainsaw/cachedimage-basic + sink.go:61: | 13:31:04 | cachedimage-basic | Wait for Ready status  | SCRIPT | LOG | + === STDOUT + OK: CachedImage reached Ready with 2/2 target nodes + sink.go:61: | 13:31:04 | cachedimage-basic | Wait for Ready status  | SCRIPT | DONE | + sink.go:61: | 13:31:04 | cachedimage-basic | Wait for Ready status  | TRY | END | + sink.go:61: | 13:31:04 | cachedimage-basic | Cleanup  | TRY | BEGIN | + sink.go:61: | 13:31:04 | cachedimage-basic | Cleanup  | DELETE | RUN | drop.corewire.io/v1alpha1/CachedImage @ test-nginx + sink.go:61: | 13:31:04 | cachedimage-basic | Cleanup  | DELETE | OK | drop.corewire.io/v1alpha1/CachedImage @ test-nginx + sink.go:61: | 13:31:04 | cachedimage-basic | Cleanup  | DELETE | DONE | drop.corewire.io/v1alpha1/CachedImage @ test-nginx + sink.go:61: | 13:31:04 | cachedimage-basic | Cleanup  | TRY | END | + sink.go:61: | 13:31:04 | cachedimage-basic | Create CachedImage  | CLEANUP | BEGIN | + sink.go:61: | 13:31:04 | cachedimage-basic | Create CachedImage  | DELETE | OK | drop.corewire.io/v1alpha1/CachedImage @ test-nginx + === ERROR + cachedimages.drop.corewire.io "test-nginx" not found + sink.go:61: | 13:31:04 | cachedimage-basic | Create CachedImage  | CLEANUP | END | + sink.go:61: | 13:31:04 | cachedimage-basic | @chainsaw  | CLEANUP | BEGIN | + sink.go:61: | 13:31:04 | cachedimage-basic | @chainsaw  | DELETE | OK | v1/Namespace @ chainsaw-modern-egret +=== NAME chainsaw/cachedimage-failure + sink.go:61: | 13:31:08 | cachedimage-failure | Wait for Degraded status with failure reason | ASSERT | DONE | drop.corewire.io/v1alpha1/CachedImage @ test-broken-image + sink.go:61: | 13:31:08 | cachedimage-failure | Wait for Degraded status with failure reason | TRY | END | + sink.go:61: | 13:31:08 | cachedimage-failure | Verify consecutiveFailures is tracked  | TRY | BEGIN | + sink.go:61: | 13:31:08 | cachedimage-failure | Verify consecutiveFailures is tracked  | ASSERT | RUN | drop.corewire.io/v1alpha1/CachedImage @ test-broken-image + sink.go:61: | 13:31:08 | cachedimage-failure | Verify consecutiveFailures is tracked  | ASSERT | DONE | drop.corewire.io/v1alpha1/CachedImage @ test-broken-image + sink.go:61: | 13:31:08 | cachedimage-failure | Verify consecutiveFailures is tracked  | TRY | END | + sink.go:61: | 13:31:08 | cachedimage-failure | Cleanup  | TRY | BEGIN | + sink.go:61: | 13:31:08 | cachedimage-failure | Cleanup  | DELETE | RUN | drop.corewire.io/v1alpha1/CachedImage @ test-broken-image + sink.go:61: | 13:31:08 | cachedimage-failure | Cleanup  | DELETE | OK | drop.corewire.io/v1alpha1/CachedImage @ test-broken-image + sink.go:61: | 13:31:08 | cachedimage-failure | Cleanup  | DELETE | DONE | drop.corewire.io/v1alpha1/CachedImage @ test-broken-image + sink.go:61: | 13:31:08 | cachedimage-failure | Cleanup  | DELETE | RUN | drop.corewire.io/v1alpha1/PullPolicy @ test-backoff-policy + sink.go:61: | 13:31:08 | cachedimage-failure | Cleanup  | DELETE | OK | drop.corewire.io/v1alpha1/PullPolicy @ test-backoff-policy + sink.go:61: | 13:31:08 | cachedimage-failure | Cleanup  | DELETE | DONE | drop.corewire.io/v1alpha1/PullPolicy @ test-backoff-policy + sink.go:61: | 13:31:08 | cachedimage-failure | Cleanup  | TRY | END | + sink.go:61: | 13:31:08 | cachedimage-failure | Create broken CachedImage  | CLEANUP | BEGIN | + sink.go:61: | 13:31:08 | cachedimage-failure | Create broken CachedImage  | DELETE | OK | drop.corewire.io/v1alpha1/CachedImage @ test-broken-image + === ERROR + cachedimages.drop.corewire.io "test-broken-image" not found + sink.go:61: | 13:31:08 | cachedimage-failure | Create broken CachedImage  | CLEANUP | END | + sink.go:61: | 13:31:08 | cachedimage-failure | Create PullPolicy  | CLEANUP | BEGIN | + sink.go:61: | 13:31:08 | cachedimage-failure | Create PullPolicy  | DELETE | OK | drop.corewire.io/v1alpha1/PullPolicy @ test-backoff-policy + === ERROR + pullpolicies.drop.corewire.io "test-backoff-policy" not found + sink.go:61: | 13:31:08 | cachedimage-failure | Create PullPolicy  | CLEANUP | END | + sink.go:61: | 13:31:08 | cachedimage-failure | @chainsaw  | CLEANUP | BEGIN | + sink.go:61: | 13:31:08 | cachedimage-failure | @chainsaw  | DELETE | OK | v1/Namespace @ chainsaw-hip-horse +=== NAME chainsaw/cachedimage-basic + sink.go:61: | 13:31:09 | cachedimage-basic | @chainsaw  | CLEANUP | END | +=== NAME chainsaw/cachedimage-failure + sink.go:61: | 13:31:13 | cachedimage-failure | @chainsaw  | CLEANUP | END | +=== NAME chainsaw/cachedimageset-discovery + sink.go:61: | 13:31:57 | cachedimageset-discovery | Wait for DiscoveryPolicy to be reconciled  | ASSERT | ERROR | drop.corewire.io/v1alpha1/DiscoveryPolicy @ test-prometheus-discovery + === ERROR + ------------------------------------------------------------------- + drop.corewire.io/v1alpha1/DiscoveryPolicy/test-prometheus-discovery + ------------------------------------------------------------------- + * status.(queryCount == `1`): Invalid value: false: Expected value: true + + --- expected + +++ actual + @@ -2,7 +2,5 @@ + kind: DiscoveryPolicy + metadata: + name: test-prometheus-discovery + -status: + - (conditions[?type == 'Ready'] | length(@) > `0`): true + - (queryCount == `1`): true + +status: {} + sink.go:61: | 13:31:57 | cachedimageset-discovery | Wait for DiscoveryPolicy to be reconciled  | TRY | END | + sink.go:61: | 13:31:57 | cachedimageset-discovery | @chainsaw  | CLEANUP | BEGIN | + sink.go:61: | 13:31:57 | cachedimageset-discovery | @chainsaw  | DELETE | OK | v1/Namespace @ chainsaw-viable-kingfish + sink.go:61: | 13:32:02 | cachedimageset-discovery | @chainsaw  | CLEANUP | END | +=== NAME chainsaw/discovery + sink.go:61: | 13:32:57 | discovery | Assert pipeline executed and images were discovered  | ASSERT | ERROR | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-prometheus + === ERROR + -------------------------------------------------------- + drop.corewire.io/v1alpha1/DiscoveryPolicy/e2e-prometheus + -------------------------------------------------------- + * status.(queryCount == `1`): Invalid value: false: Expected value: true + + --- expected + +++ actual + @@ -2,10 +2,5 @@ + kind: DiscoveryPolicy + metadata: + name: e2e-prometheus + -status: + - (conditions[?type == 'Ready']): + - - reason: Synced + - status: "True" + - (imageCount > `0`): true + - (queryCount == `1`): true + +status: {} + sink.go:61: | 13:32:57 | discovery | Assert pipeline executed and images were discovered  | TRY | END | + sink.go:61: | 13:32:57 | discovery | @chainsaw  | CLEANUP | BEGIN | + sink.go:61: | 13:32:57 | discovery | @chainsaw  | DELETE | OK | v1/Namespace @ chainsaw-closing-egret +=== NAME chainsaw/discovery-loki + sink.go:61: | 13:32:57 | discovery-loki | Assert pipeline executed and images were discovered from Loki events | ASSERT | ERROR | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-loki + === ERROR + -------------------------------------------------- + drop.corewire.io/v1alpha1/DiscoveryPolicy/e2e-loki + -------------------------------------------------- + * status.(queryCount == `1`): Invalid value: false: Expected value: true + + --- expected + +++ actual + @@ -2,18 +2,5 @@ + kind: DiscoveryPolicy + metadata: + name: e2e-loki + -status: + - (conditions[?type == 'Ready']): + - - reason: Synced + - status: "True" + - (imageCount > `0`): true + - (length(discoveredImages[?contains(image, 'test/myapp:v1')]) > `0`): true + - (length(discoveredImages[?contains(image, 'test/tools:v1')]) > `0`): true + - (length(discoveredImages[?contains(image, 'test/worker:v2')]) > `0`): true + - (queryCount == `1`): true + - (queryResults[?name == 'image-pull-events'] | [0].status): success + - (queryResults[?name == 'image-pull-events'] | [0].type): loki + - (signalResults[?name == 'p50-cold-pull-time'] | [0].images > `0`): true + - (signalResults[?name == 'p50-cold-pull-time'] | [0].status): success + - (signalResults[?name == 'pull-failures'] | [0].status): success + +status: {} + sink.go:61: | 13:32:57 | discovery-loki | Assert pipeline executed and images were discovered from Loki events | TRY | END | + sink.go:61: | 13:32:57 | discovery-loki | @chainsaw  | CLEANUP | BEGIN | +=== NAME chainsaw/discovery-registry + sink.go:61: | 13:32:57 | discovery-registry | Assert pipeline executed and images were discovered from registry | ASSERT | ERROR | drop.corewire.io/v1alpha1/DiscoveryPolicy @ e2e-registry + === ERROR + ------------------------------------------------------ + drop.corewire.io/v1alpha1/DiscoveryPolicy/e2e-registry + ------------------------------------------------------ + * status.(queryCount == `1`): Invalid value: false: Expected value: true + + --- expected + +++ actual + @@ -2,10 +2,5 @@ + kind: DiscoveryPolicy + metadata: + name: e2e-registry + -status: + - (conditions[?type == 'Ready']): + - - reason: Synced + - status: "True" + - (imageCount > `0`): true + - (queryCount == `1`): true + +status: {} + sink.go:61: | 13:32:57 | discovery-registry | Assert pipeline executed and images were discovered from registry | TRY | END | + sink.go:61: | 13:32:57 | discovery-registry | @chainsaw  | CLEANUP | BEGIN | + sink.go:61: | 13:32:57 | discovery-registry | @chainsaw  | DELETE | OK | v1/Namespace @ chainsaw-finer-mantis +=== NAME chainsaw/discovery-loki + sink.go:61: | 13:32:57 | discovery-loki | @chainsaw  | DELETE | OK | v1/Namespace @ chainsaw-distinct-asp + sink.go:61: | 13:33:02 | discovery-loki | @chainsaw  | CLEANUP | END | +=== NAME chainsaw/discovery + sink.go:61: | 13:33:02 | discovery | @chainsaw  | CLEANUP | END | +=== NAME chainsaw/discovery-registry + sink.go:61: | 13:33:02 | discovery-registry | @chainsaw  | CLEANUP | END | +--- FAIL: chainsaw (0.00s) + --- PASS: chainsaw/discovery-failure (5.47s) + --- PASS: chainsaw/cachedimageset (5.71s) + --- PASS: chainsaw/cachedimage-pacing (5.87s) + --- PASS: chainsaw/cachedimage-basic (12.80s) + --- PASS: chainsaw/cachedimage-failure (16.77s) + --- FAIL: chainsaw/cachedimageset-discovery (65.52s) + --- FAIL: chainsaw/discovery-loki (125.66s) + --- FAIL: chainsaw/discovery (125.75s) + --- FAIL: chainsaw/discovery-registry (125.86s) +FAIL +Tests Summary... +- Passed tests 5 +- Failed tests 4 +- Skipped tests 0 +Done with failures. +Error: some tests failed +make: *** [Makefile:85: test-e2e] Error 1 From 35195ad8b9cad8fd9e5d2dc3c7f40b17f219f5a1 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Sun, 28 Jun 2026 13:43:34 +0200 Subject: [PATCH 15/35] uidoc --- ai-docs/07-feature-ui.md | 47 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) create mode 100644 ai-docs/07-feature-ui.md diff --git a/ai-docs/07-feature-ui.md b/ai-docs/07-feature-ui.md new file mode 100644 index 0000000..69d59be --- /dev/null +++ b/ai-docs/07-feature-ui.md @@ -0,0 +1,47 @@ +# UI Feature Specs + +Design specs for a future DiscoveryPolicy UI. All previews use a dry-run API — never persisted in etcd. + +## 1. Query Editor (Stage 1) + +| Element | Purpose | +|---------|---------| +| PromQL/LogQL/registry query input with syntax highlighting | Fast query iteration | +| Live preview table: image ref, raw sample values, sample count | Shows query output before saving the CR | +| Query health badge: latency, series count, error message | Surface slow/broken endpoints | +| Registry: collapsible tag list per repo with tagFilter preview | Highlight matching/excluded tags so regex is visible | + +## 2. Signal Inspector (Stage 2) + +| Element | Purpose | +|---------|---------| +| Bar chart per signal: images on Y-axis sorted by value | "Which images score highest on this signal?" | +| Side-by-side signal comparison (pick 2+) | Reveals when signals disagree on ranking | +| timeWeightedAggregate: heatmap (hour-of-day × image) | Shows if business-hours window config shifts rankings | +| eventPullTime: histogram of pull durations with p50/p90/p95 lines | Debug why an image ranks high ("it takes 12s to pull") | + +## 3. Ranking Playground (Stage 3) + +| Element | Purpose | +|---------|---------| +| Ranked image list with stacked bar score breakdown | Shows *why* an image is ranked #1 vs #5 | +| Weight sliders (weightedSum): drag to reorder in real-time | Eliminates apply-wait-check loop | +| maxImages cutoff line: draggable line on ranked list | Simulate different maxImages values | +| Diff view: images entering/leaving top-N, score deltas | "Did my config change improve things?" | +| modelExposure: node exposure diagram with estimated pull cost | Makes the abstract formula concrete | + +## 4. Cross-cutting Views + +| Element | Purpose | +|---------|---------| +| Pipeline DAG: query → signal → ranking with health per node | Overview for complex multi-query setups | +| etcd budget meter: current status size vs max | Ops visibility | +| Sync timeline: imageCount sparkline with sync events | Detects flapping (oscillating image count) | +| CachedImageSet propagation: discovered → CachedImage → node pull status | Closes the loop: discovery → caching → readiness | + +## Architecture + +- Previews (query editor, weight sliders) computed via a `/dryrun` endpoint or CLI tool +- Dry-run takes a `DiscoveryPolicySpec`, runs the pipeline once, returns full result without writing status +- CR only stores the last committed sync result (slimmed status) +- UI richness comes from dry-run responses, not from bloating the stored status From 30cce4a7fe5935b47c370f6bbbb4f2553d6ffc61 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Sun, 28 Jun 2026 13:55:05 +0200 Subject: [PATCH 16/35] doocs --- README.md | 59 +++++++++++++++++++++++------- docs/content/docs/discovery.md | 67 ++++++++++++++++++---------------- docs/hugo.yaml | 9 +++++ 3 files changed, 90 insertions(+), 45 deletions(-) diff --git a/README.md b/README.md index 8cfb7f2..4e67f42 100644 --- a/README.md +++ b/README.md @@ -115,18 +115,19 @@ spec: maxImages: 20 # Only keep images from your internal registry (regex filter, optional) imageFilter: "registry.example.com/.*" - sources: - - type: prometheus + queries: + - name: runner-image-usage + type: prometheus prometheus: # Any Prometheus-compatible API (Prometheus, Thanos, Mimir, VictoriaMetrics) endpoint: https://mimir.example.com # Aggregate over the last 7 days using query_range; counts container # instances per image across the window to produce a usage score + queryType: range lookback: 168h # Resolution step for range queries (default: 5m) step: 5m # PromQL query — MUST return results with an "image" label. - # The result value becomes the ranking score (higher = cached first). query: | count( container_memory_working_set_bytes{ @@ -138,6 +139,16 @@ spec: # Supported keys: token, username, password, ca.crt, tls.crt, tls.key secretRef: name: prometheus-creds + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage --- # --- 3. CachedImageSet: ties discovery + policy together, targets nodes --- apiVersion: drop.corewire.io/v1alpha1 @@ -304,18 +315,19 @@ spec: maxImages: 30 # Only keep images matching this regex (optional) imageFilter: "registry.example.com/.*" - sources: - - type: prometheus + queries: + - name: runner-image-usage + type: prometheus prometheus: # Any Prometheus-compatible API (Prometheus, Thanos, Mimir, VictoriaMetrics) endpoint: https://mimir.example.com # Aggregate over the last 7 days (uses query_range, sums values per image) # Omit for a point-in-time instant query instead + queryType: range lookback: 168h # Resolution step for range queries (default: 5m) step: 5m # PromQL query — MUST return results with an "image" label. - # The result value becomes the ranking score (higher = cached first). query: | count( container_memory_working_set_bytes{ @@ -327,6 +339,16 @@ spec: # Supported keys: token, username, password, ca.crt, tls.crt, tls.key, headers. secretRef: name: prometheus-creds + signals: + - name: total-usage + queryRef: runner-image-usage + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: + signalRef: total-usage --- apiVersion: drop.corewire.io/v1alpha1 kind: CachedImageSet @@ -362,8 +384,9 @@ metadata: spec: syncInterval: 15m maxImages: 10 - sources: - - type: registry + queries: + - name: registry-tags + type: registry registry: # Registry base URL url: https://registry.example.com @@ -380,6 +403,16 @@ spec: # Supported keys: token, username, password, ca.crt, tls.crt, tls.key, headers. secretRef: name: registry-api-creds + signals: + - name: recent-tag-count + queryRef: registry-tags + type: aggregate + aggregate: + method: count + ranking: + strategy: signal + signal: + signalRef: recent-tag-count --- apiVersion: drop.corewire.io/v1alpha1 kind: CachedImageSet @@ -442,16 +475,16 @@ dev-set AllReady 3/3 3 dev-registry 1h web-apps Degraded 1/3 3 10m $ kubectl get discoverypolicies -NAME STATUS SOURCES IMAGES LASTSYNC AGE -dev-registry Synced 1 3 30s 1h -broken-prom ConnectionRefused 1 0 5m -bad-auth Unauthorized 1 0 2m +NAME STATUS IMAGES LASTSYNC AGE +dev-registry Synced 3 30s 1h +broken-prom ConnectionRefused 0 5m +bad-auth Unauthorized 0 2m ``` ## Development ```bash -# Prerequisites: Go 1.23+, Kind, Tilt, Helm +# Prerequisites: Go 1.26+, Kind, Tilt, Helm make generate # deepcopy make manifests # CRDs + RBAC go build ./... # compile diff --git a/docs/content/docs/discovery.md b/docs/content/docs/discovery.md index 6b9fdd5..a056478 100644 --- a/docs/content/docs/discovery.md +++ b/docs/content/docs/discovery.md @@ -30,6 +30,8 @@ With DiscoveryPolicy, image candidates are continuously sourced from real usage queries → signals → ranking → selected images ``` +![DiscoveryPolicy pipeline: queries feed signals, signals feed a single ranking strategy, the ranked list is written to status.discoveredImages and consumed by CachedImageSet to create CachedImage resources that nodes pull.](/images/discovery-pipeline.svg) + The pipeline has three stages: 1. **Queries** fetch raw observations from systems such as Prometheus or Loki. @@ -201,6 +203,8 @@ signals: Exactly one ranking strategy per policy. +![The three ranking strategies side by side: signal orders by a single signal, weightedSum blends normalized signals, and modelExposure models post-rotation cold-node exposure.](/images/ranking-strategies.svg) + ### `signal` Ranks images directly by the value of a single signal. @@ -233,6 +237,14 @@ Score: `final_score(I) = Σ weight_k * normalize(signal_k(I))` `minMax` normalization: `normalized(x) = (x - min) / (max - min)` — equals 1 when all values are equal. +$$ +\mathrm{final\_score}(I) = \sum_k w_k \cdot \mathrm{normalize}(s_k(I)) +$$ + +$$ +\mathrm{minMax}(x) = \frac{x - x_{\min}}{x_{\max} - x_{\min}} +$$ + ### `modelExposure` Ranks images by expected post-rotation cold-node exposure. @@ -249,6 +261,10 @@ ranking: Score: `score(I) = J_target(I) * (1 - 1/N)^J_pre(I) * p_hat(I)` +$$ +\mathrm{score}(I) = J_{\mathrm{target}}(I) \cdot \left(1 - \frac{1}{N}\right)^{J_{\mathrm{pre}}(I)} \cdot \hat{p}(I) +$$ + ## Complete Examples ### Example 1: Total Usage (simplest) @@ -409,53 +425,40 @@ spec: ## Status and Observability -The controller exposes per-query, per-signal, and per-image ranking detail in status: +Status records query execution outcomes and the final ordered image list used by +`CachedImageSet`. ```yaml status: lastSyncTime: "2026-06-18T10:00:00Z" + imageCount: 2 + + conditions: + - type: Ready + status: "True" + reason: Synced + message: "Discovered 2 images." queryResults: - name: runner-image-usage type: prometheus - series: 30 - samples: 60480 - status: success - - signalResults: - - name: total-usage - images: 30 - status: success - - name: peak-concurrency - images: 30 - status: success + status: success # success | failed (message set on failure) discoveredImages: - image: registry.example.com/ci/java-gradle:21 rank: 1 finalScore: "0.8768" - selected: true - signals: - - name: total-usage - rawValue: "8210" - normalizedValue: "0.824" - - name: peak-concurrency - rawValue: "96" - normalizedValue: "1.0" - ranking: - strategy: weightedSum - terms: - - signal: total-usage - weight: "0.7" - contribution: "0.5768" - - signal: peak-concurrency - weight: "0.3" - contribution: "0.3" + - image: registry.example.com/ci/node:20 + rank: 2 + finalScore: "0.5210" ``` -> **Note:** Pipeline execution is not yet implemented. The controller currently sets -> `Ready=False, reason=NotImplemented` and will populate status once execution is -> available in a future release (Issues 2–10 in the implementation sequence). +| Field | Meaning | +|-------|---------| +| `conditions[Ready]` | `reason=Synced` once the pipeline runs successfully; `message` summarizes the result | +| `imageCount` | Number of discovered images (also a print column) | +| `queryResults[]` | Per-query `name` · `type` · `status` · `message` (on failure) | +| `discoveredImages[]` | Ordered result: `image` · `rank` (1 = highest) · `finalScore` | ## Discovery Strategies Reference diff --git a/docs/hugo.yaml b/docs/hugo.yaml index b23ff26..aba995b 100644 --- a/docs/hugo.yaml +++ b/docs/hugo.yaml @@ -15,6 +15,15 @@ outputs: markup: goldmark: + extensions: + passthrough: + enable: true + delimiters: + block: + - ['\\[', '\\]'] + - ['$$', '$$'] + inline: + - ['\\(', '\\)'] renderer: unsafe: true highlight: From 5ca7d746294ac8d09533d142bf386203afe7e325 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Sun, 28 Jun 2026 14:43:57 +0200 Subject: [PATCH 17/35] strats --- docs/static/images/discovery-pipeline.svg | 132 ++++++++++++++++++++++ docs/static/images/ranking-strategies.svg | 116 +++++++++++++++++++ 2 files changed, 248 insertions(+) create mode 100644 docs/static/images/discovery-pipeline.svg create mode 100644 docs/static/images/ranking-strategies.svg diff --git a/docs/static/images/discovery-pipeline.svg b/docs/static/images/discovery-pipeline.svg new file mode 100644 index 0000000..2f11b94 --- /dev/null +++ b/docs/static/images/discovery-pipeline.svg @@ -0,0 +1,132 @@ + + + + + + DiscoveryPolicy Pipeline + queries → signals → ranking → discoveredImages → CachedImageSet → node pulls + + + + STAGE 1 · queries + STAGE 2 · signals + STAGE 3 · ranking + output + + + + + + + + + + + + + prometheus + range / instant series + + + loki + image pull event log + + + registry + tag / catalog listing + + + + + + aggregate + + + timeWeightedAggregate + + + windowAggregate + + + eventPullTime + loki only + + named per-image value derived from one queryRef + + + + + signal + one signal, direct order + + + weightedSum + Σ wₖ · normalize(signalₖ) + + + modelExposure + cold-node exposure model + + exactly one strategy per policy + + + + + status. + discoveredImages + image · rank · finalScore + + + CachedImageSet + discoveryPolicyRef + creates CachedImage + + + node pulls + paced by PullPolicy + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Status fields: + queryResults: name · type · status · message — discoveredImages: image · rank · finalScore. + + + ↻ re-runs every spec.syncInterval · keeps last known good results on transient query failure + diff --git a/docs/static/images/ranking-strategies.svg b/docs/static/images/ranking-strategies.svg new file mode 100644 index 0000000..f525df1 --- /dev/null +++ b/docs/static/images/ranking-strategies.svg @@ -0,0 +1,116 @@ + + + + + + Ranking Strategies + How signals are combined into the final ordered image list (one strategy per policy) + + + + + + signal + Rank directly by one signal. + + + signalRef: total-usage + + + + + + img A + img B + img C + + + + final order + + + 1. img B + + 2. img A + + 3. img C + + + + + + + + weightedSum + Blend normalized signals. + + + + score = Σ wₖ · minMax(signalₖ) + + + + w=0.7 total-usage + w=0.3 peak-concurrency + + + img A + + + + img B + + + + img C + + + + + final order (by blended score) + + + 1. img A + + 2. img B + + 3. img C + + + + + + + + modelExposure + Model cold-node exposure. + + + + J_target · (1 − 1/N)^J_pre · p̂ + + + + + J_target — target-window usage + + J_pre — pre-window usage + + p̂ — cold pull time (loki) + + N = nodeCount · favors slow, post-rotation hot images + + final order (by exposure) + + + 1. img C + + 2. img A + + 3. img B + + + + + Top maxImages entries become status.discoveredImages and are materialized as CachedImage resources. + From ec0006de0a170636e4d85e4e817882789023e922 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 10:54:21 +0200 Subject: [PATCH 18/35] refactor fields --- api/v1alpha1/discoverypolicy_types.go | 35 ++++++-------- api/v1alpha1/zz_generated.deepcopy.go | 20 -------- .../drop.corewire.io_discoverypolicies.yaml | 47 ++++++++----------- config/rbac/role.yaml | 15 +++--- .../drop_v1alpha1_discoverypolicy.yaml | 8 ++-- internal/controller/cachedimage_controller.go | 1 + .../discoverypolicy_controller_test.go | 10 ++-- internal/discovery/engine.go | 22 ++++----- internal/discovery/engine_test.go | 40 ++++++++-------- 9 files changed, 80 insertions(+), 118 deletions(-) diff --git a/api/v1alpha1/discoverypolicy_types.go b/api/v1alpha1/discoverypolicy_types.go index 2e49cb2..1e1be45 100644 --- a/api/v1alpha1/discoverypolicy_types.go +++ b/api/v1alpha1/discoverypolicy_types.go @@ -61,7 +61,7 @@ const ( // DiscoveryQuery defines a named raw-data source referenced by signals. type DiscoveryQuery struct { // Name is the unique identifier for this query within the policy. - // Signals reference queries by this name via queryRef. + // Signals reference queries by this name via query. // +kubebuilder:validation:MinLength=1 Name string `json:"name"` // Type selects the backend. Must be "prometheus", "loki", or "registry". @@ -257,10 +257,10 @@ type DiscoverySignal struct { // Ranking configurations reference signals by this name. // +kubebuilder:validation:MinLength=1 Name string `json:"name"` - // QueryRef is the name of the query that provides raw data for this signal. + // Query is the name of the query that provides raw data for this signal. // Must match a queries[].name within the same policy. // +kubebuilder:validation:MinLength=1 - QueryRef string `json:"queryRef"` + Query string `json:"query"` // Type selects the signal derivation method. // +kubebuilder:validation:Enum=aggregate;timeWeightedAggregate;windowAggregate;eventPullTime Type SignalType `json:"type"` @@ -424,9 +424,10 @@ type DiscoveryRanking struct { // Strategy selects the ranking algorithm. // +kubebuilder:validation:Enum=signal;weightedSum;modelExposure Strategy RankingStrategy `json:"strategy"` - // Signal is required when strategy=signal. + // Signal is the name of the signal whose values determine image rank. + // Must match a signals[].name within the same policy. Required when strategy=signal. // +optional - Signal *SignalRankingConfig `json:"signal,omitempty"` + Signal string `json:"signal,omitempty"` // WeightedSum is required when strategy=weightedSum. // +optional WeightedSum *WeightedSumRankingConfig `json:"weightedSum,omitempty"` @@ -435,14 +436,6 @@ type DiscoveryRanking struct { ModelExposure *ModelExposureRankingConfig `json:"modelExposure,omitempty"` } -// SignalRankingConfig configures the signal ranking strategy. -type SignalRankingConfig struct { - // SignalRef is the name of the signal whose values determine image rank. - // Must match a signals[].name within the same policy. - // +kubebuilder:validation:MinLength=1 - SignalRef string `json:"signalRef"` -} - // NormalizeMethod defines how signal values are normalized before weighted combination. // +kubebuilder:validation:Enum=minMax type NormalizeMethod string @@ -466,10 +459,10 @@ const ( // WeightedSumTerm defines one signal contribution in a weightedSum ranking. type WeightedSumTerm struct { - // SignalRef is the name of the signal to include in the weighted sum. + // Signal is the name of the signal to include in the weighted sum. // Must match a signals[].name within the same policy. // +kubebuilder:validation:MinLength=1 - SignalRef string `json:"signalRef"` + Signal string `json:"signal"` // Weight is the factor applied to the normalized signal value. // All weights should be non-negative; they do not need to sum to 1. // Example: "0.7" @@ -502,18 +495,18 @@ type ModelExposureRankingConfig struct { // NodeCount is the number of eligible CI nodes (N in the exposure formula). // +kubebuilder:validation:Minimum=1 NodeCount int32 `json:"nodeCount"` - // PreWindowUsageSignalRef is the name of the signal representing usage before the target window. + // PreWindowUsageSignal is the name of the signal representing usage before the target window. // Must match a signals[].name within the same policy. // +kubebuilder:validation:MinLength=1 - PreWindowUsageSignalRef string `json:"preWindowUsageSignalRef"` - // TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. + PreWindowUsageSignal string `json:"preWindowUsageSignal"` + // TargetWindowUsageSignal is the name of the signal representing usage during the target window. // Must match a signals[].name within the same policy. // +kubebuilder:validation:MinLength=1 - TargetWindowUsageSignalRef string `json:"targetWindowUsageSignalRef"` - // PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. + TargetWindowUsageSignal string `json:"targetWindowUsageSignal"` + // PullTimeSignal is the name of the signal providing per-image pull-time estimates. // Must match a signals[].name within the same policy. // +kubebuilder:validation:MinLength=1 - PullTimeSignalRef string `json:"pullTimeSignalRef"` + PullTimeSignal string `json:"pullTimeSignal"` } // ============================================================ diff --git a/api/v1alpha1/zz_generated.deepcopy.go b/api/v1alpha1/zz_generated.deepcopy.go index 03d047e..8f3cb74 100644 --- a/api/v1alpha1/zz_generated.deepcopy.go +++ b/api/v1alpha1/zz_generated.deepcopy.go @@ -564,11 +564,6 @@ func (in *DiscoveryQuery) DeepCopy() *DiscoveryQuery { // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *DiscoveryRanking) DeepCopyInto(out *DiscoveryRanking) { *out = *in - if in.Signal != nil { - in, out := &in.Signal, &out.Signal - *out = new(SignalRankingConfig) - **out = **in - } if in.WeightedSum != nil { in, out := &in.WeightedSum, &out.WeightedSum *out = new(WeightedSumRankingConfig) @@ -834,21 +829,6 @@ func (in *QueryResult) DeepCopy() *QueryResult { return out } -// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *SignalRankingConfig) DeepCopyInto(out *SignalRankingConfig) { - *out = *in -} - -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new SignalRankingConfig. -func (in *SignalRankingConfig) DeepCopy() *SignalRankingConfig { - if in == nil { - return nil - } - out := new(SignalRankingConfig) - in.DeepCopyInto(out) - return out -} - // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *TimeOfDayWindow) DeepCopyInto(out *TimeOfDayWindow) { *out = *in diff --git a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml index 792671b..ccbb0c3 100644 --- a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml +++ b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml @@ -149,7 +149,7 @@ spec: name: description: |- Name is the unique identifier for this query within the policy. - Signals reference queries by this name via queryRef. + Signals reference queries by this name via query. minLength: 1 type: string prometheus: @@ -279,42 +279,35 @@ spec: format: int32 minimum: 1 type: integer - preWindowUsageSignalRef: + preWindowUsageSignal: description: |- - PreWindowUsageSignalRef is the name of the signal representing usage before the target window. + PreWindowUsageSignal is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. minLength: 1 type: string - pullTimeSignalRef: + pullTimeSignal: description: |- - PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. + PullTimeSignal is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. minLength: 1 type: string - targetWindowUsageSignalRef: + targetWindowUsageSignal: description: |- - TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. + TargetWindowUsageSignal is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. minLength: 1 type: string required: - nodeCount - - preWindowUsageSignalRef - - pullTimeSignalRef - - targetWindowUsageSignalRef + - preWindowUsageSignal + - pullTimeSignal + - targetWindowUsageSignal type: object signal: - description: Signal is required when strategy=signal. - properties: - signalRef: - description: |- - SignalRef is the name of the signal whose values determine image rank. - Must match a signals[].name within the same policy. - minLength: 1 - type: string - required: - - signalRef - type: object + description: |- + Signal is the name of the signal whose values determine image rank. + Must match a signals[].name within the same policy. Required when strategy=signal. + type: string strategy: allOf: - enum: @@ -360,9 +353,9 @@ spec: description: WeightedSumTerm defines one signal contribution in a weightedSum ranking. properties: - signalRef: + signal: description: |- - SignalRef is the name of the signal to include in the weighted sum. + Signal is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. minLength: 1 type: string @@ -377,7 +370,7 @@ spec: pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$ x-kubernetes-int-or-string: true required: - - signalRef + - signal - weight type: object minItems: 1 @@ -475,9 +468,9 @@ spec: Ranking configurations reference signals by this name. minLength: 1 type: string - queryRef: + query: description: |- - QueryRef is the name of the query that provides raw data for this signal. + Query is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. minLength: 1 type: string @@ -630,7 +623,7 @@ spec: type: object required: - name - - queryRef + - query - type type: object type: array diff --git a/config/rbac/role.yaml b/config/rbac/role.yaml index 76ec601..70112c8 100644 --- a/config/rbac/role.yaml +++ b/config/rbac/role.yaml @@ -4,13 +4,6 @@ kind: ClusterRole metadata: name: manager-role rules: -- apiGroups: - - "" - resources: - - events - verbs: - - create - - patch - apiGroups: - "" resources: @@ -30,6 +23,14 @@ rules: - get - list - watch +- apiGroups: + - "" + - events.k8s.io + resources: + - events + verbs: + - create + - patch - apiGroups: - drop.corewire.io resources: diff --git a/config/samples/drop_v1alpha1_discoverypolicy.yaml b/config/samples/drop_v1alpha1_discoverypolicy.yaml index 82a4856..057c80f 100644 --- a/config/samples/drop_v1alpha1_discoverypolicy.yaml +++ b/config/samples/drop_v1alpha1_discoverypolicy.yaml @@ -26,13 +26,13 @@ spec: signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum - name: peak-concurrency - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: max @@ -43,7 +43,7 @@ spec: normalize: minMax missingSignal: zero terms: - - signalRef: total-usage + - signal: total-usage weight: "700m" - - signalRef: peak-concurrency + - signal: peak-concurrency weight: "300m" diff --git a/internal/controller/cachedimage_controller.go b/internal/controller/cachedimage_controller.go index 24c7c7a..5bb5398 100644 --- a/internal/controller/cachedimage_controller.go +++ b/internal/controller/cachedimage_controller.go @@ -60,6 +60,7 @@ type CachedImageReconciler struct { // +kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch;create;delete // +kubebuilder:rbac:groups="",resources=nodes,verbs=get;list;watch // +kubebuilder:rbac:groups="",resources=events,verbs=create;patch +// +kubebuilder:rbac:groups=events.k8s.io,resources=events,verbs=create;patch // nodeState tracks the pull state for a single node. type nodeState struct { diff --git a/internal/controller/discoverypolicy_controller_test.go b/internal/controller/discoverypolicy_controller_test.go index 8dc3119..f487ef6 100644 --- a/internal/controller/discoverypolicy_controller_test.go +++ b/internal/controller/discoverypolicy_controller_test.go @@ -117,9 +117,9 @@ var _ = Describe("DiscoveryPolicy Controller", func() { }, Signals: []dropv1alpha1.DiscoverySignal{ { - Name: "tag-score", - QueryRef: "reg-query", - Type: dropv1alpha1.SignalTypeAggregate, + Name: "tag-score", + Query: "reg-query", + Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{ Method: dropv1alpha1.AggregationSum, }, @@ -127,9 +127,7 @@ var _ = Describe("DiscoveryPolicy Controller", func() { }, Ranking: &dropv1alpha1.DiscoveryRanking{ Strategy: dropv1alpha1.RankingStrategySignal, - Signal: &dropv1alpha1.SignalRankingConfig{ - SignalRef: "tag-score", - }, + Signal: "tag-score", }, }, } diff --git a/internal/discovery/engine.go b/internal/discovery/engine.go index c39bb01..a927ec2 100644 --- a/internal/discovery/engine.go +++ b/internal/discovery/engine.go @@ -92,7 +92,7 @@ func ExecutePipeline( signalValues := make(map[string]map[string]float64, len(spec.Signals)) for _, sig := range spec.Signals { - raw, ok := rawByQuery[sig.QueryRef] + raw, ok := rawByQuery[sig.Query] if !ok { continue } @@ -467,11 +467,7 @@ func rankImages(ranking *dropv1alpha1.DiscoveryRanking, signals map[string]map[s switch ranking.Strategy { case dropv1alpha1.RankingStrategySignal: - ref := "" - if ranking.Signal != nil { - ref = ranking.Signal.SignalRef - } - sigMap := signals[ref] + sigMap := signals[ranking.Signal] for _, img := range images { v := sigMap[img] items = append(items, scoredItem{ @@ -522,7 +518,7 @@ func weightedSumRank(cfg *dropv1alpha1.WeightedSumRankingConfig, signals map[str type minMax struct{ min, max float64 } bounds := make(map[string]minMax, len(cfg.Terms)) for _, term := range cfg.Terms { - sigMap := signals[term.SignalRef] + sigMap := signals[term.Signal] var mn, mx float64 first := true for _, img := range images { @@ -538,7 +534,7 @@ func weightedSumRank(cfg *dropv1alpha1.WeightedSumRankingConfig, signals map[str } first = false } - bounds[term.SignalRef] = minMax{min: mn, max: mx} + bounds[term.Signal] = minMax{min: mn, max: mx} } normalize := func(v float64, b minMax) float64 { @@ -554,7 +550,7 @@ func weightedSumRank(cfg *dropv1alpha1.WeightedSumRankingConfig, signals map[str drop := false for _, term := range cfg.Terms { - sigMap := signals[term.SignalRef] + sigMap := signals[term.Signal] v, ok := sigMap[img] if !ok { if cfg.MissingSignal == dropv1alpha1.MissingSignalBehaviorDrop { @@ -563,7 +559,7 @@ func weightedSumRank(cfg *dropv1alpha1.WeightedSumRankingConfig, signals map[str } v = 0 } - b := bounds[term.SignalRef] + b := bounds[term.Signal] norm := normalize(v, b) wf := term.Weight.AsApproximateFloat64() totalScore += wf * norm @@ -587,9 +583,9 @@ func modelExposureRank(cfg *dropv1alpha1.ModelExposureRankingConfig, signals map } oneMinusInvN := 1.0 - 1.0/n - preMap := signals[cfg.PreWindowUsageSignalRef] - targetMap := signals[cfg.TargetWindowUsageSignalRef] - pullMap := signals[cfg.PullTimeSignalRef] + preMap := signals[cfg.PreWindowUsageSignal] + targetMap := signals[cfg.TargetWindowUsageSignal] + pullMap := signals[cfg.PullTimeSignal] out := make([]scoredItem, 0, len(images)) for _, img := range images { diff --git a/internal/discovery/engine_test.go b/internal/discovery/engine_test.go index 2fe70b4..43093ad 100644 --- a/internal/discovery/engine_test.go +++ b/internal/discovery/engine_test.go @@ -46,9 +46,9 @@ func TestExecutePipeline_PrometheusInstant(t *testing.T) { }, }, Signals: []dropv1alpha1.DiscoverySignal{ - {Name: "score", QueryRef: "usage", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + {Name: "score", Query: "usage", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, }, - Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "score"}}, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "score"}, MaxImages: 10, } @@ -97,9 +97,9 @@ func TestExecutePipeline_Registry(t *testing.T) { }, }, Signals: []dropv1alpha1.DiscoverySignal{ - {Name: "tag-score", QueryRef: "tags", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + {Name: "tag-score", Query: "tags", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, }, - Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "tag-score"}}, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "tag-score"}, MaxImages: 10, } @@ -147,8 +147,8 @@ func TestExecutePipeline_WeightedSum(t *testing.T) { {Name: "q2", Type: dropv1alpha1.DiscoveryQueryTypePrometheus, Prometheus: &dropv1alpha1.DiscoveryPrometheusQuery{Endpoint: srv2.URL, Query: "test", QueryType: dropv1alpha1.QueryTypeInstant}}, }, Signals: []dropv1alpha1.DiscoverySignal{ - {Name: "sig1", QueryRef: "q1", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, - {Name: "sig2", QueryRef: "q2", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + {Name: "sig1", Query: "q1", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + {Name: "sig2", Query: "q2", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, }, Ranking: &dropv1alpha1.DiscoveryRanking{ Strategy: dropv1alpha1.RankingStrategyWeightedSum, @@ -156,8 +156,8 @@ func TestExecutePipeline_WeightedSum(t *testing.T) { Normalize: dropv1alpha1.NormalizeMethodMinMax, MissingSignal: dropv1alpha1.MissingSignalBehaviorZero, Terms: []dropv1alpha1.WeightedSumTerm{ - {SignalRef: "sig1", Weight: weight700m}, - {SignalRef: "sig2", Weight: weight300m}, + {Signal: "sig1", Weight: weight700m}, + {Signal: "sig2", Weight: weight300m}, }, }, }, @@ -197,9 +197,9 @@ func TestExecutePipeline_MaxImages(t *testing.T) { {Name: "q", Type: dropv1alpha1.DiscoveryQueryTypePrometheus, Prometheus: &dropv1alpha1.DiscoveryPrometheusQuery{Endpoint: srv.URL, Query: "test", QueryType: dropv1alpha1.QueryTypeInstant}}, }, Signals: []dropv1alpha1.DiscoverySignal{ - {Name: "s", QueryRef: "q", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + {Name: "s", Query: "q", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, }, - Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "s"}}, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "s"}, MaxImages: 3, } @@ -218,9 +218,9 @@ func TestExecutePipeline_QueryFailure(t *testing.T) { {Name: "bad-query", Type: dropv1alpha1.DiscoveryQueryTypePrometheus, Prometheus: &dropv1alpha1.DiscoveryPrometheusQuery{Endpoint: "http://127.0.0.1:19999", Query: "test"}}, }, Signals: []dropv1alpha1.DiscoverySignal{ - {Name: "s", QueryRef: "bad-query", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, + {Name: "s", Query: "bad-query", Type: dropv1alpha1.SignalTypeAggregate, Aggregate: &dropv1alpha1.AggregateSignalConfig{Method: dropv1alpha1.AggregationSum}}, }, - Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "s"}}, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "s"}, MaxImages: 10, } @@ -276,16 +276,16 @@ func TestExecutePipeline_WindowAggregate(t *testing.T) { }, Signals: []dropv1alpha1.DiscoverySignal{ { - Name: "recent", - QueryRef: "q", - Type: dropv1alpha1.SignalTypeWindowAggregate, + Name: "recent", + Query: "q", + Type: dropv1alpha1.SignalTypeWindowAggregate, WindowAggregate: &dropv1alpha1.WindowAggregateSignalConfig{ Method: dropv1alpha1.AggregationSum, RelativeWindow: &window, }, }, }, - Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "recent"}}, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "recent"}, MaxImages: 10, } @@ -401,12 +401,12 @@ func TestExecutePipeline_Loki(t *testing.T) { Signals: []dropv1alpha1.DiscoverySignal{ { Name: "pull-time", - QueryRef: "pull-events", + Query: "pull-events", Type: dropv1alpha1.SignalTypeEventPullTime, EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Statistic: dropv1alpha1.EventPullTimeStatisticAvg, DurationMode: dropv1alpha1.DurationModeMessageDuration}, }, }, - Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "pull-time"}}, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "pull-time"}, MaxImages: 10, } @@ -469,12 +469,12 @@ func TestExecutePipeline_LokiFailureCount(t *testing.T) { Signals: []dropv1alpha1.DiscoverySignal{ { Name: "failures", - QueryRef: "pull-events", + Query: "pull-events", Type: dropv1alpha1.SignalTypeEventPullTime, EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Statistic: dropv1alpha1.EventPullTimeStatisticFailureCount, DurationMode: dropv1alpha1.DurationModeMessageDuration}, }, }, - Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: &dropv1alpha1.SignalRankingConfig{SignalRef: "failures"}}, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "failures"}, MaxImages: 10, } From d4b68d8e0fb171f59ab4d5187d44e27f72f3c7f0 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 10:54:39 +0200 Subject: [PATCH 19/35] update samples --- charts/drop/templates/clusterrole.yaml | 3 ++ hack/dev-samples.yaml | 67 +++++++++++--------------- 2 files changed, 32 insertions(+), 38 deletions(-) diff --git a/charts/drop/templates/clusterrole.yaml b/charts/drop/templates/clusterrole.yaml index 2fe60fd..9f000d0 100644 --- a/charts/drop/templates/clusterrole.yaml +++ b/charts/drop/templates/clusterrole.yaml @@ -50,6 +50,9 @@ rules: - apiGroups: [""] resources: ["events"] verbs: ["create", "patch"] + - apiGroups: ["events.k8s.io"] + resources: ["events"] + verbs: ["create", "patch"] {{- if .Values.metrics.secureServing }} - apiGroups: ["authentication.k8s.io"] resources: ["tokenreviews"] diff --git a/hack/dev-samples.yaml b/hack/dev-samples.yaml index 17e843d..5e54307 100644 --- a/hack/dev-samples.yaml +++ b/hack/dev-samples.yaml @@ -107,14 +107,13 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage syncInterval: 30s maxImages: 10 --- @@ -135,12 +134,12 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum - name: peak-concurrency - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: max @@ -150,9 +149,9 @@ spec: normalize: minMax missingSignal: zero terms: - - signalRef: total-usage + - signal: total-usage weight: "700m" - - signalRef: peak-concurrency + - signal: peak-concurrency weight: "300m" syncInterval: 30s maxImages: 10 @@ -172,14 +171,13 @@ spec: query: 'container_memory_working_set_bytes{namespace="build-stuff"}' signals: - name: current - queryRef: current-usage + query: current-usage type: aggregate aggregate: method: max ranking: strategy: signal - signal: - signalRef: current + signal: current syncInterval: 30s maxImages: 10 --- @@ -201,7 +199,7 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: business-hours-usage - queryRef: runner-image-usage + query: runner-image-usage type: timeWeightedAggregate timeWeightedAggregate: method: sum @@ -213,8 +211,7 @@ spec: weight: "2" ranking: strategy: signal - signal: - signalRef: business-hours-usage + signal: business-hours-usage syncInterval: 30s maxImages: 10 --- @@ -235,15 +232,14 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: recent-usage - queryRef: runner-image-usage + query: runner-image-usage type: windowAggregate windowAggregate: method: sum relativeWindow: 6h ranking: strategy: signal - signal: - signalRef: recent-usage + signal: recent-usage syncInterval: 30s maxImages: 10 --- @@ -265,14 +261,14 @@ spec: type: kubernetesEvents signals: - name: p50-cold-pull-time - queryRef: image-pull-events + query: image-pull-events type: eventPullTime eventPullTime: statistic: p50 durationMode: messageDuration includeCacheHits: false - name: pull-failures - queryRef: image-pull-events + query: image-pull-events type: eventPullTime eventPullTime: statistic: failureCount @@ -280,8 +276,7 @@ spec: includeCacheHits: false ranking: strategy: signal - signal: - signalRef: p50-cold-pull-time + signal: p50-cold-pull-time syncInterval: 30s maxImages: 10 --- @@ -304,14 +299,13 @@ spec: topX: 5 signals: - name: tag-recency - queryRef: registry-tags + query: registry-tags type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30s maxImages: 20 --- @@ -342,17 +336,17 @@ spec: type: kubernetesEvents signals: - name: pre-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum - name: target-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: max - name: pull-time - queryRef: image-pull-events + query: image-pull-events type: eventPullTime eventPullTime: statistic: p50 @@ -362,9 +356,9 @@ spec: strategy: modelExposure modelExposure: nodeCount: 3 - preWindowUsageSignalRef: pre-usage - targetWindowUsageSignalRef: target-usage - pullTimeSignalRef: pull-time + preWindowUsageSignal: pre-usage + targetWindowUsageSignal: target-usage + pullTimeSignal: pull-time syncInterval: 30s maxImages: 10 --- @@ -382,14 +376,13 @@ spec: query: "up{}" signals: - name: total-usage - queryRef: broken-query + query: broken-query type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage syncInterval: 30m maxImages: 10 --- @@ -408,14 +401,13 @@ spec: - test/app signals: - name: tag-recency - queryRef: broken-registry + query: broken-registry type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30m maxImages: 10 --- @@ -434,13 +426,12 @@ spec: - test/does-not-exist signals: - name: tag-recency - queryRef: missing-repo + query: missing-repo type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30m maxImages: 10 From 7a53ed083903dc04ad46ecd8929b32ecf623b3f6 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 10:54:49 +0200 Subject: [PATCH 20/35] update tests --- test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml | 5 ++--- test/e2e/discovery-failure/01-broken-prometheus.yaml | 5 ++--- test/e2e/discovery-loki/01-discoverypolicy.yaml | 7 +++---- test/e2e/discovery-registry/01-discoverypolicy.yaml | 5 ++--- test/e2e/discovery/01-discoverypolicy.yaml | 5 ++--- 5 files changed, 11 insertions(+), 16 deletions(-) diff --git a/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml b/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml index 8ccd3c8..3540ddd 100644 --- a/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml +++ b/test/e2e/cachedimageset-discovery/02-discoverypolicy.yaml @@ -14,13 +14,12 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", namespace="build-stuff"}) by (image)' signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage syncInterval: 30s maxImages: 10 diff --git a/test/e2e/discovery-failure/01-broken-prometheus.yaml b/test/e2e/discovery-failure/01-broken-prometheus.yaml index cc096df..4c5e355 100644 --- a/test/e2e/discovery-failure/01-broken-prometheus.yaml +++ b/test/e2e/discovery-failure/01-broken-prometheus.yaml @@ -11,13 +11,12 @@ spec: query: "up{}" signals: - name: total-usage - queryRef: broken-query + query: broken-query type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage syncInterval: 30m maxImages: 10 diff --git a/test/e2e/discovery-loki/01-discoverypolicy.yaml b/test/e2e/discovery-loki/01-discoverypolicy.yaml index 19c7cd7..56d6add 100644 --- a/test/e2e/discovery-loki/01-discoverypolicy.yaml +++ b/test/e2e/discovery-loki/01-discoverypolicy.yaml @@ -16,7 +16,7 @@ spec: signals: # Median cold-pull time derived from the "Successfully pulled ... in Xs" messages. - name: p50-cold-pull-time - queryRef: discovery-loki-image-pull-events + query: discovery-loki-image-pull-events type: eventPullTime eventPullTime: statistic: p50 @@ -24,7 +24,7 @@ spec: includeCacheHits: false # Number of pull failures per image. - name: pull-failures - queryRef: discovery-loki-image-pull-events + query: discovery-loki-image-pull-events type: eventPullTime eventPullTime: statistic: failureCount @@ -32,7 +32,6 @@ spec: includeCacheHits: false ranking: strategy: signal - signal: - signalRef: p50-cold-pull-time + signal: p50-cold-pull-time syncInterval: 30s maxImages: 10 diff --git a/test/e2e/discovery-registry/01-discoverypolicy.yaml b/test/e2e/discovery-registry/01-discoverypolicy.yaml index 5e87686..73fd9b8 100644 --- a/test/e2e/discovery-registry/01-discoverypolicy.yaml +++ b/test/e2e/discovery-registry/01-discoverypolicy.yaml @@ -16,13 +16,12 @@ spec: topX: 5 signals: - name: tag-recency - queryRef: registry-tags + query: registry-tags type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30s maxImages: 20 diff --git a/test/e2e/discovery/01-discoverypolicy.yaml b/test/e2e/discovery/01-discoverypolicy.yaml index 659dd3f..e9af13a 100644 --- a/test/e2e/discovery/01-discoverypolicy.yaml +++ b/test/e2e/discovery/01-discoverypolicy.yaml @@ -14,13 +14,12 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", namespace="build-stuff"}) by (image)' signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage syncInterval: 30s maxImages: 10 From c6e77ef67143cdae8762d50c6eb3b42664f780c8 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 10:54:58 +0200 Subject: [PATCH 21/35] update make --- Makefile | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) diff --git a/Makefile b/Makefile index cd1e033..43d857c 100644 --- a/Makefile +++ b/Makefile @@ -143,6 +143,69 @@ docs-gen-check: docs-gen ## Verify generated AI docs are up to date. @git diff --exit-code knowledge.yaml llms.txt llms-full.txt docs/static/llms-full.txt .github/copilot-instructions.md .cursorrules AGENTS.md docs/content/docs/reference/_generated_*.md || \ (echo "ERROR: generated docs are out of date — run 'make docs-gen'" && exit 1) +##@ Research + +RESEARCH_TEX_DIR ?= research/tex +RESEARCH_TEX_FILE ?= paper.tex +RESEARCH_BENCH_DIR ?= research/benchmark/evaluator +RESEARCH_BENCH_VENV ?= $(RESEARCH_BENCH_DIR)/.venv +RESEARCH_BENCH_RESULTS_DIR ?= research/benchmark/results +RESEARCH_BENCH_RESULTS_DISCOVERY_20RUNS ?= $(RESEARCH_BENCH_RESULTS_DIR)/discovery-strategy-20runs +RESEARCH_BENCH_RESULTS_ORACLE_20RUNS ?= $(RESEARCH_BENCH_RESULTS_DIR)/oracle-gap-strategy-20runs +RESEARCH_BENCH_RESULTS_CACHE_20RUNS ?= $(RESEARCH_BENCH_RESULTS_DIR)/ci-image-cache-20runs + +.PHONY: research-tex-build +research-tex-build: ## Build research PDF from TeX source (override RESEARCH_TEX_FILE=). + @cd $(RESEARCH_TEX_DIR) && \ + if command -v latexmk >/dev/null 2>&1; then \ + latexmk -pdf -interaction=nonstopmode -halt-on-error $(RESEARCH_TEX_FILE); \ + elif command -v pdflatex >/dev/null 2>&1; then \ + pdflatex -interaction=nonstopmode -halt-on-error $(RESEARCH_TEX_FILE) && \ + pdflatex -interaction=nonstopmode -halt-on-error $(RESEARCH_TEX_FILE); \ + else \ + echo "ERROR: latexmk/pdflatex not found"; exit 1; \ + fi + +.PHONY: research-bench-setup +research-bench-setup: ## Create benchmark venv and install Python dependencies. + @cd $(RESEARCH_BENCH_DIR) && \ + python3 -m venv .venv && \ + . .venv/bin/activate && \ + pip install -r requirements.txt + +.PHONY: research-bench-generate +research-bench-generate: ## Generate synthetic benchmark dataset. + @cd $(RESEARCH_BENCH_DIR) && \ + . .venv/bin/activate && \ + python generate_synthetic_day.py --out data --jobs 25000 --nodes 100 --images 30 --seed 20260621 + +.PHONY: research-bench-replay +research-bench-replay: ## Run replay policy evaluation from benchmark data. + @cd $(RESEARCH_BENCH_DIR) && \ + . .venv/bin/activate && \ + python evaluate_replay.py --data data --out outputs + +.PHONY: research-bench-discovery +research-bench-discovery: ## Evaluate discovery strategies from benchmark data. + @cd $(RESEARCH_BENCH_DIR) && \ + . .venv/bin/activate && \ + python evaluate_discovery_strategies.py --data data --out outputs/strategy_eval + +.PHONY: research-bench-plot +research-bench-plot: ## Render example pipeline Gantt figure. + @cd $(RESEARCH_BENCH_DIR) && \ + . .venv/bin/activate && \ + python plot_pipeline_gantt.py --modeled-jobs outputs/modeled_jobs_no_prewarming.csv --out figures/example_gantt.png + +.PHONY: research-bench-20runs +research-bench-20runs: ## Run 20-run discovery strategy benchmark batch. + @cd $(RESEARCH_BENCH_DIR) && \ + . .venv/bin/activate && \ + python run_discovery_strategy_20runs.py + +.PHONY: research-bench-all +research-bench-all: research-bench-generate research-bench-replay research-bench-discovery research-bench-plot ## Run full synthetic benchmark workflow. + .PHONY: tools tools: ## Install local tooling and check optional docs/chart binaries. @$(MAKE) kustomize controller-gen setup-envtest golangci-lint chainsaw From 43e11550f3bb9096d81cf6290cf3cdb5471040a3 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 11:07:42 +0200 Subject: [PATCH 22/35] gen --- README.md | 15 +++--- knowledge.yaml | 127 ++++++++++++++++++++++++------------------------- llms-full.txt | 97 +++++++++++++++++-------------------- 3 files changed, 113 insertions(+), 126 deletions(-) diff --git a/README.md b/README.md index 4e67f42..34f89b7 100644 --- a/README.md +++ b/README.md @@ -141,14 +141,13 @@ spec: name: prometheus-creds signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage --- # --- 3. CachedImageSet: ties discovery + policy together, targets nodes --- apiVersion: drop.corewire.io/v1alpha1 @@ -341,14 +340,13 @@ spec: name: prometheus-creds signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage --- apiVersion: drop.corewire.io/v1alpha1 kind: CachedImageSet @@ -405,14 +403,13 @@ spec: name: registry-api-creds signals: - name: recent-tag-count - queryRef: registry-tags + query: registry-tags type: aggregate aggregate: method: count ranking: strategy: signal - signal: - signalRef: recent-tag-count + signal: recent-tag-count --- apiVersion: drop.corewire.io/v1alpha1 kind: CachedImageSet diff --git a/knowledge.yaml b/knowledge.yaml index 3658558..8f509ae 100644 --- a/knowledge.yaml +++ b/knowledge.yaml @@ -468,7 +468,7 @@ helperTypes: json: name type: string required: true - doc: Name is the unique identifier for this query within the policy. Signals reference queries by this name via queryRef. + doc: Name is the unique identifier for this query within the policy. Signals reference queries by this name via query. - name: Type json: type type: DiscoveryQueryType @@ -512,9 +512,9 @@ helperTypes: doc: Strategy selects the ranking algorithm. - name: Signal json: signal - type: '*SignalRankingConfig' + type: string required: false - doc: Signal is required when strategy=signal. + doc: Signal is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. Required when strategy=signal. - name: WeightedSum json: weightedSum type: '*WeightedSumRankingConfig' @@ -561,11 +561,11 @@ helperTypes: type: string required: true doc: Name is the unique identifier for this signal within the policy. Ranking configurations reference signals by this name. - - name: QueryRef - json: queryRef + - name: Query + json: query type: string required: true - doc: QueryRef is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. + doc: Query is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. - name: Type json: type type: SignalType @@ -683,21 +683,21 @@ helperTypes: type: int32 required: true doc: NodeCount is the number of eligible CI nodes (N in the exposure formula). - - name: PreWindowUsageSignalRef - json: preWindowUsageSignalRef + - name: PreWindowUsageSignal + json: preWindowUsageSignal type: string required: true - doc: PreWindowUsageSignalRef is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. - - name: TargetWindowUsageSignalRef - json: targetWindowUsageSignalRef + doc: PreWindowUsageSignal is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. + - name: TargetWindowUsageSignal + json: targetWindowUsageSignal type: string required: true - doc: TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. - - name: PullTimeSignalRef - json: pullTimeSignalRef + doc: TargetWindowUsageSignal is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. + - name: PullTimeSignal + json: pullTimeSignal type: string required: true - doc: PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. + doc: PullTimeSignal is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. - name: PolicyReference doc: PolicyReference is a reference to a PullPolicy resource. fields: @@ -729,14 +729,6 @@ helperTypes: type: string required: false doc: Message describes the failure reason when status=failed. - - name: SignalRankingConfig - doc: SignalRankingConfig configures the signal ranking strategy. - fields: - - name: SignalRef - json: signalRef - type: string - required: true - doc: SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. - name: TimeOfDayWindow doc: TimeOfDayWindow defines a fixed wall-clock time range within each day. fields: @@ -825,11 +817,11 @@ helperTypes: - name: WeightedSumTerm doc: WeightedSumTerm defines one signal contribution in a weightedSum ranking. fields: - - name: SignalRef - json: signalRef + - name: Signal + json: signal type: string required: true - doc: SignalRef is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. + doc: Signal is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. - name: Weight json: weight type: resource.Quantity @@ -1055,6 +1047,22 @@ makeTargets: desc: Regenerate AI agent docs (llms.txt, instructions, etc.) from source. - name: docs-gen-check desc: Verify generated AI docs are up to date. + - name: research-tex-build + desc: Build research PDF from TeX source (override RESEARCH_TEX_FILE=). + - name: research-bench-setup + desc: Create benchmark venv and install Python dependencies. + - name: research-bench-generate + desc: Generate synthetic benchmark dataset. + - name: research-bench-replay + desc: Run replay policy evaluation from benchmark data. + - name: research-bench-discovery + desc: Evaluate discovery strategies from benchmark data. + - name: research-bench-plot + desc: Render example pipeline Gantt figure. + - name: research-bench-20runs + desc: Run 20-run discovery strategy benchmark batch. + - name: research-bench-all + desc: Run full synthetic benchmark workflow. - name: tools desc: Install local tooling and check optional docs/chart binaries. samples: | @@ -1167,14 +1175,13 @@ samples: | query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage syncInterval: 30s maxImages: 10 --- @@ -1195,12 +1202,12 @@ samples: | query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum - name: peak-concurrency - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: max @@ -1210,9 +1217,9 @@ samples: | normalize: minMax missingSignal: zero terms: - - signalRef: total-usage + - signal: total-usage weight: "700m" - - signalRef: peak-concurrency + - signal: peak-concurrency weight: "300m" syncInterval: 30s maxImages: 10 @@ -1232,14 +1239,13 @@ samples: | query: 'container_memory_working_set_bytes{namespace="build-stuff"}' signals: - name: current - queryRef: current-usage + query: current-usage type: aggregate aggregate: method: max ranking: strategy: signal - signal: - signalRef: current + signal: current syncInterval: 30s maxImages: 10 --- @@ -1261,7 +1267,7 @@ samples: | query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: business-hours-usage - queryRef: runner-image-usage + query: runner-image-usage type: timeWeightedAggregate timeWeightedAggregate: method: sum @@ -1273,8 +1279,7 @@ samples: | weight: "2" ranking: strategy: signal - signal: - signalRef: business-hours-usage + signal: business-hours-usage syncInterval: 30s maxImages: 10 --- @@ -1295,15 +1300,14 @@ samples: | query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: recent-usage - queryRef: runner-image-usage + query: runner-image-usage type: windowAggregate windowAggregate: method: sum relativeWindow: 6h ranking: strategy: signal - signal: - signalRef: recent-usage + signal: recent-usage syncInterval: 30s maxImages: 10 --- @@ -1325,14 +1329,14 @@ samples: | type: kubernetesEvents signals: - name: p50-cold-pull-time - queryRef: image-pull-events + query: image-pull-events type: eventPullTime eventPullTime: statistic: p50 durationMode: messageDuration includeCacheHits: false - name: pull-failures - queryRef: image-pull-events + query: image-pull-events type: eventPullTime eventPullTime: statistic: failureCount @@ -1340,8 +1344,7 @@ samples: | includeCacheHits: false ranking: strategy: signal - signal: - signalRef: p50-cold-pull-time + signal: p50-cold-pull-time syncInterval: 30s maxImages: 10 --- @@ -1364,14 +1367,13 @@ samples: | topX: 5 signals: - name: tag-recency - queryRef: registry-tags + query: registry-tags type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30s maxImages: 20 --- @@ -1402,17 +1404,17 @@ samples: | type: kubernetesEvents signals: - name: pre-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum - name: target-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: max - name: pull-time - queryRef: image-pull-events + query: image-pull-events type: eventPullTime eventPullTime: statistic: p50 @@ -1422,9 +1424,9 @@ samples: | strategy: modelExposure modelExposure: nodeCount: 3 - preWindowUsageSignalRef: pre-usage - targetWindowUsageSignalRef: target-usage - pullTimeSignalRef: pull-time + preWindowUsageSignal: pre-usage + targetWindowUsageSignal: target-usage + pullTimeSignal: pull-time syncInterval: 30s maxImages: 10 --- @@ -1442,14 +1444,13 @@ samples: | query: "up{}" signals: - name: total-usage - queryRef: broken-query + query: broken-query type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage syncInterval: 30m maxImages: 10 --- @@ -1468,14 +1469,13 @@ samples: | - test/app signals: - name: tag-recency - queryRef: broken-registry + query: broken-registry type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30m maxImages: 10 --- @@ -1494,13 +1494,12 @@ samples: | - test/does-not-exist signals: - name: tag-recency - queryRef: missing-repo + query: missing-repo type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30m maxImages: 10 diff --git a/llms-full.txt b/llms-full.txt index 1b02bee..f999172 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -184,7 +184,7 @@ DiscoveryQuery defines a named raw-data source referenced by signals. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| Name | `name` | `string` | ✓ | | Name is the unique identifier for this query within the policy. Signals reference queries by this name via queryRef. | +| Name | `name` | `string` | ✓ | | Name is the unique identifier for this query within the policy. Signals reference queries by this name via query. | | Type | `type` | `DiscoveryQueryType` | ✓ | | Type selects the backend. Must be "prometheus", "loki", or "registry". Enum: `prometheus`,`loki`,`registry` | | Prometheus | `prometheus` | `*DiscoveryPrometheusQuery` | — | | Prometheus contains the configuration when type=prometheus. | | Loki | `loki` | `*DiscoveryLokiQuery` | — | | Loki contains the configuration when type=loki. | @@ -198,7 +198,7 @@ DiscoveryRanking defines how signals are combined into the final ordered image l | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | Strategy | `strategy` | `RankingStrategy` | ✓ | | Strategy selects the ranking algorithm. Enum: `signal`,`weightedSum`,`modelExposure` | -| Signal | `signal` | `*SignalRankingConfig` | — | | Signal is required when strategy=signal. | +| Signal | `signal` | `string` | — | | Signal is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. Required when strategy=signal. | | WeightedSum | `weightedSum` | `*WeightedSumRankingConfig` | — | | WeightedSum is required when strategy=weightedSum. | | ModelExposure | `modelExposure` | `*ModelExposureRankingConfig` | — | | ModelExposure is required when strategy=modelExposure. | @@ -221,7 +221,7 @@ DiscoverySignal defines a named per-image metric derived from a single query. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | Name | `name` | `string` | ✓ | | Name is the unique identifier for this signal within the policy. Ranking configurations reference signals by this name. | -| QueryRef | `queryRef` | `string` | ✓ | | QueryRef is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. | +| Query | `query` | `string` | ✓ | | Query is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. | | Type | `type` | `SignalType` | ✓ | | Type selects the signal derivation method. Enum: `aggregate`,`timeWeightedAggregate`,`windowAggregate`,`eventPullTime` | | Aggregate | `aggregate` | `*AggregateSignalConfig` | — | | Aggregate is required when type=aggregate. | | TimeWeightedAggregate | `timeWeightedAggregate` | `*TimeWeightedAggregateSignalConfig` | — | | TimeWeightedAggregate is required when type=timeWeightedAggregate. | @@ -267,9 +267,9 @@ ModelExposureRankingConfig configures the modelExposure ranking strategy. Score | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | NodeCount | `nodeCount` | `int32` | ✓ | | NodeCount is the number of eligible CI nodes (N in the exposure formula). | -| PreWindowUsageSignalRef | `preWindowUsageSignalRef` | `string` | ✓ | | PreWindowUsageSignalRef is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. | -| TargetWindowUsageSignalRef | `targetWindowUsageSignalRef` | `string` | ✓ | | TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. | -| PullTimeSignalRef | `pullTimeSignalRef` | `string` | ✓ | | PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. | +| PreWindowUsageSignal | `preWindowUsageSignal` | `string` | ✓ | | PreWindowUsageSignal is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. | +| TargetWindowUsageSignal | `targetWindowUsageSignal` | `string` | ✓ | | TargetWindowUsageSignal is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. | +| PullTimeSignal | `pullTimeSignal` | `string` | ✓ | | PullTimeSignal is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. | ### PolicyReference @@ -290,14 +290,6 @@ QueryResult reports the outcome of a single named query execution. | Status | `status` | `QueryResultStatus` | ✓ | | Status is "success" or "failed". | | Message | `message` | `string` | — | | Message describes the failure reason when status=failed. | -### SignalRankingConfig - -SignalRankingConfig configures the signal ranking strategy. - -| Field | JSON | Type | Required | Default | Description | -|-------|------|------|----------|---------|-------------| -| SignalRef | `signalRef` | `string` | ✓ | | SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. | - ### TimeOfDayWindow TimeOfDayWindow defines a fixed wall-clock time range within each day. @@ -344,7 +336,7 @@ WeightedSumTerm defines one signal contribution in a weightedSum ranking. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| SignalRef | `signalRef` | `string` | ✓ | | SignalRef is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. | +| Signal | `signal` | `string` | ✓ | | Signal is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. | | Weight | `weight` | `resource.Quantity` | ✓ | | Weight is the factor applied to the normalized signal value. All weights should be non-negative; they do not need to sum to 1. Example: "0.7" | ### WindowAggregateSignalConfig @@ -514,14 +506,13 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage syncInterval: 30s maxImages: 10 --- @@ -542,12 +533,12 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum - name: peak-concurrency - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: max @@ -557,9 +548,9 @@ spec: normalize: minMax missingSignal: zero terms: - - signalRef: total-usage + - signal: total-usage weight: "700m" - - signalRef: peak-concurrency + - signal: peak-concurrency weight: "300m" syncInterval: 30s maxImages: 10 @@ -579,14 +570,13 @@ spec: query: 'container_memory_working_set_bytes{namespace="build-stuff"}' signals: - name: current - queryRef: current-usage + query: current-usage type: aggregate aggregate: method: max ranking: strategy: signal - signal: - signalRef: current + signal: current syncInterval: 30s maxImages: 10 --- @@ -608,7 +598,7 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: business-hours-usage - queryRef: runner-image-usage + query: runner-image-usage type: timeWeightedAggregate timeWeightedAggregate: method: sum @@ -620,8 +610,7 @@ spec: weight: "2" ranking: strategy: signal - signal: - signalRef: business-hours-usage + signal: business-hours-usage syncInterval: 30s maxImages: 10 --- @@ -642,15 +631,14 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: recent-usage - queryRef: runner-image-usage + query: runner-image-usage type: windowAggregate windowAggregate: method: sum relativeWindow: 6h ranking: strategy: signal - signal: - signalRef: recent-usage + signal: recent-usage syncInterval: 30s maxImages: 10 --- @@ -672,14 +660,14 @@ spec: type: kubernetesEvents signals: - name: p50-cold-pull-time - queryRef: image-pull-events + query: image-pull-events type: eventPullTime eventPullTime: statistic: p50 durationMode: messageDuration includeCacheHits: false - name: pull-failures - queryRef: image-pull-events + query: image-pull-events type: eventPullTime eventPullTime: statistic: failureCount @@ -687,8 +675,7 @@ spec: includeCacheHits: false ranking: strategy: signal - signal: - signalRef: p50-cold-pull-time + signal: p50-cold-pull-time syncInterval: 30s maxImages: 10 --- @@ -711,14 +698,13 @@ spec: topX: 5 signals: - name: tag-recency - queryRef: registry-tags + query: registry-tags type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30s maxImages: 20 --- @@ -749,17 +735,17 @@ spec: type: kubernetesEvents signals: - name: pre-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum - name: target-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: max - name: pull-time - queryRef: image-pull-events + query: image-pull-events type: eventPullTime eventPullTime: statistic: p50 @@ -769,9 +755,9 @@ spec: strategy: modelExposure modelExposure: nodeCount: 3 - preWindowUsageSignalRef: pre-usage - targetWindowUsageSignalRef: target-usage - pullTimeSignalRef: pull-time + preWindowUsageSignal: pre-usage + targetWindowUsageSignal: target-usage + pullTimeSignal: pull-time syncInterval: 30s maxImages: 10 --- @@ -789,14 +775,13 @@ spec: query: "up{}" signals: - name: total-usage - queryRef: broken-query + query: broken-query type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage syncInterval: 30m maxImages: 10 --- @@ -815,14 +800,13 @@ spec: - test/app signals: - name: tag-recency - queryRef: broken-registry + query: broken-registry type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30m maxImages: 10 --- @@ -841,14 +825,13 @@ spec: - test/does-not-exist signals: - name: tag-recency - queryRef: missing-repo + query: missing-repo type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30m maxImages: 10 @@ -883,5 +866,13 @@ spec: make docs-serve # Serve Hugo docs locally. make docs-gen # Regenerate AI agent docs (llms.txt, instructions, etc.) from source. make docs-gen-check # Verify generated AI docs are up to date. + make research-tex-build # Build research PDF from TeX source (override RESEARCH_TEX_FILE=). + make research-bench-setup # Create benchmark venv and install Python dependencies. + make research-bench-generate # Generate synthetic benchmark dataset. + make research-bench-replay # Run replay policy evaluation from benchmark data. + make research-bench-discovery # Evaluate discovery strategies from benchmark data. + make research-bench-plot # Render example pipeline Gantt figure. + make research-bench-20runs # Run 20-run discovery strategy benchmark batch. + make research-bench-all # Run full synthetic benchmark workflow. make tools # Install local tooling and check optional docs/chart binaries. ``` From bc3981d09734d8b2cd190d1ba9980896580afb2b Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 11:07:48 +0200 Subject: [PATCH 23/35] ignore --- .gitignore | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/.gitignore b/.gitignore index 94e121d..f74ee95 100644 --- a/.gitignore +++ b/.gitignore @@ -43,3 +43,10 @@ docs/.hugo_build.lock # Generated CRD chart templates (produced by make sync-crds in CI) charts/drop-crds/templates/drop.corewire.io_*.yaml charts/drop/templates/crds-drop.corewire.io_*.yaml + +# Imported research archives (always unpack; never commit zip bundles) +research/**/*.zip + +# Python cache artifacts +__pycache__/ +*.pyc From 83193d3d58b31940ddb143f97ebfe88891b727dd Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 11:07:53 +0200 Subject: [PATCH 24/35] dashboard --- charts/drop/dashboards/drop-operator.json | 32 ++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/charts/drop/dashboards/drop-operator.json b/charts/drop/dashboards/drop-operator.json index 98c7bdc..364b597 100644 --- a/charts/drop/dashboards/drop-operator.json +++ b/charts/drop/dashboards/drop-operator.json @@ -36,7 +36,7 @@ "timezone": "browser", "title": "Drop Operator", "uid": "drop-operator", - "version": 2, + "version": 3, "refresh": "10s", "panels": [ { @@ -412,6 +412,36 @@ "datasource": "Prometheus", "targets": [{ "expr": "sum by (image, node) (drop_images_cached_total{image=~\"$image\"})", "format": "table", "instant": true }], "transformations": [{ "id": "organize", "options": { "excludeByName": { "Time": true } } }] + }, + { + "id": 106, + "title": "Operator Resources", + "type": "row", + "gridPos": { "h": 1, "w": 24, "x": 0, "y": 60 }, + "collapsed": false, + "panels": [] + }, + { + "id": 70, + "title": "Operator CPU (cores)", + "type": "timeseries", + "gridPos": { "h": 8, "w": 12, "x": 0, "y": 61 }, + "datasource": "Prometheus", + "targets": [{ "expr": "sum(rate(process_cpu_seconds_total{job=\"drop-operator\"}[5m]))", "legendFormat": "cpu" }], + "fieldConfig": { + "defaults": { "unit": "short", "decimals": 3, "custom": { "drawStyle": "line", "fillOpacity": 20, "lineWidth": 2 } } + } + }, + { + "id": 71, + "title": "Operator Memory (RSS)", + "type": "timeseries", + "gridPos": { "h": 8, "w": 12, "x": 12, "y": 61 }, + "datasource": "Prometheus", + "targets": [{ "expr": "process_resident_memory_bytes{job=\"drop-operator\"}", "legendFormat": "rss" }], + "fieldConfig": { + "defaults": { "unit": "bytes", "custom": { "drawStyle": "line", "fillOpacity": 20, "lineWidth": 2 } } + } } ] } From d20ae80d209fe61f13c5881f00ce794fcab3c1dc Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 11:30:45 +0200 Subject: [PATCH 25/35] alloy conform events --- hack/e2e-infra/alloy.yaml | 98 +++++++++++++++++++ hack/e2e-infra/setup.sh | 7 ++ internal/discovery/loki.go | 21 +++- internal/discovery/loki_test.go | 44 +++++++++ .../discovery-loki-alloy/00-failing-pod.yaml | 12 +++ .../01-discoverypolicy.yaml | 32 ++++++ .../02-assert-discovery-status.yaml | 11 +++ .../discovery-loki-alloy/chainsaw-test.yaml | 24 +++++ 8 files changed, 246 insertions(+), 3 deletions(-) create mode 100644 hack/e2e-infra/alloy.yaml create mode 100644 test/e2e/discovery-loki-alloy/00-failing-pod.yaml create mode 100644 test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml create mode 100644 test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml create mode 100644 test/e2e/discovery-loki-alloy/chainsaw-test.yaml diff --git a/hack/e2e-infra/alloy.yaml b/hack/e2e-infra/alloy.yaml new file mode 100644 index 0000000..f3ba338 --- /dev/null +++ b/hack/e2e-infra/alloy.yaml @@ -0,0 +1,98 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: alloy + namespace: e2e-infra +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: alloy-events +rules: + - apiGroups: [""] + resources: ["events"] + verbs: ["get", "list", "watch"] + - apiGroups: ["events.k8s.io"] + resources: ["events"] + verbs: ["get", "list", "watch"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: alloy-events +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: alloy-events +subjects: + - kind: ServiceAccount + name: alloy + namespace: e2e-infra +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: alloy-config + namespace: e2e-infra +data: + config.alloy: | + // Tail real Kubernetes events from all namespaces and ship them to Loki. + // log_format=json so the kubernetesEvents parser can extract name/reason/message. + loki.source.kubernetes_events "events" { + job_name = "kubelet" + log_format = "json" + forward_to = [loki.write.local.receiver] + } + + // Tag every line with drop_e2e=true so discovery queries can scope to seed data. + loki.write "local" { + external_labels = { drop_e2e = "true" } + endpoint { + url = "http://loki.e2e-infra.svc.cluster.local:3100/loki/api/v1/push" + } + } +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: alloy + namespace: e2e-infra + labels: + app: alloy +spec: + replicas: 1 + selector: + matchLabels: + app: alloy + template: + metadata: + labels: + app: alloy + spec: + serviceAccountName: alloy + containers: + - name: alloy + image: grafana/alloy:v1.7.5 + args: + - run + - /etc/alloy/config.alloy + - --storage.path=/var/lib/alloy/data + ports: + - containerPort: 12345 + volumeMounts: + - name: config + mountPath: /etc/alloy + - name: data + mountPath: /var/lib/alloy/data + resources: + requests: + cpu: 25m + memory: 64Mi + limits: + memory: 128Mi + volumes: + - name: config + configMap: + name: alloy-config + - name: data + emptyDir: {} diff --git a/hack/e2e-infra/setup.sh b/hack/e2e-infra/setup.sh index 31fc872..799b33c 100755 --- a/hack/e2e-infra/setup.sh +++ b/hack/e2e-infra/setup.sh @@ -23,6 +23,10 @@ kubectl apply -n "$NAMESPACE" -f "$SCRIPT_DIR/prometheus.yaml" echo "[e2e-infra] Deploying Loki..." kubectl apply -n "$NAMESPACE" -f "$SCRIPT_DIR/loki.yaml" +# --- Deploy Alloy to ship real Kubernetes events into Loki --- +echo "[e2e-infra] Deploying Alloy (kubernetes_events -> Loki)..." +kubectl apply -f "$SCRIPT_DIR/alloy.yaml" + # --- Wait for readiness --- echo "[e2e-infra] Waiting for registry to be ready..." kubectl -n "$NAMESPACE" wait --for=condition=available deployment/registry --timeout=90s @@ -51,6 +55,9 @@ echo "[e2e-infra] Waiting for Loki to be ready..." # Loki single-binary startup can lag behind registry/prometheus in CI clusters. kubectl -n "$NAMESPACE" wait --for=condition=available deployment/loki --timeout=300s +echo "[e2e-infra] Waiting for Alloy to be ready..." +kubectl -n "$NAMESPACE" wait --for=condition=available deployment/alloy --timeout=120s + # --- Seed the registry with a few images --- echo "[e2e-infra] Seeding registry with test images..." REGISTRY_POD=$(kubectl -n "$NAMESPACE" get pods -l app=registry -o jsonpath='{.items[0].metadata.name}') diff --git a/internal/discovery/loki.go b/internal/discovery/loki.go index 7877263..6f10169 100644 --- a/internal/discovery/loki.go +++ b/internal/discovery/loki.go @@ -206,13 +206,13 @@ func parseKubernetesEventStreams(streams []lokiStream, parser *dropv1alpha1.Loki var parsed map[string]interface{} if err := json.Unmarshal([]byte(entry[1]), &parsed); err == nil { if rec.reason == "" { - rec.reason, _ = parsed[reasonField].(string) + rec.reason = lokiJSONField(parsed, reasonField, "reason") } if rec.pod == "" { - rec.pod, _ = parsed[podField].(string) + rec.pod = lokiJSONField(parsed, podField, "involvedObject_name", "name") } if rec.message == "" { - rec.message, _ = parsed[messageField].(string) + rec.message = lokiJSONField(parsed, messageField, lokiMessageField, "msg") } } else if rec.message == "" { rec.message = entry[1] @@ -358,3 +358,18 @@ func lokiCoalesceField(field, defaultVal string) string { } return defaultVal } + +// lokiJSONField reads the first non-empty string value from a JSON event using the +// configured key first, then common aliases (e.g. Grafana Alloy emits "msg"/"name" +// where raw event JSON uses "message"/"involvedObject_name"). Returns "" if none match. +func lokiJSONField(parsed map[string]interface{}, keys ...string) string { + for _, k := range keys { + if k == "" { + continue + } + if v, ok := parsed[k].(string); ok && v != "" { + return v + } + } + return "" +} diff --git a/internal/discovery/loki_test.go b/internal/discovery/loki_test.go index a852fcf..757e91c 100644 --- a/internal/discovery/loki_test.go +++ b/internal/discovery/loki_test.go @@ -176,6 +176,50 @@ func TestLokiSource_FetchRaw_KubernetesEvents_EventPair(t *testing.T) { } } +// TestLokiSource_FetchRaw_KubernetesEvents_AlloyJSON verifies that events shipped by +// Grafana Alloy (loki.source.kubernetes_events, log_format=json) parse with the default +// parser fields. Alloy emits "msg"/"name" in the JSON body, not "message"/"involvedObject_name". +func TestLokiSource_FetchRaw_KubernetesEvents_AlloyJSON(t *testing.T) { + now := time.Now() + streams := []lokiStream{ + { + Stream: map[string]string{"namespace": "default", "job": "kubelet"}, + Values: [][]string{{nanoStringLoki(now.Add(-2 * time.Second)), + `{"reason":"Pulled","name":"runner-abc","msg":"Successfully pulled image \"nginx:1.25\" in 740ms"}`}}, + }, + { + Stream: map[string]string{"namespace": "default", "job": "kubelet"}, + Values: [][]string{{nanoStringLoki(now.Add(-1 * time.Second)), + `{"reason":"Failed","name":"runner-def","msg":"Failed to pull image \"broken:v1\": not found"}`}}, + }, + } + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + resp := lokiResponse{Status: lokiStatusSuccess, Data: lokiData{ResultType: "streams", Result: streams}} + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(resp) + })) + defer srv.Close() + + // Default parser fields (no msg/name overrides) — relies on alias fallback. + src := NewLokiSource(srv.URL, `{job="kubelet"}`, time.Hour, &dropv1alpha1.LokiParser{ + Type: dropv1alpha1.LokiParserTypeKubernetesEvents, + }, srv.Client()) + samples, err := src.FetchRaw(t.Context()) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if len(samples["nginx:1.25"]) != 1 { + t.Fatalf("expected 1 sample for nginx:1.25, got %d", len(samples["nginx:1.25"])) + } + if got := samples["nginx:1.25"][0].Value; got < 0.73 || got > 0.75 { + t.Errorf("expected ~0.74s duration, got %f", got) + } + if len(samples["broken:v1"+lokiFailedSuffix]) != 1 { + t.Errorf("expected 1 failure sample for broken:v1, got %d", len(samples["broken:v1"+lokiFailedSuffix])) + } +} + // TestLokiSource_FetchRaw_HTTPError verifies that HTTP errors are surfaced. func TestLokiSource_FetchRaw_HTTPError(t *testing.T) { srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { diff --git a/test/e2e/discovery-loki-alloy/00-failing-pod.yaml b/test/e2e/discovery-loki-alloy/00-failing-pod.yaml new file mode 100644 index 0000000..9373b6c --- /dev/null +++ b/test/e2e/discovery-loki-alloy/00-failing-pod.yaml @@ -0,0 +1,12 @@ +apiVersion: v1 +kind: Pod +metadata: + name: e2e-alloy-failing-pod + namespace: default +spec: + restartPolicy: Never + containers: + - name: bad-image + image: registry.invalid.local:9999/e2e-alloy-invalid:nope + imagePullPolicy: Always + command: ["/bin/sh", "-c", "echo should-not-run && sleep 60"] diff --git a/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml b/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml new file mode 100644 index 0000000..d207fc4 --- /dev/null +++ b/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml @@ -0,0 +1,32 @@ +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: e2e-discovery-loki-alloy +spec: + queries: + - name: alloy-k8s-events + type: loki + loki: + endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" + queryType: range + lookback: 24h + query: '{drop_e2e="true"} | json | reason=~"Pulling|Failed|BackOff" | name="e2e-alloy-failing-pod"' + parser: + type: kubernetesEvents + podField: name + reasonField: reason + messageField: msg + imageField: msg + signals: + - name: pull-failures + query: alloy-k8s-events + type: eventPullTime + eventPullTime: + statistic: failureCount + durationMode: messageDuration + includeCacheHits: false + ranking: + strategy: signal + signal: pull-failures + syncInterval: 15s + maxImages: 10 diff --git a/test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml b/test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml new file mode 100644 index 0000000..c9a6e30 --- /dev/null +++ b/test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml @@ -0,0 +1,11 @@ +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: e2e-discovery-loki-alloy +status: + (conditions[?type == 'Ready'] | [0].status): "True" + (conditions[?type == 'Ready'] | [0].reason): Synced + (queryResults[?name == 'alloy-k8s-events'] | [0].status): success + (queryResults[?name == 'alloy-k8s-events'] | [0].type): loki + (imageCount > `0`): true + (contains(to_string(discoveredImages), 'e2e-alloy-invalid:nope')): true diff --git a/test/e2e/discovery-loki-alloy/chainsaw-test.yaml b/test/e2e/discovery-loki-alloy/chainsaw-test.yaml new file mode 100644 index 0000000..d7aa1d9 --- /dev/null +++ b/test/e2e/discovery-loki-alloy/chainsaw-test.yaml @@ -0,0 +1,24 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/kyverno/chainsaw/main/.schemas/json/test-chainsaw-v1alpha1.json +apiVersion: chainsaw.kyverno.io/v1alpha1 +kind: Test +metadata: + name: discovery-loki-alloy +spec: + description: | + Verify Loki discovery from real Kubernetes events shipped by Grafana Alloy + (loki.source.kubernetes_events with log_format=json). This exercises parser + fields name/msg/reason, not only seeded raw text events. + steps: + - name: Create a pod that triggers pull failures/events + try: + - apply: + file: 00-failing-pod.yaml + - name: Create DiscoveryPolicy reading Alloy json event fields + try: + - apply: + file: 01-discoverypolicy.yaml + - name: Assert pipeline executed and discovered the failing image from Alloy events + try: + - assert: + timeout: 120s + file: 02-assert-discovery-status.yaml From 7e40ea133e10e0f29fa2257049d677476504ccef Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 12:12:17 +0200 Subject: [PATCH 26/35] refactor loki parsing --- api/v1alpha1/discoverypolicy_types.go | 65 ++++++++++------ .../drop.corewire.io_discoverypolicies.yaml | 25 +++++-- internal/discovery/engine.go | 72 ++++++++++-------- internal/discovery/engine_test.go | 74 ++++++++++++++++--- internal/discovery/loki.go | 28 +++++++ internal/discovery/loki_test.go | 27 ++++++- 6 files changed, 218 insertions(+), 73 deletions(-) diff --git a/api/v1alpha1/discoverypolicy_types.go b/api/v1alpha1/discoverypolicy_types.go index 1e1be45..f720503 100644 --- a/api/v1alpha1/discoverypolicy_types.go +++ b/api/v1alpha1/discoverypolicy_types.go @@ -351,27 +351,38 @@ type TimeOfDayWindow struct { End string `json:"end"` } -// EventPullTimeStatistic defines which pull-time statistic to derive from event records. -// +kubebuilder:validation:Enum=p50;p90;p95;avg;max;count;failureCount;cacheHitCount -type EventPullTimeStatistic string +// EventMetric selects which per-image quantity an event signal measures. +// +kubebuilder:validation:Enum=pullTime;imageSize;failure;cacheHit +type EventMetric string const ( - // EventPullTimeStatisticP50 is the median cold-pull duration. - EventPullTimeStatisticP50 EventPullTimeStatistic = "p50" - // EventPullTimeStatisticP90 is the 90th-percentile cold-pull duration. - EventPullTimeStatisticP90 EventPullTimeStatistic = "p90" - // EventPullTimeStatisticP95 is the 95th-percentile cold-pull duration. - EventPullTimeStatisticP95 EventPullTimeStatistic = "p95" - // EventPullTimeStatisticAvg is the mean cold-pull duration. - EventPullTimeStatisticAvg EventPullTimeStatistic = "avg" - // EventPullTimeStatisticMax is the maximum observed cold-pull duration. - EventPullTimeStatisticMax EventPullTimeStatistic = "max" - // EventPullTimeStatisticCount is the total number of cold-pull events. - EventPullTimeStatisticCount EventPullTimeStatistic = "count" - // EventPullTimeStatisticFailureCount is the total number of pull failures. - EventPullTimeStatisticFailureCount EventPullTimeStatistic = "failureCount" - // EventPullTimeStatisticCacheHitCount is the number of cache-hit events. - EventPullTimeStatisticCacheHitCount EventPullTimeStatistic = "cacheHitCount" + // EventMetricPullTime measures cold-pull duration in seconds (from Pulled events). + EventMetricPullTime EventMetric = "pullTime" + // EventMetricImageSize measures image size in bytes (from Pulled event messages). + EventMetricImageSize EventMetric = "imageSize" + // EventMetricFailure measures pull-failure events. + EventMetricFailure EventMetric = "failure" + // EventMetricCacheHit measures already-present (cache-hit) events. + EventMetricCacheHit EventMetric = "cacheHit" +) + +// EventStatistic defines the aggregation applied to the selected metric's samples. +// +kubebuilder:validation:Enum=p50;p90;p95;avg;max;count +type EventStatistic string + +const ( + // EventStatisticP50 is the median sample value. + EventStatisticP50 EventStatistic = "p50" + // EventStatisticP90 is the 90th-percentile sample value. + EventStatisticP90 EventStatistic = "p90" + // EventStatisticP95 is the 95th-percentile sample value. + EventStatisticP95 EventStatistic = "p95" + // EventStatisticAvg is the mean sample value. + EventStatisticAvg EventStatistic = "avg" + // EventStatisticMax is the maximum sample value. + EventStatisticMax EventStatistic = "max" + // EventStatisticCount is the number of samples. + EventStatisticCount EventStatistic = "count" ) // DurationMode defines how pull duration is extracted from event records. @@ -388,16 +399,24 @@ const ( ) // EventPullTimeSignalConfig configures the eventPullTime signal type. -// The referenced query must be a Loki query. +// The referenced query must be a Loki query. Pull duration and image size are +// extracted from the same Pulled events; metric selects which one to rank on. type EventPullTimeSignalConfig struct { - // Statistic selects which pull-time metric to compute. - // +kubebuilder:validation:Enum=p50;p90;p95;avg;max;count;failureCount;cacheHitCount - Statistic EventPullTimeStatistic `json:"statistic"` + // Metric selects which per-image quantity to aggregate. Defaults to pullTime, + // which correlates strongly with cold-start cost. Use imageSize to rank by bytes. + // +kubebuilder:default=pullTime + // +optional + Metric EventMetric `json:"metric,omitempty"` + // Statistic selects how the metric's samples are aggregated per image. + // +kubebuilder:validation:Enum=p50;p90;p95;avg;max;count + Statistic EventStatistic `json:"statistic"` // IncludeCacheHits controls whether "already present on machine" events are included // in cold-pull duration statistics. Set to false to exclude cache hits. + // Only applies when metric=pullTime. // +kubebuilder:default=false IncludeCacheHits bool `json:"includeCacheHits"` // DurationMode controls how pull duration is extracted from event records. + // Only applies when metric=pullTime. // +kubebuilder:validation:Enum=eventPair;messageDuration DurationMode DurationMode `json:"durationMode"` } diff --git a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml index ccbb0c3..09719bf 100644 --- a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml +++ b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml @@ -425,15 +425,28 @@ spec: - enum: - eventPair - messageDuration - description: DurationMode controls how pull duration is - extracted from event records. + description: |- + DurationMode controls how pull duration is extracted from event records. + Only applies when metric=pullTime. type: string includeCacheHits: default: false description: |- IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. + Only applies when metric=pullTime. type: boolean + metric: + default: pullTime + description: |- + Metric selects which per-image quantity to aggregate. Defaults to pullTime, + which correlates strongly with cold-start cost. Use imageSize to rank by bytes. + enum: + - pullTime + - imageSize + - failure + - cacheHit + type: string statistic: allOf: - enum: @@ -443,8 +456,6 @@ spec: - avg - max - count - - failureCount - - cacheHitCount - enum: - p50 - p90 @@ -452,10 +463,8 @@ spec: - avg - max - count - - failureCount - - cacheHitCount - description: Statistic selects which pull-time metric to - compute. + description: Statistic selects how the metric's samples + are aggregated per image. type: string required: - durationMode diff --git a/internal/discovery/engine.go b/internal/discovery/engine.go index a927ec2..19389a3 100644 --- a/internal/discovery/engine.go +++ b/internal/discovery/engine.go @@ -615,6 +615,8 @@ func collectImages(rawByQuery map[string]*QueryRawData) []string { seen[strings.TrimSuffix(img, lokiFailedSuffix)] = struct{}{} case strings.HasSuffix(img, lokiCacheHitSuffix): seen[strings.TrimSuffix(img, lokiCacheHitSuffix)] = struct{}{} + case strings.HasSuffix(img, lokiSizeBytesSuffix): + seen[strings.TrimSuffix(img, lokiSizeBytesSuffix)] = struct{}{} default: seen[img] = struct{}{} } @@ -628,12 +630,15 @@ func collectImages(rawByQuery map[string]*QueryRawData) []string { return images } -// deriveEventPullTime computes per-image pull-time statistics from Loki event samples. +// deriveEventPullTime computes per-image statistics from Loki event samples. // // The samples map is expected to come from a Loki kubernetesEvents query: // - samples[image] → pull duration values in seconds (from Pulled events) // - samples[image+":failed"] → count of pull-failure events (value=1.0 each) // - samples[image+":cache_hit"] → count of already-present events (value=1.0 each) +// - samples[image+":size_bytes"]→ image size values in bytes (from Pulled event messages) +// +// cfg.Metric selects which series to aggregate; cfg.Statistic selects how. func deriveEventPullTime(samples map[string][]TimedSample, cfg *dropv1alpha1.EventPullTimeSignalConfig) map[string]float64 { imageSet := make(map[string]struct{}) for key := range samples { @@ -642,66 +647,69 @@ func deriveEventPullTime(samples map[string][]TimedSample, cfg *dropv1alpha1.Eve imageSet[strings.TrimSuffix(key, lokiFailedSuffix)] = struct{}{} case strings.HasSuffix(key, lokiCacheHitSuffix): imageSet[strings.TrimSuffix(key, lokiCacheHitSuffix)] = struct{}{} + case strings.HasSuffix(key, lokiSizeBytesSuffix): + imageSet[strings.TrimSuffix(key, lokiSizeBytesSuffix)] = struct{}{} default: imageSet[key] = struct{}{} } } + metric := cfg.Metric + if metric == "" { + metric = dropv1alpha1.EventMetricPullTime + } + out := make(map[string]float64, len(imageSet)) for img := range imageSet { - var v float64 - switch cfg.Statistic { - case dropv1alpha1.EventPullTimeStatisticFailureCount: - v = float64(len(samples[img+lokiFailedSuffix])) - case dropv1alpha1.EventPullTimeStatisticCacheHitCount: - v = float64(len(samples[img+lokiCacheHitSuffix])) - case dropv1alpha1.EventPullTimeStatisticCount: - pts := append([]TimedSample(nil), samples[img]...) - if cfg.IncludeCacheHits { - pts = append(pts, samples[img+lokiCacheHitSuffix]...) - } - v = float64(len(pts)) - default: - // Duration statistics: p50, p90, p95, avg, max. - pts := append([]TimedSample(nil), samples[img]...) + var pts []TimedSample + switch metric { + case dropv1alpha1.EventMetricImageSize: + pts = samples[img+lokiSizeBytesSuffix] + case dropv1alpha1.EventMetricFailure: + pts = samples[img+lokiFailedSuffix] + case dropv1alpha1.EventMetricCacheHit: + pts = samples[img+lokiCacheHitSuffix] + default: // pullTime + pts = append([]TimedSample(nil), samples[img]...) if cfg.IncludeCacheHits { pts = append(pts, samples[img+lokiCacheHitSuffix]...) } - if len(pts) == 0 { - continue - } - durations := make([]float64, len(pts)) - for i, pt := range pts { - durations[i] = pt.Value - } - v = computeEventPullTimeStat(durations, cfg.Statistic) } - out[img] = v + if len(pts) == 0 { + continue + } + vals := make([]float64, len(pts)) + for i, pt := range pts { + vals[i] = pt.Value + } + out[img] = computeEventStat(vals, cfg.Statistic) } return out } -// computeEventPullTimeStat computes a duration statistic over a non-empty slice. -func computeEventPullTimeStat(vals []float64, stat dropv1alpha1.EventPullTimeStatistic) float64 { +// computeEventStat aggregates a non-empty slice using the configured statistic. +func computeEventStat(vals []float64, stat dropv1alpha1.EventStatistic) float64 { sorted := make([]float64, len(vals)) copy(sorted, vals) sort.Float64s(sorted) switch stat { - case dropv1alpha1.EventPullTimeStatisticP50: + case dropv1alpha1.EventStatisticP50: return durationPercentile(sorted, 50) - case dropv1alpha1.EventPullTimeStatisticP90: + case dropv1alpha1.EventStatisticP90: return durationPercentile(sorted, 90) - case dropv1alpha1.EventPullTimeStatisticP95: + case dropv1alpha1.EventStatisticP95: return durationPercentile(sorted, 95) - case dropv1alpha1.EventPullTimeStatisticAvg: + case dropv1alpha1.EventStatisticAvg: var sum float64 for _, v := range sorted { sum += v } return sum / float64(len(sorted)) - case dropv1alpha1.EventPullTimeStatisticMax: + case dropv1alpha1.EventStatisticMax: return sorted[len(sorted)-1] + case dropv1alpha1.EventStatisticCount: + return float64(len(sorted)) default: return 0 } diff --git a/internal/discovery/engine_test.go b/internal/discovery/engine_test.go index 43093ad..02ddd93 100644 --- a/internal/discovery/engine_test.go +++ b/internal/discovery/engine_test.go @@ -403,7 +403,7 @@ func TestExecutePipeline_Loki(t *testing.T) { Name: "pull-time", Query: "pull-events", Type: dropv1alpha1.SignalTypeEventPullTime, - EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Statistic: dropv1alpha1.EventPullTimeStatisticAvg, DurationMode: dropv1alpha1.DurationModeMessageDuration}, + EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Statistic: dropv1alpha1.EventStatisticAvg, DurationMode: dropv1alpha1.DurationModeMessageDuration}, }, }, Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "pull-time"}, @@ -471,7 +471,7 @@ func TestExecutePipeline_LokiFailureCount(t *testing.T) { Name: "failures", Query: "pull-events", Type: dropv1alpha1.SignalTypeEventPullTime, - EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Statistic: dropv1alpha1.EventPullTimeStatisticFailureCount, DurationMode: dropv1alpha1.DurationModeMessageDuration}, + EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Metric: dropv1alpha1.EventMetricFailure, Statistic: dropv1alpha1.EventStatisticCount, DurationMode: dropv1alpha1.DurationModeMessageDuration}, }, }, Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "failures"}, @@ -493,6 +493,62 @@ func TestExecutePipeline_LokiFailureCount(t *testing.T) { } } +// TestExecutePipeline_LokiImageSize verifies ranking by image size (bytes) extracted from Pulled events. +func TestExecutePipeline_LokiImageSize(t *testing.T) { + now := time.Now() + nanoStr := func(t time.Time) string { + return strconv.FormatInt(t.UnixNano(), 10) + } + + streams := []lokiStream{ + { + Stream: map[string]string{"app": "kubelet"}, + Values: [][]string{ + {nanoStr(now.Add(-7 * time.Second)), `Successfully pulled image "nginx:1.25" in 730ms. Image size: 20461242 bytes.`}, + {nanoStr(now.Add(-2 * time.Second)), `Successfully pulled image "redis:7.0" in 3s. Image size: 5000000 bytes.`}, + }, + }, + } + + srv := httptest.NewServer(lokiStreamHandler(streams)) + defer srv.Close() + + spec := dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{ + { + Name: "pull-events", + Type: dropv1alpha1.DiscoveryQueryTypeLoki, + Loki: &dropv1alpha1.DiscoveryLokiQuery{ + Endpoint: srv.URL, + Query: `{app="kubelet"}`, + Parser: &dropv1alpha1.LokiParser{Type: dropv1alpha1.LokiParserTypeKubernetesEvents, MessageField: "message"}, + }, + }, + }, + Signals: []dropv1alpha1.DiscoverySignal{ + { + Name: "image-size", + Query: "pull-events", + Type: dropv1alpha1.SignalTypeEventPullTime, + EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Metric: dropv1alpha1.EventMetricImageSize, Statistic: dropv1alpha1.EventStatisticMax}, + }, + }, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "image-size"}, + MaxImages: 10, + } + + clientFn := func(_ context.Context, _ string) (*http.Client, error) { return srv.Client(), nil } + result := ExecutePipeline(context.Background(), spec, clientFn) + + if len(result.Images) != 2 { + t.Fatalf("expected 2 images, got %d: %v", len(result.Images), result.Images) + } + // Largest image ranks first. + if result.Images[0].Image != "nginx:1.25" || result.Images[0].FinalScore != "20461242" { + t.Errorf("expected nginx:1.25 with size 20461242 first, got %s=%s", result.Images[0].Image, result.Images[0].FinalScore) + } +} + // TestDeriveEventPullTime_Percentiles verifies p50/p90/p95 computation. func TestDeriveEventPullTime_Percentiles(t *testing.T) { // 10 duration samples: 1,2,3,4,5,6,7,8,9,10 seconds @@ -503,15 +559,15 @@ func TestDeriveEventPullTime_Percentiles(t *testing.T) { samples := map[string][]TimedSample{"nginx:1.25": pts} tests := []struct { - stat dropv1alpha1.EventPullTimeStatistic + stat dropv1alpha1.EventStatistic want float64 }{ - {dropv1alpha1.EventPullTimeStatisticP50, 5.5}, - {dropv1alpha1.EventPullTimeStatisticP90, 9.1}, - {dropv1alpha1.EventPullTimeStatisticP95, 9.55}, - {dropv1alpha1.EventPullTimeStatisticAvg, 5.5}, - {dropv1alpha1.EventPullTimeStatisticMax, 10}, - {dropv1alpha1.EventPullTimeStatisticCount, 10}, + {dropv1alpha1.EventStatisticP50, 5.5}, + {dropv1alpha1.EventStatisticP90, 9.1}, + {dropv1alpha1.EventStatisticP95, 9.55}, + {dropv1alpha1.EventStatisticAvg, 5.5}, + {dropv1alpha1.EventStatisticMax, 10}, + {dropv1alpha1.EventStatisticCount, 10}, } for _, tt := range tests { cfg := &dropv1alpha1.EventPullTimeSignalConfig{Statistic: tt.stat} diff --git a/internal/discovery/loki.go b/internal/discovery/loki.go index 6f10169..7b4c67d 100644 --- a/internal/discovery/loki.go +++ b/internal/discovery/loki.go @@ -25,6 +25,8 @@ const ( lokiFailedSuffix = ":failed" // lokiCacheHitSuffix is appended to image keys for cache-hit event counts. lokiCacheHitSuffix = ":cache_hit" + // lokiSizeBytesSuffix is appended to image keys for extracted image-size samples. + lokiSizeBytesSuffix = ":size_bytes" ) // rePulledDuration matches the pull duration in Pulled event messages. @@ -35,6 +37,10 @@ var rePulledDuration = regexp.MustCompile(`\bin\s+(\d+(?:\.\d+)?)(ms|s|m|h)\b`) // Handles: Pulling image "nginx:1.25" / image "nginx:1.25" var reImageRef = regexp.MustCompile(`(?:image|Image)\s+"([^"]+)"`) +// reImageSizeBytes matches image size in Pulled messages. +// Example: "Image size: 20461242 bytes" +var reImageSizeBytes = regexp.MustCompile(`(?i)\bimage\s+size:\s*(\d+)\s+bytes\b`) + // lokiResponse is the top-level Loki query_range API response. type lokiResponse struct { Status string `json:"status"` @@ -176,6 +182,7 @@ type lokiEventRecord struct { // - samples[image] → pull duration in seconds for each Pulled event // - samples[image+":failed"] → 1.0 per pull-failure event // - samples[image+":cache_hit"] → 1.0 per already-present event +// - samples[image+":size_bytes"] → image size in bytes per Pulled event (if present) // // Durations are derived from the "in Xs" pattern in Pulled messages (messageDuration). // When no duration is present in the message, a Pulling→Pulled event-pair duration @@ -259,6 +266,7 @@ func parseKubernetesEventStreams(streams []lokiStream, parser *dropv1alpha1.Loki case "pulled": // Primary: parse duration from message ("in Xs"). dur := lokiParsePullDuration(rec.message) + sizeBytes := lokiParseImageSizeBytes(rec.message) // Fallback: event-pair (Pulling → Pulled timestamp delta). if dur == 0 { if pullStart, ok := pullingMap[rec.pod+":"+rec.image]; ok { @@ -270,6 +278,12 @@ func parseKubernetesEventStreams(streams []lokiStream, parser *dropv1alpha1.Loki if dur > 0 { out[rec.image] = append(out[rec.image], TimedSample{Timestamp: rec.timestamp, Value: dur}) } + if sizeBytes > 0 { + out[rec.image+lokiSizeBytesSuffix] = append( + out[rec.image+lokiSizeBytesSuffix], + TimedSample{Timestamp: rec.timestamp, Value: sizeBytes}, + ) + } delete(pullingMap, rec.pod+":"+rec.image) case "failed", "backoff": @@ -322,6 +336,20 @@ func lokiParsePullDuration(msg string) float64 { } } +// lokiParseImageSizeBytes extracts image size in bytes from a Pulled event message. +// Example: "... Image size: 20461242 bytes." +func lokiParseImageSizeBytes(msg string) float64 { + m := reImageSizeBytes.FindStringSubmatch(msg) + if len(m) < 2 { + return 0 + } + v, err := strconv.ParseInt(m[1], 10, 64) + if err != nil || v <= 0 { + return 0 + } + return float64(v) +} + // lokiInferReasonFromMessage infers a Kubernetes Event reason from a plain-text log message. // This is used when the reason field is not present in the Loki stream labels. func lokiInferReasonFromMessage(msg string) string { diff --git a/internal/discovery/loki_test.go b/internal/discovery/loki_test.go index 757e91c..f44b729 100644 --- a/internal/discovery/loki_test.go +++ b/internal/discovery/loki_test.go @@ -185,7 +185,7 @@ func TestLokiSource_FetchRaw_KubernetesEvents_AlloyJSON(t *testing.T) { { Stream: map[string]string{"namespace": "default", "job": "kubelet"}, Values: [][]string{{nanoStringLoki(now.Add(-2 * time.Second)), - `{"reason":"Pulled","name":"runner-abc","msg":"Successfully pulled image \"nginx:1.25\" in 740ms"}`}}, + `{"reason":"Pulled","name":"runner-abc","msg":"Successfully pulled image \"nginx:1.25\" in 740ms (740ms including waiting). Image size: 20461242 bytes."}`}}, }, { Stream: map[string]string{"namespace": "default", "job": "kubelet"}, @@ -215,6 +215,12 @@ func TestLokiSource_FetchRaw_KubernetesEvents_AlloyJSON(t *testing.T) { if got := samples["nginx:1.25"][0].Value; got < 0.73 || got > 0.75 { t.Errorf("expected ~0.74s duration, got %f", got) } + if len(samples["nginx:1.25"+lokiSizeBytesSuffix]) != 1 { + t.Fatalf("expected 1 size sample for nginx:1.25, got %d", len(samples["nginx:1.25"+lokiSizeBytesSuffix])) + } + if got := samples["nginx:1.25"+lokiSizeBytesSuffix][0].Value; got != 20461242 { + t.Errorf("expected image size 20461242, got %f", got) + } if len(samples["broken:v1"+lokiFailedSuffix]) != 1 { t.Errorf("expected 1 failure sample for broken:v1, got %d", len(samples["broken:v1"+lokiFailedSuffix])) } @@ -275,6 +281,25 @@ func TestLokiParsePullDuration(t *testing.T) { } } +// TestLokiParseImageSizeBytes verifies image size parsing from Pulled event messages. +func TestLokiParseImageSizeBytes(t *testing.T) { + tests := []struct { + msg string + want float64 + }{ + {`Successfully pulled image "nginx:1.25" in 2.5s. Image size: 20461242 bytes.`, 20461242}, + {`Successfully pulled image "redis:7" in 1s (1s including waiting). image size: 123 bytes.`, 123}, + {`Successfully pulled image "alpine:3.19" in 800ms`, 0}, + {`Image size: bad bytes`, 0}, + } + for _, tt := range tests { + got := lokiParseImageSizeBytes(tt.msg) + if got != tt.want { + t.Errorf("msg=%q: got %f, want %f", tt.msg, got, tt.want) + } + } +} + // nanoStringLoki formats a time as a nanosecond epoch string for Loki responses. func nanoStringLoki(t time.Time) string { return strconv.FormatInt(t.UnixNano(), 10) From 83c13c6d73f15c529699037a5531138377dcc688 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 12:13:47 +0200 Subject: [PATCH 27/35] tests --- test/e2e/README.md | 2 +- .../discovery-loki-alloy/00-failing-pod.yaml | 13 ++++++++++ .../01-discoverypolicy.yaml | 5 ++-- .../02-assert-discovery-status.yaml | 1 + .../discovery-loki-alloy/chainsaw-test.yaml | 13 ++++++++-- test/e2e/discovery-loki/00-real-pods.yaml | 25 +++++++++++++++++++ .../discovery-loki/01-discoverypolicy.yaml | 9 +++++-- .../02-assert-discovery-status.yaml | 3 +-- test/e2e/discovery-loki/chainsaw-test.yaml | 8 +++++- 9 files changed, 69 insertions(+), 10 deletions(-) create mode 100644 test/e2e/discovery-loki/00-real-pods.yaml diff --git a/test/e2e/README.md b/test/e2e/README.md index e144451..5a40caf 100644 --- a/test/e2e/README.md +++ b/test/e2e/README.md @@ -25,5 +25,5 @@ make test-e2e | `cachedimageset-discovery/` | CachedImageSet backed by a DiscoveryPolicy | | `discovery/` | DiscoveryPolicy with mock Prometheus | | `discovery-failure/` | DiscoveryPolicy with unreachable Prometheus endpoint | -| `discovery-loki/` | DiscoveryPolicy with mock Loki + eventPullTime signals | +| `discovery-loki/` | DiscoveryPolicy with real Alloy-ingested Loki events + eventPullTime signals | | `discovery-registry/` | DiscoveryPolicy listing tags from a mock registry | diff --git a/test/e2e/discovery-loki-alloy/00-failing-pod.yaml b/test/e2e/discovery-loki-alloy/00-failing-pod.yaml index 9373b6c..32ed239 100644 --- a/test/e2e/discovery-loki-alloy/00-failing-pod.yaml +++ b/test/e2e/discovery-loki-alloy/00-failing-pod.yaml @@ -1,5 +1,18 @@ apiVersion: v1 kind: Pod +metadata: + name: e2e-alloy-success-pod + namespace: default +spec: + restartPolicy: Never + containers: + - name: app + image: registry.e2e-infra.svc.cluster.local:5000/test/myapp:v1 + imagePullPolicy: Always + command: ["/bin/sh", "-c", "echo ok && sleep 2"] +--- +apiVersion: v1 +kind: Pod metadata: name: e2e-alloy-failing-pod namespace: default diff --git a/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml b/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml index d207fc4..fe2b127 100644 --- a/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml +++ b/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml @@ -10,7 +10,7 @@ spec: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{drop_e2e="true"} | json | reason=~"Pulling|Failed|BackOff" | name="e2e-alloy-failing-pod"' + query: '{drop_e2e="true"} | json | reason=~"Pulling|Pulled|Failed|BackOff" | name=~"e2e-alloy-(success|failing)-pod"' parser: type: kubernetesEvents podField: name @@ -22,7 +22,8 @@ spec: query: alloy-k8s-events type: eventPullTime eventPullTime: - statistic: failureCount + metric: failure + statistic: count durationMode: messageDuration includeCacheHits: false ranking: diff --git a/test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml b/test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml index c9a6e30..d03f606 100644 --- a/test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml +++ b/test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml @@ -6,6 +6,7 @@ status: (conditions[?type == 'Ready'] | [0].status): "True" (conditions[?type == 'Ready'] | [0].reason): Synced (queryResults[?name == 'alloy-k8s-events'] | [0].status): success + (contains(to_string(discoveredImages), 'test/myapp:v1')): true (queryResults[?name == 'alloy-k8s-events'] | [0].type): loki (imageCount > `0`): true (contains(to_string(discoveredImages), 'e2e-alloy-invalid:nope')): true diff --git a/test/e2e/discovery-loki-alloy/chainsaw-test.yaml b/test/e2e/discovery-loki-alloy/chainsaw-test.yaml index d7aa1d9..4ba360f 100644 --- a/test/e2e/discovery-loki-alloy/chainsaw-test.yaml +++ b/test/e2e/discovery-loki-alloy/chainsaw-test.yaml @@ -7,9 +7,9 @@ spec: description: | Verify Loki discovery from real Kubernetes events shipped by Grafana Alloy (loki.source.kubernetes_events with log_format=json). This exercises parser - fields name/msg/reason, not only seeded raw text events. + fields name/msg/reason using real pull events from test pods. steps: - - name: Create a pod that triggers pull failures/events + - name: Create real pods that trigger pull success/failure events try: - apply: file: 00-failing-pod.yaml @@ -22,3 +22,12 @@ spec: - assert: timeout: 120s file: 02-assert-discovery-status.yaml + - name: Cleanup + try: + - delete: + file: 00-failing-pod.yaml + - delete: + ref: + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + name: e2e-discovery-loki-alloy diff --git a/test/e2e/discovery-loki/00-real-pods.yaml b/test/e2e/discovery-loki/00-real-pods.yaml new file mode 100644 index 0000000..6ec2a75 --- /dev/null +++ b/test/e2e/discovery-loki/00-real-pods.yaml @@ -0,0 +1,25 @@ +apiVersion: v1 +kind: Pod +metadata: + name: e2e-loki-success-pod + namespace: default +spec: + restartPolicy: Never + containers: + - name: app + image: registry.e2e-infra.svc.cluster.local:5000/test/myapp:v1 + imagePullPolicy: Always + command: ["/bin/sh", "-c", "echo ok && sleep 2"] +--- +apiVersion: v1 +kind: Pod +metadata: + name: e2e-loki-failure-pod + namespace: default +spec: + restartPolicy: Never + containers: + - name: bad-image + image: registry.invalid.local:9999/e2e-loki-invalid:nope + imagePullPolicy: Always + command: ["/bin/sh", "-c", "echo should-not-run && sleep 60"] diff --git a/test/e2e/discovery-loki/01-discoverypolicy.yaml b/test/e2e/discovery-loki/01-discoverypolicy.yaml index 56d6add..fe70fb5 100644 --- a/test/e2e/discovery-loki/01-discoverypolicy.yaml +++ b/test/e2e/discovery-loki/01-discoverypolicy.yaml @@ -10,9 +10,13 @@ spec: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{job="kubelet",drop_e2e="true"}' + query: '{drop_e2e="true"} | json | reason=~"Pulling|Pulled|Failed|BackOff" | name=~"e2e-loki-(success|failure)-pod"' parser: type: kubernetesEvents + podField: name + reasonField: reason + messageField: msg + imageField: msg signals: # Median cold-pull time derived from the "Successfully pulled ... in Xs" messages. - name: p50-cold-pull-time @@ -27,7 +31,8 @@ spec: query: discovery-loki-image-pull-events type: eventPullTime eventPullTime: - statistic: failureCount + metric: failure + statistic: count durationMode: messageDuration includeCacheHits: false ranking: diff --git a/test/e2e/discovery-loki/02-assert-discovery-status.yaml b/test/e2e/discovery-loki/02-assert-discovery-status.yaml index a10db88..b2d869a 100644 --- a/test/e2e/discovery-loki/02-assert-discovery-status.yaml +++ b/test/e2e/discovery-loki/02-assert-discovery-status.yaml @@ -14,5 +14,4 @@ status: (queryResults[?name == 'discovery-loki-image-pull-events'] | [0].status): success (queryResults[?name == 'discovery-loki-image-pull-events'] | [0].type): loki (length(discoveredImages[?contains(image, 'test/myapp:v1')]) > `0`): true - (length(discoveredImages[?contains(image, 'test/worker:v2')]) > `0`): true - (length(discoveredImages[?contains(image, 'test/tools:v1')]) > `0`): true + (length(discoveredImages[?contains(image, 'e2e-loki-invalid:nope')]) > `0`): true diff --git a/test/e2e/discovery-loki/chainsaw-test.yaml b/test/e2e/discovery-loki/chainsaw-test.yaml index fe028a3..f827515 100644 --- a/test/e2e/discovery-loki/chainsaw-test.yaml +++ b/test/e2e/discovery-loki/chainsaw-test.yaml @@ -7,8 +7,12 @@ spec: description: | Verify that a DiscoveryPolicy with a Loki query and the kubernetesEvents parser derives eventPullTime signals (cold-pull time and failure count) from - seeded image-pull events and populates status.discoveredImages. + real image-pull events ingested by Alloy and populates status.discoveredImages. steps: + - name: Create real pods to generate kubelet pull events + try: + - apply: + file: 00-real-pods.yaml - name: Create DiscoveryPolicy with a Loki query and eventPullTime signals try: - apply: @@ -20,6 +24,8 @@ spec: file: 02-assert-discovery-status.yaml - name: Cleanup try: + - delete: + file: 00-real-pods.yaml - delete: ref: apiVersion: drop.corewire.io/v1alpha1 From 7c79325f1650834c40b6948a9ca83744c0343601 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 12:13:54 +0200 Subject: [PATCH 28/35] infra --- hack/dev-samples.yaml | 15 +++++++++--- hack/e2e-infra/alloy.yaml | 4 ++-- hack/e2e-infra/grafana.yaml | 5 ++++ hack/e2e-infra/seed-loki-job.yaml | 38 +++++++++++++++++++------------ hack/e2e-infra/setup.sh | 2 +- 5 files changed, 43 insertions(+), 21 deletions(-) diff --git a/hack/dev-samples.yaml b/hack/dev-samples.yaml index 5e54307..edc6da9 100644 --- a/hack/dev-samples.yaml +++ b/hack/dev-samples.yaml @@ -256,7 +256,7 @@ spec: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{job="kubelet",drop_e2e="true"}' + query: '{job="kubernetes-events",drop_e2e="true"}' parser: type: kubernetesEvents signals: @@ -271,7 +271,16 @@ spec: query: image-pull-events type: eventPullTime eventPullTime: - statistic: failureCount + metric: failure + statistic: count + durationMode: messageDuration + includeCacheHits: false + - name: avg-image-size + query: image-pull-events + type: eventPullTime + eventPullTime: + metric: imageSize + statistic: avg durationMode: messageDuration includeCacheHits: false ranking: @@ -331,7 +340,7 @@ spec: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{job="kubelet",drop_e2e="true"}' + query: '{job="kubernetes-events",drop_e2e="true"}' parser: type: kubernetesEvents signals: diff --git a/hack/e2e-infra/alloy.yaml b/hack/e2e-infra/alloy.yaml index f3ba338..0568f29 100644 --- a/hack/e2e-infra/alloy.yaml +++ b/hack/e2e-infra/alloy.yaml @@ -39,12 +39,12 @@ data: // Tail real Kubernetes events from all namespaces and ship them to Loki. // log_format=json so the kubernetesEvents parser can extract name/reason/message. loki.source.kubernetes_events "events" { - job_name = "kubelet" + job_name = "kubernetes-events" log_format = "json" forward_to = [loki.write.local.receiver] } - // Tag every line with drop_e2e=true so discovery queries can scope to seed data. + // Tag every line with drop_e2e=true so discovery queries can scope to e2e data. loki.write "local" { external_labels = { drop_e2e = "true" } endpoint { diff --git a/hack/e2e-infra/grafana.yaml b/hack/e2e-infra/grafana.yaml index a507731..32da6de 100644 --- a/hack/e2e-infra/grafana.yaml +++ b/hack/e2e-infra/grafana.yaml @@ -80,6 +80,11 @@ data: url: http://prometheus.e2e-infra.svc.cluster.local:9090 isDefault: true editable: true + - name: Loki + type: loki + access: proxy + url: http://loki.e2e-infra.svc.cluster.local:3100 + editable: true --- apiVersion: v1 kind: ConfigMap diff --git a/hack/e2e-infra/seed-loki-job.yaml b/hack/e2e-infra/seed-loki-job.yaml index ae33be5..7e69c54 100644 --- a/hack/e2e-infra/seed-loki-job.yaml +++ b/hack/e2e-infra/seed-loki-job.yaml @@ -40,32 +40,40 @@ spec: BASE="$(date +%s)000000000" n=0 ENTRIES="" - add() { - # add + add_line() { + # add_line TS=$(( BASE + n * 1000000000 )) n=$(( n + 1 )) - MSG=$(printf '%s' "$1" | sed 's/"/\\"/g') + MSG=$(printf '%s' "$1" | sed 's/\\/\\\\/g; s/"/\\"/g') if [ -n "$ENTRIES" ]; then ENTRIES="$ENTRIES,"; fi ENTRIES="$ENTRIES[ \"$TS\", \"$MSG\" ]" } + add_event() { + # add_event + POD=$(printf '%s' "$1" | sed 's/\\/\\\\/g; s/"/\\"/g') + REASON=$(printf '%s' "$2" | sed 's/\\/\\\\/g; s/"/\\"/g') + MESSAGE=$(printf '%s' "$3" | sed 's/\\/\\\\/g; s/"/\\"/g') + add_line "{\"kind\":\"Pod\",\"name\":\"$POD\",\"involvedObject_name\":\"$POD\",\"reason\":\"$REASON\",\"sourcecomponent\":\"kubelet\",\"reportingcontroller\":\"kubelet\",\"msg\":\"$MESSAGE\"}" + } + # myapp:v1 — three cold pulls (3s, 4s, 5s) → avg 4s, plus a cache hit. - add "Pulling image \"$REGISTRY/test/myapp:v1\"" - add "Successfully pulled image \"$REGISTRY/test/myapp:v1\" in 3.0s (3.0s including waiting)" - add "Successfully pulled image \"$REGISTRY/test/myapp:v1\" in 4.0s (4.0s including waiting)" - add "Successfully pulled image \"$REGISTRY/test/myapp:v1\" in 5.0s (5.0s including waiting)" - add "Container image \"$REGISTRY/test/myapp:v1\" already present on machine" + add_event "seed-myapp-pod" "Pulling" "Pulling image \"$REGISTRY/test/myapp:v1\"" + add_event "seed-myapp-pod" "Pulled" "Successfully pulled image \"$REGISTRY/test/myapp:v1\" in 3.0s (3.0s including waiting)" + add_event "seed-myapp-pod" "Pulled" "Successfully pulled image \"$REGISTRY/test/myapp:v1\" in 4.0s (4.0s including waiting)" + add_event "seed-myapp-pod" "Pulled" "Successfully pulled image \"$REGISTRY/test/myapp:v1\" in 5.0s (5.0s including waiting)" + add_event "seed-myapp-pod" "AlreadyPresent" "Container image \"$REGISTRY/test/myapp:v1\" already present on machine" # worker:v2 — one slow cold pull (12s) and one pull failure. - add "Pulling image \"$REGISTRY/test/worker:v2\"" - add "Successfully pulled image \"$REGISTRY/test/worker:v2\" in 12.0s (12.0s including waiting)" - add "Failed to pull image \"$REGISTRY/test/worker:v2\": rpc error: code = Unknown" + add_event "seed-worker-pod" "Pulling" "Pulling image \"$REGISTRY/test/worker:v2\"" + add_event "seed-worker-pod" "Pulled" "Successfully pulled image \"$REGISTRY/test/worker:v2\" in 12.0s (12.0s including waiting)" + add_event "seed-worker-pod" "Failed" "Failed to pull image \"$REGISTRY/test/worker:v2\": rpc error: code = Unknown" # tools:v1 — two quick cold pulls (1s, 2s). - add "Successfully pulled image \"$REGISTRY/test/tools:v1\" in 1.0s (1.0s including waiting)" - add "Successfully pulled image \"$REGISTRY/test/tools:v1\" in 2.0s (2.0s including waiting)" + add_event "seed-tools-pod" "Pulled" "Successfully pulled image \"$REGISTRY/test/tools:v1\" in 1.0s (1.0s including waiting)" + add_event "seed-tools-pod" "Pulled" "Successfully pulled image \"$REGISTRY/test/tools:v1\" in 2.0s (2.0s including waiting)" - PAYLOAD="{\"streams\":[{\"stream\":{\"job\":\"kubelet\",\"namespace\":\"default\",\"drop_e2e\":\"true\"},\"values\":[$ENTRIES]}]}" + PAYLOAD="{\"streams\":[{\"stream\":{\"job\":\"kubernetes-events\",\"namespace\":\"default\",\"drop_e2e\":\"true\"},\"values\":[$ENTRIES]}]}" echo "Pushing image-pull events to Loki..." RESP_FILE=$(mktemp) @@ -87,7 +95,7 @@ spec: echo "Verifying seed events..." for i in $(seq 1 30); do RESULT=$(curl -s -G "$LOKI/loki/api/v1/query_range" \ - --data-urlencode 'query={job="kubelet",drop_e2e="true"}' \ + --data-urlencode 'query={job="kubernetes-events",drop_e2e="true"}' \ --data-urlencode 'limit=10' 2>/dev/null || echo "") if echo "$RESULT" | grep -q "Successfully pulled"; then echo "Seed events are queryable!" diff --git a/hack/e2e-infra/setup.sh b/hack/e2e-infra/setup.sh index 799b33c..eb1a780 100755 --- a/hack/e2e-infra/setup.sh +++ b/hack/e2e-infra/setup.sh @@ -72,7 +72,7 @@ echo "[e2e-infra] Seeding Prometheus with image metrics..." kubectl apply -n "$NAMESPACE" -f "$SCRIPT_DIR/seed-metrics-job.yaml" kubectl -n "$NAMESPACE" wait --for=condition=complete job/seed-metrics --timeout=60s 2>/dev/null || true -# --- Seed Loki with image-pull events --- +# --- Seed Loki with image-pull events (Alloy-style JSON structure) --- echo "[e2e-infra] Seeding Loki with image-pull events..." kubectl apply -n "$NAMESPACE" -f "$SCRIPT_DIR/seed-loki-job.yaml" kubectl -n "$NAMESPACE" wait --for=condition=complete job/seed-loki --timeout=180s From 71afb4bbd956b5edfc9994b93ecbcf2b7eaeacf2 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 12:14:10 +0200 Subject: [PATCH 29/35] docs --- docs/content/docs/discovery.md | 942 ++++++++++++++---- .../content/docs/reference/_generated_crds.md | 31 +- docs/go.mod | 2 + docs/go.sum | 2 + docs/static/images/discovery-pipeline.svg | 161 ++- docs/static/images/prometheus-sampling.svg | 20 + docs/static/images/ranking-decision-map.svg | 72 ++ docs/static/images/signal-aggregate.svg | 26 + docs/static/images/signal-eventpulltime.svg | 22 + docs/static/images/signal-timeweighted.svg | 22 + docs/static/images/signal-windowaggregate.svg | 21 + docs/static/llms-full.txt | 121 +-- 12 files changed, 1085 insertions(+), 357 deletions(-) create mode 100644 docs/static/images/prometheus-sampling.svg create mode 100644 docs/static/images/ranking-decision-map.svg create mode 100644 docs/static/images/signal-aggregate.svg create mode 100644 docs/static/images/signal-eventpulltime.svg create mode 100644 docs/static/images/signal-timeweighted.svg create mode 100644 docs/static/images/signal-windowaggregate.svg diff --git a/docs/content/docs/discovery.md b/docs/content/docs/discovery.md index a056478..49e4dba 100644 --- a/docs/content/docs/discovery.md +++ b/docs/content/docs/discovery.md @@ -11,7 +11,7 @@ llmsDescription: | time-weighted scoring, weighted ranking, and periodic re-discovery. --- -The DiscoveryPolicy CRD enables automatic image discovery from external sources. When referenced by a CachedImageSet, discovered images are automatically materialized as CachedImage resources. +DiscoveryPolicy discovers images from external sources. CachedImageSet consumes the discovered list and materializes CachedImage resources. ## Why This Exists @@ -22,187 +22,593 @@ Discovery came from operational pain: - Hand-maintained image lists became stale and missed newly hot images - Node rotation (e.g. Cluster API MachineDeployments rolling new nodes daily or weekly) means fresh nodes start with empty image caches — every rotation triggers a full re-pull of all active images -With DiscoveryPolicy, image candidates are continuously sourced from real usage signals (metrics), ranked by configurable strategies, and consumed by CachedImageSet. +DiscoveryPolicy continuously refreshes image candidates from usage signals and passes the ranked output to CachedImageSet. -## Pipeline Overview +## How Discovery Works ``` -queries → signals → ranking → selected images +queries → signals → ranking → discovered images ``` ![DiscoveryPolicy pipeline: queries feed signals, signals feed a single ranking strategy, the ranked list is written to status.discoveredImages and consumed by CachedImageSet to create CachedImage resources that nodes pull.](/images/discovery-pipeline.svg) -The pipeline has three stages: +| Stage | Purpose | Available types | +|-------|---------|-----------------| +| 1 · Queries | Fetch raw observations from a backend | `prometheus` · `loki` · `registry` | +| 2 · Signals | Reduce a query series to one value per image | `aggregate` · `timeWeightedAggregate` · `windowAggregate` · `eventPullTime` | +| 3 · Ranking | Order images into the final list | `signal` · `weightedSum` · `modelExposure` | -1. **Queries** fetch raw observations from systems such as Prometheus or Loki. -2. **Signals** derive named per-image metrics from query results (e.g. `total-usage`, `peak-concurrency`). -3. **Ranking** combines one or more signals into the final ordered image list. - -``` -DiscoveryPolicy → runs pipeline → writes to status.discoveredImages - ↓ -CachedImageSet → reads discoveredImages → creates/deletes CachedImage children -``` +The output lands in `status.discoveredImages`; CachedImageSet reads it and creates/deletes `CachedImage` children that nodes pull. ## Stage 1 — Queries A query fetches raw observations and is referenced by name from signals. +All snippets below are complete `DiscoveryPolicy` resources with minimal companion +signals/ranking so you can apply them directly. + +| Type | Source | Discovered from | Use when | +|------|--------|-----------------|----------| +| `prometheus` | Metrics series | `image` label on results | Usage/concurrency from cluster metrics | +| `loki` | Event logs | parsed pull events | Pull durations & failures | +| `registry` | Tag/catalog API | repository tags | Pre-cache newest tags by name | + ### Prometheus Query +**Definition.** Runs a PromQL query against any Prometheus-compatible API and turns each returned series into a candidate image. The result **must** have an `image` label — that value becomes the image reference. + +#### How it's used in the CRD + ```yaml -queries: - - name: runner-image-usage - type: prometheus - prometheus: - endpoint: https://mimir.example.com - queryType: range # range | instant (default: range) - lookback: 168h # time window for range queries - step: 1m # range resolution (default: 5m) - query: | - count( - container_memory_working_set_bytes{ - container!="", container!="POD", - namespace="gitlab-runner", pod=~"runner-.*" - } - ) by (image) +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: prometheus-query-example +spec: + syncInterval: 1h # how often the whole pipeline re-runs + maxImages: 30 # keep only the top 30 ranked images + # STAGE 1: fetch raw data + queries: + - name: runner-image-usage # unique id; referenced by signals[].query + type: prometheus + prometheus: + endpoint: https://mimir.example.com # any Prometheus-compatible API + queryType: range # range = samples over time | instant = single point + lookback: 168h # look back 7 days (range queries only) + step: 1m # smaller step = more samples + more backend load + query: | + # Result must expose an image label — Discovery keys every image by it. + count( + container_memory_working_set_bytes{ + container!="", container!="POD", + namespace="gitlab-runner", pod=~"runner-.*" + } + ) by (image) + # STAGE 2: reduce the series to one number per image + signals: + - name: total-usage # signal name, referenced by ranking below + query: runner-image-usage # which query's data to consume + type: aggregate + aggregate: + method: sum # sum all samples = total activity per image + # STAGE 3: order the images + ranking: + strategy: signal + signal: total-usage # sort purely by the total-usage signal ``` -The PromQL result **must** carry an `image` label. That label value is the discovered image reference. +#### What happens to our query + +`... by (image)` makes Prometheus return one time series per image. A `range` query samples each series across `lookback`, one point every `step`. Discovery reads the raw response: + +```json +{ + "data": { "result": [ + { "metric": { "image": "img-A" }, "values": [[t0, "1"], [t1, "2"], [t2, "6"]] }, + { "metric": { "image": "img-B" }, "values": [[t1, "1"], [t2, "3"]] } + ]} +} +``` + +We use this 48h sample (hourly, two days, midday peaks) as the running example for every Prometheus signal below. The `total-usage` signal sums each series into one value: + +![Grafana-style time-series panel over 48 hours: img-A peaks midday both days, img-B smaller; x-axis is hour of day, each series summed to one value.](/images/prometheus-sampling.svg) + +| Series | Pattern | sum | rank | +|--------|---------|-----|------| +| img-A | midday peaks, low at night | 30 | 1 | +| img-B | small midday bumps | 12 | 2 | + +| Field | Controls | Default | +|-------|----------|---------| +| `queryType` | `range` = window of samples · `instant` = one point now | `range` | +| `lookback` | how far back the window reaches (ignored for `instant`) | — | +| `step` | spacing between samples; smaller = more points, heavier query | `5m` | + +Field semantics: [`DiscoveryPrometheusQuery`](https://github.com/Breee/puller/blob/main/api/v1alpha1/discoverypolicy_types.go). ### Loki Query ```yaml -queries: - - name: image-pull-events - type: loki - loki: - endpoint: https://loki.example.com - queryType: range - lookback: 168h - query: | - {job="kubernetes-events", namespace="gitlab-runner"} - | json - | involvedObject_name =~ "runner-.*" - | reason =~ "Pulling|Pulled|Failed|BackOff" - parser: - type: kubernetesEvents - podField: involvedObject_name - reasonField: reason - messageField: message - imageField: message +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: loki-query-example +spec: + syncInterval: 1h + maxImages: 30 + queries: + - name: image-pull-events # referenced by eventPullTime signal + type: loki + loki: + endpoint: https://loki.example.com + queryType: range # only supported Loki query mode currently + lookback: 168h + query: | + # Pull lifecycle events used to derive pull durations/failures. + {job="kubernetes-events", namespace="gitlab-runner"} + | json + | involvedObject_name =~ "runner-.*" + | reason =~ "Pulling|Pulled|Failed|BackOff" + parser: + type: kubernetesEvents # maps log fields into structured event records + podField: involvedObject_name # which field holds the pod name + reasonField: reason # Pulling / Pulled / Failed + messageField: message # free-text event message + imageField: message # image ref is extracted from the message + signals: + - name: avg-cold-pull-time + query: image-pull-events + type: eventPullTime + eventPullTime: + statistic: avg # mean pull duration per image + includeCacheHits: false # only count true cold pulls + durationMode: eventPair # pair Pulling→Pulled events to get the duration + ranking: + strategy: signal + signal: avg-cold-pull-time # slowest images rank highest +``` + +How it's used: Loki contributes pull lifecycle events, not usage volume. The +`kubernetesEvents` parser turns each log/event into structured records with +`podField`, `reasonField`, and `messageField`, then extracts the image from +`imageField` (typically the same message text). + +Event reasons consumed by discovery: `Pulling`, `Pulled`, `Failed`, `BackOff`, +`AlreadyPresent`. + +Duration semantics: +- `durationMode: messageDuration` parses `in 42.3s` from Pulled messages. +- `durationMode: eventPair` uses Pulled timestamp minus Pulling timestamp. +- Failures are tracked as `image:failed`; cache hits as `image:cache_hit`. + +Alloy shipping (real cluster events): +- Use + [`loki.source.kubernetes_events`](https://grafana.com/docs/alloy/latest/reference/components/loki/loki.source.kubernetes_events/) + forwarding to + [`loki.write`](https://grafana.com/docs/alloy/latest/reference/components/loki/loki.write/). +- With `log_format: json`, Alloy emits keys like `name`, `reason`, `msg` in the + log body. Default labels are `namespace`, `job`, `instance`. +- Parser mapping for Alloy JSON should be `podField: name`, + `reasonField: reason`, `messageField: msg`, `imageField: msg`. +- Raw event-exporter JSON usually uses `involvedObject_name` + `message`. + +#### What happens to our query + +Loki returns streams, each with `[timestamp, line]` entries. With Alloy +`log_format: json`, each line is a JSON event: + +```json +{ + "stream": {"job": "kubelet", "namespace": "default"}, + "values": [ + ["1719400000000000000", "{\"reason\":\"Pulling\",\"name\":\"runner-1\",\"msg\":\"Pulling image \\\"docker.io/library/redis:7-alpine\\\"\"}"], + ["1719400002000000000", "{\"reason\":\"Pulled\",\"name\":\"runner-1\",\"msg\":\"Successfully pulled image \\\"docker.io/library/redis:7-alpine\\\" in 704ms\"}"] + ] +} +``` + +The parser extracts image + reason from each entry, then builds per-image samples: + +| Parsed event | Output key | Value added | +|-------------|------------|-------------| +| `Pulled ... in 704ms` | `docker.io/library/redis:7-alpine` | `0.704` seconds | +| `Failed ...` or `BackOff ...` | `docker.io/library/redis:7-alpine:failed` | `1` | +| `already present on machine` | `docker.io/library/redis:7-alpine:cache_hit` | `1` | + +For `eventPullTime` signals, these samples are reduced by `statistic` +(`avg`/`p50`/`p95`/etc.) into one value per image. + +### Registry Query + +```yaml +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: registry-query-example +spec: + syncInterval: 1h + maxImages: 30 + queries: + - name: registry-tags + type: registry + registry: + url: https://registry.example.com + repositories: # repos to enumerate tags from + - team/frontend + - team/backend + tagFilter: "^v[0-9]+\\." # only tags starting v1. / v2. ... + topX: 3 # keep the last 3 matching tags returned per repo + imageTemplate: "{{.Registry}}/{{.Repository}}:{{.Tag}}" # built image ref + secretRef: + name: registry-api-creds # registry auth Secret in the operator namespace + signals: + - name: recent-tag-count + query: registry-tags + type: aggregate + aggregate: + method: count # rank by how many recent tags exist + ranking: + strategy: signal + signal: recent-tag-count +``` + +How it's used: registry discovery lists tags per repository via +`/v2//tags/list`, applies `tagFilter`, keeps `topX`, then renders full +image references via `imageTemplate`. + +Important behavior notes: +- `tagFilter` is regex on tag names. Anchor explicitly (`^...$`) when needed. +- `topX` keeps the last `N` matching tags in registry response order. It is not + true semver/date recency unless your registry already returns that order. +- `imageTemplate` variables: `{{.Registry}}`, `{{.Repository}}`, `{{.Tag}}`. + Default: `{{.Registry}}/{{.Repository}}:{{.Tag}}`. + +Signal fit: +- Great with `aggregate`/`timeWeightedAggregate`/`windowAggregate` (counts and + derived scores from discovered tag entries). +- Not compatible with `eventPullTime` (which requires pull event records). + +#### What happens to our query + +For each repository, the controller calls `/v2//tags/list`, then applies +`tagFilter`, `topX`, and `imageTemplate`. + +Example registry payload: + +```json +{"name":"team/frontend","tags":["v1.10.0","v1.11.0","dev-123","v1.12.0","v1.13.0"]} ``` +With `tagFilter: "^v[0-9]+\\."` and `topX: 3`, the kept tags are: + +| Repository | Matching tags | Kept (`topX=3`) | Rendered images | +|-----------|----------------|-----------------|-----------------| +| `team/frontend` | `v1.10.0`, `v1.11.0`, `v1.12.0`, `v1.13.0` | `v1.11.0`, `v1.12.0`, `v1.13.0` | `registry.example.com/team/frontend:v1.11.0` ... `:v1.13.0` | +| `team/backend` | `v2.3.0`, `v2.4.0` | `v2.3.0`, `v2.4.0` | `registry.example.com/team/backend:v2.3.0`, `:v2.4.0` | + +An `aggregate` signal with `method: count` then ranks by how many retained tags +each repository contributed. + ### Auth / TLS Both query types support a `secretRef` for authentication and TLS: ```yaml -queries: - - name: runner-image-usage - type: prometheus - prometheus: - endpoint: https://mimir.example.com - query: ... - secretRef: - name: prometheus-creds # Secret in the drop-system namespace +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: query-auth-example +spec: + syncInterval: 1h + maxImages: 30 + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + query: ... + secretRef: + name: prometheus-creds # Secret in the operator namespace (typically drop-system) + signals: + - name: total-usage + query: runner-image-usage + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: total-usage ``` - Supported Secret keys: `token`, `username`, `password`, `ca.crt`, `tls.crt`, `tls.key`, `headers.`. ## Stage 2 — Signals -A signal derives a named per-image value from exactly one query. +A signal derives a named per-image value from exactly one query. The four types reduce the same panel differently: + +| Type | Reduces to | Key knobs | +|------|-----------|-----------| +| `aggregate` | One value over all samples | `method`: sum/max/avg/count/min | +| `timeWeightedAggregate` | Weighted sum by hour-of-day | `windows`, `weight`, `timezone` | +| `windowAggregate` | One sub-window only | `relativeWindow` or `window` start/end | +| `eventPullTime` | Pull-time statistic | `statistic`: p50/p90/p95/avg/max | + +Signal × source compatibility: + +| Signal type | Prometheus | Loki | Registry | +|-------------|------------|------|----------| +| `aggregate` | yes | yes | yes | +| `timeWeightedAggregate` | yes | yes | yes | +| `windowAggregate` | yes | yes | yes | +| `eventPullTime` | no | yes (`kubernetesEvents`) | no | + +All Prometheus examples below run on this 48h dataset (sampled every 6h, both days identical): + +| Series | 00 | 06 | 12 | 18 | sum/day | 48h total | +|--------|----|----|----|----|---------|-----------| +| img-A | 2 | 3 | 6 | 4 | 15 | 30 | +| img-B | 0 | 1 | 3 | 2 | 6 | 12 | + +> The graphics use **6h buckets** (dots mark each sample) to fit the page; real queries sample every `step` (e.g. 1m). The shapes and totals match the math, not the true resolution. ### `aggregate` Aggregates all samples per image using a single method. +![aggregate sums every sample across the lookback window into one value per image.](/images/signal-aggregate.svg) + +On the shared dataset: every bar counts. img-A → 30, img-B → 12. The whole curve collapses to one number, so total volume wins regardless of *when* it happened. + +| `method` | Reduces to | img-A | img-B | Best for | +|----------|-----------|-------|-------|----------| +| `sum` | Total of all samples | 30 | 12 | total activity / volume | +| `max` | Largest single sample | 6 | 3 | peak concurrency / bursts | +| `avg` | Mean across samples | 3.8 | 1.5 | typical load | +| `min` | Smallest single sample | 2 | 0 | always-on baseline | +| `count` | Number of samples | 8 | 8 | how often it was seen | + ```yaml -signals: - - name: total-usage - queryRef: runner-image-usage - type: aggregate - aggregate: - method: sum # sum | max | avg | count | min - - - name: peak-concurrency - queryRef: runner-image-usage - type: aggregate - aggregate: - method: max +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: aggregate-signal-example +spec: + syncInterval: 1h + maxImages: 30 + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + queryType: range + lookback: 168h + step: 1m + query: count(container_memory_working_set_bytes{container!="",container!="POD"}) by (image) + signals: + - name: total-usage + query: runner-image-usage + type: aggregate + aggregate: + method: sum # sum | max | avg | count | min (sum = total activity) + + - name: peak-concurrency + query: runner-image-usage + type: aggregate + aggregate: + method: max # captures burst behavior + ranking: + strategy: signal + signal: total-usage ``` ### `timeWeightedAggregate` Multiplies each sample value by a per-hour window weight before aggregation. +![timeWeightedAggregate scales each time band by its weight (e.g. core hours ×1.0, off-hours ×0.3) then sums.](/images/signal-timeweighted.svg) + +On the shared dataset: midday bars (×1.0) keep full value, shoulder bars (×0.3) shrink, off-hours (×0) vanish. img-A keeps most of its 30 because its peaks land in core hours; img-B fades further. Business-hour usage outranks 24h volume. + +| Window | Hours | `weight` | img-A keeps | img-B keeps | +|--------|-------|----------|-------------|-------------| +| warm-up | 07–09 | 0.3 | shoulder bars ×0.3 | shoulder bars ×0.3 | +| core | 09–17 | 1.0 | midday peak full | midday peak full | +| taper | 17–20 | 0.3 | evening ×0.3 | evening ×0.3 | +| off | else | 0 (`defaultWeight`) | dropped | dropped | +| **total** | | | **≈ 21** | **≈ 8** | + +`method` accepts sum/count/avg/max/min, but `sum` is the only one that meaningfully uses the weights. + ```yaml -signals: - - name: developer-weighted-usage - queryRef: runner-image-usage - type: timeWeightedAggregate - timeWeightedAggregate: - method: sum - timezone: Europe/Berlin - defaultWeight: "0" - windows: - - startHour: 7 - endHour: 9 - weight: "300m" # 0.3 (resource.Quantity format) - - startHour: 9 - endHour: 17 - weight: "1" # 1.0 — full weight during core hours - - startHour: 17 - endHour: 20 - weight: "300m" +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: time-weighted-signal-example +spec: + syncInterval: 1h + maxImages: 30 + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + queryType: range + lookback: 168h + step: 1m + query: count(container_memory_working_set_bytes{container!="",container!="POD"}) by (image) + signals: + - name: developer-weighted-usage + query: runner-image-usage + type: timeWeightedAggregate + timeWeightedAggregate: + method: sum + timezone: Europe/Berlin # evaluate windows in local business time + defaultWeight: "0" # hours not listed below contribute nothing + windows: # weight = how much each hour-of-day counts + - startHour: 7 + endHour: 9 + weight: "0.3" # warm-up window = 0.3× + - startHour: 9 + endHour: 17 + weight: "1.0" # core hours = full weight + - startHour: 17 + endHour: 20 + weight: "0.3" # taper period = 0.3× + ranking: + strategy: signal + signal: developer-weighted-usage ``` ### `windowAggregate` -Aggregates only the samples within a specific time sub-window. +Aggregates only the samples within a specific time sub-window. There are two +ways to pick the window, and only one may be set per signal: + +![windowAggregate keeps only samples inside one sub-window (e.g. 09:00–17:00) and sums them.](/images/signal-windowaggregate.svg) + +On the shared dataset: only the shaded 09:00–17:00 band counts; bars outside it are dropped before summing. img-A ≈ 6 (its 12:00 peak), img-B ≈ 3. Everything outside the window is invisible — sharper than weighting. + +| Setting | Window | img-A | img-B | Use when | +|---------|--------|-------|-------|----------| +| `relativeWindow: 2h` | last 2h from now | 4 | 2 | "what is hot right now" | +| `window` 00:00–09:00 | off-hours | 5 | 1 | overnight / batch jobs | +| `window` 09:00–17:00 | core hours | 6 | 3 | protect active workday | + +`method` accepts sum/count/avg/max/min (default sum). Set **either** `relativeWindow` **or** `window`+`timezone` — never both. + +- `relativeWindow` — "the last N hours from now", measured in UTC. No timezone needed. +- `window` — fixed clock hours of the day (e.g. 09:00–17:00). You **must** also set + `timezone`; those hours are read in that zone. The policy errors if it is missing. ```yaml -signals: - # Relative window (last N duration before now) - - name: recent-usage - queryRef: runner-image-usage - type: windowAggregate - windowAggregate: - method: sum - relativeWindow: 2h - - # Wall-clock window (specific hours of day) - - name: pre-window-usage - queryRef: runner-image-usage - type: windowAggregate - windowAggregate: - method: sum - timezone: Europe/Berlin - window: - start: "00:00" - end: "09:00" +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: window-aggregate-signal-example +spec: + syncInterval: 1h + maxImages: 30 + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + queryType: range + lookback: 168h + step: 1m + query: count(container_memory_working_set_bytes{container!="",container!="POD"}) by (image) + signals: + # Relative window: just the last 2 hours of samples (clock zone irrelevant) + - name: recent-usage + query: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + relativeWindow: 2h # good for "what is hot right now" + + # Wall-clock window: 00:00–09:00 every day, read in the timezone below + - name: pre-window-usage + query: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + timezone: Europe/Berlin # REQUIRED with window; start/end are Berlin local time + window: + start: "00:00" # inclusive + end: "09:00" # exclusive + + # Wall-clock window: 09:00–17:00 Berlin (the active period to protect) + - name: target-window-usage + query: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + timezone: Europe/Berlin # REQUIRED with window + window: + start: "09:00" + end: "17:00" + ranking: + strategy: signal + signal: recent-usage ``` ### `eventPullTime` -Derives image pull-time statistics from Loki event records. +Derives image pull-time statistics from Loki event records. Each `Pulled` event reports a duration; pairing it with its `Pulling` event gives the pull latency: + +```text +Pulling nginx:1.25-alpine +Pulled nginx:1.25-alpine in 730ms → nginx p50 = 730ms +Pulled redis:7-alpine in 690ms ┐ +Pulled redis:7-alpine in 700ms ├ p50 = 700ms, max = 4100ms +Pulled redis:7-alpine in 4100ms ┘ (one cold node, slow link) +``` + +A single image is pulled many times across nodes, so pick the statistic that matches intent. `p50` is the robust default: it answers "how slow is a typical pull" and ignores the one 4.1s outlier. `max` answers "what is the worst pull" and is dominated by that outlier. Use `max`/`p95` only when worst-case provisioning matters; otherwise `p50` avoids chasing noise. + +Slower images rank higher, since they hurt cold nodes most: + +![eventPullTime: nginx pulled once at 730ms, redis three times (690/700/4100); p50 per image becomes the signal.](/images/signal-eventpulltime.svg) + +This signal ignores the 48h volume dataset — it reads Loki pull durations instead. nginx p50 = 730ms, redis p50 = 700ms. The number is latency, not usage, so the slowest image ranks first. + +| `statistic` | Reduces to | nginx | redis | Best for | +|-------------|-----------|-------|-------|----------| +| `p50` | median pull | 730 | 700 | typical latency, ignores outliers | +| `p90` | slow tail | 730 | 3420 | worst-case planning | +| `p95` | slower tail | 730 | 3760 | strict SLOs | +| `avg` | mean pull | 730 | 1830 | overall cost (skewed by outliers) | +| `max` | slowest pull | 730 | 4100 | absolute worst pull | +| `count` | cold-pull events | 1 | 3 | how often pulled cold | +| `failureCount` | pull failures | 0 | 0 | flaky / broken images | +| `cacheHitCount` | already-present hits | 0 | 0 | nodes already warm | + +Two extra knobs: `includeCacheHits` (default `false`) adds "already present" events to duration stats; `durationMode` is `eventPair` (Pulled−Pulling timestamps) or `messageDuration` (parse "in 42.3s" from the message). ```yaml -signals: - - name: p50-cold-pull-time - queryRef: image-pull-events - type: eventPullTime - eventPullTime: - statistic: p50 # p50 | p90 | p95 | avg | max | count | failureCount | cacheHitCount - includeCacheHits: false - durationMode: eventPair # eventPair | messageDuration +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: event-pull-time-signal-example +spec: + syncInterval: 1h + maxImages: 30 + queries: + - name: image-pull-events + type: loki + loki: + endpoint: https://loki.example.com + queryType: range + lookback: 168h + query: | + {job="kubernetes-events", namespace="gitlab-runner"} + | json + | reason =~ "Pulling|Pulled|Failed|BackOff" + parser: + type: kubernetesEvents + podField: involvedObject_name + reasonField: reason + messageField: message + imageField: message + signals: + - name: avg-cold-pull-time + query: image-pull-events + type: eventPullTime + eventPullTime: + statistic: avg # p50 | p90 | p95 | avg | max | count | failureCount | cacheHitCount + includeCacheHits: false # ignore already-cached pulls in latency stats + durationMode: eventPair # eventPair (Pulling→Pulled) | messageDuration parsing + ranking: + strategy: signal + signal: avg-cold-pull-time ``` ## Stage 3 — Ranking Exactly one ranking strategy per policy. +![Decision map for ranking strategy selection: use signal for one dominant metric, weightedSum for balancing known trade-offs, and modelExposure for minimizing cold-node impact in rotating clusters.](/images/ranking-decision-map.svg) + ![The three ranking strategies side by side: signal orders by a single signal, weightedSum blends normalized signals, and modelExposure models post-rotation cold-node exposure.](/images/ranking-strategies.svg) ### `signal` @@ -210,56 +616,160 @@ Exactly one ranking strategy per policy. Ranks images directly by the value of a single signal. ```yaml -ranking: - strategy: signal - signal: - signalRef: total-usage +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: signal-ranking-example +spec: + syncInterval: 1h + maxImages: 30 + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + queryType: range + lookback: 168h + step: 1m + query: count(container_memory_working_set_bytes{container!="",container!="POD"}) by (image) + signals: + - name: total-usage + query: runner-image-usage + type: aggregate + aggregate: + method: sum + ranking: + strategy: signal + signal: total-usage # simplest strategy: sort by one signal ``` ### `weightedSum` -Combines normalized signals using a weighted sum. - -```yaml -ranking: - strategy: weightedSum - weightedSum: - normalize: minMax # only method available - missingSignal: zero # zero | drop - terms: - - signalRef: total-usage - weight: "700m" # 0.7 in resource.Quantity format - - signalRef: peak-concurrency - weight: "300m" # 0.3 -``` - -Score: `final_score(I) = Σ weight_k * normalize(signal_k(I))` - -`minMax` normalization: `normalized(x) = (x - min) / (max - min)` — equals 1 when all values are equal. - -$$ -\mathrm{final\_score}(I) = \sum_k w_k \cdot \mathrm{normalize}(s_k(I)) -$$ +**Definition.** Blends several signals into one score by normalizing each to `[0,1]` and summing them with per-signal weights. Use it when no single signal decides — e.g. balance steady usage against burst peaks. $$ +\mathrm{final\_score}(I) = \sum_k w_k \cdot \mathrm{normalize}(s_k(I)), \qquad \mathrm{minMax}(x) = \frac{x - x_{\min}}{x_{\max} - x_{\min}} $$ +```yaml +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: weighted-sum-ranking-example +spec: + syncInterval: 1h + maxImages: 30 + # STAGE 1: fetch raw data + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + queryType: range + lookback: 168h + step: 1m + query: count(container_memory_working_set_bytes{container!="",container!="POD"}) by (image) + # STAGE 2: two signals to balance + signals: + - name: total-usage # sustained activity + query: runner-image-usage + type: aggregate + aggregate: + method: sum + - name: peak-concurrency # burst behavior + query: runner-image-usage + type: aggregate + aggregate: + method: max + # STAGE 3: blend the two + ranking: + strategy: weightedSum + weightedSum: + normalize: minMax # rescale each signal to [0,1] before combining + missingSignal: zero # zero | drop (drop removes images missing any term) + terms: # weights are fractions, should sum to ~1.0 + - signal: total-usage + weight: "0.7" # 70% importance + - signal: peak-concurrency + weight: "0.3" # 30% importance +``` + +Field semantics: [`WeightedSumRankingConfig`](https://github.com/Breee/puller/blob/main/api/v1alpha1/discoverypolicy_types.go). + ### `modelExposure` Ranks images by expected post-rotation cold-node exposure. ```yaml -ranking: - strategy: modelExposure - modelExposure: - nodeCount: 100 - preWindowUsageSignalRef: pre-window-usage - targetWindowUsageSignalRef: developer-window-usage - pullTimeSignalRef: p50-cold-pull-time +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: model-exposure-ranking-example +spec: + syncInterval: 1h + maxImages: 30 + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + queryType: range + lookback: 168h + step: 1m + query: count(container_memory_working_set_bytes{container!="",container!="POD"}) by (image) + - name: image-pull-events + type: loki + loki: + endpoint: https://loki.example.com + queryType: range + lookback: 168h + query: | + {job="kubernetes-events", namespace="gitlab-runner"} + | json + | reason =~ "Pulling|Pulled|Failed|BackOff" + parser: + type: kubernetesEvents + podField: involvedObject_name + reasonField: reason + messageField: message + imageField: message + signals: + - name: pre-window-usage + query: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + timezone: Europe/Berlin + window: + start: "00:00" + end: "09:00" + - name: target-window-usage + query: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + timezone: Europe/Berlin + window: + start: "09:00" + end: "17:00" + - name: avg-cold-pull-time + query: image-pull-events + type: eventPullTime + eventPullTime: + statistic: avg + includeCacheHits: false + durationMode: eventPair + ranking: + strategy: modelExposure + modelExposure: + nodeCount: 100 # cluster size N (rotation spreads cache) + preWindowUsageSignal: pre-window-usage # usage already seen before target + targetWindowUsageSignal: target-window-usage # usage during peak window to protect + pullTimeSignal: avg-cold-pull-time # colder/slower pulls get higher urgency ``` -Score: `score(I) = J_target(I) * (1 - 1/N)^J_pre(I) * p_hat(I)` +Score formula: $$ \mathrm{score}(I) = J_{\mathrm{target}}(I) \cdot \left(1 - \frac{1}{N}\right)^{J_{\mathrm{pre}}(I)} \cdot \hat{p}(I) @@ -275,8 +785,8 @@ kind: DiscoveryPolicy metadata: name: total-usage spec: - syncInterval: 1h - maxImages: 30 + syncInterval: 1h # rerun pipeline every hour + maxImages: 30 # keep top 30 ranked images queries: - name: runner-image-usage @@ -296,15 +806,14 @@ spec: signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: - method: sum + method: sum # total usage in lookback window ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage ``` ### Example 2: Hybrid Usage + Peak Concurrency @@ -336,13 +845,13 @@ spec: signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum - name: peak-concurrency - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: max @@ -353,10 +862,10 @@ spec: normalize: minMax missingSignal: zero terms: - - signalRef: total-usage - weight: "700m" - - signalRef: peak-concurrency - weight: "300m" + - signal: total-usage + weight: "0.7" # prioritize sustained usage + - signal: peak-concurrency + weight: "0.3" # still account for bursts ``` ### Example 3: Developer-Time Weighted Usage @@ -388,25 +897,25 @@ spec: signals: - name: developer-weighted-usage - queryRef: runner-image-usage + query: runner-image-usage type: timeWeightedAggregate timeWeightedAggregate: method: sum timezone: Europe/Berlin - defaultWeight: "0" + defaultWeight: "0" # off-hours ignored by default windows: - startHour: 7 endHour: 9 - weight: "300m" + weight: "0.3" - startHour: 9 endHour: 17 - weight: "1" + weight: "1.0" - startHour: 17 endHour: 20 - weight: "300m" + weight: "0.3" - name: peak-concurrency - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: max @@ -417,10 +926,93 @@ spec: normalize: minMax missingSignal: zero terms: - - signalRef: developer-weighted-usage - weight: "700m" - - signalRef: peak-concurrency - weight: "300m" + - signal: developer-weighted-usage + weight: "0.7" + - signal: peak-concurrency + weight: "0.3" +``` + +### Example 4: Model-Aware Exposure + +```yaml +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: gitlab-model-exposure +spec: + syncInterval: 1h + maxImages: 30 + + queries: + - name: runner-image-usage + type: prometheus + prometheus: + endpoint: https://mimir.example.com + queryType: range + lookback: 168h + step: 1m + query: | + count( + container_memory_working_set_bytes{ + container!="", container!="POD", + namespace="gitlab-runner", pod=~"runner-.*" + } + ) by (image) + + - name: image-pull-events + type: loki + loki: + endpoint: https://loki.example.com + queryType: range + lookback: 168h + query: | + {job="kubernetes-events", namespace="gitlab-runner"} + | json + | involvedObject_name =~ "runner-.*" + | reason =~ "Pulling|Pulled|Failed|BackOff" + parser: + type: kubernetesEvents + podField: involvedObject_name + reasonField: reason + messageField: message + imageField: message + + signals: + - name: pre-window-usage + query: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + timezone: Europe/Berlin # window hours below are Berlin local time + window: + start: "00:00" # prior window + end: "09:00" + + - name: target-window-usage + query: runner-image-usage + type: windowAggregate + windowAggregate: + method: sum + timezone: Europe/Berlin # window hours below are Berlin local time + window: + start: "09:00" # target active window + end: "17:00" + + - name: avg-cold-pull-time + query: image-pull-events + type: eventPullTime + eventPullTime: + statistic: avg # mean latency signal; use p95 if you need tail sensitivity + includeCacheHits: false + durationMode: eventPair + + ranking: + strategy: modelExposure + modelExposure: + nodeCount: 100 # tune to your typical active node count + preWindowUsageSignal: pre-window-usage + targetWindowUsageSignal: target-window-usage + pullTimeSignal: avg-cold-pull-time ``` ## Status and Observability @@ -470,8 +1062,8 @@ status: | 4 | Recent usage | `Σ count_I(t)` over recent window | `recent-usage` | | 5 | Hybrid usage + peak | `α·norm(total) + (1-α)·norm(peak)` | `total-usage`, `peak-concurrency` | | 6 | Hybrid dev-time + peak | `α·norm(dev) + (1-α)·norm(peak)` | `developer-weighted-usage`, `peak-concurrency` | -| 7 | Count × pull time | `total_usage(I) · p_hat(I)` | `total-usage`, `p50-cold-pull-time` | -| 9 | Model-aware exposure | `J_target · (1-1/N)^J_pre · p_hat` | `pre-window-usage`, `target-window-usage`, `p50-cold-pull-time` | +| 7 | Count × pull time | `total_usage(I) · p_hat(I)` | `total-usage`, `avg-cold-pull-time` | +| 8 | Model-aware exposure | `J_target · (1-1/N)^J_pre · p_hat` | `pre-window-usage`, `target-window-usage`, `avg-cold-pull-time` | ## Error Handling diff --git a/docs/content/docs/reference/_generated_crds.md b/docs/content/docs/reference/_generated_crds.md index 7f4de4a..27fb130 100644 --- a/docs/content/docs/reference/_generated_crds.md +++ b/docs/content/docs/reference/_generated_crds.md @@ -210,7 +210,7 @@ DiscoveryQuery defines a named raw-data source referenced by signals. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| -| `name` | `string` | Yes | — | Name is the unique identifier for this query within the policy. Signals reference queries by this name via queryRef. | +| `name` | `string` | Yes | — | Name is the unique identifier for this query within the policy. Signals reference queries by this name via query. | | `type` | `DiscoveryQueryType` | Yes | — | Type selects the backend. Must be "prometheus", "loki", or "registry". | | `prometheus` | `*DiscoveryPrometheusQuery` | No | — | Prometheus contains the configuration when type=prometheus. | | `loki` | `*DiscoveryLokiQuery` | No | — | Loki contains the configuration when type=loki. | @@ -224,7 +224,7 @@ DiscoveryRanking defines how signals are combined into the final ordered image l | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `strategy` | `RankingStrategy` | Yes | — | Strategy selects the ranking algorithm. | -| `signal` | `*SignalRankingConfig` | No | — | Signal is required when strategy=signal. | +| `signal` | `string` | No | — | Signal is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. Required when strategy=signal. | | `weightedSum` | `*WeightedSumRankingConfig` | No | — | WeightedSum is required when strategy=weightedSum. | | `modelExposure` | `*ModelExposureRankingConfig` | No | — | ModelExposure is required when strategy=modelExposure. | @@ -247,7 +247,7 @@ DiscoverySignal defines a named per-image metric derived from a single query. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `name` | `string` | Yes | — | Name is the unique identifier for this signal within the policy. Ranking configurations reference signals by this name. | -| `queryRef` | `string` | Yes | — | QueryRef is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. | +| `query` | `string` | Yes | — | Query is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. | | `type` | `SignalType` | Yes | — | Type selects the signal derivation method. | | `aggregate` | `*AggregateSignalConfig` | No | — | Aggregate is required when type=aggregate. | | `timeWeightedAggregate` | `*TimeWeightedAggregateSignalConfig` | No | — | TimeWeightedAggregate is required when type=timeWeightedAggregate. | @@ -256,13 +256,14 @@ DiscoverySignal defines a named per-image metric derived from a single query. ### EventPullTimeSignalConfig -EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. +EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. Pull duration and image size are extracted from the same Pulled events; metric selects which one to rank on. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| -| `statistic` | `EventPullTimeStatistic` | Yes | — | Statistic selects which pull-time metric to compute. | -| `includeCacheHits` | `bool` | Yes | false | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. | -| `durationMode` | `DurationMode` | Yes | — | DurationMode controls how pull duration is extracted from event records. | +| `metric` | `EventMetric` | No | pullTime | Metric selects which per-image quantity to aggregate. Defaults to pullTime, which correlates strongly with cold-start cost. Use imageSize to rank by bytes. | +| `statistic` | `EventStatistic` | Yes | — | Statistic selects how the metric's samples are aggregated per image. | +| `includeCacheHits` | `bool` | Yes | false | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. Only applies when metric=pullTime. | +| `durationMode` | `DurationMode` | Yes | — | DurationMode controls how pull duration is extracted from event records. Only applies when metric=pullTime. | ### ImageEntry @@ -293,9 +294,9 @@ ModelExposureRankingConfig configures the modelExposure ranking strategy. Score | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `nodeCount` | `int32` | Yes | — | NodeCount is the number of eligible CI nodes (N in the exposure formula). | -| `preWindowUsageSignalRef` | `string` | Yes | — | PreWindowUsageSignalRef is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. | -| `targetWindowUsageSignalRef` | `string` | Yes | — | TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. | -| `pullTimeSignalRef` | `string` | Yes | — | PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. | +| `preWindowUsageSignal` | `string` | Yes | — | PreWindowUsageSignal is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. | +| `targetWindowUsageSignal` | `string` | Yes | — | TargetWindowUsageSignal is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. | +| `pullTimeSignal` | `string` | Yes | — | PullTimeSignal is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. | ### PolicyReference @@ -316,14 +317,6 @@ QueryResult reports the outcome of a single named query execution. | `status` | `QueryResultStatus` | Yes | — | Status is "success" or "failed". | | `message` | `string` | No | — | Message describes the failure reason when status=failed. | -### SignalRankingConfig - -SignalRankingConfig configures the signal ranking strategy. - -| Field | Type | Required | Default | Description | -|-------|------|----------|---------|-------------| -| `signalRef` | `string` | Yes | — | SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. | - ### TimeOfDayWindow TimeOfDayWindow defines a fixed wall-clock time range within each day. @@ -370,7 +363,7 @@ WeightedSumTerm defines one signal contribution in a weightedSum ranking. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| -| `signalRef` | `string` | Yes | — | SignalRef is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. | +| `signal` | `string` | Yes | — | Signal is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. | | `weight` | `resource.Quantity` | Yes | — | Weight is the factor applied to the normalized signal value. All weights should be non-negative; they do not need to sum to 1. Example: "0.7" | ### WindowAggregateSignalConfig diff --git a/docs/go.mod b/docs/go.mod index a8b9b26..cc0eced 100644 --- a/docs/go.mod +++ b/docs/go.mod @@ -1,3 +1,5 @@ module github.com/corewire/drop/docs go 1.26.0 + +require github.com/imfing/hextra v0.12.3 // indirect diff --git a/docs/go.sum b/docs/go.sum index e69de29..afa8680 100644 --- a/docs/go.sum +++ b/docs/go.sum @@ -0,0 +1,2 @@ +github.com/imfing/hextra v0.12.3 h1:DZHY2rUWYteyzjlHi9r4n7Bb5e2Q+6LXe4C1Dqn0ZjM= +github.com/imfing/hextra v0.12.3/go.mod h1:vi+yhpq8YPp/aghvJlNKVnJKcPJ/VyAEcfC1BSV9ARo= diff --git a/docs/static/images/discovery-pipeline.svg b/docs/static/images/discovery-pipeline.svg index 2f11b94..4c4b38e 100644 --- a/docs/static/images/discovery-pipeline.svg +++ b/docs/static/images/discovery-pipeline.svg @@ -1,132 +1,87 @@ - - - DiscoveryPolicy Pipeline - queries → signals → ranking → discoveredImages → CachedImageSet → node pulls + DiscoveryPolicy pipeline + queries -> signals -> ranking -> discoveredImages -> CachedImageSet -> CachedImage -> pull pods - - - STAGE 1 · queries - STAGE 2 · signals - STAGE 3 · ranking - output + + + + - - - - - + + STAGE 1 - queries + STAGE 2 - signals + STAGE 3 - ranking + materialization - - - prometheus - range / instant series + + prometheus query + range/instant image samples - - loki - image pull event log + + loki query + pull events / durations - - registry - tag / catalog listing + + registry query + repo tags -> image refs - - - aggregate + + aggregate - - timeWeightedAggregate + + timeWeightedAggregate - - windowAggregate + + windowAggregate - - eventPullTime - loki only + + eventPullTime + requires loki events - named per-image value derived from one queryRef + aggregate/timeWeighted/window accept any source + eventPullTime requires Loki kubernetes events - - - signal - one signal, direct order + + signal + scalar: signal: - - weightedSum - Σ wₖ · normalize(signalₖ) + + weightedSum + terms: signal + weight - - modelExposure - cold-node exposure model + + modelExposure + pre/target/pull-time signals - exactly one strategy per policy + pick exactly one ranking strategy - - - status. - discoveredImages - image · rank · finalScore - - - CachedImageSet - discoveryPolicyRef - creates CachedImage - - - node pulls - paced by PullPolicy - + + status.discoveredImages + image + rank + finalScore - - - - - - - - - - - - - - - - - - - - - - - - + + CachedImageSet + reads discoveryPolicyRef + + + CachedImage + pull pods + paced by PullPolicy - - - - - - - - - - - - - Status fields: - queryResults: name · type · status · message — discoveredImages: image · rank · finalScore. - - - ↻ re-runs every spec.syncInterval · keeps last known good results on transient query failure + + + + Status notes: + queryResults records source health; discoveredImages stores the ranked output consumed by CachedImageSet. + + reruns every spec.syncInterval; keeps last successful results during transient source errors diff --git a/docs/static/images/prometheus-sampling.svg b/docs/static/images/prometheus-sampling.svg new file mode 100644 index 0000000..c817ef8 --- /dev/null +++ b/docs/static/images/prometheus-sampling.svg @@ -0,0 +1,20 @@ + + + + count(...) by (image) + last 48h, step 1h + + img-A Σ30 + img-B Σ12 + + + + day 1day 2 + 0246 + 000612180006121824 + hour of day + + + + + diff --git a/docs/static/images/ranking-decision-map.svg b/docs/static/images/ranking-decision-map.svg new file mode 100644 index 0000000..fa37641 --- /dev/null +++ b/docs/static/images/ranking-decision-map.svg @@ -0,0 +1,72 @@ + + + + Which Ranking Strategy Should I Use? + Pick the simplest strategy that matches your operational decision. + + + What are you optimizing for? + Use one path below + + + + + + + + + One dominant metric + "Most used image wins" + Easy to explain and tune + + + + + strategy: signal + Inputs: one signalRef + Use when simplicity matters + and one metric dominates + + + + Balance multiple metrics + "Usage + concurrency" + Need transparent weighting + + + + + strategy: weightedSum + Inputs: 2+ signals + weights + Use when trade-offs are known + and you want explicit control + + + + Minimize cold-node impact + Frequent node rotation + Pull-time is a key cost + + + + + strategy: modelExposure + Inputs: pre/target/pull-time + Use when readiness after + rotation is the main objective + + + Recommended progression: signal → weightedSum → modelExposure as operational requirements become stricter. + + + + + + + + + + + + + diff --git a/docs/static/images/signal-aggregate.svg b/docs/static/images/signal-aggregate.svg new file mode 100644 index 0000000..93e3ebc --- /dev/null +++ b/docs/static/images/signal-aggregate.svg @@ -0,0 +1,26 @@ + + + + aggregate + sum across samples; min/max are dots, avg is dashed line + + + + + 0246 + 000612180006121824 + hour of day + + + + + + + + A avg 3.8B avg 1.5 + + img-A Σ30 + img-B Σ12 + max dot + min dot + diff --git a/docs/static/images/signal-eventpulltime.svg b/docs/static/images/signal-eventpulltime.svg new file mode 100644 index 0000000..fd9f9b6 --- /dev/null +++ b/docs/static/images/signal-eventpulltime.svg @@ -0,0 +1,22 @@ + + + + eventPullTime — Pulling→Pulled duration per image; p50 ignores the outlier, max chases it + + nginx + redis + p50 marker + max marker + + + 1000200030004000 + pull ms + nginx:1.25 + 730 + redis:7 + 690 + 700 + 4100 + p50 700 + max 4100 + diff --git a/docs/static/images/signal-timeweighted.svg b/docs/static/images/signal-timeweighted.svg new file mode 100644 index 0000000..ac62483 --- /dev/null +++ b/docs/static/images/signal-timeweighted.svg @@ -0,0 +1,22 @@ + + + + timeWeightedAggregate + each hour scaled by weight, then summed + + img-A + img-B + x0 + x0.3 + x1.0 + + + + + 0012001224 + hour of day · 07-09 ×0.3 · 09-17 ×1.0 · 17-20 ×0.3 · rest ×0 + + + + + diff --git a/docs/static/images/signal-windowaggregate.svg b/docs/static/images/signal-windowaggregate.svg new file mode 100644 index 0000000..3a8d288 --- /dev/null +++ b/docs/static/images/signal-windowaggregate.svg @@ -0,0 +1,21 @@ + + + + windowAggregate — only 09:00–17:00 each day + + img-A + img-B + selected window + + + + + + 0246 + 000612180006121824 + hour of day + + + + + diff --git a/docs/static/llms-full.txt b/docs/static/llms-full.txt index 1b02bee..dbbf428 100644 --- a/docs/static/llms-full.txt +++ b/docs/static/llms-full.txt @@ -184,7 +184,7 @@ DiscoveryQuery defines a named raw-data source referenced by signals. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| Name | `name` | `string` | ✓ | | Name is the unique identifier for this query within the policy. Signals reference queries by this name via queryRef. | +| Name | `name` | `string` | ✓ | | Name is the unique identifier for this query within the policy. Signals reference queries by this name via query. | | Type | `type` | `DiscoveryQueryType` | ✓ | | Type selects the backend. Must be "prometheus", "loki", or "registry". Enum: `prometheus`,`loki`,`registry` | | Prometheus | `prometheus` | `*DiscoveryPrometheusQuery` | — | | Prometheus contains the configuration when type=prometheus. | | Loki | `loki` | `*DiscoveryLokiQuery` | — | | Loki contains the configuration when type=loki. | @@ -198,7 +198,7 @@ DiscoveryRanking defines how signals are combined into the final ordered image l | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | Strategy | `strategy` | `RankingStrategy` | ✓ | | Strategy selects the ranking algorithm. Enum: `signal`,`weightedSum`,`modelExposure` | -| Signal | `signal` | `*SignalRankingConfig` | — | | Signal is required when strategy=signal. | +| Signal | `signal` | `string` | — | | Signal is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. Required when strategy=signal. | | WeightedSum | `weightedSum` | `*WeightedSumRankingConfig` | — | | WeightedSum is required when strategy=weightedSum. | | ModelExposure | `modelExposure` | `*ModelExposureRankingConfig` | — | | ModelExposure is required when strategy=modelExposure. | @@ -221,7 +221,7 @@ DiscoverySignal defines a named per-image metric derived from a single query. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | Name | `name` | `string` | ✓ | | Name is the unique identifier for this signal within the policy. Ranking configurations reference signals by this name. | -| QueryRef | `queryRef` | `string` | ✓ | | QueryRef is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. | +| Query | `query` | `string` | ✓ | | Query is the name of the query that provides raw data for this signal. Must match a queries[].name within the same policy. | | Type | `type` | `SignalType` | ✓ | | Type selects the signal derivation method. Enum: `aggregate`,`timeWeightedAggregate`,`windowAggregate`,`eventPullTime` | | Aggregate | `aggregate` | `*AggregateSignalConfig` | — | | Aggregate is required when type=aggregate. | | TimeWeightedAggregate | `timeWeightedAggregate` | `*TimeWeightedAggregateSignalConfig` | — | | TimeWeightedAggregate is required when type=timeWeightedAggregate. | @@ -230,13 +230,14 @@ DiscoverySignal defines a named per-image metric derived from a single query. ### EventPullTimeSignalConfig -EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. +EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. Pull duration and image size are extracted from the same Pulled events; metric selects which one to rank on. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| Statistic | `statistic` | `EventPullTimeStatistic` | ✓ | | Statistic selects which pull-time metric to compute. Enum: `p50`,`p90`,`p95`,`avg`,`max`,`count`,`failureCount`,`cacheHitCount` | -| IncludeCacheHits | `includeCacheHits` | `bool` | ✓ | `false` | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. | -| DurationMode | `durationMode` | `DurationMode` | ✓ | | DurationMode controls how pull duration is extracted from event records. Enum: `eventPair`,`messageDuration` | +| Metric | `metric` | `EventMetric` | — | `pullTime` | Metric selects which per-image quantity to aggregate. Defaults to pullTime, which correlates strongly with cold-start cost. Use imageSize to rank by bytes. | +| Statistic | `statistic` | `EventStatistic` | ✓ | | Statistic selects how the metric's samples are aggregated per image. Enum: `p50`,`p90`,`p95`,`avg`,`max`,`count` | +| IncludeCacheHits | `includeCacheHits` | `bool` | ✓ | `false` | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. Only applies when metric=pullTime. | +| DurationMode | `durationMode` | `DurationMode` | ✓ | | DurationMode controls how pull duration is extracted from event records. Only applies when metric=pullTime. Enum: `eventPair`,`messageDuration` | ### ImageEntry @@ -267,9 +268,9 @@ ModelExposureRankingConfig configures the modelExposure ranking strategy. Score | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| | NodeCount | `nodeCount` | `int32` | ✓ | | NodeCount is the number of eligible CI nodes (N in the exposure formula). | -| PreWindowUsageSignalRef | `preWindowUsageSignalRef` | `string` | ✓ | | PreWindowUsageSignalRef is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. | -| TargetWindowUsageSignalRef | `targetWindowUsageSignalRef` | `string` | ✓ | | TargetWindowUsageSignalRef is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. | -| PullTimeSignalRef | `pullTimeSignalRef` | `string` | ✓ | | PullTimeSignalRef is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. | +| PreWindowUsageSignal | `preWindowUsageSignal` | `string` | ✓ | | PreWindowUsageSignal is the name of the signal representing usage before the target window. Must match a signals[].name within the same policy. | +| TargetWindowUsageSignal | `targetWindowUsageSignal` | `string` | ✓ | | TargetWindowUsageSignal is the name of the signal representing usage during the target window. Must match a signals[].name within the same policy. | +| PullTimeSignal | `pullTimeSignal` | `string` | ✓ | | PullTimeSignal is the name of the signal providing per-image pull-time estimates. Must match a signals[].name within the same policy. | ### PolicyReference @@ -290,14 +291,6 @@ QueryResult reports the outcome of a single named query execution. | Status | `status` | `QueryResultStatus` | ✓ | | Status is "success" or "failed". | | Message | `message` | `string` | — | | Message describes the failure reason when status=failed. | -### SignalRankingConfig - -SignalRankingConfig configures the signal ranking strategy. - -| Field | JSON | Type | Required | Default | Description | -|-------|------|------|----------|---------|-------------| -| SignalRef | `signalRef` | `string` | ✓ | | SignalRef is the name of the signal whose values determine image rank. Must match a signals[].name within the same policy. | - ### TimeOfDayWindow TimeOfDayWindow defines a fixed wall-clock time range within each day. @@ -344,7 +337,7 @@ WeightedSumTerm defines one signal contribution in a weightedSum ranking. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| SignalRef | `signalRef` | `string` | ✓ | | SignalRef is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. | +| Signal | `signal` | `string` | ✓ | | Signal is the name of the signal to include in the weighted sum. Must match a signals[].name within the same policy. | | Weight | `weight` | `resource.Quantity` | ✓ | | Weight is the factor applied to the normalized signal value. All weights should be non-negative; they do not need to sum to 1. Example: "0.7" | ### WindowAggregateSignalConfig @@ -514,14 +507,13 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff", pod=~"runner-.*"}) by (image)' signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage syncInterval: 30s maxImages: 10 --- @@ -542,12 +534,12 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: total-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum - name: peak-concurrency - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: max @@ -557,9 +549,9 @@ spec: normalize: minMax missingSignal: zero terms: - - signalRef: total-usage + - signal: total-usage weight: "700m" - - signalRef: peak-concurrency + - signal: peak-concurrency weight: "300m" syncInterval: 30s maxImages: 10 @@ -579,14 +571,13 @@ spec: query: 'container_memory_working_set_bytes{namespace="build-stuff"}' signals: - name: current - queryRef: current-usage + query: current-usage type: aggregate aggregate: method: max ranking: strategy: signal - signal: - signalRef: current + signal: current syncInterval: 30s maxImages: 10 --- @@ -608,7 +599,7 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: business-hours-usage - queryRef: runner-image-usage + query: runner-image-usage type: timeWeightedAggregate timeWeightedAggregate: method: sum @@ -620,8 +611,7 @@ spec: weight: "2" ranking: strategy: signal - signal: - signalRef: business-hours-usage + signal: business-hours-usage syncInterval: 30s maxImages: 10 --- @@ -642,15 +632,14 @@ spec: query: 'count(container_memory_working_set_bytes{container!="", container!="POD", namespace="build-stuff"}) by (image)' signals: - name: recent-usage - queryRef: runner-image-usage + query: runner-image-usage type: windowAggregate windowAggregate: method: sum relativeWindow: 6h ranking: strategy: signal - signal: - signalRef: recent-usage + signal: recent-usage syncInterval: 30s maxImages: 10 --- @@ -667,28 +656,36 @@ spec: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{job="kubelet",drop_e2e="true"}' + query: '{job="kubernetes-events",drop_e2e="true"}' parser: type: kubernetesEvents signals: - name: p50-cold-pull-time - queryRef: image-pull-events + query: image-pull-events type: eventPullTime eventPullTime: statistic: p50 durationMode: messageDuration includeCacheHits: false - name: pull-failures - queryRef: image-pull-events + query: image-pull-events + type: eventPullTime + eventPullTime: + metric: failure + statistic: count + durationMode: messageDuration + includeCacheHits: false + - name: avg-image-size + query: image-pull-events type: eventPullTime eventPullTime: - statistic: failureCount + metric: imageSize + statistic: avg durationMode: messageDuration includeCacheHits: false ranking: strategy: signal - signal: - signalRef: p50-cold-pull-time + signal: p50-cold-pull-time syncInterval: 30s maxImages: 10 --- @@ -711,14 +708,13 @@ spec: topX: 5 signals: - name: tag-recency - queryRef: registry-tags + query: registry-tags type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30s maxImages: 20 --- @@ -744,22 +740,22 @@ spec: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{job="kubelet",drop_e2e="true"}' + query: '{job="kubernetes-events",drop_e2e="true"}' parser: type: kubernetesEvents signals: - name: pre-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: sum - name: target-usage - queryRef: runner-image-usage + query: runner-image-usage type: aggregate aggregate: method: max - name: pull-time - queryRef: image-pull-events + query: image-pull-events type: eventPullTime eventPullTime: statistic: p50 @@ -769,9 +765,9 @@ spec: strategy: modelExposure modelExposure: nodeCount: 3 - preWindowUsageSignalRef: pre-usage - targetWindowUsageSignalRef: target-usage - pullTimeSignalRef: pull-time + preWindowUsageSignal: pre-usage + targetWindowUsageSignal: target-usage + pullTimeSignal: pull-time syncInterval: 30s maxImages: 10 --- @@ -789,14 +785,13 @@ spec: query: "up{}" signals: - name: total-usage - queryRef: broken-query + query: broken-query type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: total-usage + signal: total-usage syncInterval: 30m maxImages: 10 --- @@ -815,14 +810,13 @@ spec: - test/app signals: - name: tag-recency - queryRef: broken-registry + query: broken-registry type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30m maxImages: 10 --- @@ -841,14 +835,13 @@ spec: - test/does-not-exist signals: - name: tag-recency - queryRef: missing-repo + query: missing-repo type: aggregate aggregate: method: sum ranking: strategy: signal - signal: - signalRef: tag-recency + signal: tag-recency syncInterval: 30m maxImages: 10 @@ -883,5 +876,13 @@ spec: make docs-serve # Serve Hugo docs locally. make docs-gen # Regenerate AI agent docs (llms.txt, instructions, etc.) from source. make docs-gen-check # Verify generated AI docs are up to date. + make research-tex-build # Build research PDF from TeX source (override RESEARCH_TEX_FILE=). + make research-bench-setup # Create benchmark venv and install Python dependencies. + make research-bench-generate # Generate synthetic benchmark dataset. + make research-bench-replay # Run replay policy evaluation from benchmark data. + make research-bench-discovery # Evaluate discovery strategies from benchmark data. + make research-bench-plot # Render example pipeline Gantt figure. + make research-bench-20runs # Run 20-run discovery strategy benchmark batch. + make research-bench-all # Run full synthetic benchmark workflow. make tools # Install local tooling and check optional docs/chart binaries. ``` From 0730a49971fc923b3ff48ff4f1e3875b27dd059d Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 12:14:22 +0200 Subject: [PATCH 30/35] docs --- knowledge.yaml | 33 +++++++++++++++++++++++---------- llms-full.txt | 24 +++++++++++++++++------- 2 files changed, 40 insertions(+), 17 deletions(-) diff --git a/knowledge.yaml b/knowledge.yaml index 8f509ae..47eb9c7 100644 --- a/knowledge.yaml +++ b/knowledge.yaml @@ -597,11 +597,17 @@ helperTypes: required: false doc: EventPullTime is required when type=eventPullTime. - name: EventPullTimeSignalConfig - doc: EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. + doc: EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. Pull duration and image size are extracted from the same Pulled events; metric selects which one to rank on. fields: + - name: Metric + json: metric + type: EventMetric + required: false + default: pullTime + doc: Metric selects which per-image quantity to aggregate. Defaults to pullTime, which correlates strongly with cold-start cost. Use imageSize to rank by bytes. - name: Statistic json: statistic - type: EventPullTimeStatistic + type: EventStatistic required: true enum: - p50 @@ -610,15 +616,13 @@ helperTypes: - avg - max - count - - failureCount - - cacheHitCount - doc: Statistic selects which pull-time metric to compute. + doc: Statistic selects how the metric's samples are aggregated per image. - name: IncludeCacheHits json: includeCacheHits type: bool required: true default: "false" - doc: IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. + doc: IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. Only applies when metric=pullTime. - name: DurationMode json: durationMode type: DurationMode @@ -626,7 +630,7 @@ helperTypes: enum: - eventPair - messageDuration - doc: DurationMode controls how pull duration is extracted from event records. + doc: DurationMode controls how pull duration is extracted from event records. Only applies when metric=pullTime. - name: ImageEntry doc: ImageEntry defines a single image to include in a set. fields: @@ -1324,7 +1328,7 @@ samples: | endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{job="kubelet",drop_e2e="true"}' + query: '{job="kubernetes-events",drop_e2e="true"}' parser: type: kubernetesEvents signals: @@ -1339,7 +1343,16 @@ samples: | query: image-pull-events type: eventPullTime eventPullTime: - statistic: failureCount + metric: failure + statistic: count + durationMode: messageDuration + includeCacheHits: false + - name: avg-image-size + query: image-pull-events + type: eventPullTime + eventPullTime: + metric: imageSize + statistic: avg durationMode: messageDuration includeCacheHits: false ranking: @@ -1399,7 +1412,7 @@ samples: | endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{job="kubelet",drop_e2e="true"}' + query: '{job="kubernetes-events",drop_e2e="true"}' parser: type: kubernetesEvents signals: diff --git a/llms-full.txt b/llms-full.txt index f999172..dbbf428 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -230,13 +230,14 @@ DiscoverySignal defines a named per-image metric derived from a single query. ### EventPullTimeSignalConfig -EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. +EventPullTimeSignalConfig configures the eventPullTime signal type. The referenced query must be a Loki query. Pull duration and image size are extracted from the same Pulled events; metric selects which one to rank on. | Field | JSON | Type | Required | Default | Description | |-------|------|------|----------|---------|-------------| -| Statistic | `statistic` | `EventPullTimeStatistic` | ✓ | | Statistic selects which pull-time metric to compute. Enum: `p50`,`p90`,`p95`,`avg`,`max`,`count`,`failureCount`,`cacheHitCount` | -| IncludeCacheHits | `includeCacheHits` | `bool` | ✓ | `false` | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. | -| DurationMode | `durationMode` | `DurationMode` | ✓ | | DurationMode controls how pull duration is extracted from event records. Enum: `eventPair`,`messageDuration` | +| Metric | `metric` | `EventMetric` | — | `pullTime` | Metric selects which per-image quantity to aggregate. Defaults to pullTime, which correlates strongly with cold-start cost. Use imageSize to rank by bytes. | +| Statistic | `statistic` | `EventStatistic` | ✓ | | Statistic selects how the metric's samples are aggregated per image. Enum: `p50`,`p90`,`p95`,`avg`,`max`,`count` | +| IncludeCacheHits | `includeCacheHits` | `bool` | ✓ | `false` | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. Only applies when metric=pullTime. | +| DurationMode | `durationMode` | `DurationMode` | ✓ | | DurationMode controls how pull duration is extracted from event records. Only applies when metric=pullTime. Enum: `eventPair`,`messageDuration` | ### ImageEntry @@ -655,7 +656,7 @@ spec: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{job="kubelet",drop_e2e="true"}' + query: '{job="kubernetes-events",drop_e2e="true"}' parser: type: kubernetesEvents signals: @@ -670,7 +671,16 @@ spec: query: image-pull-events type: eventPullTime eventPullTime: - statistic: failureCount + metric: failure + statistic: count + durationMode: messageDuration + includeCacheHits: false + - name: avg-image-size + query: image-pull-events + type: eventPullTime + eventPullTime: + metric: imageSize + statistic: avg durationMode: messageDuration includeCacheHits: false ranking: @@ -730,7 +740,7 @@ spec: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{job="kubelet",drop_e2e="true"}' + query: '{job="kubernetes-events",drop_e2e="true"}' parser: type: kubernetesEvents signals: From 22ee5e39707dc3ddae6da0609221033bf7ca37e2 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 12:14:29 +0200 Subject: [PATCH 31/35] tilt --- Tiltfile | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Tiltfile b/Tiltfile index 77265e1..5377830 100644 --- a/Tiltfile +++ b/Tiltfile @@ -101,10 +101,14 @@ local_resource( k8s_yaml('hack/e2e-infra/seed-registry-job.yaml') k8s_resource('seed-registry', labels=['infra'], resource_deps=['registry-mirror']) -# Seed Loki with image-pull events +# Seed Loki with image-pull events (Alloy-style JSON structure) k8s_yaml('hack/e2e-infra/seed-loki-job.yaml') k8s_resource('seed-loki', labels=['infra'], resource_deps=['loki']) +# Alloy: tail real Kubernetes events into Loki (drop_e2e=true) +k8s_yaml('hack/e2e-infra/alloy.yaml') +k8s_resource('alloy', objects=['alloy:serviceaccount', 'alloy-events:clusterrole', 'alloy-events:clusterrolebinding', 'alloy-config:configmap'], labels=['infra'], resource_deps=['loki']) + # --- Grafana with Drop dashboard --- # Create dashboard ConfigMap from the shipped JSON, then apply grafana manifests. dashboard_json = str(read_file('charts/drop/dashboards/drop-operator.json')) From e8154c8b64e21a0a440cba6037781b51affaef65 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Mon, 29 Jun 2026 23:06:37 +0200 Subject: [PATCH 32/35] registry discovery / docs --- README.md | 32 +-- api/v1alpha1/discoverypolicy_types.go | 30 ++- .../drop.corewire.io_discoverypolicies.yaml | 34 ++- docs/content/docs/discovery.md | 97 ++++--- .../content/docs/reference/_generated_crds.md | 3 +- docs/static/llms-full.txt | 3 +- go.mod | 2 +- hack/e2e-infra/seed-registry-job.yaml | 7 +- internal/discovery/engine.go | 90 ++++++- internal/discovery/engine_test.go | 90 +++++++ internal/discovery/registry.go | 239 +++++++++++++++--- internal/discovery/registry_test.go | 131 +++++++++- knowledge.yaml | 7 +- llms-full.txt | 3 +- .../01-discoverypolicy.yaml | 21 ++ .../02-assert-discovery-status.yaml | 16 ++ .../chainsaw-test.yaml | 27 ++ 17 files changed, 716 insertions(+), 116 deletions(-) create mode 100644 test/e2e/discovery-registry-gitlab/01-discoverypolicy.yaml create mode 100644 test/e2e/discovery-registry-gitlab/02-assert-discovery-status.yaml create mode 100644 test/e2e/discovery-registry-gitlab/chainsaw-test.yaml diff --git a/README.md b/README.md index 34f89b7..d3aaaa9 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@

-A Kubernetes operator that pre-pulls container images onto nodes — safely, with pacing, and with automatic discovery. +A Kubernetes operator that pre-pulls container images onto nodes — safely, with pacing, and with automatic discovery. ## Why @@ -362,7 +362,11 @@ spec: tag: "3.19" ``` -### Use case: discover and cache application tags from a registry +### Use case: discover and cache GitLab runner helper images from a registry + +GitLab runner helper tags carry an arch/flavor prefix (e.g. `x86_64-v17.5.0`). +Drop extracts the embedded version automatically; `versionPattern` is shown for +clarity but is optional here. ```yaml apiVersion: v1 @@ -387,29 +391,21 @@ spec: type: registry registry: # Registry base URL - url: https://registry.example.com + url: https://registry.gitlab.com # Repositories to list tags from repositories: - - team/frontend - - team/backend - - team/worker - # Only discover semver tags (regex on tag name) - tagFilter: "^v[0-9]+\\." - # Keep only the last 3 matching tags returned by the registry + - gitlab-org/gitlab-runner/gitlab-runner-helper + # Only discover x86_64 semver tags (regex on tag name) + tagFilter: "^x86_64-v[0-9]+\\." + # Optional: pin where the version lives in the tag (capture group 1) + versionPattern: "x86_64-v(.+)" + # Keep only the 3 newest matching tags (newest first) topX: 3 # Optional: Secret in the Drop pod namespace (default: drop-system) # Supported keys: token, username, password, ca.crt, tls.crt, tls.key, headers. secretRef: name: registry-api-creds - signals: - - name: recent-tag-count - query: registry-tags - type: aggregate - aggregate: - method: count - ranking: - strategy: signal - signal: recent-tag-count + # No signals/ranking needed: registry tags are returned newest-first. --- apiVersion: drop.corewire.io/v1alpha1 kind: CachedImageSet diff --git a/api/v1alpha1/discoverypolicy_types.go b/api/v1alpha1/discoverypolicy_types.go index f720503..63d31ea 100644 --- a/api/v1alpha1/discoverypolicy_types.go +++ b/api/v1alpha1/discoverypolicy_types.go @@ -96,12 +96,38 @@ type DiscoveryRegistryQuery struct { // Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) // +optional TagFilter string `json:"tagFilter,omitempty"` + // TagSeek is a pagination cursor passed to the registry as the `last` query + // parameter. The registry lists tags lexically after this value, letting you + // skip large numbers of irrelevant earlier tags without fetching them. It is + // not a real tag name — any string works. + // Example: "x86_64-u~" jumps straight to the "x86_64-v*" tags on a repo with + // tens of thousands of digest tags (GitLab runner helper). + // +optional + TagSeek string `json:"tagSeek,omitempty"` // TopX limits the number of tags kept per repository after tagFilter is applied. - // The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. - // Example: 3 (keep the last 3 matching tags returned per repo) + // Tags are sorted newest-first (by version) before this cap is applied, so the + // newest N tags are kept. + // Example: 3 (keep the 3 newest matching tags per repo) // +optional // +kubebuilder:validation:Minimum=1 TopX int32 `json:"topX,omitempty"` + // MaxScan caps how many tags are fetched per repository before filtering. + // Registries can hold tens of thousands of tags; this bounds the work. Pair + // it with tagSeek to fetch only the relevant range. Defaults to 1000 when unset. + // Example: 500 + // +optional + // +kubebuilder:validation:Minimum=1 + MaxScan int32 `json:"maxScan,omitempty"` + // VersionPattern is a regex with a single capture group that extracts the + // version substring from each tag for newest-first sorting. Use it when tags + // carry a prefix/suffix around the version, e.g. GitLab runner helper tags + // like "x86_64-v17.5.0" (pattern "x86_64-v(.+)"). + // When unset, Drop tries a strict semver parse, then falls back to extracting + // an embedded semver substring. Tags with no parseable version keep registry + // push order and sort after versioned tags. + // Example: "x86_64-v(.+)" + // +optional + VersionPattern string `json:"versionPattern,omitempty"` // ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. // Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} // Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" diff --git a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml index 09719bf..a361854 100644 --- a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml +++ b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml @@ -201,6 +201,15 @@ spec: Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "registry.example.com/{{.Repository}}:{{.Tag}}" type: string + maxScan: + description: |- + MaxScan caps how many tags are fetched per repository before filtering. + Registries can hold tens of thousands of tags; this bounds the work. Pair + it with tagSeek to fetch only the relevant range. Defaults to 1000 when unset. + Example: 500 + format: int32 + minimum: 1 + type: integer repositories: description: |- Repositories is the list of repository paths to list tags from. @@ -214,11 +223,21 @@ spec: TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) type: string + tagSeek: + description: |- + TagSeek is a pagination cursor passed to the registry as the `last` query + parameter. The registry lists tags lexically after this value, letting you + skip large numbers of irrelevant earlier tags without fetching them. It is + not a real tag name — any string works. + Example: "x86_64-u~" jumps straight to the "x86_64-v*" tags on a repo with + tens of thousands of digest tags (GitLab runner helper). + type: string topX: description: |- TopX limits the number of tags kept per repository after tagFilter is applied. - The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. - Example: 3 (keep the last 3 matching tags returned per repo) + Tags are sorted newest-first (by version) before this cap is applied, so the + newest N tags are kept. + Example: 3 (keep the 3 newest matching tags per repo) format: int32 minimum: 1 type: integer @@ -228,6 +247,17 @@ spec: Example: "https://registry.example.com", "https://ghcr.io" minLength: 1 type: string + versionPattern: + description: |- + VersionPattern is a regex with a single capture group that extracts the + version substring from each tag for newest-first sorting. Use it when tags + carry a prefix/suffix around the version, e.g. GitLab runner helper tags + like "x86_64-v17.5.0" (pattern "x86_64-v(.+)"). + When unset, Drop tries a strict semver parse, then falls back to extracting + an embedded semver substring. Tags with no parseable version keep registry + push order and sort after versioned tags. + Example: "x86_64-v(.+)" + type: string required: - repositories - url diff --git a/docs/content/docs/discovery.md b/docs/content/docs/discovery.md index 49e4dba..ed10c41 100644 --- a/docs/content/docs/discovery.md +++ b/docs/content/docs/discovery.md @@ -161,6 +161,7 @@ spec: query: image-pull-events type: eventPullTime eventPullTime: + metric: pullTime # default; aggregates pull duration samples statistic: avg # mean pull duration per image includeCacheHits: false # only count true cold pulls durationMode: eventPair # pair Pulling→Pulled events to get the duration @@ -233,41 +234,42 @@ spec: - name: registry-tags type: registry registry: - url: https://registry.example.com + url: https://registry.gitlab.com repositories: # repos to enumerate tags from - - team/frontend - - team/backend - tagFilter: "^v[0-9]+\\." # only tags starting v1. / v2. ... - topX: 3 # keep the last 3 matching tags returned per repo + - gitlab-org/gitlab-runner/gitlab-runner-helper + tagFilter: "^x86_64-v[0-9]+\\." # only x86_64-v1. / x86_64-v2. ... + versionPattern: "x86_64-v(.+)" # capture group 1 is the version + topX: 3 # keep the 3 newest matching tags per repo imageTemplate: "{{.Registry}}/{{.Repository}}:{{.Tag}}" # built image ref secretRef: name: registry-api-creds # registry auth Secret in the operator namespace - signals: - - name: recent-tag-count - query: registry-tags - type: aggregate - aggregate: - method: count # rank by how many recent tags exist - ranking: - strategy: signal - signal: recent-tag-count ``` +No `signals` or `ranking` are needed: registry queries already return their +tags newest-first, so the discovered images come out pre-ranked. + How it's used: registry discovery lists tags per repository via -`/v2//tags/list`, applies `tagFilter`, keeps `topX`, then renders full -image references via `imageTemplate`. +`/v2//tags/list`, applies `tagFilter`, sorts newest-first, keeps `topX`, +then renders full image references via `imageTemplate`. Important behavior notes: - `tagFilter` is regex on tag names. Anchor explicitly (`^...$`) when needed. -- `topX` keeps the last `N` matching tags in registry response order. It is not - true semver/date recency unless your registry already returns that order. +- Tags are sorted by version descending (newest first). Strict semver tags work + out of the box; prefixed/suffixed tags (e.g. GitLab runner helper + `x86_64-v17.5.0`) are handled by extracting an embedded semver substring. + Tags with no parseable version fall back to registry push order. `topX` then + keeps the newest N. +- `versionPattern` (optional) is a regex with one capture group that pins where + the version lives in the tag, e.g. `x86_64-v(.+)` for GitLab helper images. + Use it when the default extraction picks the wrong number. - `imageTemplate` variables: `{{.Registry}}`, `{{.Repository}}`, `{{.Tag}}`. Default: `{{.Registry}}/{{.Repository}}:{{.Tag}}`. Signal fit: -- Great with `aggregate`/`timeWeightedAggregate`/`windowAggregate` (counts and - derived scores from discovered tag entries). -- Not compatible with `eventPullTime` (which requires pull event records). +- Registry queries are self-ranking; `signals`/`ranking` are optional and + ignored for ordering. Aggregation signals are a no-op (one sample per tag). +- Not compatible with `timeWeightedAggregate`/`windowAggregate`/`eventPullTime` + (tag snapshots are not time series). #### What happens to our query @@ -277,18 +279,19 @@ For each repository, the controller calls `/v2//tags/list`, then applies Example registry payload: ```json -{"name":"team/frontend","tags":["v1.10.0","v1.11.0","dev-123","v1.12.0","v1.13.0"]} +{"name":"gitlab-org/gitlab-runner/gitlab-runner-helper","tags":["x86_64-v17.3.0","x86_64-v17.4.0","x86_64-latest","x86_64-v17.5.0","x86_64-v17.10.0"]} ``` -With `tagFilter: "^v[0-9]+\\."` and `topX: 3`, the kept tags are: +With `tagFilter: "^x86_64-v[0-9]+\\."`, `versionPattern: "x86_64-v(.+)"`, and +`topX: 3`, the newest kept tags are: | Repository | Matching tags | Kept (`topX=3`) | Rendered images | |-----------|----------------|-----------------|-----------------| -| `team/frontend` | `v1.10.0`, `v1.11.0`, `v1.12.0`, `v1.13.0` | `v1.11.0`, `v1.12.0`, `v1.13.0` | `registry.example.com/team/frontend:v1.11.0` ... `:v1.13.0` | -| `team/backend` | `v2.3.0`, `v2.4.0` | `v2.3.0`, `v2.4.0` | `registry.example.com/team/backend:v2.3.0`, `:v2.4.0` | +| `gitlab-org/gitlab-runner/gitlab-runner-helper` | `x86_64-v17.3.0`, `x86_64-v17.4.0`, `x86_64-v17.5.0`, `x86_64-v17.10.0` | `x86_64-v17.10.0`, `x86_64-v17.5.0`, `x86_64-v17.4.0` | `registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v17.10.0` ... `:x86_64-v17.4.0` | -An `aggregate` signal with `method: count` then ranks by how many retained tags -each repository contributed. +Note `x86_64-v17.10.0` correctly ranks above `x86_64-v17.5.0` (version-aware, +not lexical), and the non-versioned `x86_64-latest` tag is excluded by +`tagFilter`. Images come out newest-first, so no ranking is required. ### Auth / TLS @@ -331,17 +334,19 @@ A signal derives a named per-image value from exactly one query. The four types | `aggregate` | One value over all samples | `method`: sum/max/avg/count/min | | `timeWeightedAggregate` | Weighted sum by hour-of-day | `windows`, `weight`, `timezone` | | `windowAggregate` | One sub-window only | `relativeWindow` or `window` start/end | -| `eventPullTime` | Pull-time statistic | `statistic`: p50/p90/p95/avg/max | +| `eventPullTime` | Event metric statistic | `metric`: pullTime/imageSize/failure/cacheHit, `statistic`: p50/p90/p95/avg/max/count | Signal × source compatibility: | Signal type | Prometheus | Loki | Registry | |-------------|------------|------|----------| -| `aggregate` | yes | yes | yes | -| `timeWeightedAggregate` | yes | yes | yes | -| `windowAggregate` | yes | yes | yes | +| `aggregate` | yes | yes | no-op | +| `timeWeightedAggregate` | yes | yes | no | +| `windowAggregate` | yes | yes | no | | `eventPullTime` | no | yes (`kubernetesEvents`) | no | +Registry queries return tag snapshots, not time series, so time-windowed signals are intentionally rejected. They are already self-ranked newest-first, so `aggregate` adds nothing and signals/ranking can be omitted entirely. + All Prometheus examples below run on this 48h dataset (sampled every 6h, both days identical): | Series | 00 | 06 | 12 | 18 | sum/day | 48h total | @@ -560,11 +565,15 @@ This signal ignores the 48h volume dataset — it reads Loki pull durations inst | `avg` | mean pull | 730 | 1830 | overall cost (skewed by outliers) | | `max` | slowest pull | 730 | 4100 | absolute worst pull | | `count` | cold-pull events | 1 | 3 | how often pulled cold | -| `failureCount` | pull failures | 0 | 0 | flaky / broken images | -| `cacheHitCount` | already-present hits | 0 | 0 | nodes already warm | Two extra knobs: `includeCacheHits` (default `false`) adds "already present" events to duration stats; `durationMode` is `eventPair` (Pulled−Pulling timestamps) or `messageDuration` (parse "in 42.3s" from the message). +`eventPullTime` now uses `metric + statistic`: +- `metric: pullTime` (default) with `statistic: p50|p90|p95|avg|max|count` +- `metric: imageSize` with `statistic: p50|p90|p95|avg|max|count` (bytes from `Image size: N bytes`) +- `metric: failure` with `statistic: count` +- `metric: cacheHit` with `statistic: count` + ```yaml apiVersion: drop.corewire.io/v1alpha1 kind: DiscoveryPolicy @@ -595,7 +604,8 @@ spec: query: image-pull-events type: eventPullTime eventPullTime: - statistic: avg # p50 | p90 | p95 | avg | max | count | failureCount | cacheHitCount + metric: pullTime # pullTime (default) | imageSize | failure | cacheHit + statistic: avg # p50 | p90 | p95 | avg | max | count includeCacheHits: false # ignore already-cached pulls in latency stats durationMode: eventPair # eventPair (Pulling→Pulled) | messageDuration parsing ranking: @@ -603,6 +613,23 @@ spec: signal: avg-cold-pull-time ``` +Rank by image size (bytes) from the same Pulled events: + +```yaml +signals: + - name: avg-image-size + query: image-pull-events + type: eventPullTime + eventPullTime: + metric: imageSize + statistic: avg + durationMode: messageDuration + +ranking: + strategy: signal + signal: avg-image-size +``` + ## Stage 3 — Ranking Exactly one ranking strategy per policy. @@ -757,6 +784,7 @@ spec: query: image-pull-events type: eventPullTime eventPullTime: + metric: pullTime statistic: avg includeCacheHits: false durationMode: eventPair @@ -1002,6 +1030,7 @@ spec: query: image-pull-events type: eventPullTime eventPullTime: + metric: pullTime statistic: avg # mean latency signal; use p95 if you need tail sensitivity includeCacheHits: false durationMode: eventPair diff --git a/docs/content/docs/reference/_generated_crds.md b/docs/content/docs/reference/_generated_crds.md index 27fb130..2eb0797 100644 --- a/docs/content/docs/reference/_generated_crds.md +++ b/docs/content/docs/reference/_generated_crds.md @@ -237,7 +237,8 @@ DiscoveryRegistryQuery defines OCI registry tag listing configuration for image | `url` | `string` | Yes | — | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | | `repositories` | `[]string` | Yes | — | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | | `tagFilter` | `string` | No | — | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | -| `topX` | `int32` | No | — | TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo) | +| `topX` | `int32` | No | — | TopX limits the number of tags kept per repository after tagFilter is applied. Tags are sorted newest-first (by version) before this cap is applied, so the newest N tags are kept. Example: 3 (keep the 3 newest matching tags per repo) | +| `versionPattern` | `string` | No | — | VersionPattern is a regex with a single capture group that extracts the version substring from each tag for newest-first sorting. Use it when tags carry a prefix/suffix around the version, e.g. GitLab runner helper tags like "x86_64-v17.5.0" (pattern "x86_64-v(.+)"). When unset, Drop tries a strict semver parse, then falls back to extracting an embedded semver substring. Tags with no parseable version keep registry push order and sort after versioned tags. Example: "x86_64-v(.+)" | | `imageTemplate` | `string` | No | — | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "registry.example.com/{{.Repository}}:{{.Tag}}" | ### DiscoverySignal diff --git a/docs/static/llms-full.txt b/docs/static/llms-full.txt index dbbf428..77a4a7a 100644 --- a/docs/static/llms-full.txt +++ b/docs/static/llms-full.txt @@ -211,7 +211,8 @@ DiscoveryRegistryQuery defines OCI registry tag listing configuration for image | URL | `url` | `string` | ✓ | | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | | Repositories | `repositories` | `[]string` | ✓ | | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | | TagFilter | `tagFilter` | `string` | — | | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | -| TopX | `topX` | `int32` | — | | TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo) | +| TopX | `topX` | `int32` | — | | TopX limits the number of tags kept per repository after tagFilter is applied. Tags are sorted newest-first (by version) before this cap is applied, so the newest N tags are kept. Example: 3 (keep the 3 newest matching tags per repo) | +| VersionPattern | `versionPattern` | `string` | — | | VersionPattern is a regex with a single capture group that extracts the version substring from each tag for newest-first sorting. Use it when tags carry a prefix/suffix around the version, e.g. GitLab runner helper tags like "x86_64-v17.5.0" (pattern "x86_64-v(.+)"). When unset, Drop tries a strict semver parse, then falls back to extracting an embedded semver substring. Tags with no parseable version keep registry push order and sort after versioned tags. Example: "x86_64-v(.+)" | | ImageTemplate | `imageTemplate` | `string` | — | | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "registry.example.com/{{.Repository}}:{{.Tag}}" | ### DiscoverySignal diff --git a/go.mod b/go.mod index bacb242..d4b951e 100644 --- a/go.mod +++ b/go.mod @@ -5,6 +5,7 @@ go 1.26.0 godebug default=go1.26 require ( + github.com/Masterminds/semver/v3 v3.4.0 github.com/onsi/ginkgo/v2 v2.29.0 github.com/onsi/gomega v1.41.0 github.com/prometheus/client_golang v1.23.2 @@ -18,7 +19,6 @@ require ( require ( cel.dev/expr v0.25.1 // indirect - github.com/Masterminds/semver/v3 v3.4.0 // indirect github.com/antlr4-go/antlr/v4 v4.13.0 // indirect github.com/beorn7/perks v1.0.1 // indirect github.com/blang/semver/v4 v4.0.0 // indirect diff --git a/hack/e2e-infra/seed-registry-job.yaml b/hack/e2e-infra/seed-registry-job.yaml index a833e50..4911c9f 100644 --- a/hack/e2e-infra/seed-registry-job.yaml +++ b/hack/e2e-infra/seed-registry-job.yaml @@ -45,6 +45,11 @@ spec: test/myapp:v1|test/tools:v1 test/myapp:v1|test/tools:v2 test/myapp:v1|test/tools:v3 + test/myapp:v1|test/gitlab-runner-helper:x86_64-v18.5.0 + test/myapp:v1|test/gitlab-runner-helper:x86_64-v18.10.0 + test/myapp:v1|test/gitlab-runner-helper:x86_64-v19.0.0 + test/myapp:v1|test/gitlab-runner-helper:x86_64-latest + test/myapp:v1|test/gitlab-runner-helper:3.18-arm-v17.8.0 " for ENTRY in $TAGS; do @@ -56,7 +61,7 @@ spec: echo "" echo "Verifying tags..." - for REPO in "test/myapp" "test/worker" "test/tools"; do + for REPO in "test/myapp" "test/worker" "test/tools" "test/gitlab-runner-helper"; do TAGS=$(regctl tag ls "${REGISTRY}/${REPO}" 2>/dev/null || echo "FAILED") echo " ${REPO}: ${TAGS}" done diff --git a/internal/discovery/engine.go b/internal/discovery/engine.go index 19389a3..9943e38 100644 --- a/internal/discovery/engine.go +++ b/internal/discovery/engine.go @@ -90,6 +90,10 @@ func ExecutePipeline( // Stage 2 — Derive signals // ────────────────────────────────────────────────────────── signalValues := make(map[string]map[string]float64, len(spec.Signals)) + qResultIndex := make(map[string]int, len(qResults)) + for i := range qResults { + qResultIndex[qResults[i].Name] = i + } for _, sig := range spec.Signals { raw, ok := rawByQuery[sig.Query] @@ -97,6 +101,24 @@ func ExecutePipeline( continue } + if !isSignalCompatibleWithQueryType(sig.Type, raw.QueryType) { + if idx, found := qResultIndex[sig.Query]; found { + msg := fmt.Sprintf( + "signal %q type=%s is not compatible with query %q type=%s", + sig.Name, sig.Type, sig.Query, raw.QueryType, + ) + if qResults[idx].Status == dropv1alpha1.QueryResultStatusSuccess { + qResults[idx].Status = dropv1alpha1.QueryResultStatusFailed + qResults[idx].Message = msg + } else if qResults[idx].Message == "" { + qResults[idx].Message = msg + } else { + qResults[idx].Message = qResults[idx].Message + "; " + msg + } + } + continue + } + values := deriveSignal(sig, raw) if values != nil { signalValues[sig.Name] = values @@ -122,7 +144,7 @@ func ExecutePipeline( } } - discovered := rankImages(spec.Ranking, signalValues, allImages) + discovered := rankImages(spec.Ranking, signalValues, allImages, defaultScores(rawByQuery)) // Apply maxImages cap; mark selected maxImages := int(spec.MaxImages) @@ -139,6 +161,21 @@ func ExecutePipeline( } } +// isSignalCompatibleWithQueryType enforces meaningful signal/query combinations. +func isSignalCompatibleWithQueryType(sigType dropv1alpha1.SignalType, qType dropv1alpha1.DiscoveryQueryType) bool { + switch sigType { + case dropv1alpha1.SignalTypeAggregate: + return true + case dropv1alpha1.SignalTypeTimeWeightedAggregate, dropv1alpha1.SignalTypeWindowAggregate: + // Registry queries fetch tag snapshots, not time series. + return qType != dropv1alpha1.DiscoveryQueryTypeRegistry + case dropv1alpha1.SignalTypeEventPullTime: + return qType == dropv1alpha1.DiscoveryQueryTypeLoki + default: + return false + } +} + // executeQuery fetches raw data for a single DiscoveryQuery. func executeQuery(ctx context.Context, q dropv1alpha1.DiscoveryQuery, httpClient *http.Client) (*QueryRawData, dropv1alpha1.QueryResult) { qr := dropv1alpha1.QueryResult{Name: q.Name, Type: q.Type} @@ -222,7 +259,7 @@ func executePrometheusQuery(ctx context.Context, cfg *dropv1alpha1.DiscoveryProm // executeRegistryQuery lists tags from an OCI registry and returns raw samples. func executeRegistryQuery(ctx context.Context, cfg *dropv1alpha1.DiscoveryRegistryQuery, httpClient *http.Client) (*QueryRawData, error) { - src := NewRegistrySource(cfg.URL, cfg.Repositories, cfg.TagFilter, cfg.TopX, cfg.ImageTemplate, httpClient) + src := NewRegistrySource(cfg.URL, cfg.Repositories, cfg.TagFilter, cfg.TagSeek, cfg.TopX, cfg.MaxScan, cfg.ImageTemplate, cfg.VersionPattern, httpClient) results, err := src.Fetch(ctx) if err != nil { return nil, err @@ -453,12 +490,26 @@ func parseTimeOfDay(hhmm string, ref time.Time) (time.Time, error) { } // rankImages converts per-signal values into an ordered DiscoveredImage slice. -func rankImages(ranking *dropv1alpha1.DiscoveryRanking, signals map[string]map[string]float64, images []string) []dropv1alpha1.DiscoveredImage { - if ranking == nil || len(images) == 0 { - // No ranking configured: return images in alphabetical order with score 0. - out := make([]dropv1alpha1.DiscoveredImage, len(images)) - for i, img := range images { - out[i] = dropv1alpha1.DiscoveredImage{Image: img, Rank: int32(i + 1), FinalScore: "0"} +func rankImages(ranking *dropv1alpha1.DiscoveryRanking, signals map[string]map[string]float64, images []string, fallback map[string]float64) []dropv1alpha1.DiscoveredImage { + if ranking == nil || ranking.Strategy == "" || len(images) == 0 { + // No ranking configured: order by the per-query score (registry source + // already returns its tags newest-first), then alphabetically. This lets + // registry queries work without an explicit signal+ranking dance. + sorted := append([]string(nil), images...) + sort.Slice(sorted, func(i, j int) bool { + si, sj := fallback[sorted[i]], fallback[sorted[j]] + if si != sj { + return si > sj + } + return sorted[i] < sorted[j] + }) + out := make([]dropv1alpha1.DiscoveredImage, len(sorted)) + for i, img := range sorted { + out[i] = dropv1alpha1.DiscoveredImage{ + Image: img, + Rank: int32(i + 1), + FinalScore: strconv.FormatFloat(fallback[img], 'f', -1, 64), + } } return out } @@ -630,6 +681,29 @@ func collectImages(rawByQuery map[string]*QueryRawData) []string { return images } +// defaultScores derives a fallback per-image score used when no ranking is +// configured. Each image is scored by the max value of its non-suffixed +// samples (registry queries store newest-first scores there), so registry +// queries rank correctly without an explicit signal+ranking definition. +func defaultScores(rawByQuery map[string]*QueryRawData) map[string]float64 { + out := make(map[string]float64) + for _, raw := range rawByQuery { + for key, samples := range raw.Samples { + if strings.HasSuffix(key, lokiFailedSuffix) || + strings.HasSuffix(key, lokiCacheHitSuffix) || + strings.HasSuffix(key, lokiSizeBytesSuffix) { + continue + } + for _, s := range samples { + if cur, ok := out[key]; !ok || s.Value > cur { + out[key] = s.Value + } + } + } + } + return out +} + // deriveEventPullTime computes per-image statistics from Loki event samples. // // The samples map is expected to come from a Loki kubernetesEvents query: diff --git a/internal/discovery/engine_test.go b/internal/discovery/engine_test.go index 02ddd93..11bd9fe 100644 --- a/internal/discovery/engine_test.go +++ b/internal/discovery/engine_test.go @@ -123,6 +123,96 @@ func TestExecutePipeline_Registry(t *testing.T) { } } +// TestExecutePipeline_RegistryNoRanking verifies registry queries rank +// newest-first by semver without any signals or ranking configured. +func TestExecutePipeline_RegistryNoRanking(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + resp := tagListResponse{Name: "team/app", Tags: []string{"v1.0", "v2.0", "v1.5"}} + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(resp) + })) + defer srv.Close() + + spec := dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{{ + Name: "tags", + Type: dropv1alpha1.DiscoveryQueryTypeRegistry, + Registry: &dropv1alpha1.DiscoveryRegistryQuery{ + URL: srv.URL, + Repositories: []string{"team/app"}, + TopX: 2, + }, + }}, + MaxImages: 10, + } + + clientFn := func(_ context.Context, _ string) (*http.Client, error) { return srv.Client(), nil } + result := ExecutePipeline(context.Background(), spec, clientFn) + + if len(result.Images) != 2 { + t.Fatalf("expected top 2 images, got %d: %v", len(result.Images), result.Images) + } + host := srv.URL[len("http://"):] + if result.Images[0].Image != host+"/team/app:v2.0" { + t.Errorf("expected v2.0 first, got %s", result.Images[0].Image) + } + if result.Images[1].Image != host+"/team/app:v1.5" { + t.Errorf("expected v1.5 second, got %s", result.Images[1].Image) + } +} + +// TestExecutePipeline_RegistryWindowAggregateIncompatible verifies that +// windowAggregate is rejected for registry queries (tag snapshots are not time series). +func TestExecutePipeline_RegistryWindowAggregateIncompatible(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + resp := tagListResponse{Name: "team/app", Tags: []string{"v1.0", "v1.1"}} + w.WriteHeader(http.StatusOK) + _ = json.NewEncoder(w).Encode(resp) + })) + defer srv.Close() + + window := metav1.Duration{Duration: 2 * time.Hour} + spec := dropv1alpha1.DiscoveryPolicySpec{ + Queries: []dropv1alpha1.DiscoveryQuery{{ + Name: "tags", + Type: dropv1alpha1.DiscoveryQueryTypeRegistry, + Registry: &dropv1alpha1.DiscoveryRegistryQuery{ + URL: srv.URL, + Repositories: []string{"team/app"}, + }, + }}, + Signals: []dropv1alpha1.DiscoverySignal{{ + Name: "recent-tags", + Query: "tags", + Type: dropv1alpha1.SignalTypeWindowAggregate, + WindowAggregate: &dropv1alpha1.WindowAggregateSignalConfig{ + Method: dropv1alpha1.AggregationSum, + RelativeWindow: &window, + }, + }}, + Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "recent-tags"}, + MaxImages: 10, + } + + clientFn := func(_ context.Context, _ string) (*http.Client, error) { return srv.Client(), nil } + result := ExecutePipeline(context.Background(), spec, clientFn) + + if len(result.QueryResults) != 1 { + t.Fatalf("expected 1 query result, got %d", len(result.QueryResults)) + } + if result.QueryResults[0].Status != dropv1alpha1.QueryResultStatusFailed { + t.Fatalf("expected failed query result, got %s", result.QueryResults[0].Status) + } + if result.QueryResults[0].Message == "" { + t.Fatalf("expected incompatibility message, got empty") + } + // Registry images still surface via fallback registry-order ranking even + // though the bogus signal is ignored. + if len(result.Images) != 2 { + t.Fatalf("expected 2 registry images via fallback ranking, got %d", len(result.Images)) + } +} + // TestExecutePipeline_WeightedSum verifies weighted sum ranking. func TestExecutePipeline_WeightedSum(t *testing.T) { // Two queries with different image sets diff --git a/internal/discovery/registry.go b/internal/discovery/registry.go index bc303b9..f82917a 100644 --- a/internal/discovery/registry.go +++ b/internal/discovery/registry.go @@ -6,35 +6,45 @@ import ( "fmt" "io" "net/http" + "net/url" "regexp" "sort" + "strconv" "strings" "text/template" "time" + + "github.com/Masterminds/semver/v3" ) // RegistrySource queries OCI registries for image tags. type RegistrySource struct { - URL string - Repositories []string - TagFilter string - TopX int32 - ImageTemplate string - HTTPClient *http.Client + URL string + Repositories []string + TagFilter string + TagSeek string + TopX int32 + MaxScan int32 + ImageTemplate string + VersionPattern string + HTTPClient *http.Client } // NewRegistrySource creates a new registry discovery source. -func NewRegistrySource(url string, repos []string, tagFilter string, topX int32, imageTemplate string, httpClient *http.Client) *RegistrySource { +func NewRegistrySource(url string, repos []string, tagFilter, tagSeek string, topX, maxScan int32, imageTemplate, versionPattern string, httpClient *http.Client) *RegistrySource { if httpClient == nil { httpClient = &http.Client{Timeout: 30 * time.Second} } return &RegistrySource{ - URL: strings.TrimSuffix(url, "/"), - Repositories: repos, - TagFilter: tagFilter, - TopX: topX, - ImageTemplate: imageTemplate, - HTTPClient: httpClient, + URL: strings.TrimSuffix(url, "/"), + Repositories: repos, + TagFilter: tagFilter, + TagSeek: tagSeek, + TopX: topX, + MaxScan: maxScan, + ImageTemplate: imageTemplate, + VersionPattern: versionPattern, + HTTPClient: httpClient, } } @@ -44,6 +54,15 @@ type tagListResponse struct { Tags []string `json:"tags"` } +// tagListPageSize is the number of tags requested per page. Registries cap the +// effective page size (GitLab caps at 100), so this is an upper bound. +const tagListPageSize = 1000 + +// defaultMaxScan bounds how many tags are fetched per repository when MaxScan is +// unset. Registries can hold tens of thousands of tags; pair tagSeek with a +// budget to fetch only the relevant range. +const defaultMaxScan = 1000 + // Fetch queries the registry for tags and returns discovered images. func (rs *RegistrySource) Fetch(ctx context.Context) ([]ImageResult, error) { var allResults []ImageResult @@ -64,32 +83,103 @@ func (rs *RegistrySource) Fetch(ctx context.Context) ([]ImageResult, error) { return allResults, nil } -func (rs *RegistrySource) fetchRepo(ctx context.Context, repo string) ([]ImageResult, error) { - u := fmt.Sprintf("%s/v2/%s/tags/list", rs.URL, repo) +// listTags returns up to MaxScan tags for a repository, following the OCI +// Distribution `Link` header (rel="next") to paginate. Registries do not +// guarantee tag ordering and many (e.g. GitLab) return only a page at a time. +// TagSeek is passed as the `last` cursor so callers can skip irrelevant earlier +// tags without fetching them. +func (rs *RegistrySource) listTags(ctx context.Context, repo string) ([]string, error) { + budget := int(rs.MaxScan) + if budget <= 0 { + budget = defaultMaxScan + } - req, err := http.NewRequestWithContext(ctx, http.MethodGet, u, nil) - if err != nil { - return nil, fmt.Errorf("creating request: %w", err) + q := url.Values{} + q.Set("n", strconv.Itoa(tagListPageSize)) + if rs.TagSeek != "" { + q.Set("last", rs.TagSeek) } + next := fmt.Sprintf("%s/v2/%s/tags/list?%s", rs.URL, repo, q.Encode()) - resp, err := rs.HTTPClient.Do(req) - if err != nil { - return nil, fmt.Errorf("listing tags: %w", err) + var tags []string + for next != "" && len(tags) < budget { + req, err := http.NewRequestWithContext(ctx, http.MethodGet, next, nil) + if err != nil { + return nil, fmt.Errorf("creating request: %w", err) + } + + resp, err := rs.HTTPClient.Do(req) + if err != nil { + return nil, fmt.Errorf("listing tags: %w", err) + } + + if resp.StatusCode != http.StatusOK { + body, _ := io.ReadAll(resp.Body) + _ = resp.Body.Close() + return nil, fmt.Errorf("registry returned status %d: %s", resp.StatusCode, string(body)) + } + + var tagList tagListResponse + if err := json.NewDecoder(resp.Body).Decode(&tagList); err != nil { + _ = resp.Body.Close() + return nil, fmt.Errorf("decoding response: %w", err) + } + linkHeader := resp.Header.Get("Link") + _ = resp.Body.Close() + + tags = append(tags, tagList.Tags...) + next = rs.nextPageURL(linkHeader) } - defer func() { _ = resp.Body.Close() }() - if resp.StatusCode != http.StatusOK { - body, _ := io.ReadAll(resp.Body) - return nil, fmt.Errorf("registry returned status %d: %s", resp.StatusCode, string(body)) + if len(tags) > budget { + tags = tags[:budget] } + return tags, nil +} - var tagList tagListResponse - if err := json.NewDecoder(resp.Body).Decode(&tagList); err != nil { - return nil, fmt.Errorf("decoding response: %w", err) +// nextPageURL parses an RFC 5988 `Link` header and returns the absolute URL of +// the rel="next" page, or "" when there is no next page. The registry returns a +// relative URI which is resolved against the registry base URL. +func (rs *RegistrySource) nextPageURL(linkHeader string) string { + if linkHeader == "" { + return "" + } + for _, part := range strings.Split(linkHeader, ",") { + segs := strings.Split(part, ";") + if len(segs) < 2 { + continue + } + isNext := false + for _, p := range segs[1:] { + if strings.Contains(strings.ToLower(p), `rel="next"`) || strings.Contains(strings.ToLower(p), "rel=next") { + isNext = true + break + } + } + if !isNext { + continue + } + raw := strings.TrimSpace(segs[0]) + raw = strings.TrimPrefix(raw, "<") + raw = strings.TrimSuffix(raw, ">") + if raw == "" { + return "" + } + if strings.HasPrefix(raw, "http://") || strings.HasPrefix(raw, "https://") { + return raw + } + return rs.URL + raw + } + return "" +} + +func (rs *RegistrySource) fetchRepo(ctx context.Context, repo string) ([]ImageResult, error) { + tags, err := rs.listTags(ctx, repo) + if err != nil { + return nil, err } // Filter tags - tags := tagList.Tags if rs.TagFilter != "" { re, err := regexp.Compile(rs.TagFilter) if err != nil { @@ -104,15 +194,24 @@ func (rs *RegistrySource) fetchRepo(ctx context.Context, repo string) ([]ImageRe tags = filtered } - // Limit to topX by keeping the last N tags in the slice returned by the registry. - // The OCI Distribution Spec does not define tag ordering, so this is best-effort: - // many registries return tags in push order (oldest first, newest last), which - // means we naturally keep the most recently pushed tags. + // Sort newest-first. Tags carrying a (possibly prefixed) version are ordered + // by version desc; tags with no parseable version fall back to push order. + var versionRe *regexp.Regexp + if rs.VersionPattern != "" { + re, err := regexp.Compile(rs.VersionPattern) + if err != nil { + return nil, fmt.Errorf("compiling version pattern: %w", err) + } + versionRe = re + } + tags = sortTagsNewestFirst(tags, versionRe) + + // Limit to topX by keeping the first N tags (newest). if rs.TopX > 0 && int32(len(tags)) > rs.TopX { - tags = tags[len(tags)-int(rs.TopX):] + tags = tags[:rs.TopX] } - // Build image refs + // Build image refs. Higher score = newer (index 0 is newest). results := make([]ImageResult, 0, len(tags)) for i, tag := range tags { imageRef, err := rs.buildImageRef(repo, tag) @@ -121,13 +220,81 @@ func (rs *RegistrySource) fetchRepo(ctx context.Context, repo string) ([]ImageRe } results = append(results, ImageResult{ Image: imageRef, - Score: int64(i + 1), // Higher index = more recent + Score: int64(len(tags) - i), }) } return results, nil } +// reEmbeddedSemver extracts a semver-ish version from anywhere inside a tag, +// e.g. "x86_64-v17.5.0" -> "17.5.0". This handles arch/flavor-prefixed tags +// like GitLab runner helper images (x86_64-v17.5.0, ubuntu-x86_64-v16.11.0). +var reEmbeddedSemver = regexp.MustCompile(`(\d+)\.(\d+)(?:\.(\d+))?(?:[-+][0-9A-Za-z.-]+)?`) + +// parseTagVersion tries to interpret a tag as a version. When versionRe is set, +// its first capture group is used as the version substring. Otherwise it +// attempts a strict semver parse, then falls back to extracting an embedded +// semver substring. Returns nil when no version can be found. +func parseTagVersion(tag string, versionRe *regexp.Regexp) *semver.Version { + if versionRe != nil { + m := versionRe.FindStringSubmatch(tag) + if len(m) >= 2 { + if v, err := semver.NewVersion(m[1]); err == nil { + return v + } + } + return nil + } + if v, err := semver.NewVersion(tag); err == nil { + return v + } + if m := reEmbeddedSemver.FindString(tag); m != "" { + if v, err := semver.NewVersion(m); err == nil { + return v + } + } + return nil +} + +// sortTagsNewestFirst orders tags newest-first. Tags carrying a (possibly +// prefixed) semver version sort by version descending; tags without a parseable +// version keep their original push order (best effort) and are appended after +// the versioned tags. versionRe, when non-nil, overrides version extraction +// using its first capture group. +func sortTagsNewestFirst(tags []string, versionRe *regexp.Regexp) []string { + type vt struct { + tag string + ver *semver.Version + idx int + } + parsed := make([]vt, len(tags)) + for i, t := range tags { + parsed[i] = vt{tag: t, ver: parseTagVersion(t, versionRe), idx: i} + } + sort.SliceStable(parsed, func(i, j int) bool { + a, b := parsed[i], parsed[j] + if a.ver != nil && b.ver != nil { + if a.ver.Equal(b.ver) { + return a.tag < b.tag // stable tie-break for prefixed variants + } + return a.ver.GreaterThan(b.ver) + } + if a.ver != nil { + return true // versioned before non-versioned + } + if b.ver != nil { + return false + } + return a.idx > b.idx // both unversioned: push order, newest last -> reverse + }) + out := make([]string, len(parsed)) + for i, p := range parsed { + out[i] = p.tag + } + return out +} + // templateData provides variables for the image template. type templateData struct { Registry string diff --git a/internal/discovery/registry_test.go b/internal/discovery/registry_test.go index f3b9dc6..fe480de 100644 --- a/internal/discovery/registry_test.go +++ b/internal/discovery/registry_test.go @@ -5,20 +5,22 @@ import ( "encoding/json" "net/http" "net/http/httptest" + "regexp" "testing" ) func TestRegistrySource_Fetch(t *testing.T) { tests := []struct { - name string - repos []string - tagFilter string - topX int32 - imageTemplate string - tags []string - wantCount int - wantFirst string - wantErr bool + name string + repos []string + tagFilter string + topX int32 + imageTemplate string + versionPattern string + tags []string + wantCount int + wantFirst string + wantErr bool }{ { name: "basic tag listing", @@ -64,7 +66,7 @@ func TestRegistrySource_Fetch(t *testing.T) { })) defer server.Close() - source := NewRegistrySource(server.URL, tt.repos, tt.tagFilter, tt.topX, tt.imageTemplate, server.Client()) + source := NewRegistrySource(server.URL, tt.repos, tt.tagFilter, "", tt.topX, 0, tt.imageTemplate, tt.versionPattern, server.Client()) results, err := source.Fetch(context.Background()) if tt.wantErr { @@ -91,3 +93,112 @@ func TestRegistrySource_Fetch(t *testing.T) { }) } } + +// TestRegistrySource_Pagination verifies that the source follows the OCI +// `Link` header to walk every page. This mirrors GitLab's container registry, +// which returns 100 tags per page and links the next page — the newest semver +// tags (e.g. GitLab runner helper x86_64-v*) sort lexically onto later pages. +func TestRegistrySource_Pagination(t *testing.T) { + repo := "gitlab-org/gitlab-runner/gitlab-runner-helper" + // Page 1: lexically-early junk tags. Page 2: the real x86_64-v* versions. + pages := map[string]tagListResponse{ + "": {Name: repo, Tags: []string{"3.18-arm-v17.8.0", "alpine-edge-arm-abc123", "x86_64-latest"}}, + "x86_64-v18.5.0": {Name: repo, Tags: []string{ + "x86_64-v18.5.0", "x86_64-v18.10.0", "x86_64-v19.0.0", + }}, + } + + var server *httptest.Server + server = httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + last := r.URL.Query().Get("last") + page, ok := pages[last] + if !ok { + t.Fatalf("unexpected last=%q", last) + } + // On the first page, link to the second. + if last == "" { + w.Header().Set("Link", "; rel=\"next\"") + } + w.WriteHeader(http.StatusOK) + if err := json.NewEncoder(w).Encode(page); err != nil { + t.Fatal(err) + } + })) + defer server.Close() + + source := NewRegistrySource(server.URL, []string{repo}, `^x86_64-v[0-9]+\.`, "", 2, 0, "", "x86_64-v(.+)", server.Client()) + results, err := source.Fetch(context.Background()) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + + if len(results) != 2 { + t.Fatalf("expected top 2 results, got %d: %v", len(results), results) + } + host := server.URL[len("http://"):] + if results[0].Image != host+"/"+repo+":x86_64-v19.0.0" { + t.Errorf("expected x86_64-v19.0.0 first, got %s", results[0].Image) + } + if results[1].Image != host+"/"+repo+":x86_64-v18.10.0" { + t.Errorf("expected x86_64-v18.10.0 second (10 > 5, not lexical), got %s", results[1].Image) + } +} + +func TestSortTagsNewestFirst(t *testing.T) { + tests := []struct { + name string + in []string + want []string + }{ + { + name: "plain semver", + in: []string{"v1.9.0", "v1.10.0", "v1.2.0"}, + want: []string{"v1.10.0", "v1.9.0", "v1.2.0"}, + }, + { + name: "gitlab runner helper arch-prefixed", + in: []string{"x86_64-v17.4.0", "x86_64-v17.10.0", "x86_64-v17.5.0"}, + want: []string{"x86_64-v17.10.0", "x86_64-v17.5.0", "x86_64-v17.4.0"}, + }, + { + name: "flavor and arch prefix", + in: []string{"ubuntu-x86_64-v16.11.0", "alpine-x86_64-v17.0.0", "ubuntu-x86_64-v17.0.0"}, + want: []string{"alpine-x86_64-v17.0.0", "ubuntu-x86_64-v17.0.0", "ubuntu-x86_64-v16.11.0"}, + }, + { + name: "non-versioned tags after versioned, push order reversed", + in: []string{"x86_64-latest", "x86_64-v17.5.0", "bleeding"}, + want: []string{"x86_64-v17.5.0", "bleeding", "x86_64-latest"}, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got := sortTagsNewestFirst(tt.in, nil) + if len(got) != len(tt.want) { + t.Fatalf("got %v, want %v", got, tt.want) + } + for i := range got { + if got[i] != tt.want[i] { + t.Fatalf("position %d: got %q, want %q (full: %v)", i, got[i], tt.want[i], got) + } + } + }) + } +} + +func TestSortTagsNewestFirst_VersionPattern(t *testing.T) { + re := regexp.MustCompile(`x86_64-v(.+)`) + in := []string{"x86_64-v17.4.0", "x86_64-v17.10.0", "ubuntu-v99.0.0", "x86_64-v17.5.0"} + want := []string{"x86_64-v17.10.0", "x86_64-v17.5.0", "x86_64-v17.4.0", "ubuntu-v99.0.0"} + + got := sortTagsNewestFirst(in, re) + if len(got) != len(want) { + t.Fatalf("got %v, want %v", got, want) + } + for i := range got { + if got[i] != want[i] { + t.Fatalf("position %d: got %q, want %q (full: %v)", i, got[i], want[i], got) + } + } +} diff --git a/knowledge.yaml b/knowledge.yaml index 47eb9c7..5fc6cf2 100644 --- a/knowledge.yaml +++ b/knowledge.yaml @@ -547,7 +547,12 @@ helperTypes: json: topX type: int32 required: false - doc: 'TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo)' + doc: 'TopX limits the number of tags kept per repository after tagFilter is applied. Tags are sorted newest-first (by version) before this cap is applied, so the newest N tags are kept. Example: 3 (keep the 3 newest matching tags per repo)' + - name: VersionPattern + json: versionPattern + type: string + required: false + doc: 'VersionPattern is a regex with a single capture group that extracts the version substring from each tag for newest-first sorting. Use it when tags carry a prefix/suffix around the version, e.g. GitLab runner helper tags like "x86_64-v17.5.0" (pattern "x86_64-v(.+)"). When unset, Drop tries a strict semver parse, then falls back to extracting an embedded semver substring. Tags with no parseable version keep registry push order and sort after versioned tags. Example: "x86_64-v(.+)"' - name: ImageTemplate json: imageTemplate type: string diff --git a/llms-full.txt b/llms-full.txt index dbbf428..77a4a7a 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -211,7 +211,8 @@ DiscoveryRegistryQuery defines OCI registry tag listing configuration for image | URL | `url` | `string` | ✓ | | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | | Repositories | `repositories` | `[]string` | ✓ | | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | | TagFilter | `tagFilter` | `string` | — | | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | -| TopX | `topX` | `int32` | — | | TopX limits the number of tags kept per repository after tagFilter is applied. The registry API does not guarantee ordering; Drop keeps the last N tags returned by the registry. Example: 3 (keep the last 3 matching tags returned per repo) | +| TopX | `topX` | `int32` | — | | TopX limits the number of tags kept per repository after tagFilter is applied. Tags are sorted newest-first (by version) before this cap is applied, so the newest N tags are kept. Example: 3 (keep the 3 newest matching tags per repo) | +| VersionPattern | `versionPattern` | `string` | — | | VersionPattern is a regex with a single capture group that extracts the version substring from each tag for newest-first sorting. Use it when tags carry a prefix/suffix around the version, e.g. GitLab runner helper tags like "x86_64-v17.5.0" (pattern "x86_64-v(.+)"). When unset, Drop tries a strict semver parse, then falls back to extracting an embedded semver substring. Tags with no parseable version keep registry push order and sort after versioned tags. Example: "x86_64-v(.+)" | | ImageTemplate | `imageTemplate` | `string` | — | | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "registry.example.com/{{.Repository}}:{{.Tag}}" | ### DiscoverySignal diff --git a/test/e2e/discovery-registry-gitlab/01-discoverypolicy.yaml b/test/e2e/discovery-registry-gitlab/01-discoverypolicy.yaml new file mode 100644 index 0000000..20e9402 --- /dev/null +++ b/test/e2e/discovery-registry-gitlab/01-discoverypolicy.yaml @@ -0,0 +1,21 @@ +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: e2e-discovery-registry-gitlab-helper +spec: + queries: + - name: helper-tags + type: registry + registry: + url: "http://registry.e2e-infra.svc.cluster.local:5000" + repositories: + - test/gitlab-runner-helper + # Only the plain x86_64 runner releases (excludes -latest and arm flavor) + tagFilter: "^x86_64-v[0-9]+\\.[0-9]+\\.[0-9]+$" + # Pin the version substring for sorting + versionPattern: "x86_64-v(.+)" + # Keep the 2 newest releases + topX: 2 + # No signals/ranking: registry tags come back newest-first + syncInterval: 30s + maxImages: 10 diff --git a/test/e2e/discovery-registry-gitlab/02-assert-discovery-status.yaml b/test/e2e/discovery-registry-gitlab/02-assert-discovery-status.yaml new file mode 100644 index 0000000..0521779 --- /dev/null +++ b/test/e2e/discovery-registry-gitlab/02-assert-discovery-status.yaml @@ -0,0 +1,16 @@ +# Assert the GitLab runner helper registry query ranks tags version-aware: +# - Ready=True / Synced +# - Exactly 2 images kept (topX=2), excluding -latest and the arm flavor +# - rank 1 is x86_64-v19.0.0 (newest) +# - rank 2 is x86_64-v18.10.0, proving 18.10.0 > 18.5.0 (version-aware, not lexical) +apiVersion: drop.corewire.io/v1alpha1 +kind: DiscoveryPolicy +metadata: + name: e2e-discovery-registry-gitlab-helper +status: + (conditions[?type == 'Ready']): + - status: "True" + reason: Synced + imageCount: 2 + (discoveredImages[?rank == `1`].image | [0]): registry.e2e-infra.svc.cluster.local:5000/test/gitlab-runner-helper:x86_64-v19.0.0 + (discoveredImages[?rank == `2`].image | [0]): registry.e2e-infra.svc.cluster.local:5000/test/gitlab-runner-helper:x86_64-v18.10.0 diff --git a/test/e2e/discovery-registry-gitlab/chainsaw-test.yaml b/test/e2e/discovery-registry-gitlab/chainsaw-test.yaml new file mode 100644 index 0000000..c91c562 --- /dev/null +++ b/test/e2e/discovery-registry-gitlab/chainsaw-test.yaml @@ -0,0 +1,27 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/kyverno/chainsaw/main/.schemas/json/test-chainsaw-v1alpha1.json +apiVersion: chainsaw.kyverno.io/v1alpha1 +kind: Test +metadata: + name: discovery-registry-gitlab +spec: + description: | + Verify that a registry DiscoveryPolicy ranks GitLab runner helper style tags + (x86_64-v) version-aware and newest-first, using tagFilter + + versionPattern, with no signals/ranking configured. + steps: + - name: Create DiscoveryPolicy for GitLab runner helper tags + try: + - apply: + file: 01-discoverypolicy.yaml + - name: Assert version-aware newest-first ranking + try: + - assert: + timeout: 120s + file: 02-assert-discovery-status.yaml + - name: Cleanup + try: + - delete: + ref: + apiVersion: drop.corewire.io/v1alpha1 + kind: DiscoveryPolicy + name: e2e-discovery-registry-gitlab-helper From 4d5fcc695a68e244f18d0ccd27c8a8f4c5a9d71f Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Tue, 30 Jun 2026 00:02:20 +0200 Subject: [PATCH 33/35] loki pull --- api/v1alpha1/discoverypolicy_types.go | 29 +----- .../drop.corewire.io_discoverypolicies.yaml | 23 ----- docs/content/docs/discovery.md | 77 ++++++++-------- .../content/docs/reference/_generated_crds.md | 4 +- docs/static/llms-full.txt | 4 +- internal/discovery/engine.go | 37 ++------ internal/discovery/engine_test.go | 65 +------------ internal/discovery/loki.go | 91 ++++--------------- internal/discovery/loki_test.go | 83 +---------------- 9 files changed, 81 insertions(+), 332 deletions(-) diff --git a/api/v1alpha1/discoverypolicy_types.go b/api/v1alpha1/discoverypolicy_types.go index 63d31ea..7501fb6 100644 --- a/api/v1alpha1/discoverypolicy_types.go +++ b/api/v1alpha1/discoverypolicy_types.go @@ -378,7 +378,8 @@ type TimeOfDayWindow struct { } // EventMetric selects which per-image quantity an event signal measures. -// +kubebuilder:validation:Enum=pullTime;imageSize;failure;cacheHit +// Both quantities are derived from Pulled events. +// +kubebuilder:validation:Enum=pullTime;imageSize type EventMetric string const ( @@ -386,10 +387,6 @@ const ( EventMetricPullTime EventMetric = "pullTime" // EventMetricImageSize measures image size in bytes (from Pulled event messages). EventMetricImageSize EventMetric = "imageSize" - // EventMetricFailure measures pull-failure events. - EventMetricFailure EventMetric = "failure" - // EventMetricCacheHit measures already-present (cache-hit) events. - EventMetricCacheHit EventMetric = "cacheHit" ) // EventStatistic defines the aggregation applied to the selected metric's samples. @@ -411,19 +408,6 @@ const ( EventStatisticCount EventStatistic = "count" ) -// DurationMode defines how pull duration is extracted from event records. -// +kubebuilder:validation:Enum=eventPair;messageDuration -type DurationMode string - -const ( - // DurationModeEventPair computes duration as Pulled.timestamp - Pulling.timestamp - // for the same Pod/image pair. - DurationModeEventPair DurationMode = "eventPair" - // DurationModeMessageDuration parses the duration directly from the Pulled event message - // (e.g., "Successfully pulled image ... in 42.3s"). - DurationModeMessageDuration DurationMode = "messageDuration" -) - // EventPullTimeSignalConfig configures the eventPullTime signal type. // The referenced query must be a Loki query. Pull duration and image size are // extracted from the same Pulled events; metric selects which one to rank on. @@ -436,15 +420,6 @@ type EventPullTimeSignalConfig struct { // Statistic selects how the metric's samples are aggregated per image. // +kubebuilder:validation:Enum=p50;p90;p95;avg;max;count Statistic EventStatistic `json:"statistic"` - // IncludeCacheHits controls whether "already present on machine" events are included - // in cold-pull duration statistics. Set to false to exclude cache hits. - // Only applies when metric=pullTime. - // +kubebuilder:default=false - IncludeCacheHits bool `json:"includeCacheHits"` - // DurationMode controls how pull duration is extracted from event records. - // Only applies when metric=pullTime. - // +kubebuilder:validation:Enum=eventPair;messageDuration - DurationMode DurationMode `json:"durationMode"` } // ============================================================ diff --git a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml index a361854..6b166e6 100644 --- a/config/crd/bases/drop.corewire.io_discoverypolicies.yaml +++ b/config/crd/bases/drop.corewire.io_discoverypolicies.yaml @@ -447,25 +447,6 @@ spec: eventPullTime: description: EventPullTime is required when type=eventPullTime. properties: - durationMode: - allOf: - - enum: - - eventPair - - messageDuration - - enum: - - eventPair - - messageDuration - description: |- - DurationMode controls how pull duration is extracted from event records. - Only applies when metric=pullTime. - type: string - includeCacheHits: - default: false - description: |- - IncludeCacheHits controls whether "already present on machine" events are included - in cold-pull duration statistics. Set to false to exclude cache hits. - Only applies when metric=pullTime. - type: boolean metric: default: pullTime description: |- @@ -474,8 +455,6 @@ spec: enum: - pullTime - imageSize - - failure - - cacheHit type: string statistic: allOf: @@ -497,8 +476,6 @@ spec: are aggregated per image. type: string required: - - durationMode - - includeCacheHits - statistic type: object name: diff --git a/docs/content/docs/discovery.md b/docs/content/docs/discovery.md index ed10c41..4cf4449 100644 --- a/docs/content/docs/discovery.md +++ b/docs/content/docs/discovery.md @@ -50,7 +50,7 @@ signals/ranking so you can apply them directly. | Type | Source | Discovered from | Use when | |------|--------|-----------------|----------| | `prometheus` | Metrics series | `image` label on results | Usage/concurrency from cluster metrics | -| `loki` | Event logs | parsed pull events | Pull durations & failures | +| `loki` | Event logs | parsed pull events | Pull durations & image sizes | | `registry` | Tag/catalog API | repository tags | Pre-cache newest tags by name | ### Prometheus Query @@ -145,15 +145,15 @@ spec: queryType: range # only supported Loki query mode currently lookback: 168h query: | - # Pull lifecycle events used to derive pull durations/failures. + # Successful pulls carry pull duration and image size in the message. {job="kubernetes-events", namespace="gitlab-runner"} | json | involvedObject_name =~ "runner-.*" - | reason =~ "Pulling|Pulled|Failed|BackOff" + | reason = "Pulled" parser: type: kubernetesEvents # maps log fields into structured event records podField: involvedObject_name # which field holds the pod name - reasonField: reason # Pulling / Pulled / Failed + reasonField: reason # only Pulled events are consumed messageField: message # free-text event message imageField: message # image ref is extracted from the message signals: @@ -163,25 +163,32 @@ spec: eventPullTime: metric: pullTime # default; aggregates pull duration samples statistic: avg # mean pull duration per image - includeCacheHits: false # only count true cold pulls - durationMode: eventPair # pair Pulling→Pulled events to get the duration ranking: strategy: signal signal: avg-cold-pull-time # slowest images rank highest ``` -How it's used: Loki contributes pull lifecycle events, not usage volume. The -`kubernetesEvents` parser turns each log/event into structured records with -`podField`, `reasonField`, and `messageField`, then extracts the image from -`imageField` (typically the same message text). +How it's used: Loki contributes pull lifecycle data, not usage volume. The +`kubernetesEvents` parser turns each `Pulled` event into a structured record +with `podField`, `reasonField`, and `messageField`, then extracts the image +from `imageField` (typically the same message text). -Event reasons consumed by discovery: `Pulling`, `Pulled`, `Failed`, `BackOff`, -`AlreadyPresent`. +#### Why only Pulled events -Duration semantics: -- `durationMode: messageDuration` parses `in 42.3s` from Pulled messages. -- `durationMode: eventPair` uses Pulled timestamp minus Pulling timestamp. -- Failures are tracked as `image:failed`; cache hits as `image:cache_hit`. +The kubelet emits a different `reason` for each stage of a pull, but the +`Pulled` event already carries everything the signals need — the cold-pull +duration (`in 704ms`) and the image size (`Image size: N bytes`) are both in its +message. Other reasons (`Pulling`, `Failed`, `BackOff`, `AlreadyPresent`) are +ignored: they add no ranking data we can't already read off `Pulled`. Both +`eventPullTime` metrics are derived from `Pulled`: + +| Metric | Source | Meaning | +|--------|--------|---------| +| `pullTime` | `in Xs` in the Pulled message | Cold-pull latency — slow images rank highest | +| `imageSize` | `Image size: N bytes` in the Pulled message | Image size in bytes — large images rank highest | + +Duration semantics: `pullTime` parses `in 42.3s` directly from the Pulled +message; `imageSize` parses `Image size: N bytes` from the same message. Alloy shipping (real cluster events): - Use @@ -209,13 +216,12 @@ Loki returns streams, each with `[timestamp, line]` entries. With Alloy } ``` -The parser extracts image + reason from each entry, then builds per-image samples: +The parser extracts image + size from each `Pulled` entry, then builds per-image samples: | Parsed event | Output key | Value added | |-------------|------------|-------------| | `Pulled ... in 704ms` | `docker.io/library/redis:7-alpine` | `0.704` seconds | -| `Failed ...` or `BackOff ...` | `docker.io/library/redis:7-alpine:failed` | `1` | -| `already present on machine` | `docker.io/library/redis:7-alpine:cache_hit` | `1` | +| `Pulled ... Image size: N bytes` | `docker.io/library/redis:7-alpine:size_bytes` | `N` | For `eventPullTime` signals, these samples are reduced by `statistic` (`avg`/`p50`/`p95`/etc.) into one value per image. @@ -239,6 +245,8 @@ spec: - gitlab-org/gitlab-runner/gitlab-runner-helper tagFilter: "^x86_64-v[0-9]+\\." # only x86_64-v1. / x86_64-v2. ... versionPattern: "x86_64-v(.+)" # capture group 1 is the version + tagSeek: "x86_64-u~" # skip straight to the x86_64-v* tags + maxScan: 2000 # cap tags fetched per repo before filtering topX: 3 # keep the 3 newest matching tags per repo imageTemplate: "{{.Registry}}/{{.Repository}}:{{.Tag}}" # built image ref secretRef: @@ -262,6 +270,14 @@ Important behavior notes: - `versionPattern` (optional) is a regex with one capture group that pins where the version lives in the tag, e.g. `x86_64-v(.+)` for GitLab helper images. Use it when the default extraction picks the wrong number. +- `tagSeek` (optional) is a pagination cursor sent to the registry as the `last` + query parameter. The registry lists tags lexically after this value, so you + can skip large numbers of irrelevant earlier tags (e.g. tens of thousands of + digest tags) without fetching them. It is not a real tag name — any string + works, e.g. `x86_64-u~` jumps straight to the `x86_64-v*` tags. +- `maxScan` (optional) caps how many tags are fetched per repository before + filtering. Defaults to `1000`. Pair it with `tagSeek` to fetch only the + relevant range on registries with very large tag lists. - `imageTemplate` variables: `{{.Registry}}`, `{{.Repository}}`, `{{.Tag}}`. Default: `{{.Registry}}/{{.Repository}}:{{.Tag}}`. @@ -334,7 +350,7 @@ A signal derives a named per-image value from exactly one query. The four types | `aggregate` | One value over all samples | `method`: sum/max/avg/count/min | | `timeWeightedAggregate` | Weighted sum by hour-of-day | `windows`, `weight`, `timezone` | | `windowAggregate` | One sub-window only | `relativeWindow` or `window` start/end | -| `eventPullTime` | Event metric statistic | `metric`: pullTime/imageSize/failure/cacheHit, `statistic`: p50/p90/p95/avg/max/count | +| `eventPullTime` | Event metric statistic | `metric`: pullTime/imageSize, `statistic`: p50/p90/p95/avg/max/count | Signal × source compatibility: @@ -566,13 +582,9 @@ This signal ignores the 48h volume dataset — it reads Loki pull durations inst | `max` | slowest pull | 730 | 4100 | absolute worst pull | | `count` | cold-pull events | 1 | 3 | how often pulled cold | -Two extra knobs: `includeCacheHits` (default `false`) adds "already present" events to duration stats; `durationMode` is `eventPair` (Pulled−Pulling timestamps) or `messageDuration` (parse "in 42.3s" from the message). - -`eventPullTime` now uses `metric + statistic`: +`eventPullTime` uses `metric + statistic`, both derived from `Pulled` events: - `metric: pullTime` (default) with `statistic: p50|p90|p95|avg|max|count` - `metric: imageSize` with `statistic: p50|p90|p95|avg|max|count` (bytes from `Image size: N bytes`) -- `metric: failure` with `statistic: count` -- `metric: cacheHit` with `statistic: count` ```yaml apiVersion: drop.corewire.io/v1alpha1 @@ -592,7 +604,7 @@ spec: query: | {job="kubernetes-events", namespace="gitlab-runner"} | json - | reason =~ "Pulling|Pulled|Failed|BackOff" + | reason = "Pulled" parser: type: kubernetesEvents podField: involvedObject_name @@ -604,10 +616,8 @@ spec: query: image-pull-events type: eventPullTime eventPullTime: - metric: pullTime # pullTime (default) | imageSize | failure | cacheHit + metric: pullTime # pullTime (default) | imageSize statistic: avg # p50 | p90 | p95 | avg | max | count - includeCacheHits: false # ignore already-cached pulls in latency stats - durationMode: eventPair # eventPair (Pulling→Pulled) | messageDuration parsing ranking: strategy: signal signal: avg-cold-pull-time @@ -623,7 +633,6 @@ signals: eventPullTime: metric: imageSize statistic: avg - durationMode: messageDuration ranking: strategy: signal @@ -754,7 +763,7 @@ spec: query: | {job="kubernetes-events", namespace="gitlab-runner"} | json - | reason =~ "Pulling|Pulled|Failed|BackOff" + | reason = "Pulled" parser: type: kubernetesEvents podField: involvedObject_name @@ -786,8 +795,6 @@ spec: eventPullTime: metric: pullTime statistic: avg - includeCacheHits: false - durationMode: eventPair ranking: strategy: modelExposure modelExposure: @@ -997,7 +1004,7 @@ spec: {job="kubernetes-events", namespace="gitlab-runner"} | json | involvedObject_name =~ "runner-.*" - | reason =~ "Pulling|Pulled|Failed|BackOff" + | reason = "Pulled" parser: type: kubernetesEvents podField: involvedObject_name @@ -1032,8 +1039,6 @@ spec: eventPullTime: metric: pullTime statistic: avg # mean latency signal; use p95 if you need tail sensitivity - includeCacheHits: false - durationMode: eventPair ranking: strategy: modelExposure diff --git a/docs/content/docs/reference/_generated_crds.md b/docs/content/docs/reference/_generated_crds.md index 2eb0797..6e1c893 100644 --- a/docs/content/docs/reference/_generated_crds.md +++ b/docs/content/docs/reference/_generated_crds.md @@ -237,7 +237,9 @@ DiscoveryRegistryQuery defines OCI registry tag listing configuration for image | `url` | `string` | Yes | — | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | | `repositories` | `[]string` | Yes | — | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | | `tagFilter` | `string` | No | — | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | +| `tagSeek` | `string` | No | — | TagSeek is a pagination cursor passed to the registry as the `last` query parameter. The registry lists tags lexically after this value, letting you skip large numbers of irrelevant earlier tags without fetching them. It is not a real tag name — any string works. Example: "x86_64-u~" jumps straight to the "x86_64-v*" tags on a repo with tens of thousands of digest tags (GitLab runner helper). | | `topX` | `int32` | No | — | TopX limits the number of tags kept per repository after tagFilter is applied. Tags are sorted newest-first (by version) before this cap is applied, so the newest N tags are kept. Example: 3 (keep the 3 newest matching tags per repo) | +| `maxScan` | `int32` | No | — | MaxScan caps how many tags are fetched per repository before filtering. Registries can hold tens of thousands of tags; this bounds the work. Pair it with tagSeek to fetch only the relevant range. Defaults to 1000 when unset. Example: 500 | | `versionPattern` | `string` | No | — | VersionPattern is a regex with a single capture group that extracts the version substring from each tag for newest-first sorting. Use it when tags carry a prefix/suffix around the version, e.g. GitLab runner helper tags like "x86_64-v17.5.0" (pattern "x86_64-v(.+)"). When unset, Drop tries a strict semver parse, then falls back to extracting an embedded semver substring. Tags with no parseable version keep registry push order and sort after versioned tags. Example: "x86_64-v(.+)" | | `imageTemplate` | `string` | No | — | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "registry.example.com/{{.Repository}}:{{.Tag}}" | @@ -263,8 +265,6 @@ EventPullTimeSignalConfig configures the eventPullTime signal type. The referenc |-------|------|----------|---------|-------------| | `metric` | `EventMetric` | No | pullTime | Metric selects which per-image quantity to aggregate. Defaults to pullTime, which correlates strongly with cold-start cost. Use imageSize to rank by bytes. | | `statistic` | `EventStatistic` | Yes | — | Statistic selects how the metric's samples are aggregated per image. | -| `includeCacheHits` | `bool` | Yes | false | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. Only applies when metric=pullTime. | -| `durationMode` | `DurationMode` | Yes | — | DurationMode controls how pull duration is extracted from event records. Only applies when metric=pullTime. | ### ImageEntry diff --git a/docs/static/llms-full.txt b/docs/static/llms-full.txt index 77a4a7a..f214f52 100644 --- a/docs/static/llms-full.txt +++ b/docs/static/llms-full.txt @@ -211,7 +211,9 @@ DiscoveryRegistryQuery defines OCI registry tag listing configuration for image | URL | `url` | `string` | ✓ | | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | | Repositories | `repositories` | `[]string` | ✓ | | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | | TagFilter | `tagFilter` | `string` | — | | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | +| TagSeek | `tagSeek` | `string` | — | | TagSeek is a pagination cursor passed to the registry as the `last` query parameter. The registry lists tags lexically after this value, letting you skip large numbers of irrelevant earlier tags without fetching them. It is not a real tag name — any string works. Example: "x86_64-u~" jumps straight to the "x86_64-v*" tags on a repo with tens of thousands of digest tags (GitLab runner helper). | | TopX | `topX` | `int32` | — | | TopX limits the number of tags kept per repository after tagFilter is applied. Tags are sorted newest-first (by version) before this cap is applied, so the newest N tags are kept. Example: 3 (keep the 3 newest matching tags per repo) | +| MaxScan | `maxScan` | `int32` | — | | MaxScan caps how many tags are fetched per repository before filtering. Registries can hold tens of thousands of tags; this bounds the work. Pair it with tagSeek to fetch only the relevant range. Defaults to 1000 when unset. Example: 500 | | VersionPattern | `versionPattern` | `string` | — | | VersionPattern is a regex with a single capture group that extracts the version substring from each tag for newest-first sorting. Use it when tags carry a prefix/suffix around the version, e.g. GitLab runner helper tags like "x86_64-v17.5.0" (pattern "x86_64-v(.+)"). When unset, Drop tries a strict semver parse, then falls back to extracting an embedded semver substring. Tags with no parseable version keep registry push order and sort after versioned tags. Example: "x86_64-v(.+)" | | ImageTemplate | `imageTemplate` | `string` | — | | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "registry.example.com/{{.Repository}}:{{.Tag}}" | @@ -237,8 +239,6 @@ EventPullTimeSignalConfig configures the eventPullTime signal type. The referenc |-------|------|------|----------|---------|-------------| | Metric | `metric` | `EventMetric` | — | `pullTime` | Metric selects which per-image quantity to aggregate. Defaults to pullTime, which correlates strongly with cold-start cost. Use imageSize to rank by bytes. | | Statistic | `statistic` | `EventStatistic` | ✓ | | Statistic selects how the metric's samples are aggregated per image. Enum: `p50`,`p90`,`p95`,`avg`,`max`,`count` | -| IncludeCacheHits | `includeCacheHits` | `bool` | ✓ | `false` | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. Only applies when metric=pullTime. | -| DurationMode | `durationMode` | `DurationMode` | ✓ | | DurationMode controls how pull duration is extracted from event records. Only applies when metric=pullTime. Enum: `eventPair`,`messageDuration` | ### ImageEntry diff --git a/internal/discovery/engine.go b/internal/discovery/engine.go index 9943e38..5c12ddf 100644 --- a/internal/discovery/engine.go +++ b/internal/discovery/engine.go @@ -655,20 +655,15 @@ func modelExposureRank(cfg *dropv1alpha1.ModelExposureRankingConfig, signals map } // collectImages returns a sorted, deduplicated list of all image references across all query results. -// For Loki query data, special per-image suffix keys (":failed", ":cache_hit") are stripped to -// their base image name so that images visible only via failure/cache events are still included. +// For Loki query data, the per-image size suffix key (":size_bytes") is stripped to its base +// image name so that images are deduplicated correctly. func collectImages(rawByQuery map[string]*QueryRawData) []string { seen := make(map[string]struct{}) for _, raw := range rawByQuery { for img := range raw.Samples { - switch { - case strings.HasSuffix(img, lokiFailedSuffix): - seen[strings.TrimSuffix(img, lokiFailedSuffix)] = struct{}{} - case strings.HasSuffix(img, lokiCacheHitSuffix): - seen[strings.TrimSuffix(img, lokiCacheHitSuffix)] = struct{}{} - case strings.HasSuffix(img, lokiSizeBytesSuffix): + if strings.HasSuffix(img, lokiSizeBytesSuffix) { seen[strings.TrimSuffix(img, lokiSizeBytesSuffix)] = struct{}{} - default: + } else { seen[img] = struct{}{} } } @@ -689,9 +684,7 @@ func defaultScores(rawByQuery map[string]*QueryRawData) map[string]float64 { out := make(map[string]float64) for _, raw := range rawByQuery { for key, samples := range raw.Samples { - if strings.HasSuffix(key, lokiFailedSuffix) || - strings.HasSuffix(key, lokiCacheHitSuffix) || - strings.HasSuffix(key, lokiSizeBytesSuffix) { + if strings.HasSuffix(key, lokiSizeBytesSuffix) { continue } for _, s := range samples { @@ -708,22 +701,15 @@ func defaultScores(rawByQuery map[string]*QueryRawData) map[string]float64 { // // The samples map is expected to come from a Loki kubernetesEvents query: // - samples[image] → pull duration values in seconds (from Pulled events) -// - samples[image+":failed"] → count of pull-failure events (value=1.0 each) -// - samples[image+":cache_hit"] → count of already-present events (value=1.0 each) // - samples[image+":size_bytes"]→ image size values in bytes (from Pulled event messages) // // cfg.Metric selects which series to aggregate; cfg.Statistic selects how. func deriveEventPullTime(samples map[string][]TimedSample, cfg *dropv1alpha1.EventPullTimeSignalConfig) map[string]float64 { imageSet := make(map[string]struct{}) for key := range samples { - switch { - case strings.HasSuffix(key, lokiFailedSuffix): - imageSet[strings.TrimSuffix(key, lokiFailedSuffix)] = struct{}{} - case strings.HasSuffix(key, lokiCacheHitSuffix): - imageSet[strings.TrimSuffix(key, lokiCacheHitSuffix)] = struct{}{} - case strings.HasSuffix(key, lokiSizeBytesSuffix): + if strings.HasSuffix(key, lokiSizeBytesSuffix) { imageSet[strings.TrimSuffix(key, lokiSizeBytesSuffix)] = struct{}{} - default: + } else { imageSet[key] = struct{}{} } } @@ -739,15 +725,8 @@ func deriveEventPullTime(samples map[string][]TimedSample, cfg *dropv1alpha1.Eve switch metric { case dropv1alpha1.EventMetricImageSize: pts = samples[img+lokiSizeBytesSuffix] - case dropv1alpha1.EventMetricFailure: - pts = samples[img+lokiFailedSuffix] - case dropv1alpha1.EventMetricCacheHit: - pts = samples[img+lokiCacheHitSuffix] default: // pullTime - pts = append([]TimedSample(nil), samples[img]...) - if cfg.IncludeCacheHits { - pts = append(pts, samples[img+lokiCacheHitSuffix]...) - } + pts = samples[img] } if len(pts) == 0 { continue diff --git a/internal/discovery/engine_test.go b/internal/discovery/engine_test.go index 11bd9fe..12356a9 100644 --- a/internal/discovery/engine_test.go +++ b/internal/discovery/engine_test.go @@ -493,7 +493,7 @@ func TestExecutePipeline_Loki(t *testing.T) { Name: "pull-time", Query: "pull-events", Type: dropv1alpha1.SignalTypeEventPullTime, - EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Statistic: dropv1alpha1.EventStatisticAvg, DurationMode: dropv1alpha1.DurationModeMessageDuration}, + EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Statistic: dropv1alpha1.EventStatisticAvg}, }, }, Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "pull-time"}, @@ -520,69 +520,6 @@ func TestExecutePipeline_Loki(t *testing.T) { } } -// TestExecutePipeline_LokiFailureCount verifies that failure event counts are reported correctly. -func TestExecutePipeline_LokiFailureCount(t *testing.T) { - now := time.Now() - nanoStr := func(t time.Time) string { - return strconv.FormatInt(t.UnixNano(), 10) - } - - streams := []lokiStream{ - { - Stream: map[string]string{"app": "kubelet"}, - Values: [][]string{ - {nanoStr(now.Add(-5 * time.Second)), `Pulling image "nginx:1.25"`}, - {nanoStr(now.Add(-4 * time.Second)), `Failed to pull image "nginx:1.25": rpc error`}, - {nanoStr(now.Add(-3 * time.Second)), `Back-off pulling image "nginx:1.25"`}, - }, - }, - } - - srv := httptest.NewServer(lokiStreamHandler(streams)) - defer srv.Close() - - spec := dropv1alpha1.DiscoveryPolicySpec{ - Queries: []dropv1alpha1.DiscoveryQuery{ - { - Name: "pull-events", - Type: dropv1alpha1.DiscoveryQueryTypeLoki, - Loki: &dropv1alpha1.DiscoveryLokiQuery{ - Endpoint: srv.URL, - Query: `{app="kubelet"}`, - Parser: &dropv1alpha1.LokiParser{ - Type: dropv1alpha1.LokiParserTypeKubernetesEvents, - MessageField: "message", - }, - }, - }, - }, - Signals: []dropv1alpha1.DiscoverySignal{ - { - Name: "failures", - Query: "pull-events", - Type: dropv1alpha1.SignalTypeEventPullTime, - EventPullTime: &dropv1alpha1.EventPullTimeSignalConfig{Metric: dropv1alpha1.EventMetricFailure, Statistic: dropv1alpha1.EventStatisticCount, DurationMode: dropv1alpha1.DurationModeMessageDuration}, - }, - }, - Ranking: &dropv1alpha1.DiscoveryRanking{Strategy: dropv1alpha1.RankingStrategySignal, Signal: "failures"}, - MaxImages: 10, - } - - clientFn := func(_ context.Context, _ string) (*http.Client, error) { return srv.Client(), nil } - result := ExecutePipeline(context.Background(), spec, clientFn) - - if result.QueryResults[0].Status != dropv1alpha1.QueryResultStatusSuccess { - t.Fatalf("expected success, got %s: %s", result.QueryResults[0].Status, result.QueryResults[0].Message) - } - if len(result.Images) != 1 { - t.Fatalf("expected 1 image, got %d: %v", len(result.Images), result.Images) - } - // Both "failed" and "backoff" reasons count as failures → 2 failure events - if result.Images[0].FinalScore != "2" { - t.Errorf("expected failureCount=2, got %s", result.Images[0].FinalScore) - } -} - // TestExecutePipeline_LokiImageSize verifies ranking by image size (bytes) extracted from Pulled events. func TestExecutePipeline_LokiImageSize(t *testing.T) { now := time.Now() diff --git a/internal/discovery/loki.go b/internal/discovery/loki.go index 7b4c67d..64a4fb2 100644 --- a/internal/discovery/loki.go +++ b/internal/discovery/loki.go @@ -8,7 +8,6 @@ import ( "net/http" "net/url" "regexp" - "sort" "strconv" "strings" "time" @@ -21,10 +20,6 @@ const ( lokiMessageField = "message" // lokiLimitDefault is the maximum number of log entries to fetch per query. lokiLimitDefault = 5000 - // lokiFailedSuffix is appended to image keys for pull-failure event counts. - lokiFailedSuffix = ":failed" - // lokiCacheHitSuffix is appended to image keys for cache-hit event counts. - lokiCacheHitSuffix = ":cache_hit" // lokiSizeBytesSuffix is appended to image keys for extracted image-size samples. lokiSizeBytesSuffix = ":size_bytes" ) @@ -85,9 +80,8 @@ func NewLokiSource(endpoint, query string, lookback time.Duration, parser *dropv // FetchRaw calls /loki/api/v1/query_range and returns per-image timed samples. // // For a kubernetesEvents parser, sample values are pull durations in seconds -// (from Pulled event messages or Pulling→Pulled timestamp pairs). -// Pull failures are stored under the key "image:failed" with value 1.0, -// and cache hits under "image:cache_hit" with value 1.0. +// (parsed from Pulled event messages). Image sizes are stored under the key +// "image:size_bytes". // // Without a parser, each log entry produces a value=1.0 sample keyed by // the "image" stream label. @@ -178,15 +172,11 @@ type lokiEventRecord struct { // parseKubernetesEventStreams parses Kubernetes Event records from Loki log entries. // -// It produces: +// Only Pulled events are consumed. It produces: // - samples[image] → pull duration in seconds for each Pulled event -// - samples[image+":failed"] → 1.0 per pull-failure event -// - samples[image+":cache_hit"] → 1.0 per already-present event // - samples[image+":size_bytes"] → image size in bytes per Pulled event (if present) // -// Durations are derived from the "in Xs" pattern in Pulled messages (messageDuration). -// When no duration is present in the message, a Pulling→Pulled event-pair duration -// is used as a fallback. +// Durations and sizes are parsed from the Pulled event message text. func parseKubernetesEventStreams(streams []lokiStream, parser *dropv1alpha1.LokiParser) map[string][]TimedSample { reasonField := lokiCoalesceField(parser.ReasonField, "reason") podField := lokiCoalesceField(parser.PodField, "involvedObject_name") @@ -249,53 +239,22 @@ func parseKubernetesEventStreams(streams []lokiStream, parser *dropv1alpha1.Loki } } - // Sort records chronologically for correct eventPair matching. - sort.Slice(records, func(i, j int) bool { - return records[i].timestamp < records[j].timestamp - }) - - // pullingMap tracks the start timestamp of Pulling events per (pod:image). - pullingMap := make(map[string]float64) out := make(map[string][]TimedSample) for _, rec := range records { - switch strings.ToLower(rec.reason) { - case "pulling": - pullingMap[rec.pod+":"+rec.image] = rec.timestamp - - case "pulled": - // Primary: parse duration from message ("in Xs"). - dur := lokiParsePullDuration(rec.message) - sizeBytes := lokiParseImageSizeBytes(rec.message) - // Fallback: event-pair (Pulling → Pulled timestamp delta). - if dur == 0 { - if pullStart, ok := pullingMap[rec.pod+":"+rec.image]; ok { - if d := rec.timestamp - pullStart; d > 0 { - dur = d - } - } - } - if dur > 0 { - out[rec.image] = append(out[rec.image], TimedSample{Timestamp: rec.timestamp, Value: dur}) - } - if sizeBytes > 0 { - out[rec.image+lokiSizeBytesSuffix] = append( - out[rec.image+lokiSizeBytesSuffix], - TimedSample{Timestamp: rec.timestamp, Value: sizeBytes}, - ) - } - delete(pullingMap, rec.pod+":"+rec.image) - - case "failed", "backoff": - out[rec.image+lokiFailedSuffix] = append( - out[rec.image+lokiFailedSuffix], - TimedSample{Timestamp: rec.timestamp, Value: 1.0}, - ) - - case "alreadypresent": - out[rec.image+lokiCacheHitSuffix] = append( - out[rec.image+lokiCacheHitSuffix], - TimedSample{Timestamp: rec.timestamp, Value: 1.0}, + // Only Pulled events carry the data we rank on (duration + image size). + if strings.ToLower(rec.reason) != "pulled" { + continue + } + dur := lokiParsePullDuration(rec.message) + sizeBytes := lokiParseImageSizeBytes(rec.message) + if dur > 0 { + out[rec.image] = append(out[rec.image], TimedSample{Timestamp: rec.timestamp, Value: dur}) + } + if sizeBytes > 0 { + out[rec.image+lokiSizeBytesSuffix] = append( + out[rec.image+lokiSizeBytesSuffix], + TimedSample{Timestamp: rec.timestamp, Value: sizeBytes}, ) } } @@ -352,22 +311,12 @@ func lokiParseImageSizeBytes(msg string) float64 { // lokiInferReasonFromMessage infers a Kubernetes Event reason from a plain-text log message. // This is used when the reason field is not present in the Loki stream labels. +// Only Pulled events are relevant to discovery, so other reasons are ignored. func lokiInferReasonFromMessage(msg string) string { - lower := strings.ToLower(msg) - switch { - case strings.Contains(lower, "successfully pulled"): + if strings.Contains(strings.ToLower(msg), "successfully pulled") { return "Pulled" - case strings.Contains(lower, "back-off pulling") || strings.Contains(lower, "back-off"): - return "Backoff" - case strings.Contains(lower, "failed to pull"): - return "Failed" - case strings.Contains(lower, "pulling image"): - return "Pulling" - case strings.Contains(lower, "already present"): - return "AlreadyPresent" - default: - return "" } + return "" } // parseLokiNanoTimestamp converts a Loki nanosecond epoch string to Unix seconds (float64). diff --git a/internal/discovery/loki_test.go b/internal/discovery/loki_test.go index f44b729..850386c 100644 --- a/internal/discovery/loki_test.go +++ b/internal/discovery/loki_test.go @@ -67,18 +67,10 @@ func TestLokiSource_FetchRaw_Generic(t *testing.T) { } // TestLokiSource_FetchRaw_KubernetesEvents verifies the kubernetesEvents parser -// with message-based duration extraction and eventPair fallback. +// with message-based duration extraction. func TestLokiSource_FetchRaw_KubernetesEvents(t *testing.T) { now := time.Now() streams := []lokiStream{ - { - Stream: map[string]string{ - "reason": "Pulling", - "involvedObject_name": "pod-abc", - "message": `Pulling image "nginx:1.25"`, - }, - Values: [][]string{{nanoStringLoki(now.Add(-3 * time.Second)), ""}}, - }, { Stream: map[string]string{ "reason": "Pulled", @@ -119,63 +111,6 @@ func TestLokiSource_FetchRaw_KubernetesEvents(t *testing.T) { } } -// TestLokiSource_FetchRaw_KubernetesEvents_EventPair verifies that when no duration -// is present in the message, the Pulling→Pulled timestamp delta is used. -func TestLokiSource_FetchRaw_KubernetesEvents_EventPair(t *testing.T) { - now := time.Now() - pullingTime := now.Add(-3 * time.Second) - pulledTime := now.Add(-1 * time.Second) - - streams := []lokiStream{ - { - Stream: map[string]string{ - "reason": "Pulling", - "involvedObject_name": "pod-xyz", - "message": `Pulling image "alpine:3.19"`, - }, - Values: [][]string{{nanoStringLoki(pullingTime), ""}}, - }, - { - Stream: map[string]string{ - "reason": "Pulled", - "involvedObject_name": "pod-xyz", - "message": `Successfully pulled image "alpine:3.19"`, // no duration - }, - Values: [][]string{{nanoStringLoki(pulledTime), ""}}, - }, - } - - srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { - resp := lokiResponse{ - Status: lokiStatusSuccess, - Data: lokiData{ResultType: "streams", Result: streams}, - } - w.WriteHeader(http.StatusOK) - _ = json.NewEncoder(w).Encode(resp) - })) - defer srv.Close() - - src := NewLokiSource(srv.URL, `{app="kubelet"}`, time.Hour, &dropv1alpha1.LokiParser{ - Type: dropv1alpha1.LokiParserTypeKubernetesEvents, - ReasonField: "reason", - PodField: "involvedObject_name", - MessageField: "message", - }, srv.Client()) - samples, err := src.FetchRaw(t.Context()) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - - if len(samples["alpine:3.19"]) != 1 { - t.Fatalf("expected 1 sample for alpine:3.19, got %d", len(samples["alpine:3.19"])) - } - // eventPair duration ≈ 2 seconds (pulledTime - pullingTime) - got := samples["alpine:3.19"][0].Value - if got < 1.9 || got > 2.1 { - t.Errorf("expected eventPair duration ~2s, got %f", got) - } -} - // TestLokiSource_FetchRaw_KubernetesEvents_AlloyJSON verifies that events shipped by // Grafana Alloy (loki.source.kubernetes_events, log_format=json) parse with the default // parser fields. Alloy emits "msg"/"name" in the JSON body, not "message"/"involvedObject_name". @@ -187,11 +122,6 @@ func TestLokiSource_FetchRaw_KubernetesEvents_AlloyJSON(t *testing.T) { Values: [][]string{{nanoStringLoki(now.Add(-2 * time.Second)), `{"reason":"Pulled","name":"runner-abc","msg":"Successfully pulled image \"nginx:1.25\" in 740ms (740ms including waiting). Image size: 20461242 bytes."}`}}, }, - { - Stream: map[string]string{"namespace": "default", "job": "kubelet"}, - Values: [][]string{{nanoStringLoki(now.Add(-1 * time.Second)), - `{"reason":"Failed","name":"runner-def","msg":"Failed to pull image \"broken:v1\": not found"}`}}, - }, } srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { @@ -221,9 +151,6 @@ func TestLokiSource_FetchRaw_KubernetesEvents_AlloyJSON(t *testing.T) { if got := samples["nginx:1.25"+lokiSizeBytesSuffix][0].Value; got != 20461242 { t.Errorf("expected image size 20461242, got %f", got) } - if len(samples["broken:v1"+lokiFailedSuffix]) != 1 { - t.Errorf("expected 1 failure sample for broken:v1, got %d", len(samples["broken:v1"+lokiFailedSuffix])) - } } // TestLokiSource_FetchRaw_HTTPError verifies that HTTP errors are surfaced. @@ -247,10 +174,10 @@ func TestLokiInferReasonFromMessage(t *testing.T) { want string }{ {`Successfully pulled image "nginx:1.25" in 2s`, "Pulled"}, - {`Pulling image "nginx:1.25"`, "Pulling"}, - {`Failed to pull image "nginx:1.25": not found`, "Failed"}, - {`Back-off pulling image "nginx:1.25"`, "Backoff"}, - {`Container image "nginx:1.25" already present on machine`, "AlreadyPresent"}, + {`Pulling image "nginx:1.25"`, ""}, + {`Failed to pull image "nginx:1.25": not found`, ""}, + {`Back-off pulling image "nginx:1.25"`, ""}, + {`Container image "nginx:1.25" already present on machine`, ""}, {`some unrelated log line`, ""}, } for _, tt := range tests { From 33ae6dcbdf163abb1a56530769487d4845919ab9 Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Tue, 30 Jun 2026 00:14:01 +0200 Subject: [PATCH 34/35] twsts, gen --- README.md | 4 ++++ knowledge.yaml | 24 ++++++++----------- llms-full.txt | 4 ++-- .../discovery-loki-alloy/00-failing-pod.yaml | 13 ---------- .../discovery-loki-alloy/00-success-pod.yaml | 12 ++++++++++ .../01-discoverypolicy.yaml | 12 ++++------ .../02-assert-discovery-status.yaml | 1 - .../discovery-loki-alloy/chainsaw-test.yaml | 8 +++---- test/e2e/discovery-loki/00-real-pods.yaml | 13 ---------- .../discovery-loki/01-discoverypolicy.yaml | 14 ++--------- .../02-assert-discovery-status.yaml | 1 - test/e2e/discovery-loki/chainsaw-test.yaml | 4 ++-- .../02-assert-discovery-status.yaml | 9 +++++-- 13 files changed, 48 insertions(+), 71 deletions(-) create mode 100644 test/e2e/discovery-loki-alloy/00-success-pod.yaml diff --git a/README.md b/README.md index d3aaaa9..a3842bf 100644 --- a/README.md +++ b/README.md @@ -399,6 +399,10 @@ spec: tagFilter: "^x86_64-v[0-9]+\\." # Optional: pin where the version lives in the tag (capture group 1) versionPattern: "x86_64-v(.+)" + # Optional: skip straight to the x86_64-v* tags (registry `last` cursor) + tagSeek: "x86_64-u~" + # Optional: cap tags fetched per repo before filtering (default 1000) + maxScan: 2000 # Keep only the 3 newest matching tags (newest first) topX: 3 # Optional: Secret in the Drop pod namespace (default: drop-system) diff --git a/knowledge.yaml b/knowledge.yaml index 5fc6cf2..1fb8ce0 100644 --- a/knowledge.yaml +++ b/knowledge.yaml @@ -543,11 +543,21 @@ helperTypes: type: string required: false doc: 'TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds)' + - name: TagSeek + json: tagSeek + type: string + required: false + doc: 'TagSeek is a pagination cursor passed to the registry as the `last` query parameter. The registry lists tags lexically after this value, letting you skip large numbers of irrelevant earlier tags without fetching them. It is not a real tag name — any string works. Example: "x86_64-u~" jumps straight to the "x86_64-v*" tags on a repo with tens of thousands of digest tags (GitLab runner helper).' - name: TopX json: topX type: int32 required: false doc: 'TopX limits the number of tags kept per repository after tagFilter is applied. Tags are sorted newest-first (by version) before this cap is applied, so the newest N tags are kept. Example: 3 (keep the 3 newest matching tags per repo)' + - name: MaxScan + json: maxScan + type: int32 + required: false + doc: 'MaxScan caps how many tags are fetched per repository before filtering. Registries can hold tens of thousands of tags; this bounds the work. Pair it with tagSeek to fetch only the relevant range. Defaults to 1000 when unset. Example: 500' - name: VersionPattern json: versionPattern type: string @@ -622,20 +632,6 @@ helperTypes: - max - count doc: Statistic selects how the metric's samples are aggregated per image. - - name: IncludeCacheHits - json: includeCacheHits - type: bool - required: true - default: "false" - doc: IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. Only applies when metric=pullTime. - - name: DurationMode - json: durationMode - type: DurationMode - required: true - enum: - - eventPair - - messageDuration - doc: DurationMode controls how pull duration is extracted from event records. Only applies when metric=pullTime. - name: ImageEntry doc: ImageEntry defines a single image to include in a set. fields: diff --git a/llms-full.txt b/llms-full.txt index 77a4a7a..f214f52 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -211,7 +211,9 @@ DiscoveryRegistryQuery defines OCI registry tag listing configuration for image | URL | `url` | `string` | ✓ | | URL is the registry base URL (without repository path). Example: "https://registry.example.com", "https://ghcr.io" | | Repositories | `repositories` | `[]string` | ✓ | | Repositories is the list of repository paths to list tags from. Example: ["team/app", "team/worker", "infra/tools"] | | TagFilter | `tagFilter` | `string` | — | | TagFilter is a regex applied to tag names. Only matching tags are discovered. Example: "^v[0-9]+\\." (semver tags only), "^main-" (main branch builds) | +| TagSeek | `tagSeek` | `string` | — | | TagSeek is a pagination cursor passed to the registry as the `last` query parameter. The registry lists tags lexically after this value, letting you skip large numbers of irrelevant earlier tags without fetching them. It is not a real tag name — any string works. Example: "x86_64-u~" jumps straight to the "x86_64-v*" tags on a repo with tens of thousands of digest tags (GitLab runner helper). | | TopX | `topX` | `int32` | — | | TopX limits the number of tags kept per repository after tagFilter is applied. Tags are sorted newest-first (by version) before this cap is applied, so the newest N tags are kept. Example: 3 (keep the 3 newest matching tags per repo) | +| MaxScan | `maxScan` | `int32` | — | | MaxScan caps how many tags are fetched per repository before filtering. Registries can hold tens of thousands of tags; this bounds the work. Pair it with tagSeek to fetch only the relevant range. Defaults to 1000 when unset. Example: 500 | | VersionPattern | `versionPattern` | `string` | — | | VersionPattern is a regex with a single capture group that extracts the version substring from each tag for newest-first sorting. Use it when tags carry a prefix/suffix around the version, e.g. GitLab runner helper tags like "x86_64-v17.5.0" (pattern "x86_64-v(.+)"). When unset, Drop tries a strict semver parse, then falls back to extracting an embedded semver substring. Tags with no parseable version keep registry push order and sort after versioned tags. Example: "x86_64-v(.+)" | | ImageTemplate | `imageTemplate` | `string` | — | | ImageTemplate is a Go text/template for constructing the full image reference from discovered tags. Available variables: {{.Registry}}, {{.Repository}}, {{.Tag}} Default (when unset): "{{.Registry}}/{{.Repository}}:{{.Tag}}" Example: "registry.example.com/{{.Repository}}:{{.Tag}}" | @@ -237,8 +239,6 @@ EventPullTimeSignalConfig configures the eventPullTime signal type. The referenc |-------|------|------|----------|---------|-------------| | Metric | `metric` | `EventMetric` | — | `pullTime` | Metric selects which per-image quantity to aggregate. Defaults to pullTime, which correlates strongly with cold-start cost. Use imageSize to rank by bytes. | | Statistic | `statistic` | `EventStatistic` | ✓ | | Statistic selects how the metric's samples are aggregated per image. Enum: `p50`,`p90`,`p95`,`avg`,`max`,`count` | -| IncludeCacheHits | `includeCacheHits` | `bool` | ✓ | `false` | IncludeCacheHits controls whether "already present on machine" events are included in cold-pull duration statistics. Set to false to exclude cache hits. Only applies when metric=pullTime. | -| DurationMode | `durationMode` | `DurationMode` | ✓ | | DurationMode controls how pull duration is extracted from event records. Only applies when metric=pullTime. Enum: `eventPair`,`messageDuration` | ### ImageEntry diff --git a/test/e2e/discovery-loki-alloy/00-failing-pod.yaml b/test/e2e/discovery-loki-alloy/00-failing-pod.yaml index 32ed239..750b8aa 100644 --- a/test/e2e/discovery-loki-alloy/00-failing-pod.yaml +++ b/test/e2e/discovery-loki-alloy/00-failing-pod.yaml @@ -10,16 +10,3 @@ spec: image: registry.e2e-infra.svc.cluster.local:5000/test/myapp:v1 imagePullPolicy: Always command: ["/bin/sh", "-c", "echo ok && sleep 2"] ---- -apiVersion: v1 -kind: Pod -metadata: - name: e2e-alloy-failing-pod - namespace: default -spec: - restartPolicy: Never - containers: - - name: bad-image - image: registry.invalid.local:9999/e2e-alloy-invalid:nope - imagePullPolicy: Always - command: ["/bin/sh", "-c", "echo should-not-run && sleep 60"] diff --git a/test/e2e/discovery-loki-alloy/00-success-pod.yaml b/test/e2e/discovery-loki-alloy/00-success-pod.yaml new file mode 100644 index 0000000..750b8aa --- /dev/null +++ b/test/e2e/discovery-loki-alloy/00-success-pod.yaml @@ -0,0 +1,12 @@ +apiVersion: v1 +kind: Pod +metadata: + name: e2e-alloy-success-pod + namespace: default +spec: + restartPolicy: Never + containers: + - name: app + image: registry.e2e-infra.svc.cluster.local:5000/test/myapp:v1 + imagePullPolicy: Always + command: ["/bin/sh", "-c", "echo ok && sleep 2"] diff --git a/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml b/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml index fe2b127..871386e 100644 --- a/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml +++ b/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml @@ -10,7 +10,7 @@ spec: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{drop_e2e="true"} | json | reason=~"Pulling|Pulled|Failed|BackOff" | name=~"e2e-alloy-(success|failing)-pod"' + query: '{drop_e2e="true"} | json | reason="Pulled" | name=~"e2e-alloy-success-pod"' parser: type: kubernetesEvents podField: name @@ -18,16 +18,14 @@ spec: messageField: msg imageField: msg signals: - - name: pull-failures + - name: p50-cold-pull-time query: alloy-k8s-events type: eventPullTime eventPullTime: - metric: failure - statistic: count - durationMode: messageDuration - includeCacheHits: false + metric: pullTime + statistic: p50 ranking: strategy: signal - signal: pull-failures + signal: p50-cold-pull-time syncInterval: 15s maxImages: 10 diff --git a/test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml b/test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml index d03f606..20c20ae 100644 --- a/test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml +++ b/test/e2e/discovery-loki-alloy/02-assert-discovery-status.yaml @@ -9,4 +9,3 @@ status: (contains(to_string(discoveredImages), 'test/myapp:v1')): true (queryResults[?name == 'alloy-k8s-events'] | [0].type): loki (imageCount > `0`): true - (contains(to_string(discoveredImages), 'e2e-alloy-invalid:nope')): true diff --git a/test/e2e/discovery-loki-alloy/chainsaw-test.yaml b/test/e2e/discovery-loki-alloy/chainsaw-test.yaml index 4ba360f..ae2478e 100644 --- a/test/e2e/discovery-loki-alloy/chainsaw-test.yaml +++ b/test/e2e/discovery-loki-alloy/chainsaw-test.yaml @@ -9,15 +9,15 @@ spec: (loki.source.kubernetes_events with log_format=json). This exercises parser fields name/msg/reason using real pull events from test pods. steps: - - name: Create real pods that trigger pull success/failure events + - name: Create a real pod that triggers pull success events try: - apply: - file: 00-failing-pod.yaml + file: 00-success-pod.yaml - name: Create DiscoveryPolicy reading Alloy json event fields try: - apply: file: 01-discoverypolicy.yaml - - name: Assert pipeline executed and discovered the failing image from Alloy events + - name: Assert pipeline executed and discovered the image from Alloy events try: - assert: timeout: 120s @@ -25,7 +25,7 @@ spec: - name: Cleanup try: - delete: - file: 00-failing-pod.yaml + file: 00-success-pod.yaml - delete: ref: apiVersion: drop.corewire.io/v1alpha1 diff --git a/test/e2e/discovery-loki/00-real-pods.yaml b/test/e2e/discovery-loki/00-real-pods.yaml index 6ec2a75..fc721ff 100644 --- a/test/e2e/discovery-loki/00-real-pods.yaml +++ b/test/e2e/discovery-loki/00-real-pods.yaml @@ -10,16 +10,3 @@ spec: image: registry.e2e-infra.svc.cluster.local:5000/test/myapp:v1 imagePullPolicy: Always command: ["/bin/sh", "-c", "echo ok && sleep 2"] ---- -apiVersion: v1 -kind: Pod -metadata: - name: e2e-loki-failure-pod - namespace: default -spec: - restartPolicy: Never - containers: - - name: bad-image - image: registry.invalid.local:9999/e2e-loki-invalid:nope - imagePullPolicy: Always - command: ["/bin/sh", "-c", "echo should-not-run && sleep 60"] diff --git a/test/e2e/discovery-loki/01-discoverypolicy.yaml b/test/e2e/discovery-loki/01-discoverypolicy.yaml index fe70fb5..6d2d82c 100644 --- a/test/e2e/discovery-loki/01-discoverypolicy.yaml +++ b/test/e2e/discovery-loki/01-discoverypolicy.yaml @@ -10,7 +10,7 @@ spec: endpoint: "http://loki.e2e-infra.svc.cluster.local:3100" queryType: range lookback: 24h - query: '{drop_e2e="true"} | json | reason=~"Pulling|Pulled|Failed|BackOff" | name=~"e2e-loki-(success|failure)-pod"' + query: '{drop_e2e="true"} | json | reason="Pulled" | name=~"e2e-loki-success-pod"' parser: type: kubernetesEvents podField: name @@ -23,18 +23,8 @@ spec: query: discovery-loki-image-pull-events type: eventPullTime eventPullTime: + metric: pullTime statistic: p50 - durationMode: messageDuration - includeCacheHits: false - # Number of pull failures per image. - - name: pull-failures - query: discovery-loki-image-pull-events - type: eventPullTime - eventPullTime: - metric: failure - statistic: count - durationMode: messageDuration - includeCacheHits: false ranking: strategy: signal signal: p50-cold-pull-time diff --git a/test/e2e/discovery-loki/02-assert-discovery-status.yaml b/test/e2e/discovery-loki/02-assert-discovery-status.yaml index b2d869a..a6d6f00 100644 --- a/test/e2e/discovery-loki/02-assert-discovery-status.yaml +++ b/test/e2e/discovery-loki/02-assert-discovery-status.yaml @@ -14,4 +14,3 @@ status: (queryResults[?name == 'discovery-loki-image-pull-events'] | [0].status): success (queryResults[?name == 'discovery-loki-image-pull-events'] | [0].type): loki (length(discoveredImages[?contains(image, 'test/myapp:v1')]) > `0`): true - (length(discoveredImages[?contains(image, 'e2e-loki-invalid:nope')]) > `0`): true diff --git a/test/e2e/discovery-loki/chainsaw-test.yaml b/test/e2e/discovery-loki/chainsaw-test.yaml index f827515..b0c8e47 100644 --- a/test/e2e/discovery-loki/chainsaw-test.yaml +++ b/test/e2e/discovery-loki/chainsaw-test.yaml @@ -6,8 +6,8 @@ metadata: spec: description: | Verify that a DiscoveryPolicy with a Loki query and the kubernetesEvents - parser derives eventPullTime signals (cold-pull time and failure count) from - real image-pull events ingested by Alloy and populates status.discoveredImages. + parser derives an eventPullTime cold-pull-time signal from real Pulled + image-pull events ingested by Alloy and populates status.discoveredImages. steps: - name: Create real pods to generate kubelet pull events try: diff --git a/test/e2e/discovery-registry-gitlab/02-assert-discovery-status.yaml b/test/e2e/discovery-registry-gitlab/02-assert-discovery-status.yaml index 0521779..18501b7 100644 --- a/test/e2e/discovery-registry-gitlab/02-assert-discovery-status.yaml +++ b/test/e2e/discovery-registry-gitlab/02-assert-discovery-status.yaml @@ -12,5 +12,10 @@ status: - status: "True" reason: Synced imageCount: 2 - (discoveredImages[?rank == `1`].image | [0]): registry.e2e-infra.svc.cluster.local:5000/test/gitlab-runner-helper:x86_64-v19.0.0 - (discoveredImages[?rank == `2`].image | [0]): registry.e2e-infra.svc.cluster.local:5000/test/gitlab-runner-helper:x86_64-v18.10.0 + # Plain structural array assertion: chainsaw matches elements positionally, + # so this proves the order (rank 1 newest first) without fragile JMESPath. + discoveredImages: + - image: registry.e2e-infra.svc.cluster.local:5000/test/gitlab-runner-helper:x86_64-v19.0.0 + rank: 1 + - image: registry.e2e-infra.svc.cluster.local:5000/test/gitlab-runner-helper:x86_64-v18.10.0 + rank: 2 From 5b07808f6fd176ef4061b37c600e4a7ffd1cb0af Mon Sep 17 00:00:00 2001 From: Julian Wachter Date: Tue, 30 Jun 2026 00:31:03 +0200 Subject: [PATCH 35/35] tests --- test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml | 6 +++--- test/e2e/discovery-loki/01-discoverypolicy.yaml | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml b/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml index 871386e..35e79f8 100644 --- a/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml +++ b/test/e2e/discovery-loki-alloy/01-discoverypolicy.yaml @@ -18,14 +18,14 @@ spec: messageField: msg imageField: msg signals: - - name: p50-cold-pull-time + - name: avg-cold-pull-time query: alloy-k8s-events type: eventPullTime eventPullTime: metric: pullTime - statistic: p50 + statistic: avg ranking: strategy: signal - signal: p50-cold-pull-time + signal: avg-cold-pull-time syncInterval: 15s maxImages: 10 diff --git a/test/e2e/discovery-loki/01-discoverypolicy.yaml b/test/e2e/discovery-loki/01-discoverypolicy.yaml index 6d2d82c..b5770ce 100644 --- a/test/e2e/discovery-loki/01-discoverypolicy.yaml +++ b/test/e2e/discovery-loki/01-discoverypolicy.yaml @@ -18,15 +18,15 @@ spec: messageField: msg imageField: msg signals: - # Median cold-pull time derived from the "Successfully pulled ... in Xs" messages. - - name: p50-cold-pull-time + # Mean cold-pull time derived from the "Successfully pulled ... in Xs" messages. + - name: avg-cold-pull-time query: discovery-loki-image-pull-events type: eventPullTime eventPullTime: metric: pullTime - statistic: p50 + statistic: avg ranking: strategy: signal - signal: p50-cold-pull-time + signal: avg-cold-pull-time syncInterval: 30s maxImages: 10