feat(k8s): support promql historical data queries#82
feat(k8s): support promql historical data queries#82
Conversation
There was a problem hiding this comment.
Pull request overview
Adds first-class historical querying support to the Sysdig Monitor k8s_list_* MCP tools by introducing shared start/end RFC3339 parameters, validating/resolving the time window, and wrapping PromQL appropriately for “windowed” queries while preserving legacy instant-snapshot behavior when omitted.
Changes:
- Introduce shared time-window parsing/validation utilities (
TimeWindow,ParseTimeWindow,WithTimeWindowParams) and apply them acrossk8s_list_*Monitor tools. - Update affected tools to pass
GetQueryV1Params.Time(eval atend) and set a 60s timeout for windowed queries; deprecateintervalfor HTTP/network error tools with precedence rules + warnings. - Expand test coverage to include windowed query construction using an injected clock; document the behavior in tool docs and top-level README.
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| internal/infra/mcp/tools/utils.go | Adds shared time-window types/helpers and request arg presence detection. |
| internal/infra/mcp/tools/tools_suite_test.go | Adds shared test helpers for expected windowed query params and limits. |
| internal/infra/mcp/tools/tool_k8s_list_workloads.go | Adds start/end support, eval time + timeout, and windowed PromQL wrapping logic. |
| internal/infra/mcp/tools/tool_k8s_list_workloads_test.go | Adds mock clock + windowed query expectation cases. |
| internal/infra/mcp/tools/tool_k8s_list_underutilized_pods_memory_quota.go | Adds start/end support; wraps usage/limit in avg_over_time when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_underutilized_pods_memory_quota_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_underutilized_pods_cpu_quota.go | Adds start/end support; wraps usage/quota in avg_over_time when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_underutilized_pods_cpu_quota_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_top_unavailable_pods.go | Adds start/end support and Sysdig-canonical windowed unavailable semantics (min_over_time >= 1). |
| internal/infra/mcp/tools/tool_k8s_list_top_unavailable_pods_test.go | Adds mock clock + windowed semantics tests. |
| internal/infra/mcp/tools/tool_k8s_list_top_restarted_pods.go | Adds start/end support; uses increase() when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_top_restarted_pods_test.go | Adds mock clock + windowed increase() expectations. |
| internal/infra/mcp/tools/tool_k8s_list_top_network_errors_in_pods.go | Adds start/end support; deprecates interval with precedence and windowed sum_over_time/N behavior. |
| internal/infra/mcp/tools/tool_k8s_list_top_network_errors_in_pods_test.go | Adds mock clock + windowed query expectations and clarifies legacy cases. |
| internal/infra/mcp/tools/tool_k8s_list_top_http_errors_in_pods.go | Adds start/end support; deprecates interval with precedence and windowed sum_over_time/N behavior. |
| internal/infra/mcp/tools/tool_k8s_list_top_http_errors_in_pods_test.go | Adds mock clock + windowed query expectations (including precedence over interval). |
| internal/infra/mcp/tools/tool_k8s_list_top_memory_consumed_workload.go | Adds start/end support; uses avg_over_time for windowed. |
| internal/infra/mcp/tools/tool_k8s_list_top_memory_consumed_workload_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_top_memory_consumed_container.go | Adds start/end support; uses avg_over_time for windowed. |
| internal/infra/mcp/tools/tool_k8s_list_top_memory_consumed_container_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_top_cpu_consumed_workload.go | Adds start/end support; uses avg_over_time for windowed. |
| internal/infra/mcp/tools/tool_k8s_list_top_cpu_consumed_workload_test.go | Adds mock clock + windowed query expectations + invalid window tests. |
| internal/infra/mcp/tools/tool_k8s_list_top_cpu_consumed_container.go | Adds start/end support; uses avg_over_time for windowed. |
| internal/infra/mcp/tools/tool_k8s_list_top_cpu_consumed_container_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_pod_containers.go | Adds start/end support; uses max_over_time(...) > 0 inventory semantics when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_pod_containers_test.go | Adds mock clock + windowed inventory semantics test. |
| internal/infra/mcp/tools/tool_k8s_list_nodes.go | Adds start/end support; uses max_over_time(...) > 0 inventory semantics when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_nodes_test.go | Adds mock clock + windowed inventory semantics test. |
| internal/infra/mcp/tools/tool_k8s_list_cronjobs.go | Adds start/end support; uses max_over_time(...) > 0 inventory semantics when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_cronjobs_test.go | Adds mock clock + windowed inventory semantics test. |
| internal/infra/mcp/tools/tool_k8s_list_count_pods_per_cluster.go | Adds start/end support; uses avg_over_time for windowed pod counts. |
| internal/infra/mcp/tools/tool_k8s_list_count_pods_per_cluster_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_clusters.go | Adds start/end support; uses max_over_time(...) > 0 inventory semantics when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_clusters_test.go | Adds mock clock + windowed inventory semantics tests. |
| internal/infra/mcp/tools/README.md | Documents start/end semantics and the per-tool PromQL wrapping table + deprecation behavior. |
| cmd/server/main.go | Injects a system clock into all window-aware tools at registration time. |
| README.md | Adds a note pointing to the tool docs for windowed aggregation semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // RangeSelector returns the PromQL range-selector literal for this window, e.g. "[3600s]". | ||
| // The duration is rounded down to whole seconds so the selector is stable and debuggable. | ||
| func (w TimeWindow) RangeSelector() string { | ||
| return fmt.Sprintf("[%ds]", int64(w.End.Sub(w.Start).Seconds())) | ||
| } |
There was a problem hiding this comment.
TimeWindow.RangeSelector() can emit a "[0s]" range when the resolved window is <1s. This can happen in the supported "start without end" case because end defaults to clk.Now() (sub-second precision) while start is second precision; if start is within the current second, End.Sub(Start) truncates to 0 seconds. PromQL range selectors do not accept 0s, and for HTTP/network error tools the derived windowSeconds can become 0, causing a division-by-zero in the query. Consider normalizing end to whole seconds (or rounding up) and/or rejecting windows shorter than 1s in ParseTimeWindow before returning the TimeWindow.
tembleking
left a comment
There was a problem hiding this comment.
Nit: pre-existing typo in utils.go:16 — "exampes" should be "examples". Not from this PR but you're touching the file, easy drive-by fix.
…erplate, and add missing ParseTimeWindow unit tests
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 38 out of 38 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
internal/infra/mcp/tools/utils.go:16
Exampleswrites toschema["exampes"], which looks like a typo and will prevent MCP JSON schema examples from being emitted/recognized. Rename the key to"examples"so tooling can surface these examples correctly.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // newWindowedQueryParams constructs the GetQueryV1Params value that a windowed tool | ||
| // invocation is expected to produce: Query string, Time = end.Unix() via FromQueryTime1, | ||
| // and a 60s Timeout. | ||
| func newWindowedQueryParams(query string, end time.Time) sysdig.GetQueryV1Params { | ||
| var qt sysdig.Time | ||
| Expect(qt.FromQueryTime1(end.Unix())).To(Succeed()) | ||
| timeout := sysdig.Timeout("60s") | ||
| return sysdig.GetQueryV1Params{ | ||
| Query: query, | ||
| Time: &qt, | ||
| Timeout: &timeout, | ||
| } |
There was a problem hiding this comment.
newWindowedQueryParams calls Expect(...) during helper execution. Since this helper is used while building DescribeTable Entry(...) values, it can run during package init / spec construction (before RegisterFailHandler / a running spec exists), which can panic or behave unpredictably. Prefer returning an error, or handle the FromQueryTime1 error without Gomega (e.g., panic with a clear message) and keep assertions inside It/table bodies.
| // RangeSelector returns the PromQL range-selector literal for this window, e.g. "[3600s]". | ||
| func (w TimeWindow) RangeSelector() string { | ||
| return fmt.Sprintf("[%ds]", int64(w.End.Sub(w.Start).Seconds())) | ||
| } | ||
|
|
||
| // WindowSeconds returns the duration of the window in whole seconds. | ||
| func (w TimeWindow) WindowSeconds() int64 { | ||
| return int64(w.End.Sub(w.Start).Seconds()) | ||
| } |
There was a problem hiding this comment.
RangeSelector / WindowSeconds compute whole seconds via Duration.Seconds() (float64) and then cast to int64, which truncates and can yield 0 for sub-second windows. That can produce invalid PromQL like [0s] and even / 0 in the HTTP/network error tools. Compute whole seconds using integer duration arithmetic (e.g., w.End.Sub(w.Start)/time.Second) and ensure the result is >= 1 for non-zero windows.
|
All done, not taking care of the Copilot comments as per this PR. |
| func (w TimeWindow) RangeSelector() string { | ||
| return fmt.Sprintf("[%ds]", int64(w.End.Sub(w.Start).Seconds())) | ||
| } |
There was a problem hiding this comment.
If someone calls RangeSelector() on a zero TimeWindow, this produces [0s] — invalid PromQL.
All callers currently guard with !tw.IsZero(), so not a live bug, but the method itself has no guard. A future caller could easily miss the precondition.
| func (w TimeWindow) RangeSelector() string { | |
| return fmt.Sprintf("[%ds]", int64(w.End.Sub(w.Start).Seconds())) | |
| } | |
| func (w TimeWindow) RangeSelector() string { | |
| if w.IsZero() { | |
| panic("RangeSelector called on zero TimeWindow") | |
| } | |
| return fmt.Sprintf("[%ds]", int64(w.End.Sub(w.Start).Seconds())) | |
| } |
Or return (string, error) if you prefer not to panic. Same applies to WindowSeconds() below.
| func (w TimeWindow) ApplyToParams(params *sysdig.GetQueryV1Params) error { | ||
| evalTime, err := w.EvalTime() | ||
| if err != nil { | ||
| return err | ||
| } | ||
| params.Time = evalTime | ||
| if !w.IsZero() { | ||
| timeout := sysdig.Timeout(windowedQueryTimeout) | ||
| params.Timeout = &timeout | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
On the zero-window path, this still sets params.Time = nil, overwriting whatever was there before. Not a bug today since callers don't pre-populate it, but it's a surprising side-effect for a "no-op" case.
| func (w TimeWindow) ApplyToParams(params *sysdig.GetQueryV1Params) error { | |
| evalTime, err := w.EvalTime() | |
| if err != nil { | |
| return err | |
| } | |
| params.Time = evalTime | |
| if !w.IsZero() { | |
| timeout := sysdig.Timeout(windowedQueryTimeout) | |
| params.Timeout = &timeout | |
| } | |
| return nil | |
| } | |
| func (w TimeWindow) ApplyToParams(params *sysdig.GetQueryV1Params) error { | |
| if w.IsZero() { | |
| return nil | |
| } | |
| evalTime, err := w.EvalTime() | |
| if err != nil { | |
| return err | |
| } | |
| params.Time = evalTime | |
| timeout := sysdig.Timeout(windowedQueryTimeout) | |
| params.Timeout = &timeout | |
| return nil | |
| } |
| // Reads "start" and "end" from the request, validates them, and return the resolved TimeWindow. | ||
|
|
||
| func ParseTimeWindow(request mcp.CallToolRequest, clk clock.Clock) (TimeWindow, error) { |
There was a problem hiding this comment.
nit: Blank line between comment and function disconnects them — this won't be treated as godoc.
| // Reads "start" and "end" from the request, validates them, and return the resolved TimeWindow. | |
| func ParseTimeWindow(request mcp.CallToolRequest, clk clock.Clock) (TimeWindow, error) { | |
| // Reads "start" and "end" from the request, validates them, and return the resolved TimeWindow. | |
| func ParseTimeWindow(request mcp.CallToolRequest, clk clock.Clock) (TimeWindow, error) { |
| return TimeWindow{}, fmt.Errorf("end (%s) must be after start (%s)", end.Format(time.RFC3339), start.Format(time.RFC3339)) | ||
| } | ||
|
|
||
| return TimeWindow{Start: start, End: end}, nil |
There was a problem hiding this comment.
No upper bound on window duration. A user could request start=2020-01-01T00:00:00Z, producing a multi-year range that will likely OOM Prometheus.
Worth adding a max window guard here (30d? 90d?), or at minimum documenting in the param description that there's a practical limit.
| It("truncates sub-second precision from now so RangeSelector never emits [0s]", func() { | ||
| mockClock.EXPECT().Now().Return(now) | ||
| _, err := tools.ParseTimeWindow(makeRequest(map[string]any{ | ||
| "start": "2026-04-16T12:00:00Z", | ||
| }), mockClock) | ||
| Expect(err).To(HaveOccurred()) | ||
| }) | ||
| }) |
There was a problem hiding this comment.
Good coverage of ParseTimeWindow error paths, but missing direct unit tests for RangeSelector(), WindowSeconds(), EvalTime(), and ApplyToParams().
Also, this last test's description says "RangeSelector never emits [0s]" but it never calls RangeSelector() — it actually tests that ParseTimeWindow errors when start == truncated now. The description is misleading.
Suggested additions:
RangeSelector()returns correct string for a known 1h window →[3600s]ApplyToParams()setsTimeandTimeoutcorrectly on non-zero windowWindowSeconds()on a 1h window returns3600
| ) | ||
| }) | ||
|
|
||
| When("the time window is invalid", func() { |
There was a problem hiding this comment.
This is the only tool with error-path tests for invalid time windows. The other 15 tools skip them entirely.
Since ParseTimeWindow is shared, it's acceptable, but consider adding the same block to at least one more tool from a different category (e.g. http_errors, which has the legacy interval fallback) to catch tool-specific pre/post-processing bugs around the ParseTimeWindow call.
Enable users the ability to query historical data (if provided a timeline) from metric tools that use Sysdig Monitor with no data restrictions.
Adds optional
start/endhistorical query parameters to all Sysdig Monitork8s_list_* tools, enabling LLMs to query Kubernetes metrics over a past time window instead of only the current snapshot.When provided, the underlying PromQL is wrapped in the appropriate aggregation for each tool:
avg_over_timeincreasemin_over_time >= 1sum_over_time / N (rate per second)max_over_time > 0When omitted, tools behave as before (instant snapshot)