Skip to content

Rate limiting observability (metrics and tracing) #4553

@jerm-dro

Description

@jerm-dro

User Story

As an SRE,
I want metrics and tracing for rate limiting,
so that I can monitor enforcement and troubleshoot issues in production.

Context

See THV-0057: Rate Limiting for MCP Servers for full design details.

Acceptance Criteria

  • Decision counter increments on allowed and rejected requests, broken down by scope and operation type
  • Redis error counter increments on Redis failures, broken down by error type
  • Fail-open counter increments when a request is allowed through during Redis outage
  • Check latency histogram records Lua script round-trip time
  • All metrics use toolhive_ prefix following project conventions
  • Span attributes rate_limit.decision, rate_limit.rejected_by, rate_limit.fail_open present on request span
  • Unit: Verify each metric increments under the correct conditions
  • Unit: Verify span attributes are set on allowed, rejected, and fail-open paths
  • E2E: Send rate-limited traffic, scrape metrics endpoint, verify counters are non-zero

Dependencies

  • STORY-001 (core rate limit middleware exists to instrument)

Out of Scope

  • Alert definitions and runbooks (operational docs)
  • Dashboard templates

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiItems related to the APIenhancementNew feature or requestgoPull requests that update go codekubernetesItems related to Kubernetesoperator
    No fields configured for Story 🗺️.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions