Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,6 @@

# Re-include template files for go:embed
!**/*.gotmpl

# Re-include yaml for default mapping config
!**/*.yaml
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ COPY --from=agent-builder /workspace/switch-agent-server .
COPY --from=agent-builder /workspace/switch-agent-client .

# Expose the service ports
EXPOSE 50051 50051
EXPOSE 50051 9100

ENTRYPOINT ["/switch-agent-server"]

Expand Down
Binary file added agent
Binary file not shown.
2 changes: 1 addition & 1 deletion api/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions docs/.vitepress/config.mts
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ export default withMermaid({
{ text: 'Getting started', link: '/usage/getting-started' },
{ text: 'Provisioning', link: '/usage/provisioning' },
{ text: 'Agent', link: '/usage/agent' },
{ text: 'Agent Metrics', link: '/usage/metrics' },
]
},
{
Expand Down
3 changes: 3 additions & 0 deletions docs/usage/agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,6 @@ The switch agent runs on the switch and exposes device and interface operations

## Notes
The current implementation uses SONiC Redis as the data source for switch state.

## Metrics
The agent exposes Prometheus metrics on port 9100. See [Agent metrics](./metrics.md) for the full metric reference and configuration schema.
274 changes: 274 additions & 0 deletions docs/usage/metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
# Agent metrics

The switch agent exposes a Prometheus-compatible `/metrics` endpoint for monitoring switch health, interface state, and transceiver optics. Metrics are collected just-in-time from SONiC Redis on every Prometheus scrape — there is no background polling or caching.

## Endpoints

| Path | Description |
|------|-------------|
| `/metrics` | Prometheus metrics |
| `/healthz` | Health check — returns `200 OK` if Redis is reachable, `500` otherwise |

## Configuration

The agent accepts two flags for metrics:

| Flag | Default | Description |
|------|---------|-------------|
| `-metrics-port` | `9100` | HTTP port for the metrics server |
| `-metrics-config` | _(empty)_ | Path to a custom metrics mapping YAML. When empty, the built-in default config is used |

## Metric types

Metrics come from two sources:

### Built-in collectors

These require custom logic (cross-database joins, aggregate counting, error fallbacks) and are always registered.

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `sonic_switch_info` | gauge | `mac`, `firmware`, `hwsku`, `asic`, `platform` | Device metadata, always 1. Firmware and ASIC fall back to `/etc/sonic/sonic_version.yml` when absent from Redis |
| `sonic_switch_ready` | gauge | — | 1 if the switch is ready, 0 otherwise |
| `sonic_switch_interface_oper_state` | gauge | `interface` | Operational state (1=up, 0=down) |
| `sonic_switch_interface_admin_state` | gauge | `interface` | Admin state (1=up, 0=down) |
| `sonic_switch_interfaces_total` | gauge | `operational_status` | Number of interfaces by status |
| `sonic_switch_ports_total` | gauge | — | Total physical ports |
| `sonic_scrape_duration_seconds` | gauge | — | Duration of the last metrics scrape |

### Config-driven collectors

These are defined in YAML and can be customized or extended by operators. The default config ships the following metrics:

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `sonic_switch_transceiver_dom_temperature_celsius` | gauge | `interface` | Transceiver temperature |
| `sonic_switch_transceiver_dom_voltage_volts` | gauge | `interface` | Transceiver supply voltage |
| `sonic_switch_transceiver_dom_rx_power_dbm` | gauge | `interface`, `lane` | RX power per lane |
| `sonic_switch_transceiver_dom_tx_bias_milliamps` | gauge | `interface`, `lane` | TX bias current per lane |
| `sonic_switch_transceiver_dom_threshold` | gauge | `interface`, `sensor`, `level`, `direction` | DOM threshold values |
| `sonic_switch_transceiver_info` | gauge | `interface`, `type`, `vendor`, `model`, `serial` | Transceiver metadata, always 1 |
| `sonic_switch_transceiver_rxlos` | gauge | `interface`, `lane` | RX loss of signal per lane (1=loss, 0=ok) |
| `sonic_switch_transceiver_txfault` | gauge | `interface`, `lane` | TX fault per lane (1=fault, 0=ok) |
| `sonic_switch_interface_neighbor_info` | gauge | `interface`, `neighbor_mac`, `neighbor_name`, `neighbor_port` | LLDP neighbor metadata, always 1 |
| `sonic_switch_temperature_celsius` | gauge | `sensor` | Chassis temperature sensor reading |
| `sonic_switch_temperature_high_threshold_celsius` | gauge | `sensor` | Chassis temperature sensor high threshold |
| `sonic_switch_temperature_warning` | gauge | `sensor` | Chassis temperature warning status (1=warning, 0=ok) |
| `sonic_switch_interface_bytes_total` | counter | `interface`, `direction` | Bytes transferred |
| `sonic_switch_interface_packets_total` | counter | `interface`, `direction`, `type` | Packets by type (unicast, multicast, broadcast, non_unicast) |
| `sonic_switch_interface_errors_total` | counter | `interface`, `direction` | Interface error counters |
| `sonic_switch_interface_discards_total` | counter | `interface`, `direction` | Interface discard counters |
| `sonic_switch_interface_dropped_packets_total` | counter | `interface`, `direction` | SAI-level dropped packets |
| `sonic_switch_interface_fec_frames_total` | counter | `interface`, `type` | FEC frame counters (correctable, uncorrectable, symbol_errors) |
| `sonic_switch_interface_queue_length` | gauge | `interface` | Current output queue length |
| `sonic_switch_interface_pfc_packets_total` | counter | `interface`, `direction`, `priority` | PFC packets per priority (0-7) |
| `sonic_switch_interface_rx_packet_size_bytes` | histogram | `interface` | RX packet size distribution (buckets: 64, 127, 255, 511, 1023, 1518, 2047, 4095, 9216, 16383) |
| `sonic_switch_interface_tx_packet_size_bytes` | histogram | `interface` | TX packet size distribution (buckets: 64, 127, 255, 511, 1023, 1518, 2047, 4095, 9216, 16383) |
| `sonic_switch_interface_anomaly_packets_total` | counter | `interface`, `type` | Anomalous packets (undersize, oversize, fragments, jabbers, unknown_protos) |

## Metrics configuration schema

A custom config file replaces all config-driven metrics. The file is YAML with a single top-level key:

```yaml
metrics:
- redis_db: ...
key_pattern: ...
fields:
- metric: ...
...
```

### `metrics[]` — Metric mapping

Each entry maps a set of Redis keys to one or more Prometheus metrics.

| Field | Required | Default | Description |
|-------|----------|---------|-------------|
| `redis_db` | yes | — | SONiC Redis database name (`CONFIG_DB`, `STATE_DB`, `COUNTERS_DB`, `APPL_DB`) |
| `key_pattern` | yes | — | Redis `KEYS` glob pattern (e.g. `TRANSCEIVER_INFO|*`) |
| `key_separator` | no | `\|` | Character separating the table prefix from the key suffix |
| `key_resolver` | no | — | Name of a Redis hash that maps logical names to key suffixes (e.g. `COUNTERS_PORT_NAME_MAP`) |
| `fields` | yes | — | List of field-to-metric mappings |

### `fields[]` — Field mapping

Each entry maps a Redis hash field (or set of fields) to a Prometheus metric.

| Field | Required | Default | Description |
|-------|----------|---------|-------------|
| `field` | no | — | Specific Redis hash field name. Mutually exclusive with `field_pattern` |
| `field_pattern` | no | — | Set to `*` to iterate all hash fields. Mutually exclusive with `field` |
| `metric` | yes | — | Prometheus metric name |
| `type` | yes | — | `gauge`, `counter`, or `histogram` |
| `help` | no | — | Metric help string |
| `value` | no | — | Fixed metric value (ignores field value). Use for `_info` pattern metrics |
| `labels` | no | — | Map of label names to [value templates](#label-value-templates) |
| `transform` | no | — | [Value transformation](#transforms) |

When neither `field` nor `field_pattern` is set, the metric is emitted once per key using `value` or label data from the hash.

### Label value templates

Label values are strings that can reference dynamic data using `$` prefixes:

| Template | Resolves to |
|----------|-------------|
| `$key_suffix` | Part of the Redis key after the separator (e.g. `Ethernet0` from `TRANSCEIVER_INFO\|Ethernet0`) |
| `$port_name` | Resolved name from `key_resolver` (e.g. `Ethernet0` resolved via `COUNTERS_PORT_NAME_MAP`) |
| `$field_name` | The Redis hash field name (useful with `field_pattern: "*"`) |
| `$<hash_field>` | Value of a hash field (e.g. `$vendor_name` reads the `vendor_name` field) |
| _(literal)_ | Any string without a `$` prefix is used as-is |

### Transforms

Transforms modify how the metric value is derived. At most one transform should be set per field mapping.

#### `map`

Maps string field values to floats. Unmapped values are silently skipped.

```yaml
transform:
map:
up: 1
down: 0
```

#### `regex_capture`

Matches field names against a Go regex with [named capture groups](https://pkg.go.dev/regexp/syntax). Non-matching fields are skipped. Capture group names become additional Prometheus labels. Requires `field_pattern: "*"`.

```yaml
field_pattern: "*"
metric: sonic_switch_transceiver_dom_rx_power_dbm
labels:
interface: "$key_suffix"
transform:
regex_capture:
pattern: "^rx(?P<lane>\\d+)power$"
```

This matches `rx1power`, `rx2power`, etc. and produces a `lane` label with the captured digit.

`regex_capture` can be combined with `map` to filter field names by regex while also converting string values. For example, to expose per-lane boolean fields as numeric gauges:

```yaml
field_pattern: "*"
metric: sonic_switch_transceiver_rxlos
labels:
interface: "$key_suffix"
transform:
regex_capture:
pattern: "^rxlos(?P<lane>\\d+)$"
map:
"True": 1
"False": 0
```

#### `parse_threshold_field`

Parses SONiC DOM threshold field names (e.g. `temphighalarm`) into three additional labels: `sensor`, `level`, and `direction`. Requires `field_pattern: "*"`.

```yaml
transform:
parse_threshold_field: true
```

| Field name | sensor | level | direction |
|------------|--------|-------|-----------|
| `temphighalarm` | temperature | alarm | high |
| `vcclowwarning` | voltage | warning | low |
| `rxpowerhighwarning` | rx_power | warning | high |
| `txbiaslowalarm` | tx_bias | alarm | low |
| `txpowerhighalarm` | tx_power | alarm | high |

#### `dom_flag_severity`

Computes a severity rollup from all DOM flag fields in the hash. Each field is parsed as a threshold field name; if its value is `"true"`, it contributes to the severity. Returns the highest severity found: `0` (ok), `1` (warning), or `2` (alarm). Note: this transform is available but not included in the default config because the `TRANSCEIVER_DOM_FLAG` table is not present on all platforms.

```yaml
transform:
dom_flag_severity: true
```

#### `histogram`

Maps multiple Redis hash fields to a single Prometheus histogram. Each entry in `buckets` maps an upper bound (float64) to a Redis hash field name. The transform reads each field, parses the count as an unsigned integer, and accumulates cumulative bucket counts. The resulting histogram has `sum=0` because SAI counters don't provide total bytes — but bucket-based percentile queries and heatmap visualizations still work. Requires `type: "histogram"`.

```yaml
- metric: sonic_switch_interface_rx_packet_size_bytes
type: histogram
help: "RX packet size distribution"
labels:
interface: "$port_name"
transform:
histogram:
buckets:
64: SAI_PORT_STAT_ETHER_IN_PKTS_64_OCTETS
127: SAI_PORT_STAT_ETHER_IN_PKTS_65_TO_127_OCTETS
255: SAI_PORT_STAT_ETHER_IN_PKTS_128_TO_255_OCTETS
511: SAI_PORT_STAT_ETHER_IN_PKTS_256_TO_511_OCTETS
1023: SAI_PORT_STAT_ETHER_IN_PKTS_512_TO_1023_OCTETS
1518: SAI_PORT_STAT_ETHER_IN_PKTS_1024_TO_1518_OCTETS
```

This emits `_bucket`, `_count`, and `_sum` series automatically — Prometheus handles the histogram suffixes.

## Examples

### Adding a new counter from COUNTERS_DB

```yaml
metrics:
- redis_db: COUNTERS_DB
key_pattern: "COUNTERS:*"
key_separator: ":"
key_resolver: COUNTERS_PORT_NAME_MAP
fields:
- field: SAI_PORT_STAT_IF_IN_UCAST_PKTS
metric: sonic_switch_interface_unicast_packets_total
type: counter
help: "Total unicast packets received"
labels:
interface: "$port_name"
direction: "rx"
```

### Exposing a string field as an enum gauge

```yaml
metrics:
- redis_db: STATE_DB
key_pattern: "PORT_TABLE|*"
key_separator: "|"
fields:
- field: oper_status
metric: sonic_switch_interface_oper_state
type: gauge
help: "Operational state of the interface"
labels:
interface: "$key_suffix"
transform:
map:
up: 1
down: 0
```

### Metadata as labels (info pattern)

```yaml
metrics:
- redis_db: STATE_DB
key_pattern: "TRANSCEIVER_INFO|*"
key_separator: "|"
fields:
- metric: sonic_switch_transceiver_info
type: gauge
help: "Transceiver metadata"
value: 1
labels:
interface: "$key_suffix"
vendor: "$manufacturer"
serial: "$serial"
```
6 changes: 4 additions & 2 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,12 @@ go 1.24.5

require (
github.com/go-logr/logr v1.4.3
github.com/go-redis/redismock/v9 v9.2.0
github.com/ironcore-dev/controller-utils v0.11.0
github.com/jedib0t/go-pretty/v6 v6.7.8
github.com/onsi/ginkgo/v2 v2.28.1
github.com/onsi/gomega v1.39.1
github.com/prometheus/client_golang v1.23.2
github.com/redis/go-redis/v9 v9.18.0
github.com/spf13/cobra v1.10.2
github.com/spf13/pflag v1.0.10
Expand All @@ -18,6 +20,7 @@ require (
k8s.io/apimachinery v0.34.1
k8s.io/client-go v0.34.1
sigs.k8s.io/controller-runtime v0.22.3
sigs.k8s.io/yaml v1.6.0
)

require (
Expand Down Expand Up @@ -62,12 +65,12 @@ require (
github.com/grpc-ecosystem/grpc-gateway/v2 v2.27.3 // indirect
github.com/inconshreveable/mousetrap v1.1.0 // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/kylelemons/godebug v1.1.0 // indirect
github.com/mattn/go-runewidth v0.0.16 // indirect
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee // indirect
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
github.com/prometheus/client_golang v1.23.2 // indirect
github.com/prometheus/client_model v0.6.2 // indirect
github.com/prometheus/common v0.67.1 // indirect
github.com/prometheus/procfs v0.19.1 // indirect
Expand Down Expand Up @@ -114,5 +117,4 @@ require (
sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 // indirect
sigs.k8s.io/randfill v1.0.0 // indirect
sigs.k8s.io/structured-merge-diff/v6 v6.3.0 // indirect
sigs.k8s.io/yaml v1.6.0 // indirect
)
8 changes: 8 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,8 @@ github.com/go-openapi/swag/typeutils v0.25.1 h1:rD/9HsEQieewNt6/k+JBwkxuAHktFtH3
github.com/go-openapi/swag/typeutils v0.25.1/go.mod h1:9McMC/oCdS4BKwk2shEB7x17P6HmMmA6dQRtAkSnNb8=
github.com/go-openapi/swag/yamlutils v0.25.1 h1:mry5ez8joJwzvMbaTGLhw8pXUnhDK91oSJLDPF1bmGk=
github.com/go-openapi/swag/yamlutils v0.25.1/go.mod h1:cm9ywbzncy3y6uPm/97ysW8+wZ09qsks+9RS8fLWKqg=
github.com/go-redis/redismock/v9 v9.2.0 h1:ZrMYQeKPECZPjOj5u9eyOjg8Nnb0BS9lkVIZ6IpsKLw=
github.com/go-redis/redismock/v9 v9.2.0/go.mod h1:18KHfGDK4Y6c2R0H38EUGWAdc7ZQS9gfYxc94k7rWT0=
github.com/go-task/slim-sprig/v3 v3.0.0 h1:sUs3vkvUymDpBKi3qH1YSqBQk9+9D/8M2mN1vB6EwHI=
github.com/go-task/slim-sprig/v3 v3.0.0/go.mod h1:W848ghGpv3Qj3dhTPRyJypKRiqCdHZiAzKg9hl15HA8=
github.com/goccy/go-yaml v1.18.0 h1:8W7wMFS12Pcas7KU+VVkaiCng+kG8QiFeFwzFb+rwuw=
Expand Down Expand Up @@ -139,6 +141,10 @@ github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee h1:W5t00kpgFd
github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee/go.mod h1:yWuevngMOJpCy52FWWMvUC8ws7m/LJsjYzDa0/r8luk=
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 h1:C3w9PqII01/Oq1c1nUAm88MOHcQC9l5mIlSMApZMrHA=
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ=
github.com/nxadm/tail v1.4.8 h1:nPr65rt6Y5JFSKQO7qToXr7pePgD6Gwiw05lkbyAQTE=
github.com/nxadm/tail v1.4.8/go.mod h1:+ncqLTQzXmGhMZNUePPaPqPvBxHAIsmXswZKocGu+AU=
github.com/onsi/ginkgo v1.16.5 h1:8xi0RTUf59SOSfEtZMvwTvXYMzG4gV23XVHOZiXNtnE=
github.com/onsi/ginkgo v1.16.5/go.mod h1:+E8gABHa3K6zRBolWtd+ROzc/U5bkGt0FwiG042wbpU=
github.com/onsi/ginkgo/v2 v2.28.1 h1:S4hj+HbZp40fNKuLUQOYLDgZLwNUVn19N3Atb98NCyI=
github.com/onsi/ginkgo/v2 v2.28.1/go.mod h1:CLtbVInNckU3/+gC8LzkGUb9oF+e8W8TdUsxPwvdOgE=
github.com/onsi/gomega v1.39.1 h1:1IJLAad4zjPn2PsnhH70V4DKRFlrCzGBNrNaru+Vf28=
Expand Down Expand Up @@ -300,6 +306,8 @@ gopkg.in/evanphx/json-patch.v4 v4.13.0 h1:czT3CmqEaQ1aanPc5SdlgQrrEIb8w/wwCvWWnf
gopkg.in/evanphx/json-patch.v4 v4.13.0/go.mod h1:p8EYWUEYMpynmqDbY58zCKCFZw8pRWMG4EsWvDvM72M=
gopkg.in/inf.v0 v0.9.1 h1:73M5CoZyi3ZLMOyDlQh031Cx6N9NDJ2Vvfl76EDAgDc=
gopkg.in/inf.v0 v0.9.1/go.mod h1:cWUDdTG/fYaXco+Dcufb5Vnc6Gp2YChqWtbxRZE0mXw=
gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 h1:uRGJdciOHaEIrze2W8Q3AKkepLTh2hOroT7a+7czfdQ=
gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
Expand Down
2 changes: 1 addition & 1 deletion hack/boilerplate.go.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
// SPDX-FileCopyrightText: 2025 SAP SE or an SAP affiliate company and IronCore contributors
// SPDX-FileCopyrightText: 2026 SAP SE or an SAP affiliate company and IronCore contributors
// SPDX-License-Identifier: Apache-2.0
2 changes: 1 addition & 1 deletion hack/license-header.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SPDX-FileCopyrightText: 2025 SAP SE or an SAP affiliate company and IronCore contributors
SPDX-FileCopyrightText: 2026 SAP SE or an SAP affiliate company and IronCore contributors
SPDX-License-Identifier: Apache-2.0
Loading
Loading