Skip to content

Commit 7947697

Browse files
sradcoCursor AI Agent
andcommitted
k8s: add relabel config, relabeled rules, alerting health and AlertingRule CRD support
Add remaining k8s layer capabilities: - AlertRelabelConfig CRUD operations for OpenShift AlertRelabelConfig CRs - RelabeledRules: applies Prometheus relabel configs to derive effective alert rules with source/management labels - AlertingHealth: route reachability checks for platform and user-workload Prometheus/Alertmanager endpoints - AlertingRule CRUD operations for OpenShift AlertingRule CRs - ExternalManagement label helpers Also adds supporting leaf packages: - pkg/managementlabels: management label constants and helpers - pkg/classification: alert classification validation - pkg/alert_rule: alert rule ID generation and parsing Signed-off-by: avlitman <alitman@redhat.com> Signed-off-by: Shirly Radco <sradco@redhat.com> Signed-off-by: machadovilaca <machadovilaca@gmail.com> Co-authored-by: Cursor AI Agent <cursor-ai@users.noreply.github.com> Made-with: Cursor
1 parent 8828c5a commit 7947697

9 files changed

Lines changed: 1001 additions & 0 deletions

File tree

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Classification Overrides: Proposed Change Summary
2+
3+
## What is changing
4+
5+
Replace ConfigMap-based classification storage with labels stored directly on the rule resource, using AlertRelabelConfig (ARC) CRs only for rules that cannot be directly modified.
6+
7+
## Before and After
8+
9+
| Rule type | Management | Before (ConfigMap) | After (Labels + ARC) |
10+
|---|---|---|---|
11+
| User-defined PrometheusRule | Unmanaged | ConfigMap in plugin namespace | Labels directly on the PrometheusRule |
12+
| User-defined PrometheusRule | Operator-managed | ConfigMap in plugin namespace | ARC in `openshift-user-workload-monitoring` |
13+
| User-defined PrometheusRule | GitOps-managed | ConfigMap in plugin namespace | Block (user adds labels in Git) |
14+
| Platform via AlertingRule | Unmanaged | ConfigMap in plugin namespace | Labels directly on the AlertingRule |
15+
| Platform via AlertingRule | Operator-managed | ConfigMap in plugin namespace | ARC in `openshift-monitoring` |
16+
| Platform via AlertingRule | GitOps-managed | ConfigMap in plugin namespace | Block (user adds labels in Git) |
17+
| Platform without AlertingRule | Operator-managed | ConfigMap in plugin namespace | ARC in `openshift-monitoring` |
18+
19+
### Storage comparison
20+
21+
| Aspect | Before | After |
22+
|---|---|---|
23+
| Where classification is stored | ConfigMaps in plugin namespace | Labels on the rule itself (or ARC for operator-managed) |
24+
| New k8s interfaces | `ConfigMapInterface` (3 methods) | None |
25+
| New files | 5 files | 0 (reuses existing label update paths) |
26+
| Classification visibility | API response only | Everywhere (kubectl, Prometheus, Alertmanager) |
27+
| ARC model | N/A | Per-alert-rule (same as existing label changes) |
28+
| Provenance tracking | Implicit (separate data store) | `openshift_io_alert_rule_classification_managed_by` label |
29+
30+
## Pros
31+
32+
1. **Simpler implementation** -- removes 5 files and the ConfigMap subsystem; classification reuses the existing label update code paths
33+
2. **Classification visible beyond the API** -- labels on the rule are visible to kubectl, Prometheus, Alertmanager, and external tools (Grafana, PagerDuty)
34+
3. **No new k8s interfaces** -- no `ConfigMapInterface`, no ConfigMap mock in tests
35+
4. **Consistent with label changes** -- classification follows the exact same branching as other label updates (direct for editable rules, ARC for operator-managed, block for GitOps)
36+
5. **Clear provenance** -- `managed_by` label explicitly tracks whether the classification was set via the API
37+
6. **Zero migration** -- new feature, no existing data to migrate; existing clusters with custom ARCs or alerts are unaffected
38+
39+
## Cons
40+
41+
1. **`_from` (dynamic derivation) limited** -- `componentFrom`/`layerFrom` is only supported for ARC-based rules (operator-managed platform). Directly editable rules support only static component/layer. This covers the main use case (CVO-style platform alerts).
42+
2. **RBAC addition for user-workload ARCs** -- operator-managed user-defined PrometheusRules require a Role + RoleBinding for ARCs in `openshift-user-workload-monitoring` (one namespace, not cluster-wide)
43+
3. **GitOps-managed rules blocked** -- users must add classification labels in Git. ARC side-channel for GitOps rules can be added later if requested.
44+
4. **No provenance distinction for external label setters** -- if someone sets `openshift_io_alert_rule_component` via kubectl or their own ARC (without `managed_by`), the API treats it as "rule-defined." Functionally correct but the UI won't show it as a "user override."
45+
46+
## RBAC changes needed
47+
48+
A Role + RoleBinding in `openshift-user-workload-monitoring` for the monitoring-plugin ServiceAccount:
49+
50+
```yaml
51+
apiVersion: rbac.authorization.k8s.io/v1
52+
kind: Role
53+
metadata:
54+
name: monitoring-plugin-arc-manager
55+
namespace: openshift-user-workload-monitoring
56+
rules:
57+
- apiGroups: ["monitoring.openshift.io"]
58+
resources: ["alertrelabelconfigs"]
59+
verbs: ["get", "list", "watch", "create", "update", "delete"]
60+
```
61+
62+
This covers operator-managed user-defined PrometheusRules. All other rule types require no new RBAC.
63+
64+
## Impact on existing clusters
65+
66+
None. Classification is only activated when a user explicitly classifies a rule via the API. Existing ARCs, PrometheusRules, and alerts are unaffected.

pkg/alert_rule/alert_rule.go

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
package alertrule
2+
3+
import (
4+
"crypto/sha256"
5+
"encoding/base64"
6+
"fmt"
7+
"sort"
8+
"strings"
9+
"unicode/utf8"
10+
11+
"github.com/openshift/monitoring-plugin/pkg/classification"
12+
"github.com/openshift/monitoring-plugin/pkg/managementlabels"
13+
monitoringv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
14+
)
15+
16+
func GetAlertingRuleId(alertRule *monitoringv1.Rule) string {
17+
var name string
18+
var kind string
19+
if alertRule.Alert != "" {
20+
name = alertRule.Alert
21+
kind = "alert"
22+
} else if alertRule.Record != "" {
23+
name = alertRule.Record
24+
kind = "record"
25+
} else {
26+
return ""
27+
}
28+
29+
expr := normalizeExpr(alertRule.Expr.String())
30+
forDuration := ""
31+
if alertRule.For != nil {
32+
forDuration = strings.TrimSpace(string(*alertRule.For))
33+
}
34+
35+
labelsBlock := normalizedBusinessLabelsBlock(alertRule.Labels)
36+
37+
// Canonical payload is intentionally derived from rule spec (expr/for/labels) and identity (kind/name),
38+
// and excludes annotations and openshift_io_* provenance/system labels.
39+
canonicalPayload := strings.Join([]string{kind, name, expr, forDuration, labelsBlock}, "\n---\n")
40+
41+
// Generate SHA256 hash
42+
hash := sha256.Sum256([]byte(canonicalPayload))
43+
44+
return "rid_" + base64.RawURLEncoding.EncodeToString(hash[:])
45+
}
46+
47+
func normalizeExpr(expr string) string {
48+
// Collapse consecutive whitespace so cosmetic formatting changes do not churn ids.
49+
return strings.Join(strings.Fields(strings.TrimSpace(expr)), " ")
50+
}
51+
52+
func normalizedBusinessLabelsBlock(in map[string]string) string {
53+
if len(in) == 0 {
54+
return ""
55+
}
56+
57+
lines := make([]string, 0, len(in))
58+
for k, v := range in {
59+
key := strings.TrimSpace(k)
60+
if key == "" {
61+
continue
62+
}
63+
if strings.HasPrefix(key, "openshift_io_") || key == managementlabels.AlertNameLabel {
64+
// Skip system labels
65+
continue
66+
}
67+
if !classification.ValidatePromLabelName(key) {
68+
continue
69+
}
70+
if v == "" {
71+
// Align with specHash behavior: drop empty values
72+
continue
73+
}
74+
if !utf8.ValidString(v) {
75+
continue
76+
}
77+
78+
lines = append(lines, fmt.Sprintf("%s=%s", key, v))
79+
}
80+
81+
sort.Strings(lines)
82+
return strings.Join(lines, "\n")
83+
}

pkg/k8s/alert_relabel_config.go

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
package k8s
2+
3+
import (
4+
"context"
5+
"fmt"
6+
7+
osmv1 "github.com/openshift/api/monitoring/v1"
8+
osmv1client "github.com/openshift/client-go/monitoring/clientset/versioned"
9+
"k8s.io/apimachinery/pkg/api/errors"
10+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
11+
"k8s.io/apimachinery/pkg/fields"
12+
"k8s.io/client-go/tools/cache"
13+
)
14+
15+
type alertRelabelConfigManager struct {
16+
clientset *osmv1client.Clientset
17+
arcInformer cache.SharedIndexInformer
18+
}
19+
20+
func newAlertRelabelConfigManager(ctx context.Context, clientset *osmv1client.Clientset) (*alertRelabelConfigManager, error) {
21+
arcInformer := cache.NewSharedIndexInformer(
22+
alertRelabelConfigListWatchForAllNamespaces(clientset),
23+
&osmv1.AlertRelabelConfig{},
24+
0,
25+
cache.Indexers{},
26+
)
27+
28+
arcm := &alertRelabelConfigManager{
29+
clientset: clientset,
30+
arcInformer: arcInformer,
31+
}
32+
33+
go arcm.arcInformer.Run(ctx.Done())
34+
35+
cache.WaitForNamedCacheSync("AlertRelabelConfig informer", ctx.Done(),
36+
arcm.arcInformer.HasSynced,
37+
)
38+
39+
return arcm, nil
40+
}
41+
42+
func alertRelabelConfigListWatchForAllNamespaces(clientset *osmv1client.Clientset) *cache.ListWatch {
43+
return cache.NewListWatchFromClient(clientset.MonitoringV1().RESTClient(), "alertrelabelconfigs", "", fields.Everything())
44+
}
45+
46+
func (arcm *alertRelabelConfigManager) List(ctx context.Context, namespace string) ([]osmv1.AlertRelabelConfig, error) {
47+
arcs := arcm.arcInformer.GetStore().List()
48+
49+
alertRelabelConfigs := make([]osmv1.AlertRelabelConfig, 0, len(arcs))
50+
for _, item := range arcs {
51+
arc, ok := item.(*osmv1.AlertRelabelConfig)
52+
if !ok {
53+
continue
54+
}
55+
alertRelabelConfigs = append(alertRelabelConfigs, *arc)
56+
}
57+
58+
return alertRelabelConfigs, nil
59+
}
60+
61+
func (arcm *alertRelabelConfigManager) Get(ctx context.Context, namespace string, name string) (*osmv1.AlertRelabelConfig, bool, error) {
62+
arc, err := arcm.clientset.MonitoringV1().AlertRelabelConfigs(namespace).Get(ctx, name, metav1.GetOptions{})
63+
if err != nil {
64+
if errors.IsNotFound(err) {
65+
return nil, false, nil
66+
}
67+
68+
return nil, false, err
69+
}
70+
71+
return arc, true, nil
72+
}
73+
74+
func (arcm *alertRelabelConfigManager) Create(ctx context.Context, arc osmv1.AlertRelabelConfig) (*osmv1.AlertRelabelConfig, error) {
75+
created, err := arcm.clientset.MonitoringV1().AlertRelabelConfigs(arc.Namespace).Create(ctx, &arc, metav1.CreateOptions{})
76+
if err != nil {
77+
return nil, fmt.Errorf("failed to create AlertRelabelConfig %s/%s: %w", arc.Namespace, arc.Name, err)
78+
}
79+
80+
return created, nil
81+
}
82+
83+
func (arcm *alertRelabelConfigManager) Update(ctx context.Context, arc osmv1.AlertRelabelConfig) error {
84+
_, err := arcm.clientset.MonitoringV1().AlertRelabelConfigs(arc.Namespace).Update(ctx, &arc, metav1.UpdateOptions{})
85+
if err != nil {
86+
return fmt.Errorf("failed to update AlertRelabelConfig %s/%s: %w", arc.Namespace, arc.Name, err)
87+
}
88+
89+
return nil
90+
}
91+
92+
func (arcm *alertRelabelConfigManager) Delete(ctx context.Context, namespace string, name string) error {
93+
err := arcm.clientset.MonitoringV1().AlertRelabelConfigs(namespace).Delete(ctx, name, metav1.DeleteOptions{})
94+
if err != nil {
95+
return fmt.Errorf("failed to delete AlertRelabelConfig %s: %w", name, err)
96+
}
97+
98+
return nil
99+
}

pkg/k8s/alerting_health.go

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
package k8s
2+
3+
import (
4+
"context"
5+
"fmt"
6+
"strings"
7+
"sync"
8+
9+
"gopkg.in/yaml.v2"
10+
corev1 "k8s.io/api/core/v1"
11+
"k8s.io/apimachinery/pkg/fields"
12+
"k8s.io/client-go/kubernetes"
13+
"k8s.io/client-go/tools/cache"
14+
)
15+
16+
const (
17+
clusterMonitoringConfigMap = "cluster-monitoring-config"
18+
clusterMonitoringConfigKey = "config.yaml"
19+
)
20+
21+
type clusterMonitoringConfig struct {
22+
EnableUserWorkload bool `yaml:"enableUserWorkload"`
23+
}
24+
25+
// clusterMonitoringConfigManager watches the cluster-monitoring-config ConfigMap
26+
// via an informer and caches the parsed enableUserWorkload value so that
27+
// AlertingHealth never needs a live API call.
28+
type clusterMonitoringConfigManager struct {
29+
informer cache.SharedIndexInformer
30+
31+
mu sync.RWMutex
32+
enabled bool
33+
err error
34+
}
35+
36+
func newClusterMonitoringConfigManager(ctx context.Context, clientset *kubernetes.Clientset) (*clusterMonitoringConfigManager, error) {
37+
informer := cache.NewSharedIndexInformer(
38+
cache.NewListWatchFromClient(
39+
clientset.CoreV1().RESTClient(),
40+
"configmaps",
41+
ClusterMonitoringNamespace,
42+
fields.OneTermEqualSelector("metadata.name", clusterMonitoringConfigMap),
43+
),
44+
&corev1.ConfigMap{},
45+
0,
46+
cache.Indexers{},
47+
)
48+
49+
m := &clusterMonitoringConfigManager{
50+
informer: informer,
51+
}
52+
53+
_, err := informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
54+
AddFunc: func(obj interface{}) {
55+
cm, ok := obj.(*corev1.ConfigMap)
56+
if !ok {
57+
return
58+
}
59+
m.handleUpdate(cm)
60+
},
61+
UpdateFunc: func(_, newObj interface{}) {
62+
cm, ok := newObj.(*corev1.ConfigMap)
63+
if !ok {
64+
return
65+
}
66+
m.handleUpdate(cm)
67+
},
68+
DeleteFunc: func(_ interface{}) {
69+
m.mu.Lock()
70+
defer m.mu.Unlock()
71+
m.enabled = false
72+
m.err = nil
73+
},
74+
})
75+
if err != nil {
76+
return nil, fmt.Errorf("failed to add event handler to cluster-monitoring-config informer: %w", err)
77+
}
78+
79+
go informer.Run(ctx.Done())
80+
81+
cache.WaitForNamedCacheSync("ClusterMonitoringConfig informer", ctx.Done(),
82+
informer.HasSynced,
83+
)
84+
85+
return m, nil
86+
}
87+
88+
func (m *clusterMonitoringConfigManager) handleUpdate(cm *corev1.ConfigMap) {
89+
m.mu.Lock()
90+
defer m.mu.Unlock()
91+
92+
raw, ok := cm.Data[clusterMonitoringConfigKey]
93+
if !ok || strings.TrimSpace(raw) == "" {
94+
m.enabled = false
95+
m.err = nil
96+
return
97+
}
98+
99+
var cfg clusterMonitoringConfig
100+
if err := yaml.Unmarshal([]byte(raw), &cfg); err != nil {
101+
m.enabled = false
102+
m.err = fmt.Errorf("parse cluster monitoring config.yaml: %w", err)
103+
return
104+
}
105+
106+
m.enabled = cfg.EnableUserWorkload
107+
m.err = nil
108+
}
109+
110+
func (m *clusterMonitoringConfigManager) userWorkloadEnabled() (bool, error) {
111+
m.mu.RLock()
112+
defer m.mu.RUnlock()
113+
return m.enabled, m.err
114+
}
115+
116+
// AlertingHealth returns alerting route health and UWM enablement status.
117+
func (c *client) AlertingHealth(ctx context.Context) (AlertingHealth, error) {
118+
health := c.prometheusAlerts.alertingHealth(ctx)
119+
120+
enabled, err := c.clusterMonitoringConfig.userWorkloadEnabled()
121+
if err != nil {
122+
return health, fmt.Errorf("failed to determine user workload enablement: %w", err)
123+
}
124+
health.UserWorkloadEnabled = enabled
125+
126+
return health, nil
127+
}

0 commit comments

Comments
 (0)