When the alarm rule is configured such that PromForDuration=0 (duration is 0) and RecoverDuration>0 (observation duration is greater than 0), even if the alarm condition is no longer met, the alarm cannot be automatically recovered, resulting in the alarm remaining active all the time

### Question and Steps to reproduce

当告警规则配置为 PromForDuration=0（持续时长为0）且 RecoverDuration>0（留观时长大于0）时，即使告警条件已经不再满足，告警也无法自动恢复，导致告警一直处于活跃状态。
此外，即使配置了 PromForDuration>0，当 Nightingale 服务重启后，已触发的告警也会出现同样的问题，无法自动恢复。

复现步骤
场景一：PromForDuration=0 的情况
创建一个告警规则，配置如下：
   PromQL: sum by(region) (
       round(
           increase(
               sermant_http_client_requests_seconds_count{
                   env='prod',
                   status='401', 
                   uri=~'/inApps/v1/transactions/.*'
               }[1m]
           ) 
       )
   ) > 5
   
   执行频率: 60s
   持续时长(PromForDuration): 0s      # 关键配置
   留观时长(RecoverDuration): 180s    # 关键配置
   重复通知间隔: 10分钟
   最大发送次数: 3
   PromQL: sum by(region) (       round(           increase(               sermant_http_client_requests_seconds_count{                   env='prod',                   status='401',                    uri=~'/inApps/v1/transactions/.*'               }[1m]           )        )   ) > 5      执行频率: 60s   持续时长(PromForDuration): 0s      # 关键配置   留观时长(RecoverDuration): 180s    # 关键配置   重复通知间隔: 10分钟   最大发送次数: 3
等待告警触发并发送通知
修改指标数据，使告警条件不再满足（例如：指标值降到阈值以下）
等待超过 180 秒（留观时长）
观察告警状态
场景二：服务重启的情况
创建告警规则，配置 PromForDuration=60s，RecoverDuration=180s
触发告警
重启 Nightingale 服务
使告警条件不再满足
等待超过 180 秒
观察告警状态
预期行为
告警条件不再满足后，等待 RecoverDuration 时长（例如 180 秒），应该：
自动将告警标记为已恢复
从活跃告警列表中移除
如果配置了恢复通知，应该发送恢复通知
实际行为
告警一直保持在活跃状态，永远不会自动恢复，需要手动删除。
在日志中可以看到：
rule_eval:xxx event:xxx do not has pending event, not recover
rule_eval:xxx event:xxx do not has pending event, not recover
根本原因分析
代码位置
alert/process/process.go
问题 1：PromForDuration=0 时不写入 pendingsUseByRecover
在 handleEvent 函数中（第 393-398 行）：
if p.rule.PromForDuration == 0 {
    fireEvents = append(fireEvents, event)
    if severity > event.Severity {
        severity = event.Severity
    }
    continue  // 直接跳过，不写入 pendingsUseByRecover
}
if p.rule.PromForDuration == 0 {    fireEvents = append(fireEvents, event)    if severity > event.Severity {        severity = event.Severity    }    continue  // 直接跳过，不写入 pendingsUseByRecover}
当 PromForDuration=0 时，代码直接 continue，跳过了后续的 pendingsUseByRecover.Set() 逻辑（第 408-409 行）。
问题 2：RecoverSingle 依赖 pendingsUseByRecover
在 RecoverSingle 函数中（第 348-354 行）：
if cachedRule.RecoverDuration > 0 {
    lastPendingEvent, has := p.pendingsUseByRecover.Get(hash)
    if !has {
        // 说明没有产生过异常点，就不需要恢复了
        logger.Debugf("rule_eval:%s event:%v do not has pending event, not recover", p.Key(), event)
        return  // 直接返回，拒绝恢复
    }
    
    if now-lastPendingEvent.LastEvalTime < cachedRule.RecoverDuration {
        logger.Debugf("rule_eval:%s event:%v not recover", p.Key(), event)
        return
    }
}
if cachedRule.RecoverDuration > 0 {    lastPendingEvent, has := p.pendingsUseByRecover.Get(hash)    if !has {        // 说明没有产生过异常点，就不需要恢复了        logger.Debugf("rule_eval:%s event:%v do not has pending event, not recover", p.Key(), event)        return  // 直接返回，拒绝恢复    }        if now-lastPendingEvent.LastEvalTime < cachedRule.RecoverDuration {        logger.Debugf("rule_eval:%s event:%v not recover", p.Key(), event)        return    }}
恢复逻辑需要从 pendingsUseByRecover 获取 LastEvalTime 来计算留观时长是否已满足。如果获取不到（has=false），直接拒绝恢复。
问题 3：服务重启时不恢复 pendingsUseByRecover
在 RecoverAlertCurEventFromDb 函数中（第 493-495 行）：
func (p *Processor) RecoverAlertCurEventFromDb() {
    p.pendings = NewAlertCurEventMap(nil)
    p.pendingsUseByRecover = NewAlertCurEventMap(nil)  // 初始化为空
    
    // ... 从数据库恢复 fires
    // 但没有恢复 pendingsUseByRecover
}
func (p *Processor) RecoverAlertCurEventFromDb() {    p.pendings = NewAlertCurEventMap(nil)    p.pendingsUseByRecover = NewAlertCurEventMap(nil)  // 初始化为空        // ... 从数据库恢复 fires    // 但没有恢复 pendingsUseByRecover}
服务重启时，只从数据库恢复了 fires，但 pendingsUseByRecover 仍然是空的。
影响范围
配置影响：
所有配置了 PromForDuration=0 且 RecoverDuration>0 的告警规则
这是一个常见的配置组合（希望立即告警，但恢复时需要留观）
运维影响：
服务重启后，所有配置了 RecoverDuration>0 的活跃告警都无法自动恢复
需要手动清理数据库中的活跃告警
用户体验：
告警列表中会堆积大量已经实际恢复但系统未清理的告警
无法收到恢复通知
解决方案建议


### Relevant logs and configurations

```text
DEBUG process/process.go:352 rule_eval:alert-1-154 xxxxxxxx do not has pending event, not recover
```

### Version

Nightingale 版本：v7.0（请根据实际版本填写）
部署方式：（二进制）
数据源类型：Prometheus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the alarm rule is configured such that PromForDuration=0 (duration is 0) and RecoverDuration>0 (observation duration is greater than 0), even if the alarm condition is no longer met, the alarm cannot be automatically recovered, resulting in the alarm remaining active all the time #2999

Question and Steps to reproduce

Relevant logs and configurations

Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

When the alarm rule is configured such that PromForDuration=0 (duration is 0) and RecoverDuration>0 (observation duration is greater than 0), even if the alarm condition is no longer met, the alarm cannot be automatically recovered, resulting in the alarm remaining active all the time #2999

Description

Question and Steps to reproduce

Relevant logs and configurations

Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions