-
Notifications
You must be signed in to change notification settings - Fork 232
Document the Rule Query Inspector for threshold rules #6746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
nastasha-solomon
wants to merge
4
commits into
main
Choose a base branch
from
issue-6555
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
48e196e
first draft
nastasha-solomon 878044c
Merge branch 'main' into issue-6555
nastasha-solomon f3e1d5a
Update solutions/observability/incident-management/create-manage-rule…
nastasha-solomon 5b03b56
Update solutions/observability/incident-management/triage-threshold-b…
nastasha-solomon File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
234 changes: 234 additions & 0 deletions
234
explore-analyze/alerting/alerts/inspect-rule-queries.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,234 @@ | ||
| --- | ||
| navigation_title: Diagnose rule behavior | ||
| description: Use the rule query inspector to view the Elasticsearch request behind a rule and diagnose why an alert did or didn't fire. | ||
| applies_to: | ||
| stack: ga 9.5 | ||
| serverless: ga | ||
| products: | ||
| - id: kibana | ||
| --- | ||
|
|
||
| # Diagnose rule behavior with the rule query inspector [inspect-rule-queries] | ||
|
|
||
| The rule query inspector lets you view the {{es}} request that a rule sends when it evaluates your data. Use it to understand the query structure, confirm the rule is targeting the right data, and diagnose why an alert did or didn't fire. | ||
|
|
||
| ::::{note} | ||
| :applies_to: {"stack": "ga 9.5", "serverless": "ga"} | ||
| Currently, the rule query inspector is only available for **custom threshold rules**. | ||
| :::: | ||
|
|
||
| ## Access the inspector [inspect-access] | ||
|
|
||
| The inspector is available from two places, each showing a different query: | ||
|
|
||
| **From the rule details page (current rule parameters)** | ||
| : Open **{{stack-manage-app}}** > **{{rules-ui}}**, find your rule, and click its name to open the rule details page. Click **Rule query inspector**. The inspector builds the query from the rule's _current_ parameters. Use this view to verify that the rule is configured correctly and would match the data you expect. | ||
|
|
||
| **From an alert details page (historical parameters)** | ||
| : Go to the **Alerts** page, then open an individual alert. Click **Rule query inspector**. The inspector uses the rule parameters _as they existed when that specific alert fired_, including the exact evaluation time range. Use this view to understand why a particular alert was or wasn't triggered. | ||
|
|
||
| The key difference: the rule details page reflects the rule as it is _now_, while the alert details page reflects the rule as it was _then_. If you've edited the rule since an alert fired, the two inspectors will show different queries. | ||
|
|
||
| ## Anatomy of the query [inspect-query-anatomy] | ||
|
|
||
| The following sections describe the query structure for **custom threshold rules**. As support for additional rule types is added, this reference will expand. | ||
|
|
||
| The inspector displays the full {{es}} request. Each part of the query maps to a setting in your rule configuration. | ||
|
|
||
| ### Index and time range [inspect-anatomy-index-time] | ||
|
|
||
| The top-level index and `range` filter reflect your rule's data source and time window: | ||
|
|
||
| ```json | ||
| { | ||
| "index": ["<your-data-view-index-pattern>"], | ||
| "body": { | ||
| "query": { | ||
| "bool": { | ||
| "filter": [ | ||
| { | ||
| "range": { | ||
| "@timestamp": { <1> | ||
| "gte": "...", <2> | ||
| "lte": "..." <3> | ||
| } | ||
| } | ||
| } | ||
| ] | ||
| } | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| 1. The time field from your data view. | ||
| 2. The start of the evaluation window (`now` minus your rule's **time window** setting). | ||
| 3. The end of the evaluation window. From an alert details page, this matches the exact moment the alert was evaluated, not the current time. | ||
|
|
||
| If the time range looks unexpected from an alert details page, this confirms the exact window {{es}} searched when the alert fired. This can help explain alerts that seem outdated or cover an unexpected period. | ||
|
|
||
| ### Query filter [inspect-anatomy-query-filter] | ||
|
|
||
| If you set a **query filter** on the rule, it appears as an additional clause in the `bool` filter: | ||
|
|
||
| ```json | ||
| { | ||
| "query": { | ||
| "bool": { | ||
| "filter": [ | ||
| { "range": { "@timestamp": { ... } } }, | ||
| { "query_string": { "query": "host.name: host-1" } } <1> | ||
| ] | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| 1. The KQL query filter you set on the rule, translated to a `query_string` or `term` clause. If this filter excludes more data than expected, the rule won't find the documents you intended. | ||
|
|
||
| If the filter is missing or different from what you set, double-check the rule configuration. | ||
|
|
||
| ### Aggregations [inspect-anatomy-aggregations] | ||
|
|
||
| Each criterion you defined in the rule becomes an aggregation in the query. A rule with two criteria (for example, Aggregation A and Aggregation B) produces two sub-aggregations: | ||
|
|
||
| ```json | ||
| { | ||
| "aggs": { | ||
| "A": { <1> | ||
| "avg": { "field": "system.cpu.user.pct" } | ||
| }, | ||
| "B": { | ||
| "avg": { "field": "system.cpu.system.pct" } | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| 1. The letter label matches the criterion label shown in the rule configuration (**A**, **B**, and so on). | ||
|
|
||
| | Rule criterion | Aggregation in query | | ||
| | --- | --- | | ||
| | **Average** of a field | `avg` | | ||
| | **Max** of a field | `max` | | ||
| | **Min** of a field | `min` | | ||
| | **Sum** of a field | `sum` | | ||
| | **Count** (all docs) | `value_count` or `filter` + `value_count` | | ||
| | **Cardinality** of a field | `cardinality` | | ||
| | **95th percentile** of a field | `percentiles` with `{ "percents": [95] }` | | ||
| | **Rate** of a field | Two `max` aggregations plus a bucket script | | ||
|
|
||
| If you set a **KQL filter** on a criterion ({applies_to}`stack: ga 9.4+`), it appears as a `filter` aggregation wrapping the metric aggregation. | ||
|
|
||
| ### Group-by fields [inspect-anatomy-group-by] | ||
|
|
||
| If your rule uses **Group alerts by**, the aggregations are wrapped in a `composite` aggregation that partitions results by those fields: | ||
|
|
||
| ```json | ||
| { | ||
| "aggs": { | ||
| "groupBy": { | ||
| "composite": { | ||
| "sources": [ | ||
| { "host.name": { "terms": { "field": "host.name" } } } <1> | ||
| ], | ||
| "size": 10000 | ||
| }, | ||
| "aggs": { | ||
| "A": { "avg": { "field": "system.cpu.user.pct" } } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| 1. One entry per **Group alerts by** field. Multiple group-by fields produce multiple `sources`. | ||
|
|
||
| Without group-by, the aggregations run over all matched documents and return a single value. | ||
|
|
||
| ## Reading the response [inspect-response] | ||
|
|
||
| The inspector also shows the Elasticsearch response alongside the request. Match each aggregation bucket back to your rule configuration to understand what value was computed. | ||
|
|
||
| ### No group-by: single-value response [inspect-response-no-group] | ||
|
|
||
| When there are no group-by fields, the response contains a single set of aggregation values under `aggregations`: | ||
|
|
||
| ```json | ||
| { | ||
| "aggregations": { | ||
| "A": { "value": 0.82 }, <1> | ||
| "B": { "value": 0.15 } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| 1. Aggregation `A` returned `0.82`. If your rule equation is `(A + B) / C * 100` with threshold `IS ABOVE 95`, you'd compute the equation value with these numbers to confirm whether the threshold was met. | ||
|
|
||
| If the response value is below the threshold and no alert fired, this confirms the rule evaluated correctly. If you _expected_ an alert and the value is below the threshold, review your aggregations and KQL filters. | ||
|
|
||
| ### With group-by: bucketed response [inspect-response-group] | ||
|
|
||
| When group-by fields are used, the response returns one bucket per group under `aggregations.groupBy.buckets`: | ||
|
|
||
| ```json | ||
| { | ||
| "aggregations": { | ||
| "groupBy": { | ||
| "buckets": [ | ||
| { | ||
| "key": { "host.name": "host-1" }, | ||
| "doc_count": 342, | ||
| "A": { "value": 0.97 } <1> | ||
| }, | ||
| { | ||
| "key": { "host.name": "host-2" }, | ||
| "doc_count": 58, | ||
| "A": { "value": 0.42 } <2> | ||
| } | ||
| ] | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| 1. `host-1` had a value of `0.97`. If the threshold is `IS ABOVE 0.95`, this group breached it and an alert should have fired for `host-1`. | ||
| 2. `host-2` had a value of `0.42` — below the threshold, so no alert fired for this group. | ||
|
|
||
| If a group you expected to appear is missing from the buckets, it had no matching documents during the evaluation window. This can happen when `doc_count` is 0 or when the query filter excluded all documents for that group. | ||
|
|
||
| ### What a "no data" response looks like [inspect-response-no-data] | ||
|
|
||
| If {{es}} returned no documents, the aggregation values will be `null` or the buckets array will be empty: | ||
|
|
||
| ```json | ||
| { | ||
| "aggregations": { | ||
| "A": { "value": null } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| A `null` value means no data matched the query during the evaluation window. If you have **no data** alerts configured, this is the state that triggers them. Check the time range and query filter to confirm no documents were genuinely present, or investigate whether an index or data view configuration issue is preventing data from being found. | ||
|
|
||
| ## Common troubleshooting scenarios [inspect-troubleshoot] | ||
|
|
||
| :::{dropdown} Alert fired but I don't know why | ||
| Open the inspector from the alert details page. Review the time range to confirm it matches the evaluation period. Find the aggregation bucket for your group and check the value against the threshold. If the value exceeds the threshold, the alert fired correctly. | ||
| ::: | ||
|
|
||
| :::{dropdown} Alert didn't fire when I expected it to | ||
| Open the inspector from the rule details page and confirm the query targets the right index pattern and time range. Check the query filter for unintended restrictions. If the aggregation values in the response are below the threshold, the rule evaluated correctly but your data didn't breach the threshold during that window. | ||
| ::: | ||
|
|
||
| :::{dropdown} Rule looks correct now but the alert used different parameters | ||
| If you've modified the rule since the alert fired, open the inspector from the _alert details page_ rather than the rule details page. The alert inspector uses the parameters that were active at the time the alert fired, so the query will reflect the older configuration. | ||
| ::: | ||
|
|
||
| :::{dropdown} Empty or null aggregation values | ||
| The query matched no documents. Check whether the index pattern in the data view is correct, whether your time range is appropriate, and whether any query filter is too restrictive. Also verify that the data stream or index has data in the expected time period by running the same query in [Discover](/explore-analyze/discover.md) or [Dev Tools](/explore-analyze/query-filter/tools/console.md). | ||
| ::: | ||
|
|
||
| :::{dropdown} Unexpected group missing from results | ||
| If a group you expected (such as a specific host) doesn't appear in the buckets, no documents for that group matched the query during the evaluation window. This can happen when the group was inactive, when a filter excluded its documents, or when the field used for grouping has a different value in the actual documents than you expected. | ||
| ::: | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benakansara would you mind validating the content on this page before I open the PR for review? Here's the HTML preview so you can also see the web-rendered version of the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastasha-solomon Thanks for documenting this! A few thoughts on the scope of the documentation:
I think the primary use of the query inspector is for users to quickly check what the query returned - either because they got an alert unexpectedly, or because they want to make small adjustments (like tweaking the time range) to understand why an alert didn't fire as expected.
Users are generally not interested in how we structure the query internally, nor are they expected to modify the query beyond adjusting the time range. The query can also be quite lengthy and its structure can change as we add new features to the rule type (we don't usually modify existing functionality, but new capabilities can affect how the query is built). Documenting the query anatomy in detail creates a maintenance burden.
I'd suggest to scope the documentation as:
The "Anatomy of the query" section could be dropped or condensed into a short note that the request tab shows the full Elasticsearch query. The response tab shows the raw values Elasticsearch returned, which can help confirm whether data was found, whether expected groups are present, and what values the rule was working with.