Skip to content

add tbs storage limit and disk-related metrics#19568

Closed
rubvs wants to merge 3 commits intomainfrom
tbs-monitoring
Closed

add tbs storage limit and disk-related metrics#19568
rubvs wants to merge 3 commits intomainfrom
tbs-monitoring

Conversation

@rubvs
Copy link
Copy Markdown
Contributor

@rubvs rubvs commented Nov 14, 2025

Motivation/summary

Expose a new set of metrics to enhance TBS observability. The metric fields in the index data are tested in this PR, while the mappings are tested in the corresponding linked PRs. Changes in the elastic/elasticsearch#138131 PR are tested in conjunction with the changed in this PR.

See #15533 (comment) for the detailed overview.

Depends on PR:

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

Step 1: Ensure Elasticsearch & Kibana is running

Depends on elastic/elasticsearch#138131 with updates to monitoring-beats.json.

  1. Build Docker image with changes
> cd elasticsearch

# Build the ES image with the added metric mappings.
> ./gradlew buildAarch64DockerImage --rerun-tasks
  1. Edit apm-server/docker-compose.yml and change the ES image.
elasticsearch:
 image: docker.elastic.co/elasticsearch/elasticsearch:9.3.0-custom-SNAPSHOT
 # ... rest of config
  1. Spin up the required services.
> cd apm-server
> docker-compose up elasticsearch kibana

Step 2: Create APM Server config

apm-server:
  host: "127.0.0.1:8200"

output.elasticsearch:
  enabled: true
  hosts: ["http://localhost:9200"]
  username: "admin"
  password: "changeme"

monitoring.enabled: true

monitoring.elasticsearch:
  protocol: "http"
  hosts: ["http://localhost:9200"]
  username: "admin"
  password: "changeme"

Step 3: Start APM Server

Run APM Server binary directly:

> cd apm-server
> ./apm-server -e -v -c apm-server.yml

Step 6: Verify Data

Verify data in index .monitoring-beats-7-*

GET .monitoring-beats-7-2026.01.13/_search
{
  "_source": ["beats_stats.metrics.apm-server.sampling.tail"],
  "query": {
    "exists": {
      "field": "beats_stats.metrics.apm-server.sampling.tail"
    }
  },
  "size": 1
}

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 31,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": ".monitoring-beats-7-2026.01.13",
        "_id": "ApzqgpoBYbFQC8_sQ5JQ",
        "_score": 1,
        "_source": {
          "beats_stats": {
            "metrics": {
              "apm-server": {
                "sampling": {
                  "tail": {
                    "storage": {
                      "value_log_size": 0,
                      "storage_limit": 0,
                      "disk_used": 309071659008,
                      "disk_total": 994662584320,
                      "disk_usage_threshold_pct": 80.2,
                      "lsm_size": 8891
                    }
                  }
                }
              }
            }
          }
        }
      }
    ]
  }
}

Verify mappings for index monitoring-beats-7-*

GET .monitoring-beats-7-2026.01.13/_mapping?filter_path=**.storage

{
  ".monitoring-beats-7-2026.01.13": {
    "mappings": {
      "properties": {
        "beats_stats": {
          "properties": {
            "metrics": {
              "properties": {
                "apm-server": {
                  "properties": {
                    "sampling": {
                      "properties": {
                        "tail": {
                          "properties": {
                            "storage": {
                              "properties": {
                                "disk_total": {
                                  "type": "long"
                                },
                                "disk_usage_threshold_pct": {
                                  "type": "float"
                                },
                                "disk_used": {
                                  "type": "long"
                                },
                                "lsm_size": {
                                  "type": "long"
                                },
                                "storage_limit": {
                                  "type": "long"
                                },
                                "value_log_size": {
                                  "type": "long"
                                }
                              }
                            }
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Related issues

Part of #15533

@github-actions
Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Nov 14, 2025

This pull request does not have a backport label. Could you fix it @rubvs? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-9./d is the label to automatically backport to the 9./d branch. /d is the digit.
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@elasticmachine
Copy link
Copy Markdown
Collaborator

💚 Build Succeeded

History

@ericywl
Copy link
Copy Markdown
Contributor

ericywl commented Nov 28, 2025

I also ran the test following the steps posted by Ruben, and can confirm that the metrics mapping appears.

GET .monitoring-beats-7-2025.11.28/_mapping?filter_path=**.storage

{
  ".monitoring-beats-7-2025.11.28": {
    "mappings": {
      "properties": {
        "beats_stats": {
          "properties": {
            "metrics": {
              "properties": {
                "apm-server": {
                  "properties": {
                    "sampling": {
                      "properties": {
                        "tail": {
                          "properties": {
                            "storage": {
                              "properties": {
                                "disk_total": {
                                  "type": "long"
                                },
                                "disk_usage_threshold": {
                                  "type": "long"
                                },
                                "disk_used": {
                                  "type": "long"
                                },
                                "lsm_size": {
                                  "type": "long"
                                },
                                "storage_limit": {
                                  "type": "long"
                                },
                                "value_log_size": {
                                  "type": "long"
                                }
                              }
                            }
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

@ericywl ericywl marked this pull request as ready for review November 28, 2025 08:06
@ericywl ericywl requested a review from a team as a code owner November 28, 2025 08:06
@ericywl ericywl self-assigned this Dec 1, 2025
@ericywl ericywl added backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches backport-active-8 Automated backport with mergify to all the active 8.[0-9]+ branches labels Dec 1, 2025
Copy link
Copy Markdown
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation looks good, 1 nit on metric naming

sm.storageMetrics.storageLimitGauge, _ = meter.Int64Gauge("apm-server.sampling.tail.storage.storage_limit")
sm.storageMetrics.diskUsedGauge, _ = meter.Int64Gauge("apm-server.sampling.tail.storage.disk_used")
sm.storageMetrics.diskTotalGauge, _ = meter.Int64Gauge("apm-server.sampling.tail.storage.disk_total")
sm.storageMetrics.diskUsageThresholdGauge, _ = meter.Int64Gauge("apm-server.sampling.tail.storage.disk_usage_threshold")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if it will be more descriptive / self-explanatory with a suffix e.g. _pct. Also wonder if it better to range from 0 to 1 (correct to 2 d.p. maybe?) instead of 0-100.

if sm.storageMetrics.diskTotalGauge != nil {
sm.storageMetrics.diskTotalGauge.Record(context.Background(), int64(usage.TotalBytes))
}
// Record disk usage threshold as a percentage (0-100)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remember to update comment if we change to 0-1

func (sm *StorageManager) NewReadWriter(storageLimit uint64, diskUsageThreshold float64) RW {
// Store configured values for monitoring metrics
sm.configuredStorageLimit.Store(storageLimit)
// Store disk usage threshold as percentage (0-100)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

carsonip
carsonip previously approved these changes Jan 15, 2026
Copy link
Copy Markdown
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

For backports,

  • this obviously cannot be directly backported to 8.x as is due to the 9.0-introduced threshold, so expect some changes there.
  • Backport active 9 is fair, but bear in mind that also means ideally (not necessarily) all relevant metric PRs in ES, beats, integration need to be backported to the same patch version. It will be a bit of work. Alternatively, you may backport to 9.3 only, or no backport but the change will sit in main for a few months.

@ericywl
Copy link
Copy Markdown
Contributor

ericywl commented Jan 19, 2026

Pending on the other PRs to pass CI and be approved first.

@ericywl ericywl added this pull request to the merge queue Feb 20, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 20, 2026
@ericywl ericywl added this pull request to the merge queue Feb 20, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 20, 2026
@ericywl ericywl added this pull request to the merge queue Feb 20, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 20, 2026
@ericywl ericywl added this pull request to the merge queue Feb 23, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 23, 2026
@ericywl ericywl added this pull request to the merge queue Feb 23, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 23, 2026
@ericywl
Copy link
Copy Markdown
Contributor

ericywl commented Feb 23, 2026

Force pushed to fix some CLA issues. The contents are the same @carsonip, though it seems like system-test-fips failed again just now.

@ericywl ericywl requested a review from carsonip February 23, 2026 10:41
@ericywl ericywl enabled auto-merge February 23, 2026 10:58
@ericywl ericywl added this pull request to the merge queue Feb 23, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 23, 2026
@carsonip carsonip added this pull request to the merge queue Feb 23, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 23, 2026
@ericywl ericywl closed this Feb 23, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 23, 2026

⚠️ The sha of the head commit of this PR conflicts with #20464. Mergify cannot evaluate rules on this PR. ⚠️

@ericywl ericywl deleted the tbs-monitoring branch February 24, 2026 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-8 Automated backport with mergify to all the active 8.[0-9]+ branches backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants