|
| 1 | +# Metrics Catalog for VTEX IO Node Apps |
| 2 | + |
| 3 | +This document provides a comprehensive catalog of all metrics available in the `@vtex/api` library, organized by their implementation (diagnostics-based vs legacy). |
| 4 | + |
| 5 | +> **Looking for migration guidance?** See [METRICS_OVERVIEW.md](./METRICS_OVERVIEW.md) for migration patterns and best practices. |
| 6 | +
|
| 7 | +## Table of Contents |
| 8 | + |
| 9 | +- [Metrics Architecture Overview](#metrics-architecture-overview) |
| 10 | +- [Complete Metrics Visual Summary](#complete-metrics-visual-summary) |
| 11 | +- [Diagnostics-Related Metrics](#diagnostics-related-metrics) |
| 12 | +- [Legacy Metrics (Non-Diagnostics)](#legacy-metrics-non-diagnostics) |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Metrics Architecture Overview |
| 17 | + |
| 18 | +The `@vtex/api` library has two coexisting metrics systems during the migration period: |
| 19 | + |
| 20 | +1. **Diagnostics-Based Metrics** (New) - Uses `@vtex/diagnostics-nodejs` with OpenTelemetry |
| 21 | +2. **Legacy Metrics** (Existing) - Uses `prom-client`, `MetricsAccumulator`, and console.log exports |
| 22 | + |
| 23 | +Both systems operate independently and can coexist. The goal is to gradually migrate to diagnostics-based metrics while maintaining backward compatibility. |
| 24 | + |
| 25 | +### Two Categories of Metrics |
| 26 | + |
| 27 | +| Category | Description | Initialization | Customization | |
| 28 | +|----------|-------------|----------------|---------------| |
| 29 | +| **Runtime/Infrastructure** | System-wide metrics for capacity planning and SLOs | Once at startup | Limited (configured at startup) | |
| 30 | +| **App/Middleware** | Operation-specific metrics for debugging and optimization | Per-request/operation | Rich (can add custom attributes) | |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## Complete Metrics Visual Summary |
| 35 | + |
| 36 | +``` |
| 37 | +All Metrics in node-vtex-api |
| 38 | +│ |
| 39 | +├── 🆕 Diagnostics-Related Metrics (OpenTelemetry-based) |
| 40 | +│ │ |
| 41 | +│ ├── 🏗️ Runtime/Infrastructure Metrics |
| 42 | +│ │ │ |
| 43 | +│ │ ├── OTel Request Instruments (service/metrics/metrics.ts) |
| 44 | +│ │ │ ├── io_http_requests_current (Gauge) |
| 45 | +│ │ │ ├── runtime_http_requests_duration_milliseconds (Histogram) |
| 46 | +│ │ │ ├── runtime_http_requests_total (Counter) |
| 47 | +│ │ │ ├── runtime_http_response_size_bytes (Histogram) |
| 48 | +│ │ │ └── runtime_http_aborted_requests_total (Counter) |
| 49 | +│ │ │ |
| 50 | +│ │ ├── Auto-instrumentation (telemetry/client.ts) |
| 51 | +│ │ │ ├── http.server.duration (Histogram - HttpInstrumentation) |
| 52 | +│ │ │ ├── http.server.request.size (Histogram) |
| 53 | +│ │ │ ├── http.server.response.size (Histogram) |
| 54 | +│ │ │ ├── http.client.duration (Histogram - HttpInstrumentation) |
| 55 | +│ │ │ ├── http.client.request.size (Histogram) |
| 56 | +│ │ │ ├── http.client.response.size (Histogram) |
| 57 | +│ │ │ └── Koa-enhanced HTTP metrics (KoaInstrumentation) |
| 58 | +│ │ │ |
| 59 | +│ │ └── Host Metrics (HostMetricsInstrumentation) |
| 60 | +│ │ ├── process.runtime.nodejs.memory.heap.used (Gauge) |
| 61 | +│ │ ├── process.runtime.nodejs.memory.heap.total (Gauge) |
| 62 | +│ │ ├── process.runtime.nodejs.memory.rss (Gauge) |
| 63 | +│ │ ├── process.runtime.nodejs.memory.external (Gauge) |
| 64 | +│ │ ├── process.runtime.nodejs.memory.arrayBuffers (Gauge) |
| 65 | +│ │ ├── process.runtime.nodejs.event_loop.lag.max (Gauge) |
| 66 | +│ │ ├── process.runtime.nodejs.event_loop.lag.min (Gauge) |
| 67 | +│ │ ├── process.cpu.utilization (Gauge) |
| 68 | +│ │ ├── system.cpu.utilization (Gauge) |
| 69 | +│ │ ├── system.memory.usage (Gauge) |
| 70 | +│ │ ├── system.memory.utilization (Gauge) |
| 71 | +│ │ ├── system.network.io (Counter) |
| 72 | +│ │ └── system.network.errors (Counter) |
| 73 | +│ │ |
| 74 | +│ └── 📱 App/Middleware Metrics |
| 75 | +│ │ |
| 76 | +│ ├── HTTP Client (HttpClient/middlewares/metrics.ts) |
| 77 | +│ │ ├── latency histogram (via recordLatency) |
| 78 | +│ │ ├── http_client_requests_total (Counter) |
| 79 | +│ │ ├── http_client_cache_total (Counter) |
| 80 | +│ │ └── http_client_requests_retried_total (Counter) |
| 81 | +│ │ |
| 82 | +│ ├── HTTP Handler (worker/runtime/http/middlewares/*) |
| 83 | +│ │ ├── latency histogram (via recordLatency) |
| 84 | +│ │ ├── http_handler_requests_total (Counter) |
| 85 | +│ │ ├── http_server_requests_total (Counter) |
| 86 | +│ │ ├── http_server_requests_closed_total (Counter) |
| 87 | +│ │ └── http_server_requests_aborted_total (Counter) |
| 88 | +│ │ |
| 89 | +│ ├── GraphQL (worker/runtime/graphql/schema/schemaDirectives/Metric.ts) |
| 90 | +│ │ ├── latency histogram (via recordLatency) |
| 91 | +│ │ └── graphql_field_requests_total (Counter) |
| 92 | +│ │ |
| 93 | +│ └── HTTP Agent (HttpClient/middlewares/request/HttpAgentSingleton.ts) |
| 94 | +│ ├── http_agent_sockets_current (Gauge) |
| 95 | +│ ├── http_agent_free_sockets_current (Gauge) |
| 96 | +│ └── http_agent_pending_requests_current (Gauge) |
| 97 | +│ |
| 98 | +└── 🏛️ Legacy Metrics (Non-Diagnostics) |
| 99 | + │ |
| 100 | + ├── 📊 Prometheus Metrics (prom-client, exposed on /metrics) |
| 101 | + │ │ |
| 102 | + │ ├── Request Metrics (service/tracing/metrics/*) |
| 103 | + │ │ ├── runtime_http_requests_total (Counter) - labels: status_code, handler |
| 104 | + │ │ ├── runtime_http_aborted_requests_total (Counter) - labels: handler |
| 105 | + │ │ ├── runtime_http_requests_duration_milliseconds (Histogram) |
| 106 | + │ │ ├── runtime_http_response_size_bytes (Histogram) |
| 107 | + │ │ └── io_http_requests_current (Gauge) |
| 108 | + │ │ |
| 109 | + │ ├── Event Loop Metrics (service/tracing/metrics/measurers/*) |
| 110 | + │ │ ├── runtime_event_loop_lag_max_between_scrapes_seconds (Gauge) |
| 111 | + │ │ └── runtime_event_loop_lag_percentiles_between_scrapes_seconds (Gauge) |
| 112 | + │ │ |
| 113 | + │ └── Default Node.js Metrics (collectDefaultMetrics) |
| 114 | + │ ├── nodejs_gc_duration_seconds (Histogram) |
| 115 | + │ ├── nodejs_active_handles_total (Gauge) |
| 116 | + │ ├── nodejs_active_requests_total (Gauge) |
| 117 | + │ ├── nodejs_heap_size_total_bytes (Gauge) |
| 118 | + │ ├── nodejs_heap_size_used_bytes (Gauge) |
| 119 | + │ ├── nodejs_external_memory_bytes (Gauge) |
| 120 | + │ ├── nodejs_version_info (Gauge) |
| 121 | + │ ├── process_cpu_user_seconds_total (Counter) |
| 122 | + │ ├── process_cpu_system_seconds_total (Counter) |
| 123 | + │ ├── process_resident_memory_bytes (Gauge) |
| 124 | + │ └── process_start_time_seconds (Gauge) |
| 125 | + │ |
| 126 | + ├── 📝 MetricsAccumulator (console.log exports via trackStatus) |
| 127 | + │ │ |
| 128 | + │ ├── HTTP Handler Metrics (worker/runtime/http/middlewares/timings.ts) |
| 129 | + │ │ └── http-handler-{route_id} |
| 130 | + │ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max |
| 131 | + │ │ └── Extensions: success, error, timeout, aborted, cancelled |
| 132 | + │ │ |
| 133 | + │ ├── HTTP Client Metrics (HttpClient/middlewares/metrics.ts) |
| 134 | + │ │ └── http-client-{metric_name} |
| 135 | + │ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max |
| 136 | + │ │ └── Extensions: |
| 137 | + │ │ ├── Status: success, error, timeout, aborted, cancelled |
| 138 | + │ │ ├── Cache: success-hit, success-miss, success-inflight, success-memoized |
| 139 | + │ │ └── Retry: retry-{status}-{count} |
| 140 | + │ │ |
| 141 | + │ ├── GraphQL Metrics (worker/runtime/graphql/schema/schemaDirectives/Metric.ts) |
| 142 | + │ │ └── graphql-metric-{field_name} |
| 143 | + │ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max |
| 144 | + │ │ └── Extensions: success, error |
| 145 | + │ │ |
| 146 | + │ ├── System Metrics (metrics/MetricsAccumulator.ts) |
| 147 | + │ │ ├── cpu - user (μs), system (μs) |
| 148 | + │ │ ├── memory - rss, heapTotal, heapUsed, external, arrayBuffers |
| 149 | + │ │ ├── httpAgent - sockets, freeSockets, pendingRequests |
| 150 | + │ │ └── incomingRequest - total, closed, aborted |
| 151 | + │ │ |
| 152 | + │ └── Cache Metrics (via trackCache) |
| 153 | + │ └── {cache_name}-cache |
| 154 | + │ ├── LRU: itemCount, length, disposedItems, hitRate, hits, max, total |
| 155 | + │ ├── Disk: hits, total |
| 156 | + │ └── Multilayer: hitRate, hits, total |
| 157 | + │ |
| 158 | + └── 💰 Billing Metrics (console.log with __VTEX_IO_BILLING) |
| 159 | + └── Process time per handler |
| 160 | + ├── account, app, handler |
| 161 | + ├── production, routeType (public_route/private_route) |
| 162 | + ├── timestamp, value (milliseconds) |
| 163 | + └── vendor, workspace |
| 164 | +``` |
| 165 | + |
| 166 | +--- |
| 167 | + |
| 168 | +## Diagnostics-Related Metrics |
| 169 | + |
| 170 | +### Runtime/Infrastructure Metrics |
| 171 | + |
| 172 | +These are system-wide metrics declared at service initialization level. |
| 173 | + |
| 174 | +#### OTel Request Instruments |
| 175 | + |
| 176 | +**Source:** `service/metrics/metrics.ts` |
| 177 | + |
| 178 | +| Metric Name | Type | Description | |
| 179 | +|-------------|------|-------------| |
| 180 | +| `io_http_requests_current` | Gauge | Current number of requests in progress | |
| 181 | +| `runtime_http_requests_duration_milliseconds` | Histogram | Incoming HTTP request duration | |
| 182 | +| `runtime_http_requests_total` | Counter | Total number of HTTP requests | |
| 183 | +| `runtime_http_response_size_bytes` | Histogram | Outgoing response sizes | |
| 184 | +| `runtime_http_aborted_requests_total` | Counter | Total aborted HTTP requests | |
| 185 | + |
| 186 | +#### Auto-instrumentation Metrics |
| 187 | + |
| 188 | +**Source:** `telemetry/client.ts` (via OpenTelemetry instrumentations) |
| 189 | + |
| 190 | +| Metric Name | Type | Source | Description | |
| 191 | +|-------------|------|--------|-------------| |
| 192 | +| `http.server.duration` | Histogram | HttpInstrumentation | HTTP server request duration | |
| 193 | +| `http.client.duration` | Histogram | HttpInstrumentation | HTTP client request duration | |
| 194 | +| `process.runtime.nodejs.memory.*` | Gauge | HostMetrics | Node.js memory metrics | |
| 195 | +| `process.cpu.utilization` | Gauge | HostMetrics | Process CPU utilization | |
| 196 | +| `system.cpu.utilization` | Gauge | HostMetrics | System CPU utilization | |
| 197 | +| `system.memory.usage` | Gauge | HostMetrics | System memory usage | |
| 198 | + |
| 199 | +### App/Middleware Metrics |
| 200 | + |
| 201 | +These are operation-specific metrics recorded in middleware components. |
| 202 | + |
| 203 | +#### HTTP Client Metrics |
| 204 | + |
| 205 | +**Source:** `HttpClient/middlewares/metrics.ts` |
| 206 | + |
| 207 | +| Metric Name | Type | Attributes | |
| 208 | +|-------------|------|------------| |
| 209 | +| Latency histogram | Histogram | `component`, `client_metric`, `status_code`, `status`, `cache_state` | |
| 210 | +| `http_client_requests_total` | Counter | `component`, `client_metric`, `status_code`, `status` | |
| 211 | +| `http_client_cache_total` | Counter | `component`, `client_metric`, `status_code`, `status`, `cache_state` | |
| 212 | +| `http_client_requests_retried_total` | Counter | `component`, `client_metric`, `status_code`, `status`, `retry_count` | |
| 213 | + |
| 214 | +#### HTTP Handler Metrics |
| 215 | + |
| 216 | +**Source:** `worker/runtime/http/middlewares/timings.ts`, `requestStats.ts` |
| 217 | + |
| 218 | +| Metric Name | Type | Attributes | |
| 219 | +|-------------|------|------------| |
| 220 | +| Latency histogram | Histogram | `component`, `route_id`, `route_type`, `status_code`, `status` | |
| 221 | +| `http_handler_requests_total` | Counter | `component`, `route_id`, `route_type`, `status_code`, `status` | |
| 222 | +| `http_server_requests_total` | Counter | `route_id`, `route_type`, `status_code` | |
| 223 | +| `http_server_requests_closed_total` | Counter | `route_id`, `route_type`, `status_code` | |
| 224 | +| `http_server_requests_aborted_total` | Counter | `route_id`, `route_type`, `status_code` | |
| 225 | + |
| 226 | +#### GraphQL Metrics |
| 227 | + |
| 228 | +**Source:** `worker/runtime/graphql/schema/schemaDirectives/Metric.ts` |
| 229 | + |
| 230 | +| Metric Name | Type | Attributes | |
| 231 | +|-------------|------|------------| |
| 232 | +| Latency histogram | Histogram | `component`, `field_name`, `status` | |
| 233 | +| `graphql_field_requests_total` | Counter | `component`, `field_name`, `status` | |
| 234 | + |
| 235 | +#### HTTP Agent Metrics |
| 236 | + |
| 237 | +**Source:** `HttpClient/middlewares/request/HttpAgentSingleton.ts` |
| 238 | + |
| 239 | +| Metric Name | Type | Description | |
| 240 | +|-------------|------|-------------| |
| 241 | +| `http_agent_sockets_current` | Gauge | Active sockets | |
| 242 | +| `http_agent_free_sockets_current` | Gauge | Free sockets in pool | |
| 243 | +| `http_agent_pending_requests_current` | Gauge | Pending requests waiting for socket | |
| 244 | + |
| 245 | +--- |
| 246 | + |
| 247 | +## Legacy Metrics (Non-Diagnostics) |
| 248 | + |
| 249 | +### Prometheus Metrics |
| 250 | + |
| 251 | +Exposed on the `/metrics` endpoint via `prom-client`. |
| 252 | + |
| 253 | +#### Request Metrics |
| 254 | + |
| 255 | +**Source:** `service/tracing/metrics/MetricNames.ts` |
| 256 | + |
| 257 | +| Metric Name | Type | Labels | Description | |
| 258 | +|-------------|------|--------|-------------| |
| 259 | +| `runtime_http_requests_total` | Counter | `status_code`, `handler` | Total HTTP requests | |
| 260 | +| `runtime_http_aborted_requests_total` | Counter | `handler` | Aborted HTTP requests | |
| 261 | +| `runtime_http_requests_duration_milliseconds` | Histogram | `handler` | Request duration (buckets: 10-5120ms) | |
| 262 | +| `runtime_http_response_size_bytes` | Histogram | `handler` | Response sizes (buckets: 500B-4MB) | |
| 263 | +| `io_http_requests_current` | Gauge | - | Concurrent requests | |
| 264 | + |
| 265 | +#### Event Loop Metrics |
| 266 | + |
| 267 | +**Source:** `service/tracing/metrics/measurers/EventLoopLagMeasurer.ts` |
| 268 | + |
| 269 | +| Metric Name | Type | Labels | Description | |
| 270 | +|-------------|------|--------|-------------| |
| 271 | +| `runtime_event_loop_lag_max_between_scrapes_seconds` | Gauge | - | Max event loop lag | |
| 272 | +| `runtime_event_loop_lag_percentiles_between_scrapes_seconds` | Gauge | `percentile` | Event loop lag percentiles (95, 99) | |
| 273 | + |
| 274 | +#### Default Node.js Metrics |
| 275 | + |
| 276 | +Via `collectDefaultMetrics()` from `prom-client`: |
| 277 | + |
| 278 | +- `nodejs_gc_duration_seconds` - GC duration histogram |
| 279 | +- `nodejs_active_handles_total` - Active handles |
| 280 | +- `nodejs_active_requests_total` - Active requests |
| 281 | +- `nodejs_heap_size_*_bytes` - Heap metrics |
| 282 | +- `nodejs_external_memory_bytes` - External memory |
| 283 | +- `nodejs_version_info` - Node.js version |
| 284 | +- `process_cpu_*_seconds_total` - CPU counters |
| 285 | +- `process_resident_memory_bytes` - RSS memory |
| 286 | +- `process_start_time_seconds` - Process start time |
| 287 | + |
| 288 | +### MetricsAccumulator |
| 289 | + |
| 290 | +Exported via `console.log` as JSON and collected by Splunk. |
| 291 | + |
| 292 | +**Source:** `metrics/MetricsAccumulator.ts` |
| 293 | + |
| 294 | +#### Aggregated Metrics Format |
| 295 | + |
| 296 | +Each metric includes: |
| 297 | +- `name` - Metric identifier |
| 298 | +- `count` - Number of samples |
| 299 | +- `mean`, `median` - Average and middle values |
| 300 | +- `percentile95`, `percentile99` - Tail latencies |
| 301 | +- `max` - Maximum value |
| 302 | +- `production` - Environment flag |
| 303 | +- Plus any custom extensions |
| 304 | + |
| 305 | +#### System Metrics |
| 306 | + |
| 307 | +| Metric Name | Properties | |
| 308 | +|-------------|------------| |
| 309 | +| `cpu` | `user` (μs), `system` (μs) | |
| 310 | +| `memory` | `rss`, `heapTotal`, `heapUsed`, `external`, `arrayBuffers` | |
| 311 | +| `httpAgent` | `sockets`, `freeSockets`, `pendingRequests` | |
| 312 | +| `incomingRequest` | `total`, `closed`, `aborted` | |
| 313 | + |
| 314 | +### Billing Metrics |
| 315 | + |
| 316 | +**Source:** `worker/runtime/http/middlewares/timings.ts` |
| 317 | + |
| 318 | +Exported with `__VTEX_IO_BILLING` flag for usage tracking: |
| 319 | + |
| 320 | +```json |
| 321 | +{ |
| 322 | + "__VTEX_IO_BILLING": "true", |
| 323 | + "account": "...", |
| 324 | + "app": "...", |
| 325 | + "handler": "...", |
| 326 | + "production": true, |
| 327 | + "routeType": "public_route", |
| 328 | + "timestamp": 1234567890, |
| 329 | + "type": "process-time", |
| 330 | + "value": 150, |
| 331 | + "vendor": "vtex", |
| 332 | + "workspace": "master" |
| 333 | +} |
| 334 | +``` |
| 335 | + |
| 336 | +--- |
| 337 | + |
| 338 | +## Related Documentation |
| 339 | + |
| 340 | +- [Migration Guide](./METRICS_OVERVIEW.md) - Patterns and best practices for migrating to diagnostics-based metrics |
| 341 | + |
0 commit comments