Skip to content

Commit 919bacd

Browse files
committed
docs: add comprehensive metrics documentation
This commit introduces two new markdown files: METRICS_CATALOG.md and METRICS_OVERVIEW.md. The METRICS_CATALOG.md provides a detailed catalog of all metrics available in the `@vtex/api` library, organized by their implementation types (diagnostics-based vs legacy). It includes sections on metrics architecture, visual summaries, and specific metrics for runtime, infrastructure, app, and middleware. The METRICS_OVERVIEW.md serves as a migration guide for transitioning from the legacy MetricsAccumulator API to the new DiagnosticsMetrics API. It outlines the benefits of migration, common patterns, best practices, and troubleshooting tips, ensuring a smooth transition for developers. Both documents aim to enhance understanding and usage of metrics within VTEX IO applications.
1 parent d42c2c4 commit 919bacd

2 files changed

Lines changed: 914 additions & 0 deletions

File tree

docs/METRICS_CATALOG.md

Lines changed: 341 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,341 @@
1+
# Metrics Catalog for VTEX IO Node Apps
2+
3+
This document provides a comprehensive catalog of all metrics available in the `@vtex/api` library, organized by their implementation (diagnostics-based vs legacy).
4+
5+
> **Looking for migration guidance?** See [METRICS_OVERVIEW.md](./METRICS_OVERVIEW.md) for migration patterns and best practices.
6+
7+
## Table of Contents
8+
9+
- [Metrics Architecture Overview](#metrics-architecture-overview)
10+
- [Complete Metrics Visual Summary](#complete-metrics-visual-summary)
11+
- [Diagnostics-Related Metrics](#diagnostics-related-metrics)
12+
- [Legacy Metrics (Non-Diagnostics)](#legacy-metrics-non-diagnostics)
13+
14+
---
15+
16+
## Metrics Architecture Overview
17+
18+
The `@vtex/api` library has two coexisting metrics systems during the migration period:
19+
20+
1. **Diagnostics-Based Metrics** (New) - Uses `@vtex/diagnostics-nodejs` with OpenTelemetry
21+
2. **Legacy Metrics** (Existing) - Uses `prom-client`, `MetricsAccumulator`, and console.log exports
22+
23+
Both systems operate independently and can coexist. The goal is to gradually migrate to diagnostics-based metrics while maintaining backward compatibility.
24+
25+
### Two Categories of Metrics
26+
27+
| Category | Description | Initialization | Customization |
28+
|----------|-------------|----------------|---------------|
29+
| **Runtime/Infrastructure** | System-wide metrics for capacity planning and SLOs | Once at startup | Limited (configured at startup) |
30+
| **App/Middleware** | Operation-specific metrics for debugging and optimization | Per-request/operation | Rich (can add custom attributes) |
31+
32+
---
33+
34+
## Complete Metrics Visual Summary
35+
36+
```
37+
All Metrics in node-vtex-api
38+
39+
├── 🆕 Diagnostics-Related Metrics (OpenTelemetry-based)
40+
│ │
41+
│ ├── 🏗️ Runtime/Infrastructure Metrics
42+
│ │ │
43+
│ │ ├── OTel Request Instruments (service/metrics/metrics.ts)
44+
│ │ │ ├── io_http_requests_current (Gauge)
45+
│ │ │ ├── runtime_http_requests_duration_milliseconds (Histogram)
46+
│ │ │ ├── runtime_http_requests_total (Counter)
47+
│ │ │ ├── runtime_http_response_size_bytes (Histogram)
48+
│ │ │ └── runtime_http_aborted_requests_total (Counter)
49+
│ │ │
50+
│ │ ├── Auto-instrumentation (telemetry/client.ts)
51+
│ │ │ ├── http.server.duration (Histogram - HttpInstrumentation)
52+
│ │ │ ├── http.server.request.size (Histogram)
53+
│ │ │ ├── http.server.response.size (Histogram)
54+
│ │ │ ├── http.client.duration (Histogram - HttpInstrumentation)
55+
│ │ │ ├── http.client.request.size (Histogram)
56+
│ │ │ ├── http.client.response.size (Histogram)
57+
│ │ │ └── Koa-enhanced HTTP metrics (KoaInstrumentation)
58+
│ │ │
59+
│ │ └── Host Metrics (HostMetricsInstrumentation)
60+
│ │ ├── process.runtime.nodejs.memory.heap.used (Gauge)
61+
│ │ ├── process.runtime.nodejs.memory.heap.total (Gauge)
62+
│ │ ├── process.runtime.nodejs.memory.rss (Gauge)
63+
│ │ ├── process.runtime.nodejs.memory.external (Gauge)
64+
│ │ ├── process.runtime.nodejs.memory.arrayBuffers (Gauge)
65+
│ │ ├── process.runtime.nodejs.event_loop.lag.max (Gauge)
66+
│ │ ├── process.runtime.nodejs.event_loop.lag.min (Gauge)
67+
│ │ ├── process.cpu.utilization (Gauge)
68+
│ │ ├── system.cpu.utilization (Gauge)
69+
│ │ ├── system.memory.usage (Gauge)
70+
│ │ ├── system.memory.utilization (Gauge)
71+
│ │ ├── system.network.io (Counter)
72+
│ │ └── system.network.errors (Counter)
73+
│ │
74+
│ └── 📱 App/Middleware Metrics
75+
│ │
76+
│ ├── HTTP Client (HttpClient/middlewares/metrics.ts)
77+
│ │ ├── latency histogram (via recordLatency)
78+
│ │ ├── http_client_requests_total (Counter)
79+
│ │ ├── http_client_cache_total (Counter)
80+
│ │ └── http_client_requests_retried_total (Counter)
81+
│ │
82+
│ ├── HTTP Handler (worker/runtime/http/middlewares/*)
83+
│ │ ├── latency histogram (via recordLatency)
84+
│ │ ├── http_handler_requests_total (Counter)
85+
│ │ ├── http_server_requests_total (Counter)
86+
│ │ ├── http_server_requests_closed_total (Counter)
87+
│ │ └── http_server_requests_aborted_total (Counter)
88+
│ │
89+
│ ├── GraphQL (worker/runtime/graphql/schema/schemaDirectives/Metric.ts)
90+
│ │ ├── latency histogram (via recordLatency)
91+
│ │ └── graphql_field_requests_total (Counter)
92+
│ │
93+
│ └── HTTP Agent (HttpClient/middlewares/request/HttpAgentSingleton.ts)
94+
│ ├── http_agent_sockets_current (Gauge)
95+
│ ├── http_agent_free_sockets_current (Gauge)
96+
│ └── http_agent_pending_requests_current (Gauge)
97+
98+
└── 🏛️ Legacy Metrics (Non-Diagnostics)
99+
100+
├── 📊 Prometheus Metrics (prom-client, exposed on /metrics)
101+
│ │
102+
│ ├── Request Metrics (service/tracing/metrics/*)
103+
│ │ ├── runtime_http_requests_total (Counter) - labels: status_code, handler
104+
│ │ ├── runtime_http_aborted_requests_total (Counter) - labels: handler
105+
│ │ ├── runtime_http_requests_duration_milliseconds (Histogram)
106+
│ │ ├── runtime_http_response_size_bytes (Histogram)
107+
│ │ └── io_http_requests_current (Gauge)
108+
│ │
109+
│ ├── Event Loop Metrics (service/tracing/metrics/measurers/*)
110+
│ │ ├── runtime_event_loop_lag_max_between_scrapes_seconds (Gauge)
111+
│ │ └── runtime_event_loop_lag_percentiles_between_scrapes_seconds (Gauge)
112+
│ │
113+
│ └── Default Node.js Metrics (collectDefaultMetrics)
114+
│ ├── nodejs_gc_duration_seconds (Histogram)
115+
│ ├── nodejs_active_handles_total (Gauge)
116+
│ ├── nodejs_active_requests_total (Gauge)
117+
│ ├── nodejs_heap_size_total_bytes (Gauge)
118+
│ ├── nodejs_heap_size_used_bytes (Gauge)
119+
│ ├── nodejs_external_memory_bytes (Gauge)
120+
│ ├── nodejs_version_info (Gauge)
121+
│ ├── process_cpu_user_seconds_total (Counter)
122+
│ ├── process_cpu_system_seconds_total (Counter)
123+
│ ├── process_resident_memory_bytes (Gauge)
124+
│ └── process_start_time_seconds (Gauge)
125+
126+
├── 📝 MetricsAccumulator (console.log exports via trackStatus)
127+
│ │
128+
│ ├── HTTP Handler Metrics (worker/runtime/http/middlewares/timings.ts)
129+
│ │ └── http-handler-{route_id}
130+
│ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max
131+
│ │ └── Extensions: success, error, timeout, aborted, cancelled
132+
│ │
133+
│ ├── HTTP Client Metrics (HttpClient/middlewares/metrics.ts)
134+
│ │ └── http-client-{metric_name}
135+
│ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max
136+
│ │ └── Extensions:
137+
│ │ ├── Status: success, error, timeout, aborted, cancelled
138+
│ │ ├── Cache: success-hit, success-miss, success-inflight, success-memoized
139+
│ │ └── Retry: retry-{status}-{count}
140+
│ │
141+
│ ├── GraphQL Metrics (worker/runtime/graphql/schema/schemaDirectives/Metric.ts)
142+
│ │ └── graphql-metric-{field_name}
143+
│ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max
144+
│ │ └── Extensions: success, error
145+
│ │
146+
│ ├── System Metrics (metrics/MetricsAccumulator.ts)
147+
│ │ ├── cpu - user (μs), system (μs)
148+
│ │ ├── memory - rss, heapTotal, heapUsed, external, arrayBuffers
149+
│ │ ├── httpAgent - sockets, freeSockets, pendingRequests
150+
│ │ └── incomingRequest - total, closed, aborted
151+
│ │
152+
│ └── Cache Metrics (via trackCache)
153+
│ └── {cache_name}-cache
154+
│ ├── LRU: itemCount, length, disposedItems, hitRate, hits, max, total
155+
│ ├── Disk: hits, total
156+
│ └── Multilayer: hitRate, hits, total
157+
158+
└── 💰 Billing Metrics (console.log with __VTEX_IO_BILLING)
159+
└── Process time per handler
160+
├── account, app, handler
161+
├── production, routeType (public_route/private_route)
162+
├── timestamp, value (milliseconds)
163+
└── vendor, workspace
164+
```
165+
166+
---
167+
168+
## Diagnostics-Related Metrics
169+
170+
### Runtime/Infrastructure Metrics
171+
172+
These are system-wide metrics declared at service initialization level.
173+
174+
#### OTel Request Instruments
175+
176+
**Source:** `service/metrics/metrics.ts`
177+
178+
| Metric Name | Type | Description |
179+
|-------------|------|-------------|
180+
| `io_http_requests_current` | Gauge | Current number of requests in progress |
181+
| `runtime_http_requests_duration_milliseconds` | Histogram | Incoming HTTP request duration |
182+
| `runtime_http_requests_total` | Counter | Total number of HTTP requests |
183+
| `runtime_http_response_size_bytes` | Histogram | Outgoing response sizes |
184+
| `runtime_http_aborted_requests_total` | Counter | Total aborted HTTP requests |
185+
186+
#### Auto-instrumentation Metrics
187+
188+
**Source:** `telemetry/client.ts` (via OpenTelemetry instrumentations)
189+
190+
| Metric Name | Type | Source | Description |
191+
|-------------|------|--------|-------------|
192+
| `http.server.duration` | Histogram | HttpInstrumentation | HTTP server request duration |
193+
| `http.client.duration` | Histogram | HttpInstrumentation | HTTP client request duration |
194+
| `process.runtime.nodejs.memory.*` | Gauge | HostMetrics | Node.js memory metrics |
195+
| `process.cpu.utilization` | Gauge | HostMetrics | Process CPU utilization |
196+
| `system.cpu.utilization` | Gauge | HostMetrics | System CPU utilization |
197+
| `system.memory.usage` | Gauge | HostMetrics | System memory usage |
198+
199+
### App/Middleware Metrics
200+
201+
These are operation-specific metrics recorded in middleware components.
202+
203+
#### HTTP Client Metrics
204+
205+
**Source:** `HttpClient/middlewares/metrics.ts`
206+
207+
| Metric Name | Type | Attributes |
208+
|-------------|------|------------|
209+
| Latency histogram | Histogram | `component`, `client_metric`, `status_code`, `status`, `cache_state` |
210+
| `http_client_requests_total` | Counter | `component`, `client_metric`, `status_code`, `status` |
211+
| `http_client_cache_total` | Counter | `component`, `client_metric`, `status_code`, `status`, `cache_state` |
212+
| `http_client_requests_retried_total` | Counter | `component`, `client_metric`, `status_code`, `status`, `retry_count` |
213+
214+
#### HTTP Handler Metrics
215+
216+
**Source:** `worker/runtime/http/middlewares/timings.ts`, `requestStats.ts`
217+
218+
| Metric Name | Type | Attributes |
219+
|-------------|------|------------|
220+
| Latency histogram | Histogram | `component`, `route_id`, `route_type`, `status_code`, `status` |
221+
| `http_handler_requests_total` | Counter | `component`, `route_id`, `route_type`, `status_code`, `status` |
222+
| `http_server_requests_total` | Counter | `route_id`, `route_type`, `status_code` |
223+
| `http_server_requests_closed_total` | Counter | `route_id`, `route_type`, `status_code` |
224+
| `http_server_requests_aborted_total` | Counter | `route_id`, `route_type`, `status_code` |
225+
226+
#### GraphQL Metrics
227+
228+
**Source:** `worker/runtime/graphql/schema/schemaDirectives/Metric.ts`
229+
230+
| Metric Name | Type | Attributes |
231+
|-------------|------|------------|
232+
| Latency histogram | Histogram | `component`, `field_name`, `status` |
233+
| `graphql_field_requests_total` | Counter | `component`, `field_name`, `status` |
234+
235+
#### HTTP Agent Metrics
236+
237+
**Source:** `HttpClient/middlewares/request/HttpAgentSingleton.ts`
238+
239+
| Metric Name | Type | Description |
240+
|-------------|------|-------------|
241+
| `http_agent_sockets_current` | Gauge | Active sockets |
242+
| `http_agent_free_sockets_current` | Gauge | Free sockets in pool |
243+
| `http_agent_pending_requests_current` | Gauge | Pending requests waiting for socket |
244+
245+
---
246+
247+
## Legacy Metrics (Non-Diagnostics)
248+
249+
### Prometheus Metrics
250+
251+
Exposed on the `/metrics` endpoint via `prom-client`.
252+
253+
#### Request Metrics
254+
255+
**Source:** `service/tracing/metrics/MetricNames.ts`
256+
257+
| Metric Name | Type | Labels | Description |
258+
|-------------|------|--------|-------------|
259+
| `runtime_http_requests_total` | Counter | `status_code`, `handler` | Total HTTP requests |
260+
| `runtime_http_aborted_requests_total` | Counter | `handler` | Aborted HTTP requests |
261+
| `runtime_http_requests_duration_milliseconds` | Histogram | `handler` | Request duration (buckets: 10-5120ms) |
262+
| `runtime_http_response_size_bytes` | Histogram | `handler` | Response sizes (buckets: 500B-4MB) |
263+
| `io_http_requests_current` | Gauge | - | Concurrent requests |
264+
265+
#### Event Loop Metrics
266+
267+
**Source:** `service/tracing/metrics/measurers/EventLoopLagMeasurer.ts`
268+
269+
| Metric Name | Type | Labels | Description |
270+
|-------------|------|--------|-------------|
271+
| `runtime_event_loop_lag_max_between_scrapes_seconds` | Gauge | - | Max event loop lag |
272+
| `runtime_event_loop_lag_percentiles_between_scrapes_seconds` | Gauge | `percentile` | Event loop lag percentiles (95, 99) |
273+
274+
#### Default Node.js Metrics
275+
276+
Via `collectDefaultMetrics()` from `prom-client`:
277+
278+
- `nodejs_gc_duration_seconds` - GC duration histogram
279+
- `nodejs_active_handles_total` - Active handles
280+
- `nodejs_active_requests_total` - Active requests
281+
- `nodejs_heap_size_*_bytes` - Heap metrics
282+
- `nodejs_external_memory_bytes` - External memory
283+
- `nodejs_version_info` - Node.js version
284+
- `process_cpu_*_seconds_total` - CPU counters
285+
- `process_resident_memory_bytes` - RSS memory
286+
- `process_start_time_seconds` - Process start time
287+
288+
### MetricsAccumulator
289+
290+
Exported via `console.log` as JSON and collected by Splunk.
291+
292+
**Source:** `metrics/MetricsAccumulator.ts`
293+
294+
#### Aggregated Metrics Format
295+
296+
Each metric includes:
297+
- `name` - Metric identifier
298+
- `count` - Number of samples
299+
- `mean`, `median` - Average and middle values
300+
- `percentile95`, `percentile99` - Tail latencies
301+
- `max` - Maximum value
302+
- `production` - Environment flag
303+
- Plus any custom extensions
304+
305+
#### System Metrics
306+
307+
| Metric Name | Properties |
308+
|-------------|------------|
309+
| `cpu` | `user` (μs), `system` (μs) |
310+
| `memory` | `rss`, `heapTotal`, `heapUsed`, `external`, `arrayBuffers` |
311+
| `httpAgent` | `sockets`, `freeSockets`, `pendingRequests` |
312+
| `incomingRequest` | `total`, `closed`, `aborted` |
313+
314+
### Billing Metrics
315+
316+
**Source:** `worker/runtime/http/middlewares/timings.ts`
317+
318+
Exported with `__VTEX_IO_BILLING` flag for usage tracking:
319+
320+
```json
321+
{
322+
"__VTEX_IO_BILLING": "true",
323+
"account": "...",
324+
"app": "...",
325+
"handler": "...",
326+
"production": true,
327+
"routeType": "public_route",
328+
"timestamp": 1234567890,
329+
"type": "process-time",
330+
"value": 150,
331+
"vendor": "vtex",
332+
"workspace": "master"
333+
}
334+
```
335+
336+
---
337+
338+
## Related Documentation
339+
340+
- [Migration Guide](./METRICS_OVERVIEW.md) - Patterns and best practices for migrating to diagnostics-based metrics
341+

0 commit comments

Comments
 (0)