Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,10 @@ application using OpenTelemetry. One call to ``enable_tracing()`` instruments
query sessions, transactions, and connection pool operations — so you can
visualize request flow in Jaeger, Grafana, or any OpenTelemetry-compatible backend.

The same page also covers client-side metrics. ``enable_metrics()`` exposes operation
latency, retry cost, and query session pool metrics through an OpenTelemetry
``MeterProvider``.


API Reference
-------------
Expand Down
186 changes: 166 additions & 20 deletions docs/opentelemetry.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,22 @@
OpenTelemetry Tracing
=====================

The SDK provides built-in distributed tracing via `OpenTelemetry <https://opentelemetry.io/>`_.
When enabled, key YDB operations — such as session creation, query execution, transaction
commit/rollback, and driver initialization — produce OpenTelemetry spans. Trace
context is automatically propagated to the YDB server through gRPC metadata using the
OpenTelemetry
=============

The SDK provides built-in distributed tracing and client-side metrics via
`OpenTelemetry <https://opentelemetry.io/>`_. When tracing is enabled, key YDB
operations — such as session creation, query execution, transaction commit/rollback,
and driver initialization — produce OpenTelemetry spans. Trace context is automatically
propagated to the YDB server through gRPC metadata using the
`W3C Trace Context <https://www.w3.org/TR/trace-context/>`_ standard.

Tracing is **zero-cost when disabled**: the SDK uses no-op stubs by default, so there is
no overhead unless you explicitly opt in.
Metrics expose operation latency/failures, retry cost, and query session pool state.
Tracing and metrics are configured independently: enabling one does not require enabling
the other.

Instrumentation is **zero-cost when disabled**: the SDK uses no-op tracing and
metrics registries by default, so importing the SDK does not import OpenTelemetry
or create metric instruments unless you explicitly opt in. ``enable_tracing()``
loads the tracing plugin, while ``enable_metrics()`` loads the metrics plugin and
replaces the no-op metrics registry with an OpenTelemetry-backed registry.


Installation
Expand All @@ -22,7 +30,7 @@ OpenTelemetry packages are not included by default. Install the SDK with the
pip install ydb[opentelemetry]

This pulls in ``opentelemetry-api``. You will also need ``opentelemetry-sdk`` and an
exporter for your tracing backend, for example:
exporter for your tracing or metrics backend, for example:

.. code-block:: sh

Expand Down Expand Up @@ -73,6 +81,53 @@ Repeated calls to ``enable_tracing()`` do nothing until you call ``disable_traci
which removes hooks so you can reconfigure or turn instrumentation off.


Enabling Metrics
----------------

Call ``enable_metrics()`` once, after configuring your OpenTelemetry meter provider
and before creating YDB drivers or query session pools:

.. code-block:: python

from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource

import ydb
from ydb.opentelemetry import enable_metrics

# 1. Set up OpenTelemetry
resource = Resource(attributes={"service.name": "my-service"})
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://localhost:4317"),
export_interval_millis=1000,
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])

# 2. Enable YDB metrics
enable_metrics(meter_provider)

# 3. Use the SDK as usual — metrics are recorded automatically
with ydb.Driver(endpoint="grpc://localhost:2136", database="/local") as driver:
driver.wait(timeout=5)
with ydb.QuerySessionPool(driver, name="main-pool") as pool:
pool.execute_with_retries("SELECT 1")

meter_provider.shutdown()

``enable_metrics()`` accepts an optional ``meter_provider`` argument. If omitted, the
SDK obtains a meter named ``"ydb.sdk"`` from the global meter provider.

Repeated calls to ``enable_metrics()`` do nothing until you call
``disable_metrics()``, which clears the in-memory observable metric values and allows
metrics to be reconfigured. After disabling metrics, the SDK restores the no-op
metrics registry, so metric recording calls remain cheap no-ops.

Metrics are independent from tracing. If both ``enable_tracing()`` and
``enable_metrics()`` are called, YDB client operations produce both spans and metrics.


What Is Instrumented
--------------------

Expand Down Expand Up @@ -171,6 +226,93 @@ On errors, the span also records:
- ``db.response.status_code`` — the YDB status code name (e.g. ``"SCHEME_ERROR"``).


Metric Instruments
------------------

The SDK creates the following instruments with meter name ``"ydb.sdk"``:

.. list-table::
:header-rows: 1
:widths: 30 15 15 40

* - Metric
- Instrument
- Unit
- Description
* - ``db.client.operation.duration``
- Histogram
- ``s``
- Latency of user-visible YDB client operations.
* - ``ydb.client.operation.failed``
- Counter
- ``{command}``
- Failed user-visible YDB client operations.
* - ``ydb.query.session.create_time``
- Histogram
- ``s``
- Time spent creating a query session.
* - ``ydb.query.session.pending_requests``
- UpDownCounter
- ``{request}``
- Requests currently waiting for a session from the pool.
* - ``ydb.query.session.timeouts``
- Counter
- ``{connection}``
- Session acquisition timeouts.
* - ``ydb.query.session.count``
- ObservableUpDownCounter
- ``{connection}``
- Current number of open query sessions by pool and state.
* - ``ydb.query.session.max``
- ObservableUpDownCounter
- ``{connection}``
- Maximum configured number of sessions for a query session pool.
* - ``ydb.query.session.min``
- ObservableUpDownCounter
- ``{connection}``
- Minimum configured number of sessions for a query session pool. The SDK does not configure
a pool minimum, so this metric is always reported as ``0``.
* - ``ydb.client.retry.duration``
- Histogram
- ``s``
- Total user-visible duration of a logical retried operation, including attempts and backoff.
* - ``ydb.client.retry.attempts``
- Histogram
- ``{attempt}``
- Number of attempts performed for one logical retried operation.

Operation metrics use stable labels only:

.. list-table::
:header-rows: 1
:widths: 35 65

* - Attribute
- Description
* - ``database``
- Database path.
* - ``endpoint``
- Configured endpoint in ``host:port`` form.
* - ``operation.name``
- SDK operation name without the ``ydb.`` prefix, for example ``"ExecuteQuery"``.
* - ``status_code``
- Added only to ``ydb.client.operation.failed``.

Operation metrics are recorded for ``ExecuteQuery``, ``Commit``, ``Rollback``,
``CreateSession``, and ``BeginTransaction``.

Query session metrics use ``ydb.query.session.pool.name``. When ``name`` is not
passed to ``QuerySessionPool``, the SDK uses the YDB connection string of the
driver — ``<endpoint><database>`` (for example ``grpc://localhost:2136/local``) —
so the pool is identifiable in dashboards out of the box. Set the label
explicitly with ``QuerySessionPool(..., name="main-pool")`` for both synchronous
and asynchronous pools when several pools share a connection string.
``ydb.query.session.count`` also includes ``ydb.query.session.state`` with values
``"idle"`` or ``"used"``.

Retry metrics are recorded without attributes.


Trace Context Propagation
-------------------------

Expand All @@ -189,24 +331,25 @@ request path.
Async Usage
-----------

Tracing works identically with the async driver. Call ``enable_tracing()`` once at
startup:
Tracing and metrics work identically with the async driver. Call
``enable_tracing()`` and/or ``enable_metrics()`` once at startup:

.. code-block:: python

import asyncio
import ydb
from ydb.opentelemetry import enable_tracing
from ydb.opentelemetry import enable_metrics, enable_tracing

enable_tracing()
enable_metrics()

async def main():
async with ydb.aio.Driver(
endpoint="grpc://localhost:2136",
database="/local",
) as driver:
await driver.wait(timeout=5)
async with ydb.aio.QuerySessionPool(driver) as pool:
async with ydb.aio.QuerySessionPool(driver, name="async-main-pool") as pool:
await pool.execute_with_retries("SELECT 1")

asyncio.run(main())
Expand All @@ -229,12 +372,14 @@ To use a specific tracer instead of the global one:
Running the Examples
--------------------

The runnable script is ``examples/opentelemetry/otel_example.py`` (bank table + concurrent
Serializable transactions and ``app_startup`` / ``example_tli`` application spans). **Start
Docker (YDB or the full stack) first**, then install and run on the host — see
``examples/opentelemetry/README.md`` for the full order of commands and environment variables.
The runnable script is ``examples/opentelemetry/otel_example.py``. It demonstrates both
tracing and metrics: bank table + concurrent Serializable transactions,
``app_startup`` / ``example_tli`` application spans, and SDK metrics exported through
OTLP. **Start Docker (YDB or the full stack) first**, then install and run on the host
— see ``examples/opentelemetry/README.md`` for the full order of commands and
environment variables.

**Full stack in one command** (YDB + OTLP + Tempo + Grafana; the ``otel-example`` service is built from ``examples/opentelemetry/Dockerfile`` and runs the script once):
**Full stack in one command** (YDB + OTLP + Tempo + Grafana + Prometheus; the ``otel-example`` service is built from ``examples/opentelemetry/Dockerfile`` and runs the script once):

.. code-block:: sh

Expand All @@ -250,4 +395,5 @@ The first run builds the ``otel-example`` image from the local SDK source; subse
pip install -e '.[opentelemetry]' -r examples/opentelemetry/requirements.txt
python examples/opentelemetry/otel_example.py

Open `http://localhost:3000 <http://localhost:3000>`_ (Grafana) to explore traces via Tempo.
Open `http://localhost:3000 <http://localhost:3000>`_ (Grafana) to explore traces via
Tempo and metrics through the configured Prometheus data source.
11 changes: 6 additions & 5 deletions examples/opentelemetry/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# Isolated image for the OpenTelemetry demo. Build context is the repository root.
# Isolated image for the OpenTelemetry demo scripts. Build context is the repository root.
#
# docker compose -f examples/opentelemetry/compose-e2e.yaml build otel-example
# docker compose -f examples/opentelemetry/compose-e2e.yaml build
#
# A separate ``.dockerignore`` at the repo root keeps the context small.

FROM python:3.11-slim

ENV PYTHONUNBUFFERED=1

WORKDIR /app

# Dependency layer: copy only what setup.py needs so changes to the demo script do
Expand All @@ -15,7 +17,6 @@ COPY ydb ./ydb
COPY examples/opentelemetry/requirements.txt ./examples/opentelemetry/requirements.txt
RUN pip install --no-cache-dir -e '.[opentelemetry]' -r examples/opentelemetry/requirements.txt

# Demo script.
# Demo scripts.
COPY examples/opentelemetry/otel_example.py ./examples/opentelemetry/otel_example.py

CMD ["python", "examples/opentelemetry/otel_example.py"]
COPY examples/opentelemetry/load_tank.py ./examples/opentelemetry/load_tank.py
72 changes: 64 additions & 8 deletions examples/opentelemetry/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
# OpenTelemetry example (YDB Python SDK)

Async demo in [`otel_example.py`](otel_example.py): OTLP export, `enable_tracing()`,
`app_startup` and `example_tli` application spans, bank table, Serializable transactions (TLI-style load).
`enable_metrics()`, `app_startup` and `example_tli` application spans, SDK client
metrics, bank table, Serializable transactions (TLI-style load).

[`load_tank.py`](load_tank.py) runs a small step-like load profile for the
metrics dashboard:

```text
Peak RPS -> Medium RPS -> Min RPS -> Medium RPS -> repeat
```

Most steps assume the **repository root** as the current directory; the install step also shows the variant from this folder.

Expand All @@ -17,7 +25,10 @@ docker compose up -d
# wait until the ydb container is healthy / port 2136 is open, then continue
```

**Full stack** (YDB + OTLP collector + Tempo + Grafana; the `otel-example` service is built from a `Dockerfile` and runs the script once inside Compose). The compose file is `compose-e2e.yaml` next to this README.
**Full stack** (YDB + OTLP collector + Tempo + Prometheus + Grafana; the
`otel-example` service runs the tracing/metrics demo once, and `load-generator`
runs the metrics load tank). The compose file is `compose-e2e.yaml` next to this
README.

```sh
cd /path/to/ydb-python-sdk
Expand All @@ -34,9 +45,29 @@ docker compose -f compose-e2e.yaml up --build
The first run builds the `otel-example` image from the local SDK source (`Dockerfile` in this folder, `.dockerignore` at the repo root keeps the context small). Subsequent runs reuse the cached image; pass `--build` if you change the SDK or the demo script.

Grafana: http://localhost:3000
Prometheus: http://localhost:9090

Grafana is provisioned with the **YDB Python SDK Metrics** dashboard. It uses
Prometheus queries for SDK metrics such as `db_client_operation_duration`,
`ydb_client_operation_failed`, `ydb_query_session_count`,
`ydb_query_session_pending_requests`, `ydb_query_session_create_time`, and
`ydb_client_retry_duration`. Use Grafana Explore for ad-hoc traces through Tempo
and metrics through Prometheus.

The SDK configures explicit OpenTelemetry histogram bucket boundaries for its
own duration and retry-attempt metrics. Duration values are recorded in seconds,
with sub-millisecond and millisecond-scale buckets so Grafana percentiles show
meaningful latency distributions for fast local YDB operations.

Metrics are wired through a dedicated SDK metrics plugin. Until `enable_metrics()`
is called, the SDK uses a no-op metrics registry and does not import
OpenTelemetry metrics packages from the hot-path metric helpers.

**Logs for `otel-example`:** the container name is prefixed (e.g. `opentelemetry-otel-example-1`); use `docker compose -f examples/opentelemetry/compose-e2e.yaml ps` or `docker ps -a` to find it. The service is one-shot (`restart: "no"`) — it may already have exited.

**Logs for `load-generator`:** the service is also one-shot. It runs for
`LOAD_TANK_TOTAL_TIME` seconds and then exits after flushing metrics.

## 2. Install dependencies (on the host, for a local `python` run)

**From the repository root** (editable SDK + pins from this example):
Expand All @@ -63,12 +94,37 @@ pip install -e '../..[opentelemetry]' -r requirements.txt
python examples/opentelemetry/otel_example.py
```

Defaults: YDB `grpc://localhost:2136`, OTLP `http://localhost:4317` (for a local collector, if you use one).
Defaults: YDB `grpc://localhost:2136`, OTLP `http://localhost:4317` (for a local collector, if you use one). The same OTLP endpoint receives both traces and metrics.

Run the load tank against an already running local stack:

```sh
python examples/opentelemetry/load_tank.py
```

## Environment (Docker / overrides)

| Variable | Meaning |
|----------|---------|
| `YDB_ENDPOINT` | e.g. `grpc://ydb:2136` inside the Compose network |
| `YDB_DATABASE` | default `/local` |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | e.g. `http://otel-collector:4317` |
| Variable | Meaning |
|----------|----------------------------------------------------------|
| `YDB_ENDPOINT` | e.g. `grpc://ydb:2136` inside the Compose network |
| `YDB_DATABASE` | default `/local` |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | e.g. `http://otel-collector:4317` |
| `OTEL_SERVICE_NAME` | service name attached to exported telemetry |
| `LOAD_TANK_TOTAL_TIME` | total load duration in seconds, default `6000` |
| `LOAD_TANK_WORKERS` | number of concurrent workers, default `40` |
| `LOAD_TANK_POOL_SIZE` | query session pool size, default `20` |
| `LOAD_TANK_PEAK_RPS` | peak phase target RPS, default `120` |
| `LOAD_TANK_MEDIUM_RPS` | medium phase target RPS, default `30` |
| `LOAD_TANK_MIN_RPS` | low phase target RPS, default `3` |
| `LOAD_TANK_ERROR_RPS` | failed query target RPS, default `1`; set `0` to disable |
| `LOAD_TANK_PRESSURE_POOL_SIZE` | pool size for session pressure metrics, default `1` |
| `LOAD_TANK_PRESSURE_WORKERS` | concurrent contenders for the pressure pool, default `8` |
| `LOAD_TANK_PRESSURE_HOLD_TIME` | seconds to hold the pressure-pool session, default `1.5` |
| `LOAD_TANK_PRESSURE_ACQUIRE_TIMEOUT` | short acquire timeout for timeout metrics, default `1.0` |
| `LOAD_TANK_PRESSURE_INTERVAL` | pause between pressure rounds, default `0.2` |
| `LOAD_TANK_SESSION_CHURN_INTERVAL` | interval for creating fresh sessions, default `2.0` |
| `LOAD_TANK_PEAK_DURATION` | peak phase duration in seconds, default `60` |
| `LOAD_TANK_MEDIUM_DURATION` | medium phase duration in seconds, default `90` |
| `LOAD_TANK_MIN_DURATION` | low phase duration in seconds, default `60` |
| `LOAD_TANK_QUERY` | query executed by workers, default `SELECT 1 AS value` |
| `LOAD_TANK_ERROR_QUERY` | query used to produce failed-operation metrics |
Loading
Loading