Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,10 @@ application using OpenTelemetry. One call to ``enable_tracing()`` instruments
query sessions, transactions, and connection pool operations — so you can
visualize request flow in Jaeger, Grafana, or any OpenTelemetry-compatible backend.

The same page also covers client-side metrics. ``enable_metrics()`` exposes operation
latency, retry cost, and query session pool metrics through an OpenTelemetry
``MeterProvider``.


API Reference
-------------
Expand Down
182 changes: 162 additions & 20 deletions docs/opentelemetry.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,22 @@
OpenTelemetry Tracing
=====================

The SDK provides built-in distributed tracing via `OpenTelemetry <https://opentelemetry.io/>`_.
When enabled, key YDB operations — such as session creation, query execution, transaction
commit/rollback, and driver initialization — produce OpenTelemetry spans. Trace
context is automatically propagated to the YDB server through gRPC metadata using the
OpenTelemetry
=============

The SDK provides built-in distributed tracing and client-side metrics via
`OpenTelemetry <https://opentelemetry.io/>`_. When tracing is enabled, key YDB
operations — such as session creation, query execution, transaction commit/rollback,
and driver initialization — produce OpenTelemetry spans. Trace context is automatically
propagated to the YDB server through gRPC metadata using the
`W3C Trace Context <https://www.w3.org/TR/trace-context/>`_ standard.

Tracing is **zero-cost when disabled**: the SDK uses no-op stubs by default, so there is
no overhead unless you explicitly opt in.
Metrics expose operation latency/failures, retry cost, and query session pool state.
Tracing and metrics are configured independently: enabling one does not require enabling
the other.

Instrumentation is **zero-cost when disabled**: the SDK uses no-op tracing and
metrics registries by default, so importing the SDK does not import OpenTelemetry
or create metric instruments unless you explicitly opt in. ``enable_tracing()``
loads the tracing plugin, while ``enable_metrics()`` loads the metrics plugin and
replaces the no-op metrics registry with an OpenTelemetry-backed registry.


Installation
Expand All @@ -22,7 +30,7 @@ OpenTelemetry packages are not included by default. Install the SDK with the
pip install ydb[opentelemetry]

This pulls in ``opentelemetry-api``. You will also need ``opentelemetry-sdk`` and an
exporter for your tracing backend, for example:
exporter for your tracing or metrics backend, for example:

.. code-block:: sh

Expand Down Expand Up @@ -73,6 +81,53 @@ Repeated calls to ``enable_tracing()`` do nothing until you call ``disable_traci
which removes hooks so you can reconfigure or turn instrumentation off.


Enabling Metrics
----------------

Call ``enable_metrics()`` once, after configuring your OpenTelemetry meter provider
and before creating YDB drivers or query session pools:

.. code-block:: python

from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource

import ydb
from ydb.opentelemetry import enable_metrics

# 1. Set up OpenTelemetry
resource = Resource(attributes={"service.name": "my-service"})
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://localhost:4317"),
export_interval_millis=1000,
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])

# 2. Enable YDB metrics
enable_metrics(meter_provider)

# 3. Use the SDK as usual — metrics are recorded automatically
with ydb.Driver(endpoint="grpc://localhost:2136", database="/local") as driver:
driver.wait(timeout=5)
with ydb.QuerySessionPool(driver, name="main-pool") as pool:
pool.execute_with_retries("SELECT 1")

meter_provider.shutdown()

``enable_metrics()`` accepts an optional ``meter_provider`` argument. If omitted, the
SDK obtains a meter named ``"ydb.sdk"`` from the global meter provider.

Repeated calls to ``enable_metrics()`` do nothing until you call
``disable_metrics()``, which clears the in-memory observable metric values and allows
metrics to be reconfigured. After disabling metrics, the SDK restores the no-op
metrics registry, so metric recording calls remain cheap no-ops.

Metrics are independent from tracing. If both ``enable_tracing()`` and
``enable_metrics()`` are called, YDB client operations produce both spans and metrics.


What Is Instrumented
--------------------

Expand Down Expand Up @@ -171,6 +226,89 @@ On errors, the span also records:
- ``db.response.status_code`` — the YDB status code name (e.g. ``"SCHEME_ERROR"``).


Metric Instruments
------------------

The SDK creates the following instruments with meter name ``"ydb.sdk"``:

.. list-table::
:header-rows: 1
:widths: 30 15 15 40

* - Metric
- Instrument
- Unit
- Description
* - ``db.client.operation.duration``
- Histogram
- ``s``
- Latency of user-visible YDB client operations.
* - ``ydb.client.operation.failed``
- Counter
- ``{command}``
- Failed user-visible YDB client operations.
* - ``ydb.query.session.create_time``
- Histogram
- ``s``
- Time spent creating a query session.
* - ``ydb.query.session.pending_requests``
- UpDownCounter
- ``{request}``
- Requests currently waiting for a session from the pool.
* - ``ydb.query.session.timeouts``
- Counter
- ``{connection}``
- Session acquisition timeouts.
* - ``ydb.query.session.count``
- ObservableUpDownCounter
- ``{connection}``
- Current number of open query sessions by pool and state.
* - ``ydb.query.session.max``
- ObservableUpDownCounter
- ``{connection}``
- Maximum configured number of sessions for a query session pool.
* - ``ydb.query.session.min``
- ObservableUpDownCounter
- ``{connection}``
- Minimum configured number of sessions for a query session pool. The SDK does not configure
a pool minimum, so this metric is always reported as ``0``.
* - ``ydb.client.retry.duration``
- Histogram
- ``s``
- Total user-visible duration of a logical retried operation, including attempts and backoff.
* - ``ydb.client.retry.attempts``
- Histogram
- ``{attempt}``
- Number of attempts performed for one logical retried operation.

Operation metrics use stable labels only:

.. list-table::
:header-rows: 1
:widths: 35 65

* - Attribute
- Description
* - ``database``
- Database path.
* - ``endpoint``
- Configured endpoint in ``host:port`` form.
* - ``operation.name``
- SDK operation name without the ``ydb.`` prefix, for example ``"ExecuteQuery"``.
* - ``status_code``
- Added only to ``ydb.client.operation.failed``.

Operation metrics are recorded for ``ExecuteQuery``, ``Commit``, ``Rollback``,
``CreateSession``, and ``BeginTransaction``.

Query session metrics use ``ydb.query.session.pool.name``. The pool name is generated
automatically, or can be set explicitly with ``QuerySessionPool(..., name="main-pool")``
for both synchronous and asynchronous pools. ``ydb.query.session.count`` also includes
``ydb.query.session.state`` with values ``"idle"`` or ``"used"``.

Retry metrics are recorded without attributes.


Trace Context Propagation
-------------------------

Expand All @@ -189,24 +327,25 @@ request path.
Async Usage
-----------

Tracing works identically with the async driver. Call ``enable_tracing()`` once at
startup:
Tracing and metrics work identically with the async driver. Call
``enable_tracing()`` and/or ``enable_metrics()`` once at startup:

.. code-block:: python

import asyncio
import ydb
from ydb.opentelemetry import enable_tracing
from ydb.opentelemetry import enable_metrics, enable_tracing

enable_tracing()
enable_metrics()

async def main():
async with ydb.aio.Driver(
endpoint="grpc://localhost:2136",
database="/local",
) as driver:
await driver.wait(timeout=5)
async with ydb.aio.QuerySessionPool(driver) as pool:
async with ydb.aio.QuerySessionPool(driver, name="async-main-pool") as pool:
await pool.execute_with_retries("SELECT 1")

asyncio.run(main())
Expand All @@ -229,12 +368,14 @@ To use a specific tracer instead of the global one:
Running the Examples
--------------------

The runnable script is ``examples/opentelemetry/otel_example.py`` (bank table + concurrent
Serializable transactions and ``app_startup`` / ``example_tli`` application spans). **Start
Docker (YDB or the full stack) first**, then install and run on the host — see
``examples/opentelemetry/README.md`` for the full order of commands and environment variables.
The runnable script is ``examples/opentelemetry/otel_example.py``. It demonstrates both
tracing and metrics: bank table + concurrent Serializable transactions,
``app_startup`` / ``example_tli`` application spans, and SDK metrics exported through
OTLP. **Start Docker (YDB or the full stack) first**, then install and run on the host
— see ``examples/opentelemetry/README.md`` for the full order of commands and
environment variables.

**Full stack in one command** (YDB + OTLP + Tempo + Grafana; the ``otel-example`` service is built from ``examples/opentelemetry/Dockerfile`` and runs the script once):
**Full stack in one command** (YDB + OTLP + Tempo + Grafana + Prometheus; the ``otel-example`` service is built from ``examples/opentelemetry/Dockerfile`` and runs the script once):

.. code-block:: sh

Expand All @@ -250,4 +391,5 @@ The first run builds the ``otel-example`` image from the local SDK source; subse
pip install -e '.[opentelemetry]' -r examples/opentelemetry/requirements.txt
python examples/opentelemetry/otel_example.py

Open `http://localhost:3000 <http://localhost:3000>`_ (Grafana) to explore traces via Tempo.
Open `http://localhost:3000 <http://localhost:3000>`_ (Grafana) to explore traces via
Tempo and metrics through the configured Prometheus data source.
11 changes: 6 additions & 5 deletions examples/opentelemetry/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# Isolated image for the OpenTelemetry demo. Build context is the repository root.
# Isolated image for the OpenTelemetry demo scripts. Build context is the repository root.
#
# docker compose -f examples/opentelemetry/compose-e2e.yaml build otel-example
# docker compose -f examples/opentelemetry/compose-e2e.yaml build
#
# A separate ``.dockerignore`` at the repo root keeps the context small.

FROM python:3.11-slim

ENV PYTHONUNBUFFERED=1

WORKDIR /app

# Dependency layer: copy only what setup.py needs so changes to the demo script do
Expand All @@ -15,7 +17,6 @@ COPY ydb ./ydb
COPY examples/opentelemetry/requirements.txt ./examples/opentelemetry/requirements.txt
RUN pip install --no-cache-dir -e '.[opentelemetry]' -r examples/opentelemetry/requirements.txt

# Demo script.
# Demo scripts.
COPY examples/opentelemetry/otel_example.py ./examples/opentelemetry/otel_example.py

CMD ["python", "examples/opentelemetry/otel_example.py"]
COPY examples/opentelemetry/load_tank.py ./examples/opentelemetry/load_tank.py
72 changes: 64 additions & 8 deletions examples/opentelemetry/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
# OpenTelemetry example (YDB Python SDK)

Async demo in [`otel_example.py`](otel_example.py): OTLP export, `enable_tracing()`,
`app_startup` and `example_tli` application spans, bank table, Serializable transactions (TLI-style load).
`enable_metrics()`, `app_startup` and `example_tli` application spans, SDK client
metrics, bank table, Serializable transactions (TLI-style load).

[`load_tank.py`](load_tank.py) runs a small step-like load profile for the
metrics dashboard:

```text
Peak RPS -> Medium RPS -> Min RPS -> Medium RPS -> repeat
```

Most steps assume the **repository root** as the current directory; the install step also shows the variant from this folder.

Expand All @@ -17,7 +25,10 @@ docker compose up -d
# wait until the ydb container is healthy / port 2136 is open, then continue
```

**Full stack** (YDB + OTLP collector + Tempo + Grafana; the `otel-example` service is built from a `Dockerfile` and runs the script once inside Compose). The compose file is `compose-e2e.yaml` next to this README.
**Full stack** (YDB + OTLP collector + Tempo + Prometheus + Grafana; the
`otel-example` service runs the tracing/metrics demo once, and `load-generator`
runs the metrics load tank). The compose file is `compose-e2e.yaml` next to this
README.

```sh
cd /path/to/ydb-python-sdk
Expand All @@ -34,9 +45,29 @@ docker compose -f compose-e2e.yaml up --build
The first run builds the `otel-example` image from the local SDK source (`Dockerfile` in this folder, `.dockerignore` at the repo root keeps the context small). Subsequent runs reuse the cached image; pass `--build` if you change the SDK or the demo script.

Grafana: http://localhost:3000
Prometheus: http://localhost:9090

Grafana is provisioned with the **YDB Python SDK Metrics** dashboard. It uses
Prometheus queries for SDK metrics such as `db_client_operation_duration`,
`ydb_client_operation_failed`, `ydb_query_session_count`,
`ydb_query_session_pending_requests`, `ydb_query_session_create_time`, and
`ydb_client_retry_duration`. Use Grafana Explore for ad-hoc traces through Tempo
and metrics through Prometheus.

The SDK configures explicit OpenTelemetry histogram bucket boundaries for its
own duration and retry-attempt metrics. Duration values are recorded in seconds,
with sub-millisecond and millisecond-scale buckets so Grafana percentiles show
meaningful latency distributions for fast local YDB operations.

Metrics are wired through a dedicated SDK metrics plugin. Until `enable_metrics()`
is called, the SDK uses a no-op metrics registry and does not import
OpenTelemetry metrics packages from the hot-path metric helpers.

**Logs for `otel-example`:** the container name is prefixed (e.g. `opentelemetry-otel-example-1`); use `docker compose -f examples/opentelemetry/compose-e2e.yaml ps` or `docker ps -a` to find it. The service is one-shot (`restart: "no"`) — it may already have exited.

**Logs for `load-generator`:** the service is also one-shot. It runs for
`LOAD_TANK_TOTAL_TIME` seconds and then exits after flushing metrics.

## 2. Install dependencies (on the host, for a local `python` run)

**From the repository root** (editable SDK + pins from this example):
Expand All @@ -63,12 +94,37 @@ pip install -e '../..[opentelemetry]' -r requirements.txt
python examples/opentelemetry/otel_example.py
```

Defaults: YDB `grpc://localhost:2136`, OTLP `http://localhost:4317` (for a local collector, if you use one).
Defaults: YDB `grpc://localhost:2136`, OTLP `http://localhost:4317` (for a local collector, if you use one). The same OTLP endpoint receives both traces and metrics.

Run the load tank against an already running local stack:

```sh
python examples/opentelemetry/load_tank.py
```

## Environment (Docker / overrides)

| Variable | Meaning |
|----------|---------|
| `YDB_ENDPOINT` | e.g. `grpc://ydb:2136` inside the Compose network |
| `YDB_DATABASE` | default `/local` |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | e.g. `http://otel-collector:4317` |
| Variable | Meaning |
|----------|----------------------------------------------------------|
| `YDB_ENDPOINT` | e.g. `grpc://ydb:2136` inside the Compose network |
| `YDB_DATABASE` | default `/local` |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | e.g. `http://otel-collector:4317` |
| `OTEL_SERVICE_NAME` | service name attached to exported telemetry |
| `LOAD_TANK_TOTAL_TIME` | total load duration in seconds, default `6000` |
| `LOAD_TANK_WORKERS` | number of concurrent workers, default `40` |
| `LOAD_TANK_POOL_SIZE` | query session pool size, default `20` |
| `LOAD_TANK_PEAK_RPS` | peak phase target RPS, default `120` |
| `LOAD_TANK_MEDIUM_RPS` | medium phase target RPS, default `30` |
| `LOAD_TANK_MIN_RPS` | low phase target RPS, default `3` |
| `LOAD_TANK_ERROR_RPS` | failed query target RPS, default `1`; set `0` to disable |
| `LOAD_TANK_PRESSURE_POOL_SIZE` | pool size for session pressure metrics, default `1` |
| `LOAD_TANK_PRESSURE_WORKERS` | concurrent contenders for the pressure pool, default `8` |
| `LOAD_TANK_PRESSURE_HOLD_TIME` | seconds to hold the pressure-pool session, default `1.5` |
| `LOAD_TANK_PRESSURE_ACQUIRE_TIMEOUT` | short acquire timeout for timeout metrics, default `1.0` |
| `LOAD_TANK_PRESSURE_INTERVAL` | pause between pressure rounds, default `0.2` |
| `LOAD_TANK_SESSION_CHURN_INTERVAL` | interval for creating fresh sessions, default `2.0` |
| `LOAD_TANK_PEAK_DURATION` | peak phase duration in seconds, default `60` |
| `LOAD_TANK_MEDIUM_DURATION` | medium phase duration in seconds, default `90` |
| `LOAD_TANK_MIN_DURATION` | low phase duration in seconds, default `60` |
| `LOAD_TANK_QUERY` | query executed by workers, default `SELECT 1 AS value` |
| `LOAD_TANK_ERROR_QUERY` | query used to produce failed-operation metrics |
Loading
Loading