ydb-platform · tewbo · May 11, 2026 · May 12, 2026 · May 13, 2026 · May 14, 2026
diff --git a/docs/index.rst b/docs/index.rst
@@ -117,6 +117,10 @@ application using OpenTelemetry. One call to ``enable_tracing()`` instruments
 query sessions, transactions, and connection pool operations — so you can
 visualize request flow in Jaeger, Grafana, or any OpenTelemetry-compatible backend.
 
+The same page also covers client-side metrics. ``enable_metrics()`` exposes operation
+latency, retry cost, and query session pool metrics through an OpenTelemetry
+``MeterProvider``.
+
 
 API Reference
 -------------

diff --git a/docs/opentelemetry.rst b/docs/opentelemetry.rst
@@ -1,14 +1,22 @@
-OpenTelemetry Tracing
-=====================
-
-The SDK provides built-in distributed tracing via `OpenTelemetry <https://opentelemetry.io/>`_.
-When enabled, key YDB operations — such as session creation, query execution, transaction
-commit/rollback, and driver initialization — produce OpenTelemetry spans. Trace
-context is automatically propagated to the YDB server through gRPC metadata using the
+OpenTelemetry
+=============
+
+The SDK provides built-in distributed tracing and client-side metrics via
+`OpenTelemetry <https://opentelemetry.io/>`_. When tracing is enabled, key YDB
+operations — such as session creation, query execution, transaction commit/rollback,
+and driver initialization — produce OpenTelemetry spans. Trace context is automatically
+propagated to the YDB server through gRPC metadata using the
 `W3C Trace Context <https://www.w3.org/TR/trace-context/>`_ standard.
 
-Tracing is **zero-cost when disabled**: the SDK uses no-op stubs by default, so there is
-no overhead unless you explicitly opt in.
+Metrics expose operation latency/failures, retry cost, and query session pool state.
+Tracing and metrics are configured independently: enabling one does not require enabling
+the other.
+
+Instrumentation is **zero-cost when disabled**: the SDK uses no-op tracing and
+metrics registries by default, so importing the SDK does not import OpenTelemetry
+or create metric instruments unless you explicitly opt in. ``enable_tracing()``
+loads the tracing plugin, while ``enable_metrics()`` loads the metrics plugin and
+replaces the no-op metrics registry with an OpenTelemetry-backed registry.
 
 
 Installation
@@ -22,7 +30,7 @@ OpenTelemetry packages are not included by default. Install the SDK with the
     pip install ydb[opentelemetry]
 
 This pulls in ``opentelemetry-api``. You will also need ``opentelemetry-sdk`` and an
-exporter for your tracing backend, for example:
+exporter for your tracing or metrics backend, for example:
 
 .. code-block:: sh
 
@@ -73,6 +81,53 @@ Repeated calls to ``enable_tracing()`` do nothing until you call ``disable_traci
 which removes hooks so you can reconfigure or turn instrumentation off.
 
 
+Enabling Metrics
+----------------
+
+Call ``enable_metrics()`` once, after configuring your OpenTelemetry meter provider
+and before creating YDB drivers or query session pools:
+
+.. code-block:: python
+
+    from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
+    from opentelemetry.sdk.metrics import MeterProvider
+    from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
+    from opentelemetry.sdk.resources import Resource
+
+    import ydb
+    from ydb.opentelemetry import enable_metrics
+
+    # 1. Set up OpenTelemetry
+    resource = Resource(attributes={"service.name": "my-service"})
+    metric_reader = PeriodicExportingMetricReader(
+        OTLPMetricExporter(endpoint="http://localhost:4317"),
+        export_interval_millis=1000,
+    )
+    meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
+
+    # 2. Enable YDB metrics
+    enable_metrics(meter_provider)
+
+    # 3. Use the SDK as usual — metrics are recorded automatically
+    with ydb.Driver(endpoint="grpc://localhost:2136", database="/local") as driver:
+        driver.wait(timeout=5)
+        with ydb.QuerySessionPool(driver, name="main-pool") as pool:
+            pool.execute_with_retries("SELECT 1")
+
+    meter_provider.shutdown()
+
+``enable_metrics()`` accepts an optional ``meter_provider`` argument. If omitted, the
+SDK obtains a meter named ``"ydb.sdk"`` from the global meter provider.
+
+Repeated calls to ``enable_metrics()`` do nothing until you call
+``disable_metrics()``, which clears the in-memory observable metric values and allows
+metrics to be reconfigured. After disabling metrics, the SDK restores the no-op
+metrics registry, so metric recording calls remain cheap no-ops.
+
+Metrics are independent from tracing. If both ``enable_tracing()`` and
+``enable_metrics()`` are called, YDB client operations produce both spans and metrics.
+
+
 What Is Instrumented
 --------------------
 
@@ -171,6 +226,89 @@ On errors, the span also records:
 - ``db.response.status_code`` — the YDB status code name (e.g. ``"SCHEME_ERROR"``).
 
 
+Metric Instruments
+------------------
+
+The SDK creates the following instruments with meter name ``"ydb.sdk"``:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 30 15 15 40
+
+   * - Metric
+     - Instrument
+     - Unit
+     - Description
+   * - ``db.client.operation.duration``
+     - Histogram
+     - ``s``
+     - Latency of user-visible YDB client operations.
+   * - ``ydb.client.operation.failed``
+     - Counter
+     - ``{command}``
+     - Failed user-visible YDB client operations.
+   * - ``ydb.query.session.create_time``
+     - Histogram
+     - ``s``
+     - Time spent creating a query session.
+   * - ``ydb.query.session.pending_requests``
+     - UpDownCounter
+     - ``{request}``
+     - Requests currently waiting for a session from the pool.
+   * - ``ydb.query.session.timeouts``
+     - Counter
+     - ``{connection}``
+     - Session acquisition timeouts.
+   * - ``ydb.query.session.count``
+     - ObservableUpDownCounter
+     - ``{connection}``
+     - Current number of open query sessions by pool and state.
+   * - ``ydb.query.session.max``
+     - ObservableUpDownCounter
+     - ``{connection}``
+     - Maximum configured number of sessions for a query session pool.
+   * - ``ydb.query.session.min``
+     - ObservableUpDownCounter
+     - ``{connection}``
+     - Minimum configured number of sessions for a query session pool. The SDK does not configure
+       a pool minimum, so this metric is always reported as ``0``.
+   * - ``ydb.client.retry.duration``
+     - Histogram
+     - ``s``
+     - Total user-visible duration of a logical retried operation, including attempts and backoff.
+   * - ``ydb.client.retry.attempts``
+     - Histogram
+     - ``{attempt}``
+     - Number of attempts performed for one logical retried operation.
+
+Operation metrics use stable labels only:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 35 65
+
+   * - Attribute
+     - Description
+   * - ``database``
+     - Database path.
+   * - ``endpoint``
+     - Configured endpoint in ``host:port`` form.
+   * - ``operation.name``
+     - SDK operation name without the ``ydb.`` prefix, for example ``"ExecuteQuery"``.
+   * - ``status_code``
+     - Added only to ``ydb.client.operation.failed``.
+
+Operation metrics are recorded for ``ExecuteQuery``, ``Commit``, ``Rollback``,
+``CreateSession``, and ``BeginTransaction``.
+
+Query session metrics use ``ydb.query.session.pool.name``. The pool name is generated
+automatically, or can be set explicitly with ``QuerySessionPool(..., name="main-pool")``
+for both synchronous and asynchronous pools. ``ydb.query.session.count`` also includes
+``ydb.query.session.state`` with values ``"idle"`` or ``"used"``.
+
+Retry metrics are recorded without attributes.
+
+
 Trace Context Propagation
 -------------------------
 
@@ -189,24 +327,25 @@ request path.
 Async Usage
 -----------
 
-Tracing works identically with the async driver. Call ``enable_tracing()`` once at
-startup:
+Tracing and metrics work identically with the async driver. Call
+``enable_tracing()`` and/or ``enable_metrics()`` once at startup:
 
 .. code-block:: python
 
     import asyncio
     import ydb
-    from ydb.opentelemetry import enable_tracing
+    from ydb.opentelemetry import enable_metrics, enable_tracing
 
     enable_tracing()
+    enable_metrics()
 
     async def main():
         async with ydb.aio.Driver(
             endpoint="grpc://localhost:2136",
             database="/local",
         ) as driver:
             await driver.wait(timeout=5)
-            async with ydb.aio.QuerySessionPool(driver) as pool:
+            async with ydb.aio.QuerySessionPool(driver, name="async-main-pool") as pool:
                 await pool.execute_with_retries("SELECT 1")
 
     asyncio.run(main())
@@ -229,12 +368,14 @@ To use a specific tracer instead of the global one:
 Running the Examples
 --------------------
 
-The runnable script is ``examples/opentelemetry/otel_example.py`` (bank table + concurrent
-Serializable transactions and ``app_startup`` / ``example_tli`` application spans). **Start
-Docker (YDB or the full stack) first**, then install and run on the host — see
-``examples/opentelemetry/README.md`` for the full order of commands and environment variables.
+The runnable script is ``examples/opentelemetry/otel_example.py``. It demonstrates both
+tracing and metrics: bank table + concurrent Serializable transactions,
+``app_startup`` / ``example_tli`` application spans, and SDK metrics exported through
+OTLP. **Start Docker (YDB or the full stack) first**, then install and run on the host
+— see ``examples/opentelemetry/README.md`` for the full order of commands and
+environment variables.
 
-**Full stack in one command** (YDB + OTLP + Tempo + Grafana; the ``otel-example`` service is built from ``examples/opentelemetry/Dockerfile`` and runs the script once):
+**Full stack in one command** (YDB + OTLP + Tempo + Grafana + Prometheus; the ``otel-example`` service is built from ``examples/opentelemetry/Dockerfile`` and runs the script once):
 
 .. code-block:: sh
 
@@ -250,4 +391,5 @@ The first run builds the ``otel-example`` image from the local SDK source; subse
     pip install -e '.[opentelemetry]' -r examples/opentelemetry/requirements.txt
     python examples/opentelemetry/otel_example.py
 
-Open `http://localhost:3000 <http://localhost:3000>`_ (Grafana) to explore traces via Tempo.
+Open `http://localhost:3000 <http://localhost:3000>`_ (Grafana) to explore traces via
+Tempo and metrics through the configured Prometheus data source.
diff --git a/examples/opentelemetry/Dockerfile b/examples/opentelemetry/Dockerfile
@@ -1,11 +1,13 @@
-# Isolated image for the OpenTelemetry demo. Build context is the repository root.
+# Isolated image for the OpenTelemetry demo scripts. Build context is the repository root.
 #
-#   docker compose -f examples/opentelemetry/compose-e2e.yaml build otel-example
+#   docker compose -f examples/opentelemetry/compose-e2e.yaml build
 #
 # A separate ``.dockerignore`` at the repo root keeps the context small.
 
 FROM python:3.11-slim
 
+ENV PYTHONUNBUFFERED=1
+
 WORKDIR /app
 
 # Dependency layer: copy only what setup.py needs so changes to the demo script do
@@ -15,7 +17,6 @@ COPY ydb ./ydb
 COPY examples/opentelemetry/requirements.txt ./examples/opentelemetry/requirements.txt
 RUN pip install --no-cache-dir -e '.[opentelemetry]' -r examples/opentelemetry/requirements.txt
 
-# Demo script.
+# Demo scripts.
 COPY examples/opentelemetry/otel_example.py ./examples/opentelemetry/otel_example.py
-
-CMD ["python", "examples/opentelemetry/otel_example.py"]
+COPY examples/opentelemetry/load_tank.py ./examples/opentelemetry/load_tank.py
diff --git a/examples/opentelemetry/README.md b/examples/opentelemetry/README.md
@@ -1,7 +1,15 @@
 # OpenTelemetry example (YDB Python SDK)
 
 Async demo in [`otel_example.py`](otel_example.py): OTLP export, `enable_tracing()`,
-`app_startup` and `example_tli` application spans, bank table, Serializable transactions (TLI-style load).
+`enable_metrics()`, `app_startup` and `example_tli` application spans, SDK client
+metrics, bank table, Serializable transactions (TLI-style load).
+
+[`load_tank.py`](load_tank.py) runs a small step-like load profile for the
+metrics dashboard:
+
+```text
+Peak RPS -> Medium RPS -> Min RPS -> Medium RPS -> repeat
+```
 
 Most steps assume the **repository root** as the current directory; the install step also shows the variant from this folder.
 
@@ -17,7 +25,10 @@ docker compose up -d
 # wait until the ydb container is healthy / port 2136 is open, then continue
 ```
 
-**Full stack** (YDB + OTLP collector + Tempo + Grafana; the `otel-example` service is built from a `Dockerfile` and runs the script once inside Compose). The compose file is `compose-e2e.yaml` next to this README.
+**Full stack** (YDB + OTLP collector + Tempo + Prometheus + Grafana; the
+`otel-example` service runs the tracing/metrics demo once, and `load-generator`
+runs the metrics load tank). The compose file is `compose-e2e.yaml` next to this
+README.
 
 ```sh
 cd /path/to/ydb-python-sdk
@@ -34,9 +45,29 @@ docker compose -f compose-e2e.yaml up --build
 The first run builds the `otel-example` image from the local SDK source (`Dockerfile` in this folder, `.dockerignore` at the repo root keeps the context small). Subsequent runs reuse the cached image; pass `--build` if you change the SDK or the demo script.
 
 Grafana: http://localhost:3000
+Prometheus: http://localhost:9090
+
+Grafana is provisioned with the **YDB Python SDK Metrics** dashboard. It uses
+Prometheus queries for SDK metrics such as `db_client_operation_duration`,
+`ydb_client_operation_failed`, `ydb_query_session_count`,
+`ydb_query_session_pending_requests`, `ydb_query_session_create_time`, and
+`ydb_client_retry_duration`. Use Grafana Explore for ad-hoc traces through Tempo
+and metrics through Prometheus.
+
+The SDK configures explicit OpenTelemetry histogram bucket boundaries for its
+own duration and retry-attempt metrics. Duration values are recorded in seconds,
+with sub-millisecond and millisecond-scale buckets so Grafana percentiles show
+meaningful latency distributions for fast local YDB operations.
+
+Metrics are wired through a dedicated SDK metrics plugin. Until `enable_metrics()`
+is called, the SDK uses a no-op metrics registry and does not import
+OpenTelemetry metrics packages from the hot-path metric helpers.
 
 **Logs for `otel-example`:** the container name is prefixed (e.g. `opentelemetry-otel-example-1`); use `docker compose -f examples/opentelemetry/compose-e2e.yaml ps` or `docker ps -a` to find it. The service is one-shot (`restart: "no"`) — it may already have exited.
 
+**Logs for `load-generator`:** the service is also one-shot. It runs for
+`LOAD_TANK_TOTAL_TIME` seconds and then exits after flushing metrics.
+
 ## 2. Install dependencies (on the host, for a local `python` run)
 
 **From the repository root** (editable SDK + pins from this example):
@@ -63,12 +94,37 @@ pip install -e '../..[opentelemetry]' -r requirements.txt
 python examples/opentelemetry/otel_example.py
 ```
 
-Defaults: YDB `grpc://localhost:2136`, OTLP `http://localhost:4317` (for a local collector, if you use one).
+Defaults: YDB `grpc://localhost:2136`, OTLP `http://localhost:4317` (for a local collector, if you use one). The same OTLP endpoint receives both traces and metrics.
+
+Run the load tank against an already running local stack:
+
+```sh
+python examples/opentelemetry/load_tank.py
+```
 
 ## Environment (Docker / overrides)
 
-| Variable | Meaning |
-|----------|---------|
-| `YDB_ENDPOINT` | e.g. `grpc://ydb:2136` inside the Compose network |
-| `YDB_DATABASE` | default `/local` |
-| `OTEL_EXPORTER_OTLP_ENDPOINT` | e.g. `http://otel-collector:4317` |
+| Variable | Meaning                                                  |
+|----------|----------------------------------------------------------|
+| `YDB_ENDPOINT` | e.g. `grpc://ydb:2136` inside the Compose network        |
+| `YDB_DATABASE` | default `/local`                                         |
+| `OTEL_EXPORTER_OTLP_ENDPOINT` | e.g. `http://otel-collector:4317`                        |
+| `OTEL_SERVICE_NAME` | service name attached to exported telemetry              |
+| `LOAD_TANK_TOTAL_TIME` | total load duration in seconds, default `6000`           |
+| `LOAD_TANK_WORKERS` | number of concurrent workers, default `40`               |
+| `LOAD_TANK_POOL_SIZE` | query session pool size, default `20`                    |
+| `LOAD_TANK_PEAK_RPS` | peak phase target RPS, default `120`                     |
+| `LOAD_TANK_MEDIUM_RPS` | medium phase target RPS, default `30`                    |
+| `LOAD_TANK_MIN_RPS` | low phase target RPS, default `3`                        |
+| `LOAD_TANK_ERROR_RPS` | failed query target RPS, default `1`; set `0` to disable |
+| `LOAD_TANK_PRESSURE_POOL_SIZE` | pool size for session pressure metrics, default `1`      |
+| `LOAD_TANK_PRESSURE_WORKERS` | concurrent contenders for the pressure pool, default `8` |
+| `LOAD_TANK_PRESSURE_HOLD_TIME` | seconds to hold the pressure-pool session, default `1.5` |
+| `LOAD_TANK_PRESSURE_ACQUIRE_TIMEOUT` | short acquire timeout for timeout metrics, default `1.0` |
+| `LOAD_TANK_PRESSURE_INTERVAL` | pause between pressure rounds, default `0.2`             |
+| `LOAD_TANK_SESSION_CHURN_INTERVAL` | interval for creating fresh sessions, default `2.0`      |
+| `LOAD_TANK_PEAK_DURATION` | peak phase duration in seconds, default `60`             |
+| `LOAD_TANK_MEDIUM_DURATION` | medium phase duration in seconds, default `90`           |
+| `LOAD_TANK_MIN_DURATION` | low phase duration in seconds, default `60`              |
+| `LOAD_TANK_QUERY` | query executed by workers, default `SELECT 1 AS value`   |
+| `LOAD_TANK_ERROR_QUERY` | query used to produce failed-operation metrics           |