Skip to content

tucanoo/springboot-virtualthreads

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Virtual Threads in Spring Boot 4 — Reproducible Benchmark

A narrow, reproducible experiment: take one realistic blocking Spring Boot 4 service, flip it from platform threads to virtual threads, and measure exactly what changes — throughput, tail latency, and where the bottleneck moves (the Tomcat thread pool → the HikariCP connection pool). The rig also carries a synchronized request path and JFR pinning capture to let you verify the JEP 491 change on Java 25 yourself (the pre-Java-24 "avoid synchronized" advice is now out of date).

This repo backs the Tucanoo article Virtual Threads in Spring Boot 4: I Rewrote a Blocking Service and Measured Everything. The raw results in results/ and the chart script in charts/ are the reproducibility evidence — clone it, run it on your own hardware, and tell us what you get.

Honest framing up front: single host, loopback networking. Absolute latencies are optimistic versus production. The claims here are the relative platform-vs-virtual delta and the bottleneck shift — both hold on loopback. See Caveats.


The headline result

Concurrency Platform threads Virtual threads
500 VUs ~1,996 req/s · p99 ~1.3 s ~5,262 req/s · p99 ~127 ms
2,500 VUs ~2,226 req/s · p99 ~2.8 s ~13,673 req/s · p99 ~690 ms
10,000 VUs ~2,607 req/s · p99 ~6.6 s ~10,091 req/s · p99 ~3.4 s*

Platform threads plateau at the ~2,000–2,600 req/s Tomcat 200-thread wall; virtual threads scale to ~6.1× that at 2,500 concurrent users, from a single configuration flag. (*10,000 VUs is a closed-model overload point, not steady state — see Caveats.)

Hero chart: throughput and p99 latency vs concurrency, platform vs virtual

And the bottleneck-shift experiment — with the connection held across the downstream call, the HikariCP pool becomes a hard throughput ceiling. Platform threads flatline at their 200-thread wall; virtual threads ride the pool:

Bottleneck chart: throughput vs HikariCP pool size, platform vs virtual


Machine specs (the primary rig)

Host Intel Core i9-9900K @ 3.6 GHz
Cores 8 physical / 16 logical (Hyper-Threading on), Coffee Lake — homogeneous (no P/E-core split)
RAM 32 GB
OS Linux (WSL2, Ubuntu 24.04) on Windows 10 — .wslconfig: processors=16, memory=16GB
JDK Temurin 25.0.3 LTS (/usr/lib/jvm/temurin-25-jdk-amd64) — includes JEP 491
Spring Boot 4.0.7 (Tomcat 11 MVC, not WebFlux) · Servlet 6.1
DB PostgreSQL 16, dedicated standalone instance on port 5544, through HikariCP
Downstream stub WebFlux/Netty stub (stub-app/), non-blocking Mono.delayElement(100ms)
Load generator k6 v2.0.0 (tools/k6/k6), closed-model ramping-vus

Core map (verified, not assumed)

Each tier is pinned to its own CPUs with taskset. On Linux availableProcessors() auto-tracks the affinity mask, so the carrier (ForkJoinPool) size follows the App tier's cores automatically — no -XX:ActiveProcessorCount hack is needed. HT siblings are adjacent pairs on this box (verified via lscpu -e: physical core N = logical CPUs 2N, 2N+1), so every tier owns whole physical cores and no two tiers share one:

Tier Logical CPUs (taskset) Physical cores Notes
App (SUT) 0-7 0–3 (4 cores) kept generous so it stays HikariCP-limited, not CPU-pegged (peaks ~80%)
PostgreSQL 8-9 4 (1 core) light load (~0.5 core at peak)
Load gen + stub 10-15 5–7 (3 cores) k6 + WebFlux stub; peaks ~74%, never the limiter

The core map lives at the top of bench/lib.sh (CPUS_APP / CPUS_PG / CPUS_LOAD).


What is (and isn't) the variable

The application code is identical across every run. Only environment variables change, all set by the harness and reported live at GET /api/runtime-info:

Env var Values What it controls
VT_ENABLED false / true platform vs virtual threads (spring.threads.virtual.enabled)
HIKARI_MAX_POOL 10, 50, 100, … HikariCP maximum-pool-size (fixed-size: min = max)
CONN_MODE released / held when the DB connection is returned (see below)
PATH_MODE normal / synchronized striped-synchronized variant for the JEP 491 pinning check

The system under test (app/) is one endpoint, GET /api/order-summary/{id}: a HikariCP/JDBC primary-key lookup (orders joined to customers, 1M rows seeded) plus a blocking RestClient call to the stub (~100 ms). It combines both into JSON.

Two request paths drive the two experiments:

  • released (default, the hero path) returns the connection after the sub-millisecond query, before the downstream call → the pool governs latency, not throughput.
  • held wraps the query + downstream call in one TransactionTemplate so the connection stays checked out across the 100 ms call (the real-world "transaction spans an external HTTP call" anti-pattern) → the pool is a hard throughput ceiling. This is the "your bottleneck just moved" demo.

Reproduce it

On Linux (or WSL2). Prerequisites, with the paths the harness expects (override at the top of bench/lib.sh if yours differ):

  • Temurin 25 (or any JDK 25 with JEP 491) at /usr/lib/jvm/temurin-25-jdk-amd64
  • PostgreSQL 16 binaries at /usr/lib/postgresql/16/bin (the harness runs its own standalone instance on port 5544 with .pgdata/ — it does not touch any system cluster)
  • k6 v2.0.0 at tools/k6/k6 (fetch the static Linux binary into that path)
# 0. Clone
git clone https://github.com/tucanoo/springboot-virtualthreads.git
cd springboot-virtualthreads

# 1. Build the two jars (Spring Boot SUT + the WebFlux stub).
#    The Maven wrapper lives in app/; build the stub with it via -f.
export JAVA_HOME=/usr/lib/jvm/temurin-25-jdk-amd64
( cd app && ./mvnw -B -DskipTests package )
( cd app && ./mvnw -B -DskipTests -f ../stub-app/pom.xml package )

# 2. Run the article dataset (~50 min): hero (released) + bottleneck (held), 3 reps each.
#    Starts Postgres (+seed on first run), the stub and the app; pins every tier; samples
#    per-tier CPU; appends results/matrix.csv; tears everything down at the end.
bash bench/run-lean.sh

# 3. Render the figures from the CSV
python -m venv charts/.venv && source charts/.venv/bin/activate
pip install -r charts/requirements.txt
python charts/make_charts.py results/matrix.csv      # -> charts/out/figA_*, figB_*

run-lean.sh is just two run-matrix.sh invocations. To drive the matrix directly:

# Hero (released): pool fixed, concurrency swept
bash bench/run-matrix.sh --conn-mode released --pools 50 \
     --conc 100,500,1000,2500,5000,10000 --reps 3 --ramp 8 --warmup 10 --hold 20

# Bottleneck (held): pool swept at fixed concurrency
bash bench/run-matrix.sh --conn-mode held \
     --pools 50,100,200,400,800 --conc 2000 --reps 3 --ramp 8 --warmup 10 --hold 20

Flags: --modes platform,virtual · --pools · --conc · --reps · --ramp/--warmup/--hold (seconds) · --conn-mode released|held · --path-mode normal|synchronized · --keep. The 5-rep full matrix is bench/run-full.sh (several hours). The seeded DB persists in .pgdata/ across runs; force a reseed with RESEED=1.

Note: the author develops on Windows and executes the rig inside WSL2; that host-specific sync workflow is not required to reproduce — on a native Linux box the three steps above are the whole story.


The dataset

results/matrix.csv is the article dataset: 66 rows, both thread modes, 3 reps, collected in one clean run. Each row is one completed scenario (23 columns):

ts_iso, thread_mode, hikari_pool, path_mode, conn_mode, concurrency, rep,
throughput_rps, p50_ms, p95_ms, p99_ms, p999_ms, max_ms, error_rate, http_reqs,
app_cpu_pct, pg_cpu_pct, load_cpu_pct, hikari_active_max, hikari_pending_max,
jvm_threads_max, rss_mb_max, pinned_events

The conn_mode column separates the two experiments (released = hero, held = bottleneck). Per-tier CPU is logged on every run precisely so you can confirm the load generator and Postgres were never saturated — a saturated generator masquerading as the app topping out is the single biggest benchmark failure mode, and the data lets you rule it out.


Caveats

We'd rather you trust the numbers than be impressed by them, so:

  • Single-host loopback — absolute latencies are optimistic; the claims are the relative platform-vs-virtual delta and the bottleneck shift, both of which hold on loopback.
  • WSL2, not bare metal — the cpuset pinning is real within the VM, but the VM shares the physical machine with the Windows host. For bare-metal isolation, run the same harness on native Linux; the core map is unchanged.
  • 3 reps, not 5 — the shipped lean dataset trades reps for turnaround. Fine for effects this large, but stated honestly; run-full.sh does 5 reps if you want tighter variance bars.
  • The 10,000-VU rows are a closed-model overload cliff (throughput drops, p99 ≈ 3–4 s, a fraction of a percent of errors) — a legitimate data point, but label it overloaded, not steady state.
  • Low-concurrency p99 spikes — a recurring ~1.2 s p99 blip at low load in some reps (JIT/GC/cold-connection warm-up) inflates low-concurrency p99; read median/p95 there.
  • The held experiment stays at c2000 — at small pools, much higher concurrency blows past HikariCP's 10 s connection-timeout and the rows become timeout-dominated rather than clean throughput.
  • Pinning — JEP 491 makes the expected jdk.VirtualThreadPinned count on a synchronized hot path on Java 25 ≈ 0. The rig wires the synchronized path and JFR capture so you can confirm this; the shipped dataset focuses on throughput/latency/bottleneck.
  • Numbers are i9-9900K-specific. Different CPU counts change the carrier-pool size and move these curves — which is exactly why reproducing on your own hardware is interesting.

Repo layout

app/        Spring Boot 4 SUT — one blocking endpoint (RestClient -> stub + JDBC -> Postgres)
stub-app/   WebFlux/Netty downstream stub (non-blocking, fixed 100ms delay)
load/       k6 scenario (closed-model ramping-vus) + profiles
bench/      Bash harness: lib.sh (config + helpers), run-matrix.sh, run-lean.sh, run-full.sh
sql/        schema.sql + seed.sql (100k customers / 1M orders via generate_series)
results/    raw matrix.csv (committed = reproducibility)
charts/     make_charts.py -> hero + bottleneck figures (charts/out/)
tools/      portable k6 binary

Legacy artifacts: the bench/*.ps1 scripts, bench/CpuProbe.java / bench/OutboundProbe.java, and the stub/ (WireMock) directory are from an earlier Windows-native attempt that hit a loopback ceiling. The Linux/WSL2 bash harness above is the rig that produced the published data; the PowerShell files are kept for history and are not required.


License & reproduction

Run it, fork it, and re-plot the raw CSVs however you like. If you reproduce the benchmark on different hardware — a cloud instance, a bigger box, native Linux, ARM — we'd genuinely like to hear what you get. Read the full write-up and methodology in the companion article: Virtual Threads in Spring Boot 4: I Rewrote a Blocking Service and Measured Everything.

About

A narrow, reproducible experiment: take one realistic blocking Spring Boot 4 service, flip it from platform threads to virtual threads, and measure exactly what changes

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors