X+SlidesBench

A Benchmark for Audience-Conditioned Slide Generation

X+SlidesBench is a benchmark toolkit for evaluating slide generation as audience-conditioned information selection. It builds source-grounded probes, assigns audience-specific utility weights, evaluates generated decks, and reports Audience Coverage, Domain-wise Coverage, Efficiency, Correctness, and aggregate summaries.

Overview

Slides are usually written for a specific audience. A proof detail can be useful for a specialist, but unnecessary for a decision maker. A learner may need background and definitions that an expert can skip. Most slide-generation evaluations still score content quality as if it were the same for every audience.

X+SlidesBench addresses this gap by evaluating slide generation as audience-conditioned information selection. For each source topic, X+SlidesBench builds source-grounded probes: audience-agnostic question-answer units with evidence spans. The same probe bank is then weighted for different audience profiles, so the benchmark can ask whether a generated deck covers information that is useful for specialists, learners, or decision makers.

X+SlidesBench first prepares source documents and audience profiles, then generates audience-agnostic probes from the source. Utility weights are assigned for each audience profile. Any slide generator can then produce PPTX or PDF decks, and X+SlidesBench evaluates the decks by checking whether the deck answers retained probes and whether slide claims are supported by the source.

X+SlidesBench consists of the following metrics:

Probe: an audience-agnostic question-answer unit backed by source evidence.
Audience weight: the utility of a probe for one audience profile.
Audience Coverage: recovered audience utility divided by available audience utility.
Domain-wise Coverage: Audience Coverage split by information domain.
Efficiency: recovered utility per slide, token, or estimated presentation minute.
Correctness: a source-grounded guardrail computed from extracted slide claims.

The paper benchmark covers 45 topics, 6,849 deduplicated source-grounded probes, 3 main audience profiles, and 7 presentation scenes. This repository includes a larger working example collection used for extensive and follow-up checks, containing 113 source records, 8,127 sanitized source-grounded probes, and 27,059 audience-weight rows. Please refer to the dataset README for more details.

Setup

uv sync
git submodule update --init --recursive
git -C adapters/agent_generator apply ../../patches/agent_generator_local.patch
git -C adapters/template_generator apply ../../patches/template_generator_local.patch
cp .env.example .env
uv run xslidesbench doctor

For the same setup flow in one command, run bash src/tools/init_environment.sh.

The patch commands are idempotent only when the patches have not already been applied. If git apply reports that a patch was already applied, continue with the next setup step. Prefer the xslidesbench CLI commands below for the stable public interface, as src/tools/ only contains curated setup and batch helpers for larger experiment runs.

Agent-facing workflow guides live under skills/. Use them when delegating or automating repository work: setup, probe construction, deck generation, evaluation, and ablation analysis each have a dedicated skill. The skills point back to the stable CLI and batch helpers rather than replacing them.

Fill the shared model endpoint variables in .env:

OPENAI_BASE_URL=
OPENAI_API_KEY=
OPENAI_MODEL=gemini-3.1-pro-preview

Model roles are configured in configs/*.yaml. The current defaults use gemini-3.1-pro-preview for probe generation and gemini-3-flash-preview for utility weighting and evaluation. max_tokens is unset by default, as it caps generated output length only, not input context length.

PDF source parsing uses MinerU by default. Set MINERU_API_KEY for the hosted service, or MINERU_API_URL for a local/compatible endpoint. If MINERU_BASE_URL is empty, the default hosted base URL is used. Parsed markdown is cached under XSLIDESBENCH_MINERU_CACHE_DIR.

Docker Setup

Build the core container:

docker build -t xslidesbench:latest .

Use mirrors during build when needed, for example with CERNET:

docker build -t xslidesbench:latest \
  --build-arg APT_MIRROR=https://mirrors.cernet.edu.cn/debian \
  --build-arg APT_SECURITY_MIRROR=https://mirrors.cernet.edu.cn/debian-security \
  --build-arg UV_INDEX_URL=https://mirrors.cernet.edu.cn/pypi/web/simple \
  .

Run an interactive shell with your local .env:

docker compose run --rm xslidesbench

For Compose builds, put the same optional mirror variables in .env:

APT_MIRROR=https://mirrors.cernet.edu.cn/debian
APT_SECURITY_MIRROR=https://mirrors.cernet.edu.cn/debian-security
UV_INDEX_URL=https://mirrors.cernet.edu.cn/pypi/web/simple
DOCKER_BUILD_NETWORK=host

Inside the container:

uv run xslidesbench doctor

Pipeline Overview

X+SlidesBench follows the benchmark method as a seven-stage pipeline:

prepare source manifests and source-side metadata,
generate audience-agnostic, source-grounded probes,
assign audience-specific utility weights to the same probe bank,
generate or import PPTX/PDF decks from supported generators,
evaluate whether each deck answers retained probes and stays source-grounded,
compute Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness,
aggregate row-level metric files by system, condition, audience, and source.

The commands below follow this order. Implementation details for each stage are documented in Pipeline.

Stage 1: Source Preparation

Each input source should be represented by a source file plus optional metadata. Supported source types include PDF, HTML, Markdown, and plain text.

data/runs/my_run/sources/
  case_001/
    source.pdf
    source_metadata.json

Example metadata:

{
  "scene": "policy_briefing",
  "name": "Document title",
  "category": "policy_report",
  "topic": "public governance"
}

Prepare a source manifest from a configured source collection:

uv run xslidesbench sources \
  --config configs/default.yaml \
  --audiences-config configs/audiences.yaml \
  --output-dir data/runs/my_run/sources \
  --domain academia \
  --scene academic_research_talk \
  --audiences specialists,learners,decision_makers \
  --max-cases 5

Prepared example documents are distributed through the Hugging Face dataset release under examples/input_documents/files/.

Stage 2: Probe Construction

Probes are generated from the source before audience conditioning. Each probe stores a question, expected answer, source evidence, depth level, modality, and information domain.

uv run xslidesbench probes \
  --source-context data/runs/default/sources/case_000/source_context.txt \
  --scene academic_research_talk \
  --source-case-id case_000 \
  --output data/runs/default/probes/case_000/probes_deduped.jsonl \
  --config configs/default.yaml \
  --k 3

--source-chars is the maximum number of extracted source characters kept for probe generation after PDF/HTML/text parsing. It controls cost and latency, and it is not the model context-window size. For large-context models, keep one full-source prompt whenever practical. Split the source only when an endpoint cannot reliably handle the retained source.

Stage 3: Audience Utility Weighting

The same probe bank is weighted separately for each audience profile.

uv run xslidesbench weights \
  --probes data/runs/default/probes/case_000/probes_deduped.jsonl \
  --audience specialists \
  --scene academic_research_talk \
  --output data/runs/default/probes/case_000/weights/specialists.jsonl \
  --config configs/default.yaml \
  --audiences-config configs/audiences.yaml \
  --k 3 \
  --batch-size 12

Audience and scene profiles are defined in configs/audiences.yaml.

Stage 4: Deck Generation Or Import

X+SlidesBench can evaluate decks from any generator as long as the deck is available as PPTX or PDF. The toolkit includes these supported workflows.

X+SlidesBench

X+SlidesBench is configured through local adapter paths and model variables:

LOCAL_AGENT_GENERATOR_DIR=adapters/agent_generator
LOCAL_AGENT_GENERATOR_BIN=
LOCAL_AGENT_RESEARCH_BASE_URL=${OPENAI_BASE_URL}
LOCAL_AGENT_RESEARCH_MODEL=gemini-3.1-pro-preview
LOCAL_AGENT_RESEARCH_API_KEY=${OPENAI_API_KEY}
LOCAL_AGENT_DESIGN_BASE_URL=${OPENAI_BASE_URL}
LOCAL_AGENT_DESIGN_MODEL=gemini-3.1-pro-preview
LOCAL_AGENT_DESIGN_API_KEY=${OPENAI_API_KEY}
LOCAL_AGENT_VISION_BASE_URL=${OPENAI_BASE_URL}
LOCAL_AGENT_VISION_MODEL=gemini-3.1-pro-preview
LOCAL_AGENT_VISION_API_KEY=${OPENAI_API_KEY}
MINERU_API_KEY=
MINERU_API_URL=
MINERU_BASE_URL=
MINERU_API_FIELD=
MINERU_MODEL_VERSION=vlm
XSLIDESBENCH_PDF_PARSER=mineru
XSLIDESBENCH_MINERU_CACHE_DIR=.cache/xslidesbench/mineru

Run prepared deck jobs:

uv run xslidesbench decks xslidesbench \
  --run-sheet data/runs/default/sheets/manual_run_sheet.csv \
  --env-file .env

SlideTailor

SlideTailor requires a local adapter, CUDA device, reference paper-slide pairs, and a template deck.

LOCAL_TEMPLATE_GENERATOR_DIR=adapters/template_generator
LOCAL_TEMPLATE_GENERATOR_BIN=
LOCAL_TEMPLATE_DATASET_DIR=
LOCAL_TEMPLATE_OPENAI_BASE_URL=${OPENAI_BASE_URL}
LOCAL_TEMPLATE_OPENAI_API_KEY=${OPENAI_API_KEY}
LOCAL_TEMPLATE_OPENAI_MODEL=gemini-3.1-pro-preview
TEMPLATE_GENERATOR_RETRIEVER_MODEL_PATH=

Prepare configs and then run generation:

uv run xslidesbench decks slidetailor prepare \
  --run-dir data/runs/default \
  --template-dataset-dir "$LOCAL_TEMPLATE_DATASET_DIR"

uv run xslidesbench decks slidetailor run \
  --run-sheet data/runs/default/sheets/slidetailor_run_sheet.csv \
  --env-file .env \
  --device cuda:0

Both agnostic and audience-conditioned SlideTailor rows are generated directly from the source document and reference template. Audience-conditioned rows use the corresponding audience preference file, independent of the agnostic row.

NotebookLM

NotebookLM is optional and manual. Use the same agnostic or audience-conditioned deck-generation prompt format described in the stage-4 configuration, then import the exported deck.

uv run xslidesbench decks notebooklm import \
  --deck exported_notebooklm_deck.pdf \
  --output-dir data/runs/default \
  --case-label case_000 \
  --condition conditioned \
  --audience specialists

Stage 5: Deck Evaluation

Evaluation checks whether the deck answers each retained probe and whether deck claims are supported by the source. By default, Correctness is computed by extracting atomic slide claims, verifying them in batches against retrieved source snippets, and aggregating the verification labels into the same correctness scalar used by the tables. Speaker notes are not used by the main scoring path. Correctness judgments are cached per deck/source/model/settings combination, so evaluating the same deck against multiple audience profiles does not need to repeat claim extraction and verification.

uv run xslidesbench evaluate \
  --deck data/runs/default/decks/PPTAgent/conditioned/specialists/case_000.pptx \
  --audience specialists \
  --scene academic_research_talk \
  --source-context data/runs/default/sources/case_000/source_context.txt \
  --source-case-id case_000 \
  --probes data/runs/default/probes/case_000/probes_deduped.jsonl \
  --weights data/runs/default/probes/case_000/weights/specialists.jsonl \
  --output-dir data/runs/default/evaluations/xslidesbench/case_000/specialists \
  --config configs/default.yaml \
  --audiences-config configs/audiences.yaml \
  --correctness-method claim_level \
  --claim-batch-size 50

Stage 6: Metric Scoring

Use xslidesbench metrics when probe scores already exist and only metric calculation is needed.

uv run xslidesbench metrics \
  --probes data/runs/default/probes/case_000/probes_deduped.jsonl \
  --weights data/runs/default/probes/case_000/weights/specialists.jsonl \
  --scores data/runs/default/evaluations/xslidesbench/case_000/specialists/probe_scores.jsonl \
  --audience specialists \
  --slide-count 8 \
  --word-count 700 \
  --correctness 0.9 \
  --tau-a 0.7 \
  --output data/runs/default/evaluations/xslidesbench/case_000/specialists/metrics.json

Stage 7: Result Aggregation

uv run xslidesbench aggregate \
  --metrics-dir data/runs/default/evaluations \
  --output data/runs/default/summaries/metric_summary.json

Aggregation reads row-level metric files and can compute bootstrap confidence intervals over the per-deck results.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
adapters		adapters
configs		configs
docs		docs
examples		examples
patches		patches
skills		skills
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

X+SlidesBench

Overview

Setup

Docker Setup

Pipeline Overview

Stage 1: Source Preparation

Stage 2: Probe Construction

Stage 3: Audience Utility Weighting

Stage 4: Deck Generation Or Import

X+SlidesBench

SlideTailor

NotebookLM

Stage 5: Deck Evaluation

Stage 6: Metric Scoring

Stage 7: Result Aggregation

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

X+SlidesBench

Overview

Setup

Docker Setup

Pipeline Overview

Stage 1: Source Preparation

Stage 2: Probe Construction

Stage 3: Audience Utility Weighting

Stage 4: Deck Generation Or Import

X+SlidesBench

SlideTailor

NotebookLM

Stage 5: Deck Evaluation

Stage 6: Metric Scoring

Stage 7: Result Aggregation

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages