X+SlidesBench is a benchmark toolkit for evaluating slide generation as audience-conditioned information selection. It builds source-grounded probes, assigns audience-specific utility weights, evaluates generated decks, and reports Audience Coverage, Domain-wise Coverage, Efficiency, Correctness, and aggregate summaries.
Slides are usually written for a specific audience. A proof detail can be useful for a specialist, but unnecessary for a decision maker. A learner may need background and definitions that an expert can skip. Most slide-generation evaluations still score content quality as if it were the same for every audience.
X+SlidesBench addresses this gap by evaluating slide generation as audience-conditioned information selection. For each source topic, X+SlidesBench builds source-grounded probes: audience-agnostic question-answer units with evidence spans. The same probe bank is then weighted for different audience profiles, so the benchmark can ask whether a generated deck covers information that is useful for specialists, learners, or decision makers.
X+SlidesBench first prepares source documents and audience profiles, then generates audience-agnostic probes from the source. Utility weights are assigned for each audience profile. Any slide generator can then produce PPTX or PDF decks, and X+SlidesBench evaluates the decks by checking whether the deck answers retained probes and whether slide claims are supported by the source.
X+SlidesBench consists of the following metrics:
- Probe: an audience-agnostic question-answer unit backed by source evidence.
- Audience weight: the utility of a probe for one audience profile.
- Audience Coverage: recovered audience utility divided by available audience utility.
- Domain-wise Coverage: Audience Coverage split by information domain.
- Efficiency: recovered utility per slide, token, or estimated presentation minute.
- Correctness: a source-grounded guardrail computed from extracted slide claims.
The paper benchmark covers 45 topics, 6,849 deduplicated source-grounded probes, 3 main audience profiles, and 7 presentation scenes. This repository includes a larger working example collection used for extensive and follow-up checks, containing 113 source records, 8,127 sanitized source-grounded probes, and 27,059 audience-weight rows. Please refer to the dataset README for more details.
uv sync
git submodule update --init --recursive
git -C adapters/agent_generator apply ../../patches/agent_generator_local.patch
git -C adapters/template_generator apply ../../patches/template_generator_local.patch
cp .env.example .env
uv run xslidesbench doctorFor the same setup flow in one command, run bash src/tools/init_environment.sh.
The patch commands are idempotent only when the patches have not already been
applied. If git apply reports that a patch was already applied, continue with
the next setup step. Prefer the xslidesbench CLI commands below for the stable
public interface, as src/tools/ only contains curated setup and batch helpers for
larger experiment runs.
Agent-facing workflow guides live under skills/. Use them when delegating or
automating repository work: setup, probe construction, deck generation,
evaluation, and ablation analysis each have a dedicated skill. The skills point
back to the stable CLI and batch helpers rather than replacing them.
Fill the shared model endpoint variables in .env:
OPENAI_BASE_URL=
OPENAI_API_KEY=
OPENAI_MODEL=gemini-3.1-pro-preview
Model roles are configured in configs/*.yaml. The current defaults use
gemini-3.1-pro-preview for probe generation and gemini-3-flash-preview for
utility weighting and evaluation. max_tokens is unset by default, as it caps
generated output length only, not input context length.
PDF source parsing uses MinerU by default. Set MINERU_API_KEY for the hosted
service, or MINERU_API_URL for a local/compatible endpoint. If
MINERU_BASE_URL is empty, the default hosted base URL is used. Parsed markdown
is cached under XSLIDESBENCH_MINERU_CACHE_DIR.
Build the core container:
docker build -t xslidesbench:latest .Use mirrors during build when needed, for example with CERNET:
docker build -t xslidesbench:latest \
--build-arg APT_MIRROR=https://mirrors.cernet.edu.cn/debian \
--build-arg APT_SECURITY_MIRROR=https://mirrors.cernet.edu.cn/debian-security \
--build-arg UV_INDEX_URL=https://mirrors.cernet.edu.cn/pypi/web/simple \
.Run an interactive shell with your local .env:
docker compose run --rm xslidesbenchFor Compose builds, put the same optional mirror variables in .env:
APT_MIRROR=https://mirrors.cernet.edu.cn/debian
APT_SECURITY_MIRROR=https://mirrors.cernet.edu.cn/debian-security
UV_INDEX_URL=https://mirrors.cernet.edu.cn/pypi/web/simple
DOCKER_BUILD_NETWORK=host
Inside the container:
uv run xslidesbench doctorX+SlidesBench follows the benchmark method as a seven-stage pipeline:
- prepare source manifests and source-side metadata,
- generate audience-agnostic, source-grounded probes,
- assign audience-specific utility weights to the same probe bank,
- generate or import PPTX/PDF decks from supported generators,
- evaluate whether each deck answers retained probes and stays source-grounded,
- compute Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness,
- aggregate row-level metric files by system, condition, audience, and source.
The commands below follow this order. Implementation details for each stage are documented in Pipeline.
Each input source should be represented by a source file plus optional metadata. Supported source types include PDF, HTML, Markdown, and plain text.
data/runs/my_run/sources/
case_001/
source.pdf
source_metadata.json
Example metadata:
{
"scene": "policy_briefing",
"name": "Document title",
"category": "policy_report",
"topic": "public governance"
}Prepare a source manifest from a configured source collection:
uv run xslidesbench sources \
--config configs/default.yaml \
--audiences-config configs/audiences.yaml \
--output-dir data/runs/my_run/sources \
--domain academia \
--scene academic_research_talk \
--audiences specialists,learners,decision_makers \
--max-cases 5Prepared example documents are distributed through the Hugging Face dataset
release under examples/input_documents/files/.
Probes are generated from the source before audience conditioning. Each probe stores a question, expected answer, source evidence, depth level, modality, and information domain.
uv run xslidesbench probes \
--source-context data/runs/default/sources/case_000/source_context.txt \
--scene academic_research_talk \
--source-case-id case_000 \
--output data/runs/default/probes/case_000/probes_deduped.jsonl \
--config configs/default.yaml \
--k 3--source-chars is the maximum number of extracted source characters kept for
probe generation after PDF/HTML/text parsing. It controls cost and latency, and it
is not the model context-window size. For large-context models, keep one
full-source prompt whenever practical. Split the source only when an endpoint
cannot reliably handle the retained source.
The same probe bank is weighted separately for each audience profile.
uv run xslidesbench weights \
--probes data/runs/default/probes/case_000/probes_deduped.jsonl \
--audience specialists \
--scene academic_research_talk \
--output data/runs/default/probes/case_000/weights/specialists.jsonl \
--config configs/default.yaml \
--audiences-config configs/audiences.yaml \
--k 3 \
--batch-size 12Audience and scene profiles are defined in configs/audiences.yaml.
X+SlidesBench can evaluate decks from any generator as long as the deck is available as PPTX or PDF. The toolkit includes these supported workflows.
X+SlidesBench is configured through local adapter paths and model variables:
LOCAL_AGENT_GENERATOR_DIR=adapters/agent_generator
LOCAL_AGENT_GENERATOR_BIN=
LOCAL_AGENT_RESEARCH_BASE_URL=${OPENAI_BASE_URL}
LOCAL_AGENT_RESEARCH_MODEL=gemini-3.1-pro-preview
LOCAL_AGENT_RESEARCH_API_KEY=${OPENAI_API_KEY}
LOCAL_AGENT_DESIGN_BASE_URL=${OPENAI_BASE_URL}
LOCAL_AGENT_DESIGN_MODEL=gemini-3.1-pro-preview
LOCAL_AGENT_DESIGN_API_KEY=${OPENAI_API_KEY}
LOCAL_AGENT_VISION_BASE_URL=${OPENAI_BASE_URL}
LOCAL_AGENT_VISION_MODEL=gemini-3.1-pro-preview
LOCAL_AGENT_VISION_API_KEY=${OPENAI_API_KEY}
MINERU_API_KEY=
MINERU_API_URL=
MINERU_BASE_URL=
MINERU_API_FIELD=
MINERU_MODEL_VERSION=vlm
XSLIDESBENCH_PDF_PARSER=mineru
XSLIDESBENCH_MINERU_CACHE_DIR=.cache/xslidesbench/mineru
Run prepared deck jobs:
uv run xslidesbench decks xslidesbench \
--run-sheet data/runs/default/sheets/manual_run_sheet.csv \
--env-file .envSlideTailor requires a local adapter, CUDA device, reference paper-slide pairs, and a template deck.
LOCAL_TEMPLATE_GENERATOR_DIR=adapters/template_generator
LOCAL_TEMPLATE_GENERATOR_BIN=
LOCAL_TEMPLATE_DATASET_DIR=
LOCAL_TEMPLATE_OPENAI_BASE_URL=${OPENAI_BASE_URL}
LOCAL_TEMPLATE_OPENAI_API_KEY=${OPENAI_API_KEY}
LOCAL_TEMPLATE_OPENAI_MODEL=gemini-3.1-pro-preview
TEMPLATE_GENERATOR_RETRIEVER_MODEL_PATH=
Prepare configs and then run generation:
uv run xslidesbench decks slidetailor prepare \
--run-dir data/runs/default \
--template-dataset-dir "$LOCAL_TEMPLATE_DATASET_DIR"
uv run xslidesbench decks slidetailor run \
--run-sheet data/runs/default/sheets/slidetailor_run_sheet.csv \
--env-file .env \
--device cuda:0Both agnostic and audience-conditioned SlideTailor rows are generated directly from the source document and reference template. Audience-conditioned rows use the corresponding audience preference file, independent of the agnostic row.
NotebookLM is optional and manual. Use the same agnostic or audience-conditioned deck-generation prompt format described in the stage-4 configuration, then import the exported deck.
uv run xslidesbench decks notebooklm import \
--deck exported_notebooklm_deck.pdf \
--output-dir data/runs/default \
--case-label case_000 \
--condition conditioned \
--audience specialistsEvaluation checks whether the deck answers each retained probe and whether deck
claims are supported by the source. By default, Correctness is computed by
extracting atomic slide claims, verifying them in batches against retrieved
source snippets, and aggregating the verification labels into the same
correctness scalar used by the tables. Speaker notes are not used by the main
scoring path. Correctness judgments are cached per deck/source/model/settings
combination, so evaluating the same deck against multiple audience profiles does
not need to repeat claim extraction and verification.
uv run xslidesbench evaluate \
--deck data/runs/default/decks/PPTAgent/conditioned/specialists/case_000.pptx \
--audience specialists \
--scene academic_research_talk \
--source-context data/runs/default/sources/case_000/source_context.txt \
--source-case-id case_000 \
--probes data/runs/default/probes/case_000/probes_deduped.jsonl \
--weights data/runs/default/probes/case_000/weights/specialists.jsonl \
--output-dir data/runs/default/evaluations/xslidesbench/case_000/specialists \
--config configs/default.yaml \
--audiences-config configs/audiences.yaml \
--correctness-method claim_level \
--claim-batch-size 50Use xslidesbench metrics when probe scores already exist and only metric
calculation is needed.
uv run xslidesbench metrics \
--probes data/runs/default/probes/case_000/probes_deduped.jsonl \
--weights data/runs/default/probes/case_000/weights/specialists.jsonl \
--scores data/runs/default/evaluations/xslidesbench/case_000/specialists/probe_scores.jsonl \
--audience specialists \
--slide-count 8 \
--word-count 700 \
--correctness 0.9 \
--tau-a 0.7 \
--output data/runs/default/evaluations/xslidesbench/case_000/specialists/metrics.jsonuv run xslidesbench aggregate \
--metrics-dir data/runs/default/evaluations \
--output data/runs/default/summaries/metric_summary.jsonAggregation reads row-level metric files and can compute bootstrap confidence intervals over the per-deck results.