🌐 Project Page • 📖 Paper • 🤗 Hugging Face • 🧭 ModelScope
Official code and data of the paper RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension (ACL 2026).
RPC-Bench, a large-scale fine-grained question answering benchmark constructed from review-rebuttal exchanges of high-quality academic papers, with each paper available in two input formats (pure text and rendered page images) enabling evaluation of both large language models (LLMs) and visual language models (VLMs).
First, create a conda environment and install all pip package requirements.
conda create -n rpc python==3.11.13
conda activate rpc
pip install -r requirements.txtThe pipeline/ directory provides an example workflow for constructing benchmark QA annotations from crawled OpenReview review-rebuttal data through LLM-based decomposition, rewriting, and filtering.
See pipeline/README.md for details.
For this benchmark, each academic paper can be processed into either structured text or page-rendered images, enabling evaluation across both LLMs and VLMs. Choose the parsing mode that best fits your experimental objectives.
- File Download: Download paper PDFs based on metadata from JSON files located under the
benchmark/directory.
python download.py- Text Parsing: Parse PDF content into text using MinerU.
pip install --upgrade pip
pip install uv
uv pip install -U "mineru[core]"
mineru-models-download
mineru -p "./benchmark/pdf/test" -o "./benchmark/parse/test" --source local- Image Parsing: Convert PDF pages into image format for further processing.
python pdf2image.pyYou may also download our processed data directly from Google Drive, Hugging Face, or ModelScope. The processed data includes:
pdf/: original paper PDFs.md/: Markdown files parsed from each paper by MinerU, used as text input for LLM-oriented evaluation.parse/: full MinerU parsing outputs, including structured layout and content artifacts.vlm/: page images rendered from PDFs with PyMuPDF at 200 DPI, used for VLM-oriented evaluation.
The examples below show how to download only md/ and vlm/, which are sufficient for running the default LLM and VLM inference scripts.
Option 1: Download from Hugging Face
pip install -U huggingface_hub
hf download zai-org/RPC-Bench md/test/ vlm/test/ --repo-type dataset --local-dir ./benchmarkOption 2: Download from ModelScope
pip install -U modelscope
modelscope download --dataset ZhipuAI/RPC-Bench --include "md/test/**" "vlm/test/**" --local_dir ./benchmarkThe consistency/ directory provides a self-contained example for measuring consistency between LLM judge outputs and human pairwise preferences.
See consistency/README.md for details.
GPT-5 is given as an example below, but you may replace this with any other LLM or VLM supported in your environment.
- LLM Inference:
python llm.py- VLM Inference:
python vlm.pyAfter inference, evaluate predictions against benchmark references using:
python eval.py