Official implementation of "Ask Optimal Questions: Aligning Large Language Models with Retriever’sPreference".
Chanwoong Yoon1*, Gangwoo Kim1*, Byeongguk Jeon1, Sungdong Kim2,3, Yohan Jo4, Jaewoo Kang1
Korea University1, NAVER Cloud2, KAIST AI3, Seoul National University4
In NAACL 2025.
📃 Paper | 🤗 Model | 🤗 RF-Collection
Abstract Conversational search, unlike single-turn retrieval tasks, requires understanding the current question within a dialogue context. The common approach of rewrite-then-retrieve aims to decontextualize questions to be self-sufficient for off-the-shelf retrievers, but most existing methods produce sub-optimal query rewrites due to the limited ability to incorporate signals from the retrieval results. To overcome this limitation, we present a novel framework RetPO (Retriever's Preference Optimization), which is designed to optimize a language model (LM) for reformulating search queries in line with the preferences of the target retrieval systems. The process begins by prompting a large LM to produce various potential rewrites and then collects retrieval performance for these rewrites as the retrievers' preferences. Through the process, we construct a large-scale dataset called RF collection, containing Retrievers' Feedback on over 410K query rewrites across 12K conversations. Furthermore, we fine-tune a smaller LM using this dataset to align it with the retrievers' preferences as feedback. The resulting model demonstrates superiority on two benchmarks, surpassing the previous state-of-the-art performance of rewrite-then-retrieve approaches, including GPT-3.5.
- Installation Instructions
- Evaluation
- RetPO (Retriever's Preference Optimization)
Please be aware that we utilize two distinct environments.
- retpo_search (retriever indexing and search)
- retpo_qr (QR model training and inference)
The base retrieval code uses faiss-gpu, which is tied to specific versions of CUDA and torch. If the versions do not match, errors may occur. Therefore, we use separate environments.
As we require a lot of retrieval of dense retriever, we recommend to consider to use faiss-gpu.
# create environment
conda create -n retpo python==3.9 && conda activate retpo_search
# install torch
pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
# faiss-cpu or faiss-gpu
# CPU
pip install faiss-cpu==1.7.3
# GPU
pip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
# other requirements
pip install -r requirements.txt
# create environment
cd retpo_qr/
conda create -n retpo_qr python=3.10 && conda activate retpo_qr
# install torch
pip install torch==2.1.0 # this specific version is crucial for reproducibility. you may need to install other variants based on your hardware.
# install dependencies
python -m pip install .
# Flash Attention 2 (Optional, but Recommended for Faster Training)
# If your machine has less than 96GB of RAM and many CPU cores, reduce MAX_JOBS, e.g.:
python -m pip install flash-attn --no-build-isolation
We mainly evaluate our method using two types of retrievers: BM25 and ANCE on two Conversational QA benchmarks: TopiOCQA and QReCC.
There are well-organized repositories for preprocessing these datasets and indexing passages for retrieval. We recommend using them before running our code. We mainly refer to the ConvGQR as a reference.
Specifically, to run our code, you need to prepare following files.
You can find the code to prepare these folders here:
pyserini_index/ # https://github.com/fengranMark/ConvGQR/blob/main/bm25/create_index.sh
tokenized/ # https://github.com/fengranMark/ConvGQR/blob/main/gen_tokenized_doc.py
embeddings/ # https://github.com/fengranMark/ConvGQR/blob/main/gen_doc_embeddings.py
ROOT_DIR/
└── datasets/
└── checkpoints # Retriever checkpoints
└── ad-hoc-ance-msmarco # https://huggingface.co/3ricL/ad-hoc-ance-msmarco
└── topiocqa/
├── pyserini_index/
├── full_wiki_segments.tsv
├── tokenized/
├── embeddings/
└── qrecc/
├── pyserini_index/
├── full_wiki_segments.tsv
├── tokenized/
├── embeddings/
For those who'd like to reproduce our reported performance, you can download our queries generated by RetPO from this Google Drive. (Place it in the ROOT_DIR/distill_outputs
of the repository.)
You can reproduce our main performance by running the following command.
cd eval
bash ./scripts/bm25_topiocqa.sh
We construct a large-scale dataset called RF-COLLECTION, containing Retrievers’ Feedback on over 410K query rewrites across 12K conversations. You can download it from Huggingface using the following command.
from datasets import load_dataset
ds = load_dataset("dmis-lab/RF-Collection", cache_dir="{ROOT_DIR}/retpo_qr/")
@article{yoon2024ask,
title={Ask Optimal Questions: Aligning Large Language Models with Retriever's Preference in Conversational Search},
author={Yoon, Chanwoong and Kim, Gangwoo and Jeon, Byeongguk and Kim, Sungdong and Jo, Yohan and Kang, Jaewoo},
journal={arXiv preprint arXiv:2402.11827},
year={2024}
}
For more information or any questions of our work, feel free to contact me (cwyoon99 (at) korea.ac.kr or gmail.com).