overview.mp4
The First Large-scale Benchmark for UAV Visual Localization under Low-altitude Multi-view Observation Conditions
AnyVisLoc is a benchmark for UAV visual localization under low-altitude, multi-view observation conditions. It uses 2.5D aerial and satellite reference maps, and its baseline follows a unified image retrieval β image matching β Perspective-n-Point (PnP) localization framework.
πππ News: Our paper has been accepted to CVPR 2026 Findings! πππ The complete AnyVisLoc dataset is now publicly available, and the testing code has been upgraded to support the NPZ-based data format. Thank you for your attention and support!
If you find our work useful, please consider giving us a βοΈ. Your support means a lot to us! π₯°π₯°π₯°
View Paper Β· Download Dataset (Baidu NetDisk) Β·
Other download options will be available within one week.
-
Large scale: 20,077 full-resolution DJI UAV images from 24 scenes across China. The reference maps cover distinct regions ranging in coverage area from 10,000
$m^2$ to 9,000,000$m^2$ . - Multi-altitude: The dataset contains diverse UAV flight heights, ranging approximately from 6 m to 500 m.
- Multi-view: The dataset covers common UAV imaging pitch angles from approximately 5Β° to 90Β°, including both nadir and oblique views.
- Multi-scene: The dataset includes dense urban areas, towns and villages, typical landmark scenes, campuses, parks, natural scenes such as grasslands, farmland, and mountains, as well as mixed environments.
- Multi-reference map: The dataset provides two complementary types of 2.5D reference maps. The aerial map provides high spatial resolution for high-precision localization, while the satellite map is a more broadly available reference source that does not require scene-specific aerial acquisition or reconstruction.
- Multi-drone type: DJI Mavic 2, Mavic 3, Mavic 3 Pro, Phantom 3, Phantom 4, Phantom 4 RTK, and Mini 4 Pro.
- Others: multiple weather conditions (βοΈβ βοΈπ«οΈπ§οΈ), seasons (π»ππβ), and illumination conditions (ππ).
Note: To further improve the diversity of the public release, we removed three scenes with limited geographic coverage and relatively single-view observations, and added two new scenes. The current release contains 24 scenes and 20,077 UAV images.
The ground-truth generation protocol is provided in GROUND_TRUTH.md.
Detailed scene-level statistics are provided in DATASET_STATISTICS.md.
This repository provides a unified AnyVisLoc testing pipeline for UAV visual localization. The testing code provides three evaluation modes:
- Full pipeline: image retrieval β pixel matching β Perspective-n-Point (PnP) localization.
- Retrieval only: evaluates image-level retrieval without matching or PnP.
- Matching + localization only: evaluates pixel matching and Perspective-n-Point (PnP) localization using the ground-truth reference crop, without image retrieval.
The current release supports the following components:
| Component | Supported options |
|---|---|
| Image retrieval | CAMP |
| Pixel matching | RoMa, SP_LG, SP_LG_GIM, SP_LG_MINIMA, ALIKED_LG, and DISK_LG |
| PnP solver | OpenCV implementations: P3P by default; AP3P, EPNP are also available through --PnP_method |
| Pose-prior setting | yp or p through --pose_priori |
Pose-prior protocol. In practical UAV applications, coarse altitude and attitude measurements are often available from the onboard inertial navigation system. Under low-altitude oblique views, a usable altitude and pitch angle estimate is important for estimating the projected footprint and cropping an appropriate aerial or satellite reference region. Large uncompensated yaw differences can also substantially reduce retrieval and pixel-matching accuracy.
ypis the easier and default setting: it uses coarse pitch, yaw, and altitude to estimate the view center, and yaw-aligns the reference map before retrieval and matching.pis the more challenging setting: it retains the pose-aware view-center estimate but does not yaw-rotate the reference map, leaving the yaw discrepancy to the retrieval and matching stages.unknownuses no pose prior. To study less reliable onboard measurements, users may perturb the released pose metadata before evaluation. These standardized settings are provided to make evaluations on AnyVisLoc more comparable; they do not prevent users from testing other prior assumptions or noise levels.
Clone the project:
git clone https://github.com/UAV-AVL/Benchmark.git
cd BenchmarkWe recommend creating a clean conda environment. The demo has been tested with Python 3.10, PyTorch 2.5.1, CUDA 12.1, and an NVIDIA RTX 4090 GPU.
conda create -n anyvisloc_demo python=3.10 -y
conda activate anyvisloc_demoInstall GDAL from conda-forge:
conda install -c conda-forge gdalInstall PyTorch first according to your CUDA version. For example, for CUDA 12.1:
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121Then install the remaining dependencies:
pip install -r requirements.txtIf you use a different CUDA version, install the corresponding PyTorch build from the official PyTorch installation page before running pip install -r requirements.txt.
Optional acceleration packages such as flash-attn and xformers are not included in the basic requirements.txt, because they are highly dependent on the CUDA, PyTorch, GCC, and system environment. Install them only when your environment is compatible.
The complete AnyVisLoc dataset is publicly available through the official download portal. By downloading or using the dataset, you agree to the AnyVisLoc Dataset License. Please place the downloaded files under a local dataset root, for example:
./Data/AnyVisLoc/
The default baseline configuration uses CAMP for image retrieval and supports RoMa and selectable sparse matchers for pixel-level matching. Required checkpoints can be downloaded from the original projects: CAMP, RoMa, DINOv2 ViT-L/14 (used by RoMa), GIM LightGlue, and MINIMA LightGlue. Alternatively, all required checkpoints can be downloaded from our Baidu NetDisk link (extraction code: zyp9).
Please keep the original filenames and place the checkpoints in the expected folders:
Benchmark/
βββ Retrieval_Models/
β βββ CAMP/
β βββ weights/
β βββ <CAMP checkpoint>.pth
βββ Matching_Models/
βββ RoMa/
β βββ ckpt/
β βββ roma_outdoor.pth
β βββ dinov2_vitl14_pretrain.pth
βββ Sparse_matchers/
βββ weights/
βββ gim_lightglue_100h.ckpt
βββ minima_lightglue.pth
Note: All third-party checkpoints remain subject to the licenses, terms of use, and citation requirements of their original authors.
The released AnyVisLoc dataset uses a scene-wise NPZ structure. Each scene folder contains one lightweight reference JSON file, two reference maps, two DSM files, and multiple UAV image samples.
Data/AnyVisLoc/
βββ Scene_01/
β βββ L01_reference.json
β βββ aerial_map.png
β βββ aerial_dsm.npy
β βββ satellite_map.png
β βββ satellite_dsm.npy
β βββ L01_0001.npz
β βββ L01_0002.npz
β βββ ...
βββ Scene_02/
β βββ L02_reference.json
β βββ aerial_map.png
β βββ aerial_dsm.npy
β βββ satellite_map.png
β βββ satellite_dsm.npy
β βββ L02_0001.npz
β βββ ...
βββ ...
Each Lxx_reference.json is required by the current runners. It is used to discover scene folders and, for each sample, resolve the reference data selected by --reference_mode before the map and DSM are loaded lazily.
The current testing code supports:
aerial: high-resolution aerial orthophoto + aerial DSM.satellite: satellite image + satellite DSM.
For each reference mode, the JSON file records the relative paths to the reference map and DSM, the map and DSM spatial resolutions, and the local coordinate origin information required to convert between image pixels and scene-local metric coordinates.
Each Lxx_????.npz file stores one UAV image sample and its metadata. The dataloader reads these files through AnyVisLocNPZDataset in avl_data.py. The current dataloader returns:
| Key | Description |
|---|---|
image |
RGB UAV image tensor in [3, H, W], normalized to [0, 1]. |
K |
Camera intrinsic matrix after UAV image downsampling, shape [3, 3]. |
dist |
Distortion coefficients, shape [5]. |
image_size |
Loaded/downsampled UAV image size, [H, W]. |
pose_c2w |
Camera-to-world pose matrix, shape [4, 4]. |
pose_w2c |
World-to-camera pose matrix, shape [4, 4]. |
xyz |
UAV position in the dataset-local metric coordinate system, [x, y, z]. |
euler_deg |
UAV attitude in degrees, [roll, pitch, yaw]. |
reference |
Lightweight reference pointer, including ref_json_path, ref_path, and scene_dir. |
sample_id |
Sample identifier, such as L01_0001. |
The full pipeline runs retrieval, matching, and PnP localization in sequence.
python run_avl.py \
--dataset_root ./Data/AnyVisLoc \
--yaml config_selectable_matchers.yaml \
--save_dir ./Result/AnyVisLoc \
--reference_mode aerial \
--pose_priori yp \
--strategy Topn_opt \
--PnP_method P3P \
--match_keypoints 3000 \
--retrieval_k 5 \
--success_thresholds 1 3 5 10 20 \
--device cudaRun only selected scenes:
python run_avl.py \
--dataset_root ./Data/AnyVisLoc \
--yaml config_selectable_matchers.yaml \
--reference_mode aerial \
--scenes Scene_01 Scene_02 \
--device cudaQuickly test the first few samples:
python run_avl.py \
--dataset_root ./Data/AnyVisLoc \
--yaml config_selectable_matchers.yaml \
--reference_mode aerial \
--limit 20 \
--device cudaTo enable visualization for qualitative inspection, add:
--visualize
β οΈ For faster evaluation, keep visualization disabled by default or explicitly use:
--no-visualizeThe full pipeline reads RETRIEVAL_METHODS and MATCHING_METHODS from the YAML configuration file. For example, the provided config_selectable_matchers.yaml uses:
RETRIEVAL_METHODS:
- CAMP
MATCHING_METHODS:
- SP_LG_GIM
PNP_METHODS:
- P3P
RETRIEVAL_COVER: 50
RETRIEVAL_TOPN: 5
RETRIEVAL_FEATURE_NORM: true
BATCH_SIZE: 128To test different retrieval or matching methods in the full pipeline, edit the YAML file.
Use run_avl_retrieval_only.py to evaluate image-level retrieval without pixel matching or PnP.
python run_avl_retrieval_only.py \
--dataset_root ./Data/AnyVisLoc \
--yaml config_selectable_matchers.yaml \
--save_dir ./Result/AnyVisLoc_retrieval_only \
--reference_mode aerial \
--retrieval_methods CAMP \
--retrieval_ks 1 3 5 \
--pose_priori yp \
--device cudaEvaluate satellite references instead of aerial references:
python run_avl_retrieval_only.py \
--dataset_root ./Data/AnyVisLoc \
--yaml config_selectable_matchers.yaml \
--reference_mode satellite \
--retrieval_methods CAMP \
--retrieval_ks 1 3 5 \
--device cudaUseful retrieval-only arguments
| Argument | Description |
|---|---|
--retrieval_methods |
Overrides RETRIEVAL_METHODS in the YAML file. Example: --retrieval_methods CAMP. |
--retrieval_ks |
Sets the K values for Recall@K and PDM@K. Example: --retrieval_ks 1 3 5. |
--ref_feature_cache_dir |
Sets a custom directory for cached reference/gallery retrieval features. |
--disable_ref_feature_cache |
Disables the reference feature cache. |
Use run_avl_match_loc.py to evaluate pixel-level matching and PnP localization without image retrieval. In this mode, the reference patch is cropped from the ground-truth view center, making it suitable for evaluating matching and localization under oracle coarse localization.
python run_avl_match_loc.py \
--dataset_root ./Data/AnyVisLoc \
--yaml config_selectable_matchers.yaml \
--save_dir ./Result/AnyVisLoc_match_loc \
--reference_mode aerial \
--matching_methods SP_LG_GIM \
--pose_priori yp \
--PnP_method P3P \
--match_keypoints 3000 \
--min_matches 5 \
--device cudaTest several matchers:
python run_avl_match_loc.py \
--dataset_root ./Data/AnyVisLoc \
--yaml config_selectable_matchers.yaml \
--matching_methods SP_LG SP_LG_GIM ALIKED_LG DISK_LG \
--reference_mode aerial \
--device cudaUseful matching/localization-only arguments
| Argument | Description |
|---|---|
--matching_methods |
Overrides MATCHING_METHODS in the YAML file. Supported names include Roma, SP_LG, SP_LG_GIM, SP_LG_MINIMA, ALIKED_LG, and DISK_LG. |
--resize_ratio |
Sets the resize ratio for UAV images before matching. |
--match_keypoints |
Sets the maximum number of sparse keypoints used by selectable matchers. |
--pnp_reproj_error |
Sets the RANSAC reprojection error threshold for PnP. |
--pnp_iterations |
Sets the maximum number of RANSAC iterations for PnP. |
--pnp_confidence |
Sets the RANSAC confidence for PnP. |
Common arguments shared by the new runners:
Click to expand common arguments
| Argument | Options / Example | Description |
|---|---|---|
--yaml |
config_selectable_matchers.yaml |
Configuration file for retrieval, matching, and evaluation settings. |
--scenes |
Scene_01 Scene_02 or Scene_01,Scene_02 |
Tests selected scenes only. Omit this argument to test all scenes. |
--strategy |
Top1 / Topn_opt |
Candidate-selection strategy for the full pipeline. Top1 matches only the top retrieval candidate. Topn_opt is the default and evaluates the top-N retrieval candidates before selecting the best PnP result. |
--patch_scale |
1.0 by default |
Physical reference-patch scale. The reference patch is estimated from the UAV image footprint using altitude, camera intrinsics, and pose prior. This value multiplies that footprint before cropping the reference map. Larger values add context but increase runtime and distractors; smaller values are faster but less tolerant to pose-prior errors. |
--reference_mode |
aerial / satellite |
Selects aerial or satellite reference maps. |
--pose_priori |
yp / p |
yp uses pitch, yaw, and altitude for view-center estimation and yaw-aligns the reference map. p keeps the pose-aware view-center estimate but does not yaw-align the map. |
--limit |
20 |
Processes only the first N samples for quick debugging. 0 means all samples. |
The scripts save results in timestamped folders under --save_dir. Expand the sections below to view the typical full-pipeline layout and the meaning of each main result file.
Typical full-pipeline output structure
Result/AnyVisLoc/
βββ 20260629-1200_all/
βββ aerial/
βββ aerial-CAMP-SP_LG_GIM-yp/
β βββ all_results.json
β βββ all_results.csv
β βββ all_results.jsonl
β βββ failed.json
β βββ failed.jsonl
β βββ summary.json
β βββ retrieval_metrics.json
β βββ retrieval_metrics_by_scene.csv
β βββ success_curve_overall.png
β βββ scene_success_stats/
β βββ Scene_01/
β βββ L01_0001/
β βββ VG_data_L01_0001.pkl
βββ all_results.json
βββ all_results.csv
βββ failed.json
βββ summary.json
Main result files
| File | Description |
|---|---|
summary.json |
Overall statistics, including success rate and mean/median localization error. |
all_results.csv |
Per-sample localization and retrieval metrics. |
all_results.json / all_results.jsonl |
Per-sample results in JSON format. |
failed.json / failed.jsonl |
Failed samples with error messages and tracebacks. |
retrieval_metrics.json |
Region-normalized retrieval metrics for the full pipeline. |
success_curve_overall.png |
Overall localization success curve under different distance thresholds. |
scene_success_stats/ |
Per-scene success statistics and curves. |
VG_data_*.pkl |
Detailed per-sample full-pipeline result. |
result.json |
Per-sample result used by retrieval-only and matching/localization-only runners. |
To test your own dataset with the AnyVisLoc workflow, convert it to the same NPZ-based structure:
YourDatasetRoot/
βββ Scene_01/
β βββ L01_reference.json
β βββ aerial_map.png
β βββ aerial_dsm.npy
β βββ satellite_map.png
β βββ satellite_dsm.npy
β βββ L01_0001.npz
β βββ L01_0002.npz
β βββ ...
βββ ...
Make sure that:
- Each scene folder has a valid
Lxx_reference.json. - Each UAV sample is stored as
Lxx_????.npz. - UAV samples contain camera intrinsics, distortion coefficients, camera poses, local metric coordinates, and Euler angles.
- Reference maps and DSMs are spatially aligned through the resolutions and local origins recorded in the reference JSON.
- Scene names follow the
Scene_XXconvention if you want to use the default scene filtering.
- Put your retrieval method under:
./Retrieval_Models/your_approach/
- Modify the retrieval model loader and feature extraction interface:
./Retrieval_Models/multi_model_loader.py
./Retrieval_Models/feature_extract.py
- Add your method name to the YAML configuration:
RETRIEVAL_METHODS:
- CAMP
- YOUR_RETRIEVAL_METHODThe retrieval method should provide a model and image transform that are compatible with the existing retrieval_init() and retrieval_all_anyvisloc() pipeline.
- Put your matcher under:
./Matching_Models/your_approach/
-
Add an initialization function and a matching function following the existing matcher interface.
-
Register the matcher in
avl_utils.py, especially in:
matching_init()
run_pixel_match_anyvisloc()- Add your method name to the YAML configuration:
MATCHING_METHODS:
- SP_LG_GIM
- YOUR_MATCHING_METHODThe current code already supports the selectable sparse matchers listed in the Supported Baseline Components section above.
Why do we need to perform image retrieval before image matching?
In UAV visual localization, the reference map usually covers a much larger area than a single real-time UAV image. Running pixel-level matching directly over the entire map would create a large search space and heavy computational and storage costs. Under low-altitude oblique observation, image-level retrieval is also more robust to viewpoint changes than direct full-map pixel matching. Therefore, AnyVisLoc first uses image retrieval, also known as visual geo-localization or visual place recognition, to obtain coarse location candidates, and then applies pixel matching and PnP localization for accurate pose estimation.
Why do we provide both aerial reference maps and satellite maps?
The two reference modalities represent different practical settings. Aerial reference maps provide high-resolution geometry and support high-precision localization, but they require dedicated aerial data collection and reconstruction. Satellite reference maps are more broadly available and can be acquired without scene-specific aerial mapping, making them suitable for studying localization under more general reference conditions. AnyVisLoc supports systematic evaluation with both reference sources.
Why is end-to-end retrievalβmatchingβPnP evaluation slow, and how can it be accelerated?
Most of the runtime is spent on image retrieval rather than matching or PnP. The retrieval patch size and location are adapted to each UAV image according to its altitude and pose prior. Since different UAV images can have different altitude, pitch, and yaw values, the reference map may need to be cropped differently for each query. In the yp setting, the reference map is also yaw-rotated per query, which changes the gallery layout and usually requires reference-patch features to be extracted again while searching across the full reference map.
The p setting keeps the reference map unrotated and supports caching reference-gallery features when the patch geometry and tile layout remain the same; see --ref_feature_cache_dir. This can reduce repeated feature extraction, but full-map retrieval is still computationally demanding. For larger experiments, we recommend distributing independent scenes or queries across multiple GPUs, grouping samples with equivalent or near-equivalent crop settings, and precomputing or caching the corresponding gallery features. The released demo intentionally keeps the direct pose-conditioned retrieval formulation for reproducibility instead of introducing aggressive approximation or engineering-specific optimizations.
Why are satellite-reference localization results substantially worse than aerial-reference results?
This gap is expected and reflects the intrinsic difficulty of low-altitude UAV localization with satellite references. UAV and satellite images can differ greatly in viewpoint, spatial resolution, illumination, season, and acquisition time, and many objects may have changed between acquisitions. Satellite imagery also has substantially lower spatial resolution than low-altitude UAV imagery, so even visually corresponding objects may not support accurate pixel-level alignment. In addition, satellite reference images can contain local georeferencing distortions, mosaicking artifacts, blur, and non-orthorectified building lean. Finally, the public satellite DSM has a spatial resolution of approximately 30 m, which cannot support high-precision pose estimation in the same way as the higher-resolution aerial DSM. These factors are realistic challenges for low-altitude UAV visual localization with satellite references.
Please cite the official AnyVisLoc paper in any public work that uses the dataset.
@InProceedings{Ye_2026_CVPR,
author = {Ye, Yibin and Teng, Xichao and Chen, Shuo and Liu, Leqi and
Wang, Kun and Song, Xiaokai and Li, Zhang},
title = {Exploring the Best Way for UAV Visual Localization under
Low-altitude Multi-view Observation Condition: A Benchmark},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) Findings},
month = {June},
year = {2026},
pages = {1731--1741}
}AnyVisLoc is released under the AnyVisLoc Dataset License for non-commercial academic research, education, reproducibility, and internal evaluation only.
Without prior written permission, users may not use the dataset commercially; redistribute, mirror, upload, or share the dataset; or publicly release any re-annotated, re-partitioned, relabeled, transformed, or substantially overlapping dataset, benchmark, subset, or derived data file based on AnyVisLoc. Users may create internal annotations, alternative splits, and derived metadata for non-commercial research, but may not publicly release them.
Third-party satellite imagery, DSM/elevation products, and other third-party materials are not licensed by the AnyVisLoc authors; users must comply with the applicable third-party terms.
AnyVisLoc builds on and interfaces with several excellent open-source projects:
- CAMP, used as the retrieval baseline.
- RoMa, used for robust dense feature matching.
- LightGlue, used as the sparse local-feature matcher.
- SuperPoint, used with LightGlue-based pipelines.
- GIM, whose GIM-trained SuperPoint + LightGlue checkpoint is supported.
- MINIMA, whose modality-invariant LightGlue checkpoint is supported.
- ALIKED, used as a lightweight keypoint and descriptor extractor.
- DISK, available through the sparse-matching integration.
- AW3D30, used as the source of the public 30 m DSM product.
- Google Earth historical satellite imagery is subject to the applicable Google Geo Guidelines and other relevant third-party terms.
Please consult and comply with the licenses and citation requirements of all third-party projects and pretrained models.
For questions about AnyVisLoc, dataset access, or adding your own retrieval or matching method to this benchmark, please contact zhangli_nudt@163.com.

