From cbeabb02652354b662f16d71096fb39593967825 Mon Sep 17 00:00:00 2001 From: prrao87 <35005448+prrao87@users.noreply.github.com> Date: Fri, 15 May 2026 11:38:30 -0700 Subject: [PATCH 1/2] Update HF dataset cards --- docs/datasets/ade20k.mdx | 243 ++++++--- docs/datasets/chartqa.mdx | 251 +++++++-- docs/datasets/cifar10.mdx | 236 ++++++--- docs/datasets/coco-captions-2017.mdx | 273 +++++++--- docs/datasets/coco-detection-2017.mdx | 278 ++++++---- docs/datasets/docvqa.mdx | 249 +++++++-- docs/datasets/eurosat.mdx | 240 +++++++-- docs/datasets/fashion-mnist.mdx | 234 +++++++-- docs/datasets/fineweb-edu.mdx | 376 ++++++++------ docs/datasets/flickr30k.mdx | 277 ++++++---- docs/datasets/food101.mdx | 235 +++++++-- docs/datasets/gqa-testdev-balanced.mdx | 250 +++++++-- docs/datasets/hotpotqa-distractor.mdx | 261 +++++++--- docs/datasets/imagenet-1k-val.mdx | 234 ++++++--- docs/datasets/index.mdx | 60 +-- docs/datasets/kitti-2d-detection.mdx | 284 ++++++++--- docs/datasets/laion-1m.mdx | 340 ++++++------- docs/datasets/lerobot-pusht.mdx | 304 +++++++++-- docs/datasets/lerobot-xvla-soft-fold.mdx | 481 ++++++++++-------- docs/datasets/librispeech-clean.mdx | 286 ++++++++--- docs/datasets/mnist.mdx | 239 ++++++--- docs/datasets/ms-marco-v2.mdx | 260 +++++++--- docs/datasets/natural-questions-val.mdx | 252 +++++++-- docs/datasets/openvid.mdx | 462 ++++++++--------- docs/datasets/oxford-pets.mdx | 231 +++++++-- .../datasets/pascal-voc-2012-segmentation.mdx | 249 +++++++-- docs/datasets/squad-v2.mdx | 252 ++++++--- docs/datasets/stanford-cars.mdx | 248 +++++++-- docs/datasets/textvqa.mdx | 257 ++++++++-- docs/datasets/trivia-qa.mdx | 249 +++++++-- docs/datasets/vqav2.mdx | 289 +++++++---- 31 files changed, 6004 insertions(+), 2376 deletions(-) diff --git a/docs/datasets/ade20k.mdx b/docs/datasets/ade20k.mdx index 39f1558..e49b91c 100644 --- a/docs/datasets/ade20k.mdx +++ b/docs/datasets/ade20k.mdx @@ -1,7 +1,7 @@ --- title: "ADE20K" sidebarTitle: "ADE20K" -description: "Lance-formatted version of the full ADE20K scene parsing benchmark (sourced from 1aurent/ADE20K) — 27,574 scene images with semantic and instance segmentation maps, scene labels, and per-object metadata, all stored inline." +description: "A Lance-formatted version of the full ADE20K scene parsing benchmark, sourced from 1aurent/ADE20K. Each row is one scene image with its inline JPEG bytes, a per-pixel semantic segmentation map encoded as PNG bytes, an optional instance map, scene…" --- -Lance-formatted version of the full [ADE20K scene parsing benchmark](https://groups.csail.mit.edu/vision/datasets/ADE20K/) (sourced from [`1aurent/ADE20K`](https://huggingface.co/datasets/1aurent/ADE20K)) — **27,574 scene images** with semantic and instance segmentation maps, scene labels, and per-object metadata, all stored inline. +A Lance-formatted version of the full [ADE20K scene parsing benchmark](https://groups.csail.mit.edu/vision/datasets/ADE20K/), sourced from [`1aurent/ADE20K`](https://huggingface.co/datasets/1aurent/ADE20K). Each row is one scene image with its inline JPEG bytes, a per-pixel semantic segmentation map encoded as PNG bytes, an optional instance map, scene class labels, the full per-polygon object-name list, an OpenCLIP image embedding, and pre-built indices — all available directly from the Hub at `hf://datasets/lance-format/ade20k-lance/data`. + +## Key features + +- **Inline image and segmentation bytes** — both the JPEG image and the RGB-encoded PNG segmentation map ride on the same row, so an annotated example is a single row read with no sidecar files. +- **Per-polygon object metadata** — `object_names` keeps the full list (one entry per annotated polygon), `objects_present` is the deduped set used for class-presence filters, and `num_objects` is precomputed. +- **CLIP image embeddings** (`image_emb`, OpenCLIP ViT-B/32, 512-d, cosine-normalized) for visual retrieval over scenes. +- **Indices shipped on disk** — `IVF_PQ` on `image_emb`, `BTREE` on `num_objects`, and `LABEL_LIST` on `objects_present` for fast `array_has_any` / `array_has_all` predicates. ## Splits | Split | Rows | |-------|------| | `train.lance` | 25,574 | -| `validation.lance` | 2,000 | +| `validation.lance` | 2,000 | ## Schema @@ -30,70 +37,104 @@ Lance-formatted version of the full [ADE20K scene parsing benchmark](https://gro | `segmentation` | `large_binary` | Inline PNG bytes — semantic segmentation map (RGB encoding per ADE20K spec) | | `instance` | `large_binary?` | Inline PNG bytes — instance map; null if not provided | | `filename` | `string` | ADE20K relative filename | -| `scene` | `list` | Scene labels (e.g. `["bathroom"]`) | -| `object_names` | `list` | Names of all annotated objects (one entry per polygon) | +| `scene` | `list` | Scene class labels (e.g. `["bathroom"]`) | +| `object_names` | `list` | Per-polygon object names (one entry per polygon, not deduped) | | `objects_present` | `list` | Deduped object names — feeds the `LABEL_LIST` index | | `num_objects` | `int32` | Number of annotated objects | -| `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` image embedding (cosine-normalized) | +| `image_emb` | `fixed_size_list` | OpenCLIP ViT-B/32 image embedding (cosine-normalized) | ## Pre-built indices -- `IVF_PQ` on `image_emb` — `metric=cosine` -- `BTREE` on `num_objects` -- `LABEL_LIST` on `objects_present` — supports `array_has_any` / `array_has_all` +- `IVF_PQ` on `image_emb` — vector similarity search (cosine) +- `BTREE` on `num_objects` — fast range filters on scene complexity +- `LABEL_LIST` on `objects_present` — supports `array_has_any` / `array_has_all` for class-presence filtering + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. -## Quick start +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample without installing anything Lance-specific. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/ade20k-lance", split="validation", streaming=True) +for row in hf_ds.take(3): + print(row["filename"], row["scene"], row["num_objects"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data") tbl = db.open_table("validation") -print(f"LanceDB table opened with {len(tbl)} scene images") +print(len(tbl)) ``` -## Read an image with its segmentation +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices. ```python -import io import lance -from PIL import Image ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance") -row = ds.take([0], columns=["image", "segmentation", "scene", "objects_present"]).to_pylist()[0] - -Image.open(io.BytesIO(row["image"])).save("img.jpg") -Image.open(io.BytesIO(row["segmentation"])).save("seg.png") -print("scene:", row["scene"]) -print("objects:", row["objects_present"][:10]) +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -## Filter by scene / objects +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access, ANN search, and any mutation are far faster against a local copy: +> ```bash +> hf download lance-format/ade20k-lance --repo-type dataset --local-dir ./ade20k-lance +> ``` +> Then point Lance or LanceDB at `./ade20k-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `image_emb` makes approximate-nearest-neighbor scene retrieval a single call. In production you would encode a query image through the same OpenCLIP ViT-B/32 model used at ingest and pass the resulting 512-d vector to `tbl.search(...)`. The example below uses the embedding stored on row 42 as a runnable stand-in, so the snippet works without loading any model. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance") +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data") +tbl = db.open_table("validation") + +seed = ( + tbl.search() + .select(["image_emb", "filename", "scene"]) + .limit(1) + .offset(42) + .to_list()[0] +) -# Indoor scenes containing both a bed and a window. -rows = ds.scanner( - filter="array_has_all(objects_present, ['bed', 'window'])", - columns=["filename", "scene"], - limit=10, -).to_table().to_pylist() +hits = ( + tbl.search(seed["image_emb"]) + .metric("cosine") + .select(["filename", "scene", "objects_present"]) + .limit(10) + .to_list() +) +print("query scene:", seed["scene"]) +for r in hits: + print(f" {r['filename']} scene={r['scene']} objs={r['objects_present'][:5]}") ``` -### Filter with LanceDB +Because the embeddings are cosine-normalized, the first hit will typically be the source image itself — a useful sanity check. Tune `nprobes` and `refine_factor` to trade recall against latency for your workload. + +## Curate + +Curation for a semantic-segmentation workflow usually means picking scenes that contain specific classes, possibly bounded by complexity. The `LABEL_LIST` index on `objects_present` makes class-presence predicates trivial, and Lance evaluates them inside the same scan as a structural filter on `num_objects`. The bounded `.limit(500)` keeps the result small and inspectable, and the `segmentation` blob is left out of the projection so the candidate scan is dominated by metadata, not PNG bytes. ```python import lancedb @@ -101,33 +142,81 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data") tbl = db.open_table("validation") -rows = ( +candidates = ( tbl.search() - .where("array_has_all(objects_present, ['bed', 'window'])") - .select(["filename", "scene"]) - .limit(10) + .where( + "array_has_all(objects_present, ['bed', 'window']) AND num_objects >= 8", + prefilter=True, + ) + .select(["id", "filename", "scene", "objects_present", "num_objects"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} candidates; first scene: {candidates[0]['scene']}") ``` -## Visual similarity search +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `id`s, or feed into the Evolve and Train workflows below. Swapping `array_has_all` for `array_has_any` widens the recall; replacing the structural predicate with `num_objects BETWEEN 3 AND 6` selects simpler scenes for an ablation slice. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a `has_person` flag and a `scene_label` string pulled out of the `scene` list, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus first. + +```python +import lancedb + +db = lancedb.connect("./ade20k-lance/data") # local copy required for writes +tbl = db.open_table("validation") + +tbl.add_columns({ + "has_person": "array_has_any(objects_present, ['person'])", + "scene_label": "element_at(scene, 1)", + "complexity_bucket": "CASE WHEN num_objects < 5 THEN 'sparse' " + "WHEN num_objects < 15 THEN 'medium' ELSE 'dense' END", +}) +``` + +If the values you want to attach already live in another table (offline panoptic ids, predictions from a baseline segmenter, a second-pass embedding), merge them in by joining on `id`: ```python -import lance import pyarrow as pa -ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"] -query = pa.array([ref], type=emb_field.type) - -neighbors = ds.scanner( - nearest={"column": "image_emb", "q": query[0], "k": 5}, - columns=["filename", "scene"], -).to_table().to_pylist() +predictions = pa.table({ + "id": pa.array([0, 1, 2], type=pa.int64()), + "baseline_miou": pa.array([0.41, 0.55, 0.62]), +}) +tbl.merge(predictions, on="id") ``` -### LanceDB visual similarity search +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., re-running a segmentation model over the image bytes), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a semantic-segmentation run, project the JPEG bytes and the segmentation PNG bytes; both are decoded inside the training step. Columns added in the Evolve section above cost nothing per batch until they are explicitly projected. + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data") +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns(["image", "segmentation"]) +loader = DataLoader(train_ds, batch_size=8, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the JPEG and PNG byte columns; decode both, + # remap the ADE20K RGB-encoded mask to class ids, forward, loss... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "objects_present"]` to `select_columns(...)` on the next run skips JPEG and PNG decoding entirely and reads only the cached 512-d vectors plus the deduped class list, which is the right shape for training a lightweight scene classifier or a class-presence probe on top of frozen features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. ```python import lancedb @@ -135,23 +224,51 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data") tbl = db.open_table("validation") -ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0] -query_embedding = ref["image_emb"] +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` -results = ( - tbl.search(query_embedding) - .metric("cosine") - .select(["filename", "scene"]) - .limit(5) - .to_list() -) +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./ade20k-lance/data") +local_tbl = local_db.open_table("validation") +local_tbl.tags.create("segmenter-baseline-v1", local_tbl.version) ``` -## Why Lance? +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("validation", version="segmenter-baseline-v1") +tbl_v5 = db.open_table("validation", version=5) +``` + +Pinning supports two workflows. A serving pipeline locked to `segmenter-baseline-v1` keeps reading the exact same segmentation maps and class lists while the dataset evolves in parallel; newly merged predictions or evolved columns do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images, so changes in mIoU reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("array_has_any(objects_present, ['bed', 'sofa', 'chair']) AND num_objects >= 5") + .select(["id", "image", "segmentation", "filename", "scene", + "objects_present", "num_objects", "image_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./ade20k-indoor-subset") +local_db.create_table("train", batches) +``` -- One dataset for images + segmentation + instance + scene + objects + embeddings + indices — no folder of paired files. -- On-disk vector and label-list indices live next to the data, so search works on local copies and on the Hub. -- Schema evolution: add columns (panoptic ids, fresh embeddings, model predictions) without rewriting the data. +The resulting `./ade20k-indoor-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/ade20k-lance/data` for `./ade20k-indoor-subset`. ## Source & license diff --git a/docs/datasets/chartqa.mdx b/docs/datasets/chartqa.mdx index a7b7b3d..bc4bf0c 100644 --- a/docs/datasets/chartqa.mdx +++ b/docs/datasets/chartqa.mdx @@ -1,7 +1,7 @@ --- title: "ChartQA" sidebarTitle: "ChartQA" -description: "Lance-formatted version of ChartQA — VQA over scientific and business charts that combine logical and visual reasoning — sourced from lmms-lab/ChartQA." +description: "A Lance-formatted version of ChartQA, a benchmark for question answering over scientific and business charts that demands a mix of logical and visual reasoning, redistributed via lmms-lab/ChartQA. Each row carries the chart image as inline JPEG…" --- -Lance-formatted version of [ChartQA](https://github.com/vis-nlp/ChartQA) — VQA over scientific and business charts that combine logical and visual reasoning — sourced from [`lmms-lab/ChartQA`](https://huggingface.co/datasets/lmms-lab/ChartQA). +A Lance-formatted version of [ChartQA](https://github.com/vis-nlp/ChartQA), a benchmark for question answering over scientific and business charts that demands a mix of logical and visual reasoning, redistributed via [`lmms-lab/ChartQA`](https://huggingface.co/datasets/lmms-lab/ChartQA). Each row carries the chart image as inline JPEG bytes, the natural-language question and reference answer(s), a question-type tag (`human` vs `augmented`), and paired CLIP embeddings for the image and the question — all available directly from the Hub at `hf://datasets/lance-format/chartqa-lance/data`. + +## Key features + +- **Inline chart image bytes** in the `image` column — no sidecar files, no image folders. +- **Paired CLIP embeddings in the same row** — `image_emb` and `question_emb` (ViT-B/32, 512-dim, cosine-normalized) — so visual and textual retrieval are one indexed lookup. +- **All reference answers preserved in `answers`** alongside a canonical `answer` string used for full-text search. +- **Pre-built ANN, FTS, and scalar indices** covering both embedding columns, the question and answer strings, and the `type` tag. ## Splits -| Split | Rows | -|-------|------| -| `test.lance` | 2,500 | +| Split | Rows | Notes | +|-------|------|-------| +| `test.lance` | 2,500 | Public test slice from `lmms-lab/ChartQA` | -> The `lmms-lab/ChartQA` redistribution exposes test only. Train and validation live in the original release (https://github.com/vis-nlp/ChartQA); add them via `chartqa/dataprep.py --splits` once a parquet mirror is identified. +> The `lmms-lab/ChartQA` redistribution exposes the test split only. Train and validation live in the original ChartQA release; extend `chartqa/dataprep.py` with additional sources to add them. ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index | -| `image` | `large_binary` | Inline chart image bytes | -| `image_id` / `question_id` | `string?` | (Source does not assign explicit ids — null for now) | +| `id` | `int64` | Row index within split (natural join key) | +| `image` | `large_binary` | Inline JPEG bytes | +| `image_id` | `string?` | Source does not assign explicit ids — null | +| `question_id` | `string?` | Source does not assign explicit ids — null | | `question` | `string` | Natural-language question | -| `answers` | `list` | Reference answer (typically a single string) | -| `answer` | `string` | First answer — used as canonical | +| `answers` | `list` | Reference answer(s), typically a single string | +| `answer` | `string` | First reference answer — canonical, used for FTS | | `type` | `string?` | Question type (`human` vs `augmented`) | | `image_emb` | `fixed_size_list` | CLIP image embedding (cosine-normalized) | | `question_emb` | `fixed_size_list` | CLIP text embedding of the question | ## Pre-built indices -- `IVF_PQ` on `image_emb` and `question_emb` — `metric=cosine` -- `INVERTED` (FTS) on `question` and `answer` -- `BITMAP` on `type` +- `IVF_PQ` on `image_emb` — image-side vector search (cosine) +- `IVF_PQ` on `question_emb` — text-side vector search (cosine) +- `INVERTED` (FTS) on `question` and `answer` — keyword and hybrid search +- `BITMAP` on `type` — fast filtering by question type + +## Why Lance? -## Quick start +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/chartqa-lance/data/test.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +import datasets + +hf_ds = datasets.load_dataset("lance-format/chartqa-lance", split="test", streaming=True) +for row in hf_ds.take(3): + print(row["question"], "->", row["answer"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data") tbl = db.open_table("test") -print(f"LanceDB table opened with {len(tbl)} chart-question pairs") +print(len(tbl)) ``` -### LanceDB vector search +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, and the list of pre-built indices. + +```python +import lance + +ds = lance.dataset("hf://datasets/lance-format/chartqa-lance/data/test.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) +``` + +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/chartqa-lance --repo-type dataset --local-dir ./chartqa-lance +> ``` +> Then point Lance or LanceDB at `./chartqa-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `question_emb` makes question-to-question retrieval a single call: encode a query with the same CLIP model used at ingest (ViT-B/32, cosine-normalized) and pass the resulting 512-d vector to `tbl.search(...)`. The example below uses the `question_emb` already stored in row 42 as a runnable stand-in, so the snippet works without any model loaded. ```python import lancedb @@ -70,19 +112,48 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data") tbl = db.open_table("test") -ref = tbl.search().limit(1).select(["question_emb", "question"]).to_list()[0] -query_embedding = ref["question_emb"] +seed = ( + tbl.search() + .select(["question_emb", "question"]) + .limit(1) + .offset(42) + .to_list()[0] +) -results = ( - tbl.search(query_embedding, vector_column_name="question_emb") +hits = ( + tbl.search(seed["question_emb"], vector_column_name="question_emb") .metric("cosine") - .select(["question", "answer"]) - .limit(5) + .select(["question", "answer", "type"]) + .limit(10) .to_list() ) +print("query:", seed["question"]) +for r in hits: + print(f" [{r['type']}] {r['question'][:70]} -> {r['answer']}") ``` -### LanceDB full-text search +Swap `vector_column_name="question_emb"` for `image_emb` to do question-to-chart retrieval against the visual embedding instead — useful for finding charts whose layout is similar to a given prompt encoding. + +Because the dataset also ships an `INVERTED` index on `question` and `answer`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query. LanceDB merges the two result lists and reranks them in a single call, which is useful when a phrase like "percentage" or "highest bar" must literally appear in the question but you still want CLIP to do the heavy lifting on semantic similarity. + +```python +hybrid_hits = ( + tbl.search(query_type="hybrid", vector_column_name="question_emb") + .vector(seed["question_emb"]) + .text("percentage") + .select(["question", "answer", "type"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(f" [{r['type']}] {r['question'][:70]} -> {r['answer']}") +``` + +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. + +## Curate + +A typical curation pass combines a content predicate on the question text with a structural predicate on the question-type tag. Stacking both inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(500)` makes it cheap to inspect before committing the subset to anything downstream. The example below collects human-authored questions that mention a percentage, which is a common slice for evaluating numeric-reasoning behaviour. ```python import lancedb @@ -90,14 +161,130 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data") tbl = db.open_table("test") -results = ( - tbl.search("percentage") - .select(["question", "answer"]) - .limit(10) +candidates = ( + tbl.search("percentage OR percent") + .where("type = 'human'", prefilter=True) + .select(["id", "question", "answer", "type"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} candidates; first: {candidates[0]['question'][:80]}") +``` + +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of row ids, or feed into the Evolve and Train workflows below. The `image` column is never read, so the network traffic for a 500-row candidate scan is dominated by question and answer text rather than chart JPEGs. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds `answer_length`, an `is_yes_no` flag, and an `is_numeric` flag, any of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. + +```python +import lancedb + +db = lancedb.connect("./chartqa-lance/data") # local copy required for writes +tbl = db.open_table("test") + +tbl.add_columns({ + "answer_length": "length(answer)", + "is_yes_no": "lower(answer) IN ('yes', 'no')", + "is_numeric": "regexp_match(answer, '^-?[0-9]+(\\.[0-9]+)?%?$') IS NOT NULL", +}) +``` + +If the values you want to attach already live in another table (model predictions on the test set, reasoning-chain annotations, a difficulty score), merge them in by joining on the `id` column: + +```python +import pyarrow as pa + +predictions = pa.table({ + "id": pa.array([0, 1, 2]), + "pred_answer": pa.array(["12%", "Yes", "34"]), + "is_correct": pa.array([True, True, False]), +}) +tbl.merge(predictions, on="id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a chart-OCR model over the image bytes), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For fine-tuning a VLM on chart QA, project the chart bytes plus the question and answer; columns added in the Evolve section above cost nothing per batch until they are explicitly projected. + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data") +tbl = db.open_table("test") + +train_ds = Permutation.identity(tbl).select_columns(["image", "question", "answer"]) +loader = DataLoader(train_ds, batch_size=32, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; decode the JPEG bytes, + # tokenize the question/answer pair, forward, backward... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "question_emb", "answer"]` to `select_columns(...)` on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a lightweight answer-classifier or a linear probe on top of frozen features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges predictions, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data") +tbl = db.open_table("test") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./chartqa-lance/data") +local_tbl = local_db.open_table("test") +local_tbl.tags.create("eval-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("test", version="eval-v1") +tbl_v5 = db.open_table("test", version=5) +``` + +Pinning supports two workflows. An evaluation harness locked to `eval-v1` keeps producing comparable scores while the dataset evolves in parallel — newly added prediction columns or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same charts and questions, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data") +remote_tbl = remote_db.open_table("test") + +batches = ( + remote_tbl.search("percentage OR percent") + .where("type = 'human'") + .select(["id", "image", "question", "answer", "type", "image_emb", "question_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./chartqa-human-subset") +local_db.create_table("test", batches) ``` +The resulting `./chartqa-human-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/chartqa-lance/data` for `./chartqa-human-subset`. + ## Source & license Converted from [`lmms-lab/ChartQA`](https://huggingface.co/datasets/lmms-lab/ChartQA). The original ChartQA dataset is released under the GNU GPL-3.0 license by Masry et al. diff --git a/docs/datasets/cifar10.mdx b/docs/datasets/cifar10.mdx index 60e4313..b81e31d 100644 --- a/docs/datasets/cifar10.mdx +++ b/docs/datasets/cifar10.mdx @@ -1,7 +1,7 @@ --- title: "CIFAR-10" sidebarTitle: "CIFAR-10" -description: "A Lance-formatted version of CIFAR-10 with 60,000 32×32 RGB images across 10 classes, stored inline with CLIP embeddings and a pre-built IVF_PQ ANN index." +description: "A Lance-formatted version of CIFAR-10 covering 60,000 32×32 RGB images across ten balanced object classes. Each row carries inline PNG bytes, the integer label, the human-readable class name, and a cosine-normalized CLIP image embedding, all backed…" --- -A Lance-formatted version of [CIFAR-10](https://huggingface.co/datasets/uoft-cs/cifar10) with **60,000 32×32 RGB images** across 10 classes, stored inline with CLIP embeddings and a pre-built `IVF_PQ` ANN index. +A Lance-formatted version of [CIFAR-10](https://huggingface.co/datasets/uoft-cs/cifar10) covering 60,000 32×32 RGB images across ten balanced object classes. Each row carries inline PNG bytes, the integer label, the human-readable class name, and a cosine-normalized CLIP image embedding, all backed by a bundled `IVF_PQ` vector index plus scalar indices on the label columns and available directly from the Hub at `hf://datasets/lance-format/cifar10-lance/data`. ## Key features -- All multimodal data (image bytes + embeddings) stored **inline** in the same Lance dataset. -- **Pre-computed CLIP embeddings** (OpenCLIP `ViT-B-32` / `laion2b_s34b_b79k`, 512-dim, L2-normalized) with an `IVF_PQ` index. -- **BTREE on `label`** and **BITMAP on `label_name`** for fast filtered scans. +- **Inline PNG bytes** in the `image` column — no sidecar files, no image folders. +- **Pre-computed CLIP image embeddings** (OpenCLIP `ViT-B-32` / `laion2b_s34b_b79k`, 512-dim, cosine-normalized) with a bundled `IVF_PQ` index. +- **Scalar indices on both label columns** — `BTREE` on `label` and `BITMAP` on `label_name` — so class filters and class-conditioned search are constant-time lookups. +- **One columnar dataset** — scan labels cheaply, then fetch image bytes only for the rows you want. ## Splits | Split | Rows | |-------|------| -| `train` | 50,000 | -| `test` | 10,000 | +| `train.lance` | 50,000 | +| `test.lance` | 10,000 | ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index within the split | +| `id` | `int64` | Row index within the split (natural join key for merges) | | `image` | `large_binary` | Inline PNG bytes (32×32 RGB) | -| `label` | `int32` | Class id (0-9) | +| `label` | `int32` | Class id (0–9) | | `label_name` | `string` | One of `airplane`, `automobile`, `bird`, `cat`, `deer`, `dog`, `frog`, `horse`, `ship`, `truck` | | `image_emb` | `fixed_size_list` | CLIP image embedding (cosine-normalized) | ## Pre-built indices -- `IVF_PQ` on `image_emb` — `metric=cosine` -- `BTREE` on `label` -- `BITMAP` on `label_name` +- `IVF_PQ` on `image_emb` — vector similarity search (cosine) +- `BTREE` on `label` — fast equality and range filters on the class id +- `BITMAP` on `label_name` — fast filters across the ten class names + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. ## Load with `datasets.load_dataset` +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable if your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample without installing anything Lance-specific. + ```python import datasets hf_ds = datasets.load_dataset("lance-format/cifar10-lance", split="train", streaming=True) for row in hf_ds.take(3): - print(row["label_name"]) -``` - -## Load directly with Lance (recommended) - -```python -import lance - -ds = lance.dataset("hf://datasets/lance-format/cifar10-lance/data/train.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) + print(row["label"], row["label_name"]) ``` ## Load with LanceDB +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. + ```python import lancedb @@ -72,30 +77,27 @@ tbl = db.open_table("train") print(len(tbl)) ``` -> **Tip — for production use, download locally first.** -> ```bash -> hf download lance-format/cifar10-lance --repo-type dataset --local-dir ./cifar10-lance -> ``` +## Load with Lance -## Vector search example +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python import lance -import pyarrow as pa ds = lance.dataset("hf://datasets/lance-format/cifar10-lance/data/train.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"] -query = pa.array([ref], type=emb_field.type) - -neighbors = ds.scanner( - nearest={"column": "image_emb", "q": query[0], "k": 5, "nprobes": 16, "refine_factor": 30}, - columns=["id", "label_name"], -).to_table().to_pylist() -print(neighbors) +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB vector search +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/cifar10-lance --repo-type dataset --local-dir ./cifar10-lance +> ``` +> Then point Lance or LanceDB at `./cifar10-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `image_emb` turns nearest-neighbor lookup on the 512-d CLIP space into a single call. In production you would encode a query image (or, for cross-modal text→image lookup, a tokenized prompt) through OpenCLIP `ViT-B-32` at runtime and pass the resulting vector to `tbl.search(...)`. The example below uses the embedding already stored in row 42 as a runnable stand-in so the snippet works without any model loaded. ```python import lancedb @@ -103,60 +105,160 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/cifar10-lance/data") tbl = db.open_table("train") -ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0] -query_embedding = ref["image_emb"] +seed = ( + tbl.search() + .select(["image_emb", "label_name"]) + .limit(1) + .offset(42) + .to_list()[0] +) -results = ( - tbl.search(query_embedding) +hits = ( + tbl.search(seed["image_emb"]) .metric("cosine") - .select(["id", "label_name"]) - .limit(5) + .select(["id", "label", "label_name"]) + .limit(10) .to_list() ) -for row in results: - print(row["id"], row["label_name"]) +print("query class:", seed["label_name"]) +for r in hits: + print(f" id={r['id']:>5} {r['label_name']}") ``` -## Filter by class +Because CIFAR-10 has only ten classes and the embeddings are cosine-normalized, near-neighbors of a seed image cluster tightly inside the seed's own class. Tune `metric`, `nprobes`, and `refine_factor` to trade recall against latency. -```python -import lance -ds = lance.dataset("hf://datasets/lance-format/cifar10-lance/data/train.lance") -ships = ds.scanner(filter="label_name = 'ship'", columns=["id"], limit=5).to_table() -``` +## Curate -### Filter by class with LanceDB +A typical curation pass for a classification workflow narrows the table to a single class (or a small set of confusable classes) before sampling. Because both label columns are indexed, the filter resolves without scanning the embedding or image bytes; the bounded `.limit(500)` keeps the output small enough to inspect or hand off as a manifest of row ids. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/cifar10-lance/data") tbl = db.open_table("train") -ships = ( + +candidates = ( tbl.search() - .where("label_name = 'ship'") - .select(["id"]) - .limit(5) + .where("label_name IN ('cat', 'dog')", prefilter=True) + .select(["id", "label", "label_name"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} cat/dog candidates") ``` -## Working with images +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `id`s, or feed into the Evolve and Train workflows below. The `image` and `image_emb` columns are never read, so the network traffic for a 500-row candidate scan is dominated by the tiny label payload. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds an `is_animal` flag and an `is_target_class` flag, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus first. ```python -from pathlib import Path -import lance +import lancedb -ds = lance.dataset("hf://datasets/lance-format/cifar10-lance/data/train.lance") -row = ds.take([0], columns=["image", "label_name"]).to_pylist()[0] -Path(f"sample_{row['label_name']}.png").write_bytes(row["image"]) +db = lancedb.connect("./cifar10-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "is_animal": "label_name IN ('bird', 'cat', 'deer', 'dog', 'frog', 'horse')", + "is_target_class": "label = 3", +}) ``` -## Why Lance? +If the values you want to attach already live in another table (offline labels from a stronger model, classifier predictions, per-row confidence scores), merge them in by joining on the `id` column: + +```python +import pyarrow as pa + +predictions = pa.table({ + "id": pa.array([0, 1, 2], type=pa.int64()), + "pred_label": pa.array([3, 8, 0], type=pa.int32()), + "pred_conf": pa.array([0.91, 0.74, 0.99]), +}) +tbl.merge(predictions, on="id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a second image encoder over the inline PNG bytes), Lance provides a batch-UDF API in the underlying library — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/) for that pattern. + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetch, shuffling, and batching behave as in any PyTorch pipeline. Columns added in the Evolve section above cost nothing per batch until they are explicitly projected. + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/cifar10-lance/data") +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns(["image", "label"]) +loader = DataLoader(train_ds, batch_size=256, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; image_emb stays on disk. + # decode the PNG bytes, apply augmentations, forward, backward... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "label"]` to `select_columns(...)` on the next run skips PNG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a linear probe or a lightweight reranker on top of frozen CLIP features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/cifar10-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./cifar10-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="clip-vitb32-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `clip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel; newly added prediction columns or relabelings do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and labels, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/cifar10-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("label_name IN ('cat', 'dog')") + .select(["id", "image", "label", "label_name", "image_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./cifar10-cats-dogs") +local_db.create_table("train", batches) +``` -- One dataset for images + embeddings + indices + metadata — no sidecar files. -- On-disk vector and FTS indices live next to the data, so search works on both local copies and the Hub. -- Schema evolution: add new columns (model predictions, fresh embeddings, augmentations) without rewriting the data. +The resulting `./cifar10-cats-dogs` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/cifar10-lance/data` for `./cifar10-cats-dogs`. ## Source & license diff --git a/docs/datasets/coco-captions-2017.mdx b/docs/datasets/coco-captions-2017.mdx index ad629d5..d177f2b 100644 --- a/docs/datasets/coco-captions-2017.mdx +++ b/docs/datasets/coco-captions-2017.mdx @@ -1,7 +1,7 @@ --- title: "COCO Captions 2017" sidebarTitle: "COCO Captions 2017" -description: "Lance-formatted version of the COCO Captions 2017 corpus, redistributed via lmms-lab/COCO-Caption2017. Each row is one image with 5–7 human-written captions, CLIP image embedding, and CLIP text embedding of the canonical caption — all stored inline." +description: "A Lance-formatted version of the COCO Captions 2017 corpus, redistributed via lmms-lab/COCO-Caption2017. Each row is one image with 5–7 human-written captions, a cosine-normalized CLIP image embedding, and a cosine-normalized CLIP text embedding of…" --- -Lance-formatted version of the [COCO Captions 2017](https://cocodataset.org/) corpus, redistributed via [`lmms-lab/COCO-Caption2017`](https://huggingface.co/datasets/lmms-lab/COCO-Caption2017). Each row is one image with **5–7 human-written captions**, CLIP image embedding, and CLIP text embedding of the canonical caption — all stored inline. +A Lance-formatted version of the [COCO Captions 2017](https://cocodataset.org/) corpus, redistributed via [`lmms-lab/COCO-Caption2017`](https://huggingface.co/datasets/lmms-lab/COCO-Caption2017). Each row is one image with **5–7 human-written captions**, a cosine-normalized CLIP image embedding, and a cosine-normalized CLIP text embedding of the canonical caption — all stored inline and available directly from the Hub at `hf://datasets/lance-format/coco-captions-2017-lance/data`. + +## Key features + +- **Inline JPEG bytes** in the `image` column — no sidecar files, no image folders. +- **Paired CLIP embeddings in the same row** — `image_emb` and `text_emb` (ViT-B/32, 512-dim, cosine-normalized) — so cross-modal retrieval is one indexed lookup. +- **All 5–7 raw captions kept in `captions`** alongside a `caption` canonical string used for full-text search. +- **Pre-built ANN, FTS, and scalar indices** covering both embedding columns, the canonical caption, and `image_id`. ## Splits -| Split | Rows | -|-------|------| -| `val.lance` | 5,000 (canonical COCO 2017 val set) | -| `test.lance` | 40,700 | +| Split | Rows | Notes | +|-------|------|-------| +| `val.lance` | 5,000 | Canonical COCO 2017 val set | +| `test.lance` | 40,700 | Public test slice from `lmms-lab/COCO-Caption2017` | -> The 2017 train split (118 k images, ~18 GB of source JPEGs) is intentionally -> not bundled here because the `lmms-lab/COCO-Caption2017` redistribution does -> not include it. To extend with train, run `coco_captions_2017/dataprep.py` -> against your local COCO 2017 train mirror. +> The 2017 train split (118 k images, ~18 GB of source JPEGs) is intentionally not bundled here because the `lmms-lab/COCO-Caption2017` redistribution does not include it. To extend with train, run `coco_captions_2017/dataprep.py` against your local COCO 2017 train mirror. ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index within split | +| `id` | `int64` | Row index within split (natural join key) | | `image` | `large_binary` | Inline JPEG bytes | | `image_id` | `string` | COCO image id | | `filename` | `string` | Original filename (e.g. `000000179765.jpg`) | -| `captions` | `list` | All 5–7 captions | -| `caption` | `string` | First caption — used as canonical text for FTS | +| `captions` | `list` | All 5–7 captions for the image | +| `caption` | `string` | First caption — canonical text used for FTS | | `image_emb` | `fixed_size_list` | CLIP image embedding (cosine-normalized) | | `text_emb` | `fixed_size_list` | CLIP text embedding of the canonical caption | ## Pre-built indices -- `IVF_PQ` on `image_emb` and `text_emb` — `metric=cosine` -- `INVERTED` on `caption` -- `BTREE` on `image_id` +- `IVF_PQ` on `image_emb` — image-side vector search (cosine) +- `IVF_PQ` on `text_emb` — text-side vector search (cosine) +- `INVERTED` (FTS) on `caption` — keyword and hybrid search +- `BTREE` on `image_id` — fast lookup by COCO image id + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` -## Quick start +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/coco-captions-2017-lance/data/val.lance") -print(ds.count_rows(), ds.schema.names) -print(ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/coco-captions-2017-lance", split="val", streaming=True) +for row in hf_ds.take(3): + print(row["caption"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/coco-captions-2017-lance/data") tbl = db.open_table("val") -print(f"LanceDB table opened with {len(tbl)} image-caption pairs") +print(len(tbl)) ``` -> **Tip — for production use, download locally first.** -> ```bash -> hf download lance-format/coco-captions-2017-lance --repo-type dataset --local-dir ./coco-captions-2017-lance -> ``` - -## Vector search examples +## Load with Lance -Cross-modal text→image: +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices. ```python -import lance, open_clip, pyarrow as pa, torch - -model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k") -tokenizer = open_clip.get_tokenizer("ViT-B-32") -model = model.eval().cuda().half() -with torch.no_grad(): - q = model.encode_text(tokenizer(["a giraffe eating leaves"]).cuda()) - q = (q / q.norm(dim=-1, keepdim=True)).float().cpu().numpy()[0] +import lance ds = lance.dataset("hf://datasets/lance-format/coco-captions-2017-lance/data/val.lance") -emb_field = ds.schema.field("image_emb") -hits = ds.scanner( - nearest={"column": "image_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 10}, - columns=["image_id", "caption"], -).to_table().to_pylist() +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB cross-modal text→image search +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/coco-captions-2017-lance --repo-type dataset --local-dir ./coco-captions-2017-lance +> ``` +> Then point Lance or LanceDB at `./coco-captions-2017-lance/data`. -```python -import lancedb, open_clip, torch +## Search + +The bundled `IVF_PQ` index on `image_emb` makes cross-modal text→image retrieval a single call: encode a text query with the same CLIP model used at ingest (ViT-B/32, cosine-normalized), then pass the resulting 512-d vector to `tbl.search(...)` and target `image_emb`. The example below uses the `text_emb` already stored in row 42 as a runnable stand-in for "the CLIP encoding of a caption", so the snippet works without any model loaded. -model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k") -tokenizer = open_clip.get_tokenizer("ViT-B-32") -model = model.eval().cuda().half() -with torch.no_grad(): - q = model.encode_text(tokenizer(["a giraffe eating leaves"]).cuda()) - q = (q / q.norm(dim=-1, keepdim=True)).float().cpu().numpy()[0] +```python +import lancedb db = lancedb.connect("hf://datasets/lance-format/coco-captions-2017-lance/data") tbl = db.open_table("val") -results = ( - tbl.search(q.tolist(), vector_column_name="image_emb") +seed = ( + tbl.search() + .select(["text_emb", "caption"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["text_emb"], vector_column_name="image_emb") .metric("cosine") .select(["image_id", "caption"]) .limit(10) .to_list() ) +print("query caption:", seed["caption"]) +for r in hits: + print(f" {r['image_id']:>12} {r['caption'][:70]}") ``` -Full-text search: +Because OpenAI-style CLIP embeddings are normalized, cosine is the right metric and the first hit will typically be the source image itself — a useful sanity check. Swap `vector_column_name="image_emb"` for `text_emb` to do text→text retrieval against the canonical captions instead. + +Because the dataset also ships an `INVERTED` index on `caption`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query. LanceDB merges the two result lists and reranks them in a single call, which is useful when a phrase like "yellow taxi" must literally appear in the caption but you still want CLIP to do the heavy lifting on visual similarity. ```python -ds = lance.dataset("hf://datasets/lance-format/coco-captions-2017-lance/data/val.lance") -hits = ds.scanner( - full_text_query="surfer riding a wave", - columns=["image_id", "caption"], - limit=10, -).to_table().to_pylist() +hybrid_hits = ( + tbl.search(query_type="hybrid", vector_column_name="image_emb") + .vector(seed["text_emb"]) + .text("a man riding a surfboard") + .select(["image_id", "caption"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(f" {r['image_id']:>12} {r['caption'][:70]}") ``` -### LanceDB full-text search +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. + +## Curate + +A typical curation pass for a captioning or contrastive-training workflow combines a content filter on the captions with a structural filter on the image. Stacking both inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(500)` makes it cheap to inspect before committing the subset to anything downstream. ```python import lancedb @@ -137,19 +160,127 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/coco-captions-2017-lance/data") tbl = db.open_table("val") -results = ( - tbl.search("surfer riding a wave") - .select(["image_id", "caption"]) - .limit(10) +candidates = ( + tbl.search("surfer OR surfboard OR wave") + .where("array_length(captions) >= 5", prefilter=True) + .select(["image_id", "caption", "captions"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} candidates; first caption: {candidates[0]['caption'][:80]}") ``` -## Why Lance? +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `image_id`s, or feed into the Evolve and Train workflows below. The `image` column is never read, so the network traffic for a 500-row candidate scan is dominated by caption text rather than JPEG bytes. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds `num_captions` and a `long_caption` flag, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. + +```python +import lancedb + +db = lancedb.connect("./coco-captions-2017-lance/data") # local copy required for writes +tbl = db.open_table("val") + +tbl.add_columns({ + "num_captions": "array_length(captions)", + "long_caption": "length(caption) >= 80", +}) +``` + +If the values you want to attach already live in another table (offline labels, classifier predictions, a second-pass caption from a different model), merge them in by joining on `image_id`: + +```python +import pyarrow as pa + +labels = pa.table({ + "image_id": pa.array(["179765", "000139"]), + "scene_label": pa.array(["beach", "kitchen"]), +}) +tbl.merge(labels, on="image_id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a second CLIP variant over the image bytes), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a CLIP-style contrastive run, project the JPEG bytes and a sampled caption; for a reranker or probe on top of frozen features, project the precomputed embeddings instead. + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/coco-captions-2017-lance/data") +tbl = db.open_table("val") + +train_ds = Permutation.identity(tbl).select_columns(["image", "caption"]) +loader = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; decode the JPEG bytes, + # tokenize the captions, encode, contrastive loss... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "text_emb"]` to `select_columns(...)` on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a lightweight reranker or a linear probe. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/coco-captions-2017-lance/data") +tbl = db.open_table("val") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./coco-captions-2017-lance/data") +local_tbl = local_db.open_table("val") +local_tbl.tags.create("clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("val", version="clip-vitb32-v1") +tbl_v5 = db.open_table("val", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `clip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel — newly added embeddings or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and captions, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/coco-captions-2017-lance/data") +remote_tbl = remote_db.open_table("test") + +batches = ( + remote_tbl.search("surfer OR surfboard OR wave") + .where("array_length(captions) >= 5") + .select(["image_id", "image", "caption", "captions", "image_emb", "text_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./coco-surf-subset") +local_db.create_table("train", batches) +``` -- One dataset carries images + image embeddings + text embeddings + indices — no sidecar files. -- On-disk vector and full-text indices live next to the data, so search works on local copies and on the Hub. -- Schema evolution: add columns (new captions, alternate embeddings, model predictions) without rewriting the data. +The resulting `./coco-surf-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/coco-captions-2017-lance/data` for `./coco-surf-subset`. ## Source & license @@ -162,6 +293,6 @@ Converted from [`lmms-lab/COCO-Caption2017`](https://huggingface.co/datasets/lmm title={Microsoft COCO: Common objects in context}, author={Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{\'a}r, Piotr and Zitnick, C Lawrence}, booktitle={European Conference on Computer Vision (ECCV)}, - year={2014}, + year={2014} } ``` diff --git a/docs/datasets/coco-detection-2017.mdx b/docs/datasets/coco-detection-2017.mdx index 0b16696..e6ff4c5 100644 --- a/docs/datasets/coco-detection-2017.mdx +++ b/docs/datasets/coco-detection-2017.mdx @@ -1,7 +1,7 @@ --- title: "COCO 2017 Detection" sidebarTitle: "COCO 2017 Detection" -description: "Lance-formatted version of the COCO 2017 object detection benchmark — sourced from detection-datasets/coco — with 123,287 images and the full per-image list of bounding boxes, category labels, and CLIP image embeddings, all stored inline." +description: "A Lance-formatted version of the COCO 2017 object detection benchmark, sourced from detection-datasets/coco. Each row is one image with its inline JPEG bytes, the full per-image list of bounding boxes, COCO 80-class category ids and names…" --- -Lance-formatted version of the [COCO 2017 object detection benchmark](https://cocodataset.org/) — sourced from [`detection-datasets/coco`](https://huggingface.co/datasets/detection-datasets/coco) — with **123,287 images** and the full per-image list of bounding boxes, category labels, and CLIP image embeddings, all stored inline. +A Lance-formatted version of the [COCO 2017 object detection benchmark](https://cocodataset.org/), sourced from [`detection-datasets/coco`](https://huggingface.co/datasets/detection-datasets/coco). Each row is one image with its inline JPEG bytes, the full per-image list of bounding boxes, COCO 80-class category ids and names, per-object areas, an OpenCLIP image embedding, and pre-built indices — all available directly from the Hub at `hf://datasets/lance-format/coco-detection-2017-lance/data`. -## Why this version? +## Key features -Object detection datasets typically split images, annotations, and embeddings across multiple files (often three different formats: JPEG, JSON, NumPy). Lance keeps all of it in one tabular dataset: - -- one row per image, -- the JPEG bytes, the bounding box list, the category labels, and the CLIP image embedding all live as columns on the same row, -- `IVF_PQ` on the embedding column lets you do visual similarity search without leaving the dataset, and `LABEL_LIST` on `categories_present` lets you filter to "images containing a dog and a frisbee" in milliseconds. +- **Inline JPEG bytes** in the `image` column — no sidecar files, no image folders. +- **Per-object annotations as parallel list columns** — `bboxes`, `categories`, `category_names`, and `areas` are aligned position-for-position, so iterating boxes alongside their labels is a single row read. +- **Pre-aggregated annotation summaries** — `num_objects` (int) and `categories_present` (deduped string list) precompute the predicates curation queries hit most. +- **CLIP image embeddings** (`image_emb`, OpenCLIP ViT-B/32, 512-d, cosine-normalized) with a bundled `IVF_PQ` index for visual retrieval. ## Splits | Split | Rows | |-------|------| | `train.lance` | 117,000+ | -| `val.lance` | 4,950+ | +| `val.lance` | 4,950+ | -(Counts come from the `detection-datasets/coco` redistribution; box counts: ~860k train / ~37k val.) +Total annotated boxes: ~860k train / ~37k val. ## Schema @@ -37,88 +36,77 @@ Object detection datasets typically split images, annotations, and embeddings ac |---|---|---| | `id` | `int64` | Row index within split | | `image` | `large_binary` | Inline JPEG bytes | -| `image_id` | `int64` | COCO image id | +| `image_id` | `int64` | COCO image id (natural join key) | | `width`, `height` | `int32` | Image dimensions in pixels | -| `bboxes` | `list>` | Each box is `[x_min, y_min, x_max, y_max]` in absolute pixel coords | -| `categories` | `list` | COCO 80-class id (0-79) | -| `category_names` | `list` | Human-readable class name per object (e.g. `person`, `dog`, …) | -| `areas` | `list` | Bounding-box area (pixels²) | +| `bboxes` | `list>` | Each box is `[x_min, y_min, x_max, y_max]` in absolute pixel coordinates | +| `categories` | `list` | COCO 80-class id (0–79), aligned with `bboxes` | +| `category_names` | `list` | Human-readable class name per object (e.g. `person`, `dog`) | +| `areas` | `list` | Bounding-box area in pixels², aligned with `bboxes` | | `num_objects` | `int32` | Number of annotated objects in the image | -| `categories_present` | `list` | Deduped class names — feeds the `LABEL_LIST` index for fast filtering | -| `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` image embedding (cosine-normalized) | +| `categories_present` | `list` | Deduped class names — feeds the `LABEL_LIST` index | +| `image_emb` | `fixed_size_list` | OpenCLIP ViT-B/32 image embedding (cosine-normalized) | ## Pre-built indices -- `IVF_PQ` on `image_emb` — `metric=cosine` -- `BTREE` on `image_id`, `num_objects` -- `LABEL_LIST` on `categories_present` — supports `array_has_any` / `array_has_all` predicates +- `IVF_PQ` on `image_emb` — vector similarity search (cosine) +- `BTREE` on `image_id` — fast lookup by COCO image id +- `BTREE` on `num_objects` — range filters on image complexity +- `LABEL_LIST` on `categories_present` — supports `array_has_any` / `array_has_all` for class-presence filtering + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` -## Quick start +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample without installing anything Lance-specific. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/coco-detection-2017-lance/data/val.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/coco-detection-2017-lance", split="val", streaming=True) +for row in hf_ds.take(3): + print(row["image_id"], row["num_objects"], row["categories_present"][:5]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data") tbl = db.open_table("val") -print(f"LanceDB table opened with {len(tbl)} images") +print(len(tbl)) ``` -> **Tip — for production use, download locally first.** -> ```bash -> hf download lance-format/coco-detection-2017-lance --repo-type dataset --local-dir ./coco-detection-2017-lance -> ``` +## Load with Lance -## Read one annotated image +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices. ```python -import io import lance -from PIL import Image, ImageDraw ds = lance.dataset("hf://datasets/lance-format/coco-detection-2017-lance/data/val.lance") -row = ds.take([0], columns=["image", "bboxes", "category_names", "width", "height"]).to_pylist()[0] - -img = Image.open(io.BytesIO(row["image"])).convert("RGB") -draw = ImageDraw.Draw(img) -for (x1, y1, x2, y2), name in zip(row["bboxes"], row["category_names"]): - draw.rectangle([x1, y1, x2, y2], outline="red", width=3) - draw.text((x1 + 4, y1 + 4), name, fill="red") -img.save("annotated.jpg") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -## Filter by classes (LABEL_LIST index) - -```python -import lance -ds = lance.dataset("hf://datasets/lance-format/coco-detection-2017-lance/data/val.lance") +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access, ANN search, and any mutation are far faster against a local copy: +> ```bash +> hf download lance-format/coco-detection-2017-lance --repo-type dataset --local-dir ./coco-detection-2017-lance +> ``` +> Then point Lance or LanceDB at `./coco-detection-2017-lance/data`. -# Images that contain BOTH a person and a frisbee. -rows = ds.scanner( - filter="array_has_all(categories_present, ['person', 'frisbee'])", - columns=["image_id", "category_names"], - limit=10, -).to_table().to_pylist() - -# Images with at least 5 objects of any class. -busy = ds.scanner( - filter="num_objects >= 5", - columns=["image_id", "num_objects"], - limit=10, -).to_table().to_pylist() -``` +## Search -### Filter by classes with LanceDB +The bundled `IVF_PQ` index on `image_emb` makes approximate-nearest-neighbor visual retrieval a single call. In production you would encode a query image through the same OpenCLIP ViT-B/32 model used at ingest and pass the resulting 512-d vector to `tbl.search(...)`. The example below uses the embedding stored on row 42 as a runnable stand-in, so the snippet works without loading any model. ```python import lancedb @@ -126,41 +114,117 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data") tbl = db.open_table("val") -rows = ( +seed = ( tbl.search() - .where("array_has_all(categories_present, ['person', 'frisbee'])") - .select(["image_id", "category_names"]) + .select(["image_emb", "image_id", "categories_present"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["image_emb"]) + .metric("cosine") + .select(["image_id", "categories_present", "num_objects"]) .limit(10) .to_list() ) +print("query categories:", seed["categories_present"]) +for r in hits: + print(f" image_id={r['image_id']:>10} n={r['num_objects']:>3} cats={r['categories_present'][:5]}") +``` + +Because the embeddings are cosine-normalized, the first hit will typically be the source image itself — a useful sanity check. Tune `nprobes` and `refine_factor` to trade recall against latency for your workload. + +## Curate + +Curation for a detection workflow usually means picking images that contain a specific class combination, possibly bounded by scene complexity. The `LABEL_LIST` index on `categories_present` makes class-presence predicates trivial, and Lance evaluates them inside the same scan as range filters on `num_objects` or `width`/`height`. The bounded `.limit(500)` keeps the result small and inspectable, and the `image` column is left out of the projection so the candidate scan is dominated by annotation metadata, not JPEG bytes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data") +tbl = db.open_table("val") -busy = ( +candidates = ( tbl.search() - .where("num_objects >= 5") - .select(["image_id", "num_objects"]) - .limit(10) + .where( + "array_has_all(categories_present, ['person', 'frisbee']) " + "AND num_objects BETWEEN 3 AND 12", + prefilter=True, + ) + .select(["image_id", "categories_present", "num_objects", "width", "height"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} candidates; first image_id: {candidates[0]['image_id']}") ``` -## Visual similarity search +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `image_id`s, or feed into the Evolve and Train workflows below. Swapping `array_has_all` for `array_has_any` widens recall to images containing any of the listed classes; replacing the structural predicate with `num_objects >= 10` selects busy scenes for crowd-detection ablations. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a `has_person` flag, an `aspect_ratio`, and a `max_box_area` that surfaces the largest annotated object area per image — all of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. + +```python +import lancedb + +db = lancedb.connect("./coco-detection-2017-lance/data") # local copy required for writes +tbl = db.open_table("val") + +tbl.add_columns({ + "has_person": "array_has_any(categories_present, ['person'])", + "aspect_ratio": "CAST(width AS DOUBLE) / CAST(height AS DOUBLE)", + "max_box_area": "array_max(areas)", + "crowded": "num_objects >= 10", +}) +``` + +If the values you want to attach already live in another table (offline predictions from a baseline detector, per-image difficulty scores, or a second-pass embedding), merge them in by joining on `image_id`: ```python -import lance import pyarrow as pa -ds = lance.dataset("hf://datasets/lance-format/coco-detection-2017-lance/data/val.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"] -query = pa.array([ref], type=emb_field.type) - -neighbors = ds.scanner( - nearest={"column": "image_emb", "q": query[0], "k": 5}, - columns=["image_id", "category_names"], -).to_table().to_pylist() +predictions = pa.table({ + "image_id": pa.array([397133, 37777, 252219], type=pa.int64()), + "baseline_map": pa.array([0.31, 0.48, 0.22]), +}) +tbl.merge(predictions, on="image_id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a second detector over the image bytes), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a detector training run, project the JPEG bytes alongside the parallel annotation columns the loss consumes — boxes, category ids, and (optionally) areas. Columns added in the Evolve section above cost nothing per batch until they are explicitly projected. + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data") +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns( + ["image", "bboxes", "categories", "areas"] +) +loader = DataLoader(train_ds, batch_size=16, shuffle=True, num_workers=4, + collate_fn=lambda b: b) # detection targets are ragged + +for batch in loader: + # batch is a list of dicts: decode each JPEG, stack the bboxes / categories + # into the target dictionary your detector expects, forward, loss... + ... ``` -### LanceDB visual similarity search +Switching feature sets is a configuration change: passing `["image_emb", "categories_present"]` to `select_columns(...)` on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors plus the deduped class list, which is the right shape for training a lightweight multi-label classifier or a class-presence probe on top of frozen features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. ```python import lancedb @@ -168,23 +232,51 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data") tbl = db.open_table("val") -ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0] -query_embedding = ref["image_emb"] +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: -results = ( - tbl.search(query_embedding) - .metric("cosine") - .select(["image_id", "category_names"]) - .limit(5) - .to_list() -) +```python +local_db = lancedb.connect("./coco-detection-2017-lance/data") +local_tbl = local_db.open_table("val") +local_tbl.tags.create("detector-baseline-v1", local_tbl.version) ``` -## Why Lance? +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("val", version="detector-baseline-v1") +tbl_v5 = db.open_table("val", version=5) +``` + +Pinning supports two workflows. An evaluation harness locked to `detector-baseline-v1` keeps scoring against the exact same boxes and category ids while the dataset evolves in parallel; newly merged predictions or evolved columns do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and annotations, so changes in mAP reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("array_has_any(categories_present, ['dog', 'cat']) AND num_objects >= 2") + .select(["image_id", "image", "bboxes", "categories", "category_names", + "areas", "num_objects", "categories_present", "image_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./coco-pets-subset") +local_db.create_table("train", batches) +``` -- One dataset carries images + boxes + categories + areas + embeddings + indices — no JSON sidecars. -- On-disk vector and label-list indices live next to the data, so filters and ANN search work on local copies and on the Hub. -- Schema evolution: add columns (segmentation polygons, keypoints, panoptic ids, fresh embeddings) without rewriting the data. +The resulting `./coco-pets-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/coco-detection-2017-lance/data` for `./coco-pets-subset`. ## Source & license diff --git a/docs/datasets/docvqa.mdx b/docs/datasets/docvqa.mdx index 542a2d2..9ab260b 100644 --- a/docs/datasets/docvqa.mdx +++ b/docs/datasets/docvqa.mdx @@ -1,7 +1,7 @@ --- title: "DocVQA" sidebarTitle: "DocVQA" -description: "Lance-formatted version of DocVQA — VQA over document images (industry / government scans, multi-page reports, forms, receipts) — sourced from lmms-lab/DocVQA (DocVQA config)." +description: "A Lance-formatted version of DocVQA, a benchmark for visual question answering over document images such as industry and government scans, multi-page reports, forms, and receipts, redistributed via lmms-lab/DocVQA (DocVQA config). Each row carries…" --- -Lance-formatted version of [DocVQA](https://www.docvqa.org/) — VQA over document images (industry / government scans, multi-page reports, forms, receipts) — sourced from [`lmms-lab/DocVQA`](https://huggingface.co/datasets/lmms-lab/DocVQA) (`DocVQA` config). +A Lance-formatted version of [DocVQA](https://www.docvqa.org/), a benchmark for visual question answering over document images such as industry and government scans, multi-page reports, forms, and receipts, redistributed via [`lmms-lab/DocVQA`](https://huggingface.co/datasets/lmms-lab/DocVQA) (`DocVQA` config). Each row carries the page image as inline JPEG bytes, the question and reference answer span(s), the original DocVQA question-type tags, UCSF Industry Documents Library provenance, and paired CLIP embeddings for the image and the question — all available directly from the Hub at `hf://datasets/lance-format/docvqa-lance/data`. + +## Key features + +- **Inline page image bytes** in the `image` column — no sidecar files, no document folders. +- **Paired CLIP embeddings in the same row** — `image_emb` and `question_emb` (ViT-B/32, 512-dim, cosine-normalized) — so visual and textual retrieval are one indexed lookup. +- **All reference answer spans preserved in `answers`** alongside a canonical `answer` string used for full-text search. +- **Pre-built ANN, FTS, scalar, and label-list indices** covering both embedding columns, the question and answer text, the document ids, and the `question_types` tag list. ## Splits -| Split | Rows | -|-------|------| -| `validation.lance` | 5,349 | -| `test.lance` | 5,188 | +| Split | Rows | Notes | +|-------|------|-------| +| `validation.lance` | 5,349 | Canonical DocVQA validation set | +| `test.lance` | 5,188 | Public test slice from `lmms-lab/DocVQA` | ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index within split | +| `id` | `int64` | Row index within split (natural join key) | | `image` | `large_binary` | Inline JPEG bytes (page image) | -| `image_id` | `string?` | DocVQA `docId` (alias) | +| `image_id` | `string?` | DocVQA `docId` (alias of `doc_id`) | | `question_id` | `string?` | DocVQA `questionId` | | `question` | `string` | Natural-language question | | `answers` | `list` | Reference answer span(s) | -| `answer` | `string` | First reference answer (FTS target) | +| `answer` | `string` | First reference answer — canonical, used for FTS | | `doc_id` | `string?` | DocVQA document id | | `ucsf_document_id` | `string?` | UCSF Industry Documents Library id | | `ucsf_document_page_no` | `string?` | Page number within the source document | @@ -42,32 +49,66 @@ Lance-formatted version of [DocVQA](https://www.docvqa.org/) — VQA over docume ## Pre-built indices -- `IVF_PQ` on `image_emb` and `question_emb` — `metric=cosine` -- `INVERTED` (FTS) on `question` and `answer` -- `BTREE` on `image_id`, `question_id`, `doc_id` -- `LABEL_LIST` on `question_types` +- `IVF_PQ` on `image_emb` — image-side vector search (cosine) +- `IVF_PQ` on `question_emb` — text-side vector search (cosine) +- `INVERTED` (FTS) on `question` and `answer` — keyword and hybrid search +- `BTREE` on `image_id`, `question_id`, `doc_id` — fast lookup by document or question id +- `LABEL_LIST` on `question_types` — set-membership filtering over question-type tags + +## Why Lance? -## Quick start +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/docvqa-lance/data/validation.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +import datasets + +hf_ds = datasets.load_dataset("lance-format/docvqa-lance", split="validation", streaming=True) +for row in hf_ds.take(3): + print(row["question"], "->", row["answer"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/docvqa-lance/data") tbl = db.open_table("validation") -print(f"LanceDB table opened with {len(tbl)} document-question pairs") +print(len(tbl)) ``` -### LanceDB vector search +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, and the list of pre-built indices. + +```python +import lance + +ds = lance.dataset("hf://datasets/lance-format/docvqa-lance/data/validation.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) +``` + +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/docvqa-lance --repo-type dataset --local-dir ./docvqa-lance +> ``` +> Then point Lance or LanceDB at `./docvqa-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `question_emb` makes question-to-question retrieval a single call: encode a query with the same CLIP model used at ingest (ViT-B/32, cosine-normalized) and pass the resulting 512-d vector to `tbl.search(...)`. The example below uses the `question_emb` already stored in row 42 as a runnable stand-in, so the snippet works without any model loaded. ```python import lancedb @@ -75,19 +116,48 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/docvqa-lance/data") tbl = db.open_table("validation") -ref = tbl.search().limit(1).select(["question_emb", "question"]).to_list()[0] -query_embedding = ref["question_emb"] +seed = ( + tbl.search() + .select(["question_emb", "question"]) + .limit(1) + .offset(42) + .to_list()[0] +) -results = ( - tbl.search(query_embedding, vector_column_name="question_emb") +hits = ( + tbl.search(seed["question_emb"], vector_column_name="question_emb") .metric("cosine") - .select(["question", "answer"]) - .limit(5) + .select(["question_id", "question", "answer", "question_types"]) + .limit(10) .to_list() ) +print("query:", seed["question"]) +for r in hits: + print(f" {r['question_id']:>8} {r['question'][:60]} -> {r['answer']}") ``` -### LanceDB full-text search +Swap `vector_column_name="question_emb"` for `image_emb` to retrieve pages whose visual layout is similar to a given embedding — useful when you want to find other forms or invoices that look like a seed page. + +Because the dataset also ships an `INVERTED` index on `question` and `answer`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query. LanceDB merges the two result lists and reranks them in a single call, which is useful when a phrase like "invoice total" or "date of birth" must literally appear in the question but you still want CLIP to do the heavy lifting on semantic similarity. + +```python +hybrid_hits = ( + tbl.search(query_type="hybrid", vector_column_name="question_emb") + .vector(seed["question_emb"]) + .text("invoice total") + .select(["question_id", "question", "answer"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(f" {r['question_id']:>8} {r['question'][:60]} -> {r['answer']}") +``` + +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. + +## Curate + +A typical curation pass for a document-VQA workflow combines a content filter on the question with a structural filter on the question-type tags. Stacking both inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(500)` makes it cheap to inspect before committing the subset to anything downstream. The example below collects form-style questions that mention a date, which is a common slice for evaluating form-understanding behaviour. ```python import lancedb @@ -95,42 +165,131 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/docvqa-lance/data") tbl = db.open_table("validation") -results = ( - tbl.search("invoice total") - .select(["question", "answer"]) - .limit(10) +candidates = ( + tbl.search("date") + .where("array_has_any(question_types, ['form'])", prefilter=True) + .select(["question_id", "doc_id", "question", "answer", "question_types"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} candidates; first: {candidates[0]['question'][:80]}") ``` -## Filter by question type +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `question_id`s, or feed into the Evolve and Train workflows below. The `image` column is never read, so the network traffic for a 500-row candidate scan is dominated by question and answer text rather than page JPEGs. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds `answer_length`, an `is_form_question` flag, and a `has_table` flag, any of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/docvqa-lance/data/validation.lance") -forms = ds.scanner( - filter="array_has_any(question_types, ['form'])", - columns=["question", "answer"], - limit=5, -).to_table() +import lancedb + +db = lancedb.connect("./docvqa-lance/data") # local copy required for writes +tbl = db.open_table("validation") + +tbl.add_columns({ + "answer_length": "length(answer)", + "is_form_question": "array_has_any(question_types, ['form'])", + "has_table": "array_has_any(question_types, ['table/list'])", +}) ``` -### Filter with LanceDB +If the values you want to attach already live in another table (OCR-extracted page text, model predictions, layout-detector outputs), merge them in by joining on `question_id`: + +```python +import pyarrow as pa + +predictions = pa.table({ + "question_id": pa.array(["49153", "49154", "49155"]), + "pred_answer": pa.array(["$1,234.56", "John Doe", "2018-04-12"]), + "is_correct": pa.array([True, True, False]), +}) +tbl.merge(predictions, on="question_id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running OCR or a layout model over the page bytes), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For fine-tuning a document-VLM, project the page bytes plus the question and answer; columns added in the Evolve section above cost nothing per batch until they are explicitly projected. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/docvqa-lance/data") tbl = db.open_table("validation") -forms = ( - tbl.search() + +train_ds = Permutation.identity(tbl).select_columns(["image", "question", "answer"]) +loader = DataLoader(train_ds, batch_size=16, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; decode the JPEG bytes, + # tokenize the question/answer pair, forward, backward... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "question_emb", "answer"]` to `select_columns(...)` on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a lightweight answer-classifier or a linear probe on top of frozen features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges predictions, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/docvqa-lance/data") +tbl = db.open_table("validation") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./docvqa-lance/data") +local_tbl = local_db.open_table("validation") +local_tbl.tags.create("eval-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("validation", version="eval-v1") +tbl_v5 = db.open_table("validation", version=5) +``` + +Pinning supports two workflows. An evaluation harness locked to `eval-v1` keeps producing comparable scores while the dataset evolves in parallel — newly added prediction columns or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same pages and questions, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/docvqa-lance/data") +remote_tbl = remote_db.open_table("validation") + +batches = ( + remote_tbl.search("date") .where("array_has_any(question_types, ['form'])") - .select(["question", "answer"]) - .limit(5) - .to_list() + .select(["id", "image", "question_id", "doc_id", "question", "answer", + "question_types", "image_emb", "question_emb"]) + .to_batches() ) + +local_db = lancedb.connect("./docvqa-forms-subset") +local_db.create_table("validation", batches) ``` +The resulting `./docvqa-forms-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/docvqa-lance/data` for `./docvqa-forms-subset`. + ## Source & license Converted from [`lmms-lab/DocVQA`](https://huggingface.co/datasets/lmms-lab/DocVQA). DocVQA is released under the MIT license; the underlying documents come from the [UCSF Industry Documents Library](https://www.industrydocuments.ucsf.edu/) — review their access conditions before redistribution. diff --git a/docs/datasets/eurosat.mdx b/docs/datasets/eurosat.mdx index c9ca094..0ec8b66 100644 --- a/docs/datasets/eurosat.mdx +++ b/docs/datasets/eurosat.mdx @@ -1,7 +1,7 @@ --- title: "EuroSAT" sidebarTitle: "EuroSAT" -description: "Lance-formatted version of EuroSAT — Sentinel-2 satellite imagery (RGB) covering 27,000 64×64 tiles across 10 land-cover classes, sourced from blanchon/EuroSAT_RGB." +description: "A Lance-formatted version of EuroSAT, the canonical Sentinel-2 RGB land-cover benchmark, sourced from blanchon/EuroSAT_RGB. Each row is a single 64×64 RGB tile with its integer class id, the human-readable class name, and a cosine-normalized…" --- -Lance-formatted version of [EuroSAT](https://github.com/phelber/eurosat) — Sentinel-2 satellite imagery (RGB) covering **27,000 64×64 tiles** across 10 land-cover classes, sourced from [`blanchon/EuroSAT_RGB`](https://huggingface.co/datasets/blanchon/EuroSAT_RGB). +A Lance-formatted version of [EuroSAT](https://github.com/phelber/eurosat), the canonical Sentinel-2 RGB land-cover benchmark, sourced from [`blanchon/EuroSAT_RGB`](https://huggingface.co/datasets/blanchon/EuroSAT_RGB). Each row is a single 64×64 RGB tile with its integer class id, the human-readable class name, and a cosine-normalized OpenCLIP image embedding — all stored inline and available directly from the Hub at `hf://datasets/lance-format/eurosat-lance/data`. -This is the canonical "geo" tile-level classification benchmark, useful for remote sensing pre-training and small-tile retrieval research. +## Key features + +- **Inline JPEG bytes** in the `image` column — no sidecar TIF folders, no per-class subdirectories. +- **Pre-computed OpenCLIP image embeddings** (`image_emb`, ViT-B/32, 512-dim, cosine-normalized) with a bundled `IVF_PQ` index for similarity search. +- **Both label representations** — integer `label` (0-9) and string `label_name` — with scalar indices on both for fast class filters. +- **One columnar dataset** — scan labels and embeddings cheaply, fetch tile bytes only for the rows you actually need. ## Splits -| Split | Rows | -|-------|------| -| `train.lance` | 16,200 | -| `validation.lance` | 5,400 | -| `test.lance` | 5,400 | +| Split | Rows | Notes | +|-------|------|-------| +| `train.lance` | 16,200 | Training split | +| `validation.lance` | 5,400 | Validation split | +| `test.lance` | 5,400 | Held-out test split | ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index within split | -| `image` | `large_binary` | Inline JPEG bytes (64×64 RGB Sentinel-2) | +| `id` | `int64` | Row index within the split (natural join key) | +| `image` | `large_binary` | Inline JPEG bytes (64×64 RGB Sentinel-2 tile) | | `label` | `int32` | Class id (0-9) | -| `label_name` | `string` | `Annual_Crop`, `Forest`, `Herbaceous_Vegetation`, `Highway`, `Industrial_Buildings`, `Pasture`, `Permanent_Crop`, `Residential_Buildings`, `River`, `SeaLake` | +| `label_name` | `string` | One of `Annual_Crop`, `Forest`, `Herbaceous_Vegetation`, `Highway`, `Industrial_Buildings`, `Pasture`, `Permanent_Crop`, `Residential_Buildings`, `River`, `SeaLake` | | `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` image embedding (cosine-normalized) | ## Pre-built indices -- `IVF_PQ` on `image_emb` — `metric=cosine` -- `BTREE` on `label` -- `BITMAP` on `label_name` +- `IVF_PQ` on `image_emb` — vector similarity search (cosine) +- `BTREE` on `label` — fast equality / range filters by class id +- `BITMAP` on `label_name` — fast set-membership filters by class name + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` -## Quick start +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample without installing anything Lance-specific. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/eurosat-lance/data/train.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/eurosat-lance", split="train", streaming=True) +for row in hf_ds.take(3): + print(row["label_name"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/eurosat-lance/data") tbl = db.open_table("train") -print(f"LanceDB table opened with {len(tbl)} satellite tiles") +print(len(tbl)) ``` -## Visual similarity search +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python import lance -import pyarrow as pa ds = lance.dataset("hf://datasets/lance-format/eurosat-lance/data/train.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb", "label_name"]).to_pylist()[0] -query = pa.array([ref["image_emb"]], type=emb_field.type) - -hits = ds.scanner( - nearest={"column": "image_emb", "q": query[0], "k": 5, "nprobes": 16, "refine_factor": 30}, - columns=["id", "label_name"], -).to_table().to_pylist() -print(f"reference: {ref['label_name']}") -for h in hits: - print(h) +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB visual similarity search +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/eurosat-lance --repo-type dataset --local-dir ./eurosat-lance +> ``` +> Then point Lance or LanceDB at `./eurosat-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `image_emb` makes visually-similar-tile retrieval a single call. In production you would encode a query tile through the same OpenCLIP `ViT-B-32` model used at ingest (cosine-normalized) and pass the resulting 512-d vector to `tbl.search(...)`. The example below uses the embedding already stored in row 42 as a runnable stand-in, so the snippet works without any model loaded. ```python import lancedb @@ -89,41 +106,166 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/eurosat-lance/data") tbl = db.open_table("train") -ref = tbl.search().limit(1).select(["image_emb", "label_name"]).to_list()[0] -query_embedding = ref["image_emb"] +seed = ( + tbl.search() + .select(["image_emb", "label_name"]) + .limit(1) + .offset(42) + .to_list()[0] +) -results = ( - tbl.search(query_embedding) +hits = ( + tbl.search(seed["image_emb"]) .metric("cosine") .select(["id", "label_name"]) - .limit(5) + .limit(10) .to_list() ) +print(f"reference tile class: {seed['label_name']}") +for r in hits: + print(f" id={r['id']:>6} {r['label_name']}") ``` -## Filter by class +Because the embeddings are cosine-normalized at ingest, `metric="cosine")` is the right choice and the first hit will typically be the seed tile itself — a useful sanity check. Tune `nprobes` and `refine_factor` to trade recall against latency for your workload. + +## Curate + +A typical curation pass for a land-cover classification or retrieval study narrows the dataset to a single class and then retrieves the visually closest tiles to a seed. Lance evaluates the vector search and the metadata filter inside a single query, so the candidate set comes back already filtered. The example below pulls the 500 forest tiles most similar to a chosen seed; the bounded `.limit(500)` keeps the output small enough to inspect or hand off. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/eurosat-lance/data/train.lance") -rivers = ds.scanner(filter="label_name = 'River'", columns=["id"], limit=5).to_table() +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/eurosat-lance/data") +tbl = db.open_table("train") + +seed = ( + tbl.search() + .select(["image_emb"]) + .limit(1) + .offset(0) + .to_list()[0] +) + +candidates = ( + tbl.search(seed["image_emb"]) + .where("label_name = 'Forest'", prefilter=True) + .select(["id", "label", "label_name"]) + .limit(500) + .to_list() +) +print(f"{len(candidates)} Forest candidates") +``` + +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of row ids, or feed into the Evolve and Train workflows below. The `image` column is never read, so the network traffic for a 500-row candidate scan is dominated by the small metadata payload rather than JPEG bytes. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a coarse `is_urban` flag that captures whether a tile belongs to one of the built-environment classes, useful as a direct predicate in later `where` clauses without re-evaluating the class set on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus first. + +```python +import lancedb + +db = lancedb.connect("./eurosat-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "is_urban": "label_name IN ('Highway', 'Industrial_Buildings', 'Residential_Buildings')", +}) +``` + +If the values you want to attach already live in another table (a coarse climate label per class, an external aesthetic score, model predictions from a separate eval), merge them in by joining on `label_name`: + +```python +import pyarrow as pa + +climate = pa.table({ + "label_name": pa.array(["Forest", "Pasture", "SeaLake", "River"]), + "climate_zone": pa.array(["temperate", "temperate", "marine", "freshwater"]), +}) +tbl.merge(climate, on="label_name") ``` -### Filter by class with LanceDB +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running an alternative remote-sensing model over the tile bytes), Lance provides a batch-UDF API in the underlying library — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/) for that pattern. + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. Columns added in the Evolve section above cost nothing per batch until they are explicitly projected. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/eurosat-lance/data") tbl = db.open_table("train") -rivers = tbl.search().where("label_name = 'River'").select(["id"]).limit(5).to_list() + +train_ds = Permutation.identity(tbl).select_columns(["image", "label"]) +loader = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; image_emb stays on disk. + # decode the JPEG bytes, forward through a CNN or ViT, cross-entropy loss... + ... ``` -## Why Lance? +Switching feature sets is a configuration change: passing `["image_emb", "label"]` to `select_columns(...)` on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a linear probe or a lightweight classifier head on top of frozen CLIP features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/eurosat-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./eurosat-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="clip-vitb32-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `clip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel; newly added columns or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same tiles, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/eurosat-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("label_name IN ('Forest', 'River', 'SeaLake')") + .select(["id", "image", "label", "label_name", "image_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./eurosat-natural-subset") +local_db.create_table("train", batches) +``` -- One dataset for tiles + embeddings + indices — no sidecar TIF folder per class. -- On-disk vector and FTS indices live next to the data, so search works on local copies and on the Hub. -- Schema evolution: add columns (multi-spectral channels, model predictions, fresh embeddings) without rewriting the data. +The resulting `./eurosat-natural-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/eurosat-lance/data` for `./eurosat-natural-subset`. ## Source & license diff --git a/docs/datasets/fashion-mnist.mdx b/docs/datasets/fashion-mnist.mdx index 08de318..fc8f60b 100644 --- a/docs/datasets/fashion-mnist.mdx +++ b/docs/datasets/fashion-mnist.mdx @@ -1,7 +1,7 @@ --- title: "Fashion-MNIST" sidebarTitle: "Fashion-MNIST" -description: "A Lance-formatted version of Fashion-MNIST with 70,000 28×28 grayscale clothing images stored inline alongside CLIP embeddings and a pre-built IVF_PQ ANN index." +description: "A Lance-formatted version of Fashion-MNIST covering 70,000 28×28 grayscale clothing images across ten balanced apparel classes. Each row carries inline PNG bytes, the integer label, the human-readable class name, and a cosine-normalized CLIP image…" --- -A Lance-formatted version of [Fashion-MNIST](https://huggingface.co/datasets/zalando-datasets/fashion_mnist) with **70,000 28×28 grayscale clothing images** stored inline alongside CLIP embeddings and a pre-built `IVF_PQ` ANN index. +A Lance-formatted version of [Fashion-MNIST](https://huggingface.co/datasets/zalando-datasets/fashion_mnist) covering 70,000 28×28 grayscale clothing images across ten balanced apparel classes. Each row carries inline PNG bytes, the integer label, the human-readable class name, and a cosine-normalized CLIP image embedding, all backed by a bundled `IVF_PQ` vector index plus scalar indices on the label columns and available directly from the Hub at `hf://datasets/lance-format/fashion-mnist-lance/data`. ## Key features -- All multimodal data (image bytes + embeddings) stored **inline** in the same Lance dataset. -- **Pre-computed CLIP embeddings** (OpenCLIP `ViT-B-32` / `laion2b_s34b_b79k`, 512-dim, L2-normalized) with an `IVF_PQ` index. -- **BTREE on `label`** and **BITMAP on `label_name`** for fast filtered scans. +- **Inline PNG bytes** in the `image` column — no sidecar files, no image folders. +- **Pre-computed CLIP image embeddings** (OpenCLIP `ViT-B-32` / `laion2b_s34b_b79k`, 512-dim, cosine-normalized) with a bundled `IVF_PQ` index. +- **Scalar indices on both label columns** — `BTREE` on `label` and `BITMAP` on `label_name` — so apparel-class filters and class-conditioned search are constant-time lookups. +- **One columnar dataset** — scan labels cheaply, then fetch image bytes only for the rows you want. ## Splits | Split | Rows | |-------|------| -| `train` | 60,000 | -| `test` | 10,000 | +| `train.lance` | 60,000 | +| `test.lance` | 10,000 | ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index within the split | +| `id` | `int64` | Row index within the split (natural join key for merges) | | `image` | `large_binary` | Inline PNG bytes (28×28 grayscale) | -| `label` | `int32` | Class id (0-9) | -| `label_name` | `string` | One of `T-shirt/top`, `Trouser`, `Pullover`, `Dress`, `Coat`, `Sandal`, `Shirt`, `Sneaker`, `Bag`, `Ankle_boot` | +| `label` | `int32` | Class id (0–9) | +| `label_name` | `string` | One of `T-shirt_top`, `Trouser`, `Pullover`, `Dress`, `Coat`, `Sandal`, `Shirt`, `Sneaker`, `Bag`, `Ankle_boot` | | `image_emb` | `fixed_size_list` | CLIP image embedding (cosine-normalized) | +> The original Fashion-MNIST class strings `T-shirt/top` and `Ankle boot` are sanitized to `T-shirt_top` and `Ankle_boot` for use as filename-safe identifiers, so SQL filters on `label_name` should reference the underscored form. + ## Pre-built indices -- `IVF_PQ` on `image_emb` — `metric=cosine` -- `BTREE` on `label` -- `BITMAP` on `label_name` +- `IVF_PQ` on `image_emb` — vector similarity search (cosine) +- `BTREE` on `label` — fast equality and range filters on the class id +- `BITMAP` on `label_name` — fast filters across the ten class names -## Load with Lance +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable if your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample without installing anything Lance-specific. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/fashion-mnist-lance/data/train.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/fashion-mnist-lance", split="train", streaming=True) +for row in hf_ds.take(3): + print(row["label"], row["label_name"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/fashion-mnist-lance/data") tbl = db.open_table("train") -print(f"LanceDB table opened with {len(tbl)} images") +print(len(tbl)) ``` -## Load with `datasets.load_dataset` +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python -import datasets +import lance -hf_ds = datasets.load_dataset("lance-format/fashion-mnist-lance", split="train", streaming=True) -for row in hf_ds.take(3): - print(row["label_name"]) +ds = lance.dataset("hf://datasets/lance-format/fashion-mnist-lance/data/train.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -> **Tip — for production use, download locally first** to avoid Hub rate limits: +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: > ```bash > hf download lance-format/fashion-mnist-lance --repo-type dataset --local-dir ./fashion-mnist-lance > ``` +> Then point Lance or LanceDB at `./fashion-mnist-lance/data`. -## Vector search example +## Search + +The bundled `IVF_PQ` index on `image_emb` turns nearest-neighbor lookup on the 512-d CLIP space into a single call. In production you would encode a query image (or, for cross-modal text→image lookup, a tokenized prompt like "a black ankle boot") through OpenCLIP `ViT-B-32` at runtime and pass the resulting vector to `tbl.search(...)`. The example below uses the embedding already stored in row 42 as a runnable stand-in so the snippet works without any model loaded. ```python -import lance -import pyarrow as pa +import lancedb -ds = lance.dataset("hf://datasets/lance-format/fashion-mnist-lance/data/train.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"] -query = pa.array([ref], type=emb_field.type) - -neighbors = ds.scanner( - nearest={"column": "image_emb", "q": query[0], "k": 5, "nprobes": 16, "refine_factor": 30}, - columns=["id", "label_name"], -).to_table().to_pylist() +db = lancedb.connect("hf://datasets/lance-format/fashion-mnist-lance/data") +tbl = db.open_table("train") + +seed = ( + tbl.search() + .select(["image_emb", "label_name"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["image_emb"]) + .metric("cosine") + .select(["id", "label", "label_name"]) + .limit(10) + .to_list() +) +print("query class:", seed["label_name"]) +for r in hits: + print(f" id={r['id']:>5} {r['label_name']}") ``` -### LanceDB vector search +Because the embeddings are cosine-normalized and CLIP separates apparel categories cleanly, near-neighbors of a seed image are typically dominated by the seed's own class, with the most confusable garments (Shirt vs T-shirt_top, Sneaker vs Sandal) showing up next. Tune `metric`, `nprobes`, and `refine_factor` to trade recall against latency. + +## Curate + +A typical curation pass for an apparel-classification workflow narrows the table to a confusable subset of classes (for example, the three upper-body garments that share silhouettes) before sampling. Because both label columns are indexed, the filter resolves without scanning the embedding or image bytes; the bounded `.limit(500)` keeps the output small enough to inspect or hand off as a manifest of row ids. ```python import lancedb @@ -104,41 +139,128 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/fashion-mnist-lance/data") tbl = db.open_table("train") -ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0] -query_embedding = ref["image_emb"] - -results = ( - tbl.search(query_embedding) - .metric("cosine") - .select(["id", "label_name"]) - .limit(5) +candidates = ( + tbl.search() + .where("label_name IN ('Shirt', 'T-shirt_top', 'Pullover')", prefilter=True) + .select(["id", "label", "label_name"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} upper-body-garment candidates") ``` -## Filter by class +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `id`s, or feed into the Evolve and Train workflows below. The `image` and `image_emb` columns are never read, so the network traffic for a 500-row candidate scan is dominated by the tiny label payload. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds an `is_footwear` flag that groups the three shoe-like classes and an `is_target_class` flag for one-vs-rest experiments, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus first. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/fashion-mnist-lance/data/train.lance") -sneakers = ds.scanner(filter="label_name = 'Sneaker'", columns=["id"], limit=5).to_table() +import lancedb + +db = lancedb.connect("./fashion-mnist-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "is_footwear": "label_name IN ('Sandal', 'Sneaker', 'Ankle_boot')", + "is_target_class": "label = 6", +}) ``` -### Filter by class with LanceDB +If the values you want to attach already live in another table (offline labels from a stronger model, classifier predictions, per-row confidence scores), merge them in by joining on the `id` column: + +```python +import pyarrow as pa + +predictions = pa.table({ + "id": pa.array([0, 1, 2], type=pa.int64()), + "pred_label": pa.array([9, 0, 3], type=pa.int32()), + "pred_conf": pa.array([0.94, 0.81, 0.77]), +}) +tbl.merge(predictions, on="id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a second image encoder over the inline PNG bytes), Lance provides a batch-UDF API in the underlying library — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/) for that pattern. + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetch, shuffling, and batching behave as in any PyTorch pipeline. Columns added in the Evolve section above cost nothing per batch until they are explicitly projected. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/fashion-mnist-lance/data") tbl = db.open_table("train") -sneakers = tbl.search().where("label_name = 'Sneaker'").select(["id"]).limit(5).to_list() + +train_ds = Permutation.identity(tbl).select_columns(["image", "label"]) +loader = DataLoader(train_ds, batch_size=256, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; image_emb stays on disk. + # decode the PNG bytes, normalize to [0, 1], forward, backward... + ... ``` -## Why Lance? +Switching feature sets is a configuration change: passing `["image_emb", "label"]` to `select_columns(...)` on the next run skips PNG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a linear probe or a lightweight reranker on top of frozen CLIP features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/fashion-mnist-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./fashion-mnist-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="clip-vitb32-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `clip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel; newly added prediction columns or relabelings do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and labels, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/fashion-mnist-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("label_name IN ('Shirt', 'T-shirt_top', 'Pullover')") + .select(["id", "image", "label", "label_name", "image_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./fashion-mnist-upper-body") +local_db.create_table("train", batches) +``` -- One dataset for images + embeddings + indices + metadata — no sidecar files. -- On-disk vector and FTS indices live next to the data, so search works on local copies and the Hub. -- Schema evolution: add new columns (model predictions, fresh embeddings, augmentations) without rewriting the data. +The resulting `./fashion-mnist-upper-body` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/fashion-mnist-lance/data` for `./fashion-mnist-upper-body`. ## Source & license diff --git a/docs/datasets/fineweb-edu.mdx b/docs/datasets/fineweb-edu.mdx index 97f36df..b45535a 100644 --- a/docs/datasets/fineweb-edu.mdx +++ b/docs/datasets/fineweb-edu.mdx @@ -1,7 +1,7 @@ --- title: "FineWeb-Edu" sidebarTitle: "FineWeb-Edu" -description: "FineWeb-edu dataset with over 1.5 billion rows. Each passage ships with cleaned text, metadata, and 384-dim text embeddings for retrieval-heavy workloads." +description: "A Lance-formatted version of FineWeb-Edu — over 1.5 billion educational web passages with cleaned text, source metadata, language detection signals, and 384-dim text embeddings — available directly from the Hub at…" --- -FineWeb-edu dataset with over 1.5 billion rows. Each passage ships with cleaned text, metadata, and 384-dim text embeddings for retrieval-heavy workloads. +A Lance-formatted version of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) — **over 1.5 billion educational web passages** with cleaned text, source metadata, language detection signals, and 384-dim text embeddings — available directly from the Hub at `hf://datasets/lance-format/fineweb-edu/data/train.lance`. +## Key features -## Load via `datasets.load_dataset` +- **Cleaned passage text** in the `text` column with the source `url` and `title` carried alongside. +- **Language detection signals** (`language`, `language_probability`) for filtered subsets. +- **Pre-computed 384-dim text embeddings** in `text_embedding`, ready for ANN search once an index is built locally. +- **One columnar dataset** — scan metadata cheaply, project just the columns each query needs, defer the heavy `text` and `text_embedding` reads to the rows that matter. + +> **No pre-built indices on the Hub copy yet.** At 1.5 B+ rows the on-disk indices are too large to ship comfortably alongside the data on the Hub. The Search, Curate, Evolve, and Train sections below describe the same APIs you'd use against a fully indexed dataset, but vector and full-text examples assume a local copy with `IVF_PQ` and `INVERTED` indices built once after download. See the Materialize-a-subset section at the end for a focused-subset workflow that makes indexing tractable. + +## Splits + +`train.lance` + +## Schema + +| Column | Type | Notes | +|---|---|---| +| `text` | `string` | Cleaned passage body | +| `title` | `string` | Page or article title when available | +| `url` | `string` | Canonical source URL | +| `language` | `string` | Detected language code (e.g., `en`) | +| `language_probability` | `float32` | Confidence of the language detector | +| `text_embedding` | `fixed_size_list` | Passage embedding for retrieval | +| *FineWeb-Edu quality metadata* | — | Heuristic scores and length statistics carried over from the upstream corpus | + +## Pre-built indices + +None bundled at present. Build the recommended indices on a local copy: ```python -import datasets +import lancedb + +db = lancedb.connect("./fineweb-edu/data") +tbl = db.open_table("train") -hf_ds = datasets.load_dataset( - "lance-format/fineweb-edu", - split="train", - streaming=True, +tbl.create_index( + metric="cosine", + vector_column_name="text_embedding", + index_type="IVF_PQ", + num_partitions=2048, + num_sub_vectors=96, ) -# Take first three rows and print titles -for row in hf_ds.take(3): - print(row["title"]) +tbl.create_fts_index("text", replace=True) ``` -Use Lance's native connector when you need ANN search, FTS, or direct access to embeddings while still pointing to the copy hosted on Hugging Face: +Both indices live next to the data, so subsequent queries against the same local path pick them up automatically. + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/fineweb-edu/data/train.lance")print(f"Total passages: {ds.count_rows():,}") +hf_ds = datasets.load_dataset("lance-format/fineweb-edu", split="train", streaming=True) +for row in hf_ds.take(3): + print(row["title"] or row["url"]) ``` -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +## Load with LanceDB + +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/fineweb-edu/data") tbl = db.open_table("train") -print(f"LanceDB table opened with {len(tbl)} passages") +print(len(tbl)) ``` +## Load with Lance +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices. -> The dataset hosted on Hugging Face Hub does **not** currently have pre-built ANN (vector) or FTS (full-text search) indices. -> +```python +import lance -> - For any search or similarity workloads, you should download the dataset locally and build indices yourself. -> +ds = lance.dataset("hf://datasets/lance-format/fineweb-edu/data/train.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) +``` + +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but at 1.5 B+ rows random access and any kind of search are dramatically faster against a local copy, and ANN / FTS require local indices anyway: > ```bash -> # Download once -> huggingface-cli download lance-format/fineweb-edu --repo-type dataset --local-dir ./fineweb-edu -> -> # Then load locally and build indices -> import lance -> ds = lance.dataset("./fineweb-edu") -> # ds.create_index(...) +> hf download lance-format/fineweb-edu --repo-type dataset --local-dir ./fineweb-edu > ``` -> +> Then point Lance or LanceDB at `./fineweb-edu/data`. For most workflows, the Materialize-a-subset section is a better starting point than downloading the full 1.5 B-row corpus. +## Search -## Why Lance? +Once an `IVF_PQ` index exists on `text_embedding`, dense retrieval is a single call. In production you would encode a query string through the same 384-dim text encoder used at ingest and pass the resulting vector to `tbl.search(...)`. The example below uses the embedding from row 42 as a runnable stand-in. + +```python +import lancedb -- Optimized for AI workloads: Lance keeps multimodal data and vector search-ready storage in the same columnar format designed for accelerator-era retrieval (see [lance.org](https://lance.org)). -- Images + embeddings + metadata travel as one tabular dataset. -- On-disk, scalable ANN index means -- Schema evolution lets you add new features/columns (moderation tags, embeddings, etc.) without rewriting the raw data. +db = lancedb.connect("./fineweb-edu/data") # local copy with the indices from the section above +tbl = db.open_table("train") + +seed = ( + tbl.search() + .select(["text_embedding", "url"]) + .limit(1) + .offset(42) + .to_list()[0] +) +hits = ( + tbl.search(seed["text_embedding"]) + .metric("cosine") + .where("language = 'en' AND language_probability > 0.9", prefilter=True) + .select(["title", "url", "text"]) + .limit(10) + .to_list() +) +for r in hits: + print(f"{r['url']}\n {(r['title'] or '')[:80]}") +``` -## Quick Start (Lance Python) +The result set carries only the projected columns. The `text_embedding` vector is never read on the result side, and the `text` body is fetched only for the ten passages that actually came back, keeping the working set small even though the corpus is enormous. + +Because the recommended setup also builds an `INVERTED` index on `text`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query. LanceDB merges the two result lists and reranks them in a single call, which is useful when a phrase must literally appear in the passage but the dense side still does most of the ranking. ```python -import lance -import pyarrow as pa +hybrid_hits = ( + tbl.search(query_type="hybrid") + .vector(seed["text_embedding"]) + .text("quantum computing") + .where("language = 'en'", prefilter=True) + .select(["title", "url", "text"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(f"{r['url']}\n {(r['title'] or '')[:80]}") +``` -lance_ds = lance.dataset("hf://datasets/lance-format/fineweb-edu/data/train.lance") +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. -# Browse titles & language without touching embeddings -rows = lance_ds.scanner( - columns=["title", "language"], - limit=5 -).to_table().to_pylist() +## Curate -# Vector similarity from the on-dataset ANN index -ref = lance_ds.take([0], columns=["text_embedding", "title"]) -query_vec = pa.array([ref.to_pylist()[0]["text_embedding"]], - type=ref.schema.field("text_embedding").type) +A typical curation pass over a web corpus starts with a metadata filter — pick high-confidence English, drop short or low-quality fragments, restrict to a domain — before any text gets read. Stacking predicates inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(1000)` makes it cheap to inspect. -results = lance_ds.scanner( - nearest={ - "column": "text_embedding", - "q": query_vec[0], - "k": 5, - "nprobes": 8, - "refine_factor": 20, - }, - columns=["title", "language", "text"], -).to_table().to_pylist() -``` +```python +import lancedb -> **Hugging Face Streaming Note** -> - Streaming uses conservative ANN parameters (`nprobes`, `refine_factor`) to stay within HF rate limits. -> - Prefer local copies (`huggingface-cli download lance-format/fineweb-edu --local-dir ./fineweb`) for heavy workloads, then point Lance at `./fineweb`. +db = lancedb.connect("hf://datasets/lance-format/fineweb-edu/data") +tbl = db.open_table("train") -## Dataset Schema +candidates = ( + tbl.search() + .where( + "language = 'en' " + "AND language_probability > 0.95 " + "AND length(text) >= 1000", + prefilter=True, + ) + .select(["url", "title", "language_probability"]) + .limit(1000) + .to_list() +) +print(f"{len(candidates)} candidates; first url: {candidates[0]['url']}") +``` -Common columns you'll find in this Lance dataset: -- `text` – cleaned passage content. -- `title` – page/article title when available. -- `url` – canonical source URL. -- `language` + `language_probability` – detector outputs for filtering. -- Quality metadata from FineWeb-Edu (e.g., heuristic scores or length stats). -- `text_embedding` – 384-dimension float32 vector for retrieval. +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of URLs, or hand to the Materialize-a-subset section below for export to a writable local copy. Neither the `text` body nor the `text_embedding` vector is read by this scan, so a 1000-row curation pass against the Hub moves only kilobytes of metadata even though the underlying table is in the billions. -## Usage Examples +## Evolve -> **Search snippets for reference** -> The vector/FTS examples below show the Lance APIs you’ll use once indexes are available. The hosted dataset doesn’t yet ship ANN/FTS indexes—download locally (or build indexes yourself) before running them. Pre-built indexes are coming soon. +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a `text_length` and a `long_passage` flag, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. -### 1. Sample documents without embeddings +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull a larger slice first. ```python -scanner = ds.scanner( - columns=["title", "language", "text"], - filter="language = 'en'", - limit=5, -) -for doc in scanner.to_table().to_pylist(): - print(doc["title"], doc["language"]) - print(doc["text"][:200], "...\n") +import lancedb + +db = lancedb.connect("./fineweb-edu/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "text_length": "length(text)", + "long_passage": "length(text) >= 1000", +}) ``` -### 2. Vector search for semantically similar passages +If the values you want to attach already live in another table (offline labels, topic classifications, alternate embeddings from a stronger model), merge them in by joining on `url`: ```python -ref_doc = ds.take([123], columns=["text_embedding", "title", "text"]).to_pylist()[0] -emb_type = ds.to_table(columns=["text_embedding"], limit=1).schema.field("text_embedding").type -query = pa.array([ref_doc["text_embedding"]], type=emb_type) - -neighbors = ds.scanner( - nearest={ - "column": "text_embedding", - "q": query[0], - "k": 6, - "nprobes": 8, - "refine_factor": 20, - }, - columns=["title", "language", "text"], -).to_table().to_pylist()[1:] -``` - -### LanceDB Vector Search +import pyarrow as pa + +labels = pa.table({ + "url": pa.array(["https://example.com/a", "https://example.com/b"]), + "topic": pa.array(["math", "history"]), +}) +tbl.merge(labels, on="url") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a different embedding model over the text), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For language-model pretraining the natural projection is just the `text` column; for a retrieval probe or a reranker on top of frozen features, project the precomputed embedding instead. + ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader -db = lancedb.connect("hf://datasets/lance-format/fineweb-edu/data") +db = lancedb.connect("./fineweb-edu/data") tbl = db.open_table("train") -# Get a passage to use as a query -ref_passage = tbl.limit(1).offset(123).select(["text_embedding", "text"]).to_pandas().to_dict('records')[0] -query_embedding = ref_passage["text_embedding"] +train_ds = Permutation.identity(tbl).select_columns(["text"]) +loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=8) -results = tbl.search(query_embedding) \ - .limit(5) \ - .to_list() +for batch in loader: + # batch carries only the projected columns; tokenize, forward, backward... + ... ``` -### 3. Full-text search with Lance FTS +Switching feature sets is a configuration change: passing `["text_embedding"]` to `select_columns(...)` on the next run reads only the 384-d vectors and skips the text body entirely, which is the right shape for training a lightweight retrieval head on cached embeddings. Columns added in Evolve cost nothing per batch until they are explicitly projected. -```python -hits = ds.scanner( - full_text_query="quantum computing", - columns=["title", "language", "text"], - limit=10, - fast_search=True, -).to_table().to_pylist() -``` +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. -### LanceDB Full-Text Search ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/fineweb-edu/data") tbl = db.open_table("train") -results = tbl.search("quantum computing") \ - .select(["title", "language", "text"]) \ - .limit(10) \ - .to_list() +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) ``` +Once you have a local copy, tag a version for reproducibility: -See `fineweb_edu/example.py` on lance-huggingface repo for a complete walkthrough that combines HF streaming batches with Lance-powered retrieval. - -## Dataset Evolution +```python +local_db = lancedb.connect("./fineweb-edu/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("english-v1", local_tbl.version) +``` -Lance supports flexible schema and data evolution ([docs](https://lance.org/guide/data_evolution/?h=evol)). You can add/drop columns, backfill with SQL or Python, rename fields, or change data types without rewriting the whole dataset. In practice this lets you: -- Introduce fresh metadata (moderation labels, embeddings, quality scores) as new signals become available. -- Add new columns to existing datasets without re-exporting terabytes of video. -- Adjust column names or shrink storage (e.g., cast embeddings to float16) while keeping previous snapshots queryable for reproducibility. +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: ```python -import lance -import pyarrow as pa -import numpy as np +tbl_v1 = db.open_table("train", version="english-v1") +tbl_v5 = db.open_table("train", version=5) +``` -# Assume ds is a local Lance dataset -# ds = lance.dataset("./fineweb_edu_local") +Pinning supports two workflows. A retrieval system locked to `english-v1` keeps returning stable results while the dataset evolves in parallel — newly added embeddings or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same passages, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. -base = pa.table({"id": pa.array([1, 2, 3]), "text": pa.array(["A", "B", "C"])}) -dataset = lance.write_dataset(base, "fineweb_evolution", mode="overwrite") +## Materialize a subset -# 1. Add a schema-only column (data to be added later) -dataset.add_columns(pa.field("subject", pa.string())) +At 1.5 B+ rows, very few workflows want the full corpus on local disk. The practical entry point is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. The result is a writable LanceDB database scoped to the rows that actually matter for the downstream task, sized to index and iterate cheaply. -# 2. Add a column with data -dataset.add_columns({"quality_bucket": "'unknown'"}) +```python +import lancedb -# 3. Generate rich columns via Python batch UDFs -@lance.batch_udf() -def random_embedding(batch): - vecs = np.random.rand(batch.num_rows, 384).astype("float32") - return pa.RecordBatch.from_arrays( - [pa.FixedSizeListArray.from_arrays(vecs.ravel(), 384)], - names=["text_embedding"], +remote_db = lancedb.connect("hf://datasets/lance-format/fineweb-edu/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where( + "language = 'en' " + "AND language_probability > 0.95 " + "AND length(text) >= 1000" ) + .select(["url", "title", "text", "language", "language_probability", "text_embedding"]) + .to_batches() +) -dataset.add_columns(random_embedding) +local_db = lancedb.connect("./fineweb-edu-en") +local_db.create_table("train", batches) +``` + +The resulting `./fineweb-edu-en` is a first-class LanceDB database. Build the recommended indices on it once (the same `create_index` / `create_fts_index` calls shown in the Pre-built indices section, pointed at the local path), and every snippet in the Search, Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/fineweb-edu/data` for `./fineweb-edu-en`. + +## Source & license -# 4. Bring in annotations with merge -labels = pa.table({"id": pa.array([1, 2, 3]), "label": pa.array(["math", "history", "science"])}) -dataset.merge(labels, "id") +Converted from [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). FineWeb-Edu is distributed under [ODC-BY 1.0](https://opendatacommons.org/licenses/by/1-0/); individual document content remains subject to the rights of the original publishers. Review the [upstream dataset card](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) before downstream use. -# 5. Rename or cast columns as needs change -dataset.alter_columns({"path": "subject", "name": "topic"}) -dataset.alter_columns({"path": "text_embedding", "data_type": pa.list_(pa.float16(), 384)}) +## Citation + +``` +@misc{lozhkov2024finewebedu, + title = {FineWeb-Edu: the Finest Collection of Educational Content the Web Has to Offer}, + author = {Lozhkov, Anton and Ben Allal, Loubna and von Werra, Leandro and Wolf, Thomas}, + year = {2024}, + url = {https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu} +} ``` -You can iterate on embeddings, quality tags, or moderation fields while keeping earlier dataset versions available for reproducible experiments. diff --git a/docs/datasets/flickr30k.mdx b/docs/datasets/flickr30k.mdx index 154837d..8b3091d 100644 --- a/docs/datasets/flickr30k.mdx +++ b/docs/datasets/flickr30k.mdx @@ -1,7 +1,7 @@ --- title: "Flickr30k" sidebarTitle: "Flickr30k" -description: "Lance-formatted version of Flickr30k (re-distributed via lmms-lab/flickr30k) — 31,783 images, each paired with 5 human-written captions, with CLIP image and text embeddings stored inline and pre-built ANN indices on both." +description: "A Lance-formatted version of Flickr30k, redistributed via lmms-lab/flickr30k. Each row is one image with 5 human-written captions, a cosine-normalized CLIP image embedding, and a cosine-normalized CLIP text embedding of the canonical caption — all…" --- -Lance-formatted version of [Flickr30k](https://shannon.cs.illinois.edu/DenotationGraph/) (re-distributed via [`lmms-lab/flickr30k`](https://huggingface.co/datasets/lmms-lab/flickr30k)) — **31,783 images, each paired with 5 human-written captions**, with CLIP image **and** text embeddings stored inline and pre-built ANN indices on both. +A Lance-formatted version of [Flickr30k](https://shannon.cs.illinois.edu/DenotationGraph/), redistributed via [`lmms-lab/flickr30k`](https://huggingface.co/datasets/lmms-lab/flickr30k). Each row is one image with **5 human-written captions**, a cosine-normalized CLIP image embedding, and a cosine-normalized CLIP text embedding of the canonical caption — all stored inline and available directly from the Hub at `hf://datasets/lance-format/flickr30k-lance/data`. ## Key features -- **Inline images** — full JPEG bytes per row. -- **Pre-computed CLIP embeddings** for both image and caption text — `IVF_PQ` indices on both columns let you do cross-modal retrieval (image→caption or caption→image) without any model at query time. -- **Full-text inverted index** on the canonical caption. -- Self-contained: no sidecar files or external image downloads. +- **Inline JPEG bytes** in the `image` column — no sidecar files, no image folders. +- **Paired CLIP embeddings in the same row** — `image_emb` and `text_emb` (ViT-B/32, 512-dim, cosine-normalized) — so cross-modal retrieval is one indexed lookup. +- **All 5 raw captions kept in `captions`** alongside a `caption` canonical string used for full-text search. +- **Pre-built ANN, FTS, and scalar indices** covering both embedding columns, the canonical caption, and `image_id`. + +## Splits + +| Split | Rows | Notes | +|-------|------|-------| +| `train.lance` | 31,783 | All Flickr30k images; the `lmms-lab/flickr30k` redistribution merges the original train/val/test labels into a single split | ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index | +| `id` | `int64` | Row index within split (natural join key) | | `image` | `large_binary` | Inline JPEG bytes | | `image_id` | `string` | Original Flickr image id | -| `filename` | `string` | Original filename (e.g. `1000092795.jpg`) | +| `filename` | `string?` | Original filename (e.g. `1000092795.jpg`) | | `captions` | `list` | All 5 captions for the image | -| `caption` | `string` | First caption — used as canonical text for FTS / quick browsing | +| `caption` | `string` | First caption — canonical text used for FTS | | `image_emb` | `fixed_size_list` | CLIP image embedding (cosine-normalized) | | `text_emb` | `fixed_size_list` | CLIP text embedding of the canonical caption | ## Pre-built indices -- `IVF_PQ` on `image_emb` — `metric=cosine` -- `IVF_PQ` on `text_emb` — `metric=cosine` (cross-modal retrieval works out of the box) -- `INVERTED` on `caption` -- `BTREE` on `image_id` +- `IVF_PQ` on `image_emb` — image-side vector search (cosine) +- `IVF_PQ` on `text_emb` — text-side vector search (cosine) +- `INVERTED` (FTS) on `caption` — keyword and hybrid search +- `BTREE` on `image_id` — fast lookup by Flickr image id -## Splits +## Why Lance? -A single `train.lance` table containing all 31,783 rows (the `lmms-lab/flickr30k` redistribution exposes them as a single split). The original train/val/test labels are not preserved in the source parquet. +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. -## Load with Lance +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/flickr30k-lance/data/train.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/flickr30k-lance", split="train", streaming=True) +for row in hf_ds.take(3): + print(row["caption"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/flickr30k-lance/data") tbl = db.open_table("train") -print(f"LanceDB table opened with {len(tbl)} image-caption pairs") +print(len(tbl)) ``` -## Cross-modal text→image search +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python import lance -import pyarrow as pa -import open_clip -import torch - -# 1. Encode the query text once with the same CLIP model used at conversion. -model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k") -tokenizer = open_clip.get_tokenizer("ViT-B-32") -model = model.eval().cuda().half() -with torch.no_grad(): - q = model.encode_text(tokenizer(["a man surfing at sunset"]).cuda()) - q = (q / q.norm(dim=-1, keepdim=True)).float().cpu().numpy()[0] ds = lance.dataset("hf://datasets/lance-format/flickr30k-lance/data/train.lance") -emb_field = ds.schema.field("image_emb") -query = pa.array([q.tolist()], type=emb_field.type) - -# 2. Nearest-neighbour search against the image embedding index. -hits = ds.scanner( - nearest={"column": "image_emb", "q": query[0], "k": 10, "nprobes": 16, "refine_factor": 30}, - columns=["image_id", "caption"], -).to_table().to_pylist() -for h in hits: - print(h) +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB cross-modal text→image search +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/flickr30k-lance --repo-type dataset --local-dir ./flickr30k-lance +> ``` +> Then point Lance or LanceDB at `./flickr30k-lance/data`. -```python -import lancedb, open_clip, torch +## Search + +The bundled `IVF_PQ` index on `image_emb` makes cross-modal text→image retrieval a single call: encode a text query with the same CLIP model used at ingest (ViT-B/32, cosine-normalized), then pass the resulting 512-d vector to `tbl.search(...)` and target `image_emb`. The example below uses the `text_emb` already stored in row 42 as a runnable stand-in for "the CLIP encoding of a caption", so the snippet works without any model loaded. -model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k") -tokenizer = open_clip.get_tokenizer("ViT-B-32") -model = model.eval().cuda().half() -with torch.no_grad(): - q = model.encode_text(tokenizer(["a man surfing at sunset"]).cuda()) - q = (q / q.norm(dim=-1, keepdim=True)).float().cpu().numpy()[0] +```python +import lancedb db = lancedb.connect("hf://datasets/lance-format/flickr30k-lance/data") tbl = db.open_table("train") -results = ( - tbl.search(q.tolist(), vector_column_name="image_emb") +seed = ( + tbl.search() + .select(["text_emb", "caption"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["text_emb"], vector_column_name="image_emb") .metric("cosine") .select(["image_id", "caption"]) .limit(10) .to_list() ) +print("query caption:", seed["caption"]) +for r in hits: + print(f" {r['image_id']:>12} {r['caption'][:70]}") ``` -## Image→caption (image-to-text retrieval) +Because OpenAI-style CLIP embeddings are normalized, cosine is the right metric and the first hit will typically be the source image itself — a useful sanity check. Swap `vector_column_name="image_emb"` for `text_emb` to do text→text retrieval against the canonical captions instead. + +Because the dataset also ships an `INVERTED` index on `caption`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query. LanceDB merges the two result lists and reranks them in a single call, which is useful when a phrase like "dog playing in the snow" must literally appear in the caption but you still want CLIP to do the heavy lifting on visual similarity. ```python -ds = lance.dataset("hf://datasets/lance-format/flickr30k-lance/data/train.lance") -ref = ds.take([0], columns=["image_emb", "caption"]).to_pylist()[0] -emb_field = ds.schema.field("text_emb") -query = pa.array([ref["image_emb"]], type=emb_field.type) -neighbors = ds.scanner( - nearest={"column": "text_emb", "q": query[0], "k": 10}, - columns=["caption"], -).to_table().to_pylist() +hybrid_hits = ( + tbl.search(query_type="hybrid", vector_column_name="image_emb") + .vector(seed["text_emb"]) + .text("dog playing in the snow") + .select(["image_id", "caption"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(f" {r['image_id']:>12} {r['caption'][:70]}") ``` -### LanceDB image→caption search +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. + +## Curate + +A typical curation pass for a captioning or contrastive-training workflow combines a content filter on the captions with a structural filter on the row. Stacking both inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(500)` makes it cheap to inspect before committing the subset to anything downstream. ```python import lancedb @@ -140,61 +157,127 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/flickr30k-lance/data") tbl = db.open_table("train") -ref = tbl.search().limit(1).select(["image_emb", "caption"]).to_list()[0] -query_embedding = ref["image_emb"] - -results = ( - tbl.search(query_embedding, vector_column_name="text_emb") - .metric("cosine") - .select(["caption"]) - .limit(10) +candidates = ( + tbl.search("surfer OR surfboard OR wave") + .where("array_length(captions) = 5", prefilter=True) + .select(["image_id", "caption", "captions"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} candidates; first caption: {candidates[0]['caption'][:80]}") ``` -## Full-text search on captions +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `image_id`s, or feed into the Evolve and Train workflows below. The `image` column is never read, so the network traffic for a 500-row candidate scan is dominated by caption text rather than JPEG bytes. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds `num_captions` and a `long_caption` flag, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/flickr30k-lance/data/train.lance") -hits = ds.scanner( - full_text_query="dog playing in the snow", - columns=["image_id", "caption"], - limit=10, -).to_table().to_pylist() +import lancedb + +db = lancedb.connect("./flickr30k-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "num_captions": "array_length(captions)", + "long_caption": "length(caption) >= 80", +}) +``` + +If the values you want to attach already live in another table (offline labels, classifier predictions, an aesthetic or NSFW score, a second-pass caption from a different model), merge them in by joining on `image_id`: + +```python +import pyarrow as pa + +labels = pa.table({ + "image_id": pa.array(["1000092795", "10002456"]), + "scene_label": pa.array(["outdoor", "indoor"]), +}) +tbl.merge(labels, on="image_id") ``` -### LanceDB full-text search +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a second CLIP variant over the image bytes), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a CLIP-style contrastive run, project the JPEG bytes and a sampled caption; for a reranker or probe on top of frozen features, project the precomputed embeddings instead. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/flickr30k-lance/data") tbl = db.open_table("train") -results = ( - tbl.search("dog playing in the snow") - .select(["image_id", "caption"]) - .limit(10) - .to_list() -) +train_ds = Permutation.identity(tbl).select_columns(["image", "caption"]) +loader = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; decode the JPEG bytes, + # tokenize the captions, encode, contrastive loss... + ... ``` -## Working with images +Switching feature sets is a configuration change: passing `["image_emb", "text_emb"]` to `select_columns(...)` on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a lightweight reranker or a linear probe. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. ```python -from pathlib import Path -import lance -ds = lance.dataset("hf://datasets/lance-format/flickr30k-lance/data/train.lance") -row = ds.take([0], columns=["image", "filename"]).to_pylist()[0] -Path(row["filename"]).write_bytes(row["image"]) +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/flickr30k-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) ``` -## Why Lance? +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./flickr30k-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="clip-vitb32-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `clip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel — newly added embeddings or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and captions, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/flickr30k-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search("surfer OR surfboard OR wave") + .where("array_length(captions) = 5") + .select(["image_id", "image", "caption", "captions", "image_emb", "text_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./flickr30k-surf-subset") +local_db.create_table("train", batches) +``` -- One dataset carries images + image embeddings + text embeddings + indices — no sidecar files. -- On-disk vector and full-text indices live next to the data, so search works on local copies and on the Hub. -- Schema evolution: add columns (new captions, alternate embeddings, moderation labels) without rewriting the data. +The resulting `./flickr30k-surf-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/flickr30k-lance/data` for `./flickr30k-surf-subset`. ## Source & license diff --git a/docs/datasets/food101.mdx b/docs/datasets/food101.mdx index d74eca0..e74981a 100644 --- a/docs/datasets/food101.mdx +++ b/docs/datasets/food101.mdx @@ -1,7 +1,7 @@ --- title: "Food-101" sidebarTitle: "Food-101" -description: "Lance-formatted version of Food-101 — 101,000 food photographs across 101 classes — sourced from ethz/food101. Inline JPEG bytes + CLIP image embeddings + IVF_PQ." +description: "A Lance-formatted version of Food-101, the fine-grained dish-classification benchmark of 101,000 photos spread evenly across 101 dish classes, sourced from ethz/food101. Each row carries the inline JPEG bytes, the integer label, the human-readable…" --- -Lance-formatted version of [Food-101](https://www.kaggle.com/datasets/dansbecker/food-101) — 101,000 food photographs across 101 classes — sourced from [`ethz/food101`](https://huggingface.co/datasets/ethz/food101). Inline JPEG bytes + CLIP image embeddings + IVF_PQ. +A Lance-formatted version of [Food-101](https://www.kaggle.com/datasets/dansbecker/food-101), the fine-grained dish-classification benchmark of 101,000 photos spread evenly across 101 dish classes, sourced from [`ethz/food101`](https://huggingface.co/datasets/ethz/food101). Each row carries the inline JPEG bytes, the integer `label`, the human-readable `label_name`, and a cosine-normalized CLIP image embedding, all available directly from the Hub at `hf://datasets/lance-format/food101-lance/data`. + +## Key features + +- **Inline JPEG bytes** in the `image` column — no sidecar files, no image folders. +- **Pre-computed CLIP image embeddings** (`image_emb`, OpenCLIP `ViT-B-32`, 512-dim, cosine-normalized) with a bundled `IVF_PQ` index for similarity search. +- **Both numeric and string labels** (`label`, `label_name`) so filters can target either the class id or the dish name without an external mapping table. +- **Scalar indices on both label columns** so class-based curation is a quick predicate rather than a full scan. ## Splits -| Split | Rows | -|-------|------| -| `train.lance` | 75,750 | -| `validation.lance` | 25,250 | +| Split | Rows | Notes | +|-------|------|-------| +| `train.lance` | 75,750 | Canonical Food-101 train split (750 images per class) | +| `validation.lance` | 25,250 | Canonical Food-101 test split (250 images per class) | ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index within split | -| `image` | `large_binary` | Inline JPEG bytes | -| `label` | `int32` | Class id (0-100) | -| `label_name` | `string` | One of 101 dish names (`apple_pie`, `baby_back_ribs`, …) | -| `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` embedding (cosine-normalized) | +| `id` | `int64` | Row index within split (natural join key for merges) | +| `image` | `large_binary` | Inline JPEG bytes (256x256, quality 92) | +| `label` | `int32` | Class id (0–100) | +| `label_name` | `string` | One of 101 dish names, underscore-spaced (`apple_pie`, `baby_back_ribs`, …) | +| `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` image embedding (cosine-normalized) | ## Pre-built indices -- `IVF_PQ` on `image_emb` — `metric=cosine` -- `BTREE` on `label` -- `BITMAP` on `label_name` +- `IVF_PQ` on `image_emb` — vector similarity search (cosine) +- `BTREE` on `label` — fast lookup by class id +- `BITMAP` on `label_name` — fast lookup by class name + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` -## Quick start +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample without installing anything Lance-specific. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/food101-lance/data/validation.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +import datasets + +hf_ds = datasets.load_dataset("lance-format/food101-lance", split="validation", streaming=True) +for row in hf_ds.take(3): + print(row["label_name"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/food101-lance/data") tbl = db.open_table("validation") -print(f"LanceDB table opened with {len(tbl)} images") +print(len(tbl)) ``` -## Filter by class +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python import lance + ds = lance.dataset("hf://datasets/lance-format/food101-lance/data/validation.lance") -sushi = ds.scanner(filter="label_name = 'sushi'", columns=["id"], limit=5).to_table() +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### Filter by class with LanceDB +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/food101-lance --repo-type dataset --local-dir ./food101-lance +> ``` +> Then point Lance or LanceDB at `./food101-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `image_emb` makes approximate-nearest-neighbor search a single call. In production you would encode a query photo through the same OpenCLIP `ViT-B-32` model used at ingest and pass the resulting 512-d vector to `tbl.search(...)`. The example below uses the embedding stored in row 0 as a runnable stand-in so the snippet works without a model loaded; the first hit is expected to be the seed image itself, which is a useful sanity check on the index. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/food101-lance/data") tbl = db.open_table("validation") -sushi = tbl.search().where("label_name = 'sushi'").select(["id"]).limit(5).to_list() + +seed = ( + tbl.search() + .select(["image_emb", "label_name"]) + .limit(1) + .to_list()[0] +) + +hits = ( + tbl.search(seed["image_emb"]) + .metric("cosine") + .select(["id", "label_name"]) + .limit(10) + .to_list() +) +print("seed dish:", seed["label_name"]) +for r in hits: + print(f" {r['id']:>6} {r['label_name']}") ``` -## Visual similarity search +Tune `metric`, `nprobes`, and `refine_factor` to trade recall against latency for your workload. + +## Curate + +A typical curation pass for a fine-grained classifier combines a class-based filter with the bundled vector search to assemble a small, focused candidate set. The `BITMAP` index on `label_name` makes the predicate effectively free, and the bounded `.limit(200)` keeps the result small enough to inspect or hand off to a training run. ```python -import lance, pyarrow as pa -ds = lance.dataset("hf://datasets/lance-format/food101-lance/data/validation.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb", "label_name"]).to_pylist()[0] -query = pa.array([ref["image_emb"]], type=emb_field.type) -neighbors = ds.scanner( - nearest={"column": "image_emb", "q": query[0], "k": 5, "nprobes": 16, "refine_factor": 30}, - columns=["id", "label_name"], -).to_table().to_pylist() +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/food101-lance/data") +tbl = db.open_table("validation") + +candidates = ( + tbl.search() + .where("label_name IN ('sushi', 'sashimi', 'ramen')") + .select(["id", "label", "label_name"]) + .limit(200) + .to_list() +) +print(f"{len(candidates)} candidate Japanese-cuisine rows; first: {candidates[0]['label_name']}") ``` -### LanceDB visual similarity search +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `id`s, or feed into the Evolve and Train workflows below. The `image` and `image_emb` columns are never read by this query, so the network traffic is dominated by the small label fields rather than JPEG bytes or vectors. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a coarse cuisine bucket and an `is_target_dish` flag for a focused training run, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. + +```python +import lancedb + +db = lancedb.connect("./food101-lance/data") # local copy required for writes +tbl = db.open_table("validation") + +tbl.add_columns({ + "is_target_dish": "label_name IN ('sushi', 'ramen', 'pho')", + "is_dessert": "label_name IN ('apple_pie', 'cheesecake', 'tiramisu', 'ice_cream', 'donuts')", +}) +``` + +If the values you want to attach already live in another table (offline labels, a second classifier's predictions, human-verified taste tags), merge them in by joining on `id`: + +```python +import pyarrow as pa + +predictions = pa.table({ + "id": pa.array([0, 1, 2]), + "model_v2_pred": pa.array(["sushi", "sashimi", "sushi"]), +}) +tbl.merge(predictions, on="id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a second embedding model over the JPEG bytes), Lance provides a batch-UDF API in the underlying library — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/) for that pattern. + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetch, shuffling, and batching behave as in any PyTorch pipeline. For a from-scratch image classifier, project the JPEG bytes and the integer label; for a linear probe or reranker on top of frozen CLIP features, swap the projection to the embedding column and skip JPEG decoding entirely. + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/food101-lance/data") +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns(["image", "label"]) +loader = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; image_emb stays on disk. + # decode the JPEG bytes, forward, cross-entropy against `label`... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "label"]` to `select_columns(...)` on the next run reads only the cached 512-d vectors and the label, which is the right shape for a linear probe. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges predictions, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. ```python import lancedb @@ -97,18 +213,51 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/food101-lance/data") tbl = db.open_table("validation") -ref = tbl.search().limit(1).select(["image_emb", "label_name"]).to_list()[0] -query_embedding = ref["image_emb"] +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` -results = ( - tbl.search(query_embedding) - .metric("cosine") - .select(["id", "label_name"]) - .limit(5) - .to_list() +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./food101-lance/data") +local_tbl = local_db.open_table("validation") +local_tbl.tags.create("clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("validation", version="clip-vitb32-v1") +tbl_v5 = db.open_table("validation", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `clip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel; newly added columns or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and labels, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/food101-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("label_name IN ('sushi', 'sashimi', 'ramen', 'pho', 'dumplings')") + .select(["id", "image", "label", "label_name", "image_emb"]) + .to_batches() ) + +local_db = lancedb.connect("./food101-asian-subset") +local_db.create_table("train", batches) ``` +The resulting `./food101-asian-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/food101-lance/data` for `./food101-asian-subset`. + ## Source & license Converted from [`ethz/food101`](https://huggingface.co/datasets/ethz/food101). The Food-101 dataset is by Bossard et al. (ETH Zurich) — see the [original dataset page](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) for licensing details. diff --git a/docs/datasets/gqa-testdev-balanced.mdx b/docs/datasets/gqa-testdev-balanced.mdx index fc383bb..a50df8b 100644 --- a/docs/datasets/gqa-testdev-balanced.mdx +++ b/docs/datasets/gqa-testdev-balanced.mdx @@ -1,7 +1,7 @@ --- title: "GQA testdev-balanced" sidebarTitle: "GQA testdev-balanced" -description: "Lance-formatted version of the canonical GQA testdev_balanced slice — 12,578 compositional VQA questions joined with the matching 398 images — sourced from lmms-lab/GQA." +description: "A Lance-formatted version of the canonical GQA testdev_balanced slice — 12,578 compositional VQA questions joined against the matching 398 images — sourced from lmms-lab/GQA. The original redistribution ships instructions and images as separate…" --- -Lance-formatted version of the canonical GQA `testdev_balanced` slice — 12,578 compositional VQA questions joined with the matching 398 images — sourced from [`lmms-lab/GQA`](https://huggingface.co/datasets/lmms-lab/GQA). +A Lance-formatted version of the canonical GQA `testdev_balanced` slice — 12,578 compositional VQA questions joined against the matching 398 images — sourced from [`lmms-lab/GQA`](https://huggingface.co/datasets/lmms-lab/GQA). The original redistribution ships instructions and images as separate parquet configs; here they are pre-joined on `image_id`, so each row carries the question text, the short answer, the GQA reasoning-program tags, paired CLIP image and question embeddings, and the inline JPEG bytes — all available directly from the Hub at `hf://datasets/lance-format/gqa-testdev-balanced-lance/data`. -`lmms-lab/GQA` exposes instructions and images as **separate parquet configs**; this Lance dataset joins them on `imageId`, so each row has the question, the answer, the GQA reasoning-program tags, *and* the image bytes inline. +## Key features + +- **Inline JPEG bytes** in the `image` column, duplicated across rows that share an `image_id` so each Q/A row is self-contained. +- **Paired CLIP embeddings in the same row** — `image_emb` and `question_emb` (512-dim, cosine-normalized) — for cross-modal retrieval as one indexed lookup. +- **Compositional reasoning metadata** — `structural`, `semantic`, and `detailed` question-type tags plus the `semantic_str` reasoning program. +- **Pre-built ANN, FTS, scalar, and bitmap indices** covering both embeddings, the question and short answer, the reasoning-type tags, and the image/question ids. ## Splits @@ -22,57 +27,91 @@ Lance-formatted version of the canonical GQA `testdev_balanced` slice — 12,578 |-------|------|----------------| | `testdev.lance` | 12,578 | 398 | -> Train (`train_balanced_instructions` × `train_balanced_images`, ~943k Q's × 72k images, ~10 GB images) and val splits are not bundled by default — pass `--instr-config`/`--images-config` to `gqa/dataprep.py` to extend. +The train_balanced (~943 k Q's × 72 k images) and val_balanced splits are not bundled by default; pass `--instr-config` / `--images-config` to `gqa/dataprep.py` to extend. ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index | -| `image` | `large_binary` | Inline JPEG bytes (image is duplicated across rows that share an `image_id`) | +| `id` | `int64` | Row index within split | +| `image` | `large_binary` | Inline JPEG bytes (duplicated across rows that share an `image_id`) | | `image_id` | `string` | GQA scene-graph image id | | `question_id` | `string` | GQA question id | | `question` | `string` | Compositional natural-language question | | `answers` | `list` | One-element list (the GQA short answer) | -| `answer` | `string` | Same short answer (canonical / FTS target) | -| `full_answer` | `string?` | Full sentence answer | +| `answer` | `string` | Canonical short answer (used for FTS) | +| `full_answer` | `string?` | Full-sentence answer | | `structural` | `string?` | One of `verify`, `query`, `compare`, `choose`, `logical` | | `semantic` | `string?` | One of `attr`, `cat`, `global`, `obj`, `rel` | | `detailed` | `string?` | Fine-grained type (e.g. `weatherVerifyC`) | | `is_balanced` | `bool` | GQA balanced subset flag | -| `group_global` / `group_local` | `string?` | GQA reasoning-group ids | +| `group_global`, `group_local` | `string?` | GQA reasoning-group ids | | `semantic_str` | `string?` | Compact description of the reasoning program | | `image_emb` | `fixed_size_list` | CLIP image embedding (cosine-normalized) | | `question_emb` | `fixed_size_list` | CLIP text embedding of the question | ## Pre-built indices -- `IVF_PQ` on `image_emb` and `question_emb` — `metric=cosine` -- `INVERTED` (FTS) on `question` and `answer` -- `BITMAP` on `structural`, `semantic`, `detailed` -- `BTREE` on `image_id`, `question_id` +- `IVF_PQ` on `image_emb` — image-side vector search (cosine) +- `IVF_PQ` on `question_emb` — question-side vector search (cosine) +- `INVERTED` (FTS) on `question` and `answer` — keyword and hybrid search +- `BITMAP` on `structural`, `semantic`, `detailed` — fast categorical filters on the reasoning program +- `BTREE` on `image_id`, `question_id` — fast lookup by GQA id + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` -## Quick start +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/gqa-testdev-balanced-lance/data/testdev.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +import datasets + +hf_ds = datasets.load_dataset("lance-format/gqa-testdev-balanced-lance", split="testdev", streaming=True) +for row in hf_ds.take(3): + print(row["question"], "->", row["answer"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/gqa-testdev-balanced-lance/data") tbl = db.open_table("testdev") -print(f"LanceDB table opened with {len(tbl)} image-question pairs") +print(len(tbl)) ``` -### LanceDB vector search +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, and the list of pre-built indices. + +```python +import lance + +ds = lance.dataset("hf://datasets/lance-format/gqa-testdev-balanced-lance/data/testdev.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) +``` + +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/gqa-testdev-balanced-lance --repo-type dataset --local-dir ./gqa-testdev-balanced-lance +> ``` +> Then point Lance or LanceDB at `./gqa-testdev-balanced-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `image_emb` makes cross-modal text→image retrieval a single call: encode a question with the same CLIP model used at ingest (ViT-B/32, cosine-normalized), then pass the resulting 512-d vector to `tbl.search(...)` and target `image_emb`. The example below uses the `question_emb` already stored in row 42 as a runnable stand-in for "the CLIP encoding of a question", so the snippet works without any model loaded. ```python import lancedb @@ -80,19 +119,48 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/gqa-testdev-balanced-lance/data") tbl = db.open_table("testdev") -ref = tbl.search().limit(1).select(["question_emb", "question"]).to_list()[0] -query_embedding = ref["question_emb"] +seed = ( + tbl.search() + .select(["question_emb", "question", "answer"]) + .limit(1) + .offset(42) + .to_list()[0] +) -results = ( - tbl.search(query_embedding, vector_column_name="question_emb") +hits = ( + tbl.search(seed["question_emb"], vector_column_name="image_emb") .metric("cosine") - .select(["question", "answer"]) - .limit(5) + .select(["image_id", "question", "answer", "structural"]) + .limit(10) .to_list() ) +print("query question:", seed["question"], "->", seed["answer"]) +for r in hits: + print(f" {r['image_id']:>12} [{r['structural']}] {r['question'][:70]}") ``` -### LanceDB full-text search +Because the CLIP embeddings are cosine-normalized, cosine is the right metric and the first hit will often be the source row itself — a useful sanity check. Swap `vector_column_name="image_emb"` for `question_emb` to find paraphrased or topically related questions instead. + +The dataset also ships an `INVERTED` index on `question` and `answer`, so the same query can be issued as a hybrid search that combines the dense vector with a literal keyword match. This is useful when a noun like "umbrella" must appear in the question text but you still want CLIP to handle visual similarity over the candidate set. + +```python +hybrid_hits = ( + tbl.search(query_type="hybrid", vector_column_name="image_emb") + .vector(seed["question_emb"]) + .text("umbrella") + .select(["image_id", "question", "answer"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(f" {r['image_id']:>12} {r['question'][:70]} -> {r['answer']}") +``` + +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. + +## Curate + +A typical curation pass for a compositional-reasoning study combines a predicate on the question text (or the GQA short answer) with a structural filter on the reasoning program, so the candidate set is both topically and structurally consistent. Stacking both inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(500)` makes it cheap to inspect before committing the subset to anything downstream. ```python import lancedb @@ -100,42 +168,132 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/gqa-testdev-balanced-lance/data") tbl = db.open_table("testdev") -results = ( - tbl.search("color of the car") - .select(["question", "answer"]) - .limit(10) +candidates = ( + tbl.search() + .where( + "structural = 'verify' AND answer IN ('yes', 'no') AND question LIKE 'Is %'", + prefilter=True, + ) + .select(["question_id", "image_id", "question", "answer", "semantic"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} verify-style yes/no candidates; first: {candidates[0]['question']}") ``` -## Filter by reasoning type +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `question_id`s, or feed into the Evolve and Train workflows below. The `image` column is never read, so the network traffic for a 500-row candidate scan is dominated by the question and answer strings rather than JPEG bytes. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds an `is_binary_answer` flag and a `question_length` integer, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/gqa-testdev-balanced-lance/data/testdev.lance") -verify_qs = ds.scanner(filter="structural = 'verify'", columns=["question", "answer"], limit=5).to_table() +import lancedb + +db = lancedb.connect("./gqa-testdev-balanced-lance/data") # local copy required for writes +tbl = db.open_table("testdev") + +tbl.add_columns({ + "is_binary_answer": "answer IN ('yes', 'no')", + "question_length": "length(question)", + "answer_length": "length(answer)", +}) ``` -### Filter with LanceDB +If the values you want to attach already live in another table (offline labels, scene-graph features, or per-question predictions from an external model), merge them in by joining on `question_id`: + +```python +import pyarrow as pa + +predictions = pa.table({ + "question_id": pa.array(["20240268", "20240269"]), + "model_answer": pa.array(["yes", "left"]), + "model_confidence": pa.array([0.91, 0.62]), +}) +tbl.merge(predictions, on="question_id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation, Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a VQA fine-tune, project the JPEG bytes, the question, and the short answer; columns added in the Evolve section above cost nothing per batch until they are explicitly projected. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/gqa-testdev-balanced-lance/data") tbl = db.open_table("testdev") -verify_qs = ( - tbl.search() - .where("structural = 'verify'") - .select(["question", "answer"]) - .limit(5) - .to_list() -) + +train_ds = Permutation.identity(tbl).select_columns(["image", "question", "answer"]) +loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; decode the JPEG bytes, + # tokenize the question, forward through the VLM, compute the loss... + ... ``` -## Why Lance? +Switching feature sets is a configuration change: passing `["image_emb", "question_emb", "answer"]` to `select_columns(...)` on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors, which is the right shape for a lightweight reasoning probe over frozen CLIP features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/gqa-testdev-balanced-lance/data") +tbl = db.open_table("testdev") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./gqa-testdev-balanced-lance/data") +local_tbl = local_db.open_table("testdev") +local_tbl.tags.create("clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("testdev", version="clip-vitb32-v1") +tbl_v5 = db.open_table("testdev", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `clip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel — newly added model predictions or reasoning annotations do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and questions, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/gqa-testdev-balanced-lance/data") +remote_tbl = remote_db.open_table("testdev") + +batches = ( + remote_tbl.search() + .where("structural = 'verify' AND answer IN ('yes', 'no')") + .select(["question_id", "image_id", "image", "question", "answer", "image_emb", "question_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./gqa-yesno-subset") +local_db.create_table("testdev", batches) +``` -- One dataset for the joined image + question + answer + reasoning-program metadata + dual embeddings + indices — no instructions/images parquet split to keep in sync. -- Schema evolution: add columns (alternate scene graphs, model predictions) without rewriting the data. +The resulting `./gqa-yesno-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/gqa-testdev-balanced-lance/data` for `./gqa-yesno-subset`. ## Source & license diff --git a/docs/datasets/hotpotqa-distractor.mdx b/docs/datasets/hotpotqa-distractor.mdx index f7d3339..b29694f 100644 --- a/docs/datasets/hotpotqa-distractor.mdx +++ b/docs/datasets/hotpotqa-distractor.mdx @@ -1,7 +1,7 @@ --- title: "HotpotQA distractor" sidebarTitle: "HotpotQA distractor" -description: "Lance-formatted version of HotpotQA — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs — using the distractor config (10 candidate paragraphs per question, including gold + 8…" +description: "A Lance-formatted version of HotpotQA using the distractor config — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs, with 10 candidate paragraphs per question (gold + 8…" --- -Lance-formatted version of [HotpotQA](https://hotpotqa.github.io/) — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs — using the `distractor` config (10 candidate paragraphs per question, including gold + 8 distractors). Sourced from [`hotpot_qa`](https://huggingface.co/datasets/hotpot_qa). +A Lance-formatted version of [HotpotQA](https://hotpotqa.github.io/) using the `distractor` config — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs, with 10 candidate paragraphs per question (gold + 8 distractors). The dataset ships with MiniLM question embeddings, flattened context text for full-text search, and pre-built ANN/FTS indices, available directly from the Hub at `hf://datasets/lance-format/hotpotqa-distractor-lance/data`. + +## Key features + +- **Multi-hop questions with gold supporting facts** — each row carries the question, the canonical short answer, and the `(title, sent_id)` pointers into the paragraphs that justify it. +- **Ten candidate paragraphs per question** in the parallel `context_titles` / `context_sentences` columns, plus a flattened `context_text` field that feeds the FTS index. +- **Pre-computed 384-dim question embeddings** (`question_emb`, `sentence-transformers/all-MiniLM-L6-v2`, cosine-normalized) with a bundled `IVF_PQ` index for semantic question lookup. +- **One columnar dataset** — scan metadata cheaply, then read the heavy context text only for the rows you actually want. ## Splits @@ -27,83 +34,127 @@ Lance-formatted version of [HotpotQA](https://hotpotqa.github.io/) — multi-hop |---|---|---| | `id` | `string` | HotpotQA question id | | `question` | `string` | The question | -| `answer` | `string` | Reference short answer (yes / no / span) | +| `answer` | `string` | Reference short answer (`yes` / `no` / span) | | `type` | `string?` | `bridge` or `comparison` | | `level` | `string?` | `easy` / `medium` / `hard` | -| `supporting_titles` | `list` | Wikipedia titles that contain gold facts | +| `supporting_titles` | `list` | Wikipedia titles that contain the gold facts | | `supporting_sent_ids` | `list` | Sentence indices into those titles | | `context_titles` | `list` | All 10 paragraph titles (gold + distractors) | | `context_sentences` | `list>` | Sentences per paragraph | | `context_text` | `string` | Flattened paragraphs — feeds the FTS index | | `num_supporting_facts` | `int32` | Number of gold supporting facts | -| `question_emb` | `fixed_size_list` | sentence-transformers `all-MiniLM-L6-v2` (cosine-normalized) | +| `question_emb` | `fixed_size_list` | MiniLM question embedding | ## Pre-built indices -- `IVF_PQ` on `question_emb` — `metric=cosine` -- `INVERTED` (FTS) on `question` and `context_text` -- `BTREE` on `id`, `answer` -- `BITMAP` on `type`, `level` +- `IVF_PQ` on `question_emb` — semantic question lookup (cosine) +- `INVERTED` (FTS) on `question` and `context_text` — keyword and hybrid search +- `BTREE` on `id`, `answer` — stable lookup by identifier +- `BITMAP` on `type`, `level` — cheap predicate evaluation for question class + +## Why Lance? -## Quick start +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/hotpotqa-distractor-lance/data/validation.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +import datasets + +hf_ds = datasets.load_dataset("lance-format/hotpotqa-distractor-lance", split="validation", streaming=True) +for row in hf_ds.take(3): + print(row["question"], "->", row["answer"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. Each `.lance` file in `data/` is a table — open by name (`train`, `validation`). The same handle is used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data") tbl = db.open_table("validation") -print(f"LanceDB table opened with {len(tbl)} questions") +print(len(tbl)) ``` -## Multi-hop semantic search +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices. ```python -import lance, pyarrow as pa -from sentence_transformers import SentenceTransformer - -encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda") -q = encoder.encode(["which actor played in both inception and dunkirk"], normalize_embeddings=True)[0] - -ds = lance.dataset("hf://datasets/lance-format/hotpotqa-distractor-lance/data/train.lance") -emb_field = ds.schema.field("question_emb") -hits = ds.scanner( - nearest={"column": "question_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 5}, - columns=["question", "answer", "supporting_titles"], -).to_table().to_pylist() +import lance + +ds = lance.dataset("hf://datasets/lance-format/hotpotqa-distractor-lance/data/validation.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB semantic search +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/hotpotqa-distractor-lance --repo-type dataset --local-dir ./hotpotqa-distractor-lance +> ``` +> Then point Lance or LanceDB at `./hotpotqa-distractor-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `question_emb` makes nearest-neighbour question lookup a single call. In production you would encode an incoming user question through the same 384-dim MiniLM encoder used at ingest and pass the resulting vector to `tbl.search(...)`. The example below uses the embedding from row 42 as a runnable stand-in so the snippet works without loading a model. ```python import lancedb -from sentence_transformers import SentenceTransformer - -encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda") -q = encoder.encode(["which actor played in both inception and dunkirk"], normalize_embeddings=True)[0] db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data") tbl = db.open_table("train") -results = ( - tbl.search(q.tolist(), vector_column_name="question_emb") +seed = ( + tbl.search() + .select(["question_emb", "question"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["question_emb"], vector_column_name="question_emb") .metric("cosine") + .where("level = 'hard'", prefilter=True) + .select(["question", "answer", "supporting_titles", "type"]) + .limit(10) + .to_list() +) +for r in hits: + print(f"[{r['type']}] {r['question']} -> {r['answer']}") +``` + +The result set carries only the projected columns; the 384-d `question_emb` is never read on the result side, and the long `context_text` body is left untouched, keeping the working set small even when the underlying scan touches every row of the train split. + +Because the dataset also ships an `INVERTED` index on both `question` and `context_text`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query against the full paragraph text. LanceDB merges the two result lists and reranks them in a single call, which is useful when a named entity must literally appear in one of the supporting paragraphs but the dense side still does most of the ranking. + +```python +hybrid_hits = ( + tbl.search(query_type="hybrid") + .vector(seed["question_emb"]) + .text("inception dunkirk") .select(["question", "answer", "supporting_titles"]) - .limit(5) + .limit(10) .to_list() ) +for r in hybrid_hits: + print(r["question"], "->", r["answer"]) ``` -### LanceDB full-text search +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency for your workload. + +## Curate + +Building a focused evaluation slice usually means stacking predicates over the question metadata before any context text gets read. Lance evaluates the filter inside a single scan, so the candidate set comes back already filtered, and the bounded `.limit(2000)` keeps the output small enough to inspect. The example below assembles a set of hard, multi-hop comparison questions for which the gold answer is a real span rather than `yes`/`no`. ```python import lancedb @@ -111,42 +162,138 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data") tbl = db.open_table("train") -results = ( - tbl.search("inception dunkirk") - .select(["question", "answer"]) - .limit(10) +candidates = ( + tbl.search() + .where( + "type = 'comparison' " + "AND level = 'hard' " + "AND num_supporting_facts >= 2 " + "AND answer NOT IN ('yes', 'no') " + "AND length(question) >= 40", + prefilter=True, + ) + .select(["id", "question", "answer", "supporting_titles"]) + .limit(2000) .to_list() ) +print(f"{len(candidates)} candidates; first: {candidates[0]['question']}") ``` -## Filter by question type +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of question ids, or hand to the Evolve and Train sections below. Neither `context_text` nor `context_sentences` is read by this scan, so a 2000-row curation pass against the Hub moves only kilobytes of metadata. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a `question_length` column and a `is_multi_hop` flag, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/hotpotqa-distractor-lance/data/validation.lance") -hard_compare = ds.scanner( - filter="type = 'comparison' AND level = 'hard'", - columns=["question", "answer"], - limit=10, -).to_table() +import lancedb + +db = lancedb.connect("./hotpotqa-distractor-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "question_length": "length(question)", + "is_multi_hop": "num_supporting_facts >= 2", +}) ``` -### Filter with LanceDB +If the values you want to attach already live in another table (offline retriever scores, reranker logits, alternate embeddings from a stronger model), merge them in by joining on the question `id`: + +```python +import pyarrow as pa + +retriever_scores = pa.table({ + "id": pa.array(["5a8b57f25542995d1e6f1371", "5a8c7595554299585d9e36b6"]), + "bm25_top1_score": pa.array([12.7, 9.4]), +}) +tbl.merge(retriever_scores, on="id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a different encoder over the question text), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a multi-hop QA model the natural projection is the question plus the flattened context and the gold answer; for a question-encoder retraining loop the precomputed embedding is enough on its own. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data") -tbl = db.open_table("validation") -hard_compare = ( - tbl.search() - .where("type = 'comparison' AND level = 'hard'") - .select(["question", "answer"]) - .limit(10) - .to_list() +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns(["question", "context_text", "answer"]) +loader = DataLoader(train_ds, batch_size=16, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; tokenize, forward, backward... + ... +``` + +Switching feature sets is a configuration change: passing `["question_emb", "answer"]` to `select_columns(...)` on the next run reads only the 384-d vectors and the short answer string, which is the right shape for fine-tuning a retrieval head on cached embeddings. Columns added in Evolve cost nothing per batch until they are explicitly projected. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./hotpotqa-distractor-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("hard-multihop-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="hard-multihop-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A QA system locked to `hard-multihop-v1` keeps returning stable supporting facts while the dataset evolves in parallel — newly added retriever scores or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same questions and contexts, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full corpus. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where( + "type = 'comparison' " + "AND level = 'hard' " + "AND num_supporting_facts >= 2" + ) + .select(["id", "question", "answer", "supporting_titles", "context_text", "question_emb"]) + .to_batches() ) + +local_db = lancedb.connect("./hotpotqa-hard-comparison") +local_db.create_table("train", batches) ``` +The resulting `./hotpotqa-hard-comparison` is a first-class LanceDB database. Every snippet in the Search, Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/hotpotqa-distractor-lance/data` for `./hotpotqa-hard-comparison`. + ## Source & license Converted from [`hotpot_qa`](https://huggingface.co/datasets/hotpot_qa) (`distractor` config). HotpotQA is released under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). diff --git a/docs/datasets/imagenet-1k-val.mdx b/docs/datasets/imagenet-1k-val.mdx index 27a2fac..c8fb51d 100644 --- a/docs/datasets/imagenet-1k-val.mdx +++ b/docs/datasets/imagenet-1k-val.mdx @@ -1,7 +1,7 @@ --- title: "ImageNet-1k Validation" sidebarTitle: "ImageNet-1k Validation" -description: "A Lance-formatted version of the canonical 50,000-image ImageNet-1k validation split (also known as ILSVRC2012 val) sourced from benjamin-paine/imagenet-1k. All 50 k JPEGs are stored inline alongside CLIP embeddings and a pre-built IVF_PQ ANN index." +description: "A Lance-formatted version of the canonical 50,000-image ImageNet-1k (ILSVRC2012) validation split, sourced from benjamin-paine/imagenet-1k. Each row is one image with its integer class id, a string class name, and a cosine-normalized OpenCLIP image…" --- -A Lance-formatted version of the **canonical 50,000-image ImageNet-1k validation split** (also known as ILSVRC2012 val) sourced from [`benjamin-paine/imagenet-1k`](https://huggingface.co/datasets/benjamin-paine/imagenet-1k). All 50 k JPEGs are stored inline alongside CLIP embeddings and a pre-built `IVF_PQ` ANN index. +A Lance-formatted version of the canonical 50,000-image ImageNet-1k (ILSVRC2012) validation split, sourced from [`benjamin-paine/imagenet-1k`](https://huggingface.co/datasets/benjamin-paine/imagenet-1k). Each row is one image with its integer class id, a string class name, and a cosine-normalized OpenCLIP image embedding — all stored inline and available directly from the Hub at `hf://datasets/lance-format/imagenet-1k-val-lance/data`. The 1.28 M ImageNet-1k train split (~155 GB) is intentionally out of scope for this redistribution; the val split is the canonical evaluation slice for classification benchmarks and is small enough (~7 GB Lance) to ride entirely in inline storage alongside its embeddings. -> **Why only the validation split?** The 1.28 M ImageNet-1k train split is ~155 GB and is intentionally out of scope for this lance distribution. The val split is the canonical evaluation slice for image-classification benchmarks and is small enough (~6.7 GB raw, ~7 GB Lance) to ride entirely in inline storage with embeddings. +## Key features + +- **Inline JPEG bytes** in the `image` column — no per-class folders, no sidecar files. +- **Pre-computed OpenCLIP image embeddings** (`image_emb`, ViT-B/32 trained on `laion2b_s34b_b79k`, 512-dim, cosine-normalized) with a bundled `IVF_PQ` index for similarity search. +- **Both label representations** — integer `label` (0-999) and string `label_name` (first synonym of the WordNet synset, e.g. `golden_retriever`) — with scalar indices on both for fast class filters. +- **One columnar dataset** — scan labels and embeddings cheaply, fetch image bytes only for the rows you actually need. ## Splits -| Split | Rows | -|-------|------| -| `validation.lance` | 50,000 | +A single split, shipped as `validation.lance` (50,000 rows). ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index within the split (0-49,999) | +| `id` | `int64` | Row index within the split, 0-49,999 (natural join key) | | `image` | `large_binary` | Inline JPEG bytes | | `label` | `int32` | Class id (0-999) | | `label_name` | `string` | First synonym of the synset, underscore-spaced (e.g. `golden_retriever`) | -| `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` / `laion2b_s34b_b79k` embedding (cosine-normalized) | +| `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` / `laion2b_s34b_b79k` image embedding (cosine-normalized) | -The full WordNet synset descriptions for each class are available in the dataset metadata under `lance:class_names` (comma-separated). +The full comma-separated WordNet synset descriptions for each class are stored in the dataset metadata under `lance:class_names`. ## Pre-built indices -- `IVF_PQ` on `image_emb` — `metric=cosine`, `num_partitions=64` -- `BTREE` on `label` -- `BITMAP` on `label_name` +- `IVF_PQ` on `image_emb` — vector similarity search (cosine, `num_partitions=64`) +- `BTREE` on `label` — fast equality / range filters by class id +- `BITMAP` on `label_name` — fast set-membership filters by class name + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` -## Quick start +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample without installing anything Lance-specific. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/imagenet-1k-val-lance/data/validation.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/imagenet-1k-val-lance", split="validation", streaming=True) +for row in hf_ds.take(3): + print(row["label_name"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/imagenet-1k-val-lance/data") tbl = db.open_table("validation") -print(f"LanceDB table opened with {len(tbl)} images") +print(len(tbl)) ``` -> **Tip — for production use, download locally first** to avoid Hub rate limits: -> ```bash -> hf download lance-format/imagenet-1k-val-lance --repo-type dataset --local-dir ./imagenet-1k-val-lance -> ``` +## Load with Lance -## Vector search example +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python import lance -import pyarrow as pa ds = lance.dataset("hf://datasets/lance-format/imagenet-1k-val-lance/data/validation.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb", "label_name"]).to_pylist()[0] -query = pa.array([ref["image_emb"]], type=emb_field.type) - -neighbors = ds.scanner( - nearest={"column": "image_emb", "q": query[0], "k": 5, "nprobes": 16, "refine_factor": 30}, - columns=["id", "label_name"], -).to_table().to_pylist() -print(f"reference: {ref['label_name']}") -for n in neighbors: - print(n) +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB vector search +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/imagenet-1k-val-lance --repo-type dataset --local-dir ./imagenet-1k-val-lance +> ``` +> Then point Lance or LanceDB at `./imagenet-1k-val-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `image_emb` makes nearest-neighbor retrieval over the validation set a single call. In production you would encode a query image through the same OpenCLIP `ViT-B-32` / `laion2b_s34b_b79k` model used at ingest (cosine-normalized) and pass the resulting 512-d vector to `tbl.search(...)`. The example below uses the embedding already stored in row 42 as a runnable stand-in, so the snippet works without any model loaded. ```python import lancedb @@ -94,62 +104,162 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/imagenet-1k-val-lance/data") tbl = db.open_table("validation") -ref = tbl.search().limit(1).select(["image_emb", "label_name"]).to_list()[0] -query_embedding = ref["image_emb"] +seed = ( + tbl.search() + .select(["image_emb", "label_name"]) + .limit(1) + .offset(42) + .to_list()[0] +) -results = ( - tbl.search(query_embedding) +hits = ( + tbl.search(seed["image_emb"]) .metric("cosine") .select(["id", "label_name"]) - .limit(5) + .limit(10) .to_list() ) +print(f"reference class: {seed['label_name']}") +for r in hits: + print(f" id={r['id']:>6} {r['label_name']}") ``` -## Filter by class +Because the embeddings are cosine-normalized at ingest, `metric="cosine"` is the right choice and the first hit will typically be the seed image itself — a useful sanity check. Tune `nprobes` and `refine_factor` to trade recall against latency for your workload. -```python -import lance -ds = lance.dataset("hf://datasets/lance-format/imagenet-1k-val-lance/data/validation.lance") -goldens = ds.scanner(filter="label_name = 'golden_retriever'", columns=["id"], limit=5).to_table() -``` +## Curate -### Filter by class with LanceDB +A typical curation pass for an ImageNet-style classification or robustness study narrows the validation set to a single class (or a synset prefix) and then materializes a small candidate set for inspection. Stacking the filter and the projection inside a single scan keeps the result small and explicit, and the bounded `.limit(200)` makes it cheap to inspect before committing the subset to anything downstream. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/imagenet-1k-val-lance/data") tbl = db.open_table("validation") -goldens = ( + +candidates = ( tbl.search() .where("label_name = 'golden_retriever'") - .select(["id"]) - .limit(5) + .select(["id", "label", "label_name"]) + .limit(200) .to_list() ) +print(f"{len(candidates)} golden_retriever validation rows") ``` -## Working with images +The `BITMAP` index on `label_name` resolves the predicate without scanning, and the `image` column is never read, so the network traffic for the candidate scan is dominated by the small metadata payload rather than JPEG bytes. The result is a plain list of dictionaries, ready to inspect, persist as a manifest of row ids, or feed into the Evolve and Train workflows below. To grab a family of related classes, replace the equality with a `LIKE` predicate such as `label_name LIKE 'tabby%'` or an `IN` set over a curated synset list. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a coarse `is_dog` flag over a curated set of canine synsets, which can then be used directly in later `where` clauses without re-listing the class set on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. ```python -from pathlib import Path -import lance +import lancedb -ds = lance.dataset("hf://datasets/lance-format/imagenet-1k-val-lance/data/validation.lance") -row = ds.take([0], columns=["image", "label_name"]).to_pylist()[0] -Path(f"sample_{row['label_name']}.jpg").write_bytes(row["image"]) +db = lancedb.connect("./imagenet-1k-val-lance/data") # local copy required for writes +tbl = db.open_table("validation") + +tbl.add_columns({ + "is_dog": "label_name IN ('golden_retriever', 'Labrador_retriever', 'beagle', 'pug', 'poodle')", +}) ``` -## Why Lance? +If the values you want to attach already live in another table — per-class hypernyms from WordNet, ImageNet-A / ImageNet-R membership flags, model-prediction logs from an external eval run — merge them in by joining on `label_name`: + +```python +import pyarrow as pa + +hypernyms = pa.table({ + "label_name": pa.array(["golden_retriever", "tabby", "espresso"]), + "hypernym": pa.array(["dog", "cat", "beverage"]), +}) +tbl.merge(hypernyms, on="label_name") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running an alternative vision backbone over the image bytes), Lance provides a batch-UDF API in the underlying library — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/) for that pattern. + +## Train + +Projection lets a training loop — or, more commonly for this split, an evaluation loop — read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. Columns added in the Evolve section above cost nothing per batch until they are explicitly projected. + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/imagenet-1k-val-lance/data") +tbl = db.open_table("validation") + +eval_ds = Permutation.identity(tbl).select_columns(["image", "label"]) +loader = DataLoader(eval_ds, batch_size=128, shuffle=False, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; image_emb stays on disk. + # decode the JPEG bytes, forward through your classifier, accumulate top-1 / top-5... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "label"]` to `select_columns(...)` on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a linear probe or a lightweight classifier head on top of frozen CLIP features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/imagenet-1k-val-lance/data") +tbl = db.open_table("validation") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./imagenet-1k-val-lance/data") +local_tbl = local_db.open_table("validation") +local_tbl.tags.create("clip-vitb32-laion2b-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("validation", version="clip-vitb32-laion2b-v1") +tbl_v5 = db.open_table("validation", version=5) +``` + +Pinning supports two workflows. An evaluation harness locked to `clip-vitb32-laion2b-v1` keeps reporting numbers against a fixed snapshot of labels and embeddings even as the dataset evolves in parallel; newly added columns or relabelings do not change what the tag resolves to. A research experiment pinned to the same tag can be rerun later against the exact same images, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training or evaluation loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/imagenet-1k-val-lance/data") +remote_tbl = remote_db.open_table("validation") + +batches = ( + remote_tbl.search() + .where("label_name IN ('golden_retriever', 'Labrador_retriever', 'beagle', 'pug', 'poodle')") + .select(["id", "image", "label", "label_name", "image_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./imagenet-dogs-subset") +local_db.create_table("validation", batches) +``` -- One dataset for images + embeddings + indices + metadata — no sidecar files. -- On-disk vector and FTS indices live next to the data, so search works on local copies and on the Hub. -- Schema evolution: add columns (model predictions, fresh embeddings, robustness annotations) without rewriting the data. +The resulting `./imagenet-dogs-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/imagenet-1k-val-lance/data` for `./imagenet-dogs-subset`. ## Source & license -Converted from [`benjamin-paine/imagenet-1k`](https://huggingface.co/datasets/benjamin-paine/imagenet-1k), itself a redistribution of the [ILSVRC2012 ImageNet-1k validation split](https://image-net.org/challenges/LSVRC/2012/). All use is subject to the [ImageNet terms of access](https://image-net.org/download.php) — for **research use only**. +Converted from [`benjamin-paine/imagenet-1k`](https://huggingface.co/datasets/benjamin-paine/imagenet-1k), itself a redistribution of the [ILSVRC2012 ImageNet-1k validation split](https://image-net.org/challenges/LSVRC/2012/). All use is subject to the [ImageNet terms of access](https://image-net.org/download.php) — **for research use only**. ## Citation diff --git a/docs/datasets/index.mdx b/docs/datasets/index.mdx index 7f43b7f..b8cdf37 100644 --- a/docs/datasets/index.mdx +++ b/docs/datasets/index.mdx @@ -32,28 +32,28 @@ integration itself, see the [Hugging Face Hub integration page](/integrations/ai - `lance-format/mnist-lance` — A Lance-formatted version of the classic MNIST handwritten-digit dataset with 70,000 28×28 grayscale digits stored inline alongside CLIP image embeddings and a pre-built ANN index. + `lance-format/mnist-lance` — A Lance-formatted version of the classic MNIST handwritten-digit dataset covering 70,000 28×28 grayscale digits across ten balanced classes. Each row carries inline PNG bytes, the digit label, the human-readable class name, and a cosine-normalized… - `lance-format/cifar10-lance` — A Lance-formatted version of CIFAR-10 with 60,000 32×32 RGB images across 10 classes, stored inline with CLIP embeddings and a pre-built IVF_PQ ANN index. + `lance-format/cifar10-lance` — A Lance-formatted version of CIFAR-10 covering 60,000 32×32 RGB images across ten balanced object classes. Each row carries inline PNG bytes, the integer label, the human-readable class name, and a cosine-normalized CLIP image embedding, all backed… - `lance-format/fashion-mnist-lance` — A Lance-formatted version of Fashion-MNIST with 70,000 28×28 grayscale clothing images stored inline alongside CLIP embeddings and a pre-built IVF_PQ ANN index. + `lance-format/fashion-mnist-lance` — A Lance-formatted version of Fashion-MNIST covering 70,000 28×28 grayscale clothing images across ten balanced apparel classes. Each row carries inline PNG bytes, the integer label, the human-readable class name, and a cosine-normalized CLIP image… - `lance-format/food101-lance` — Lance-formatted version of Food-101 — 101,000 food photographs across 101 classes — sourced from ethz/food101. Inline JPEG bytes + CLIP image embeddings + IVF_PQ. + `lance-format/food101-lance` — A Lance-formatted version of Food-101, the fine-grained dish-classification benchmark of 101,000 photos spread evenly across 101 dish classes, sourced from ethz/food101. Each row carries the inline JPEG bytes, the integer label, the human-readable… - `lance-format/oxford-pets-lance` — Lance-formatted version of the Oxford-IIIT Pet dataset — 7,390 cat & dog photos across 37 breeds — sourced from pcuenq/oxford-pets. + `lance-format/oxford-pets-lance` — A Lance-formatted version of the Oxford-IIIT Pet dataset — 7,390 cat and dog photos across 37 breeds — sourced from pcuenq/oxford-pets. Each row carries the inline JPEG bytes, the breed name, a species flag distinguishing cats from dogs, and a… - `lance-format/stanford-cars-lance` — Lance-formatted version of the Stanford Cars dataset — 8,144 training images across 196 fine-grained car make/model/year classes — sourced from Multimodal-Fatima/StanfordCars_train. + `lance-format/stanford-cars-lance` — A Lance-formatted version of the Stanford Cars fine-grained benchmark — 8,144 photographs across 196 make/model/year classes — sourced from Multimodal-Fatima/StanfordCars_train. Each row carries the inline JPEG bytes, the integer class id, a… - `lance-format/imagenet-1k-val-lance` — A Lance-formatted version of the canonical 50,000-image ImageNet-1k validation split (also known as ILSVRC2012 val) sourced from benjamin-paine/imagenet-1k. All 50 k JPEGs are stored inline alongside CLIP embeddings and a pre-built IVF_PQ ANN index. + `lance-format/imagenet-1k-val-lance` — A Lance-formatted version of the canonical 50,000-image ImageNet-1k (ILSVRC2012) validation split, sourced from benjamin-paine/imagenet-1k. Each row is one image with its integer class id, a string class name, and a cosine-normalized OpenCLIP image… - `lance-format/eurosat-lance` — Lance-formatted version of EuroSAT — Sentinel-2 satellite imagery (RGB) covering 27,000 64×64 tiles across 10 land-cover classes, sourced from blanchon/EuroSAT_RGB. + `lance-format/eurosat-lance` — A Lance-formatted version of EuroSAT, the canonical Sentinel-2 RGB land-cover benchmark, sourced from blanchon/EuroSAT_RGB. Each row is a single 64×64 RGB tile with its integer class id, the human-readable class name, and a cosine-normalized… @@ -61,16 +61,16 @@ integration itself, see the [Hugging Face Hub integration page](/integrations/ai - `lance-format/coco-detection-2017-lance` — Lance-formatted version of the COCO 2017 object detection benchmark — sourced from detection-datasets/coco — with 123,287 images and the full per-image list of bounding boxes, category labels, and CLIP image embeddings, all stored inline. + `lance-format/coco-detection-2017-lance` — A Lance-formatted version of the COCO 2017 object detection benchmark, sourced from detection-datasets/coco. Each row is one image with its inline JPEG bytes, the full per-image list of bounding boxes, COCO 80-class category ids and names… - `lance-format/pascal-voc-2012-segmentation-lance` — A Lance-formatted version of the Pascal VOC 2012 semantic segmentation split (sourced from nateraw/pascal-voc-2012) — 2,913 image / mask pairs with CLIP image embeddings stored inline and a pre-built IVF_PQ ANN index. + `lance-format/pascal-voc-2012-segmentation-lance` — A Lance-formatted version of the Pascal VOC 2012 semantic segmentation split, sourced from nateraw/pascal-voc-2012. Each row pairs an inline JPEG image with the per-pixel PNG segmentation mask and a cosine-normalized OpenCLIP ViT-B-32 image… - `lance-format/ade20k-lance` — Lance-formatted version of the full ADE20K scene parsing benchmark (sourced from 1aurent/ADE20K) — 27,574 scene images with semantic and instance segmentation maps, scene labels, and per-object metadata, all stored inline. + `lance-format/ade20k-lance` — A Lance-formatted version of the full ADE20K scene parsing benchmark, sourced from 1aurent/ADE20K. Each row is one scene image with its inline JPEG bytes, a per-pixel semantic segmentation map encoded as PNG bytes, an optional instance map, scene… - `lance-format/kitti-2d-detection-lance` — Lance-formatted version of the KITTI 2D Object Detection benchmark — 7,481 training images from the KITTI Vision Benchmark Suite with 2D bounding boxes plus the full 3D-box / observation-angle metadata. Sourced from nateraw/kitti so no manual… + `lance-format/kitti-2d-detection-lance` — A Lance-formatted version of the KITTI 2D Object Detection benchmark, sourced from nateraw/kitti so no manual signup or download from cvlibs.net is required. Each row is a single driving frame with inline JPEG bytes, the full set of 2D and 3D… @@ -78,13 +78,13 @@ integration itself, see the [Hugging Face Hub integration page](/integrations/ai - `lance-format/coco-captions-2017-lance` — Lance-formatted version of the COCO Captions 2017 corpus, redistributed via lmms-lab/COCO-Caption2017. Each row is one image with 5–7 human-written captions, CLIP image embedding, and CLIP text embedding of the canonical caption — all stored inline. + `lance-format/coco-captions-2017-lance` — A Lance-formatted version of the COCO Captions 2017 corpus, redistributed via lmms-lab/COCO-Caption2017. Each row is one image with 5–7 human-written captions, a cosine-normalized CLIP image embedding, and a cosine-normalized CLIP text embedding of… - `lance-format/flickr30k-lance` — Lance-formatted version of Flickr30k (re-distributed via lmms-lab/flickr30k) — 31,783 images, each paired with 5 human-written captions, with CLIP image and text embeddings stored inline and pre-built ANN indices on both. + `lance-format/flickr30k-lance` — A Lance-formatted version of Flickr30k, redistributed via lmms-lab/flickr30k. Each row is one image with 5 human-written captions, a cosine-normalized CLIP image embedding, and a cosine-normalized CLIP text embedding of the canonical caption — all… - `lance-format/laion-1m` — A lance dataset of LAION image-text corpus (~1M rows) with inline JPEG bytes, CLIP embeddings (img_emb), and full metadata available directly from the Hub: hf://datasets/lance-format/laion-1m/data/train.lance. + `lance-format/laion-1m` — A Lance-formatted slice of the LAION image-text corpus (~1M rows) with inline JPEG bytes, CLIP image embeddings (img_emb), full metadata, and a pre-built ANN index — all available directly from the Hub at… @@ -92,19 +92,19 @@ integration itself, see the [Hugging Face Hub integration page](/integrations/ai - `lance-format/chartqa-lance` — Lance-formatted version of ChartQA — VQA over scientific and business charts that combine logical and visual reasoning — sourced from lmms-lab/ChartQA. + `lance-format/chartqa-lance` — A Lance-formatted version of ChartQA, a benchmark for question answering over scientific and business charts that demands a mix of logical and visual reasoning, redistributed via lmms-lab/ChartQA. Each row carries the chart image as inline JPEG… - `lance-format/docvqa-lance` — Lance-formatted version of DocVQA — VQA over document images (industry / government scans, multi-page reports, forms, receipts) — sourced from lmms-lab/DocVQA (DocVQA config). + `lance-format/docvqa-lance` — A Lance-formatted version of DocVQA, a benchmark for visual question answering over document images such as industry and government scans, multi-page reports, forms, and receipts, redistributed via lmms-lab/DocVQA (DocVQA config). Each row carries… - `lance-format/textvqa-lance` — Lance-formatted version of TextVQA — VQA where the question requires reading text in the image — sourced from lmms-lab/textvqa. + `lance-format/textvqa-lance` — A Lance-formatted version of TextVQA — visual question answering where the question requires reading text in the image (street signs, product labels, screen captures) — sourced from lmms-lab/textvqa. Each row carries the image bytes, the question… - `lance-format/vqav2-lance` — Lance-formatted version of VQAv2 — Visual Question Answering on COCO images, sourced from lmms-lab/VQAv2. Each row is a (image, question, 10 answers) triple with two CLIP embeddings (image + question text) so the same dataset supports both visual… + `lance-format/vqav2-lance` — A Lance-formatted version of VQAv2 — open-ended visual question answering on COCO images — sourced from lmms-lab/VQAv2. Each row is one (image, question, 10 annotator answers) triple with paired CLIP image and question embeddings drawn from the… - `lance-format/gqa-testdev-balanced-lance` — Lance-formatted version of the canonical GQA testdev_balanced slice — 12,578 compositional VQA questions joined with the matching 398 images — sourced from lmms-lab/GQA. + `lance-format/gqa-testdev-balanced-lance` — A Lance-formatted version of the canonical GQA testdev_balanced slice — 12,578 compositional VQA questions joined against the matching 398 images — sourced from lmms-lab/GQA. The original redistribution ships instructions and images as separate… @@ -112,19 +112,19 @@ integration itself, see the [Hugging Face Hub integration page](/integrations/ai - `lance-format/squad-v2-lance` — Lance-formatted version of SQuAD v2 — Stanford Question Answering Dataset, version 2 — with MiniLM sentence embeddings stored inline alongside the questions, contexts, and answers. + `lance-format/squad-v2-lance` — A Lance-formatted version of SQuAD v2 — the Stanford Question Answering Dataset with both answerable and deliberately unanswerable questions over Wikipedia passages — with MiniLM question embeddings stored inline and ready for retrieval at… - `lance-format/trivia-qa-lance` — Lance-formatted version of TriviaQA (rc.nocontext config) — a question-answering dataset of trivia questions paired with answer aliases — with MiniLM sentence embeddings stored inline. + `lance-format/trivia-qa-lance` — A Lance-formatted version of TriviaQA (rc.nocontext config) — a large reading-comprehension dataset of trivia questions paired with a canonical answer, accepted aliases, and entity-type metadata — with MiniLM question embeddings stored inline and… - `lance-format/hotpotqa-distractor-lance` — Lance-formatted version of HotpotQA — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs — using the distractor config (10 candidate paragraphs per question, including gold + 8… + `lance-format/hotpotqa-distractor-lance` — A Lance-formatted version of HotpotQA using the distractor config — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs, with 10 candidate paragraphs per question (gold + 8… - `lance-format/natural-questions-val-lance` — Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries with their full Wikipedia articles and 1–5 annotator labels per question. Sourced from google-research-datasets/natural_questions. + `lance-format/natural-questions-val-lance` — A Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries paired with the full Wikipedia article a human used to answer them, plus 1–5 annotator labels per question. MiniLM question embeddings are stored… - `lance-format/ms-marco-v2.1-lance` — Lance-formatted version of MS MARCO v2.1 — Microsoft's machine reading comprehension benchmark — with MiniLM query embeddings stored inline alongside the candidate passages and human-written answers. + `lance-format/ms-marco-v2.1-lance` — A Lance-formatted version of MS MARCO v2.1 — Microsoft's machine-reading-comprehension benchmark built from anonymized Bing query logs. Each row is one user query, the up-to-10 candidate passages Bing retrieved for it with relevance flags, and the… @@ -132,7 +132,7 @@ integration itself, see the [Hugging Face Hub integration page](/integrations/ai - `lance-format/fineweb-edu` — FineWeb-edu dataset with over 1.5 billion rows. Each passage ships with cleaned text, metadata, and 384-dim text embeddings for retrieval-heavy workloads. + `lance-format/fineweb-edu` — A Lance-formatted version of FineWeb-Edu — over 1.5 billion educational web passages with cleaned text, source metadata, language detection signals, and 384-dim text embeddings — available directly from the Hub at… @@ -140,7 +140,7 @@ integration itself, see the [Hugging Face Hub integration page](/integrations/ai - `lance-format/librispeech-clean-lance` — Lance-formatted version of the LibriSpeech ASR clean configuration (sourced from openslr/librispeech_asr). Audio is stored inline as FLAC bytes (no re-encoding); transcripts are sentence-embedded so semantic transcript search works out of the box. + `lance-format/librispeech-clean-lance` — A Lance-formatted version of the LibriSpeech ASR clean configuration, sourced from openslr/librispeech_asr. Each row is one utterance with inline FLAC audio bytes, the reference transcript, a sentence-transformers embedding of that transcript, and… @@ -148,7 +148,7 @@ integration itself, see the [Hugging Face Hub integration page](/integrations/ai - `lance-format/openvid-lance` — Lance format version of the OpenVid dataset with 937,957 high-quality videos stored with inline video blobs, embeddings, and rich metadata. + `lance-format/openvid-lance` — A Lance-formatted version of the OpenVid-1M corpus — 937,957 high-quality clips with inline MP4 bytes, 1024-dim video embeddings, captions, and rich per-clip quality signals — available directly from the Hub at… @@ -156,10 +156,10 @@ integration itself, see the [Hugging Face Hub integration page](/integrations/ai - `lance-format/lerobot-pusht-lance` — Lance-formatted version of lerobot/pusht — the canonical PushT benchmark from the Diffusion Policy paper — packaged using the same three-table layout as the existing lance-format/lerobot-xvla-soft-fold so consumers can flip between datasets without… + `lance-format/lerobot-pusht-lance` — A Lance-formatted version of lerobot/pusht — the canonical PushT benchmark from the Diffusion Policy paper — packaged using the same three-table layout as lance-format/lerobot-xvla-soft-fold so consumers can flip between datasets without changing… - `lance-format/lerobot-xvla-soft-fold` — This dataset was created using LeRobot. + `lance-format/lerobot-xvla-soft-fold` — A Lance-formatted version of lerobot/xvla-soft-fold — a multi-camera robotics dataset from the X-VLA project — packaged as three Lance tables for efficient frame-level training, episode-level trajectory loading, and direct access to the original… diff --git a/docs/datasets/kitti-2d-detection.mdx b/docs/datasets/kitti-2d-detection.mdx index cce4c24..58df1d9 100644 --- a/docs/datasets/kitti-2d-detection.mdx +++ b/docs/datasets/kitti-2d-detection.mdx @@ -1,7 +1,7 @@ --- title: "KITTI 2D Detection" sidebarTitle: "KITTI 2D Detection" -description: "Lance-formatted version of the KITTI 2D Object Detection benchmark — 7,481 training images from the KITTI Vision Benchmark Suite with 2D bounding boxes plus the full 3D-box / observation-angle metadata. Sourced from nateraw/kitti so no manual…" +description: "A Lance-formatted version of the KITTI 2D Object Detection benchmark, sourced from nateraw/kitti so no manual signup or download from cvlibs.net is required. Each row is a single driving frame with inline JPEG bytes, the full set of 2D and 3D…" --- -Lance-formatted version of the [KITTI 2D Object Detection benchmark](https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d) — 7,481 training images from the KITTI Vision Benchmark Suite with 2D bounding boxes plus the full 3D-box / observation-angle metadata. Sourced from [`nateraw/kitti`](https://huggingface.co/datasets/nateraw/kitti) so no manual signup or download from cvlibs.net is required. +A Lance-formatted version of the [KITTI 2D Object Detection benchmark](https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d), sourced from [`nateraw/kitti`](https://huggingface.co/datasets/nateraw/kitti) so no manual signup or download from cvlibs.net is required. Each row is a single driving frame with inline JPEG bytes, the full set of 2D and 3D object annotations stored as parallel per-object lists, plus a cosine-normalized OpenCLIP `ViT-B-32` image embedding — all available directly from the Hub at `hf://datasets/lance-format/kitti-2d-detection-lance/data`. -KITTI is the canonical autonomous-driving 2D / 3D detection benchmark — useful for AV perception research, robust real-world benchmarking, and as a small-scale companion to nuScenes / Waymo. +KITTI is the canonical autonomous-driving detection benchmark with 8 object classes drawn from real street scenes around Karlsruhe. It is widely used for AV perception research and serves as a small-scale companion to nuScenes and Waymo. + +## Key features + +- **Inline JPEG bytes** in the `image` column — no parallel `image_2/` and `label_2/` folders to keep in sync. +- **Per-object 2D and 3D annotations on the same row** — bounding boxes, observation angles, 3D dimensions, locations, yaw, occlusion and truncation flags travel as parallel list columns of equal length. +- **Pre-computed CLIP image embeddings** (`image_emb`, 512-dim, cosine-normalized) with a bundled `IVF_PQ` index for visual similarity over driving scenes. +- **Scalar and label-list indices** on `num_objects` and `types_present` make per-class and crowdedness filters cheap on the Hub copy and locally. ## Splits -| Split | Rows | -|-------|------| -| `train.lance` | 7,481 | +| Split | Rows | Notes | +|-------|------|-------| +| `train.lance` | 7,481 | Official KITTI training set with labels | -(The `test` split has no labels published, so we omit it. Add it back via `--splits train test` if you want the unlabeled images as well.) +The KITTI `test` split has no public labels and is intentionally not bundled. Add it via `--splits train test` in `kitti/dataprep.py` if you want the unlabeled images for inference. ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index within split | +| `id` | `int64` | Row index within split (natural join key for merges) | | `image` | `large_binary` | Inline JPEG bytes (re-encoded from the source PNG) | | `bboxes` | `list>` | 2D box per object — `[left, top, right, bottom]` in pixel coords | -| `alphas` | `list` | Observation angle (radians, KITTI convention) | -| `dimensions` | `list>` | 3D box `(h, w, l)` in metres | -| `locations` | `list>` | 3D centre `(x, y, z)` in camera coords (metres) | -| `rotation_y` | `list` | Yaw angle in camera coords (radians) | +| `alphas` | `list` | Observation angle per object (radians, KITTI convention) | +| `dimensions` | `list>` | 3D box `(h, w, l)` per object, in metres | +| `locations` | `list>` | 3D centre `(x, y, z)` per object in camera coords, in metres | +| `rotation_y` | `list` | Yaw per object in camera coords (radians) | | `occluded` | `list` | KITTI occlusion flag (0=visible, 1=partly, 2=largely, 3=unknown) | -| `truncated` | `list` | Truncation fraction (0.0-1.0) | -| `types` | `list` | Class name per object (e.g. `Car`, `Pedestrian`, `Cyclist`, `DontCare`) | -| `num_objects` | `int32` | Number of annotated objects | +| `truncated` | `list` | Truncation fraction per object (0.0-1.0) | +| `types` | `list` | Class name per object (`Car`, `Van`, `Truck`, `Pedestrian`, `Person_sitting`, `Cyclist`, `Tram`, `Misc`, `DontCare`) | +| `num_objects` | `int32` | Number of annotated objects in the frame | | `types_present` | `list` | Deduped class names — feeds the LABEL_LIST index | | `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` image embedding (cosine-normalized) | +All `list<...>` annotation columns on the same row are aligned — index `i` across `bboxes`, `alphas`, `dimensions`, `locations`, `rotation_y`, `occluded`, `truncated`, and `types` describes the same physical object. + ## Pre-built indices - `IVF_PQ` on `image_emb` — `metric=cosine` - `BTREE` on `num_objects` - `LABEL_LIST` on `types_present` -## Quick start +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/kitti-2d-detection-lance/data/train.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/kitti-2d-detection-lance", split="train", streaming=True) +for row in hf_ds.take(3): + print(row["id"], row["num_objects"], row["types_present"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/kitti-2d-detection-lance/data") tbl = db.open_table("train") -print(f"LanceDB table opened with {len(tbl)} frames") +print(len(tbl)) ``` -## Read a frame with annotations +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python -import io import lance -from PIL import Image, ImageDraw ds = lance.dataset("hf://datasets/lance-format/kitti-2d-detection-lance/data/train.lance") -row = ds.take([0], columns=["image", "bboxes", "types"]).to_pylist()[0] - -img = Image.open(io.BytesIO(row["image"])).convert("RGB") -draw = ImageDraw.Draw(img) -for (l, t, r, b), cls in zip(row["bboxes"], row["types"]): - if cls == "DontCare": - continue - draw.rectangle([l, t, r, b], outline="lime", width=2) - draw.text((l + 4, t + 2), cls, fill="lime") -img.save("kitti.jpg") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -## Filter by classes +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/kitti-2d-detection-lance --repo-type dataset --local-dir ./kitti-2d-detection-lance +> ``` +> Then point Lance or LanceDB at `./kitti-2d-detection-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `image_emb` makes visual nearest-neighbour search over driving scenes a single call. In production you would encode a query frame (or a scene prototype) through OpenCLIP `ViT-B-32` at runtime and pass the resulting 512-d cosine-normalized vector to `tbl.search(...)`. The example below uses the embedding from row 42 as a runnable stand-in. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/kitti-2d-detection-lance/data/train.lance") +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/kitti-2d-detection-lance/data") +tbl = db.open_table("train") -# Frames containing both a Car and a Cyclist (LABEL_LIST index makes this fast). -both = ds.scanner( - filter="array_has_all(types_present, ['Car', 'Cyclist'])", - columns=["id", "types_present"], - limit=10, -).to_table() +seed = ( + tbl.search() + .select(["image_emb", "types_present"]) + .limit(1) + .offset(42) + .to_list()[0] +) -# Frames with at least 10 objects (for crowded-scene experiments). -crowded = ds.scanner(filter="num_objects >= 10", columns=["id"], limit=10).to_table() +hits = ( + tbl.search(seed["image_emb"]) + .metric("cosine") + .select(["id", "num_objects", "types_present"]) + .limit(10) + .to_list() +) +print("query scene types:", seed["types_present"]) +for r in hits: + print(f" id={r['id']:>5} n={r['num_objects']:>2} {r['types_present']}") ``` -### Filter by classes with LanceDB +Because the embeddings are cosine-normalized, `metric="cosine"` is the natural choice and the first hit is typically the seed row itself. Visual neighbours tend to share scene-level structure (highway vs. urban intersection vs. parked-cars row) before they share class composition, which is what makes the cross between `image_emb` and the `types_present` / `num_objects` indices useful for the curation patterns below. + +## Curate + +KITTI's parallel per-object list columns make composition-based filters natural: pick scenes by which classes are present, by how many objects are in them, or by the occlusion profile of those objects. Lance evaluates these predicates inside a single filtered scan, and the bounded `.limit(...)` keeps the candidate set small and explicit. The first snippet below finds crowded scenes that contain at least one cyclist and one pedestrian — a useful slice for vulnerable-road-user studies. ```python import lancedb @@ -114,65 +150,155 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/kitti-2d-detection-lance/data") tbl = db.open_table("train") -both = ( +vru = ( tbl.search() - .where("array_has_all(types_present, ['Car', 'Cyclist'])") - .select(["id", "types_present"]) - .limit(10) + .where( + "array_has_all(types_present, ['Cyclist', 'Pedestrian']) AND num_objects >= 8", + prefilter=True, + ) + .select(["id", "num_objects", "types_present"]) + .limit(200) .to_list() ) +print(f"{len(vru)} VRU-rich frames") +``` + +A second pass can combine a structural filter with visual similarity: take a crowded urban seed frame and look for visually similar frames whose object lists also contain cars. This is a one-shot retrieval against the `IVF_PQ` index, joined with the `LABEL_LIST` index on `types_present` inside a single query. -crowded = ( +```python +seed = ( tbl.search() - .where("num_objects >= 10") - .select(["id"]) - .limit(10) + .where("num_objects >= 10 AND array_contains(types_present, 'Car')", prefilter=True) + .select(["image_emb"]) + .limit(1) + .to_list()[0] +) + +similar_crowded = ( + tbl.search(seed["image_emb"]) + .metric("cosine") + .where("array_contains(types_present, 'Car')", prefilter=True) + .select(["id", "num_objects", "types_present"]) + .limit(50) .to_list() ) ``` -## Visual similarity search +The results are plain lists of dictionaries, ready to inspect, persist as manifests of `id`s, or feed into the Evolve and Train workflows below. The annotation list columns and `image_emb` are read; the JPEG bytes are not touched until you ask for them. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds per-frame counts for the two most safety-relevant classes plus a `has_vru` flag, all of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. + +```python +import lancedb + +db = lancedb.connect("./kitti-2d-detection-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "num_cars": "array_length(array_filter(types, x -> x = 'Car'))", + "num_pedestrians": "array_length(array_filter(types, x -> x = 'Pedestrian'))", + "has_vru": "array_has_any(types_present, ['Pedestrian', 'Cyclist'])", +}) +``` + +If the values you want to attach already live in another table — detector predictions on the same frames, LIDAR-derived per-frame features, or human re-annotation — merge them in by joining on `id`: ```python -import lance import pyarrow as pa -ds = lance.dataset("hf://datasets/lance-format/kitti-2d-detection-lance/data/train.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"] -query = pa.array([ref], type=emb_field.type) - -neighbors = ds.scanner( - nearest={"column": "image_emb", "q": query[0], "k": 5, "nprobes": 16, "refine_factor": 30}, - columns=["id", "types_present"], -).to_table().to_pylist() +predictions = pa.table({ + "id": pa.array([0, 1, 2], type=pa.int64()), + "pred_num_cars": pa.array([3, 5, 0], type=pa.int32()), +}) +tbl.merge(predictions, on="id") ``` -### LanceDB visual similarity search +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (running a fresh detector over the JPEG bytes, deriving alternative embeddings), Lance also provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a 2D detector, project the JPEG bytes together with the per-object `bboxes` and `types` lists; everything else (3D annotations, CLIP embeddings) stays on disk until you opt in. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/kitti-2d-detection-lance/data") tbl = db.open_table("train") -ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0] -query_embedding = ref["image_emb"] - -results = ( - tbl.search(query_embedding) - .metric("cosine") - .select(["id", "types_present"]) - .limit(5) - .to_list() +train_ds = ( + Permutation.identity(tbl) + .select_columns(["image", "bboxes", "types"]) ) +loader = DataLoader(train_ds, batch_size=8, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; 3D fields and image_emb stay on disk. + # decode the JPEGs, drop DontCare boxes, build target tensors, forward, backward... + ... ``` -## Why Lance? +Switching feature sets is a configuration change: passing `["image_emb", "types_present"]` to `select_columns(...)` on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors plus the deduped class list, which is the right shape for training a lightweight scene classifier or a linear probe on top of frozen CLIP features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges predictions, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/kitti-2d-detection-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./kitti-2d-detection-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("kitti-clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="kitti-clip-vitb32-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A perception service locked to `kitti-clip-vitb32-v1` keeps returning stable retrieval results while the dataset evolves in parallel — newly added detector predictions or alternative embeddings do not change what the tag resolves to. A detection-training experiment pinned to the same tag can be rerun later against the exact same frames and annotations, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. The filter below carves out a vulnerable-road-user training set — frames that contain at least one pedestrian or cyclist — and writes them to a local LanceDB database. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/kitti-2d-detection-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("array_has_any(types_present, ['Pedestrian', 'Cyclist'])") + .select(["id", "image", "bboxes", "types", "num_objects", "types_present", "image_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./kitti-vru-subset") +local_db.create_table("train", batches) +``` -- One dataset for images + 2D + 3D annotations + embeddings + indices — no parallel `image_2/` and `label_2/` folders. -- On-disk vector and label-list indices live next to the data, so search and class-based filtering work on local copies and on the Hub. -- Schema evolution: add columns (LIDAR features, alternative embeddings, model predictions) without rewriting the data. +The resulting `./kitti-vru-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/kitti-2d-detection-lance/data` for `./kitti-vru-subset`. ## Source & license diff --git a/docs/datasets/laion-1m.mdx b/docs/datasets/laion-1m.mdx index 2f1d2e4..7458a11 100644 --- a/docs/datasets/laion-1m.mdx +++ b/docs/datasets/laion-1m.mdx @@ -1,7 +1,7 @@ --- title: "LAION-1M" sidebarTitle: "LAION-1M" -description: "A lance dataset of LAION image-text corpus (~1M rows) with inline JPEG bytes, CLIP embeddings (img_emb), and full metadata available directly from the Hub: hf://datasets/lance-format/laion-1m/data/train.lance." +description: "A Lance-formatted slice of the LAION image-text corpus (~1M rows) with inline JPEG bytes, CLIP image embeddings (img_emb), full metadata, and a pre-built ANN index — all available directly from the Hub at…" --- -A lance dataset of LAION image-text corpus (~1M rows) with inline JPEG bytes, CLIP embeddings (`img_emb`), and full metadata available directly from the Hub: `hf://datasets/lance-format/laion-1m/data/train.lance`. +A Lance-formatted slice of the [LAION](https://laion.ai/blog/laion-5b/) image-text corpus (~1M rows) with inline JPEG bytes, CLIP image embeddings (`img_emb`), full metadata, and a pre-built ANN index — all available directly from the Hub at `hf://datasets/lance-format/laion-1m/data/train.lance`. +## Key features -## Key Features +- **Inline JPEG bytes** in the `image` column — no sidecar files, no image folders. +- **Pre-computed CLIP image embeddings** (`img_emb`, 768-dim) with a bundled `IVF_PQ` index for similarity search. +- **Full LAION metadata** — captions, URLs, NSFW flags, EXIF, dimensions, similarity scores. +- **One columnar dataset** — scan metadata cheaply, then fetch image bytes only for the rows you want. -- **Images stored inline** – the `image` column is binary data, so sampling/exporting images never leaves Lance. -- **Prebuilt ANN index** – `img_emb` ships with IVF_PQ for instant similarity search. -- **Metadata rich** – captions, URLs, NSFW flags, EXIF, dimensions, similarity scores, etc. -- **Lance<>HF integration** – access via `datasets` or connect with Lance for ANN search, image export, and any operation that needs the vector index or binary blobs. +## Splits + +`train.lance` + +## Schema + +| Column | Type | Notes | +|---|---|---| +| `key` | `int64` | Row key (natural join key for merges) | +| `image` | `large_binary` | Inline JPEG bytes | +| `image_path` | `string` | Original filename | +| `caption` | `string` | Image caption | +| `url` | `string` | Source URL | +| `NSFW` | `int64` | 0 = safe, 1 = NSFW | +| `LICENSE` | `string` | Per-row license tag | +| `similarity` | `float64` | CLIP image–text cosine similarity | +| `width`, `height` | `int64` | Image dimensions | +| `original_width`, `original_height` | `int64` | Original dimensions before resize | +| `exif`, `md5`, `status`, `error_message` | `string` | Provenance / metadata | +| `img_emb` | `fixed_size_list` | CLIP image embedding | + +## Pre-built indices + +- `IVF_PQ` on `img_emb` — vector similarity search (L2) + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. ## Load with `datasets.load_dataset` +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable if your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample without installing anything Lance-specific. + ```python import datasets -hf_ds = datasets.load_dataset( - "lance-format/laion-1m", - split="train", - streaming=True -) -# Take first three rows and print captions +hf_ds = datasets.load_dataset("lance-format/laion-1m", split="train", streaming=True) for row in hf_ds.take(3): print(row["caption"]) ``` -## Load with Lance +## Load with LanceDB -Use Lance for ANN search, image export, and any operation that needs the vector index or binary blobs: - -```python -import lance - -ds = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance") -print(ds.count_rows()) -``` - -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, and Versioning examples below. ```python import lancedb -db = lancedb.connect("hf://datasets/lance-format/laion-subset/data") +db = lancedb.connect("hf://datasets/lance-format/laion-1m/data") tbl = db.open_table("train") -print(f"LanceDB table opened with {len(tbl)} image-text pairs") +print(len(tbl)) ``` -> **⚠️ HuggingFace Streaming Note** -> - Download the dataset locally (`huggingface-cli download lance-format/laion-1m --repo-type dataset --local-dir ./laion`) for heavy usage, then point Lance at `./laion` to use the IVF_PQ index. +## Load with Lance +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices. -## Why Lance? +```python +import lance -- Optimized for AI workloads: Lance keeps multimodal data and vector search-ready storage in the same columnar format designed for accelerator-era retrieval (see [lance.org](https://lance.org)). -- Images + embeddings + metadata travel as one tabular dataset. -- On-disk, scalable ANN index -- Schema evolution lets you add new features/columns (moderation tags, embeddings, etc.) without rewriting the raw data. +ds = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) +``` -## Quick Start (Lance) +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/laion-1m --repo-type dataset --local-dir ./laion-1m +> ``` +> Then point Lance or LanceDB at `./laion-1m/data`. -### Inspecting Existing Indices +## Search -This dataset comes with a built in vector (IVF) index for image embeddings. You can inspect the prebuilt indices on the dataset: +The bundled `IVF_PQ` index on `img_emb` makes approximate-nearest-neighbor search a single call. In production you would encode a user prompt or query image through CLIP at runtime and pass the resulting 768-d vector to `tbl.search(...)`. The example below uses the embedding from row 42 as a runnable stand-in. ```python -import lance - -dataset = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance") +import lancedb -# List all indices -indices = dataset.list_indices() -print(indices) -``` +db = lancedb.connect("hf://datasets/lance-format/laion-1m/data") +tbl = db.open_table("train") -While this dataset comes with pre-built indices, you can also create your own custom indices if needed. For example: +query = ( + tbl.search() + .select(["img_emb"]) + .limit(1) + .offset(42) + .to_list()[0]["img_emb"] +) -```python -# ds is a local Lance dataset -ds.create_index( - "img_emb", - index_type="IVF_PQ", - num_partitions=256, - num_sub_vectors=96, - replace=True, +hits = ( + tbl.search(query) + .metric("L2") + .select(["caption", "url", "similarity"]) + .limit(10) + .to_list() ) +for r in hits: + print(f"{r['similarity']:.3f} | {r['caption'][:80]}") ``` -```python -# ds is a local Lance dataset -ds.create_fts_index("caption") -``` +Tune `metric`, `nprobes`, and `refine_factor` to trade recall against latency for your workload. -## Quick Start (Lance) +## Curate -```python -import lance -import pyarrow as pa +Building a focused subset usually means combining similarity with metadata filters. Lance evaluates both inside a single query, so the candidate set comes back already filtered. The example below finds images visually similar to a seed row and restricts the result to safe-rated, high-resolution rows in one call. The bounded `.limit(500)` keeps the output small enough to inspect or hand off. -lance_ds = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance") - -# Vector search via img_emb IVF_PQ index -emb_field = lance_ds.schema.field("img_emb") -query = pa.array(list(range(768)), type=emb_field.type) - -neighbors = lance_ds.scanner( - nearest={ - "column": emb_field.name, - "q": query[0], - "k": 6, - "nprobes": 16, - "refine_factor": 30, - }, - columns=["caption", "url", "similarity"], -).to_table().to_pylist() -``` +```python +import lancedb -## Storing & Retrieving Multimodal Data +db = lancedb.connect("hf://datasets/lance-format/laion-1m/data") +tbl = db.open_table("train") -```python -from pathlib import Path +seed = ( + tbl.search() + .select(["img_emb", "caption"]) + .limit(1) + .offset(42) + .to_list()[0] +) -rows = lance_ds.take([0, 1], columns=["image", "caption"]).to_pylist() -for idx, row in enumerate(rows): - Path("samples").mkdir(exist_ok=True) - with open(f"samples/{idx}.jpg", "wb") as f: - f.write(row["image"]) +candidates = ( + tbl.search(seed["img_emb"]) + .where('"NSFW" = 0 AND similarity > 0.3 AND width >= 512', prefilter=True) + .select(["key", "url", "caption", "similarity"]) + .limit(500) + .to_list() +) +print(f"{len(candidates)} candidates around: {seed['caption'][:60]}") ``` -Images are stored inline as binary columns (regular Lance binary, not the special blob handle used in OpenVid). They behave like any other column—scan captions without touching `image`, then `take()` when you want the bytes. +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of row keys, or feed into the Evolve and Train workflows below. -## Dataset Schema +## Evolve -Core fields: -- `image_path`, `image` -- `caption`, `url` -- `NSFW` (uppercase), `similarity`, `LICENSE`, `key`, `status`, `error_message` -- `width`, `height`, `original_width`, `original_height` -- `exif`, `md5` -- `img_emb` +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a precomputed `aspect_ratio` and an `is_high_res` flag, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. +> **Note**: Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus. -## Usage Examples +```python +import lancedb -### 1. Browse metadata +db = lancedb.connect("./laion-1m/data") # local copy required for writes +tbl = db.open_table("train") -```python -scanner = ds.scanner(columns=["caption", "url", "similarity"], limit=5) -for row in scanner.to_table().to_pylist(): - print(row) +tbl.add_columns({ + "aspect_ratio": "CAST(width AS DOUBLE) / CAST(height AS DOUBLE)", + "is_high_res": "width >= 512 AND height >= 512", +}) ``` -### 2. Export images +If the values you want to attach already live in another table (offline labels, classifier predictions, aesthetic scores), merge them in by joining on the `key` column: ```python -rows = ds.take(range(3), columns=["image", "caption"]).to_pylist() -for i, row in enumerate(rows): - with open(f"sample_{i}.jpg", "wb") as f: - f.write(row["image"]) +import pyarrow as pa + +labels = pa.table({ + "key": pa.array([0, 1, 2]), + "aesthetic_score": pa.array([7.1, 6.4, 8.9]), +}) +tbl.merge(labels, on="key") ``` -### 3. Vector similarity search +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running an embedding model over the image bytes), Lance provides a batch-UDF API in the underlying library — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/) for that pattern. -```python -emb_field = ds.schema.field("img_emb") -ref = ds.take([123], columns=["img_emb"]).to_pylist()[0] -query = pa.array([ref["img_emb"]], type=emb_field.type) - -neighbors = ds.scanner( - nearest={ - "column": emb_field.name, - "q": query[0], - "k": 6, - "nprobes": 16, - "refine_factor": 30, - }, - columns=["caption", "url", "similarity"], -).to_table().to_pylist() -``` +## Train -### LanceDB Vector Similarity Search +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetch, shuffling, and batching behave as in any PyTorch pipeline. Columns added in the Evolve section above cost nothing per batch until they are explicitly projected. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/laion-1m/data") -query_embedding = list(range(768)) +tbl = db.open_table("train") -results = tbl.search(query_embedding) \ - .limit(5) \ - .to_list() +train_ds = Permutation.identity(tbl).select_columns(["image", "caption"]) +loader = DataLoader(train_ds, batch_size=256, shuffle=True, num_workers=4) +for batch in loader: + # batch carries only the projected columns; img_emb / img_emb_dinov3 stay on disk. + # decode the JPEG bytes, tokenize the captions, forward, backward... + ... ``` -### LanceDB Full-Text Search +Switching feature sets is a configuration change: passing `["img_emb_dinov3", "caption"]` to `select_columns(...)` on the next run reads only those columns, with no data movement or shard reorganization. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. ```python import lancedb @@ -212,61 +223,50 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/laion-1m/data") tbl = db.open_table("train") -results = tbl.search("dog running") \ - .select(["caption", "url", "similarity"]) \ - .limit(10) \ - .to_list() +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) ``` -## Dataset Evolution +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./laion-1m/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("aesthetic-v1", local_tbl.version) +``` -Lance supports flexible schema and data evolution ([docs](https://lance.org/guide/data_evolution/)). You can add/drop columns, backfill with SQL or Python, rename fields, or change data types without rewriting the whole dataset. In practice this lets you: -- Introduce fresh metadata (moderation labels, embeddings, quality scores) as new signals become available. -- Add new columns to existing datasets without re-exporting terabytes of video. -- Adjust column names or shrink storage (e.g., cast embeddings to float16) while keeping previous snapshots queryable for reproducibility. +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: ```python -import lance -import pyarrow as pa -import numpy as np +tbl_v1 = db.open_table("train", version="aesthetic-v1") +tbl_v5 = db.open_table("train", version=5) +``` -# Assumes you ran the export to Lance example above to store a local subset of the data -# ds = lance.dataset("./laion_1m_local") +Pinning supports two workflows. A retrieval system locked to `aesthetic-v1` keeps returning stable results while the dataset evolves in parallel; newly added columns or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same data, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. -# 1. Add a schema-only column (data to be added later) -dataset.add_columns(pa.field("moderation_label", pa.string())) +## Materialize a subset -# 2. Add a column with data backfill using a SQL expression -dataset.add_columns( - { - "moderation_label": "case WHEN \"NSFW\" > 0.5 THEN 'review' ELSE 'ok' END" - } -) +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full corpus. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. -# 3. Generate rich columns via Python batch UDFs -@lance.batch_udf() -def random_embedding(batch): - arr = np.random.rand(batch.num_rows, 128).astype("float32") - return pa.RecordBatch.from_arrays( - [pa.FixedSizeListArray.from_arrays(arr.ravel(), 128)], - names=["embedding"], - ) +```python +import lancedb -dataset.add_columns(random_embedding) +remote_db = lancedb.connect("hf://datasets/lance-format/laion-1m/data") +remote_tbl = remote_db.open_table("train") -# 4. Bring in offline annotations with merge -labels = pa.table({ - "id": pa.array([1, 2, 3]), - "label": pa.array(["horse", "rabbit", "cat"]), -}) -dataset.merge(labels, "id") +batches = ( + remote_tbl.search() + .where('"NSFW" = 0 AND similarity > 0.35 AND width >= 512') + .select(["key", "image", "caption", "url", "img_emb"]) + .to_batches() +) -# 5. Rename or cast columns as needs change -dataset.alter_columns({"path": "quality_bucket", "name": "quality_tier"}) -dataset.alter_columns({"path": "embedding", "data_type": pa.list_(pa.float16(), 128)}) +local_db = lancedb.connect("./laion-subset") +local_db.create_table("train", batches) ``` -These operations are automatically versioned, so prior experiments can still point to earlier versions while the dataset keeps evolving. +The resulting `./laion-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/laion-1m/data` for `./laion-subset`. ## Citation @@ -281,4 +281,4 @@ These operations are automatically versioned, so prior experiments can still poi ## License -Content inherits LAION’s original licensing and safety guidelines. Review [LAION policy](https://laion.ai/blog/laion-5b/) before downstream use. +Content inherits LAION's original licensing and safety guidelines. Review [LAION policy](https://laion.ai/blog/laion-5b/) before downstream use. diff --git a/docs/datasets/lerobot-pusht.mdx b/docs/datasets/lerobot-pusht.mdx index 2030efb..609bd9f 100644 --- a/docs/datasets/lerobot-pusht.mdx +++ b/docs/datasets/lerobot-pusht.mdx @@ -1,7 +1,7 @@ --- title: "LeRobot PushT" sidebarTitle: "LeRobot PushT" -description: "Lance-formatted version of lerobot/pusht — the canonical PushT benchmark from the Diffusion Policy paper — packaged using the same three-table layout as the existing lance-format/lerobot-xvla-soft-fold so consumers can flip between datasets without…" +description: "A Lance-formatted version of lerobot/pusht — the canonical PushT benchmark from the Diffusion Policy paper — packaged using the same three-table layout as lance-format/lerobot-xvla-soft-fold so consumers can flip between datasets without changing…" --- -Lance-formatted version of [`lerobot/pusht`](https://huggingface.co/datasets/lerobot/pusht) — the canonical PushT benchmark from the [Diffusion Policy paper](https://diffusion-policy.cs.columbia.edu/) — packaged using the same three-table layout as the existing [`lance-format/lerobot-xvla-soft-fold`](https://huggingface.co/datasets/lance-format/lerobot-xvla-soft-fold) so consumers can flip between datasets without changing code. +A Lance-formatted version of [`lerobot/pusht`](https://huggingface.co/datasets/lerobot/pusht) — the canonical PushT benchmark from the [Diffusion Policy paper](https://diffusion-policy.cs.columbia.edu/) — packaged using the same three-table layout as [`lance-format/lerobot-xvla-soft-fold`](https://huggingface.co/datasets/lance-format/lerobot-xvla-soft-fold) so consumers can flip between datasets without changing code. Available directly from the Hub at `hf://datasets/lance-format/lerobot-pusht-lance/data`. + +## Key features + +- **Three-table layout** — `frames`, `episodes`, `videos` — so frame-level training, episode-level trajectory work, and raw video access live side-by-side without scattered parquet shards or sidecar MP4 directories. +- **Inline MP4 segments** in `episodes.lance` (one blob per camera, with `from_timestamp` / `to_timestamp` bounds) and full source MP4s in `videos.lance`, all surfaced as lazy `BlobFile` handles via `take_blobs` so metadata scans never read the bytes. +- **Frame-level observations and actions** in `frames.lance` with stable `episode_index`, `frame_index`, and `index` columns for joining or temporal iteration. +- **Schema-evolution friendly** — add alternate camera streams, language annotations, or model predictions later without rewriting the data. ## Tables -The dataset is published as three Lance tables under `data/`: +| Table | Rows ~ | Purpose | +|---|---|---| +| `frames.lance` | one row per frame | Per-frame observations, actions, episode/task indices | +| `episodes.lance` | one row per episode | Full per-episode trajectories plus per-camera MP4 segment blobs and timestamp bounds | +| `videos.lance` | one row per source MP4 | Raw source video blobs and file-level provenance (path, size, sha256) | -| Table | Purpose | -|---|---| -| `frames.lance` | One row per frame — observations, actions, episode index, task index. | -| `videos.lance` | One row per source MP4 — full per-camera video stored as an inline blob. | -| `episodes.lance` | One row per episode — full timestamps + actions + per-camera video segment blobs. | +Use `frames.lance` for low-level training (loss-per-timestep, state-conditioned policies). Use `episodes.lance` when you need the full trajectory and the matching video segments together. Use `videos.lance` when you want direct access to the original encoded video files. -Use `frames.lance` for low-level training (loss-per-timestep), `episodes.lance` when you need the full trajectory + matching video segments, and `videos.lance` when you want to pull entire raw videos by camera. +## Schemas -## Quick start +### `frames.lance` -```python -import lance +| Column | Type | Notes | +|---|---|---| +| `observation_state` | `list` | Robot state vector for that frame | +| `action` | `list` | Action vector for that frame | +| `timestamp` | `float` | Canonical frame timestamp (seconds) | +| `frame_index` | `int64` | Frame index within episode | +| `episode_index` | `int64` | Parent episode id | +| `index` | `int64` | Global frame index | +| `task_index` | `int64` | Task id | + +### `episodes.lance` + +| Column | Type | Notes | +|---|---|---| +| `episode_index` | `int64` | Episode id | +| `task_index` | `int64` | Task id | +| `fps` | `int32` | Frame rate of the episode video segments | +| `timestamps` | `list` | Per-frame timestamps | +| `actions` | `list>` | Per-frame action vectors | +| `observation_state` | `list>` | Per-frame robot state vectors | +| `_video_blob` | `large_binary` (blob-encoded) | Inline MP4 segment for each camera, read lazily via `take_blobs` | +| `_from_timestamp` | `float64` | Segment start time | +| `_to_timestamp` | `float64` | Segment end time | + +### `videos.lance` + +| Column | Type | Notes | +|---|---|---| +| `camera_angle` | `string` | Camera key | +| `chunk_index`, `file_index` | `int32` | IDs parsed from the source path | +| `relative_path`, `filename` | `string` | Provenance | +| `file_size_bytes` | `int64` | Source MP4 size | +| `sha256` | `string` | SHA256 of the MP4 bytes | +| `video_blob` | `large_binary` (blob-encoded) | Raw source MP4 bytes | + +## Pre-built indices + +None bundled. Build indices on a local copy if a workload calls for them — e.g., a `BTREE` on `frames.episode_index` for fast episode lookup, or a vector index after attaching observation embeddings via Evolve. + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. -frames = lance.dataset("hf://datasets/lance-format/lerobot-pusht-lance/data/frames.lance") -videos = lance.dataset("hf://datasets/lance-format/lerobot-pusht-lance/data/videos.lance") -episodes = lance.dataset("hf://datasets/lance-format/lerobot-pusht-lance/data/episodes.lance") +## Load with `datasets.load_dataset` -print("frames:", frames.count_rows()) -print("videos:", videos.count_rows()) -print("episodes:", episodes.count_rows()) +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. Each Lance table is a separate `datasets` config. + +```python +import datasets + +hf_ds = datasets.load_dataset("lance-format/lerobot-pusht-lance", split="frames", streaming=True) +for row in hf_ds.take(3): + print(row["episode_index"], row["frame_index"], row["action"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. Each `.lance` file in `data/` is a table — open by name. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. Each `.lance` file in `data/` is a table — open by name. The same handles are used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. ```python import lancedb @@ -50,55 +105,208 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/lerobot-pusht-lance/data") frames = db.open_table("frames") -videos = db.open_table("videos") episodes = db.open_table("episodes") +videos = db.open_table("videos") +print(len(frames), len(episodes), len(videos)) +``` + +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices — or when you need the blob-level `take_blobs` entry point that streams MP4 bytes lazily from inline storage. + +```python +import lance + +ds = lance.dataset("hf://datasets/lance-format/lerobot-pusht-lance/data/frames.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) +``` + +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access to video segments and any kind of indexed search are dramatically faster against a local copy: +> ```bash +> hf download lance-format/lerobot-pusht-lance --repo-type dataset --local-dir ./lerobot-pusht +> ``` +> Then point Lance or LanceDB at `./lerobot-pusht/data`. + +## Search + +PushT does not ship a vector index out of the box — observation states are low-dimensional and most robotics workflows look up by index rather than by similarity. The bundled identifier columns (`episode_index`, `task_index`, `frame_index`) make exact lookups a single filtered scan. The example below pulls the first few frames of episode 0 from the frames table. + +```python +import lancedb -print("frames:", len(frames)) -print("videos:", len(videos)) -print("episodes:", len(episodes)) +db = lancedb.connect("hf://datasets/lance-format/lerobot-pusht-lance/data") +frames = db.open_table("frames") + +slice_ = ( + frames.search() + .where("episode_index = 0 AND frame_index < 10", prefilter=True) + .select(["episode_index", "frame_index", "timestamp", "action", "observation_state"]) + .limit(10) + .to_list() +) +for r in slice_: + print(r["frame_index"], r["timestamp"], r["action"]) ``` -### LanceDB query example +For similarity-style search across states or actions, attach an embedding column via Evolve and build an `IVF_PQ` index on it (see Evolve below). For visual similarity over rendered frames, the pre-extracted-frames pattern in Train below produces a table that can carry a learned image embedding alongside the pixels. + +## Curate + +A typical curation pass for a robotics workflow starts with an episode-level filter — pick episodes with a particular task, length, or initial condition — and then drops down to the frames within those episodes. Stacking predicates inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(...)` makes it cheap to inspect. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/lerobot-pusht-lance/data") -tbl = db.open_table("frames") - -# Browse a few frames from the first episode -results = ( - tbl.search() - .where("episode_index = 0") - .select(["episode_index", "frame_index", "timestamp"]) - .limit(5) +episodes = db.open_table("episodes") +frames = db.open_table("frames") + +# Pick a handful of episodes for the default task. +ep_rows = ( + episodes.search() + .where("task_index = 0", prefilter=True) + .select(["episode_index", "fps", "observation_images_image_from_timestamp", + "observation_images_image_to_timestamp"]) + .limit(10) + .with_row_id(True) + .to_list() +) +ep_ids = [r["episode_index"] for r in ep_rows] + +# Pull the frames belonging to those episodes for the next stage. +frame_rows = ( + frames.search() + .where(f"episode_index IN ({', '.join(map(str, ep_ids))})", prefilter=True) + .select(["episode_index", "frame_index", "timestamp", "action", "observation_state"]) + .limit(2000) .to_list() ) -for row in results: - print(row) +print(f"{len(ep_rows)} episodes, {len(frame_rows)} frames selected") ``` -## Pull a video segment for one episode +Neither scan reads any video bytes. The MP4 segments live in the blob-encoded `_video_blob` columns and stay on disk until something explicitly asks for them. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds an `action_magnitude` and a `large_action` flag to the frames table, either of which can then be used directly in `where` clauses. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need. ```python -from pathlib import Path -import lance +import lancedb -episodes = lance.dataset("hf://datasets/lance-format/lerobot-pusht-lance/data/episodes.lance") -row = episodes.take([0]).to_pylist()[0] +db = lancedb.connect("./lerobot-pusht/data") # local copy required for writes +frames = db.open_table("frames") -# The episode row carries one ``_video_blob`` per camera angle. -for col, value in row.items(): - if col.endswith("_video_blob") and value: - Path(f"{col}.mp4").write_bytes(value) - print(f"saved {col}.mp4 ({len(value)/1e6:.1f} MB)") +frames.add_columns({ + "action_magnitude": "SQRT(action[1] * action[1] + action[2] * action[2])", + "large_action": "SQRT(action[1] * action[1] + action[2] * action[2]) > 5.0", +}) ``` -## Why Lance? +If the values you want to attach already live in another table (offline reward labels, classifier predictions, learned observation embeddings), merge them in by joining on the appropriate key — `index` for frames or `episode_index` for episodes: + +```python +import pyarrow as pa + +rewards = pa.table({ + "index": pa.array([0, 1, 2]), + "reward_to_go": pa.array([1.4, 1.3, 1.2]), +}) +frames.merge(rewards, on="index") +``` + +The original columns and the inline video blobs are untouched, so existing code that does not reference the new columns continues to work unchanged. For column values that require a Python computation (e.g., running a visual encoder over the decoded video frames), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +A common pattern for vision-conditioned policy training is to pre-extract decoded frame pixels once into a derived LanceDB table — one row per frame, with the per-frame `action` and `observation_state` already joined in — and train against that table with the regular projection-based dataloader. `take_blobs` is the mechanism that makes the extraction step tractable: each episode's MP4 segment is randomly addressable in `episodes.lance` (the `from_timestamp` / `to_timestamp` columns give the segment bounds), so the pass can subset bytes on demand and write decoded frames into a fresh table without an external file store. Other workflows project the `_video_blob` columns from `episodes.lance` directly and decode at the batch boundary, or skip pixels entirely and train a state-only policy on `frames.lance` — the right shape is workload-specific. The actual training loop is the same `Permutation.identity(tbl).select_columns(...)` snippet in every case; only the source table and the column list change. + +For a state-only policy, the frames table is already in the right shape — no pre-extraction needed: + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/lerobot-pusht-lance/data") +frames = db.open_table("frames") + +train_ds = Permutation.identity(frames).select_columns(["observation_state", "action"]) +loader = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4) +``` + +For a vision-conditioned policy, train against a pre-extracted frames-with-pixels table that joins each frame's decoded image to its `action` and `observation_state`: + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("./lerobot-pusht-frames") # local table produced by the one-time extraction +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns(["image", "observation_state", "action"]) +loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=4) +``` + +The inline `_video_blob` storage and `take_blobs` still earn their place outside of the training loop — visualizing an episode in a notebook, sampling for human review, one-off evaluation against a held-out task, and the pre-extraction step itself — but they are not the dataloader. + +## Versioning + +Every mutation to a Lance table, whether it adds a column, merges labels, or builds an index, commits a new version. Each of `frames`, `episodes`, and `videos` is versioned independently, so a column added to `frames` does not bump the version of `episodes`. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/lerobot-pusht-lance/data") +frames = db.open_table("frames") + +print("frames version:", frames.version) +print("history:", frames.list_versions()) +print("tags:", frames.tags.list()) +``` + +Once you have a local copy, tag the table for reproducibility: + +```python +local_db = lancedb.connect("./lerobot-pusht/data") +local_frames = local_db.open_table("frames") +local_frames.tags.create("pusht-v1", local_frames.version) +``` + +Reopen by tag or by version number against either the Hub copy or a local one: + +```python +frames_v1 = db.open_table("frames", version="pusht-v1") +frames_v5 = db.open_table("frames", version=5) +``` + +Pinning supports two workflows. A policy locked to `pusht-v1` keeps reproducing the same behavior while the dataset evolves in parallel. A training experiment pinned to the same tag can be rerun later against the exact same frames, so changes in metrics reflect model changes rather than data drift. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation, index builds) need a writable backing store, and a training pipeline benefits from a local copy with fast random access into the video blobs. Both can be served by a subset of the dataset rather than the full corpus. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/lerobot-pusht-lance/data") +remote_frames = remote_db.open_table("frames") + +batches = ( + remote_frames.search() + .where("task_index = 0 AND episode_index < 50") + .select(["episode_index", "frame_index", "index", "timestamp", "action", "observation_state"]) + .to_batches() +) + +local_db = lancedb.connect("./pusht-task0-subset") +local_db.create_table("frames", batches) +``` -- One dataset bundles low-level frames + full-episode trajectories + raw video blobs — no scattered parquet shards or sidecar MP4 directories. -- Inline video blobs use Lance's blob encoding so metadata scans never load the bytes; you fetch them on demand via `take_blobs`. -- Schema evolution: add columns (alternate camera streams, language annotations, model predictions) without rewriting the data. +The resulting `./pusht-task0-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/lerobot-pusht-lance/data` for `./pusht-task0-subset`. The same pattern applies to `episodes` and `videos` — narrow each table to the rows your workload needs, and the resulting database stays small enough to index and iterate cheaply. ## Source & license diff --git a/docs/datasets/lerobot-xvla-soft-fold.mdx b/docs/datasets/lerobot-xvla-soft-fold.mdx index 6b55e7e..3c6cc12 100644 --- a/docs/datasets/lerobot-xvla-soft-fold.mdx +++ b/docs/datasets/lerobot-xvla-soft-fold.mdx @@ -1,7 +1,7 @@ --- title: "LeRobot X-VLA Soft-Fold" sidebarTitle: "LeRobot X-VLA Soft-Fold" -description: "This dataset was created using LeRobot." +description: "A Lance-formatted version of lerobot/xvla-soft-fold — a multi-camera robotics dataset from the X-VLA project — packaged as three Lance tables for efficient frame-level training, episode-level trajectory loading, and direct access to the original…" --- -This dataset was created using [LeRobot](https://github.com/huggingface/lerobot). +A Lance-formatted version of [`lerobot/xvla-soft-fold`](https://huggingface.co/datasets/lerobot/xvla-soft-fold) — a multi-camera robotics dataset from the [X-VLA](https://thu-air-dream.github.io/X-VLA/) project — packaged as three Lance tables for efficient frame-level training, episode-level trajectory loading, and direct access to the original encoded videos. Available directly from the Hub at `hf://datasets/lance-format/lerobot-xvla-soft-fold/data`. + +- **1,542 episodes** +- **2,852,512 frames** at **20 FPS** +- **3 camera streams per episode** — `cam_high`, `cam_left_wrist`, `cam_right_wrist` +- **Robot state and action vectors** aligned to frame timestamps + +## Key features + +- **Three-table layout** — `frames`, `episodes`, `videos` — so frame-level training, episode-level trajectory work, and raw video access live side-by-side without scattered parquet shards or sidecar MP4 directories. +- **Per-camera inline MP4 segments** in `episodes.lance`, with `from_timestamp` / `to_timestamp` bounds per camera and per episode, surfaced as lazy `BlobFile` handles via `take_blobs` so metadata scans never read the bytes. +- **Frame-level observations and actions** in `frames.lance` with stable `episode_index`, `frame_index`, and `index` columns for joining or temporal iteration. +- **Source MP4 provenance** in `videos.lance` (`relative_path`, `filename`, `file_size_bytes`, `sha256`) alongside the raw bytes, for integrity checks or custom decode pipelines. + +## Tables + +| Table | Rows | Purpose | +|---|---|---| +| `frames.lance` | 2,852,512 | Per-frame observations, actions, episode/task indices | +| `episodes.lance` | 1,542 | Full per-episode trajectories plus per-camera MP4 segment blobs and timestamp bounds | +| `videos.lance` | 104 | Raw source MP4 files (one row per source MP4) with file-level provenance | + +Use `frames.lance` for low-level training (loss-per-timestep, state-conditioned policies). Use `episodes.lance` when you need the full trajectory and the matching per-camera video segments together. Use `videos.lance` when you want direct access to the original encoded video files. + +## Schemas + +### `frames.lance` + +| Column | Type | Notes | +|---|---|---| +| `observation_state` | `list` | Robot state vector for that frame | +| `action` | `list` | Action vector for that frame | +| `time_stamp` | `float` | Original source timestamp field | +| `timestamp` | `float` | Canonical frame timestamp (seconds) | +| `frame_index` | `int64` | Frame index within episode | +| `episode_index` | `int64` | Parent episode id | +| `index` | `int64` | Global frame index | +| `task_index` | `int64` | Task id | + +### `episodes.lance` + +| Column | Type | Notes | +|---|---|---| +| `episode_index` | `int64` | Episode id | +| `task_index` | `int64` | Task id | +| `fps` | `int32` | Frame rate of the episode video segments | +| `timestamps` | `list` | Per-frame timestamps | +| `actions` | `list>` | Per-frame action vectors | +| `observation_state` | `list>` | Per-frame robot state vectors | +| `observation_images_cam_high_video_blob` | `large_binary` (blob-encoded) | Inline MP4 segment for `cam_high` | +| `observation_images_cam_high_from_timestamp` | `float64` | `cam_high` segment start time | +| `observation_images_cam_high_to_timestamp` | `float64` | `cam_high` segment end time | +| `observation_images_cam_left_wrist_video_blob` | `large_binary` (blob-encoded) | Inline MP4 segment for `cam_left_wrist` | +| `observation_images_cam_left_wrist_from_timestamp` | `float64` | `cam_left_wrist` segment start time | +| `observation_images_cam_left_wrist_to_timestamp` | `float64` | `cam_left_wrist` segment end time | +| `observation_images_cam_right_wrist_video_blob` | `large_binary` (blob-encoded) | Inline MP4 segment for `cam_right_wrist` | +| `observation_images_cam_right_wrist_from_timestamp` | `float64` | `cam_right_wrist` segment start time | +| `observation_images_cam_right_wrist_to_timestamp` | `float64` | `cam_right_wrist` segment end time | + +### `videos.lance` + +| Column | Type | Notes | +|---|---|---| +| `camera_angle` | `string` | Camera key (e.g. `cam_high`) | +| `chunk_index`, `file_index` | `int32` | IDs parsed from the source path | +| `relative_path`, `filename` | `string` | Provenance | +| `file_size_bytes` | `int64` | Source MP4 size | +| `sha256` | `string` | SHA256 of the MP4 bytes | +| `video_blob` | `large_binary` (blob-encoded) | Raw source MP4 bytes | + +## Pre-built indices + +None bundled. Build indices on a local copy if a workload calls for them — e.g., a `BTREE` on `frames.episode_index` for fast per-episode lookup, or a vector index after attaching observation embeddings via Evolve. + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. Each Lance table is a separate `datasets` config. -## Dataset Description +```python +import datasets - **Repository:** [X-VLA](https://thu-air-dream.github.io/X-VLA/) +hf_ds = datasets.load_dataset("lance-format/lerobot-xvla-soft-fold", split="frames", streaming=True) +for row in hf_ds.take(3): + print(row["episode_index"], row["frame_index"], row["action"]) +``` - **License:** Apache 2.0 +## Load with LanceDB - **Paper:** *Zheng et al., 2025, “X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model”* ([arXiv:2510.10274](https://arxiv.org/pdf/2510.10274)) +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. Each `.lance` file in `data/` is a table — open by name. The same handles are used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. +```python +import lancedb -## What this dataset contains +db = lancedb.connect("hf://datasets/lance-format/lerobot-xvla-soft-fold/data") -This is the Lance-format version of [lerobot/xvla-soft-fold](https://huggingface.co/datasets/lerobot/xvla-soft-fold), designed for efficient frame-level sampling and sequential episode loading. +frames = db.open_table("frames") +episodes = db.open_table("episodes") +videos = db.open_table("videos") +print(len(frames), len(episodes), len(videos)) +``` -- `1,542` episodes -- `2,852,512` frames -- `20` FPS -- 3 camera streams per episode (`cam_high`, `cam_left_wrist`, `cam_right_wrist`) -- robot state vectors and action vectors aligned to frame timestamps +## Load with Lance -## Dataset structure +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices — or when you need the blob-level `take_blobs` entry point that streams MP4 bytes lazily from inline storage. -The dataset is organized under `data/` with three Lance tables: +```python +import lance -### Frames table +ds = lance.dataset("hf://datasets/lance-format/lerobot-xvla-soft-fold/data/frames.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) +``` -This is the main table for model training and analytics at frame granularity. Each row is one frame with aligned state/action metadata and indexing fields so you can filter by episode, iterate temporally, or build sampled batches directly. +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access to video segments and any kind of indexed search are dramatically faster against a local copy. The full dataset is **>50 GB**, so ensure you have sufficient disk space: +> ```bash +> hf download lance-format/lerobot-xvla-soft-fold --repo-type dataset --local-dir ./lerobot-xvla-soft-fold +> ``` +> Then point Lance or LanceDB at `./lerobot-xvla-soft-fold/data`. For most workflows, the Materialize-a-subset section at the end of this card is a better starting point than downloading the full corpus. -Schema: -- `observation_state` (`list`): robot state vector for that frame. -- `action` (`list`): action vector for that frame. -- `time_stamp` (`float`): original source timestamp field. -- `timestamp` (`float`): canonical frame timestamp. -- `frame_index` (`int64`): frame index within episode. -- `episode_index` (`int64`): parent episode id. -- `index` (`int64`): global frame index. -- `task_index` (`int64`): task id. +## Search -### Episodes table +This dataset does not ship a vector index out of the box — observation states are low-dimensional and most robotics workflows look up by index rather than by similarity. The bundled identifier columns (`episode_index`, `task_index`, `frame_index`) make exact lookups a single filtered scan. The example below pulls the first few frames of episode 30 from the frames table. -This table is optimized for sequence-aware loading. Each row represents one complete episode and stores per-episode arrays (`timestamps`, `actions`, `observation_state`) plus per-camera video blobs and timestamp ranges. Use this table when you need contiguous windows, trajectory-level batching, or synchronized decoding from episode-level video chunks. +```python +import lancedb -Schema: -- `episode_index` (`int64`, required): episode id. -- `task_index` (`int64`, required): task id. -- `fps` (`int32`, required): frame rate. -- `timestamps` (`list`): per-frame timestamps for the episode. -- `actions` (`list>`): per-frame action vectors. -- `observation_state` (`list>`): per-frame robot state vectors. -- `observation_images_cam_high_video_blob` (`large_binary` blob): encoded video segment for `cam_high`. -- `observation_images_cam_high_from_timestamp` (`double`): segment start time for `cam_high`. -- `observation_images_cam_high_to_timestamp` (`double`): segment end time for `cam_high`. -- `observation_images_cam_left_wrist_video_blob` (`large_binary` blob): encoded video segment for `cam_left_wrist`. -- `observation_images_cam_left_wrist_from_timestamp` (`double`): segment start time for `cam_left_wrist`. -- `observation_images_cam_left_wrist_to_timestamp` (`double`): segment end time for `cam_left_wrist`. -- `observation_images_cam_right_wrist_video_blob` (`large_binary` blob): encoded video segment for `cam_right_wrist`. -- `observation_images_cam_right_wrist_from_timestamp` (`double`): segment start time for `cam_right_wrist`. -- `observation_images_cam_right_wrist_to_timestamp` (`double`): segment end time for `cam_right_wrist`. +db = lancedb.connect("hf://datasets/lance-format/lerobot-xvla-soft-fold/data") +frames = db.open_table("frames") -### Videos table +slice_ = ( + frames.search() + .where("episode_index = 30 AND frame_index < 10", prefilter=True) + .select(["episode_index", "frame_index", "timestamp", "action", "observation_state"]) + .limit(10) + .to_list() +) +for r in slice_: + print(r["frame_index"], r["timestamp"], r["action"]) +``` -This table stores raw MP4 payloads from the source and file-level provenance metadata. It is useful when you want direct access to original encoded video assets, integrity checks (`sha256`), or custom decoding pipelines that operate on the original video files themselves, rather than episode/frame abstractions. +For similarity-style search across states or actions, attach an embedding column via Evolve and build an `IVF_PQ` index on it. For visual similarity over rendered frames, the pre-extracted-frames pattern in Train below produces a table that can carry a learned image embedding alongside the pixels. -Schema: -- `camera_angle` (`string`, required): camera key. -- `chunk_index` (`int32`): chunk id parsed from path. -- `file_index` (`int32`): file id parsed from path. -- `relative_path` (`string`, required): original relative path in dataset. -- `filename` (`string`, required): MP4 filename. -- `file_size_bytes` (`int64`, required): file size. -- `sha256` (`string`, required): SHA256 digest. -- `video_blob` (`large_binary`, required blob): raw MP4 bytes. +## Curate -## Usage +A typical curation pass for a robotics workflow starts with an episode-level filter — pick episodes with a particular task, length, or initial condition — and then either iterates frames or pulls the matching video segments. Stacking predicates inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(...)` makes it cheap to inspect. -In the following sections, we'll show how to work with the dataset in Lance or LanceDB. +```python +import lancedb -### Read with Lance +db = lancedb.connect("hf://datasets/lance-format/lerobot-xvla-soft-fold/data") +episodes = db.open_table("episodes") -```python -import lance +ep_rows = ( + episodes.search() + .where("task_index = 0 AND fps = 20", prefilter=True) + .select([ + "episode_index", + "observation_images_cam_high_from_timestamp", + "observation_images_cam_high_to_timestamp", + ]) + .limit(20) + .with_row_id(True) + .to_list() +) +print(f"{len(ep_rows)} episodes selected") +for r in ep_rows[:3]: + print( + f" ep {r['episode_index']} " + f"{r['observation_images_cam_high_from_timestamp']:.2f}s → " + f"{r['observation_images_cam_high_to_timestamp']:.2f}s" + ) +``` -root_path = "hf://datasets/lance-format/lerobot-xvla-soft-fold/data" -frames_table_name = "frames.lance" -episodes_table_name = "episodes.lance" -videos_table_name = "videos.lance" +Neither this scan nor any of the per-camera segment columns are read. The MP4 segments live in the blob-encoded `_video_blob` columns and stay on disk until something explicitly asks for them — which makes "find me the right episodes" a metadata-only operation against a multi-million-frame corpus. -ds = lance.dataset(f"{root_path}/{frames_table_name}") -print(ds.count_rows()) +## Evolve -ds = lance.dataset(f"{root_path}/{episodes_table_name}") -print(ds.count_rows()) +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds an `episode_duration` column to the episodes table from the existing `cam_high` timestamp bounds. -ds = lance.dataset(f"{root_path}/{videos_table_name}") -print(ds.count_rows()) +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need. -# 2852512 -# 1542 -# 104 +```python +import lancedb + +db = lancedb.connect("./lerobot-xvla-soft-fold/data") # local copy required for writes +episodes = db.open_table("episodes") + +episodes.add_columns({ + "episode_duration_s": ( + "observation_images_cam_high_to_timestamp - " + "observation_images_cam_high_from_timestamp" + ), + "is_long_episode": ( + "(observation_images_cam_high_to_timestamp - " + " observation_images_cam_high_from_timestamp) > 120.0" + ), +}) ``` -### Inspect a few frames +If the values you want to attach already live in another table (offline reward labels, classifier predictions, learned observation embeddings), merge them in by joining on the appropriate key — `index` for frames or `episode_index` for episodes: ```python -import lance +import pyarrow as pa -root_path = "hf://datasets/lance-format/lerobot-xvla-soft-fold/data" -frames_table_name = "frames.lance" - -frames = lance.dataset(f"{root_path}/{frames_table_name}") -print(f"There are {frames.count_rows()} frames in total") - -# pip install polars -res = frames.scanner( - columns=["episode_index", "frame_index", "timestamp"], - limit=2, -).to_table() -print(res) - -# Returns -# There are 2852512 frames in total -# pyarrow.Table -# episode_index: int64 -# frame_index: int64 -# timestamp: float -# ---- -# episode_index: [[0,0]] -# frame_index: [[0,1]] -# timestamp: [[0,0.05]] +ep_labels = pa.table({ + "episode_index": pa.array([0, 1, 2]), + "outcome": pa.array(["success", "partial", "success"]), +}) +episodes.merge(ep_labels, on="episode_index") ``` -### Retrieving and saving video blobs +The original columns and the inline video blobs are untouched, so existing code that does not reference the new columns continues to work unchanged. For column values that require a Python computation (e.g., running a visual encoder over the decoded video frames), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). -```py -from pathlib import Path -import lance +## Train -root_path = "hf://datasets/lance-format/lerobot-xvla-soft-fold/data" -episodes_table_name = "episodes.lance" -ds = lance.dataset(f"{root_path}/{episodes_table_name}") - -out = Path("video_blobs") -out.mkdir(exist_ok=True) - -# Retrieve first two videos from the episodes table -for offset in range(0, 2): - row = ( - ds.scanner( - columns=["episode_index", "observation_images_cam_high_video_blob"], - blob_handling="all_binary", - limit=2, - offset=offset, - ) - .to_table() - .to_pylist()[0] - ) - # Write the video blob to a file - (out / f"episode_{row['episode_index']}.mp4").write_bytes( - row["observation_images_cam_high_video_blob"] - ) -``` -This outputs the retrieved blobs as MP4 files in a local directory. +A common pattern for vision-language-action training is to pre-extract decoded frame pixels once into a derived LanceDB table — one row per frame, with the per-frame `action` and `observation_state` already joined in, and one column per camera holding the decoded image — and train against that table with the regular projection-based dataloader. `take_blobs` is the mechanism that makes the extraction step tractable: each episode's per-camera MP4 segment is randomly addressable in `episodes.lance` (the `*_from_timestamp` / `*_to_timestamp` columns give the segment bounds), so the pass can subset bytes on demand and write decoded frames into a fresh table without an external file store. Other workflows project the `*_video_blob` columns from `episodes.lance` directly and decode at the batch boundary, or skip pixels entirely and train a state-only policy on `frames.lance` — the right shape is workload-specific. The actual training loop is the same `Permutation.identity(tbl).select_columns(...)` snippet in every case; only the source table and the column list change. + +For a state-only policy, the frames table is already in the right shape — no pre-extraction needed: + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader -### Random seek on subsets of video +db = lancedb.connect("hf://datasets/lance-format/lerobot-xvla-soft-fold/data") +frames = db.open_table("frames") -The snippet shown below reads one episode’s video blob directly from HF Hub via Lance, computes a tiny time window inside that episode, opens the blob as a stream (without downloading full data into a local file), seeks to the start timestamp, and prints the blob size plus the exact seek positions in seconds and stream PTS units. +train_ds = Permutation.identity(frames).select_columns(["observation_state", "action"]) +loader = DataLoader(train_ds, batch_size=256, shuffle=True, num_workers=4) +``` -```py -import av -import lance +For a vision-language-action policy, train against a pre-extracted frames-with-pixels table that joins each frame's three decoded camera images to its `action` and `observation_state`. Picking the cameras the model actually conditions on is then a column projection — `cam_high` alone, all three, or any subset: -DATASET_URI = "hf://datasets/lance-format/lerobot-xvla-soft-fold/data/episodes.lance" -EPISODE_INDEX = 30 -START_OFFSET_S = 1.0 -WINDOW_S = 0.5 +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader -ds = lance.dataset(DATASET_URI) -row = ds.scanner( - columns=[ - "episode_index", - "observation_images_cam_high_from_timestamp", - "observation_images_cam_high_to_timestamp", - "_rowid", - ], - with_row_id=True, - filter=f"episode_index = {EPISODE_INDEX}", - limit=1, -).to_table().to_pylist()[0] - -start_s = row["observation_images_cam_high_from_timestamp"] + START_OFFSET_S -end_s = min( - start_s + WINDOW_S, - row["observation_images_cam_high_to_timestamp"], +db = lancedb.connect("./lerobot-xvla-frames") # local table produced by the one-time extraction +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns( + ["cam_high", "cam_left_wrist", "cam_right_wrist", "observation_state", "action"] ) +loader = DataLoader(train_ds, batch_size=32, shuffle=True, num_workers=4) +``` + +The inline `_video_blob` storage and `take_blobs` still earn their place outside of the training loop — visualizing an episode in a notebook, sampling for human review, one-off evaluation, and the pre-extraction step itself — but they are not the dataloader. + +## Versioning + +Every mutation to a Lance table, whether it adds a column, merges labels, or builds an index, commits a new version. Each of `frames`, `episodes`, and `videos` is versioned independently, so a column added to `frames` does not bump the version of `episodes`. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/lerobot-xvla-soft-fold/data") +frames = db.open_table("frames") + +print("frames version:", frames.version) +print("history:", frames.list_versions()) +print("tags:", frames.tags.list()) +``` -blob = ds.take_blobs("observation_images_cam_high_video_blob", ids=[row["_rowid"]])[0] -with av.open(blob) as container: - stream = container.streams.video[0] - stream.codec_context.skip_frame = "NONKEY" +Once you have a local copy, tag the table for reproducibility: - start_pts = int(start_s / stream.time_base) - end_pts = int(end_s / stream.time_base) - container.seek(start_pts, stream=stream) +```python +local_db = lancedb.connect("./lerobot-xvla-soft-fold/data") +local_frames = local_db.open_table("frames") +local_frames.tags.create("xvla-v1", local_frames.version) +``` - print(f"episode_index={row['episode_index']}") - print(f"blob_size_bytes={blob.size()}") - print(f"seek_start_seconds={start_s:.3f}") - print(f"seek_end_seconds={end_s:.3f}") - print(f"seek_start_pts={start_pts}") - print(f"seek_end_pts={end_pts}") +Reopen by tag or by version number against either the Hub copy or a local one: -blob.close() +```python +frames_v1 = db.open_table("frames", version="xvla-v1") +frames_v5 = db.open_table("frames", version=5) ``` -### LanceDB search +Pinning supports two workflows. A policy locked to `xvla-v1` keeps reproducing the same behavior while the dataset evolves in parallel. A training experiment pinned to the same tag can be rerun later against the exact same frames and segments, so changes in metrics reflect model changes rather than data drift. + +## Materialize a subset -LanceDB users can also interface with the Lance dataset on the Hub. The key step is to -connect to the dataset repo and open the relevant table. +At >50 GB across three tables and millions of frames, few workflows want the full corpus on local disk. The practical entry point is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory — including the per-camera `_video_blob` columns on `episodes.lance`, which stream through Arrow record batches rather than being assembled in a single buffer. -```py +```python import lancedb -db = lancedb.connect("hf://datasets/lance-format/lerobot-xvla-soft-fold/data") -tbl = db.open_table("episodes") - -# Search without any parameters -results = ( - tbl.search() - .select( - [ - "episode_index", - "observation_images_cam_high_from_timestamp", - "observation_images_cam_high_to_timestamp", - ] - ) - .limit(3) - .to_list() -) +remote_db = lancedb.connect("hf://datasets/lance-format/lerobot-xvla-soft-fold/data") +remote_episodes = remote_db.open_table("episodes") -for result in results: - print( - f"{result['episode_index']} | {result['observation_images_cam_high_from_timestamp']} | {result['observation_images_cam_high_to_timestamp']}" - ) +batches = ( + remote_episodes.search() + .where("task_index = 0 AND episode_index < 50") + .select([ + "episode_index", "task_index", "fps", "timestamps", "actions", "observation_state", + "observation_images_cam_high_video_blob", + "observation_images_cam_high_from_timestamp", + "observation_images_cam_high_to_timestamp", + ]) + .to_batches() +) -# Returns: -# 0 | 0.0 | 122.95 -# 1 | 122.95 | 230.65 -# 2 | 230.65 | 340.0 +local_db = lancedb.connect("./xvla-task0-subset") +local_db.create_table("episodes", batches) ``` -### Download +The resulting `./xvla-task0-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/lerobot-xvla-soft-fold/data` for `./xvla-task0-subset`. The same pattern applies to `frames` and `videos` — narrow each table to the rows your workload needs, and the resulting database stays small enough to index and iterate cheaply. -If you need to make modifications to the data or work with the raw files directly, you can do a -full download of the dataset locally. +## Source & license -> **⚠️ Large dataset download** -> The full dataset is >50GB in size, so ensure you have sufficient disk space available. +Converted from [`lerobot/xvla-soft-fold`](https://huggingface.co/datasets/lerobot/xvla-soft-fold) (LeRobot v3.0 dataset format), originally released as part of the [X-VLA](https://thu-air-dream.github.io/X-VLA/) project. Apache 2.0. -```bash -uv run hf download lance-format/lerobot-xvla-soft-fold --repo-type dataset --local-dir . +## Citation + +``` +@article{zheng2025xvla, + title={X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model}, + author={Zheng and others}, + journal={arXiv preprint arXiv:2510.10274}, + year={2025} +} + +@misc{cadene2024lerobot, + title={LeRobot: State-of-the-art Machine Learning for Real-World Robotics in PyTorch}, + author={R{\'e}mi Cadene and Simon Alibert and Alexander Soare and Quentin Gallou{\'e}dec and Adil Zouitine and Steven Palma and Pepijn Kooijmans and Michel Aractingi and Mustafa Shukor and Martino Russi and Francesco Capuano and Caroline Pascal and Jade Choghari and Jess Moss and Thomas Wolf}, + year={2024}, + url={https://github.com/huggingface/lerobot} +} ``` diff --git a/docs/datasets/librispeech-clean.mdx b/docs/datasets/librispeech-clean.mdx index 62217b9..37b9f7c 100644 --- a/docs/datasets/librispeech-clean.mdx +++ b/docs/datasets/librispeech-clean.mdx @@ -1,7 +1,7 @@ --- title: "LibriSpeech clean" sidebarTitle: "LibriSpeech clean" -description: "Lance-formatted version of the LibriSpeech ASR clean configuration (sourced from openslr/librispeech_asr). Audio is stored inline as FLAC bytes (no re-encoding); transcripts are sentence-embedded so semantic transcript search works out of the box." +description: "A Lance-formatted version of the LibriSpeech ASR clean configuration, sourced from openslr/librispeech_asr. Each row is one utterance with inline FLAC audio bytes, the reference transcript, a sentence-transformers embedding of that transcript, and…" --- -Lance-formatted version of the LibriSpeech ASR `clean` configuration (sourced from [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr)). Audio is stored inline as FLAC bytes (no re-encoding); transcripts are sentence-embedded so semantic transcript search works out of the box. +A Lance-formatted version of the LibriSpeech ASR `clean` configuration, sourced from [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr). Each row is one utterance with inline FLAC audio bytes, the reference transcript, a sentence-transformers embedding of that transcript, and speaker/chapter metadata — all available directly from the Hub at `hf://datasets/lance-format/librispeech-clean-lance/data`. + +## Key features + +- **Inline FLAC bytes** in the `audio` column at 16 kHz mono, with no re-encoding from the upstream parquet. +- **Sentence-transformers embedding of the transcript** in `text_emb` (`all-MiniLM-L6-v2`, 384-dim, cosine-normalized) with a bundled `IVF_PQ` index for semantic transcript search. +- **Pre-built `INVERTED` FTS index on `text`** and `BTREE` indices on `id`, `speaker_id`, and `chapter_id` for keyword search and stable lookup by identifier. +- **Per-utterance metadata** — `speaker_id`, `chapter_id`, `num_chars`, `sampling_rate` — that downstream filters can stack on. ## Splits -| Split | Lance file | Rows | Description | -|-------|------------|------|-------------| -| `dev_clean.lance` | dev.clean | 2,703 | Standard ASR validation set | -| `test_clean.lance` | test.clean | 2,620 | Standard ASR test set | -| `train_clean_100.lance` | train.clean.100 | 28,539 | 100-hour clean training subset | +| Split | Source config | Rows | Description | +|-------|---------------|------|-------------| +| `dev_clean.lance` | `dev.clean` | 2,703 | Standard ASR validation set | +| `test_clean.lance` | `test.clean` | 2,620 | Standard ASR test set | +| `train_clean_100.lance` | `train.clean.100` | 28,539 | 100-hour clean training subset | -> The 360-hour and 500-hour LibriSpeech subsets (`train.360`, `train.other.500`) are **not** bundled here. To extend the dataset, point `librispeech/dataprep.py` at additional splits. +> The 360-hour and 500-hour LibriSpeech subsets (`train.360`, `train.other.500`) are not bundled here. To extend, point `librispeech/dataprep.py` at additional splits. ## Schema @@ -39,109 +46,205 @@ Lance-formatted version of the LibriSpeech ASR `clean` configuration (sourced fr ## Pre-built indices -- `IVF_PQ` on `text_emb` — `metric=cosine` -- `INVERTED` (FTS) on `text` -- `BTREE` on `id`, `speaker_id`, `chapter_id` +- `IVF_PQ` on `text_emb` — semantic transcript search (cosine) +- `INVERTED` (FTS) on `text` — keyword and hybrid search +- `BTREE` on `id`, `speaker_id`, `chapter_id` — fast lookup by identifier + +## Why Lance? -## Quick start +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/librispeech-clean-lance/data/test_clean.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/librispeech-clean-lance", split="test_clean", streaming=True) +for row in hf_ds.take(3): + print(row["id"], row["text"][:80]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. Each `.lance` file in `data/` is a table — open by name (e.g., `test_clean`, `train_clean_100`). +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. Each `.lance` file in `data/` is a table — open by name (`dev_clean`, `test_clean`, `train_clean_100`). The same handle is used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/librispeech-clean-lance/data") -tbl = db.open_table("test_clean") -print(f"LanceDB table opened with {len(tbl)} utterances") +tbl = db.open_table("train_clean_100") +print(len(tbl)) ``` -## Read one utterance and play it +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices. ```python -from pathlib import Path import lance -ds = lance.dataset("hf://datasets/lance-format/librispeech-clean-lance/data/test_clean.lance") -row = ds.take([0], columns=["id", "audio", "text", "speaker_id"]).to_pylist()[0] - -Path(f"{row['id']}.flac").write_bytes(row["audio"]) -print("speaker:", row["speaker_id"]) -print("transcript:", row["text"]) +ds = lance.dataset("hf://datasets/lance-format/librispeech-clean-lance/data/train_clean_100.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -You can decode the FLAC bytes in-memory with `soundfile` and feed them straight into a model: +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access, ANN search, and audio decoding are far faster against a local copy: +> ```bash +> hf download lance-format/librispeech-clean-lance --repo-type dataset --local-dir ./librispeech-clean +> ``` +> Then point Lance or LanceDB at `./librispeech-clean/data`. + +## Search + +The bundled `IVF_PQ` index on `text_emb` makes semantic transcript retrieval a single call. In production you would encode a query string through the same sentence-transformers model used at ingest (`all-MiniLM-L6-v2`, cosine-normalized), then pass the resulting 384-d vector to `tbl.search(...)`. The example below uses the embedding from row 42 as a runnable stand-in. ```python -import io -import soundfile as sf +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/librispeech-clean-lance/data") +tbl = db.open_table("train_clean_100") + +seed = ( + tbl.search() + .select(["text_emb", "text"]) + .limit(1) + .offset(42) + .to_list()[0] +) -samples, sr = sf.read(io.BytesIO(row["audio"])) -print(samples.shape, sr) +hits = ( + tbl.search(seed["text_emb"], vector_column_name="text_emb") + .metric("cosine") + .select(["id", "speaker_id", "text"]) + .limit(10) + .to_list() +) +print("query transcript:", seed["text"][:80]) +for r in hits: + print(f" {r['id']} spk={r['speaker_id']} {r['text'][:80]}") ``` -## Semantic transcript retrieval +The `audio` blob is never touched. A top-10 semantic search moves a few kilobytes of transcript text rather than the FLAC bytes for every candidate. + +Because the dataset also ships an `INVERTED` index on `text`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query — useful when a name or domain term must literally appear in the transcript but you still want the semantic side to rank the rest. ```python -import lance -import pyarrow as pa -from sentence_transformers import SentenceTransformer +hybrid_hits = ( + tbl.search(query_type="hybrid", vector_column_name="text_emb") + .vector(seed["text_emb"]) + .text("astronomy") + .select(["id", "speaker_id", "text"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(f" {r['id']} spk={r['speaker_id']} {r['text'][:80]}") +``` -encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda") -q = encoder.encode(["a person talking about astronomy"], normalize_embeddings=True)[0] +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. -ds = lance.dataset("hf://datasets/lance-format/librispeech-clean-lance/data/train_clean_100.lance") -emb_field = ds.schema.field("text_emb") -hits = ds.scanner( - nearest={"column": "text_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 5}, - columns=["id", "speaker_id", "text"], -).to_table().to_pylist() -for h in hits: - print(h) -``` +## Curate -### LanceDB semantic transcript retrieval +Building a focused subset of utterances usually means combining content with structure — pick utterances by a single speaker, or above a minimum transcript length, or matching a topic. Stacking predicates inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(500)` makes it cheap to inspect. ```python import lancedb -from sentence_transformers import SentenceTransformer - -encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda") -q = encoder.encode(["a person talking about astronomy"], normalize_embeddings=True)[0] db = lancedb.connect("hf://datasets/lance-format/librispeech-clean-lance/data") tbl = db.open_table("train_clean_100") -results = ( - tbl.search(q.tolist(), vector_column_name="text_emb") - .metric("cosine") - .select(["id", "speaker_id", "text"]) - .limit(5) +candidates = ( + tbl.search() + .where("speaker_id = 1272 AND num_chars >= 60", prefilter=True) + .select(["id", "chapter_id", "num_chars", "text"]) + .limit(500) + .with_row_id(True) .to_list() ) +print(f"{len(candidates)} utterances; first: {candidates[0]['text'][:80]}") ``` -## Full-text and per-speaker filtering +The scan never reads the `audio` column. Lance stores binary columns independently, so a metadata-only curation pass moves only the transcript text and scalar fields across the wire — even though the underlying table includes hours of inline FLAC audio. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a `is_long_utterance` flag and a coarse `length_bucket`, either of which can then be used directly in `where` clauses without re-evaluating the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need. ```python -ds = lance.dataset("hf://datasets/lance-format/librispeech-clean-lance/data/train_clean_100.lance") +import lancedb -# Word search via the FTS index. -hits = ds.scanner(full_text_query="universe stars", columns=["id", "text"], limit=10).to_table() +db = lancedb.connect("./librispeech-clean/data") # local copy required for writes +tbl = db.open_table("train_clean_100") -# All utterances by a given speaker. -sp = ds.scanner(filter="speaker_id = 1272", columns=["id", "chapter_id", "text"], limit=10).to_table() +tbl.add_columns({ + "is_long_utterance": "num_chars >= 200", + "length_bucket": ( + "CASE WHEN num_chars < 80 THEN 'short' " + "WHEN num_chars < 200 THEN 'medium' ELSE 'long' END" + ), +}) ``` -### LanceDB full-text search and per-speaker filtering +If the values you want to attach already live in another table (alternate transcripts, speaker embeddings, model predictions), merge them in by joining on `id`: + +```python +import pyarrow as pa + +predictions = pa.table({ + "id": pa.array(["1272-128104-0000", "1272-128104-0001"]), + "wer": pa.array([0.04, 0.12]), +}) +tbl.merge(predictions, on="id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. For column values that require a Python computation (e.g., running a speaker embedding model over the FLAC bytes), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +A common pattern for audio training is to pre-extract decoded features once into a derived LanceDB table — one row per training-ready window of log-mel frames or raw PCM samples — and train against that table with the regular projection-based dataloader. `take_blobs` is the mechanism that makes the extraction step tractable: each utterance's FLAC bytes are randomly addressable, so the pass can subset audio on demand and write decoded windows into a fresh table without an external file store. Other workflows project `audio` directly through `select_columns(...)` and decode at the batch boundary, or skip audio entirely and train on the cached transcript embeddings — the right shape is workload-specific. The actual training loop is the same `Permutation.identity(tbl).select_columns(...)` snippet in every case; only the source table and the column list change. + +Against a pre-extracted features table: + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("./librispeech-features") # local table produced by the one-time extraction +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns(["log_mel", "text", "speaker_id"]) +loader = DataLoader(train_ds, batch_size=32, shuffle=True, num_workers=4) +``` + +Against the cached transcript embeddings on the source table (no audio decode): + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +src_db = lancedb.connect("hf://datasets/lance-format/librispeech-clean-lance/data") +src_tbl = src_db.open_table("train_clean_100") + +train_ds = Permutation.identity(src_tbl).select_columns(["text_emb", "speaker_id"]) +loader = DataLoader(train_ds, batch_size=256, shuffle=True, num_workers=4) +``` + +The inline `audio` storage and `take_blobs` still earn their place around the training process — listening back to an utterance in a notebook, sampling for human review, one-off evaluation against a held-out set, and the pre-extraction pass itself. Each of those reads a small, explicit set of blobs once. What the Train section above keeps off the per-batch hot path is exactly that raw-audio decode: paying it every step is what the pre-extracted features are designed to avoid. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. ```python import lancedb @@ -149,29 +252,50 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/librispeech-clean-lance/data") tbl = db.open_table("train_clean_100") -# Word search via the FTS index. -hits = ( - tbl.search("universe stars") - .select(["id", "text"]) - .limit(10) - .to_list() -) +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` -# All utterances by a given speaker. -sp = ( - tbl.search() +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./librispeech-clean/data") +local_tbl = local_db.open_table("train_clean_100") +local_tbl.tags.create("minilm-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train_clean_100", version="minilm-v1") +tbl_v5 = db.open_table("train_clean_100", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `minilm-v1` keeps returning stable results while the dataset evolves in parallel. A training experiment pinned to the same tag can be rerun later against the exact same utterances, so changes in metrics reflect model changes rather than data drift. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training pipeline benefits from a local copy with fast random access to the FLAC bytes. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory — including the `audio` column, which streams through Arrow record batches rather than being assembled in a single buffer. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/librispeech-clean-lance/data") +remote_tbl = remote_db.open_table("train_clean_100") + +batches = ( + remote_tbl.search() .where("speaker_id = 1272") - .select(["id", "chapter_id", "text"]) - .limit(10) - .to_list() + .select(["id", "audio", "sampling_rate", "text", "speaker_id", "chapter_id", "text_emb"]) + .to_batches() ) -``` -## Why Lance? +local_db = lancedb.connect("./librispeech-speaker-1272") +local_db.create_table("train", batches) +``` -- One dataset for audio + transcripts + embeddings + indices — no parallel folder of FLAC files plus a transcript JSON. -- On-disk vector and full-text indices live next to the data, so search works on local copies and on the Hub. -- Schema evolution: add columns (alternate transcripts, speaker embeddings, model predictions) without rewriting the data. +The resulting `./librispeech-speaker-1272` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/librispeech-clean-lance/data` for `./librispeech-speaker-1272`. ## Source & license diff --git a/docs/datasets/mnist.mdx b/docs/datasets/mnist.mdx index 522daaf..f544f72 100644 --- a/docs/datasets/mnist.mdx +++ b/docs/datasets/mnist.mdx @@ -1,7 +1,7 @@ --- title: "MNIST" sidebarTitle: "MNIST" -description: "A Lance-formatted version of the classic MNIST handwritten-digit dataset with 70,000 28×28 grayscale digits stored inline alongside CLIP image embeddings and a pre-built ANN index." +description: "A Lance-formatted version of the classic MNIST handwritten-digit dataset covering 70,000 28×28 grayscale digits across ten balanced classes. Each row carries inline PNG bytes, the digit label, the human-readable class name, and a cosine-normalized…" --- -A Lance-formatted version of the classic [MNIST handwritten-digit dataset](https://huggingface.co/datasets/ylecun/mnist) with **70,000 28×28 grayscale digits** stored inline alongside CLIP image embeddings and a pre-built ANN index. +A Lance-formatted version of the classic [MNIST handwritten-digit dataset](https://huggingface.co/datasets/ylecun/mnist) covering 70,000 28×28 grayscale digits across ten balanced classes. Each row carries inline PNG bytes, the digit label, the human-readable class name, and a cosine-normalized CLIP image embedding, all backed by a bundled `IVF_PQ` vector index plus scalar indices on the label columns and available directly from the Hub at `hf://datasets/lance-format/mnist-lance/data`. ## Key features -- All multimodal data (image bytes + embeddings) stored **inline** in the same Lance dataset — no sidecar files, no external image folders. -- **Pre-computed CLIP embeddings** (OpenCLIP `ViT-B-32` / `laion2b_s34b_b79k`, 512-dim, L2-normalized) shipped with an `IVF_PQ` index for instant similarity search. -- **BTREE index on `label`** and **BITMAP index on `label_name`** for sub-millisecond filtering. -- Standard train/test splits, ready to use with `lance.dataset(...)` or `datasets.load_dataset(...)`. +- **Inline PNG bytes** in the `image` column — no sidecar files, no image folders. +- **Pre-computed CLIP image embeddings** (OpenCLIP `ViT-B-32` / `laion2b_s34b_b79k`, 512-dim, cosine-normalized) with a bundled `IVF_PQ` index. +- **Scalar indices on both label columns** — `BTREE` on `label` and `BITMAP` on `label_name` — so digit filters and digit-conditioned search are constant-time lookups. +- **One columnar dataset** — scan labels cheaply, then fetch image bytes only for the rows you want. ## Splits | Split | Rows | |-------|------| -| `train` | 60,000 | -| `test` | 10,000 | +| `train.lance` | 60,000 | +| `test.lance` | 10,000 | ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index within the split | +| `id` | `int64` | Row index within the split (natural join key for merges) | | `image` | `large_binary` | Inline PNG bytes (28×28 grayscale) | -| `label` | `int32` | Digit class id (0-9) | -| `label_name` | `string` | Human-readable class (`"0".."9"`) | +| `label` | `int32` | Digit class id (0–9) | +| `label_name` | `string` | Human-readable class (`"0"`..`"9"`) | | `image_emb` | `fixed_size_list` | CLIP image embedding (cosine-normalized) | ## Pre-built indices -- `IVF_PQ` on `image_emb` — vector similarity search (`metric=cosine`) -- `BTREE` on `label` — fast equality / range filters -- `BITMAP` on `label_name` — fast filters on the 10 class names +- `IVF_PQ` on `image_emb` — vector similarity search (cosine) +- `BTREE` on `label` — fast equality and range filters on the digit id +- `BITMAP` on `label_name` — fast filters across the ten class names + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. ## Load with `datasets.load_dataset` +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable if your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample without installing anything Lance-specific. + ```python import datasets @@ -54,18 +65,10 @@ for row in hf_ds.take(3): print(row["label"], row["label_name"]) ``` -## Load directly with Lance (recommended) - -```python -import lance - -ds = lance.dataset("hf://datasets/lance-format/mnist-lance/data/train.lance") -print(ds.count_rows(), ds.schema.names) -print(ds.list_indices()) -``` - ## Load with LanceDB +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. + ```python import lancedb @@ -74,31 +77,27 @@ tbl = db.open_table("train") print(len(tbl)) ``` -> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: -> ```bash -> hf download lance-format/mnist-lance --repo-type dataset --local-dir ./mnist-lance -> ``` -> Then `lance.dataset("./mnist-lance/data/train.lance")`. +## Load with Lance -## Vector search example +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python import lance -import pyarrow as pa ds = lance.dataset("hf://datasets/lance-format/mnist-lance/data/train.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"] -query = pa.array([ref], type=emb_field.type) - -neighbors = ds.scanner( - nearest={"column": "image_emb", "q": query[0], "k": 5, "nprobes": 16, "refine_factor": 30}, - columns=["id", "label", "label_name"], -).to_table().to_pylist() -print(neighbors) +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB vector search +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/mnist-lance --repo-type dataset --local-dir ./mnist-lance +> ``` +> Then point Lance or LanceDB at `./mnist-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `image_emb` turns nearest-neighbor lookup on the 512-d CLIP space into a single call. In production you would encode a query digit through OpenCLIP `ViT-B-32` at runtime and pass the resulting vector to `tbl.search(...)`. The example below uses the embedding already stored in row 42 as a runnable stand-in so the snippet works without any model loaded. ```python import lancedb @@ -106,64 +105,160 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/mnist-lance/data") tbl = db.open_table("train") -ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0] -query_embedding = ref["image_emb"] +seed = ( + tbl.search() + .select(["image_emb", "label"]) + .limit(1) + .offset(42) + .to_list()[0] +) -results = ( - tbl.search(query_embedding) +hits = ( + tbl.search(seed["image_emb"]) .metric("cosine") .select(["id", "label", "label_name"]) - .limit(5) + .limit(10) .to_list() ) -for row in results: - print(row["id"], row["label"], row["label_name"]) +print("query digit:", seed["label"]) +for r in hits: + print(f" id={r['id']:>5} label={r['label']}") ``` -## Filter by class +Because the embeddings are cosine-normalized and MNIST digits cluster tightly in CLIP space, near-neighbors of a seed image are dominated by the seed's own digit class — a useful sanity check before swapping in a real query encoder. Tune `metric`, `nprobes`, and `refine_factor` to trade recall against latency. -```python -ds = lance.dataset("hf://datasets/lance-format/mnist-lance/data/train.lance") -sevens = ds.scanner(filter="label = 7", columns=["id"], limit=10).to_table() -print(sevens) -``` +## Curate -### Filter by class with LanceDB +A typical curation pass for a digit-classification workflow narrows the table to a single digit (or a small set of confusable digits like 4/9 or 3/8) before sampling. Because both label columns are indexed, the filter resolves without scanning the embedding or image bytes; the bounded `.limit(500)` keeps the output small enough to inspect or hand off as a manifest of row ids. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/mnist-lance/data") tbl = db.open_table("train") -sevens = ( + +candidates = ( tbl.search() - .where("label = 7") - .select(["id"]) - .limit(10) + .where("label IN (4, 9)", prefilter=True) + .select(["id", "label", "label_name"]) + .limit(500) .to_list() ) -print(sevens) +print(f"{len(candidates)} 4/9 candidates") ``` -## Working with images +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `id`s, or feed into the Evolve and Train workflows below. The `image` and `image_emb` columns are never read, so the network traffic for a 500-row candidate scan is dominated by the tiny label payload. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds an `is_target_class` flag for binary one-vs-rest experiments and an `is_curvy_digit` flag that groups digits with curved strokes, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus first. ```python -from pathlib import Path -import lance +import lancedb -ds = lance.dataset("hf://datasets/lance-format/mnist-lance/data/train.lance") -row = ds.take([0], columns=["image", "label"]).to_pylist()[0] -Path("digit_0.png").write_bytes(row["image"]) -print("label =", row["label"]) +db = lancedb.connect("./mnist-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "is_target_class": "label = 7", + "is_curvy_digit": "label IN (0, 3, 6, 8, 9)", +}) ``` -Images are stored inline as PNG bytes; scanning columns like `label` does not pay the I/O cost of loading image bytes. +If the values you want to attach already live in another table (offline labels from a stronger model, classifier predictions, per-row confidence scores), merge them in by joining on the `id` column: -## Why Lance? +```python +import pyarrow as pa + +predictions = pa.table({ + "id": pa.array([0, 1, 2], type=pa.int64()), + "pred_label": pa.array([5, 0, 4], type=pa.int32()), + "pred_conf": pa.array([0.97, 0.88, 0.82]), +}) +tbl.merge(predictions, on="id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a second image encoder over the inline PNG bytes), Lance provides a batch-UDF API in the underlying library — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/) for that pattern. + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetch, shuffling, and batching behave as in any PyTorch pipeline. Columns added in the Evolve section above cost nothing per batch until they are explicitly projected. + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/mnist-lance/data") +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns(["image", "label"]) +loader = DataLoader(train_ds, batch_size=256, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; image_emb stays on disk. + # decode the PNG bytes, normalize to [0, 1], forward, backward... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "label"]` to `select_columns(...)` on the next run skips PNG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a linear probe or a lightweight reranker on top of frozen CLIP features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/mnist-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./mnist-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="clip-vitb32-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `clip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel; newly added prediction columns or relabelings do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same digits and labels, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/mnist-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("label IN (4, 9)") + .select(["id", "image", "label", "label_name", "image_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./mnist-4-vs-9") +local_db.create_table("train", batches) +``` -- One dataset for images + embeddings + indices + metadata — no sidecar files to manage. -- On-disk vector and full-text indices live next to the data, so search works on both local copies and the Hub. -- Schema evolution lets you add new columns (fresh embeddings, augmentations, model predictions) without rewriting the data ([docs](https://lance.org/guide/data_evolution/)). +The resulting `./mnist-4-vs-9` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/mnist-lance/data` for `./mnist-4-vs-9`. ## Source & license diff --git a/docs/datasets/ms-marco-v2.mdx b/docs/datasets/ms-marco-v2.mdx index c54701d..4bdf039 100644 --- a/docs/datasets/ms-marco-v2.mdx +++ b/docs/datasets/ms-marco-v2.mdx @@ -1,7 +1,7 @@ --- title: "MS MARCO v2.1" sidebarTitle: "MS MARCO v2.1" -description: "Lance-formatted version of MS MARCO v2.1 — Microsoft's machine reading comprehension benchmark — with MiniLM query embeddings stored inline alongside the candidate passages and human-written answers." +description: "A Lance-formatted version of MS MARCO v2.1 — Microsoft's machine-reading-comprehension benchmark built from anonymized Bing query logs. Each row is one user query, the up-to-10 candidate passages Bing retrieved for it with relevance flags, and the…" --- -Lance-formatted version of [MS MARCO v2.1](https://huggingface.co/datasets/microsoft/ms_marco) — Microsoft's machine reading comprehension benchmark — with **MiniLM query embeddings** stored inline alongside the candidate passages and human-written answers. +A Lance-formatted version of [MS MARCO v2.1](https://huggingface.co/datasets/microsoft/ms_marco) — Microsoft's machine-reading-comprehension benchmark built from anonymized Bing query logs. Each row is one user query, the up-to-10 candidate passages Bing retrieved for it with relevance flags, and the human-written reference answers, with MiniLM query embeddings stored inline and pre-built ANN/FTS indices, available directly from the Hub at `hf://datasets/lance-format/ms-marco-v2.1-lance/data`. -## Why this version? +## Key features -- **One self-contained Lance dataset** with ~900 k queries; each row is a query, the 10 candidate passages retrieved by Bing, the relevance flags, and the human-written reference answers. -- **Pre-computed query embeddings** (`sentence-transformers/all-MiniLM-L6-v2`, 384-dim, L2-normalized) with an `IVF_PQ` index — semantic query lookup without re-embedding. -- **Full-text inverted indices** on the query and the first selected passage. -- Designed for both retrieval research (use the index) and RAG / answer eval (use the passage list + answers). +- **Self-contained passage-ranking rows** — each query carries up to 10 candidate passages in parallel `passage_text` / `passage_url` / `passage_is_selected` columns, alongside the human-written `answers` and `well_formed_answers`. +- **First relevant passage promoted to its own field** in `selected_passage`, so RAG / answer-evaluation workflows can read the gold context without indexing into the parallel passage lists. +- **Pre-computed 384-dim query embeddings** (`query_emb`, `sentence-transformers/all-MiniLM-L6-v2`, cosine-normalized) with a bundled `IVF_PQ` index for semantic query lookup. +- **One columnar dataset** — scan query metadata cheaply, defer the heavy passage text reads to the rows that matter. ## Splits | Split | Rows | |-------|------| -| `train.lance` | 808,731 | +| `train.lance` | 808,731 | | `validation.lance` | 101,093 | ## Schema @@ -41,132 +41,256 @@ Lance-formatted version of [MS MARCO v2.1](https://huggingface.co/datasets/micro | `passage_url` | `list` | Source URLs for each candidate | | `passage_is_selected` | `list` | `1` if Bing labelled the passage relevant | | `selected_passage` | `string?` | First relevant passage (null if none) | -| `query_emb` | `fixed_size_list` | MiniLM embedding of `query` (cosine-normalized) | +| `query_emb` | `fixed_size_list` | MiniLM query embedding | ## Pre-built indices -- `IVF_PQ` on `query_emb` — `metric=cosine` -- `INVERTED` on `query` and `selected_passage` -- `BTREE` on `query_id` -- `BITMAP` on `query_type` +- `IVF_PQ` on `query_emb` — semantic query lookup (cosine) +- `INVERTED` (FTS) on `query` and `selected_passage` — keyword and hybrid search +- `BTREE` on `query_id` — stable lookup by identifier +- `BITMAP` on `query_type` — cheap predicate evaluation for query class -## Quick start +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/validation.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/ms-marco-v2.1-lance", split="validation", streaming=True) +for row in hf_ds.take(3): + print(row["query"], "->", row["answers"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. Each `.lance` file in `data/` is a table — open by name (`train`, `validation`). The same handle is used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data") tbl = db.open_table("validation") -print(f"LanceDB table opened with {len(tbl)} queries") +print(len(tbl)) ``` -## Semantic query lookup +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices. ```python import lance -import pyarrow as pa -from sentence_transformers import SentenceTransformer - -encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda") -q = encoder.encode(["how to compute determinant of a 3x3 matrix"], normalize_embeddings=True)[0] ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/validation.lance") -emb_field = ds.schema.field("query_emb") -hits = ds.scanner( - nearest={"column": "query_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 5, "nprobes": 16, "refine_factor": 30}, - columns=["query_id", "query", "selected_passage", "answers"], -).to_table().to_pylist() -for h in hits: - print(h["query"]) - print(" selected:", (h.get("selected_passage") or "")[:120]) +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB semantic query lookup +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/ms-marco-v2.1-lance --repo-type dataset --local-dir ./ms-marco-v2.1-lance +> ``` +> Then point Lance or LanceDB at `./ms-marco-v2.1-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `query_emb` makes nearest-neighbour query lookup a single call. In production you would encode an incoming user query through the same 384-dim MiniLM encoder used at ingest and pass the resulting vector to `tbl.search(...)`. The example below uses the embedding from row 42 as a runnable stand-in so the snippet works without loading a model. ```python import lancedb -from sentence_transformers import SentenceTransformer - -encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda") -q = encoder.encode(["how to compute determinant of a 3x3 matrix"], normalize_embeddings=True)[0] db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data") tbl = db.open_table("validation") -results = ( - tbl.search(q.tolist(), vector_column_name="query_emb") +seed = ( + tbl.search() + .select(["query_emb", "query"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["query_emb"], vector_column_name="query_emb") .metric("cosine") + .where("query_type = 'NUMERIC'", prefilter=True) .select(["query_id", "query", "selected_passage", "answers"]) - .limit(5) + .limit(10) .to_list() ) +for r in hits: + print(r["query"], "->", (r["selected_passage"] or "")[:120]) ``` -### LanceDB full-text search +The result set carries only the projected columns; the 384-d `query_emb` is never read on the result side, and the full `passage_text` list is left untouched, keeping the working set small even when the underlying scan touches every row of the validation split. + +Because the dataset also ships an `INVERTED` index on both `query` and `selected_passage`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query against the gold passage. LanceDB merges the two result lists and reranks them in a single call, which is useful when a phrase must literally appear in the relevant passage but the dense side still does most of the ranking. + +```python +hybrid_hits = ( + tbl.search(query_type="hybrid") + .vector(seed["query_emb"]) + .text("determinant matrix") + .select(["query", "selected_passage", "answers"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(r["query"]) +``` + +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency for your workload. + +## Curate + +A typical curation pass over MS MARCO starts by combining metadata filters with structural predicates over the parallel passage lists before any heavy text gets read. Lance evaluates the filter inside a single scan, so the candidate set comes back already filtered, and the bounded `.limit(1000)` keeps the output small enough to inspect. The example below assembles a set of numeric questions for which Bing labelled at least one passage relevant and the annotators produced a well-formed reference answer. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data") -tbl = db.open_table("validation") +tbl = db.open_table("train") -results = ( - tbl.search("determinant matrix") - .select(["query", "selected_passage"]) - .limit(10) +candidates = ( + tbl.search() + .where( + "query_type = 'NUMERIC' " + "AND selected_passage IS NOT NULL " + "AND array_length(well_formed_answers) > 0 " + "AND length(query) >= 30", + prefilter=True, + ) + .select(["query_id", "query", "answers", "well_formed_answers"]) + .limit(1000) .to_list() ) +print(f"{len(candidates)} candidates; first: {candidates[0]['query']}") ``` -## Get all candidate passages for a query +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `query_id`s, or hand to the Evolve and Train sections below. Neither `passage_text` nor `query_emb` is read by this scan, so a 1000-row curation pass against the Hub moves only kilobytes of metadata. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a `query_length` column and a `num_selected` count over the parallel `passage_is_selected` list, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/validation.lance") -row = ds.scanner(filter="query_id = 1185869", columns=["query", "passage_text", "passage_is_selected"]).to_table().to_pylist()[0] -for text, sel in zip(row["passage_text"], row["passage_is_selected"]): - print("[selected]" if sel else "[other]", text[:120]) +import lancedb + +db = lancedb.connect("./ms-marco-v2.1-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "query_length": "length(query)", + "num_selected": "array_length(passage_is_selected)", + "has_well_formed": "array_length(well_formed_answers) > 0", +}) ``` -## Filter by query_type +If the values you want to attach already live in another table (cross-encoder reranker scores, generated-answer judgments, alternate embeddings from a stronger model), merge them in by joining on `query_id`: ```python -ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/train.lance") -numeric = ds.scanner(filter="query_type = 'NUMERIC'", columns=["query"], limit=5).to_table() +import pyarrow as pa + +reranker_scores = pa.table({ + "query_id": pa.array([1185869, 9083, 524332], type=pa.int64()), + "reranker_top1_score": pa.array([0.91, 0.47, 0.83]), +}) +tbl.merge(reranker_scores, on="query_id") ``` -### Filter by query_type with LanceDB +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a different encoder over the query text), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a reader-style QA model the natural projection is the query plus the gold passage and the answer; for a query-encoder retraining loop the precomputed embedding is enough on its own. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data") tbl = db.open_table("train") -numeric = ( - tbl.search() - .where("query_type = 'NUMERIC'") - .select(["query"]) - .limit(5) - .to_list() -) + +train_ds = Permutation.identity(tbl).select_columns(["query", "selected_passage", "answers"]) +loader = DataLoader(train_ds, batch_size=32, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; tokenize, forward, backward... + ... ``` -## Why Lance? +Switching feature sets is a configuration change: passing `["query_emb", "passage_text", "passage_is_selected"]` to `select_columns(...)` on the next run reads only those columns, which is the right shape for training a passage reranker on cached query embeddings. Columns added in Evolve cost nothing per batch until they are explicitly projected. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./ms-marco-v2.1-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("numeric-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="numeric-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `numeric-v1` keeps returning stable passages while the dataset evolves in parallel — newly added reranker scores or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same queries and passages, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full corpus. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where( + "query_type = 'NUMERIC' " + "AND selected_passage IS NOT NULL " + "AND array_length(well_formed_answers) > 0" + ) + .select(["query_id", "query", "query_type", "answers", "well_formed_answers", "selected_passage", "query_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./ms-marco-numeric") +local_db.create_table("train", batches) +``` -- One dataset carries queries + passages + answers + embeddings + indices — no sidecar files. -- On-disk vector and full-text indices live next to the data, so search works on local copies and on the Hub. -- Schema evolution: add columns (alternate embeddings, generated answers, model predictions) without rewriting the data. +The resulting `./ms-marco-numeric` is a first-class LanceDB database. Every snippet in the Search, Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/ms-marco-v2.1-lance/data` for `./ms-marco-numeric`. ## Source & license diff --git a/docs/datasets/natural-questions-val.mdx b/docs/datasets/natural-questions-val.mdx index 14c30fb..399058c 100644 --- a/docs/datasets/natural-questions-val.mdx +++ b/docs/datasets/natural-questions-val.mdx @@ -1,7 +1,7 @@ --- title: "Natural Questions Validation" sidebarTitle: "Natural Questions Validation" -description: "Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries with their full Wikipedia articles and 1–5 annotator labels per question. Sourced from google-research-datasets/natural_questions." +description: "A Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries paired with the full Wikipedia article a human used to answer them, plus 1–5 annotator labels per question. MiniLM question embeddings are stored…" --- -Lance-formatted version of the [Natural Questions](https://ai.google.com/research/NaturalQuestions/) **validation split** — 7,830 real Google search queries with their full Wikipedia articles and 1–5 annotator labels per question. Sourced from [`google-research-datasets/natural_questions`](https://huggingface.co/datasets/google-research-datasets/natural_questions). +A Lance-formatted version of the [Natural Questions](https://ai.google.com/research/NaturalQuestions/) validation split — 7,830 real Google search queries paired with the full Wikipedia article a human used to answer them, plus 1–5 annotator labels per question. MiniLM question embeddings are stored inline and the dataset ships with pre-built ANN/FTS indices, all available directly from the Hub at `hf://datasets/lance-format/natural-questions-val-lance/data`. Sourced from [`google-research-datasets/natural_questions`](https://huggingface.co/datasets/google-research-datasets/natural_questions). -> The NQ **train** split is 143 GB (307,373 rows); it is intentionally not bundled here. Add it via `natural_questions/dataprep.py --splits train` once disk + bandwidth allow. +> The NQ **train** split is 143 GB (307,373 rows); it is intentionally not bundled here. Add it via `natural_questions/dataprep.py --splits train` once disk and bandwidth allow. + +## Key features + +- **Real Google search queries** with the full Wikipedia article that answers each one — `document_html` carries the inline UTF-8 HTML, so no sidecar files or external lookups are needed at query time. +- **Annotator answer summaries** — `short_answers` aggregates and dedupes spans across all annotators, `yes_no_answer` carries the majority vote, and the `has_short_answer` / `has_long_answer` flags make annotation-coverage filters a single predicate. +- **Pre-computed 384-dim question embeddings** (`question_emb`, `sentence-transformers/all-MiniLM-L6-v2`, cosine-normalized) with a bundled `IVF_PQ` index for semantic question lookup. +- **One columnar dataset** — scan question metadata cheaply, then read the heavy `document_html` only for the rows you actually want. ## Splits @@ -36,57 +43,118 @@ Lance-formatted version of the [Natural Questions](https://ai.google.com/researc | `has_short_answer` | `bool` | At least one annotator provided a short-answer span | | `has_long_answer` | `bool` | At least one annotator selected a long-answer candidate | | `yes_no_answer` | `string` | `YES` / `NO` / `NONE` — majority vote across annotators | -| `question_emb` | `fixed_size_list` | sentence-transformers `all-MiniLM-L6-v2` (cosine-normalized) | +| `question_emb` | `fixed_size_list` | MiniLM question embedding | ## Pre-built indices -- `IVF_PQ` on `question_emb` — `metric=cosine` -- `INVERTED` (FTS) on `question` -- `BTREE` on `id`, `document_title` -- `BITMAP` on `yes_no_answer`, `has_short_answer`, `has_long_answer` +- `IVF_PQ` on `question_emb` — semantic question lookup (cosine) +- `INVERTED` (FTS) on `question` — keyword and hybrid search +- `BTREE` on `id`, `document_title` — stable lookup by identifier +- `BITMAP` on `yes_no_answer`, `has_short_answer`, `has_long_answer` — cheap predicate evaluation for annotation coverage + +## Why Lance? -## Quick start +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +import datasets + +hf_ds = datasets.load_dataset("lance-format/natural-questions-val-lance", split="validation", streaming=True) +for row in hf_ds.take(3): + print(row["question"], "->", row["short_answers"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data") tbl = db.open_table("validation") -print(f"LanceDB table opened with {len(tbl)} questions") +print(len(tbl)) ``` -### LanceDB semantic question search +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices. ```python -import lancedb -from sentence_transformers import SentenceTransformer +import lance -encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda") -q = encoder.encode(["who wrote the declaration of independence"], normalize_embeddings=True)[0] +ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) +``` + +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access, ANN search, and HTML decoding are far faster against a local copy: +> ```bash +> hf download lance-format/natural-questions-val-lance --repo-type dataset --local-dir ./natural-questions-val-lance +> ``` +> Then point Lance or LanceDB at `./natural-questions-val-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `question_emb` makes nearest-neighbour question lookup a single call. In production you would encode an incoming user query through the same 384-dim MiniLM encoder used at ingest and pass the resulting vector to `tbl.search(...)`. The example below uses the embedding from row 42 as a runnable stand-in so the snippet works without loading a model. + +```python +import lancedb db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data") tbl = db.open_table("validation") -results = ( - tbl.search(q.tolist(), vector_column_name="question_emb") +seed = ( + tbl.search() + .select(["question_emb", "question"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["question_emb"], vector_column_name="question_emb") .metric("cosine") + .where("has_short_answer = TRUE", prefilter=True) .select(["question", "short_answers", "document_title"]) - .limit(5) + .limit(10) .to_list() ) +for r in hits: + print(r["question"], "->", r["short_answers"]) ``` -### LanceDB full-text search +The result set carries only the projected columns; the 384-d `question_emb` is never read on the result side, and the heavy `document_html` is left untouched, keeping the working set small even though each row carries a full Wikipedia article inline. + +Because the dataset also ships an `INVERTED` index on `question`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query against the question text. LanceDB merges the two result lists and reranks them in a single call, which is useful when a named entity must literally appear in the query but the dense side still does most of the ranking. + +```python +hybrid_hits = ( + tbl.search(query_type="hybrid") + .vector(seed["question_emb"]) + .text("declaration of independence") + .select(["question", "short_answers", "document_title"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(r["question"]) +``` + +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency for your workload. + +## Curate + +A typical curation pass over NQ starts with annotation-coverage filters before any HTML gets read. Lance evaluates the filter inside a single scan, so the candidate set comes back already filtered, and the bounded `.limit(500)` keeps the output small enough to inspect. The example below assembles a set of factoid questions with at least one short-answer span and a non-yes/no resolution. ```python import lancedb @@ -94,52 +162,138 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data") tbl = db.open_table("validation") -results = ( - tbl.search("declaration of independence") - .select(["question", "document_title"]) - .limit(10) +candidates = ( + tbl.search() + .where( + "has_short_answer = TRUE " + "AND yes_no_answer = 'NONE' " + "AND array_length(short_answers) >= 1 " + "AND length(question) >= 30", + prefilter=True, + ) + .select(["id", "question", "short_answers", "document_title", "document_url"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} candidates; first: {candidates[0]['question']}") ``` -## Get only questions with short-answer spans +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of NQ example ids, or hand to the Evolve and Train sections below. The large `document_html` column is not read by this scan, so a 500-row curation pass against the Hub moves only kilobytes of metadata even though each row holds an entire Wikipedia article. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a `question_length` column, a `first_short_answer_length` derived from the deduped span list, and an `is_factoid` flag that combines the annotation flags, any of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance") -short = ds.scanner( - filter="has_short_answer = true", - columns=["question", "short_answers", "document_title"], - limit=10, -).to_table().to_pylist() +import lancedb + +db = lancedb.connect("./natural-questions-val-lance/data") # local copy required for writes +tbl = db.open_table("validation") + +tbl.add_columns({ + "question_length": "length(question)", + "first_short_answer_length": "length(short_answers[1])", + "is_factoid": "has_short_answer = TRUE AND yes_no_answer = 'NONE'", +}) ``` -### Filter with LanceDB +If the values you want to attach already live in another table (offline retriever scores, generated-answer judgments, alternate embeddings from a stronger model), merge them in by joining on `id`: + +```python +import pyarrow as pa + +retriever_scores = pa.table({ + "id": pa.array(["797803103333068850", "5225754983651766092"]), + "bm25_top1_score": pa.array([14.2, 8.7]), +}) +tbl.merge(retriever_scores, on="id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., extracting the long-answer paragraph from `document_html`), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For an open-domain QA reader the natural projection is the question plus the full document HTML and the answer spans; for a question-encoder retraining loop the precomputed embedding is enough on its own, and skipping `document_html` keeps each batch small. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data") tbl = db.open_table("validation") -short = ( - tbl.search() - .where("has_short_answer = true") - .select(["question", "short_answers", "document_title"]) - .limit(10) - .to_list() -) + +train_ds = Permutation.identity(tbl).select_columns(["question", "document_html", "short_answers"]) +loader = DataLoader(train_ds, batch_size=4, shuffle=True, num_workers=2) + +for batch in loader: + # batch carries only the projected columns; tokenize, forward, backward... + ... ``` -## Read the full Wikipedia HTML for one question +Switching feature sets is a configuration change: passing `["question_emb", "short_answers"]` to `select_columns(...)` on the next run reads only the 384-d vectors and the answer spans, which is the right shape for fine-tuning a retrieval head on cached embeddings without paying for the multi-megabyte `document_html` per row. Columns added in Evolve cost nothing per batch until they are explicitly projected. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance") -row = ds.take([0], columns=["question", "document_html", "document_url"]).to_pylist()[0] -print(row["question"], "->", row["document_url"]) -print(row["document_html"][:500].decode("utf-8", errors="replace")) +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data") +tbl = db.open_table("validation") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./natural-questions-val-lance/data") +local_tbl = local_db.open_table("validation") +local_tbl.tags.create("factoid-v1", local_tbl.version) ``` +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("validation", version="factoid-v1") +tbl_v5 = db.open_table("validation", version=5) +``` + +Pinning supports two workflows. A QA system locked to `factoid-v1` keeps returning stable answer spans while the dataset evolves in parallel — newly added retriever scores or labels do not change what the tag resolves to. An evaluation experiment pinned to the same tag can be rerun later against the exact same questions and articles, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full corpus. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data") +remote_tbl = remote_db.open_table("validation") + +batches = ( + remote_tbl.search() + .where( + "has_short_answer = TRUE " + "AND yes_no_answer = 'NONE' " + "AND array_length(short_answers) >= 1" + ) + .select(["id", "question", "document_title", "document_url", "short_answers", "question_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./nq-factoid") +local_db.create_table("validation", batches) +``` + +The resulting `./nq-factoid` is a first-class LanceDB database. Every snippet in the Search, Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/natural-questions-val-lance/data` for `./nq-factoid`. Note that this projection deliberately omits `document_html`; include it in the `.select(...)` list when the downstream task needs the article body. + ## Source & license Converted from [`google-research-datasets/natural_questions`](https://huggingface.co/datasets/google-research-datasets/natural_questions). NQ is released under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) (matching the Wikipedia source). diff --git a/docs/datasets/openvid.mdx b/docs/datasets/openvid.mdx index 3816fe1..cb91814 100644 --- a/docs/datasets/openvid.mdx +++ b/docs/datasets/openvid.mdx @@ -1,7 +1,7 @@ --- title: "OpenVid-1M" sidebarTitle: "OpenVid-1M" -description: "Lance format version of the OpenVid dataset with 937,957 high-quality videos stored with inline video blobs, embeddings, and rich metadata." +description: "A Lance-formatted version of the OpenVid-1M corpus — 937,957 high-quality clips with inline MP4 bytes, 1024-dim video embeddings, captions, and rich per-clip quality signals — available directly from the Hub at…" --- -Lance format version of the [OpenVid dataset](https://huggingface.co/datasets/nkp37/OpenVid-1M) with **937,957 high-quality videos** stored with inline video blobs, embeddings, and rich metadata. +A Lance-formatted version of the [OpenVid-1M](https://huggingface.co/datasets/nkp37/OpenVid-1M) corpus — **937,957 high-quality clips** with inline MP4 bytes, 1024-dim video embeddings, captions, and rich per-clip quality signals — available directly from the Hub at `hf://datasets/lance-format/openvid-lance/data/train.lance`. ![](https://huggingface.co/datasets/nkp37/OpenVid-1M/resolve/main/OpenVid-1M.png) -**Key Features:** -The dataset is stored in lance format with inline video blobs, video embeddings, and rich metadata. +## Key features -- **Videos stored inline as blobs** - No external files to manage -- **Efficient column access** - Load metadata without touching video data -- **Prebuilt indices available** - IVF_PQ index for similarity search, FTS index on captions -- **Fast random access** - Read any video instantly by index -- **HuggingFace integration** - Load directly from the Hub +- **Inline MP4 bytes** in the `video_blob` column, stored in a side blob file and surfaced as lazy `BlobFile` handles via `take_blobs` — metadata scans, search, and filtering never read a single byte of video data. +- **Pre-computed 1024-dim video embeddings** in `embedding` with a bundled `IVF_PQ` ANN index. +- **Pre-built `INVERTED` (FTS) index on `caption`** for keyword and hybrid search. +- **Rich quality signals** — `aesthetic_score`, `motion_score`, `temporal_consistency_score`, `camera_motion`, `fps`, `seconds` — that downstream filters can stack on. -## Load lance dataset using `datasets.load_dataset` +## Splits + +`train.lance` + +## Schema + +| Column | Type | Notes | +|---|---|---| +| `video_blob` | `large_binary` (blob-encoded) | Inline MP4 bytes; stored in a separate blob file and read lazily through `take_blobs` | +| `video_path` | `string` | Original file path / object key | +| `caption` | `string` | Text description of the clip | +| `embedding` | `fixed_size_list` | Video embedding | +| `aesthetic_score` | `float64` | Visual quality, roughly 0–6 | +| `motion_score` | `float64` | Amount of motion, 0–1 | +| `temporal_consistency_score` | `float64` | Frame-to-frame stability, 0–1 | +| `camera_motion` | `string` | `pan`, `zoom`, `static`, etc. | +| `fps` | `float64` | Frames per second | +| `seconds` | `float64` | Clip duration | +| `frame` | `int64` | Total frame count | + +## Pre-built indices + +- `IVF_PQ` on `embedding` — video similarity (L2) +- `INVERTED` (FTS) on `caption` — keyword and hybrid search + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you just want a quick streaming sample. ```python import datasets -hf_ds = datasets.load_dataset( - "lance-format/openvid-lance", - split="train", - streaming=True, -) -# Take first three rows and print captions +hf_ds = datasets.load_dataset("lance-format/openvid-lance", split="train", streaming=True) for row in hf_ds.take(3): print(row["caption"]) ``` -You can also load lance datasets from HF hub using native API when you want blob bytes or advanced indexing while still pointing at the same dataset on the Hub: +## Load with LanceDB -```python -import lance - -lance_ds = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance") -blob_file = lance_ds.take_blobs("video_blob", ids=[0])[0] -video_bytes = blob_file.read() -``` - -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") tbl = db.open_table("train") -print(f"LanceDB table opened with {len(tbl)} videos") +print(len(tbl)) ``` +## Load with Lance -## Why Lance? - -- Optimized for AI workloads: Lance keeps multimodal data and vector search-ready storage in the same columnar format designed for accelerator-era retrieval (see [lance.org](https://lance.org)). -- Images + embeddings + metadata travel as one tabular dataset. -- On-disk, scalable ANN index means -- Schema evolution lets you add new features/columns (moderation tags, embeddings, etc.) without rewriting the raw data. +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices — or when you need the blob-level `take_blobs` entry point that streams video bytes lazily from inline storage. +```python +import lance -## Lance Blob API +ds = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance") +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) +``` -Lance stores videos as **inline blobs** - binary data embedded directly in the dataset. This provides: +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access, ANN search, and video decoding are far faster against a local copy: +> ```bash +> hf download lance-format/openvid-lance --repo-type dataset --local-dir ./openvid +> ``` +> Then point Lance or LanceDB at `./openvid/data`. -- **Single source of truth** - Videos and metadata together in one dataset -- **Lazy loading** - Videos only loaded when you explicitly request them -- **Efficient storage** - Optimized encoding for large binary data -- **Transactional consistency** - Query and retrieve in one atomic operation +## Search +The bundled `IVF_PQ` index on `embedding` makes approximate-nearest-neighbor search a single call. In production you would encode a text prompt through a text-to-video model or a reference clip through the same video encoder used at ingest, and pass the resulting 1024-d vector to `tbl.search(...)`. The example below uses the embedding from row 42 as a runnable stand-in. ```python -import lance - -ds = lance.dataset("hf://datasets/lance-format/openvid-lance") - -# 1. Browse metadata without loading video data -metadata = ds.scanner( - columns=["caption", "aesthetic_score"], # No video_blob column! - filter="aesthetic_score >= 4.5", - limit=10 -).to_table().to_pylist() +import lancedb -# 2. User selects video to watch -selected_index = 3 +db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") +tbl = db.open_table("train") -# 3. Load only that video blob -blob_file = ds.take_blobs("video_blob", ids=[selected_index])[0] -video_bytes = blob_file.read() +seed = ( + tbl.search() + .select(["embedding", "caption"]) + .limit(1) + .offset(42) + .to_list()[0] +) -# 4. Save to disk -with open("video.mp4", "wb") as f: - f.write(video_bytes) +hits = ( + tbl.search(seed["embedding"]) + .metric("L2") + .select(["caption", "aesthetic_score", "camera_motion", "seconds"]) + .limit(10) + .to_list() +) +for r in hits: + print(f"{r['aesthetic_score']:.2f} | {r['camera_motion']:>8} | {r['caption'][:60]}") ``` -## Quick Start +The result set carries only the projected columns. The `video_blob` column is never read, so the network traffic for a top-10 search is dominated by a few kilobytes of caption text, not by megabytes of MP4. The lazy blob fetch comes later — see Curate below. -```python -import lance +Because OpenVid also ships an `INVERTED` index on `caption`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query. LanceDB merges the two result lists and reranks them in a single call. -ds = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance") -print(f"Total videos: {ds.count_rows():,}") +```python +hybrid_hits = ( + tbl.search(query_type="hybrid") + .vector(seed["embedding"]) + .text("sunset over the ocean") + .select(["caption", "aesthetic_score", "seconds"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(f"{r['aesthetic_score']:.2f} | {r['seconds']:.1f}s | {r['caption'][:60]}") ``` -> **⚠️ HuggingFace Streaming Note** -> -> When streaming from HuggingFace (as shown above), some operations use minimal parameters to avoid rate limits: -> - `nprobes=1` for vector search (lowest value) -> - Column selection to reduce I/O -> -> **You may still hit rate limits on HuggingFace's free tier.** For best performance and to avoid rate limits, **download the dataset locally**: -> -> ```bash -> # Download once -> huggingface-cli download lance-format/openvid-lance --repo-type dataset --local-dir ./openvid -> -> # Then load locally -> ds = lance.dataset("./openvid") -> ``` -> -> Streaming is recommended only for quick exploration and testing. - +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. -## Dataset Schema +## Curate -Each row contains: -- `video_blob` - Video file as binary blob (inline storage) -- `caption` - Text description of the video -- `embedding` - 1024-dim vector embedding -- `aesthetic_score` - Visual quality score (0-5+) -- `motion_score` - Amount of motion (0-1) -- `temporal_consistency_score` - Frame consistency (0-1) -- `camera_motion` - Camera movement type (pan, zoom, static, etc.) -- `fps`, `seconds`, `frame` - Video properties +Curation for a video workflow almost always starts as a metadata filter — pick the dynamic, high-aesthetic, well-stabilized clips first, then decide what to do with the video bytes. Stacking predicates inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(200)` makes it cheap to inspect or hand off. -## Usage Examples +```python +import lancedb -### 1. Browse Metadata quickly (Fast - No Video Loading) +db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") +tbl = db.open_table("train") -```python -# Load only metadata without heavy video blobs -scanner = ds.scanner( - columns=["caption", "aesthetic_score", "motion_score"], - limit=10 +candidates = ( + tbl.search() + .where( + "aesthetic_score >= 4.5 " + "AND motion_score >= 0.3 " + "AND temporal_consistency_score >= 0.9", + prefilter=True, + ) + .select(["caption", "camera_motion", "aesthetic_score", "fps", "seconds"]) + .limit(200) + .with_row_id(True) + .to_list() ) -videos = scanner.to_table().to_pylist() - -for video in videos: - print(f"{video['caption']} - Quality: {video['aesthetic_score']:.2f}") +print(f"{len(candidates)} clips selected") ``` -### 2. Export Videos from Blobs +The scan above never reads the `video_blob` column. Lance stores blobs in a separate side file referenced by the dataset, so column-projected reads skip them entirely until they are explicitly requested. That is what makes "find me the right clips" a metadata-only operation against a million-row video corpus. + +Once the candidate set is fixed, pull the actual video bytes through pylance's `take_blobs`. It returns one `BlobFile` per row — a file-like handle that streams from inline blob storage on demand rather than reading the full clip into Python memory up front. For video specifically, this is the operation that matters: a video model trainer or a dataloader inspecting a few seconds of each clip should never have to materialize entire MP4s in memory just to inspect or decode part of them. ```python -# Load specific videos by index -indices = [0, 100, 500] -blob_files = ds.take_blobs("video_blob", ids=indices) - -# Save to disk -for i, blob_file in enumerate(blob_files): - with open(f"video_{i}.mp4", "wb") as f: - f.write(blob_file.read()) +import lance + +ds = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance") + +row_ids = [r["_rowid"] for r in candidates[:10]] +blob_files = ds.take_blobs("video_blob", ids=row_ids) ``` -### 3. Open inline videos with PyAV and run seeks directly on the blob file +Each `BlobFile` implements the file protocol, so it can be passed straight to a decoder like PyAV without first being copied through a `bytes` object. The decoder seeks and reads against the underlying handle, which means a 2-second sample from a 30-second clip moves only the bytes the decoder actually touches — not the whole MP4. ```python import av -selected_index = 123 -blob_file = ds.take_blobs("video_blob", ids=[selected_index])[0] - -with av.open(blob_file) as container: +with av.open(blob_files[0]) as container: stream = container.streams.video[0] - for seconds in (0.0, 1.0, 2.5): - target_pts = int(seconds / stream.time_base) - container.seek(target_pts, stream=stream) - - frame = None - for candidate in container.decode(stream): - if candidate.time is None: - continue - frame = candidate - if frame.time >= seconds: - break - - print( - f"Seek {seconds:.1f}s -> {frame.width}x{frame.height} " - f"(pts={frame.pts}, time={frame.time:.2f}s)" + target = int(seconds / stream.time_base) + container.seek(target, stream=stream) + frame = next( + (f for f in container.decode(stream) if f.time is not None and f.time >= seconds), + None, ) + if frame is not None: + print(f" seek {seconds:.1f}s -> {frame.width}x{frame.height} @ {frame.time:.2f}s") ``` -### 3.5. Inspecting Existing Indices - -You can inspect the prebuilt indices on the dataset: +If you only need the raw bytes (e.g., to persist a hand-picked subset to disk), call `.read()` on each handle. The lazy semantics are the same; `read()` simply materializes the full blob for that one row. ```python -import lance +for r_id, blob in zip(row_ids, blob_files): + with open(f"clip_{r_id}.mp4", "wb") as f: + f.write(blob.read()) +``` -# Open the dataset -dataset = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance") +## Evolve -# List all indices -indices = dataset.list_indices() -print(indices) -``` +Lance stores each column independently, so a new column can be appended without rewriting the existing data — including the video blobs, which stay exactly where they are. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a `duration_bucket` and a `is_high_quality` flag, either of which can then be used directly in `where` clauses without re-evaluating the predicate on every query. -While this dataset comes with pre-built indices, you can also create your own custom indices if needed. For example: +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus first. ```python -# ds is a local Lance dataset -ds.create_index( - "embedding", - index_type="IVF_PQ", - num_partitions=256, - num_sub_vectors=96, - replace=True, -) +import lancedb + +db = lancedb.connect("./openvid/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "duration_bucket": ( + "CASE WHEN seconds < 5 THEN 'short' " + "WHEN seconds < 15 THEN 'medium' ELSE 'long' END" + ), + "is_high_quality": ( + "aesthetic_score >= 4.5 AND temporal_consistency_score >= 0.9" + ), +}) ``` -### 4. Vector Similarity Search +If the values you want to attach already live in another table (offline labels, safety classifications, a second embedding from a different encoder), merge them in by joining on `video_path`: ```python import pyarrow as pa -# Find similar videos -ref_video = ds.take([0], columns=["embedding"]).to_pylist()[0] -query_vector = pa.array([ref_video['embedding']], type=pa.list_(pa.float32(), 1024)) - -results = ds.scanner( - nearest={ - "column": "embedding", - "q": query_vector[0], - "k": 5, - "nprobes": 1, - "refine_factor": 1 - } -).to_table().to_pylist() - -for video in results[1:]: # Skip first (query itself) - print(video['caption']) +labels = pa.table({ + "video_path": pa.array(["s3://openvid/clips/00001.mp4", "s3://openvid/clips/00002.mp4"]), + "scene_label": pa.array(["beach", "city"]), +}) +tbl.merge(labels, on="video_path") ``` -### LanceDB Vector Similarity Search +The original columns and the `video_blob` side file are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running an alternative video encoder over the inline bytes), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +A common pattern for video training is to pre-extract decoded frames once into a derived LanceDB table, and train against that table with the regular projection-based dataloader. `take_blobs` is the mechanism that makes the extraction step tractable: each clip's MP4 is randomly addressable, so the pass can subset bytes on demand and write decoded windows into a fresh table without an external file store. Other workflows project `video_blob` directly through `select_columns(...)` and decode at the batch boundary, or skip pixels entirely and train on the cached embeddings — the right shape is workload-specific. The actual training loop is the same `Permutation.identity(tbl).select_columns(...)` snippet in every case; only the source table and the column list change. + +Against a pre-extracted frames table: ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader -db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") +db = lancedb.connect("./openvid-frames") # local table produced by the one-time extraction tbl = db.open_table("train") -# Get a video to use as a query -ref_video = tbl.limit(1).select(["embedding", "caption"]).to_pandas().to_dict('records')[0] -query_embedding = ref_video["embedding"] - -results = tbl.search(query_embedding) \ - .metric("L2") \ - .nprobes(1) \ - .limit(5) \ - .to_list() - -for video in results[1:]: # Skip first (query itself) - print(f"{video['caption'][:60]}...") +train_ds = Permutation.identity(tbl).select_columns(["frames", "caption", "aesthetic_score"]) +loader = DataLoader(train_ds, batch_size=8, shuffle=True, num_workers=4) ``` -### 5. Full-Text Search +Against the cached embeddings on the source table (no pre-extraction): ```python -# Search captions using FTS index -results = ds.scanner( - full_text_query="sunset beach", - columns=["caption", "aesthetic_score"], - limit=10, - fast_search=True -).to_table().to_pylist() - -for video in results: - print(f"{video['caption']} - {video['aesthetic_score']:.2f}") +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +src_db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") +src_tbl = src_db.open_table("train") + +train_ds = Permutation.identity(src_tbl).select_columns(["embedding", "caption"]) +loader = DataLoader(train_ds, batch_size=256, shuffle=True, num_workers=4) ``` -### LanceDB Full-Text Search +The inline `video_blob` storage and `take_blobs` still earn their place outside of the training loop — random-access inspection of a clip in a notebook, sampling for human review, one-off evaluation against a held-out set, and the pre-extraction step itself — but they are not the dataloader. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk, with the same blob handles still valid. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. ```python import lancedb @@ -297,77 +293,54 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") tbl = db.open_table("train") -results = tbl.search("sunset beach") \ - .select(["caption", "aesthetic_score"]) \ - .limit(10) \ - .to_list() - -for video in results: - print(f"{video['caption']} - {video['aesthetic_score']:.2f}") +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) ``` -### 6. Filter by Quality +Once you have a local copy, tag a version for reproducibility: ```python -# Get high-quality videos -high_quality = ds.scanner( - filter="aesthetic_score >= 4.5 AND motion_score >= 0.3", - columns=["caption", "aesthetic_score", "camera_motion"], - limit=20 -).to_table().to_pylist() +local_db = lancedb.connect("./openvid/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("quality-v1", local_tbl.version) ``` -## Dataset Evolution - -Lance supports flexible schema and data evolution ([docs](https://lance.org/guide/data_evolution/?h=evol)). You can add/drop columns, backfill with SQL or Python, rename fields, or change data types without rewriting the whole dataset. In practice this lets you: -- Introduce fresh metadata (moderation labels, embeddings, quality scores) as new signals become available. -- Add new columns to existing datasets without re-exporting terabytes of video. -- Adjust column names or shrink storage (e.g., cast embeddings to float16) while keeping previous snapshots queryable for reproducibility. +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: ```python -import lance -import pyarrow as pa -import numpy as np +tbl_v1 = db.open_table("train", version="quality-v1") +tbl_v5 = db.open_table("train", version=5) +``` -base = pa.table({"id": pa.array([1, 2, 3])}) -dataset = lance.write_dataset(base, "openvid_evolution", mode="overwrite") +Pinning supports two workflows. A retrieval system locked to `quality-v1` keeps returning stable results while the dataset evolves in parallel — newly added columns or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same clips, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. -# 1. Grow the schema instantly (metadata-only) -dataset.add_columns(pa.field("quality_bucket", pa.string())) +## Materialize a subset -# 2. Backfill with SQL expressions or constants -dataset.add_columns({"status": "'active'"}) +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access into the blob file. Both can be served by a subset of the dataset rather than the full corpus. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory — including the `video_blob` column, which streams through Arrow record batches rather than being assembled in a single buffer. -# 3. Generate rich columns via Python batch UDFs -@lance.batch_udf() -def random_embedding(batch): - arr = np.random.rand(batch.num_rows, 128).astype("float32") - return pa.RecordBatch.from_arrays( - [pa.FixedSizeListArray.from_arrays(arr.ravel(), 128)], - names=["embedding"], - ) +```python +import lancedb -dataset.add_columns(random_embedding) +remote_db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") +remote_tbl = remote_db.open_table("train") -# 4. Bring in offline annotations with merge -labels = pa.table({ - "id": pa.array([1, 2, 3]), - "label": pa.array(["horse", "rabbit", "cat"]), -}) -dataset.merge(labels, "id") +batches = ( + remote_tbl.search() + .where("aesthetic_score >= 4.5 AND motion_score >= 0.3") + .select(["caption", "embedding", "video_blob", "aesthetic_score", "camera_motion"]) + .to_batches() +) -# 5. Rename or cast columns as needs change -dataset.alter_columns({"path": "quality_bucket", "name": "quality_tier"}) -dataset.alter_columns({"path": "embedding", "data_type": pa.list_(pa.float16(), 128)}) +local_db = lancedb.connect("./openvid-subset") +local_db.create_table("train", batches) ``` -These operations are automatically versioned, so prior experiments can still point to earlier versions while OpenVid keeps evolving. - - +The resulting `./openvid-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/openvid-lance/data` for `./openvid-subset`. The same `take_blobs` pattern from Curate also works against the local copy — and runs faster, because the blob side file is now on local disk. ## Citation -```bibtex +``` @article{nan2024openvid, title={OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation}, author={Nan, Kepan and Xie, Rui and Zhou, Penghao and Fan, Tiehan and Yang, Zhenheng and Chen, Zhijie and Li, Xiang and Yang, Jian and Tai, Ying}, @@ -376,7 +349,6 @@ These operations are automatically versioned, so prior experiments can still poi } ``` - ## License -Please check the original OpenVid dataset license for usage terms. +Content inherits the original OpenVid-1M dataset license. Review the [upstream dataset card](https://huggingface.co/datasets/nkp37/OpenVid-1M) before downstream use. diff --git a/docs/datasets/oxford-pets.mdx b/docs/datasets/oxford-pets.mdx index e79c226..d973a27 100644 --- a/docs/datasets/oxford-pets.mdx +++ b/docs/datasets/oxford-pets.mdx @@ -1,7 +1,7 @@ --- title: "Oxford-IIIT Pet" sidebarTitle: "Oxford-IIIT Pet" -description: "Lance-formatted version of the Oxford-IIIT Pet dataset — 7,390 cat & dog photos across 37 breeds — sourced from pcuenq/oxford-pets." +description: "A Lance-formatted version of the Oxford-IIIT Pet dataset — 7,390 cat and dog photos across 37 breeds — sourced from pcuenq/oxford-pets. Each row carries the inline JPEG bytes, the breed name, a species flag distinguishing cats from dogs, and a…" --- -Lance-formatted version of the [Oxford-IIIT Pet dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) — 7,390 cat & dog photos across 37 breeds — sourced from [`pcuenq/oxford-pets`](https://huggingface.co/datasets/pcuenq/oxford-pets). +A Lance-formatted version of the [Oxford-IIIT Pet](https://www.robots.ox.ac.uk/~vgg/data/pets/) dataset — 7,390 cat and dog photos across 37 breeds — sourced from [`pcuenq/oxford-pets`](https://huggingface.co/datasets/pcuenq/oxford-pets). Each row carries the inline JPEG bytes, the breed name, a species flag distinguishing cats from dogs, and a cosine-normalized CLIP image embedding, all available directly from the Hub at `hf://datasets/lance-format/oxford-pets-lance/data`. + +## Key features + +- **Inline JPEG bytes** in the `image` column — no sidecar files, no image folders. +- **Pre-computed CLIP image embeddings** (`image_emb`, OpenCLIP `ViT-B-32`, 512-dim, cosine-normalized) with a bundled `IVF_PQ` index for similarity search. +- **Both breed and species labels** (`label_name`, `is_dog`) so a query can target a specific breed, all dogs, or all cats by stacking simple predicates. +- **Bitmap indices on both label columns** make species- and breed-based curation a cheap predicate rather than a full scan. + +## Splits + +| Split | Rows | Notes | +|-------|------|-------| +| `train.lance` | 7,390 | The `pcuenq/oxford-pets` source mirror ships a single split; the canonical Oxford-IIIT trainval/test partition is not pre-applied here. | ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index | -| `image` | `large_binary` | Inline JPEG bytes | +| `id` | `int64` | Row index within split (natural join key for merges) | +| `image` | `large_binary` | Inline JPEG bytes (quality 92) | | `label_name` | `string` | One of 37 breeds, underscore-spaced (`british_shorthair`, `golden_retriever`, …) | | `is_dog` | `bool` | `true` for dog breeds, `false` for cat breeds | -| `path` | `string?` | Original filename in the source dataset | -| `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` embedding (cosine-normalized) | +| `path` | `string?` | Original filename from the source dataset | +| `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` image embedding (cosine-normalized) | ## Pre-built indices -- `IVF_PQ` on `image_emb` — `metric=cosine` -- `BITMAP` on `label_name` and `is_dog` +- `IVF_PQ` on `image_emb` — vector similarity search (cosine) +- `BITMAP` on `label_name` — fast lookup by breed +- `BITMAP` on `is_dog` — fast species filter + +## Why Lance? -## Quick start +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample without installing anything Lance-specific. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/oxford-pets-lance/data/train.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +import datasets + +hf_ds = datasets.load_dataset("lance-format/oxford-pets-lance", split="train", streaming=True) +for row in hf_ds.take(3): + print(row["label_name"], row["is_dog"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data") tbl = db.open_table("train") -print(f"LanceDB table opened with {len(tbl)} images") +print(len(tbl)) ``` -## Filter — only dogs, only golden retrievers, etc. +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python import lance + ds = lance.dataset("hf://datasets/lance-format/oxford-pets-lance/data/train.lance") -dogs = ds.scanner(filter="is_dog = true", columns=["label_name"], limit=5).to_table() -goldens = ds.scanner(filter="label_name = 'golden_retriever'", columns=["id"], limit=5).to_table() +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) +``` + +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/oxford-pets-lance --repo-type dataset --local-dir ./oxford-pets-lance +> ``` +> Then point Lance or LanceDB at `./oxford-pets-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `image_emb` makes approximate-nearest-neighbor search a single call. In production you would encode a query photo through the same OpenCLIP `ViT-B-32` model used at ingest and pass the resulting 512-d vector to `tbl.search(...)`. The example below uses the embedding stored in row 0 as a runnable stand-in so the snippet works without a model loaded; on a clean run the first hit is expected to be the seed image itself, which is a useful sanity check on the index. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data") +tbl = db.open_table("train") + +seed = ( + tbl.search() + .select(["image_emb", "label_name", "is_dog"]) + .limit(1) + .to_list()[0] +) + +hits = ( + tbl.search(seed["image_emb"]) + .metric("cosine") + .select(["id", "label_name", "is_dog"]) + .limit(10) + .to_list() +) +print(f"seed: {seed['label_name']} (is_dog={seed['is_dog']})") +for r in hits: + print(f" {r['id']:>5} {r['label_name']:<22} is_dog={r['is_dog']}") ``` -### Filter with LanceDB +Tune `metric`, `nprobes`, and `refine_factor` to trade recall against latency for your workload. + +## Curate + +A typical curation pass for a fine-grained pet classifier stacks the species predicate and a breed predicate inside a single filtered scan. With bitmap indices on both `label_name` and `is_dog`, the result comes back in milliseconds, and the bounded `.limit(200)` keeps it small enough to inspect or hand off to a training run. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data") tbl = db.open_table("train") -dogs = tbl.search().where("is_dog = true").select(["label_name"]).limit(5).to_list() -goldens = tbl.search().where("label_name = 'golden_retriever'").select(["id"]).limit(5).to_list() + +candidates = ( + tbl.search() + .where("is_dog = true AND label_name IN ('golden_retriever', 'beagle', 'pug')") + .select(["id", "label_name", "is_dog", "path"]) + .limit(200) + .to_list() +) +print(f"{len(candidates)} candidates; first: {candidates[0]['label_name']}") ``` -## Visual similarity search +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `id`s, or feed into the Evolve and Train workflows below. The `image` and `image_emb` columns are never read by this query, so the network traffic is dominated by the small label fields rather than JPEG bytes or vectors. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below derives a `species` string from the `is_dog` boolean and adds a coarse breed-group flag for terriers, either of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. ```python -import lance, pyarrow as pa -ds = lance.dataset("hf://datasets/lance-format/oxford-pets-lance/data/train.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb", "label_name"]).to_pylist()[0] -neighbors = ds.scanner( - nearest={"column": "image_emb", "q": pa.array([ref["image_emb"]], type=emb_field.type)[0], "k": 5}, - columns=["id", "label_name"], -).to_table().to_pylist() +import lancedb + +db = lancedb.connect("./oxford-pets-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "species": "CASE WHEN is_dog THEN 'dog' ELSE 'cat' END", + "is_terrier": "label_name LIKE '%terrier%'", +}) +``` + +If the values you want to attach already live in another table (offline labels, classifier predictions, an integer class id), merge them in by joining on `id`: + +```python +import pyarrow as pa + +class_ids = pa.table({ + "id": pa.array([0, 1, 2]), + "label_int": pa.array([0, 0, 17]), +}) +tbl.merge(class_ids, on="id") ``` -### LanceDB visual similarity search +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a second embedding model over the JPEG bytes), Lance provides a batch-UDF API in the underlying library — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/) for that pattern. + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetch, shuffling, and batching behave as in any PyTorch pipeline. For a from-scratch breed classifier, project the JPEG bytes and the string breed label; for a linear probe on top of frozen CLIP features, swap the projection to the embedding column and skip JPEG decoding entirely. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data") tbl = db.open_table("train") -ref = tbl.search().limit(1).select(["image_emb", "label_name"]).to_list()[0] -query_embedding = ref["image_emb"] +train_ds = Permutation.identity(tbl).select_columns(["image", "label_name"]) +loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=4) -results = ( - tbl.search(query_embedding) - .metric("cosine") - .select(["id", "label_name"]) - .limit(5) - .to_list() +for batch in loader: + # batch carries only the projected columns; image_emb stays on disk. + # decode the JPEG bytes, map label_name -> int via a class list, forward, cross-entropy... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "label_name"]` to `select_columns(...)` on the next run reads only the cached 512-d vectors and the label, which is the right shape for a linear probe or a lightweight reranker. Projecting `["image", "is_dog"]` reduces the task to binary species classification on the same data. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./oxford-pets-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="clip-vitb32-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `clip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel; newly added columns or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and labels, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("is_dog = true") + .select(["id", "image", "label_name", "is_dog", "image_emb"]) + .to_batches() ) + +local_db = lancedb.connect("./oxford-pets-dogs-subset") +local_db.create_table("train", batches) ``` +The resulting `./oxford-pets-dogs-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/oxford-pets-lance/data` for `./oxford-pets-dogs-subset`. + ## Source & license Converted from [`pcuenq/oxford-pets`](https://huggingface.co/datasets/pcuenq/oxford-pets). Released under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). diff --git a/docs/datasets/pascal-voc-2012-segmentation.mdx b/docs/datasets/pascal-voc-2012-segmentation.mdx index eb90784..d389491 100644 --- a/docs/datasets/pascal-voc-2012-segmentation.mdx +++ b/docs/datasets/pascal-voc-2012-segmentation.mdx @@ -1,7 +1,7 @@ --- title: "Pascal VOC 2012 Segmentation" sidebarTitle: "Pascal VOC 2012 Segmentation" -description: "A Lance-formatted version of the Pascal VOC 2012 semantic segmentation split (sourced from nateraw/pascal-voc-2012) — 2,913 image / mask pairs with CLIP image embeddings stored inline and a pre-built IVF_PQ ANN index." +description: "A Lance-formatted version of the Pascal VOC 2012 semantic segmentation split, sourced from nateraw/pascal-voc-2012. Each row pairs an inline JPEG image with the per-pixel PNG segmentation mask and a cosine-normalized OpenCLIP ViT-B-32 image…" --- -A Lance-formatted version of the [Pascal VOC 2012 semantic segmentation split](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/) (sourced from [`nateraw/pascal-voc-2012`](https://huggingface.co/datasets/nateraw/pascal-voc-2012)) — **2,913 image / mask pairs** with CLIP image embeddings stored inline and a pre-built `IVF_PQ` ANN index. +A Lance-formatted version of the [Pascal VOC 2012 semantic segmentation split](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/), sourced from [`nateraw/pascal-voc-2012`](https://huggingface.co/datasets/nateraw/pascal-voc-2012). Each row pairs an inline JPEG image with the per-pixel PNG segmentation mask and a cosine-normalized OpenCLIP `ViT-B-32` image embedding, so a single columnar table carries both annotation modalities and the features needed to retrieve, curate, and train against them — all available directly from the Hub at `hf://datasets/lance-format/pascal-voc-2012-segmentation-lance/data`. -## Why segmentation? +## Key features -VOC 2012 ships several tasks (classification, detection, segmentation, action). We focus on the **semantic segmentation** subset because every row carries a paired mask image and the dataset is small enough to convert quickly with full embeddings — useful as a smoke test or a small benchmark. +- **Inline JPEG bytes and inline PNG mask bytes in the same row** — image and per-pixel segmentation travel together with no sidecar folders or mask lookups. +- **Pre-computed CLIP image embeddings** (`image_emb`, 512-dim, cosine-normalized) with a bundled `IVF_PQ` index for visual similarity search. +- **Standard VOC class encoding** — mask pixel values are class ids in `0..20` plus `255` for void, identical to the official VOC palette. +- **One columnar dataset** — scan image-level metadata cheaply, then fetch image or mask bytes only for the rows you actually want. + +The 20 Pascal VOC foreground classes are: `aeroplane`, `bicycle`, `bird`, `boat`, `bottle`, `bus`, `car`, `cat`, `chair`, `cow`, `diningtable`, `dog`, `horse`, `motorbike`, `person`, `pottedplant`, `sheep`, `sofa`, `train`, `tvmonitor`. ## Splits -| Split | Rows | -|-------|------| -| `train.lance` | 1,464 | -| `validation.lance` | 1,449 | +| Split | Rows | Notes | +|-------|------|-------| +| `train.lance` | 1,464 | Official VOC 2012 segmentation train | +| `validation.lance` | 1,449 | Official VOC 2012 segmentation val | ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index within the split | +| `id` | `int64` | Row index within the split (natural join key for merges) | | `image` | `large_binary` | Inline JPEG bytes | | `mask` | `large_binary` | Inline PNG bytes — class id per pixel (0=background, 1-20=VOC classes, 255=void) | | `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` image embedding (cosine-normalized) | -The 20 Pascal VOC classes are: `aeroplane`, `bicycle`, `bird`, `boat`, `bottle`, `bus`, `car`, `cat`, `chair`, `cow`, `diningtable`, `dog`, `horse`, `motorbike`, `person`, `pottedplant`, `sheep`, `sofa`, `train`, `tvmonitor`. - ## Pre-built indices - `IVF_PQ` on `image_emb` — `metric=cosine` -> Note: the small dataset size (≤1,464 rows per split) is below Lance's -> default partition count, so the helper falls back to a smaller -> `num_partitions` automatically. For higher recall, build the index with -> `num_partitions=16` against a local copy. +> Note: the small split sizes (≤1,464 rows) sit below Lance's default partition count, so the helper falls back to a smaller `num_partitions` automatically. For higher recall, rebuild the index with `num_partitions=16` against a local copy. + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` -## Quick start +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/pascal-voc-2012-segmentation-lance/data/train.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/pascal-voc-2012-segmentation-lance", split="train", streaming=True) +for row in hf_ds.take(3): + print(row["id"], len(row["image"]), len(row["mask"])) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/pascal-voc-2012-segmentation-lance/data") tbl = db.open_table("train") -print(f"LanceDB table opened with {len(tbl)} image-mask pairs") +print(len(tbl)) ``` -## Working with images and masks +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python -from pathlib import Path import lance -from PIL import Image -import io ds = lance.dataset("hf://datasets/lance-format/pascal-voc-2012-segmentation-lance/data/train.lance") -row = ds.take([0], columns=["image", "mask"]).to_pylist()[0] -Path("img.jpg").write_bytes(row["image"]) -Path("mask.png").write_bytes(row["mask"]) - -import numpy as np -mask = np.array(Image.open(io.BytesIO(row["mask"]))) -print("classes present:", np.unique(mask).tolist()) +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -## Vector search example +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/pascal-voc-2012-segmentation-lance --repo-type dataset --local-dir ./pascal-voc-2012-segmentation-lance +> ``` +> Then point Lance or LanceDB at `./pascal-voc-2012-segmentation-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `image_emb` makes visual nearest-neighbour search a single call. In production you would encode a query image (or a class prototype) through OpenCLIP `ViT-B-32` at runtime and pass the resulting 512-d cosine-normalized vector to `tbl.search(...)`. The example below uses the embedding from row 42 as a runnable stand-in. ```python -import lance -import pyarrow as pa +import lancedb -ds = lance.dataset("hf://datasets/lance-format/pascal-voc-2012-segmentation-lance/data/train.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"] -query = pa.array([ref], type=emb_field.type) - -neighbors = ds.scanner( - nearest={"column": "image_emb", "q": query[0], "k": 5}, - columns=["id"], -).to_table().to_pylist() +db = lancedb.connect("hf://datasets/lance-format/pascal-voc-2012-segmentation-lance/data") +tbl = db.open_table("train") + +seed = ( + tbl.search() + .select(["image_emb"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["image_emb"]) + .metric("cosine") + .select(["id"]) + .limit(10) + .to_list() +) +for r in hits: + print(r["id"]) ``` -### LanceDB vector search +Because the embeddings are cosine-normalized, `metric="cosine"` is the natural choice and the first hit is typically the seed row itself — a useful sanity check before tuning `nprobes` and `refine_factor` for recall. + +## Curate + +A typical curation pass for a segmentation workflow combines visual similarity with a structural filter on the row. Stacking both inside a single filtered scan keeps the candidate set small and explicit, and the bounded `.limit(200)` makes it cheap to inspect before committing to anything downstream. The snippet below seeds from row 42 and restricts the candidates to rows whose mask payload is non-trivially sized — a cheap proxy for masks that actually carry foreground annotation. ```python import lancedb @@ -109,23 +137,136 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/pascal-voc-2012-segmentation-lance/data") tbl = db.open_table("train") -ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0] -query_embedding = ref["image_emb"] +seed = ( + tbl.search() + .select(["image_emb"]) + .limit(1) + .offset(42) + .to_list()[0] +) -results = ( - tbl.search(query_embedding) +candidates = ( + tbl.search(seed["image_emb"]) .metric("cosine") + .where("octet_length(mask) > 2000", prefilter=True) .select(["id"]) - .limit(5) + .limit(200) .to_list() ) +print(f"{len(candidates)} candidates") ``` -## Why Lance? +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `id`s, or feed into the Evolve and Train workflows below. The `image` and `mask` columns are never read in the candidate scan, so the network traffic stays dominated by the embedding vectors rather than image or mask bytes. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds an `image_bytes` size and a `has_mask` flag, both of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. + +```python +import lancedb + +db = lancedb.connect("./pascal-voc-2012-segmentation-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "image_bytes": "octet_length(image)", + "has_mask": "octet_length(mask) > 1024", +}) +``` + +For class-level statistics — for example, a per-row list of class ids present in the mask, or a per-class pixel count — the values cannot be derived in SQL because they require decoding the PNG. Compute them once in an external table and join in by `id`: + +```python +import pyarrow as pa + +class_stats = pa.table({ + "id": pa.array([0, 1, 2], type=pa.int64()), + "classes_present": pa.array([[15], [7, 15], [9]], type=pa.list_(pa.int8())), +}) +tbl.merge(class_stats, on="id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require running a model over the image bytes (a second-pass embedding, an instance segmentation, a depth prediction), Lance also provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a segmentation run, project the JPEG bytes and the PNG mask bytes together; everything else, including the CLIP embeddings, stays on disk until you opt in. + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/pascal-voc-2012-segmentation-lance/data") +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns(["image", "mask"]) +loader = DataLoader(train_ds, batch_size=16, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; image_emb stays on disk. + # decode the JPEGs and PNGs, build (image, label) tensors, forward, backward... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb"]` to `select_columns(...)` on the next run skips JPEG and PNG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a lightweight linear probe over frozen CLIP features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges class statistics, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/pascal-voc-2012-segmentation-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./pascal-voc-2012-segmentation-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("voc2012-clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="voc2012-clip-vitb32-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `voc2012-clip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel — newly added class statistics or alternative embeddings do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same image/mask pairs, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/pascal-voc-2012-segmentation-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("octet_length(mask) > 2000") + .select(["id", "image", "mask", "image_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./voc-subset") +local_db.create_table("train", batches) +``` -- One dataset carries images + masks + embeddings + indices — no sidecar files. -- On-disk vector and full-text indices live next to the data, so search works on local copies and on the Hub. -- Schema evolution: add columns (instance masks, alternate embeddings, model predictions) without rewriting the data. +The resulting `./voc-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/pascal-voc-2012-segmentation-lance/data` for `./voc-subset`. ## Source & license diff --git a/docs/datasets/squad-v2.mdx b/docs/datasets/squad-v2.mdx index 6e9ee6f..ec0267a 100644 --- a/docs/datasets/squad-v2.mdx +++ b/docs/datasets/squad-v2.mdx @@ -1,7 +1,7 @@ --- title: "SQuAD v2" sidebarTitle: "SQuAD v2" -description: "Lance-formatted version of SQuAD v2 — Stanford Question Answering Dataset, version 2 — with MiniLM sentence embeddings stored inline alongside the questions, contexts, and answers." +description: "A Lance-formatted version of SQuAD v2 — the Stanford Question Answering Dataset with both answerable and deliberately unanswerable questions over Wikipedia passages — with MiniLM question embeddings stored inline and ready for retrieval at…" --- -Lance-formatted version of [SQuAD v2](https://huggingface.co/datasets/rajpurkar/squad_v2) — Stanford Question Answering Dataset, version 2 — with **MiniLM sentence embeddings** stored inline alongside the questions, contexts, and answers. +A Lance-formatted version of [SQuAD v2](https://huggingface.co/datasets/rajpurkar/squad_v2) — the Stanford Question Answering Dataset with both answerable and deliberately unanswerable questions over Wikipedia passages — with MiniLM question embeddings stored inline and ready for retrieval at `hf://datasets/lance-format/squad-v2-lance/data`. -## Why this version? +## Key features -- **One self-contained Lance dataset** with 130k+ Wikipedia-grounded questions and reference answers. -- **Pre-computed text embeddings** (`sentence-transformers/all-MiniLM-L6-v2`, 384-dim, L2-normalized) on the question column with an `IVF_PQ` index — instant semantic question retrieval. -- **Full-text inverted indices** on both `question` and `context` for keyword search. -- **BITMAP** on `is_impossible` for fast filtering between answerable and unanswerable questions. +- **Span-extraction QA over Wikipedia** with 130k+ training questions and an `is_impossible` flag that cleanly separates answerable from unanswerable items. +- **Pre-computed 384-dim question embeddings** (`question_emb`, `sentence-transformers/all-MiniLM-L6-v2`, cosine-normalized) with a bundled `IVF_PQ` index for semantic question retrieval. +- **Full-text inverted indices** on both `question` and `context` for keyword search alongside dense retrieval. +- **One columnar dataset** carrying questions, contexts, answer spans, and embeddings together — project only the columns each query needs. ## Splits @@ -32,7 +32,7 @@ Lance-formatted version of [SQuAD v2](https://huggingface.co/datasets/rajpurkar/ | Column | Type | Notes | |---|---|---| -| `id` | `string` | SQuAD question id | +| `id` | `string` | SQuAD question id (natural join key for merges) | | `title` | `string` | Wikipedia article title | | `context` | `string` | Paragraph the question was generated from | | `question` | `string` | The question text | @@ -43,85 +43,113 @@ Lance-formatted version of [SQuAD v2](https://huggingface.co/datasets/rajpurkar/ ## Pre-built indices -- `IVF_PQ` on `question_emb` — `metric=cosine` -- `INVERTED` on `question` and `context` -- `BTREE` on `id` and `title` -- `BITMAP` on `is_impossible` +- `IVF_PQ` on `question_emb` — `metric=cosine`, vector similarity search +- `INVERTED` on `question` and `context` — full-text search +- `BTREE` on `id` and `title` — point lookups and prefix scans +- `BITMAP` on `is_impossible` — fast filtering between answerable and unanswerable -## Quick start +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/squad-v2-lance/data/validation.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/squad-v2-lance", split="validation", streaming=True) +for row in hf_ds.take(3): + print(row["question"], "->", row["answers"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/squad-v2-lance/data") -tbl = db.open_table("validation") -print(f"LanceDB table opened with {len(tbl)} questions") +tbl = db.open_table("train") +print(len(tbl)) ``` -## Semantic question retrieval +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices. ```python import lance -import pyarrow as pa -from sentence_transformers import SentenceTransformer - -encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda") -q_vec = encoder.encode(["what year was the eiffel tower built?"], normalize_embeddings=True)[0] ds = lance.dataset("hf://datasets/lance-format/squad-v2-lance/data/train.lance") -emb_field = ds.schema.field("question_emb") -query = pa.array([q_vec.tolist()], type=emb_field.type) - -hits = ds.scanner( - nearest={"column": "question_emb", "q": query[0], "k": 10, "nprobes": 16, "refine_factor": 30}, - columns=["id", "title", "question", "answers"], -).to_table().to_pylist() +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB semantic question retrieval +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/squad-v2-lance --repo-type dataset --local-dir ./squad-v2-lance +> ``` +> Then point Lance or LanceDB at `./squad-v2-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `question_emb` turns semantic question retrieval into a single call. In production you would encode an incoming question through the same MiniLM encoder used at ingest and pass the resulting 384-dim vector to `tbl.search(...)`. The example below uses the embedding from row 42 as a runnable stand-in, then restricts the result to answerable items so the response always carries a usable span. ```python import lancedb -from sentence_transformers import SentenceTransformer - -encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda") -q_vec = encoder.encode(["what year was the eiffel tower built?"], normalize_embeddings=True)[0] db = lancedb.connect("hf://datasets/lance-format/squad-v2-lance/data") tbl = db.open_table("train") -results = ( - tbl.search(q_vec.tolist(), vector_column_name="question_emb") +seed = ( + tbl.search() + .select(["question_emb", "question"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["question_emb"], vector_column_name="question_emb") .metric("cosine") + .where("is_impossible = false", prefilter=True) .select(["id", "title", "question", "answers"]) .limit(10) .to_list() ) +for r in hits: + print(f"{r['title']:30s} | {r['question'][:80]}") ``` -## Full-text search on contexts +Because the recommended setup also builds an `INVERTED` index on both `question` and `context`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query. LanceDB merges and reranks the two result lists in a single call, which is useful when a literal phrase must appear in the passage but the dense side should still drive ranking. ```python -ds = lance.dataset("hf://datasets/lance-format/squad-v2-lance/data/train.lance") -hits = ds.scanner( - full_text_query="great pyramid of giza", - columns=["title", "question", "context"], - limit=5, -).to_table().to_pylist() +hybrid_hits = ( + tbl.search(query_type="hybrid") + .vector(seed["question_emb"]) + .text("eiffel tower") + .where("is_impossible = false", prefilter=True) + .select(["id", "title", "question", "context", "answers"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(f"{r['title']:30s} | {r['question'][:80]}") ``` -### LanceDB full-text search +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. + +## Curate + +SQuAD v2 has a natural split between answerable and unanswerable questions, and the `is_impossible` boolean — backed by a `BITMAP` index — makes either subset cheap to extract. Stacking predicates inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(1000)` makes it easy to inspect or hand off. ```python import lancedb @@ -129,46 +157,134 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/squad-v2-lance/data") tbl = db.open_table("train") -results = ( - tbl.search("great pyramid of giza") - .select(["title", "question", "context"]) - .limit(5) +impossible = ( + tbl.search() + .where("is_impossible = true AND length(question) >= 40", prefilter=True) + .select(["id", "title", "question", "context"]) + .limit(1000) .to_list() ) +print(f"{len(impossible)} hard unanswerable questions; first title: {impossible[0]['title']}") ``` -## Filter answerable vs impossible questions +The mirror query — long, well-grounded answerable questions — looks identical with the boolean flipped, and the `question_emb` vector is never read by either scan. The result is a plain list of dictionaries, ready to inspect, persist as a manifest of question ids, or hand to the Materialize-a-subset section below for export to a writable local copy. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a `question_length`, a `num_answers` count, and a `has_answer` flag — any of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus. ```python -ds = lance.dataset("hf://datasets/lance-format/squad-v2-lance/data/validation.lance") -impossible = ds.scanner(filter="is_impossible = true", columns=["question"], limit=5).to_table() +import lancedb + +db = lancedb.connect("./squad-v2-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "question_length": "length(question)", + "num_answers": "array_length(answers)", + "has_answer": "NOT is_impossible", +}) ``` -### Filter with LanceDB +If the values you want to attach already live in another table (offline reader-model predictions, alternate embeddings, span-level labels), merge them in by joining on `id`: + +```python +import pyarrow as pa + +scores = pa.table({ + "id": pa.array(["56be4db0acb8001400a502ec", "56be4db0acb8001400a502ed"]), + "reader_score": pa.array([0.91, 0.42]), +}) +tbl.merge(scores, on="id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a different embedding model over the questions), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a reading-comprehension model the natural projection is the question, the context, and the answer spans together; for a retriever or reranker on top of frozen features, project the precomputed embedding instead. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/squad-v2-lance/data") -tbl = db.open_table("validation") -impossible = ( - tbl.search() - .where("is_impossible = true") - .select(["question"]) - .limit(5) - .to_list() +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns( + ["question", "context", "answers", "answer_starts", "is_impossible"] ) +loader = DataLoader(train_ds, batch_size=32, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; question_emb stays on disk. + # tokenize question+context, build span labels from answer_starts, forward, backward... + ... ``` -## Why Lance? +Switching feature sets is a configuration change: passing `["question_emb"]` (optionally with `["answers"]` for hard-negative mining) to `select_columns(...)` on the next run reads only the 384-d vectors and skips the bulky `context` strings entirely, which is the right shape for training a retrieval head or reranker on cached embeddings. Columns added in Evolve cost nothing per batch until they are explicitly projected. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/squad-v2-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./squad-v2-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("baseline-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="baseline-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `baseline-v1` keeps returning stable results while the dataset evolves in parallel — newly added reader scores or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same questions and contexts, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full corpus. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/squad-v2-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("is_impossible = false AND length(question) >= 30") + .select(["id", "title", "context", "question", "answers", "answer_starts", "question_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./squad-v2-answerable") +local_db.create_table("train", batches) +``` -- One dataset carries questions + contexts + answers + embeddings + indices — no sidecar files. -- On-disk vector and full-text indices live next to the data, so search works on local copies and on the Hub. -- Schema evolution: add columns (alternate embeddings, model predictions, task labels) without rewriting the data. +The resulting `./squad-v2-answerable` is a first-class LanceDB database. Every snippet in the Search, Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/squad-v2-lance/data` for `./squad-v2-answerable`. ## Source & license -Converted from [`rajpurkar/squad_v2`](https://huggingface.co/datasets/rajpurkar/squad_v2). SQuAD v2 is released under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). +Converted from [`rajpurkar/squad_v2`](https://huggingface.co/datasets/rajpurkar/squad_v2). SQuAD v2 is distributed under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). ## Citation @@ -177,6 +293,6 @@ Converted from [`rajpurkar/squad_v2`](https://huggingface.co/datasets/rajpurkar/ title={Know What You Don't Know: Unanswerable Questions for SQuAD}, author={Rajpurkar, Pranav and Jia, Robin and Liang, Percy}, journal={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers)}, - year={2018}, + year={2018} } ``` diff --git a/docs/datasets/stanford-cars.mdx b/docs/datasets/stanford-cars.mdx index 7701095..67e7497 100644 --- a/docs/datasets/stanford-cars.mdx +++ b/docs/datasets/stanford-cars.mdx @@ -1,7 +1,7 @@ --- title: "Stanford Cars" sidebarTitle: "Stanford Cars" -description: "Lance-formatted version of the Stanford Cars dataset — 8,144 training images across 196 fine-grained car make/model/year classes — sourced from Multimodal-Fatima/StanfordCars_train." +description: "A Lance-formatted version of the Stanford Cars fine-grained benchmark — 8,144 photographs across 196 make/model/year classes — sourced from Multimodal-Fatima/StanfordCars_train. Each row carries the inline JPEG bytes, the integer class id, a…" --- -Lance-formatted version of the [Stanford Cars dataset](https://web.archive.org/web/20210212183835/http://ai.stanford.edu/~jkrause/cars/car_dataset.html) — 8,144 training images across 196 fine-grained car make/model/year classes — sourced from [`Multimodal-Fatima/StanfordCars_train`](https://huggingface.co/datasets/Multimodal-Fatima/StanfordCars_train). +A Lance-formatted version of the [Stanford Cars](https://web.archive.org/web/20210212183835/http://ai.stanford.edu/~jkrause/cars/car_dataset.html) fine-grained benchmark — 8,144 photographs across 196 make/model/year classes — sourced from [`Multimodal-Fatima/StanfordCars_train`](https://huggingface.co/datasets/Multimodal-Fatima/StanfordCars_train). Each row carries the inline JPEG bytes, the integer class id, a BLIP-generated caption inherited from the source mirror, and a cosine-normalized CLIP image embedding, all available directly from the Hub at `hf://datasets/lance-format/stanford-cars-lance/data`. + +## Key features + +- **Inline JPEG bytes** in the `image` column — no sidecar files, no image folders. +- **Pre-computed CLIP image embeddings** (`image_emb`, OpenCLIP `ViT-B-32`, 512-dim, cosine-normalized) with a bundled `IVF_PQ` index for similarity search. +- **BLIP captions in `blip_caption`** with a full-text index, so keyword search on visual descriptions composes with vector search in a single query. +- **A bundled scalar index on `label`** makes class-based curation a cheap predicate rather than a full scan. + +## Splits + +| Split | Rows | Notes | +|-------|------|-------| +| `train.lance` | 8,144 | The source mirror redistributes a single split; the original Stanford Cars test split is not included here. | ## Schema | Column | Type | Notes | |---|---|---| -| `id` | `int64` | Row index | -| `image` | `large_binary` | Inline JPEG bytes | -| `label` | `int32` | Class id (0-195) | +| `id` | `int64` | Row index within split (natural join key for merges) | +| `image` | `large_binary` | Inline JPEG bytes (quality 92) | +| `label` | `int32` | Class id (0–195), one per Make Model Year combination | | `blip_caption` | `string?` | BLIP-generated caption (beam=5) carried through from the source mirror | -| `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` embedding (cosine-normalized) | +| `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` image embedding (cosine-normalized) | ## Pre-built indices -- `IVF_PQ` on `image_emb` — `metric=cosine` -- `INVERTED` (FTS) on `blip_caption` -- `BTREE` on `label` +- `IVF_PQ` on `image_emb` — vector similarity search (cosine) +- `INVERTED` (FTS) on `blip_caption` — keyword and hybrid search +- `BTREE` on `label` — fast lookup by class id + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. -## Quick start +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample without installing anything Lance-specific. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/stanford-cars-lance/data/train.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +import datasets + +hf_ds = datasets.load_dataset("lance-format/stanford-cars-lance", split="train", streaming=True) +for row in hf_ds.take(3): + print(row["label"], row["blip_caption"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/stanford-cars-lance/data") tbl = db.open_table("train") -print(f"LanceDB table opened with {len(tbl)} car images") +print(len(tbl)) ``` -## Caption-based filtering +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python import lance + ds = lance.dataset("hf://datasets/lance-format/stanford-cars-lance/data/train.lance") -hits = ds.scanner(full_text_query="red sports car", columns=["id", "blip_caption"], limit=10).to_table() +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB full-text search +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/stanford-cars-lance --repo-type dataset --local-dir ./stanford-cars-lance +> ``` +> Then point Lance or LanceDB at `./stanford-cars-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `image_emb` makes approximate-nearest-neighbor search a single call. In production you would encode a query photo through the same OpenCLIP `ViT-B-32` model used at ingest and pass the resulting 512-d vector to `tbl.search(...)`. The example below uses the embedding stored in row 0 as a runnable stand-in so the snippet works without any model loaded. ```python import lancedb @@ -66,28 +104,45 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/stanford-cars-lance/data") tbl = db.open_table("train") -results = ( - tbl.search("red sports car") - .select(["id", "blip_caption"]) +seed = ( + tbl.search() + .select(["image_emb", "blip_caption"]) + .limit(1) + .to_list()[0] +) + +hits = ( + tbl.search(seed["image_emb"]) + .metric("cosine") + .select(["id", "label", "blip_caption"]) .limit(10) .to_list() ) +print("seed caption:", seed["blip_caption"]) +for r in hits: + print(f" {r['id']:>6} label={r['label']:>3} {r['blip_caption'][:60]}") ``` -## Visual similarity search +Tune `metric`, `nprobes`, and `refine_factor` to trade recall against latency for your workload. + +Because the dataset also ships an `INVERTED` index on `blip_caption`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query. LanceDB merges the two result lists and reranks them in a single call, which is useful when a phrase like "red convertible" must literally appear in the caption but you still want CLIP to do the heavy lifting on visual similarity. ```python -import lance, pyarrow as pa -ds = lance.dataset("hf://datasets/lance-format/stanford-cars-lance/data/train.lance") -emb_field = ds.schema.field("image_emb") -ref = ds.take([0], columns=["image_emb", "blip_caption"]).to_pylist()[0] -neighbors = ds.scanner( - nearest={"column": "image_emb", "q": pa.array([ref["image_emb"]], type=emb_field.type)[0], "k": 5}, - columns=["id", "blip_caption"], -).to_table().to_pylist() +hybrid_hits = ( + tbl.search(query_type="hybrid") + .vector(seed["image_emb"]) + .text("red convertible") + .select(["id", "label", "blip_caption"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(f" {r['id']:>6} label={r['label']:>3} {r['blip_caption'][:60]}") ``` -### LanceDB visual similarity search +## Curate + +A typical curation pass for a fine-grained classifier combines a class-based filter with a content filter on the caption. Stacking both inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(200)` makes it cheap to inspect before committing the subset to anything downstream. The `BTREE` on `label` and the `INVERTED` index on `blip_caption` make both predicates effectively free. ```python import lancedb @@ -95,21 +150,134 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/stanford-cars-lance/data") tbl = db.open_table("train") -ref = tbl.search().limit(1).select(["image_emb", "blip_caption"]).to_list()[0] -query_embedding = ref["image_emb"] - -results = ( - tbl.search(query_embedding) - .metric("cosine") - .select(["id", "blip_caption"]) - .limit(5) +candidates = ( + tbl.search("convertible OR coupe") + .where("label IN (12, 47, 89)", prefilter=True) + .select(["id", "label", "blip_caption"]) + .limit(200) .to_list() ) +print(f"{len(candidates)} candidates; first: {candidates[0]['blip_caption'][:80]}") +``` + +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `id`s, or feed into the Evolve and Train workflows below. The `image` column is never read, so the network traffic for a 200-row candidate scan is dominated by the small caption strings rather than JPEG bytes. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. Stanford Cars class strings often encode the model year as a trailing four-digit token in the caption; the example below uses a SQL regex to lift that year into its own column, and adds a flag for vintage cars. Either can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. + +```python +import lancedb + +db = lancedb.connect("./stanford-cars-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "caption_year": "CAST(regexp_extract(blip_caption, '(\\d{4})', 1) AS INTEGER)", + "is_long_caption": "length(blip_caption) >= 80", +}) +``` + +If the values you want to attach already live in another table (offline labels, classifier predictions, the Make Model Year strings from the original Stanford metadata), merge them in by joining on `id`: + +```python +import pyarrow as pa + +class_strings = pa.table({ + "id": pa.array([0, 1, 2]), + "class_name": pa.array([ + "AM General Hummer SUV 2000", + "Acura RL Sedan 2012", + "Acura TL Sedan 2012", + ]), +}) +tbl.merge(class_strings, on="id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a second captioner over the image bytes), Lance provides a batch-UDF API in the underlying library — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/) for that pattern. + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetch, shuffling, and batching behave as in any PyTorch pipeline. For a from-scratch fine-grained classifier, project the JPEG bytes and the integer label; for a linear probe on top of frozen CLIP features, swap the projection to the embedding column and skip JPEG decoding entirely. + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/stanford-cars-lance/data") +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns(["image", "label"]) +loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; image_emb and blip_caption stay on disk. + # decode the JPEG bytes, forward, cross-entropy against `label`... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "label"]` to `select_columns(...)` on the next run reads only the cached 512-d vectors and the label, which is the right shape for a linear probe or a lightweight reranker. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/stanford-cars-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./stanford-cars-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("clip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="clip-vitb32-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `clip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel; newly added columns or captions do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and labels, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/stanford-cars-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search("convertible OR coupe") + .select(["id", "image", "label", "blip_caption", "image_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./stanford-cars-sports-subset") +local_db.create_table("train", batches) ``` +The resulting `./stanford-cars-sports-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/stanford-cars-lance/data` for `./stanford-cars-sports-subset`. + ## Source & license -Converted from [`Multimodal-Fatima/StanfordCars_train`](https://huggingface.co/datasets/Multimodal-Fatima/StanfordCars_train), itself a parquet redistribution of the Stanford Cars test split. The original dataset license is for non-commercial research use; review the [Stanford Cars terms](https://github.com/jhoffman/stanford-cars) before redistribution. +Converted from [`Multimodal-Fatima/StanfordCars_train`](https://huggingface.co/datasets/Multimodal-Fatima/StanfordCars_train), itself a redistribution of the Stanford Cars dataset. The original dataset license is for non-commercial research use; review the [Stanford Cars terms](https://github.com/jhoffman/stanford-cars) before redistribution. ## Citation diff --git a/docs/datasets/textvqa.mdx b/docs/datasets/textvqa.mdx index f3f1e0c..42d91d0 100644 --- a/docs/datasets/textvqa.mdx +++ b/docs/datasets/textvqa.mdx @@ -1,7 +1,7 @@ --- title: "TextVQA" sidebarTitle: "TextVQA" -description: "Lance-formatted version of TextVQA — VQA where the question requires reading text in the image — sourced from lmms-lab/textvqa." +description: "A Lance-formatted version of TextVQA — visual question answering where the question requires reading text in the image (street signs, product labels, screen captures) — sourced from lmms-lab/textvqa. Each row carries the image bytes, the question…" --- -Lance-formatted version of [TextVQA](https://textvqa.org/) — VQA where the question requires *reading* text in the image — sourced from [`lmms-lab/textvqa`](https://huggingface.co/datasets/lmms-lab/textvqa). +A Lance-formatted version of [TextVQA](https://textvqa.org/) — visual question answering where the question requires *reading* text in the image (street signs, product labels, screen captures) — sourced from [`lmms-lab/textvqa`](https://huggingface.co/datasets/lmms-lab/textvqa). Each row carries the image bytes, the question, the 10 reference annotator answers, the OCR tokens detected by the source pre-processing, OpenImages-style scene tags, and paired CLIP image and question embeddings — all available directly from the Hub at `hf://datasets/lance-format/textvqa-lance/data`. -Each row carries the image bytes, the question, the 10 reference answers, the OCR tokens detected by the dataset's pre-processing, and CLIP image + question embeddings. +## Key features + +- **Inline JPEG bytes** in the `image` column — no sidecar files or image folders. +- **Paired CLIP embeddings in the same row** — `image_emb` and `question_emb` (OpenCLIP ViT-B/32, 512-dim, cosine-normalized) — so cross-modal text→image retrieval is one indexed lookup. +- **OCR tokens travel with the row** in `ocr_tokens`, which makes OCR-aware filtering and reranking a single SQL predicate alongside the visual and textual features. +- **Pre-built ANN, FTS, and scalar indices** covering both embeddings, the question and canonical answer, and the source partition. ## Splits | Split | Rows | |-------|------| | `validation.lance` | 5,000 | -| `train.lance` | 34,602 | +| `train.lance` | 34,602 | ## Schema @@ -33,7 +38,7 @@ Each row carries the image bytes, the question, the 10 reference answers, the OC | `question_id` | `string?` | TextVQA question id | | `question` | `string` | The question text | | `answers` | `list` | 10 annotator answers | -| `answer` | `string` | First answer — used as canonical / FTS target | +| `answer` | `string` | First annotator answer (used as canonical / FTS target) | | `ocr_tokens` | `list` | OCR tokens detected on the image | | `image_classes` | `list` | OpenImages-style scene tags from the source | | `set_name` | `string?` | Source partition (`train`, `val`) | @@ -42,75 +47,114 @@ Each row carries the image bytes, the question, the 10 reference answers, the OC ## Pre-built indices -- `IVF_PQ` on `image_emb` and `question_emb` — `metric=cosine` -- `INVERTED` (FTS) on `question` and `answer` -- `BTREE` on `image_id`, `question_id`, `set_name` +- `IVF_PQ` on `image_emb` — image-side vector search (cosine) +- `IVF_PQ` on `question_emb` — question-side vector search (cosine) +- `INVERTED` (FTS) on `question` and `answer` — keyword and hybrid search +- `BTREE` on `image_id`, `question_id`, `set_name` — fast lookup by id and partition + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` -## Quick start +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance -ds = lance.dataset("hf://datasets/lance-format/textvqa-lance/data/validation.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +import datasets + +hf_ds = datasets.load_dataset("lance-format/textvqa-lance", split="validation", streaming=True) +for row in hf_ds.take(3): + print(row["question"], "->", row["answer"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/textvqa-lance/data") tbl = db.open_table("validation") -print(f"LanceDB table opened with {len(tbl)} image-question pairs") +print(len(tbl)) ``` -## Cross-modal text→image search +## Load with Lance -```python -import lance, pyarrow as pa, open_clip, torch +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, and the list of pre-built indices. -model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k") -tokenizer = open_clip.get_tokenizer("ViT-B-32") -model = model.eval().cuda().half() -with torch.no_grad(): - q = model.encode_text(tokenizer(["what brand is on this billboard?"]).cuda()) - q = (q / q.norm(dim=-1, keepdim=True)).float().cpu().numpy()[0] +```python +import lance ds = lance.dataset("hf://datasets/lance-format/textvqa-lance/data/validation.lance") -emb_field = ds.schema.field("image_emb") -hits = ds.scanner( - nearest={"column": "image_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 10}, - columns=["question", "answer", "ocr_tokens"], -).to_table().to_pylist() +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB cross-modal text→image search +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/textvqa-lance --repo-type dataset --local-dir ./textvqa-lance +> ``` +> Then point Lance or LanceDB at `./textvqa-lance/data`. -```python -import lancedb, open_clip, torch +## Search -model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k") -tokenizer = open_clip.get_tokenizer("ViT-B-32") -model = model.eval().cuda().half() -with torch.no_grad(): - q = model.encode_text(tokenizer(["what brand is on this billboard?"]).cuda()) - q = (q / q.norm(dim=-1, keepdim=True)).float().cpu().numpy()[0] +The bundled `IVF_PQ` index on `image_emb` makes cross-modal text→image retrieval a single call: encode a question with the same OpenCLIP model used at ingest (ViT-B/32 `laion2b_s34b_b79k`, cosine-normalized), then pass the resulting 512-d vector to `tbl.search(...)` and target `image_emb`. The example below uses the `question_emb` already stored in row 42 as a runnable stand-in for "the CLIP encoding of a question", so the snippet works without any model loaded. + +```python +import lancedb db = lancedb.connect("hf://datasets/lance-format/textvqa-lance/data") tbl = db.open_table("validation") -results = ( - tbl.search(q.tolist(), vector_column_name="image_emb") +seed = ( + tbl.search() + .select(["question_emb", "question", "answer"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["question_emb"], vector_column_name="image_emb") .metric("cosine") - .select(["question", "answer", "ocr_tokens"]) + .select(["image_id", "question", "answer", "ocr_tokens"]) + .limit(10) + .to_list() +) +print("query question:", seed["question"], "->", seed["answer"]) +for r in hits: + print(f" {r['image_id']:>14} {r['question'][:60]} ocr={r['ocr_tokens'][:5]}") +``` + +Because the CLIP embeddings are cosine-normalized, cosine is the right metric and the first hit will often be the source row itself. Swap `vector_column_name="image_emb"` for `question_emb` to find paraphrased questions instead of visually similar images. + +Because the dataset also ships an `INVERTED` index on `question` and `answer`, the same query can be issued as a hybrid search that combines the dense vector with a literal keyword match. This is particularly useful for TextVQA, where a brand name or sign content like "stop" must literally appear in the question (or eventually, the answer) while CLIP handles the visual side. + +```python +hybrid_hits = ( + tbl.search(query_type="hybrid", vector_column_name="image_emb") + .vector(seed["question_emb"]) + .text("brand name") + .select(["image_id", "question", "answer", "ocr_tokens"]) .limit(10) .to_list() ) +for r in hybrid_hits: + print(f" {r['image_id']:>14} {r['question'][:60]} -> {r['answer']}") ``` -### LanceDB full-text search +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. + +## Curate + +Curation passes for TextVQA usually combine an OCR-driven structural filter (does this image actually contain a meaningful amount of detected text?) with a content predicate on the question or the canonical answer, so the candidate set is both visually interesting and topically focused. Stacking both inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(500)` makes it cheap to inspect before committing the subset to anything downstream. ```python import lancedb @@ -118,19 +162,134 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/textvqa-lance/data") tbl = db.open_table("validation") -results = ( - tbl.search("brand name") - .select(["question", "answer"]) - .limit(10) +candidates = ( + tbl.search() + .where( + "array_length(ocr_tokens) >= 5 AND question LIKE '%brand%' AND length(answer) > 0", + prefilter=True, + ) + .select(["question_id", "image_id", "question", "answer", "ocr_tokens"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} candidates; first: {candidates[0]['question']} -> {candidates[0]['answer']}") ``` -## Why Lance? +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `question_id`s, or feed into the Evolve and Train workflows below. The `image` column is never read, so the network traffic for a 500-row candidate scan is dominated by question, answer, and OCR-token strings rather than JPEG bytes. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds an `ocr_token_count`, an `is_yes_no_question` flag, and an `answer_length` integer, any of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. + +```python +import lancedb + +db = lancedb.connect("./textvqa-lance/data") # local copy required for writes +tbl = db.open_table("validation") + +tbl.add_columns({ + "ocr_token_count": "array_length(ocr_tokens)", + "is_yes_no_question": "lower(answer) IN ('yes', 'no')", + "answer_length": "length(answer)", + "question_length": "length(question)", +}) +``` + +If the values you want to attach already live in another table (a stronger OCR system's tokens, a model's predicted answer, an annotator-disagreement score), merge them in by joining on `question_id`: + +```python +import pyarrow as pa + +predictions = pa.table({ + "question_id": pa.array(["34602", "34603"]), + "model_answer": pa.array(["pepsi", "exit"]), + "model_confidence": pa.array([0.88, 0.72]), +}) +tbl.merge(predictions, on="question_id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running an alternate OCR engine over the image bytes), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a TextVQA fine-tune that needs OCR conditioning, project the JPEG bytes, the question, the OCR tokens, and the canonical answer; columns added in the Evolve section above cost nothing per batch until they are explicitly projected. + +```python +import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader + +db = lancedb.connect("hf://datasets/lance-format/textvqa-lance/data") +tbl = db.open_table("train") + +train_ds = Permutation.identity(tbl).select_columns(["image", "question", "ocr_tokens", "answer"]) +loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; decode the JPEG bytes, + # tokenize the question and OCR tokens, forward through the VLM, + # compute the loss against `answer`... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "question_emb", "answer"]` to `select_columns(...)` on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a lightweight reranker or a linear probe on top of frozen CLIP features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/textvqa-lance/data") +tbl = db.open_table("validation") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./textvqa-lance/data") +local_tbl = local_db.open_table("validation") +local_tbl.tags.create("openclip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("validation", version="openclip-vitb32-v1") +tbl_v5 = db.open_table("validation", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `openclip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel — newly added OCR systems or model predictions do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images, questions, and OCR tokens, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/textvqa-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("array_length(ocr_tokens) >= 5") + .select(["question_id", "image_id", "image", "question", "answer", "ocr_tokens", "image_emb", "question_emb"]) + .to_batches() +) + +local_db = lancedb.connect("./textvqa-ocr-rich") +local_db.create_table("train", batches) +``` -- One dataset for images + questions + answers + OCR + dual embeddings + indices — no JSON/feature folders. -- Cross-modal search and OCR-text filtering work on the same dataset on the Hub. -- Schema evolution: add columns (alternate OCR systems, model predictions) without rewriting the data. +The resulting `./textvqa-ocr-rich` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/textvqa-lance/data` for `./textvqa-ocr-rich`. ## Source & license diff --git a/docs/datasets/trivia-qa.mdx b/docs/datasets/trivia-qa.mdx index 4e5024e..373748f 100644 --- a/docs/datasets/trivia-qa.mdx +++ b/docs/datasets/trivia-qa.mdx @@ -1,7 +1,7 @@ --- title: "TriviaQA" sidebarTitle: "TriviaQA" -description: "Lance-formatted version of TriviaQA (rc.nocontext config) — a question-answering dataset of trivia questions paired with answer aliases — with MiniLM sentence embeddings stored inline." +description: "A Lance-formatted version of TriviaQA (rc.nocontext config) — a large reading-comprehension dataset of trivia questions paired with a canonical answer, accepted aliases, and entity-type metadata — with MiniLM question embeddings stored inline and…" --- -Lance-formatted version of [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) (`rc.nocontext` config) — a question-answering dataset of trivia questions paired with answer aliases — with **MiniLM sentence embeddings** stored inline. +A Lance-formatted version of [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) (`rc.nocontext` config) — a large reading-comprehension dataset of trivia questions paired with a canonical answer, accepted aliases, and entity-type metadata — with MiniLM question embeddings stored inline and ready for retrieval at `hf://datasets/lance-format/trivia-qa-lance/data`. The `rc.nocontext` slice is the standard reading-comprehension form without the multi-gigabyte `entity_pages` / `search_results` payloads, which keeps the dataset compact while preserving everything needed for closed-book QA, retrieval research, and as a search target. -## Why `rc.nocontext`? +## Key features -The full TriviaQA dataset bundles entire Wikipedia / web pages per question (`entity_pages`, `search_results`), which makes it tens of GB. The `rc.nocontext` slice keeps the question + answer + answer aliases in a compact form — ideal for closed-book QA, retrieval research, and as a search target. +- **138k+ trivia questions** with a canonical `answer_value`, normalized form for exact-match scoring, and a list of accepted `answer_aliases`. +- **Pre-computed 384-dim question embeddings** (`question_emb`, `sentence-transformers/all-MiniLM-L6-v2`, cosine-normalized) with a bundled `IVF_PQ` index for semantic question retrieval. +- **Full-text inverted index** on `question` for keyword search and hybrid retrieval. +- **One columnar dataset** carrying questions, canonical answers, aliases, types, and embeddings together — project only the columns each query needs. ## Splits @@ -29,9 +32,9 @@ The full TriviaQA dataset bundles entire Wikipedia / web pages per question (`en | Column | Type | Notes | |---|---|---| -| `question_id` | `string` | TriviaQA question id (e.g. `tc_1`) | +| `question_id` | `string` | TriviaQA question id (e.g. `tc_1`); natural join key for merges | | `question` | `string` | The trivia question | -| `question_source` | `string` | URL / source where the question came from | +| `question_source` | `string` | URL or source the question came from | | `answer_value` | `string` | Canonical answer | | `answer_aliases` | `list` | Other accepted phrasings (e.g. `["Sinclair Lewis", "Harry Sinclair Lewis"]`) | | `normalized_answer` | `string` | Lowercased / normalized form for exact-match scoring | @@ -40,72 +43,111 @@ The full TriviaQA dataset bundles entire Wikipedia / web pages per question (`en ## Pre-built indices -- `IVF_PQ` on `question_emb` — `metric=cosine` -- `INVERTED` on `question` -- `BTREE` on `question_id` and `answer_value` -- `BITMAP` on `answer_type` +- `IVF_PQ` on `question_emb` — `metric=cosine`, vector similarity search +- `INVERTED` on `question` — full-text search +- `BTREE` on `question_id` and `answer_value` — point lookups and prefix scans +- `BITMAP` on `answer_type` — fast filtering by entity type -## Quick start +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` + +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/trivia-qa-lance/data/train.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/trivia-qa-lance", split="train", streaming=True) +for row in hf_ds.take(3): + print(row["question"], "->", row["answer_value"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/trivia-qa-lance/data") tbl = db.open_table("train") -print(f"LanceDB table opened with {len(tbl)} trivia questions") +print(len(tbl)) ``` -## Semantic search over questions +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices. ```python import lance -import pyarrow as pa -from sentence_transformers import SentenceTransformer - -encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda") -q = encoder.encode(["who painted the sistine chapel ceiling"], normalize_embeddings=True)[0] ds = lance.dataset("hf://datasets/lance-format/trivia-qa-lance/data/train.lance") -emb_field = ds.schema.field("question_emb") -hits = ds.scanner( - nearest={"column": "question_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 5}, - columns=["question", "answer_value", "answer_aliases"], -).to_table().to_pylist() +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB semantic search +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/trivia-qa-lance --repo-type dataset --local-dir ./trivia-qa-lance +> ``` +> Then point Lance or LanceDB at `./trivia-qa-lance/data`. + +## Search + +The bundled `IVF_PQ` index on `question_emb` turns semantic retrieval over trivia questions into a single call. In production you would encode an incoming question through the same MiniLM encoder used at ingest and pass the resulting 384-dim vector to `tbl.search(...)`. The example below uses the embedding from row 42 as a runnable stand-in. ```python import lancedb -from sentence_transformers import SentenceTransformer - -encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda") -q = encoder.encode(["who painted the sistine chapel ceiling"], normalize_embeddings=True)[0] db = lancedb.connect("hf://datasets/lance-format/trivia-qa-lance/data") tbl = db.open_table("train") -results = ( - tbl.search(q.tolist(), vector_column_name="question_emb") +seed = ( + tbl.search() + .select(["question_emb", "question"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["question_emb"], vector_column_name="question_emb") .metric("cosine") - .select(["question", "answer_value", "answer_aliases"]) - .limit(5) + .select(["question_id", "question", "answer_value", "answer_aliases"]) + .limit(10) + .to_list() +) +for r in hits: + print(f"{r['answer_value']:30s} | {r['question'][:80]}") +``` + +Because the recommended setup also builds an `INVERTED` index on `question`, the same query can be issued as a hybrid search that combines the dense vector with a keyword query. LanceDB merges and reranks the two result lists in a single call, which is useful when a specific named entity must literally appear in the question but the dense side should still drive ranking. + +```python +hybrid_hits = ( + tbl.search(query_type="hybrid") + .vector(seed["question_emb"]) + .text("sistine chapel") + .select(["question_id", "question", "answer_value", "answer_aliases"]) + .limit(10) .to_list() ) +for r in hybrid_hits: + print(f"{r['answer_value']:30s} | {r['question'][:80]}") ``` -### LanceDB full-text search +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. + +## Curate + +TriviaQA's `answer_type` column — backed by a `BITMAP` index — makes it cheap to slice the dataset by entity category, and the question text itself is a useful predicate for filtering out very short or unusually long items. Stacking predicates inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(1000)` makes it easy to inspect or hand off. ```python import lancedb @@ -113,46 +155,141 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/trivia-qa-lance/data") tbl = db.open_table("train") -results = ( - tbl.search("sistine chapel") - .select(["question", "answer_value"]) - .limit(10) +candidates = ( + tbl.search() + .where( + "answer_type = 'WikipediaEntity' " + "AND length(question) BETWEEN 60 AND 300", + prefilter=True, + ) + .select(["question_id", "question", "answer_value", "answer_aliases"]) + .limit(1000) .to_list() ) +print(f"{len(candidates)} candidates; first answer: {candidates[0]['answer_value']}") ``` -## Filter by answer type +Neither the `question_emb` vector nor the unused alias fields drive this scan, so a 1000-row curation pass against the Hub moves only the projected text columns. The result is a plain list of dictionaries, ready to inspect, persist as a manifest of question ids, or hand to the Materialize-a-subset section below for export to a writable local copy. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a `question_length`, a `num_aliases` count, and a `has_aliases` flag — any of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full corpus. ```python -ds = lance.dataset("hf://datasets/lance-format/trivia-qa-lance/data/train.lance") -wiki = ds.scanner(filter="answer_type = 'WikipediaEntity'", columns=["question"], limit=5).to_table() +import lancedb + +db = lancedb.connect("./trivia-qa-lance/data") # local copy required for writes +tbl = db.open_table("train") + +tbl.add_columns({ + "question_length": "length(question)", + "num_aliases": "array_length(answer_aliases)", + "has_aliases": "array_length(answer_aliases) > 0", +}) +``` + +If the values you want to attach already live in another table (offline reader-model predictions, alternate embeddings, retrieval scores from a different system), merge them in by joining on `question_id`: + +```python +import pyarrow as pa + +scores = pa.table({ + "question_id": pa.array(["tc_1", "tc_2"]), + "retriever_score": pa.array([0.88, 0.31]), +}) +tbl.merge(scores, on="question_id") ``` -### Filter with LanceDB +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a different embedding model over the questions), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a closed-book QA model the natural projection is the question, the canonical answer, and the alias list (the aliases serve as additional supervision targets during loss computation or evaluation); for a retriever or reranker on top of frozen features, project the precomputed embedding instead. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/trivia-qa-lance/data") tbl = db.open_table("train") -wiki = ( - tbl.search() - .where("answer_type = 'WikipediaEntity'") - .select(["question"]) - .limit(5) - .to_list() + +train_ds = Permutation.identity(tbl).select_columns( + ["question", "answer_value", "answer_aliases"] ) +loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; question_emb stays on disk. + # tokenize question and answer, forward, backward... + ... ``` -## Why Lance? +Switching feature sets is a configuration change: passing `["question_emb", "answer_value"]` to `select_columns(...)` on the next run reads only the 384-d vectors and the canonical answer string, which is the right shape for training a retrieval head or reranker on cached embeddings. Columns added in Evolve cost nothing per batch until they are explicitly projected. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/trivia-qa-lance/data") +tbl = db.open_table("train") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./trivia-qa-lance/data") +local_tbl = local_db.open_table("train") +local_tbl.tags.create("baseline-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("train", version="baseline-v1") +tbl_v5 = db.open_table("train", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `baseline-v1` keeps returning stable results while the dataset evolves in parallel — newly added scores or alternate embeddings do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same questions and answers, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full corpus. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/trivia-qa-lance/data") +remote_tbl = remote_db.open_table("train") + +batches = ( + remote_tbl.search() + .where("answer_type = 'WikipediaEntity' AND length(question) >= 60") + .select( + ["question_id", "question", "answer_value", "answer_aliases", + "normalized_answer", "answer_type", "question_emb"] + ) + .to_batches() +) + +local_db = lancedb.connect("./trivia-qa-wiki") +local_db.create_table("train", batches) +``` -- One dataset carries questions + answers + aliases + embeddings + indices — no sidecar files. -- On-disk vector and full-text indices live next to the data, so search works on local copies and on the Hub. -- Schema evolution: add columns (alternate embeddings, generated answers, task labels) without rewriting the data. +The resulting `./trivia-qa-wiki` is a first-class LanceDB database. Every snippet in the Search, Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/trivia-qa-lance/data` for `./trivia-qa-wiki`. ## Source & license -Converted from [`mandarjoshi/trivia_qa`](https://huggingface.co/datasets/mandarjoshi/trivia_qa) (`rc.nocontext`). TriviaQA is released under the Apache 2.0 license. +Converted from [`mandarjoshi/trivia_qa`](https://huggingface.co/datasets/mandarjoshi/trivia_qa) (`rc.nocontext` config). TriviaQA is released under the Apache 2.0 license. ## Citation diff --git a/docs/datasets/vqav2.mdx b/docs/datasets/vqav2.mdx index 09bdd60..1fdc11d 100644 --- a/docs/datasets/vqav2.mdx +++ b/docs/datasets/vqav2.mdx @@ -1,7 +1,7 @@ --- title: "VQAv2" sidebarTitle: "VQAv2" -description: "Lance-formatted version of VQAv2 — Visual Question Answering on COCO images, sourced from lmms-lab/VQAv2. Each row is a (image, question, 10 answers) triple with two CLIP embeddings (image + question text) so the same dataset supports both visual…" +description: "A Lance-formatted version of VQAv2 — open-ended visual question answering on COCO images — sourced from lmms-lab/VQAv2. Each row is one (image, question, 10 annotator answers) triple with paired CLIP image and question embeddings drawn from the…" --- -Lance-formatted version of [VQAv2](https://visualqa.org/) — Visual Question Answering on COCO images, sourced from [`lmms-lab/VQAv2`](https://huggingface.co/datasets/lmms-lab/VQAv2). Each row is a `(image, question, 10 answers)` triple with **two** CLIP embeddings (image + question text) so the same dataset supports both visual retrieval and question-similarity retrieval. +A Lance-formatted version of [VQAv2](https://visualqa.org/) — open-ended visual question answering on COCO images — sourced from [`lmms-lab/VQAv2`](https://huggingface.co/datasets/lmms-lab/VQAv2). Each row is one `(image, question, 10 annotator answers)` triple with paired CLIP image and question embeddings drawn from the same shared space, plus the VQAv2 `question_type` / `answer_type` taxonomy and the consensus `multiple_choice_answer` — all available directly from the Hub at `hf://datasets/lance-format/vqav2-lance/data`. + +## Key features + +- **Inline JPEG bytes** in the `image` column — no sidecar files or image folders. +- **Paired CLIP embeddings in the same row** — `image_emb` and `question_emb` (OpenCLIP ViT-B/32, 512-dim, cosine-normalized) — so cross-modal text→image retrieval and question-similarity retrieval both work as a single indexed lookup. +- **Both raw and consensus answers** — the 10 annotator answers in `answers` alongside the canonical `multiple_choice_answer`, with parallel `answer_confidences`. +- **Pre-built ANN, FTS, scalar, and bitmap indices** covering both embeddings, the question text, the answer taxonomy, and the COCO and VQAv2 ids. ## Splits @@ -20,13 +27,7 @@ Lance-formatted version of [VQAv2](https://visualqa.org/) — Visual Question An |-------|------| | `validation.lance` | 214,354 | -> **Train split note.** `lmms-lab/VQAv2` ships `train`, `validation`, `testdev`, -> and `test` parquet shards but only declares the eval splits in its -> `dataset_info`, so `datasets.load_dataset(..., split="train")` raises. The -> `vqav2/dataprep.py` script in this repo builds the validation split today; -> the train split (444k rows) can be enabled in a follow-up by reading the -> `data/train-*.parquet` shards directly with PyArrow or by switching to -> `Multimodal-Fatima/VQAv2_train`. Track progress in `TRACKED_DATASETS.md`. +The `lmms-lab/VQAv2` redistribution declares only the eval splits in its `dataset_info`, so the train shards (~444 k rows) are not bundled here today; they can be enabled by reading `data/train-*.parquet` directly with PyArrow or by switching to `Multimodal-Fatima/VQAv2_train`. Track progress in `TRACKED_DATASETS.md`. ## Schema @@ -40,104 +41,124 @@ Lance-formatted version of [VQAv2](https://visualqa.org/) — Visual Question An | `question_type` | `string` | First few tokens of the question (e.g. `what is`, `is the`) | | `answer_type` | `string` | One of `yes/no`, `number`, `other` | | `multiple_choice_answer` | `string` | Canonical (most-common) answer | -| `answers` | `list` | Raw answers from 10 annotators | +| `answers` | `list` | 10 annotator answers | | `answer_confidences` | `list` | Parallel confidence list (`yes` / `maybe` / `no`) | -| `image_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` image embedding (cosine-normalized) | -| `question_emb` | `fixed_size_list` | OpenCLIP `ViT-B-32` text embedding of the question (cosine-normalized) | +| `image_emb` | `fixed_size_list` | OpenCLIP image embedding (cosine-normalized) | +| `question_emb` | `fixed_size_list` | OpenCLIP text embedding of the question (cosine-normalized) | -Because both embeddings come from the same CLIP model, they share an embedding space and cross-modal retrieval (image→question or question→image) works out of the box. +Because both embeddings come from the same CLIP model, they share an embedding space and cross-modal retrieval (image→question or question→image) works without any additional alignment. ## Pre-built indices -- `IVF_PQ` on `image_emb` and `question_emb` — `metric=cosine` -- `INVERTED` (FTS) on `question` -- `BTREE` on `image_id`, `question_id`, `multiple_choice_answer` -- `BITMAP` on `question_type`, `answer_type` +- `IVF_PQ` on `image_emb` — image-side vector search (cosine) +- `IVF_PQ` on `question_emb` — question-side vector search (cosine) +- `INVERTED` (FTS) on `question` — keyword and hybrid search +- `BITMAP` on `question_type`, `answer_type` — fast categorical filters over the VQAv2 taxonomy +- `BTREE` on `image_id`, `question_id`, `multiple_choice_answer` — fast lookup by id and canonical answer + +## Why Lance? + +1. **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. +2. **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. +3. **Native Index Support**: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. +4. **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. +5. **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. +6. **Data Versioning**: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history. + +## Load with `datasets.load_dataset` -## Quick start +You can load Lance datasets via the standard HuggingFace `datasets` interface, suitable when your pipeline already speaks `Dataset` / `IterableDataset` or you want a quick streaming sample. ```python -import lance +import datasets -ds = lance.dataset("hf://datasets/lance-format/vqav2-lance/data/validation.lance") -print(ds.count_rows(), ds.schema.names, ds.list_indices()) +hf_ds = datasets.load_dataset("lance-format/vqav2-lance", split="validation", streaming=True) +for row in hf_ds.take(3): + print(row["question"], "->", row["multiple_choice_answer"]) ``` ## Load with LanceDB -These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries. +LanceDB is the embedded retrieval library built on top of the Lance format ([docs](https://lancedb.com/docs)), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below. ```python import lancedb db = lancedb.connect("hf://datasets/lance-format/vqav2-lance/data") tbl = db.open_table("validation") -print(f"LanceDB table opened with {len(tbl)} image-question pairs") +print(len(tbl)) ``` -## Cross-modal: find an image for a free-form question +## Load with Lance + +`pylance` is the Python binding for the Lance format and works directly with the format's lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, and the list of pre-built indices. ```python import lance -import pyarrow as pa -import open_clip -import torch - -model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k") -tokenizer = open_clip.get_tokenizer("ViT-B-32") -model = model.eval().cuda().half() -with torch.no_grad(): - q = model.encode_text(tokenizer(["what color is the dog?"]).cuda()) - q = (q / q.norm(dim=-1, keepdim=True)).float().cpu().numpy()[0] ds = lance.dataset("hf://datasets/lance-format/vqav2-lance/data/validation.lance") -emb_field = ds.schema.field("image_emb") -hits = ds.scanner( - nearest={"column": "image_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 5}, - columns=["image_id", "question", "multiple_choice_answer"], -).to_table().to_pylist() +print(ds.count_rows(), ds.schema.names) +print(ds.list_indices()) ``` -### LanceDB cross-modal search +> **Tip — for production use, download locally first.** Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy: +> ```bash +> hf download lance-format/vqav2-lance --repo-type dataset --local-dir ./vqav2-lance +> ``` +> Then point Lance or LanceDB at `./vqav2-lance/data`. -```python -import lancedb, open_clip, torch +## Search -model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k") -tokenizer = open_clip.get_tokenizer("ViT-B-32") -model = model.eval().cuda().half() -with torch.no_grad(): - q = model.encode_text(tokenizer(["what color is the dog?"]).cuda()) - q = (q / q.norm(dim=-1, keepdim=True)).float().cpu().numpy()[0] +The bundled `IVF_PQ` index on `image_emb` makes cross-modal text→image retrieval a single call: encode a question with the same OpenCLIP model used at ingest (ViT-B/32 `laion2b_s34b_b79k`, cosine-normalized), then pass the resulting 512-d vector to `tbl.search(...)` and target `image_emb`. The example below uses the `question_emb` already stored in row 42 as a runnable stand-in for "the CLIP encoding of a question", so the snippet works without any model loaded. + +```python +import lancedb db = lancedb.connect("hf://datasets/lance-format/vqav2-lance/data") tbl = db.open_table("validation") -results = ( - tbl.search(q.tolist(), vector_column_name="image_emb") +seed = ( + tbl.search() + .select(["question_emb", "question", "multiple_choice_answer"]) + .limit(1) + .offset(42) + .to_list()[0] +) + +hits = ( + tbl.search(seed["question_emb"], vector_column_name="image_emb") .metric("cosine") - .select(["image_id", "question", "multiple_choice_answer"]) - .limit(5) + .select(["image_id", "question", "multiple_choice_answer", "answer_type"]) + .limit(10) .to_list() ) +print("query question:", seed["question"], "->", seed["multiple_choice_answer"]) +for r in hits: + print(f" {r['image_id']:>12} [{r['answer_type']}] {r['question'][:60]} -> {r['multiple_choice_answer']}") ``` -## Question similarity (text→text) +Because the CLIP embeddings are cosine-normalized, cosine is the right metric. Swap `vector_column_name="image_emb"` for `question_emb` to do question→question retrieval against the validation set instead, which is useful for clustering paraphrases or spotting near-duplicate questions across COCO images. + +Because the dataset also ships an `INVERTED` index on `question`, the same query can be issued as a hybrid search that combines the dense vector with a literal keyword match. This is useful when a noun like "dog" must appear in the question text but you still want CLIP to handle visual similarity over the candidate set. ```python -ds = lance.dataset("hf://datasets/lance-format/vqav2-lance/data/validation.lance") -ref = ds.take([0], columns=["question_emb", "question"]).to_pylist()[0] -emb_field = ds.schema.field("question_emb") -neighbors = ds.scanner( - nearest={"column": "question_emb", "q": pa.array([ref["question_emb"]], type=emb_field.type)[0], "k": 5}, - columns=["question", "multiple_choice_answer"], -).to_table().to_pylist() -print("query:", ref["question"]) -for n in neighbors: - print(n) +hybrid_hits = ( + tbl.search(query_type="hybrid", vector_column_name="image_emb") + .vector(seed["question_emb"]) + .text("dog") + .select(["image_id", "question", "multiple_choice_answer"]) + .limit(10) + .to_list() +) +for r in hybrid_hits: + print(f" {r['image_id']:>12} {r['question'][:60]} -> {r['multiple_choice_answer']}") ``` -### LanceDB question similarity +Tune `metric`, `nprobes`, and `refine_factor` on the vector side to trade recall against latency. + +## Curate + +A typical curation pass for VQAv2 combines a structural filter on the answer taxonomy (e.g. only yes/no questions, or only counting questions) with a content predicate on the question text or the consensus answer, so the candidate set is both categorically uniform and topically focused. Stacking both inside a single filtered scan keeps the result small and explicit, and the bounded `.limit(500)` makes it cheap to inspect before committing the subset to anything downstream. ```python import lancedb @@ -145,54 +166,134 @@ import lancedb db = lancedb.connect("hf://datasets/lance-format/vqav2-lance/data") tbl = db.open_table("validation") -ref = tbl.search().limit(1).select(["question_emb", "question"]).to_list()[0] -query_embedding = ref["question_emb"] - -results = ( - tbl.search(query_embedding, vector_column_name="question_emb") - .metric("cosine") - .select(["question", "multiple_choice_answer"]) - .limit(5) +candidates = ( + tbl.search() + .where( + "answer_type = 'yes/no' AND question_type = 'is the' AND multiple_choice_answer IN ('yes', 'no')", + prefilter=True, + ) + .select(["question_id", "image_id", "question", "multiple_choice_answer"]) + .limit(500) .to_list() ) +print(f"{len(candidates)} 'is the' yes/no candidates; first: {candidates[0]['question']} -> {candidates[0]['multiple_choice_answer']}") ``` -## Filter by question / answer type +The result is a plain list of dictionaries, ready to inspect, persist as a manifest of `question_id`s, or feed into the Evolve and Train workflows below. The `image` and embedding columns are never read, so the network traffic for a 500-row candidate scan is dominated by question and answer strings rather than JPEG bytes or vectors. + +## Evolve + +Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds an `is_binary_answer` flag, a `num_answer_tokens` count, and a `question_length` integer, any of which can then be used directly in `where` clauses without recomputing the predicate on every query. + +> **Note:** Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use `hf download` to pull the full split first. ```python -ds = lance.dataset("hf://datasets/lance-format/vqav2-lance/data/validation.lance") -yesno = ds.scanner(filter="answer_type = 'yes/no'", columns=["question", "multiple_choice_answer"], limit=5).to_table() -counts = ds.scanner(filter="answer_type = 'number'", columns=["question", "multiple_choice_answer"], limit=5).to_table() +import lancedb + +db = lancedb.connect("./vqav2-lance/data") # local copy required for writes +tbl = db.open_table("validation") + +tbl.add_columns({ + "is_binary_answer": "multiple_choice_answer IN ('yes', 'no')", + "question_length": "length(question)", + "answer_length": "length(multiple_choice_answer)", + "num_unique_answers": "array_length(answers)", +}) ``` -### Filter with LanceDB +If the values you want to attach already live in another table (a model's predicted answer, an annotator-agreement score, or a difficulty rating), merge them in by joining on `question_id`: + +```python +import pyarrow as pa + +predictions = pa.table({ + "question_id": pa.array([262148000, 262148001], type=pa.int64()), + "model_answer": pa.array(["yes", "2"]), + "model_confidence": pa.array([0.87, 0.64]), +}) +tbl.merge(predictions, on="question_id") +``` + +The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running an alternate VLM over the image bytes), Lance provides a batch-UDF API — see the [Lance data evolution docs](https://lance.org/guide/data_evolution/). + +## Train + +Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through `Permutation.identity(tbl).select_columns([...])`, which plugs straight into the standard `torch.utils.data.DataLoader` so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a VQA fine-tune, project the JPEG bytes, the question, and the consensus answer; columns added in the Evolve section above cost nothing per batch until they are explicitly projected. ```python import lancedb +from lancedb.permutation import Permutation +from torch.utils.data import DataLoader db = lancedb.connect("hf://datasets/lance-format/vqav2-lance/data") tbl = db.open_table("validation") -yesno = ( - tbl.search() - .where("answer_type = 'yes/no'") - .select(["question", "multiple_choice_answer"]) - .limit(5) - .to_list() -) -counts = ( - tbl.search() + +train_ds = Permutation.identity(tbl).select_columns(["image", "question", "multiple_choice_answer"]) +loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=4) + +for batch in loader: + # batch carries only the projected columns; decode the JPEG bytes, + # tokenize the question, forward through the VLM, compute the loss + # against `multiple_choice_answer`... + ... +``` + +Switching feature sets is a configuration change: passing `["image_emb", "question_emb", "multiple_choice_answer"]` to `select_columns(...)` on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a lightweight reranker or a linear probe on top of frozen CLIP features. + +## Versioning + +Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes. + +```python +import lancedb + +db = lancedb.connect("hf://datasets/lance-format/vqav2-lance/data") +tbl = db.open_table("validation") + +print("Current version:", tbl.version) +print("History:", tbl.list_versions()) +print("Tags:", tbl.tags.list()) +``` + +Once you have a local copy, tag a version for reproducibility: + +```python +local_db = lancedb.connect("./vqav2-lance/data") +local_tbl = local_db.open_table("validation") +local_tbl.tags.create("openclip-vitb32-v1", local_tbl.version) +``` + +A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one: + +```python +tbl_v1 = db.open_table("validation", version="openclip-vitb32-v1") +tbl_v5 = db.open_table("validation", version=5) +``` + +Pinning supports two workflows. A retrieval system locked to `openclip-vitb32-v1` keeps returning stable results while the dataset evolves in parallel — newly added model predictions or alternative annotations do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images, questions, and consensus answers, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking. + +## Materialize a subset + +Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through `.to_batches()` into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory. + +```python +import lancedb + +remote_db = lancedb.connect("hf://datasets/lance-format/vqav2-lance/data") +remote_tbl = remote_db.open_table("validation") + +batches = ( + remote_tbl.search() .where("answer_type = 'number'") - .select(["question", "multiple_choice_answer"]) - .limit(5) - .to_list() + .select(["question_id", "image_id", "image", "question", "multiple_choice_answer", "answers", "image_emb", "question_emb"]) + .to_batches() ) -``` -## Why Lance? +local_db = lancedb.connect("./vqav2-counting-subset") +local_db.create_table("validation", batches) +``` -- One dataset for images + questions + answers + dual embeddings + indices — no JSON/CSV sidecars. -- On-disk vector and FTS indices live next to the data, so search works on local copies and on the Hub. -- Schema evolution: add columns (alternate embeddings, model predictions, generated answers) without rewriting the data. +The resulting `./vqav2-counting-subset` is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping `hf://datasets/lance-format/vqav2-lance/data` for `./vqav2-counting-subset`. ## Source & license From ca80e40bada32431c035e672e251cbcaa9521d36 Mon Sep 17 00:00:00 2001 From: prrao87 <35005448+prrao87@users.noreply.github.com> Date: Fri, 15 May 2026 11:49:29 -0700 Subject: [PATCH 2/2] Update source spec file name --- .github/workflows/docs-ci.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/docs-ci.yml b/.github/workflows/docs-ci.yml index 54907b8..393df80 100644 --- a/.github/workflows/docs-ci.yml +++ b/.github/workflows/docs-ci.yml @@ -30,8 +30,8 @@ jobs: - name: Copy lance-namespace REST API spec run: | set -euo pipefail - test -s lance-namespace/docs/src/rest.yaml - cp lance-namespace/docs/src/rest.yaml docs/api-reference/rest/openapi.yml + test -s lance-namespace/docs/src/spec.yaml + cp lance-namespace/docs/src/spec.yaml docs/api-reference/rest/openapi.yml - name: Install Mintlify CLI run: npm install -g mintlify@latest