DataMetaMap is a Python library for representing datasets in a shared vector space, so you can compare datasets (and tasks) using standard distances and similarity metrics.
It includes multiple dataset embedding algorithms implemented on top of PyTorch:
- Dataset2Vec (tabular datasets)
- Task2Vec (supervised tasks via Fisher information)
- Wasserstein Task Embedding (Optimal Transport based)
- MMD (used as a baseline in some workflows)
If you can measure similarity between datasets, you can:
- retrieve the most similar dataset(s) to a target dataset
- choose better pretraining sources
- cluster tasks and datasets, and visualize the dataset landscape
- track dataset drift over time
- Maximum Mean Discrepancy, also see 📝 review
- Task2Vec, also see 📝 paper
- Dataset2Vec, also see 📝 paper
- Wasserstein Task Embedding, also see 📝 paper
Requires Python 3.10+.
Recommended: install into an isolated virtual environment.
macOS / Linux:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pipWindows (PowerShell):
py -m venv .venv
.\.venv\Scripts\Activate.ps1
py -m pip install -U pipgit clone https://github.com/intsystems/DataMetaMap.git
cd DataMetaMap
python -m pip install .python -m pip install -e ".[dev,viz]"Dataset2VecEmbedder trains on a collection of tabular datasets, then embeds a single dataset as a vector.
import numpy as np
import torch
from data_meta_map.models import get_model
from data_meta_map.dataset2vec_embedder import Dataset2VecEmbedder
# Model for tabular embedding
model = get_model("dataset2vec")
embedder = Dataset2VecEmbedder(model, max_epochs=1, batch_size=8, n_batches=5)
# Each training dataset: last column is the target
train_ds1 = np.random.randn(64, 6).astype(np.float32)
train_ds2 = np.random.randn(64, 6).astype(np.float32)
embedder.fit([train_ds1, train_ds2])
X = torch.randn(32, 5)
y = torch.randint(0, 2, (32,)).float()
z = embedder.embed(X, y)
print(z.shape) # (output_size,)WassersteinEmbedder can compute class statistics from a dataset and build embeddings via a distance matrix.
See demo/wasserstein/simple_example1 (1).ipynb for an end-to-end notebook.
Task2Vec computes a task embedding based on the Fisher information of a probe network. See demo/task2vec/simple_example.ipynb for an example workflow.
Notebooks are in:
- demo/dataset2vec/simple_example.ipynb
- demo/task2vec/simple_example.ipynb
- demo/wasserstein/simple_example1 (1).ipynb
Benchmark notebooks and scripts live in benchmarks/. In particular, see benchmarks/pretrain_benchmark/ for experiments comparing transfer performance between pretraining sources and target tasks.
- Vladislav Minashkin (Project planning, Benchmarking, Algorithms)
- Papay Ivan (Documentation writing, Code writing, Algorithms)
- Meshkov Vlad (Blog post, Demo, Algorithms)
- Stepanov Ilya (Tech. report, Code writing, Algorithms)
- You are welcome to contribute to our project!
Run tests:
pytest -q
pytest -q --cov=src/data_meta_map --cov-report=term-missing