DataMetaMap

Datasets in a shared vector space

DataMetaMap is a Python library for representing datasets in a shared vector space, so you can compare datasets (and tasks) using standard distances and similarity metrics.

It includes multiple dataset embedding algorithms implemented on top of PyTorch:

Dataset2Vec (tabular datasets)
Task2Vec (supervised tasks via Fisher information)
Wasserstein Task Embedding (Optimal Transport based)
MMD (used as a baseline in some workflows)

📬 Assets

💡 Motivation

If you can measure similarity between datasets, you can:

retrieve the most similar dataset(s) to a target dataset
choose better pretraining sources
cluster tasks and datasets, and visualize the dataset landscape
track dataset drift over time

🗃 Algorithms

Maximum Mean Discrepancy, also see 📝 review
Task2Vec, also see 📝 paper
Dataset2Vec, also see 📝 paper
Wasserstein Task Embedding, also see 📝 paper

🛠️ Install

Requires Python 3.10+.

Virtual Environment (venv)

Recommended: install into an isolated virtual environment.

macOS / Linux:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

Windows (PowerShell):

py -m venv .venv
.\.venv\Scripts\Activate.ps1
py -m pip install -U pip

Install from source

git clone https://github.com/intsystems/DataMetaMap.git
cd DataMetaMap
python -m pip install .

Development install (editable + dev dependencies)

python -m pip install -e ".[dev,viz]"

🚀 Quickstart

Dataset2Vec (tabular)

Dataset2VecEmbedder trains on a collection of tabular datasets, then embeds a single dataset as a vector.

import numpy as np
import torch

from data_meta_map.models import get_model
from data_meta_map.dataset2vec_embedder import Dataset2VecEmbedder

# Model for tabular embedding
model = get_model("dataset2vec")
embedder = Dataset2VecEmbedder(model, max_epochs=1, batch_size=8, n_batches=5)

# Each training dataset: last column is the target
train_ds1 = np.random.randn(64, 6).astype(np.float32)
train_ds2 = np.random.randn(64, 6).astype(np.float32)
embedder.fit([train_ds1, train_ds2])

X = torch.randn(32, 5)
y = torch.randint(0, 2, (32,)).float()
z = embedder.embed(X, y)
print(z.shape)  # (output_size,)

Wasserstein Task Embedding (PyTorch Dataset / DataLoader)

WassersteinEmbedder can compute class statistics from a dataset and build embeddings via a distance matrix. See demo/wasserstein/simple_example1 (1).ipynb for an end-to-end notebook.

Task2Vec (supervised tasks)

Task2Vec computes a task embedding based on the Fisher information of a probe network. See demo/task2vec/simple_example.ipynb for an example workflow.

🎮 Demo

Notebooks are in:

📈 Benchmarks

Benchmark notebooks and scripts live in benchmarks/. In particular, see benchmarks/pretrain_benchmark/ for experiments comparing transfer performance between pretraining sources and target tasks.

👥 Contributors

Vladislav Minashkin (Project planning, Benchmarking, Algorithms)
Papay Ivan (Documentation writing, Code writing, Algorithms)
Meshkov Vlad (Blog post, Demo, Algorithms)
Stepanov Ilya (Tech. report, Code writing, Algorithms)
You are welcome to contribute to our project!

🔗 Useful links

Docs: https://intsystems.github.io/DataMetaMap
Report: report/data_meta_map.pdf

🧪 Development

Run tests:

pytest -q
pytest -q --cov=src/data_meta_map --cov-report=term-missing

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.github/workflows		.github/workflows
assets		assets
benchmarks/pretrain_benchmark		benchmarks/pretrain_benchmark
configs		configs
demo		demo
doc/source		doc/source
figures		figures
logs/pretrain_to_task_logs		logs/pretrain_to_task_logs
report		report
slides		slides
src/data_meta_map		src/data_meta_map
tests		tests
.gitignore		.gitignore
BLOGPOST.md		BLOGPOST.md
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
badge_generator.py		badge_generator.py
coverage-badge.svg		coverage-badge.svg
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataMetaMap

📬 Assets

💡 Motivation

🗃 Algorithms

🛠️ Install

Virtual Environment (venv)

Install from source

Development install (editable + dev dependencies)

🚀 Quickstart

Dataset2Vec (tabular)

Wasserstein Task Embedding (PyTorch Dataset / DataLoader)

Task2Vec (supervised tasks)

🎮 Demo

📈 Benchmarks

👥 Contributors

🔗 Useful links

🧪 Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataMetaMap

📬 Assets

💡 Motivation

🗃 Algorithms

🛠️ Install

Virtual Environment (venv)

Install from source

Development install (editable + dev dependencies)

🚀 Quickstart

Dataset2Vec (tabular)

Wasserstein Task Embedding (PyTorch Dataset / DataLoader)

Task2Vec (supervised tasks)

🎮 Demo

📈 Benchmarks

👥 Contributors

🔗 Useful links

🧪 Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages