Business Application Data Dictionary

A comprehensive, open-source business data dictionary stored in SQLite (datadict.db, with a full schema + data dump in datadict.sql). It collects standardized business data items across major industries and functional areas, extracted only from public / open-source resources.

3,688 data items · 12 categories · 9 open-source standards | 31 items corroborated across multiple sources.

Contents by category

Category	Items	Category	Items
Manufacturing	648	Supply Chain / Logistics	226
Finance / Accounting	612	Human Resources	195
Sales / Order Management	538	Inventory / Warehouse	131
Customer Relationship Management (CRM)	484	Procurement / Purchasing	111
Healthcare	331	Quality Management	81
Product Master Data	299	Maintenance / Asset Management	32

Sources (all public / open-source)

Standard	License	Items
Microsoft Common Data Model (CDM)	CDLA-Permissive-2.0	712
Tryton	GPL-3.0	634
ERPNext / Frappe Health	GPL-3.0	610
Odoo	LGPL-3.0	546
Stripe API (OpenAPI)	MIT	393
Schema.org	CC BY-SA 3.0	373
GS1 (Barcode Syntax Dictionary)	Apache-2.0	220
ISA-95 / B2MML	Royalty-free (MESA)	122
HL7 FHIR (R4)	CC0 (public domain)	116

Full provenance and extraction notes: sources.md.

Diagrams

Schema and source/category maps live in DATA_MODEL.md (Mermaid, renders on GitHub) with static SVG/PNG exports in diagrams/:


ER diagram	two-table star: `Categories` → `DataItems`
Categories	items per category (pie)
Description coverage	description provenance: from-source vs curated (pie)
Source→Category map	which standards feed which categories

Every item is described (100% coverage) — 3,234 descriptions come straight from the upstream source and 454 are curated editorial text added where the source provided none:

Quick start

# Build / refresh datadict.db and datadict.sql from the seed modules
python3 build_dict.py

# Print summary statistics only (no rebuild)
python3 build_dict.py --stats

# Query the database
sqlite3 datadict.db "SELECT * FROM DataItems WHERE CategoryID = 1 LIMIT 10;"

The build is idempotent — re-running never creates duplicates. To rebuild from scratch, delete datadict.db first (it is fully regenerable from the seeds). See the Query Cookbook for ready-to-run SQL.

Find fields by business term

tools/find.py looks up data items by a plain business word and tells you how it matched. A term resolves through four layers, first hit wins:

category — inventory, accounting, quality → every item in that category
entity — patient, work_order, invoice → that object's field family
alias — business vocabulary the source systems don't name directly (billing, payee, shipping, appointment, …) is mapped to real entities via a reviewable SEARCH_ALIASES map (~55 terms)
keyword — substring fallback across name/title/description (flagged as scattered)

python3 tools/find.py billing                 # alias -> invoice/charge/...
python3 tools/find.py patient invoice          # multiple terms (union)
python3 tools/find.py appointment --ddl        # emit a CREATE TABLE
python3 tools/find.py inventory --limit 0      # list every match

$ python3 tools/find.py billing --limit 3
=== 'billing'  [~ alias -> [invoice, invoiceitem, charge, payment_intent]]  382 items, 5 entities, 1 categories
    invoice.account              INTEGER   Account
    invoice.account_country      VARCHAR   account_country
    invoice.account_id           VARCHAR   Account
    ... +379 more

The --ddl flag turns a match into a ready-to-run CREATE TABLE per entity — types, lengths, NOT NULL, and enum CHECKs all come from the dictionary metadata — handy for scaffolding an app schema:

$ python3 tools/find.py appointment --ddl
-- patient_appointment (48 fields)
CREATE TABLE patient_appointment (
    patient_appointment_id INTEGER PRIMARY KEY,
    appointment_date  DATE NOT NULL,
    appointment_for   VARCHAR(255) NOT NULL
                        CHECK (appointment_for IN ('Practitioner', 'Department', 'Service Unit')),
    ...
);

Business terms that don't yet resolve well are a one-line addition to SEARCH_ALIASES in tools/find.py.

For a full multi-table schema with foreign keys, use tools/export_ddl.py. It resolves the same terms, then emits one CREATE TABLE per entity with real FOREIGN KEY constraints derived from the dictionary's relationship hints, ordered so the DDL loads, in your choice of dialect:

python3 tools/export_ddl.py manufacturing                 # SQLite (self-validated)
python3 tools/export_ddl.py billing patient --dialect postgres
python3 tools/export_ddl.py inventory --dialect mysql --out schema.sql

Schema

CREATE TABLE Categories (
    CategoryID  INTEGER PRIMARY KEY,
    Name        TEXT NOT NULL UNIQUE,   -- "Manufacturing", "Finance", ...
    Description TEXT,
    Source      TEXT
);

CREATE TABLE DataItems (
    DataItemID   INTEGER PRIMARY KEY,
    CategoryID   INTEGER NOT NULL REFERENCES Categories(CategoryID),
    Name         TEXT NOT NULL,         -- normalized "entity.field" (snake_case)
    Title        TEXT,
    Description  TEXT,
    DataType     TEXT,                  -- VARCHAR, INTEGER, DECIMAL, DATE, ...
    ByteLength   INTEGER,
    DecimalScale INTEGER,
    IsRequired   BOOLEAN DEFAULT FALSE,
    IsNullable   BOOLEAN DEFAULT TRUE,
    DefaultValue TEXT,
    AllowedValues TEXT,                 -- JSON array, or per-source JSON object
    FormatMask    TEXT,                 -- e.g. "YYYY-MM-DD", GS1 "N14,csum"
    SourceStandard TEXT,                -- one or more "; "-joined standards
    SourceURL      TEXT,                -- one or more " | "-joined URLs
    Version        TEXT,
    CreatedAt DATETIME DEFAULT CURRENT_TIMESTAMP,
    UpdatedAt DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_dataitems_category ON DataItems(CategoryID);
CREATE INDEX idx_dataitems_name     ON DataItems(Name);
-- Idempotency key for upserts:
CREATE UNIQUE INDEX ux_dataitems_natural
    ON DataItems(CategoryID, Name, SourceStandard);

Two field conventions worth knowing

Name is normalized to entity.field in snake_case (e.g. account.account_id, product.gtin). The entity is everything before the last dot.
SourceStandard / SourceURL hold a ; - / |-joined list when an item was corroborated by more than one source (a cross-source merge).
AllowedValues is usually a JSON array (["male","female","other"]). When merged sources have divergent enums, it becomes a JSON object keyed by source (e.g. purchase_order.state). Consumers should handle both.

UI projection (`ui_datadict.db`)

A second, derived database aimed at UI / form generation and resource governance. tools/build_ui_dict.py reads datadict.db (read-only) and rebuilds ui_datadict.db + ui_datadict.sql from scratch:

python3 tools/build_ui_dict.py
sqlite3 ui_datadict.db "SELECT * FROM UI_DataItems LIMIT 5;"

It reshapes the dictionary into a 3-level hierarchy — Categories → Groups → UI_DataItems (12 categories, 120 groups, 3,688 items):

Groups — one per entity (the part before the last dot of Name). The 43 transient wizard / relation (junction) entities collapse into a single Wizard group; gs1 is kept. Each group carries its CategoryID.
UI_DataItems — Name is reduced to just the field; every item is treated as UTF-8 with an implied DataType, plus governance columns:
- CharLength — declared VARCHAR length, else a per-type default
- ByteLength = CharLength × 4 (UTF-8 worst case)
- ValidationSpecs — the source mask (e.g. GS1) when present, else generated from type / allowed values / scale

Groupname	Name	DataType	CharLength	ByteLength	ValidationSpecs
patient	dob	DATE	10	40	`^\d{4}-\d{2}-\d{2}$`
patient	first_name	VARCHAR	255	1020	`maxlength: 255`
patient	blood_group	VARCHAR	255	1020	`one of: A Positive\|A Negative\|…`

datadict.db is never modified; CI rebuilds and validates ui_datadict.db (completeness, FK integrity, ByteLength = CharLength*4, unique (GroupID, Name)).

How it's built

build_dict.py          # orchestrator: schema, load seeds, normalize, export, stats
normalize.py           # Phase 3: snake_case naming, entity/field aliases,
                        #          cross-source merge, per-source enum union
seeds/                 # one module per source (CATEGORIES + ITEMS); some
                        #   are auto-generated by the tools below
tools/fetch_*.py       # generators that (re)build seed modules from upstream
datadict.db            # the SQLite database (build output)
datadict.sql           # full schema + INSERTs (build output)
sources.md             # every source, license, and extraction method
NORMALIZATION_REPORT.md # auto-generated: aliases, merges, related concepts
DATA_MODEL.md          # auto-generated Mermaid ER + category/source diagrams
QUERY_COOKBOOK.md      # ready-to-run SQL recipes
PROGRESS.md            # running build log

Pipeline (4 phases)

Discovery — explore each open repo/spec for business entities.
Extraction — tools/fetch_*.py parse upstream schemas (XSD, JSON-LD, Python ast, DocType JSON, OpenAPI, FHIR StructureDefinitions) into seed modules. Each is re-runnable to refresh from upstream.
Normalization (normalize.py) — consistent snake_case names; deliberate, reviewable entity aliases (e.g. account.invoice → invoice) and field aliases (e.g. gs1.gtin → product.gtin); conservative cross-source merge (same entity.field in the same category) that keeps the richest metadata and records all sources; per-source enum union.
SQLite generation — datadict.db + datadict.sql, plus stats and the normalization report.

Regenerating a single source

python3 tools/fetch_cdm.py        # Microsoft CDM   -> seeds/cdm_microsoft.py
python3 tools/fetch_odoo.py       # Odoo            -> seeds/odoo.py
python3 tools/fetch_frappe.py     # ERPNext+Health  -> seeds/frappe.py
python3 tools/fetch_schemaorg.py  # Schema.org      -> seeds/schemaorg.py
python3 tools/fetch_tryton.py     # Tryton          -> seeds/tryton.py
python3 tools/fetch_gs1.py        # GS1             -> seeds/gs1.py
python3 tools/fetch_fhir.py       # HL7 FHIR        -> seeds/fhir.py
python3 tools/fetch_openapi.py    # OpenAPI (Stripe)-> seeds/openapi.py
python3 build_dict.py             # then rebuild the DB
python3 tools/gen_diagram.py      # refresh DATA_MODEL.md from the DB
python3 tools/render_diagrams.py  # render diagrams/*.svg + *.png (Chromium or mmdc)

(ISA-95 / B2MML lives in a hand-curated seeds/isa95_b2mml.py.)

Adding a new source

Create a seeds/<name>.py exposing two lists — CATEGORIES and ITEMS (each item a dict with at least category and name) — then run python3 build_dict.py. For OpenAPI/Swagger specs, just add an entry to SPECS in tools/fetch_openapi.py.

Requirements

Python 3 standard library only (sqlite3, json, ast, urllib, re). No third-party packages. The tools/fetch_*.py generators need network access to reach the upstream repos; build_dict.py works fully offline from the committed seed modules.

Contributing

Contributions — new open-source sources, corrections, better descriptions, tooling — are welcome. See CONTRIBUTING.md for the seed-module contract, how to add a source, the aliasing/merge rules, and the pre-PR checklist. Core rule: public/open-source sources only, accuracy over quantity, always record provenance.

Changelog

Release-facing changes are in CHANGELOG.md (latest release: v1.3.0). The granular development log lives in PROGRESS.md.

Contributors

Michael Anderson (@cloud3000) — creator & maintainer
Claude (Anthropic) — AI pair-programmer; contributions are co-authored via Co-Authored-By trailers in the commit history

The full, always-current list is on the contributors graph. New contributors welcome — see CONTRIBUTING.md.

License & attribution

This repository is dual-licensed to reflect its two layers:

Code — the build/generator scripts (build_dict.py, normalize.py, tools/*.py) and the project's documentation are licensed under the MIT License.
Data — datadict.db / datadict.sql is a compilation derived from multiple open sources. The original compilation, arrangement, and this project's own descriptions are licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

Upstream terms still apply per item. Individual data items remain subject to the license of the source they came from, recorded in each row's SourceStandard / SourceURL and summarized in sources.md: HL7 FHIR (CC0), Stripe (MIT), GS1 (Apache-2.0), Microsoft CDM (CDLA-Permissive-2.0), Schema.org (CC BY-SA 3.0), Odoo (LGPL-3.0), ERPNext/Frappe Health & Tryton (GPL-3.0), ISA-95/B2MML (MESA royalty-free). If you redistribute the data, provide attribution (the per-item source fields + sources.md satisfy this) and honor each upstream license — particularly Schema.org's ShareAlike and the copyleft (GPL) sources.

No paywalled, proprietary, or scraped commercial content is included.

Not legal advice. This dual-license setup is a good-faith, conservative reflection of the sources; consult a professional for your specific use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Business Application Data Dictionary

Contents by category

Sources (all public / open-source)

Diagrams

Quick start

Find fields by business term

Schema

Two field conventions worth knowing

UI projection (`ui_datadict.db`)

How it's built

Pipeline (4 phases)

Regenerating a single source

Adding a new source

Requirements

Contributing

Changelog

Contributors

License & attribution

About

Releases

Packages

Contributors

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
diagrams		diagrams
seeds		seeds
tools		tools
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DATA_LICENSE		DATA_LICENSE
DATA_MODEL.md		DATA_MODEL.md
LICENSE		LICENSE
NORMALIZATION_REPORT.md		NORMALIZATION_REPORT.md
PROGRESS.md		PROGRESS.md
QUERY_COOKBOOK.md		QUERY_COOKBOOK.md
README.md		README.md
build_dict.py		build_dict.py
datadict.db		datadict.db
datadict.sql		datadict.sql
normalize.py		normalize.py
sources.md		sources.md
ui_datadict.db		ui_datadict.db
ui_datadict.sql		ui_datadict.sql

Folders and files

Latest commit

History

Repository files navigation

Business Application Data Dictionary

Contents by category

Sources (all public / open-source)

Diagrams

Quick start

Find fields by business term

Schema

Two field conventions worth knowing

UI projection (ui_datadict.db)

How it's built

Pipeline (4 phases)

Regenerating a single source

Adding a new source

Requirements

Contributing

Changelog

Contributors

License & attribution

About

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

Packages

Contributors

Languages

UI projection (`ui_datadict.db`)