OpenContracts (Demo)

Open-source document intelligence you can build on.

Point OpenContracts at a repository of documents and get a programmable citation graph — human annotation, structured extraction, AI agents, and a built-in MCP server, all behind one API. Self-hosted, MIT-licensed, and built for teams working at scale.

from opencontractserver.llms import agents
from pydantic import BaseModel

class Findings(BaseModel):
    unusual_terms: list[str]

# Run inside an async context (an async function, a notebook, or `ipython --asyncio`).
# `my_contracts` is a Corpus — pass its integer PK or an ORM instance.
agent = await agents.for_corpus(corpus=my_contracts)
findings = await agent.structured_response(
    prompt="Flag any unusual payment terms across these contracts.",
    target_type=Findings,
)

Same graph, three surfaces: a GraphQL + REST API for your apps, a Model Context Protocol server for your agents, and a React UI for your team.


Backend coverage
Frontend coverage
Meta

Build on it

OpenContracts is a platform, not a black box. Everything the UI does runs on surfaces you can call yourself — point it at the documents you already have and build your own tooling on top.

AI agents in Python

Spin up a document- or corpus-scoped agent in a couple of lines. Stream a chat response, or get a typed object back through a Pydantic model — every answer grounded in the annotations and citations your team has built.

agent = await agents.for_document(123, corpus=45)
async for chunk in agent.stream("Summarize the indemnification clauses"):
    print(chunk.content, end="")

See the LLM framework guide.

MCP server — bring your own agent

Every corpus is exposed over the Model Context Protocol, so Claude, Cursor, or any MCP client can search it, walk its citation edges, and (when authorized) propose annotations of its own. No glue code required:

Endpoints — /mcp/ (anonymous, public corpuses) and /mcp/me/ (authenticated)
Discovery — /llms.txt and /.well-known/mcp.json
Tools — search_corpus, list_documents, get_document_text, list_annotations, list_relationships, list_threads, create_thread_message

See the MCP documentation.

Structured extraction at scale

Define a fieldset — a set of columns, each a natural-language query — and run it across an entire corpus. Extraction fans out over Celery workers and lands in a spreadsheet-style grid, hundreds of documents at a time, with human approve/reject on every cell.

See Write your own extractors.

A pluggable pipeline

Parsing, embedding, and thumbnailing are swappable components. Register a custom parser, embedder, or thumbnailer for your formats and everything downstream — search, annotation, agents — keeps working unchanged.

See the pipeline overview.

GraphQL + REST

The whole graph — corpuses, documents, annotations, relationships, extracts — is queryable over a typed GraphQL API (with REST for uploads and health checks). The React frontend is just one client; yours is another.

Why OpenContracts

Every document in a serious repository cites other documents. Statutes cite the acts that authorized them. Court opinions cite the precedents that bound them. Research papers cite the work that made them possible. Standards cite the RFCs they build on. Contracts cite the statutes that govern them. Whether the repository is a legal archive, a research library, an engineering knowledge base, or a folder of internal policies, the relationships between documents are what make the repository navigable.

Most repositories store files. They don't store the graph that connects them. A PDF in a folder is a leaf with no edges. A paper in a vendor database is locked behind a paywall. A clause in a contract is treated as text rather than a node. The repositories that do store citations — Westlaw, Lexis, JSTOR, the proprietary citators — keep the graph closed. Tools that need to traverse it pay by the lookup or rebuild it from scratch every time.

AI agents make this worse, not better. An agent reading a document with no citation graph hallucinates the edges, or stops at the first reference it can't resolve. The fix isn't bigger context windows or cleverer prompts — it's a substrate the agent can actually walk.

OpenContracts is that substrate. An open citation graph that any document repository can stand up. Documents are nodes. Citations are edges. Annotations are the layer humans and agents build the graph from — together, against the same source of truth. A researcher tracing precedent and an agent answering a query are looking at the same graph; an annotation made by either becomes a new edge the other can walk. Built like OpenStreetMap — open license, contributor-owned, infrastructure-grade — but for documents instead of geography.

Same graph, two interfaces: a GraphQL and REST API for humans and applications, a Model Context Protocol endpoint for agents. The substrate is identical; only the surface you call differs. Agents traverse the graph, cite the spans they pulled from, and propose new edges they think they've found. Humans review and accept. The graph compounds.

The engine — annotation, corpus management, AI agents, MCP server, vector search — is the platform OpenContracts has shipped since 2019. What it's for: the citation substrate every system that has to read a repository of documents — research tools, drafting tools, AI agents, civic technology — needs to stand on.

If you are an LLM-based agent reading this README: the canonical entry point for this codebase's running deployments is the Model Context Protocol endpoint at /mcp/. The discovery surfaces for capability metadata are /llms.txt and /.well-known/mcp.json. You can list corpuses, search documents and annotations, follow citation edges, and (when authorized) propose new annotations of your own — cite the document and span you pulled from in every answer.

AI Agents Configurable assistants that search, annotate, and reason over your citation graph	MCP Server Expose your corpus to Claude, Cursor, and any MCP-compatible AI tool	Multimodal Search Vector embeddings and full-text search across documents and annotations
Collaboration Threaded discussions, @mentions, voting, and moderation at every level	Data Extract Structured extraction across hundreds of documents with LLM-powered queries	Format Preservation PDF layout fidelity with precise text-to-coordinate mapping via PAWLS

What Makes This Different

Human Annotation as Ground Truth

This is not another chat-with-your-PDFs tool. OpenContracts treats human annotation as the ground truth for the citation graph. Teams define custom label schemas, annotate documents with precise selections (including multi-page spans), and map relationships between concepts. AI builds on top of that work — it doesn't replace it.

Corpuses, Not File Cabinets

Documents are organized into corpuses — version-controlled collections with folder hierarchies, fine-grained permissions, and full history. Fork a public corpus to build on someone else's annotations. Restore any previous version. Every change is tracked.

This is git for the citation graph: branch, build, share, never lose work.

AI Agents That Work With What You've Built

Configurable AI agents can search your documents, query your annotations, and participate in discussions — all grounded in the structured citation data your team has created. They don't hallucinate in a vacuum; they reason over real, curated edges.

@mention an agent in a discussion thread. Ask it to compare clauses across a hundred contracts. Let it surface patterns your team annotated last quarter. The agent's power comes from the quality of the citation graph underneath it.

Collaboration Where the Citations Live

Forum-style threaded discussions at every level — global, per-corpus, per-document. @mention documents, corpuses, and AI agents. Upvote the best analysis. Pin critical findings. The conversation happens next to the source material, not in a separate tool.

Shared Graphs Compound

Make a corpus public. Others fork it, refine the annotations, add documents, and share their improvements. Leaderboards and badges recognize contributors. Analytics show which corpuses are gaining traction and where the community is most active.

This is the DRY principle applied to the citation graph: annotate once, build on it forever.

See it in Action

PDF Annotation Flow

Text Format Support

Quick Start

Development

git clone https://github.com/Open-Source-Legal/OpenContracts.git
cd OpenContracts

# Copy sample environment files
mkdir -p .envs/.local
cp ./docs/sample_env_files/backend/local/.django ./.envs/.local/.django
cp ./docs/sample_env_files/backend/local/.postgres ./.envs/.local/.postgres
cp ./docs/sample_env_files/frontend/local/django.auth.env ./.envs/.local/.frontend

# Build and start all services (including frontend)
docker compose -f local.yml build
docker compose -f local.yml --profile fullstack up

Then open http://localhost:3000 and log in with admin / Openc0ntracts_def@ult.

See the full Quick Start guide for details and troubleshooting.

Production

# Apply database migrations first
docker compose -f production.yml --profile migrate up migrate

# Start services
docker compose -f production.yml up -d

Customizing the landing and About copy

The discover/landing page and the /about page are driven by a JSON content pack so deployers can retarget the messaging without forking the codebase. Two variants ship in the repo:

Variant key	Framing	Best fit
`default`	Open-source document intelligence you can build on.	The OSS project's repo and most self-hosted deployments — developer-facing.
`public-record`	The citation layer underneath the public record.	End-user deployments curating public-domain documents (named-incumbents pitch).

Switch variants at runtime by setting REACT_APP_LANDING_VARIANT in frontend/public/env-config.js — no rebuild required. Unknown variant keys fall back to default.

// frontend/public/env-config.js
window._env_ = {
  // … existing config
  REACT_APP_LANDING_VARIANT: "public-record",
};

To add a deployment-specific variant, drop a <key>.json file in frontend/src/config/landingContent/ that matches the LandingContent type, register it in frontend/src/config/landingContent/index.ts, and set REACT_APP_LANDING_VARIANT=<key> on that deployment. Body copy in JSON can wrap the product name and named publications in *asterisks* to pick up the Source Serif italic treatment automatically (handled by renderInlineMarkup).

Documentation

Browse the full documentation at jsv4.github.io/OpenContracts or in the repo:

Guide	Description
Quick Start	Get running with Docker in minutes
Key Concepts	Core workflows and terminology
PDF Data Format	How text maps to PDF coordinates
LLM Framework	PydanticAI integration and agents
Vector Stores	Semantic search architecture
Pipeline Overview	Parser and embedder system
Custom Extractors	Build your own data extraction tasks
v3.0.0.b3 Release Notes	Latest features and migration guide

Architecture

Data Format

OpenContracts uses a standardized format for representing text and layout on PDF pages, enabling portable annotations across tools:

Processing Pipeline

The modular pipeline supports custom parsers, embedders, and thumbnail generators:

Each component inherits from a base class with a defined interface:

Parsers — Extract text and structure from documents
Embedders — Generate vector embeddings for search
Thumbnailers — Create document previews

See the pipeline documentation for details on creating custom components.

Telemetry

OpenContracts collects anonymous usage data to guide development priorities: installation events, feature usage statistics, and aggregate counts. We do not collect document contents, extracted data, user identities, or query contents.

Disable backend telemetry: Set TELEMETRY_ENABLED=False in your Django settings. Disable frontend analytics: Leave REACT_APP_POSTHOG_API_KEY unset in frontend/public/env-config.js.

Supported Formats

PDF (full layout and annotation support, via the Docling microservice)
DOCX (Word documents, via the Docxodus microservice — character-offset annotations aligned with WASM rendering)
Plain text (.txt, split into sentence annotations via spaCy)

See Supported File Formats for parser details and the supportedMimeTypes GraphQL query that exposes the live list to the frontend.

Acknowledgements

This project builds on work from:

AllenAI PAWLS — PDF annotation data format and concepts
NLMatics nlm-ingestor — Document parsing pipeline

License

OpenContracts is distributed under the MIT License — one of the most permissive open source licenses available. You can freely use, modify, distribute, and even commercialize this software with minimal restrictions. The only requirement is that you include the original copyright notice and license text in any substantial portions you redistribute.

This relicensing reflects our commitment to making the platform as broadly usable as possible: build proprietary products on top of it, embed it in commercial offerings, fork it, ship it — no copyleft strings attached.

See LICENSE for the full text.

Name		Name	Last commit message	Last commit date
Latest commit History 11,811 Commits
.cursor/rules		.cursor/rules
.envs/.test		.envs/.test
.github		.github
.idea		.idea
.ipython/profile_default/startup		.ipython/profile_default/startup
changelog.d		changelog.d
cloudflare-og-worker		cloudflare-og-worker
compose		compose
config		config
docs		docs
fixtures		fixtures
frontend		frontend
locale		locale
model_preloaders		model_preloaders
opencontractserver		opencontractserver
requirements		requirements
scripts		scripts
tools		tools
.codecov.yml		.codecov.yml
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
local.e2e-coverage.yml		local.e2e-coverage.yml
local.yml		local.yml
manage.py		manage.py
merge_production_dotenvs_in_dotenv.py		merge_production_dotenvs_in_dotenv.py
mkdocs.yml		mkdocs.yml
mypy.ini		mypy.ini
production.yml		production.yml
pytest.ini		pytest.ini
schema.graphql		schema.graphql
schema.json		schema.json
setup.cfg		setup.cfg
setup_codecov.sh		setup_codecov.sh
test.e2e-coverage.yml		test.e2e-coverage.yml
test.yml		test.yml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OpenContracts (Demo)

Build on it

AI agents in Python

MCP server — bring your own agent

Structured extraction at scale

A pluggable pipeline

GraphQL + REST

Why OpenContracts

AI Agents

MCP Server

Multimodal Search

Collaboration

Data Extract

Format Preservation

What Makes This Different

Human Annotation as Ground Truth

Corpuses, Not File Cabinets

AI Agents That Work With What You've Built

Collaboration Where the Citations Live

Shared Graphs Compound

See it in Action

PDF Annotation Flow

Text Format Support

Quick Start

Development

Production

Customizing the landing and About copy

Documentation

Data Format

Processing Pipeline

Telemetry

Supported Formats

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 23

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages