Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions .github/workflows/weekly-index.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
name: Weekly Index

on:
schedule:
- cron: '0 3 * * 1' # Every Monday at 03:00 UTC
workflow_dispatch: # Manual trigger via GitHub UI

jobs:
index:
runs-on: ubuntu-latest

env:
PHYSLIB_REPO: https://github.com/leanprover-community/physlib
JIXIA_REPO: https://github.com/frenzymath/jixia
MODULE_NAMES: Physlib
DRY_RUN: 'false'
CHROMA_PATH: chroma
CONNECTION_STRING: ${{ secrets.DATABASE_URL }}
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
GEMINI_MODEL: ${{ vars.GEMINI_MODEL || 'gemini-2.5-flash-preview-04-17' }}
GEMINI_FAST_MODEL: ${{ vars.GEMINI_FAST_MODEL || 'gemini-2.5-flash-preview-04-17' }}
GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow runs python3 -m database ..., which executes database/__main__.py. That module currently requires LOG_FILENAME, LOG_FILEMODE, and LOG_LEVEL to be present in the environment (it uses os.environ[...]), but they are not set here, so the job will crash with KeyError. Either set these env vars in the workflow (e.g., LOG_LEVEL: WARNING, etc.) or make database/__main__.py use os.environ.get(...) defaults.

Suggested change
GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }}
GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }}
LOG_FILENAME: weekly-index.log
LOG_FILEMODE: a
LOG_LEVEL: WARNING

Copilot uses AI. Check for mistakes.

steps:
- name: Checkout main
uses: actions/checkout@v4
with:
ref: main

# Authenticate the heroku remote so we can read/write chroma/ storage
- name: Authenticate Heroku remote
env:
HEROKU_API_KEY: ${{ secrets.HEROKU_API_KEY }}
run: |
git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git remote set-url heroku ... will fail on a fresh Actions checkout because there is no heroku remote by default. Add the remote if it doesn't exist (e.g., git remote add heroku ... || git remote set-url heroku ...) so subsequent git fetch heroku main works reliably.

Suggested change
git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"
git remote add heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git" \
|| git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"

Copilot uses AI. Check for mistakes.

# Restore chroma/ and the SHA tracker from the heroku git remote
- name: Restore ChromaDB and SHA tracker
run: |
git fetch heroku main
git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null \
|| echo "No prior index on heroku/main — starting fresh."
Comment on lines +40 to +42
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git fetch heroku main will hard-fail the workflow if the Heroku remote/branch doesn't exist yet (e.g., first run or app not initialized), even though later steps try to handle a missing prior index. Consider making the fetch/checkout resilient (e.g., tolerate missing branch and proceed with an empty chroma/ and no .last-physlib-sha) so the workflow can bootstrap from scratch.

Suggested change
git fetch heroku main
git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null \
|| echo "No prior index on heroku/main — starting fresh."
if git fetch heroku main; then
git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null || {
echo "No prior index files on heroku/main — starting fresh."
mkdir -p "$CHROMA_PATH"
rm -f .last-physlib-sha
}
else
echo "No heroku/main branch available yet — starting fresh."
mkdir -p "$CHROMA_PATH"
rm -f .last-physlib-sha
fi

Copilot uses AI. Check for mistakes.

# Skip everything if PhysLib has not changed since the last run
- name: Check PhysLib for new commits
id: check
run: |
CURRENT_SHA=$(git ls-remote "$PHYSLIB_REPO" HEAD | cut -f1)
LAST_SHA=$(cat .last-physlib-sha 2>/dev/null || echo "")
echo "current_sha=$CURRENT_SHA" >> "$GITHUB_OUTPUT"
if [ "$CURRENT_SHA" = "$LAST_SHA" ]; then
echo "has_changes=false" >> "$GITHUB_OUTPUT"
echo "PhysLib unchanged at $CURRENT_SHA — nothing to do."
else
echo "has_changes=true" >> "$GITHUB_OUTPUT"
echo "PhysLib changed: $LAST_SHA -> $CURRENT_SHA"
fi

- name: Set up Python
if: steps.check.outputs.has_changes == 'true'
uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: pip

- name: Install Python dependencies
if: steps.check.outputs.has_changes == 'true'
run: pip install -r requirements.txt

# PhysLib must be cloned before the cache steps so hashFiles() can read lean-toolchain
- name: Clone PhysLib
if: steps.check.outputs.has_changes == 'true'
run: git clone --depth 1 "$PHYSLIB_REPO" physlib

- name: Cache elan toolchains
if: steps.check.outputs.has_changes == 'true'
uses: actions/cache@v4
with:
path: ~/.elan
key: elan-${{ hashFiles('physlib/lean-toolchain') }}

- name: Cache PhysLib lake build
if: steps.check.outputs.has_changes == 'true'
uses: actions/cache@v4
with:
path: physlib/.lake/build
key: physlib-lake-${{ steps.check.outputs.current_sha }}
restore-keys: physlib-lake-

- name: Install elan (Lean version manager)
if: steps.check.outputs.has_changes == 'true'
run: |
if ! command -v elan &>/dev/null; then
curl -sSf https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh \
| sh -s -- -y --no-modify-path
fi
echo "$HOME/.elan/bin" >> "$GITHUB_PATH"

- name: Build PhysLib
if: steps.check.outputs.has_changes == 'true'
run: cd physlib && lake build
timeout-minutes: 90

# jixia must be cloned before its cache step so hashFiles() can read its lakefile
- name: Clone jixia
if: steps.check.outputs.has_changes == 'true'
run: git clone --depth 1 "$JIXIA_REPO" jixia

- name: Cache jixia build
if: steps.check.outputs.has_changes == 'true'
uses: actions/cache@v4
with:
path: jixia/.lake/build
key: jixia-${{ hashFiles('physlib/lean-toolchain') }}-${{ hashFiles('jixia/lakefile.lean', 'jixia/lakefile.toml') }}
restore-keys: jixia-${{ hashFiles('physlib/lean-toolchain') }}-

- name: Build jixia
if: steps.check.outputs.has_changes == 'true'
run: cd jixia && lake build
timeout-minutes: 30

- name: Set JIXIA_PATH and LEAN_SYSROOT
if: steps.check.outputs.has_changes == 'true'
run: |
echo "JIXIA_PATH=$(pwd)/jixia/.lake/build/bin/jixia" >> "$GITHUB_ENV"
TOOLCHAIN=$(cat physlib/lean-toolchain)
echo "LEAN_SYSROOT=$HOME/.elan/toolchains/$TOOLCHAIN" >> "$GITHUB_ENV"

# --- Incremental pipeline ---
# Each step skips rows/vectors that already exist, so only new theorems are processed.

- name: Create/update schema
if: steps.check.outputs.has_changes == 'true'
run: python3 -m database schema

- name: Load jixia data into PostgreSQL
if: steps.check.outputs.has_changes == 'true'
run: python3 -m database jixia ./physlib "$MODULE_NAMES"

- name: Informalize new declarations
if: steps.check.outputs.has_changes == 'true'
run: python3 -m database informal --batch-size 50

- name: Embed new declarations into ChromaDB
if: steps.check.outputs.has_changes == 'true'
run: python3 -m database vector-db --batch-size 8

# Persist the updated chroma/ and SHA tracker back to the heroku git remote
- name: Push updated index to Heroku git remote
if: steps.check.outputs.has_changes == 'true'
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git worktree add /tmp/heroku-tree heroku/main
rsync -a --delete chroma/ /tmp/heroku-tree/chroma/
echo "${{ steps.check.outputs.current_sha }}" > /tmp/heroku-tree/.last-physlib-sha
cd /tmp/heroku-tree
git add -f chroma/ .last-physlib-sha
git diff --cached --quiet \
|| git commit -m "chore: weekly index $(date -u +%Y-%m-%d) [physlib ${{ steps.check.outputs.current_sha }}]"
git push heroku HEAD:main
cd -
git worktree remove /tmp/heroku-tree

# Rebuild and release the Docker image so the live app gets the new chroma/
- name: Install Heroku CLI
if: steps.check.outputs.has_changes == 'true'
run: curl https://cli-assets.heroku.com/install.sh | sh

- name: Build and release Docker image
if: steps.check.outputs.has_changes == 'true'
env:
HEROKU_API_KEY: ${{ secrets.HEROKU_API_KEY }}
run: |
heroku container:push web --app physlibsearch
heroku container:release web --app physlibsearch
37 changes: 24 additions & 13 deletions database/vector_db.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,19 @@ def create_vector_db(conn: Connection, path: str, batch_size: int):

client = chromadb.PersistentClient(path)
try:
client.delete_collection("physlibsearch")
logger.info("deleted existing physlibsearch collection")
collection = client.get_collection(name="physlibsearch", embedding_function=None)
existing_ids = set(collection.get(include=[])["ids"])
logger.warning("using existing physlibsearch collection (%d vectors)", len(existing_ids))
Comment on lines +18 to +19
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building existing_ids via set(collection.get(include=[])["ids"]) loads every ID into memory up front; for large collections this can be slow and memory-intensive. Consider avoiding a full scan by checking existence in batch (e.g., per DB batch: call collection.get(ids=batch_ids, include=[]) and filter missing IDs) or by paging through IDs if you must precompute the set.

Copilot uses AI. Check for mistakes.
except Exception:
pass
collection = client.create_collection(
name="physlibsearch",
metadata={"hnsw:space": "cosine"},
embedding_function=None,
)

collection = client.create_collection(
name="physlibsearch",
metadata={"hnsw:space": "cosine"},
embedding_function=None,
)
Comment on lines 16 to +25
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except Exception here is too broad: if get_collection(...) succeeds but collection.get(...) fails (e.g., transient I/O / corruption / API change), the code falls into the except block and then calls create_collection(...), which will typically fail because the collection already exists. Split this into separate error handling (only create the collection when the specific "collection not found" error is raised), and surface unexpected errors instead of masking them.

Copilot uses AI. Check for mistakes.
existing_ids = set()
logger.warning("created new physlibsearch collection")

added = 0
with conn.cursor() as cursor:
cursor.execute("""
SELECT s.name, d.module_name, d.index, s.kind, d.signature, s.type, i.name, i.description
Expand All @@ -40,12 +43,20 @@ def create_vector_db(conn: Connection, path: str, batch_size: int):
for name, module_name, index, kind, signature, tp, informal_name, informal_description in batch:
if signature is None:
signature = tp
batch_doc.append(f"{kind} {name} {signature}\n{informal_name}: {informal_description}")
# NOTE: the space character is not used in names from Physlib and its dependencies
batch_id.append(" ".join(str(x) for x in name))
if os.environ["DRY_RUN"] == "true":
logger.info("DRY_RUN:skipped embedding: %s", f"{kind} {name} {signature} {informal_name}")
vec_id = " ".join(str(x) for x in name)
if vec_id in existing_ids:
continue
batch_doc.append(f"{kind} {name} {signature}\n{informal_name}: {informal_description}")
Comment on lines +47 to +50
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping any vec_id already present means you will never refresh embeddings for declarations whose content/signature/description changed across PhysLib commits, so the vector DB can become stale even though the workflow reruns on new commits. If updated embeddings are required, consider switching to upsert (always re-embed) or storing a content hash in Chroma metadata and only re-embedding when the hash differs.

Copilot uses AI. Check for mistakes.
batch_id.append(vec_id)
if not batch_doc:
continue
if os.environ["DRY_RUN"] == "true":
for doc in batch_doc:
logger.info("DRY_RUN:skipped embedding: %s", doc)
return
batch_embedding = embedding.embed(batch_doc)
collection.add(embeddings=batch_embedding, ids=batch_id)
added += len(batch_id)

logger.warning("vector-db: added %d new vectors, %d already existed", added, len(existing_ids))
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final log line reports len(existing_ids) as "already existed", but that is the total pre-existing vector count, not the number actually skipped during this run. If you want accurate reporting, track a separate skipped counter (or compute skipped as total candidates - added) and log that instead.

Copilot uses AI. Check for mistakes.
Loading