Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 24 additions & 41 deletions .github/workflows/weekly-index.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: Weekly Index
on:
schedule:
- cron: '0 3 * * 1' # Every Monday at 03:00 UTC
workflow_dispatch: # Manual trigger via GitHub UI
workflow_dispatch:

jobs:
index:
Expand All @@ -27,26 +27,31 @@ jobs:
with:
ref: main

# Add the heroku remote (it doesn't exist in a fresh Actions checkout)
- name: Add Heroku remote
- name: Install Heroku CLI
run: curl https://cli-assets.heroku.com/install.sh | sh
Comment on lines +30 to +31
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing the Heroku CLI via curl ... | sh executes a remote script without verification, which is a supply-chain risk. Prefer a pinned install method (e.g., package manager with version pinning, or downloading a specific release artifact and verifying its checksum/signature).

Suggested change
- name: Install Heroku CLI
run: curl https://cli-assets.heroku.com/install.sh | sh
- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install Heroku CLI
run: npm install --global heroku@9.3.0

Copilot uses AI. Check for mistakes.

# Pull chroma/ from the live Docker image so the pipeline runs incrementally
- name: Extract ChromaDB from current Heroku image
env:
HEROKU_API_KEY: ${{ secrets.HEROKU_API_KEY }}
run: |
git remote add heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"

# Restore chroma/ and the SHA tracker from the heroku git remote
- name: Restore ChromaDB and SHA tracker
run: |
git fetch heroku main
git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null \
|| echo "No prior index on heroku/main — starting fresh."
heroku container:login
docker pull registry.heroku.com/physlibsearch/web || echo "No existing image — starting fresh."
CID=$(docker create registry.heroku.com/physlibsearch/web 2>/dev/null) || true
Comment on lines +39 to +40
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker pull ... || echo "No existing image" treats any pull failure (network hiccup, rate limit, registry outage) as "no existing image" and proceeds with a fresh index, which can hide operational issues and cause unexpectedly long runs. Consider failing on pull errors except for the specific "manifest unknown"/"not found" case.

Suggested change
docker pull registry.heroku.com/physlibsearch/web || echo "No existing image — starting fresh."
CID=$(docker create registry.heroku.com/physlibsearch/web 2>/dev/null) || true
CID=""
PULL_OUTPUT=$(docker pull registry.heroku.com/physlibsearch/web 2>&1)
PULL_STATUS=$?
if [ "$PULL_STATUS" -eq 0 ]; then
echo "$PULL_OUTPUT"
CID=$(docker create registry.heroku.com/physlibsearch/web 2>/dev/null) || true
elif printf '%s\n' "$PULL_OUTPUT" | grep -qiE 'manifest unknown|not found'; then
echo "No existing image — starting fresh."
else
echo "$PULL_OUTPUT"
exit "$PULL_STATUS"
fi

Copilot uses AI. Check for mistakes.
if [ -n "$CID" ]; then
docker cp "$CID:/app/chroma" . 2>/dev/null || echo "No chroma/ in image — starting fresh."
docker rm "$CID"
fi
Comment on lines +33 to +44
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ChromaDB extraction step runs unconditionally, even when PhysLib hasn't changed and the rest of the job is skipped. This adds a potentially large docker pull to every scheduled run; consider moving the PhysLib change check earlier (or gating the extraction step on has_changes) so the workflow can exit quickly when there's nothing to update.

Copilot uses AI. Check for mistakes.

# Skip everything if PhysLib has not changed since the last run
# Check if PhysLib has changed since the last successful run.
# The last SHA is stored as a Heroku config var to avoid git commits.
- name: Check PhysLib for new commits
id: check
env:
HEROKU_API_KEY: ${{ secrets.HEROKU_API_KEY }}
run: |
CURRENT_SHA=$(git ls-remote "$PHYSLIB_REPO" HEAD | cut -f1)
Comment on lines 52 to 53
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CURRENT_SHA=$(git ls-remote ... | cut -f1) can silently produce an empty SHA if git ls-remote fails, because without pipefail the pipeline exit status is from cut. That can make the workflow treat a transient network failure as a change and later persist an empty/incorrect LAST_PHYSLIB_SHA. Add set -o pipefail (or avoid the pipeline) and explicitly fail if CURRENT_SHA is empty.

Suggested change
run: |
CURRENT_SHA=$(git ls-remote "$PHYSLIB_REPO" HEAD | cut -f1)
run: |
set -o pipefail
CURRENT_SHA=$(git ls-remote "$PHYSLIB_REPO" HEAD | cut -f1)
if [ -z "$CURRENT_SHA" ]; then
echo "Failed to resolve current PhysLib HEAD SHA from $PHYSLIB_REPO" >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
LAST_SHA=$(cat .last-physlib-sha 2>/dev/null || echo "")
LAST_SHA=$(heroku config:get LAST_PHYSLIB_SHA --app physlibsearch 2>/dev/null || echo "")
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heroku config:get ... 2>/dev/null || echo "" suppresses all errors (including auth/permission issues) and treats them as "no prior SHA", which can trigger an unnecessary full reindex and then overwrite the stored SHA. Consider only defaulting to empty when the config var is genuinely missing, and otherwise failing the job so misconfiguration doesn't silently reset state.

Suggested change
LAST_SHA=$(heroku config:get LAST_PHYSLIB_SHA --app physlibsearch 2>/dev/null || echo "")
LAST_SHA_OUTPUT=$(heroku config:get LAST_PHYSLIB_SHA --app physlibsearch 2>/dev/null)
HEROKU_CONFIG_GET_STATUS=$?
if [ "$HEROKU_CONFIG_GET_STATUS" -ne 0 ]; then
echo "Failed to read LAST_PHYSLIB_SHA from Heroku app physlibsearch." >&2
exit "$HEROKU_CONFIG_GET_STATUS"
fi
LAST_SHA=$LAST_SHA_OUTPUT

Copilot uses AI. Check for mistakes.
echo "current_sha=$CURRENT_SHA" >> "$GITHUB_OUTPUT"
if [ "$CURRENT_SHA" = "$LAST_SHA" ]; then
echo "has_changes=false" >> "$GITHUB_OUTPUT"
Expand All @@ -67,7 +72,7 @@ jobs:
if: steps.check.outputs.has_changes == 'true'
run: pip install -r requirements.txt

# PhysLib must be cloned before the cache steps so hashFiles() can read lean-toolchain
# PhysLib must be cloned before cache steps so hashFiles() can read lean-toolchain
- name: Clone PhysLib
if: steps.check.outputs.has_changes == 'true'
run: git clone --depth 1 "$PHYSLIB_REPO" physlib
Expand All @@ -87,7 +92,7 @@ jobs:
key: physlib-lake-${{ steps.check.outputs.current_sha }}
restore-keys: physlib-lake-

- name: Install elan (Lean version manager)
- name: Install elan
if: steps.check.outputs.has_changes == 'true'
run: |
if ! command -v elan &>/dev/null; then
Expand All @@ -101,7 +106,7 @@ jobs:
run: cd physlib && lake build
timeout-minutes: 90

# jixia must be cloned before its cache step so hashFiles() can read its lakefile
# jixia must be cloned before its cache step
- name: Clone jixia
if: steps.check.outputs.has_changes == 'true'
run: git clone --depth 1 "$JIXIA_REPO" jixia
Expand All @@ -126,9 +131,7 @@ jobs:
TOOLCHAIN=$(cat physlib/lean-toolchain)
echo "LEAN_SYSROOT=$HOME/.elan/toolchains/$TOOLCHAIN" >> "$GITHUB_ENV"

# --- Incremental pipeline ---
# Each step skips rows/vectors that already exist, so only new theorems are processed.

# Incremental pipeline — each step skips already-processed items
- name: Create/update schema
if: steps.check.outputs.has_changes == 'true'
run: python3 -m database schema
Expand All @@ -145,32 +148,12 @@ jobs:
if: steps.check.outputs.has_changes == 'true'
run: python3 -m database vector-db --batch-size 8

# Persist the updated chroma/ and SHA tracker back to the heroku git remote
- name: Push updated index to Heroku git remote
if: steps.check.outputs.has_changes == 'true'
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git worktree add /tmp/heroku-tree heroku/main
rsync -a --delete chroma/ /tmp/heroku-tree/chroma/
echo "${{ steps.check.outputs.current_sha }}" > /tmp/heroku-tree/.last-physlib-sha
cd /tmp/heroku-tree
git add -f chroma/ .last-physlib-sha
git diff --cached --quiet \
|| git commit -m "chore: weekly index $(date -u +%Y-%m-%d) [physlib ${{ steps.check.outputs.current_sha }}]"
git push heroku HEAD:main
cd -
git worktree remove /tmp/heroku-tree

# Rebuild and release the Docker image so the live app gets the new chroma/
- name: Install Heroku CLI
if: steps.check.outputs.has_changes == 'true'
run: curl https://cli-assets.heroku.com/install.sh | sh

# Rebuild the Docker image with the updated chroma/ and deploy
- name: Build and release Docker image
if: steps.check.outputs.has_changes == 'true'
env:
HEROKU_API_KEY: ${{ secrets.HEROKU_API_KEY }}
run: |
heroku container:push web --app physlibsearch
heroku container:release web --app physlibsearch
heroku config:set LAST_PHYSLIB_SHA=${{ steps.check.outputs.current_sha }} --app physlibsearch
Comment on lines 157 to +159
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Heroku app name (physlibsearch) is hard-coded in several commands (registry path, config:get, container:*). To reduce duplication and the chance of inconsistent edits later, consider defining it once as an env var (e.g., HEROKU_APP) and reusing it across steps.

Copilot uses AI. Check for mistakes.
Loading