Reference image library for Space Invader mosaics where we know the official identifier, such as AIX_05 or AVI_13.
Store:
- several reference images per invader from different angles
- stable metadata for each invader
- provenance for each image and identifier match
references/
AIX/
AIX_01/
metadata.json
sources/
images/
AIX_02/
AVI/
metadata/
places.json
Each invader directory should contain a metadata.json file with:
{
"place_id": "AIX",
"invader_id": "AIX_01",
"city": "Aix-en-Provence",
"country": "France",
"status": "confirmed",
"sources": [],
"images": []
}- Prefer primary or near-primary sources first:
- official Space Invader pages
- Spotter Invader
- Instagram posts with explicit identifier confirmation
- Keep source attribution for every image.
- Do not rename downloaded originals destructively; store the original file and record normalized metadata separately.
Project planning and status tracking for the reference-corpus automation work live in:
docs/ref-corpus-automation-plan.mddocs/ref-corpus-status.md
First pass for the spotter site city index:
npm run scrape:citiesThis writes:
data/cities.json
Per-city listing scrape:
npm run scrape:city -- AIXThis writes:
data/cities/AIX.json
To also sync the scraped invaders into references/<CITY>/...:
npm run scrape:city -- AIX --sync-referencesTo additionally download the referenced spotter images into each invader directory:
npm run scrape:city -- AIX --sync-references --download-imagesDaily automation for new invaders:
npm run daily:new-mosaicsThis first parses news.php and looks for recent green mosaic IDs (a.ok) to target only likely new additions, then scrapes only the impacted city tails. If news parsing fails, it falls back to the city-count delta strategy. The script writes a report to tmp/daily-new-mosaics-report.json.
You can inspect the raw news parser output directly:
npm run discover:news -- --max-days=10To force the old city-delta strategy:
npm run daily:new-mosaics -- --disable-news-discoveryThe scheduled GitHub Action in .github/workflows/daily-new-mosaics.yml uploads the new grosplan images to R2, commits the updated reference tree back to the canonical repo, and sends an ntfy notification when something new is found.
For a notification-only test run, use the workflow dispatch inputs test_notification=true and an override ntfy_url pointing at a disposable ntfy topic.
Instagram tag pages currently redirect anonymous requests to login. To test a saved logged-in browser profile:
- Open a persistent Playwright browser and log in manually:
npm run instagram:login -- https://www.instagram.com/explore/tags/aix_06/-
After login, press Enter in the terminal to save the profile.
-
Reuse that saved session in headless mode to probe a tag page:
npm run instagram:test -- https://www.instagram.com/explore/tags/aix_06/The browser profile is stored under profiles/instagram/ and is ignored by git.
To scrape post links and media for a tag using the saved session:
npm run instagram:scrape -- aix_06 --limit=12 --downloadThis writes results under data/instagram/<tag>/.
Run a reference-library audit to measure:
- city coverage against the scraped city list
- live-feed overlap against
collect-si-live-data - per-invader metadata and asset quality issues
npm run audit:libraryThis writes:
data/audits/reference-library-audit.json
To send an ntfy notification from a generated report:
npm run notify:ntfy -- --report-path tmp/daily-new-mosaics-report.json