A dbt project that turns GitHub Archive event data into clean, tested analytics tables: daily repo activity and contributor summaries. Built as a portfolio piece showing layered SQL modeling, data quality gates, and a path from local dev to BigQuery at scale.
The public GitHub Archive is multi-terabyte; most tutorials never touch real scale. This repo shows how to scope sensibly (days/months, not the full history), model event types separately in staging, dedupe midnight boundary duplicates, classify bot actors, and enforce probabilistic quality tests (mostly-style thresholds). It runs locally on DuckDB today and documents the exact switch to BigQuery + optional Python models when you enable cloud.
seeds/raw_github_events.csv ← local dev (BigQuery githubarchive.day.* when cloud on)
│
▼
stg_push_events / stg_pr_events / stg_watch_events (views)
│
▼
int_events_deduped → int_repo_names_normalized → int_events_with_keys (tables)
│
▼
int_actor_login_cleaned (table — SQL locally; Python optional on BigQuery)
│
├──► mart_daily_repo_activity
└──► mart_contributor_summary
Lineage screenshot: After
dbt docs generate, rundbt docs serveand capture the DAG for your portfolio README.
From docs/exploration.md:
| Metric | Value |
|---|---|
| Raw rows | 21 (20 unique events after dedup) |
| Event mix | 76% Push, 10% PR, 14% Watch |
| Bot-like actors | 3 (14% of all events) |
| Invalid repo slugs | 1 (5%) |
| Date span | 2024-01-01 → 2024-01-03 |
Designed to scale to 3–6 months of githubarchive.day.* partitions without redesign.
cd C:\Users\gejuj\dbt-github-insights
py -3.11 -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
$env:DBT_PROFILES_DIR = (Get-Location).Path
copy profiles.yml.example profiles.yml # if missing
dbt deps
dbt seed
dbt run
dbt test
dbt docs generate
dbt docs serve- Strict:
unique,not_null, regex onevent_id, date ranges - Probabilistic: singular tests
assert_actor_login_mostly_not_null(≥98%) andassert_push_commit_count_mostly_in_range(≥99.5%) — same production intent asmostly:in dbt-expectations - Volume: table row count bounds on
stg_push_events(relaxed for local seed; tighten for BigQuery)
Run dbt test and save all-green output for your portfolio.
| Decision | Why |
|---|---|
| DuckDB + seed first | Full DAG and tests without GCP billing, keys, or network during development. |
mostly via singular tests |
Current dbt_expectations + dbt 1.11 macro API rejects inline mostly:; singular SQL tests preserve the probabilistic DQ story. |
| SQL actor cleaning locally | Same regex semantics as the planned Python model; avoids BigQuery Python runtime on Windows. Restore py_actor_login_cleaned.py on cloud for the SQL+Python hybrid narrative. |
| Intermediate as tables, staging as views | Cheap fresh staging; materialize heavier transforms once. |
| One staging model per event type | Clear lineage and simpler downstream joins. |
Repo name dbt-github-insights |
Public portfolio repo (plan suggested dbt-github-archive; same purpose). |
models/
sources.yml
staging/ stg_*_events.sql + .yml
intermediate/ int_* + int_actor_login_cleaned.sql
marts/ mart_*.sql
tests/ singular mostly-* tests
seeds/ raw_github_events.csv
docs/ exploration.md, PHASE_STATUS.md, bigquery_staging_snippets.md
See docs/PHASE_STATUS.md for per-phase completion.
- Phases 2–4, 6–7, 9 (local)
- Phase 1 profile (local seed)
- Phase 0 GCP (skipped — optional)
- Phase 5 Python model on BigQuery (documented, not required locally)
- Phase 8 lineage screenshot in README
- Follow
profiles.yml.bigquery.example - Use SQL in
docs/bigquery_staging_snippets.md - Run exploration queries in
docs/exploration.mdand update the profile table - Optional: add
models/intermediate/py_actor_login_cleaned.pyper original Phase 5 spec
Portfolio project. GitHub Archive data subject to GitHub Terms of Service.