Skip to content

basic pipeline#77

Open
zehroque21 wants to merge 1 commit into
sbms4d:mainfrom
zehroque21:solution/sql-test
Open

basic pipeline#77
zehroque21 wants to merge 1 commit into
sbms4d:mainfrom
zehroque21:solution/sql-test

Conversation

@zehroque21
Copy link
Copy Markdown

SQL test solution

Quick notes on how I approached each file and the trade-offs I hit along the way.

Setup

Ran everything against the trinodb/trino image exactly as the README says:

docker run --name=sexi-silverbullet -d trinodb/trino
docker exec sexi-silverbullet trino -f create_employees.sql
docker exec sexi-silverbullet trino -f create_expenses.sql
docker exec sexi-silverbullet trino -f create_invoices.sql
docker exec sexi-silverbullet trino -f find_manager_cycles.sql
docker exec sexi-silverbullet trino -f calculate_largest_expensors.sql
docker exec sexi-silverbullet trino -f generate_supplier_payment_plans.sql

All six files run clean from a fresh docker restart.

Data loading (create_employees / create_expenses / create_invoices)

One thing worth calling out: the default memory connector doesn't read files from disk, so I couldn't just point at the CSV or the receipt txts. I inlined the rows as VALUES and kept the
original file name as a comment next to each row so the mapping back to the source file is still obvious. If I had a hive/iceberg catalog available I'd have used CREATE TABLE ... WITH
(external_location = ...) instead — happy to rewrite that way if you set one up for the review.

Other decisions:

  • supplier_id is assigned with row_number() over (order by name) on the distinct supplier names so the alphabetical requirement is enforced in SQL, not by hand.
  • due_date is computed from CURRENT_DATE with last_day_of_month(date_add('month', N, current_date)), so the dataset still makes sense if someone runs it next month.
  • Expenses are joined back to employee by first + last name — a bit awkward, but the receipts don't carry an id and spelling out the join makes the mapping auditable.

find_manager_cycles

Recursive CTE that walks manager_id upwards and carries the visited ids in an array. The moment the next manager is already in the array, the path has closed on itself.

The dataset has exactly one cycle (Ian → Darren → Umberto → Ian), and every employee on that cycle would naturally produce a rotated version of the same loop. I rotate each detected loop
so it starts at the smallest employee_id, which collapses 1,4,2, 4,2,1 and 2,1,4 into a single canonical string and makes dedup trivial. Output: three rows, one per cycle member, all
sharing cycle_path = '1,4,2'.

The query throws a "stages exceeds soft limit" warning — it's a consequence of the recursive CTE pattern on Trino, not a correctness problem, and at this data size it runs instantly.

calculate_largest_expensors

Group by employee, having sum(unit_price * quantity) > 1000, left join back to employee to resolve the manager name (left so a missing manager doesn't silently drop the row). Sorted
descending by total. Only Alex Jacobson clears the threshold, at 1682.00.

generate_supplier_payment_plans

This one took the most thinking. The worked example in the README is the key — Catering Plus has a 2000 invoice due in 2 months and a 1500 invoice due in 3 months, and the expected plan
is 1500 / 1500 / 500. That only works if each invoice is amortised into N equal instalments where N is the number of months from the first payment to its own due month, and then the
monthly payments are aggregated per supplier.

So:

  • Invoice 2000 → 2 instalments of 1000
  • Invoice 1500 → 3 instalments of 500
  • Supplier total per month: 1500, 1500, 500

That's what the query does — fan out each invoice into its instalment rows with unnest(sequence(...)), sum by supplier + month, then subtract the running sum from the supplier's total to
get balance_outstanding (defined as the balance after the payment, so the final month lands at zero).

Verified Catering Plus matches the README example exactly.

What I'd change in a real environment

  • Read the CSV/txt files via an actual filesystem-aware connector instead of inline VALUES.
  • Promote plan_anchor / first_payment into a session variable so it's easy to backdate for testing.
  • Add a check query after each DDL to fail loud if row counts don't match expectation (ex: assert count(*) = 9 on employee), since silent loads are scary.

Let me know if anything needs clarifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant