basic pipeline#77
Open
zehroque21 wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SQL test solution
Quick notes on how I approached each file and the trade-offs I hit along the way.
Setup
Ran everything against the trinodb/trino image exactly as the README says:
docker run --name=sexi-silverbullet -d trinodb/trino
docker exec sexi-silverbullet trino -f create_employees.sql
docker exec sexi-silverbullet trino -f create_expenses.sql
docker exec sexi-silverbullet trino -f create_invoices.sql
docker exec sexi-silverbullet trino -f find_manager_cycles.sql
docker exec sexi-silverbullet trino -f calculate_largest_expensors.sql
docker exec sexi-silverbullet trino -f generate_supplier_payment_plans.sql
All six files run clean from a fresh docker restart.
Data loading (create_employees / create_expenses / create_invoices)
One thing worth calling out: the default memory connector doesn't read files from disk, so I couldn't just point at the CSV or the receipt txts. I inlined the rows as VALUES and kept the
original file name as a comment next to each row so the mapping back to the source file is still obvious. If I had a hive/iceberg catalog available I'd have used CREATE TABLE ... WITH
(external_location = ...) instead — happy to rewrite that way if you set one up for the review.
Other decisions:
find_manager_cycles
Recursive CTE that walks manager_id upwards and carries the visited ids in an array. The moment the next manager is already in the array, the path has closed on itself.
The dataset has exactly one cycle (Ian → Darren → Umberto → Ian), and every employee on that cycle would naturally produce a rotated version of the same loop. I rotate each detected loop
so it starts at the smallest employee_id, which collapses 1,4,2, 4,2,1 and 2,1,4 into a single canonical string and makes dedup trivial. Output: three rows, one per cycle member, all
sharing cycle_path = '1,4,2'.
The query throws a "stages exceeds soft limit" warning — it's a consequence of the recursive CTE pattern on Trino, not a correctness problem, and at this data size it runs instantly.
calculate_largest_expensors
Group by employee, having sum(unit_price * quantity) > 1000, left join back to employee to resolve the manager name (left so a missing manager doesn't silently drop the row). Sorted
descending by total. Only Alex Jacobson clears the threshold, at 1682.00.
generate_supplier_payment_plans
This one took the most thinking. The worked example in the README is the key — Catering Plus has a 2000 invoice due in 2 months and a 1500 invoice due in 3 months, and the expected plan
is 1500 / 1500 / 500. That only works if each invoice is amortised into N equal instalments where N is the number of months from the first payment to its own due month, and then the
monthly payments are aggregated per supplier.
So:
That's what the query does — fan out each invoice into its instalment rows with unnest(sequence(...)), sum by supplier + month, then subtract the running sum from the supplier's total to
get balance_outstanding (defined as the balance after the payment, so the final month lands at zero).
Verified Catering Plus matches the README example exactly.
What I'd change in a real environment
Let me know if anything needs clarifying.