feat: enable bounded-borrow task admission#693
Conversation
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
PR #693 Review: enable bounded-borrow task admissionSummaryPromotes the bounded-borrow task admission policy from an opt-in alternative to the default for the production Touched files:
FindingsCorrectness
Style / conventions
Tests
Benchmark script
Performance
Architecture / structural impact
VerdictApprove with minor follow-ups. The implementation is correct under the cases I traced, well-covered by new unit tests, and aligned with the architectural guidance in
The intentional utilization regression for solo-heavy traffic is well-documented in the PR description and benchmark, and is the right tradeoff for the #650 goal. |
Greptile SummaryThis PR activates
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py | Core logic rewrite: adds strict-share computation with synthesized-peer solo treatment, marginal borrow-debt via min(amount, projected−strict_share), dynamic reserve-derived ceiling, and per-resource peer-pressure gating. Logic is correct; debt increment and ceiling are consistent across acquire/release paths. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py | Minimal change: threads BoundedBorrowTaskAdmissionPolicyConfig() into the default TaskAdmissionConfig and adds the task_admission_config read-only property for test introspection. |
| packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py | Adds five new focused tests covering solo reservation, dynamic-ceiling reserve values, explicit-ceiling marginal semantics, peer-arrival queue selection, and updates three existing tests to opt into floor rounding to preserve prior behavior. |
| scripts/benchmarks/benchmark_bounded_borrow_admission.py | New deterministic event-driven benchmark; correctly models task dispatch, lease lifecycle, and per-column zero-inflight idle time. |
Reviews (4): Last reviewed commit: "address bounded-borrow review feedback" | Re-trigger Greptile
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
|
Thanks for putting this together, @eric-tramel — the strict-share + dynamic reserve story holds up well, and the benchmark gives reviewers exactly the evidence we need. SummaryEnables bounded-borrow as the default async-scheduler task admission policy. Strict share now computes per-resource fair share from queue/admission snapshots and rounds up by default, while solo groups can borrow up to a capacity-derived reserve ( FindingsWarnings — Worth addressing
Suggestions — Take it or leave it
What Looks Good
VerdictNeeds changes — let's address the two warnings before merge:
The suggestions can land later or in a follow-up — they're not blocking. |
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
|
@nabinchha thanks for the detailed review. I pushed
I left the optional accumulated denial diagnostics out of this PR. The point is valid, but the main blocked-queue path currently aggregates only denial reasons, so making that operationally useful would require broader I reran the focused checks and benchmark, then refreshed the PR description with the current benchmark evidence SHA. |
📋 Summary
Enables bounded-borrow task admission as the default async scheduler policy for #650. The policy computes per-resource strict share from scheduler capacity, queued competing groups, and group weights, rounds strict share up by default to avoid avoidable capacity loss, then permits solo groups to borrow only up to a capacity-derived reserve so heavy root work cannot fill the whole scheduler window before downstream or peer work becomes ready.
This PR is the scheduler-side admission-control guardrail that #651 can generalize from coarse task-stage resources to provider/model/resource-vector resources; it is not claiming to solve endpoint/GPU utilization by itself.
🔗 Related Issue
Closes #650
Follow-up design: #651
🔄 Changes
BoundedBorrowTaskAdmissionPolicyConfigin the defaultAsyncTaskSchedulertask-admission config.SchedulerResourceRequest,QueueView,TaskAdmissionView, andTaskGroupSpecwithout moving DAG traversal or resource ownership intoFairTaskQueue.ceil, keeping the olderfloorbehavior available only as an internal policy config for focused tests/experiments.reserved_slots = min(8, max(1, ceil(resource_limit * 0.125))); solo groups may borrow up toresource_limit - reserved_slots.1admit the number of borrowed slots they advertise.default_borrow_ceilingandborrow_ceiling_by_group_resourcefor tests/experiments, while leaving publicRunConfigunchanged.TaskAdmissionConfig(..., bounded_borrow=None)for tests and benchmark comparison.scripts/benchmarks/benchmark_bounded_borrow_admission.pycomparing strict fair admission against the default bounded-borrow policy for heavy-root peer arrival and neutral ready-at-start traffic.architecture/dataset-builders.mdand expose the effective scheduler task-admission config through a read-only engine accessor for tests/diagnostics.🔍 Attention Areas
task_policies.py— strict-share computation, ceil rounding default, dynamic reserve-derived ceiling, marginal borrow-debt accounting, and admission decisions.async_scheduler.py— default policy selection for production async scheduling.benchmark_bounded_borrow_admission.py— deterministic evidence path for strict versus default bounded behavior and per-generation-column idle time.📈 Bounded Borrow Verification Report
Plan alignment against
plans/645TaskAdmissionControllerstill supports strict fair admission whenbounded_borrow=Nonefor the Implement TaskAdmissionController lease boundary for async scheduler #644 lease-only comparison path.RunConfigor public API field was added; the change stays engine-internal as requested by Add bounded-borrow admission policy for heavy-root async workloads #650.FairTaskQueuestill owns only ready ordering. The policy consumes queue/admission snapshots and does not inspect the DAG, generators, model registry, provider registry, or request-admission AIMD state.8, the policy reserves1slot, admits up to7solo tasks, and records3marginal over-share debt slots.Bridge to #651 resource-vector work
Bounded borrow establishes the control-law shape that #651 should generalize: strict share, bounded solo prefill, peer-pressure yielding, and borrow-debt repayment. Today this operates over coarse scheduler task-stage resources such as
llm_wait; #651 should decide how the same policy maps onto provider/model/resource-vector identities without duplicating request-stage AIMD.The benchmark now reports per-generation-column zero-inflight idle as a proxy for endpoint/GPU idleness. #651 should carry that acceptance criterion forward for provider/model resources: per-resource idle, total idle, max idle, and baseline deltas should be part of the resource-vector evidence gate.
Benchmark command
Evidence was generated from commit
96b1a4db58592189f633530601202f6e0a285c15on macOS with Python 3.12.11.Benchmark metric definition
Generation-column idle is the amount of workflow wall time where that column has no in-flight generation task. This is the benchmark proxy for endpoint/GPU time with no tokens being generated for that model resource. The benchmark reports each generation column independently, plus total and max column idle.
Benchmark results
Interpretation
8scheduler slots with hot/root work before peer work arrived. Default bounded borrow admitted7hot tasks, preserved one slot, and dispatched peer work immediately when it became ready.33.450sto11.950s, a64.3%reduction. First peer wait improved from0.450sto0.000s.26.850sto24.050s(-2.800s) and reduced max column idle from26.400sto24.050s(-2.350s) in the heavy-root scenario.36.700sinstead of46.550sfor the ceil/fixed-ceiling version, while keeping immediate peer dispatch.🧪 Testing
.venv/bin/ruff check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py scripts/benchmarks/benchmark_bounded_borrow_admission.py.venv/bin/ruff format --check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py scripts/benchmarks/benchmark_bounded_borrow_admission.py.venv/bin/ruff check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_policies.py scripts/benchmarks/benchmark_bounded_borrow_admission.py.venv/bin/ruff format --check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_policies.py scripts/benchmarks/benchmark_bounded_borrow_admission.py.venv/bin/ruff check scripts/benchmarks/benchmark_bounded_borrow_admission.py.venv/bin/ruff format --check scripts/benchmarks/benchmark_bounded_borrow_admission.py.venv/bin/ruff check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py.venv/bin/ruff format --check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.pygit diff --check.venv/bin/pytest packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_policies.py -q— 24 passed.venv/bin/pytest packages/data-designer-engine/tests/engine/dataset_builders/scheduling packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py -q— 169 passed.venv/bin/pytest packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py packages/data-designer-engine/tests/engine/test_capacity.py -q— 10 passed.venv/bin/python scripts/benchmarks/benchmark_bounded_borrow_admission.py --output-dir .scratch/bounded-borrow-admission✅ Checklist
plans/645; no public docs/API updates required