feat\!: Replace data_batch state machine with RAII lock accessor types by bwyogatama · Pull Request #99 · NVIDIA/cuCascade

bwyogatama · 2026-04-06T22:35:35Z

Summary

Replace the 4-state finite state machine (batch_state: idle, task_created, processing, in_transit) with a 3-class design where data_batch is the "idle" state and all data access requires acquiring a lock through RAII accessor types.

Core invariant: it is impossible to read or mutate batch data without holding the appropriate lock, and move semantics make stale references a compile error.

Design

data_batch (idle — owns shared_mutex + data representation)
├── Lock-free public API: get_batch_id(), subscribe(), unsubscribe()
├── Private data accessors: get_data(), get_current_tier(), set_data(), convert_to()
│   └── Only accessible by friend accessor classes
│
├── read_only_data_batch<PtrType> (RAII shared lock)
│   └── Named const accessors: get_batch_id(), get_current_tier(), get_data(), clone()
│
└── mutable_data_batch<PtrType> (RAII exclusive lock)
    └── Named mutating accessors: set_data(), convert_to(), plus all read accessors

State transitions are static methods that move the smart pointer, nullifying the source:

auto ro = data_batch::to_read_only(std::move(batch));   // batch is now null
auto batch = data_batch::to_idle(std::move(ro));         // ro is now consumed

Non-blocking variants (try_to_read_only, try_to_mutable) return std::optional.

PtrType-agnostic: accessors work with both shared_ptr<data_batch> and unique_ptr<data_batch> — no enable_shared_from_this required.

Breaking changes

data_batch is now the top-level type — all code that used data_batch with the old state machine must use the new accessor-based API
batch_state enum removed — no state machine
idata_batch_probe, data_batch_processing_handle, lock_for_processing_result/status removed
pop_data_batch() no longer takes a batch_state parameter — simple FIFO pop; callers acquire locks after popping
pop_data_batch_by_id() and get_data_batch_by_id() no longer take target_state
data_repository_manager::add_data_batch_impl uses if constexpr instead of SFINAE

Files changed

Area	Files	What
Core type system	`data_batch.hpp`, `data_batch.cpp`	3-class rewrite with RAII accessors
Repository layer	`data_repository.hpp`, `data_repository_manager.hpp`	Simplified pop/get APIs, `if constexpr` dispatch
Repository impl	`data_repository.cpp`, `data_repository_manager.cpp`	Updated type references
Converter	`representation_converter.cpp`	Added new batch type support
Memory	`memory_space.cpp`	Minor cleanup
Tests	`test_data_batch.cpp`, `test_data_repository.cpp`, `test_data_repository_manager.cpp`	Full rewrite for new API

Test plan

pixi run build compiles cleanly (all targets)
pixi run test passes (all tests)
Accessor types enforce const-correctness at compile time
Non-blocking try_to_read_only / try_to_mutable return nullopt when lock is held
Move semantics nullify source pointer after transition

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

copy-pr-bot · 2026-04-06T22:35:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mbrobbel · 2026-04-07T08:20:51Z

/ok to test 5380f13

bwyogatama · 2026-04-07T22:40:17Z

I incorporated a lot of @dhruv9vats and @aminaramoon feedback into this PR.

@dhruv9vats there are 2 changes i made to your idea:

I think we cannot do enable_shared_from_this because we want data_batch to work for shared_ptr<data_batch> and unique_ptr<data_batch>. Sirius use shared_ptr<data_batch> but other engine might want to use unique_ptr<data_batch>
For transitioning between readonly_batch and mutable_batch, I tried to implement it so that we don't go back to the synchronized_data_batch first. Let me know if I am missing something but I thought it's better to make it simpler for the user and minimize mistakes.

Reviews are appreciated, I am not surprised if I am missing something while working on this.

bwyogatama · 2026-04-07T22:45:15Z

One thing that I still don't know is the best API to pop data_batch from the repository with this new model. Previously we passed in the batch_state that we want to transition it into, which I think is not a good API but it allows us to skip data_batch that does not meet the state machine requirement to transition to that state.

For example, when popping data batch from the repo for downgrade it allows us to skip data batch that's currently in the in_processing.

But right now, since the state is gone, we will just pop data_batch from repository in a FIFO manner and it's up to the caller what do they want to do with it. The problem is now we might run into a 'pop and then insert again' scenarios where we pop a data batch, notice that we cannot downgrade cause a task is executing on it, and have to insert again to the repo.

Let me know if people have some ideas @aminaramoon @dhruv9vats

dhruv9vats · 2026-04-13T12:56:23Z

Thanks for the refactor @bwyogatama !

I think we cannot do enable_shared_from_this because we want data_batch to work for shared_ptr<data_batch> and unique_ptr<data_batch>. Sirius use shared_ptr<data_batch> but other engine might want to use unique_ptr<data_batch>

These are, IMO, 2 different use cases. The current use case we are addressing is: we want to be able to have multiple readers + enforce a single writer in a streamlined fashion using standard sync utilities (shared_lock / unique_lock). If we remove shared_from_this, we lose the ability to extend the lifetime of the synchronized_data_batch when we create a read_only (shallow) copy of the object. Which means, the responsibility of managing the lifetime of the original object now again lies in the hands of the end user, since we are propagating raw pointers when we get the read-only / mutable views of the synchronized_data_batch. I believe we dont want to do that especially since we have the option to ensure it ourselves.

dhruv9vats · 2026-04-13T13:07:34Z

idata_batch_probe, ... removed

We would need the probing ability for planned observability in the near future. We should think about how we want to do that.

dhruv9vats · 2026-04-13T12:58:18Z