Skip to content

feat\!: Replace data_batch state machine with RAII lock accessor types#99

Draft
bwyogatama wants to merge 1 commit intomainfrom
data-batch-lock-refactor
Draft

feat\!: Replace data_batch state machine with RAII lock accessor types#99
bwyogatama wants to merge 1 commit intomainfrom
data-batch-lock-refactor

Conversation

@bwyogatama
Copy link
Copy Markdown
Contributor

@bwyogatama bwyogatama commented Apr 6, 2026

Summary

Replace the 4-state finite state machine (batch_state: idle, task_created, processing, in_transit) with a 3-class design where data_batch is the "idle" state and all data access requires acquiring a lock through RAII accessor types.

Core invariant: it is impossible to read or mutate batch data without holding the appropriate lock, and move semantics make stale references a compile error.

Design

data_batch (idle — owns shared_mutex + data representation)
├── Lock-free public API: get_batch_id(), subscribe(), unsubscribe()
├── Private data accessors: get_data(), get_current_tier(), set_data(), convert_to()
│   └── Only accessible by friend accessor classes
│
├── read_only_data_batch<PtrType> (RAII shared lock)
│   └── Named const accessors: get_batch_id(), get_current_tier(), get_data(), clone()
│
└── mutable_data_batch<PtrType> (RAII exclusive lock)
    └── Named mutating accessors: set_data(), convert_to(), plus all read accessors

State transitions are static methods that move the smart pointer, nullifying the source:

auto ro = data_batch::to_read_only(std::move(batch));   // batch is now null
auto batch = data_batch::to_idle(std::move(ro));         // ro is now consumed

Non-blocking variants (try_to_read_only, try_to_mutable) return std::optional.

PtrType-agnostic: accessors work with both shared_ptr<data_batch> and unique_ptr<data_batch> — no enable_shared_from_this required.

Breaking changes

  • data_batch is now the top-level type — all code that used data_batch with the old state machine must use the new accessor-based API
  • batch_state enum removed — no state machine
  • idata_batch_probe, data_batch_processing_handle, lock_for_processing_result/status removed
  • pop_data_batch() no longer takes a batch_state parameter — simple FIFO pop; callers acquire locks after popping
  • pop_data_batch_by_id() and get_data_batch_by_id() no longer take target_state
  • data_repository_manager::add_data_batch_impl uses if constexpr instead of SFINAE

Files changed

Area Files What
Core type system data_batch.hpp, data_batch.cpp 3-class rewrite with RAII accessors
Repository layer data_repository.hpp, data_repository_manager.hpp Simplified pop/get APIs, if constexpr dispatch
Repository impl data_repository.cpp, data_repository_manager.cpp Updated type references
Converter representation_converter.cpp Added new batch type support
Memory memory_space.cpp Minor cleanup
Tests test_data_batch.cpp, test_data_repository.cpp, test_data_repository_manager.cpp Full rewrite for new API

Test plan

  • pixi run build compiles cleanly (all targets)
  • pixi run test passes (all tests)
  • Accessor types enforce const-correctness at compile time
  • Non-blocking try_to_read_only / try_to_mutable return nullopt when lock is held
  • Move semantics nullify source pointer after transition

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mbrobbel
Copy link
Copy Markdown
Member

mbrobbel commented Apr 7, 2026

/ok to test 5380f13

@bwyogatama bwyogatama requested a review from dhruv9vats April 7, 2026 22:33
@bwyogatama
Copy link
Copy Markdown
Contributor Author

I incorporated a lot of @dhruv9vats and @aminaramoon feedback into this PR.

@dhruv9vats there are 2 changes i made to your idea:

  1. I think we cannot do enable_shared_from_this because we want data_batch to work for shared_ptr<data_batch> and unique_ptr<data_batch>. Sirius use shared_ptr<data_batch> but other engine might want to use unique_ptr<data_batch>
  2. For transitioning between readonly_batch and mutable_batch, I tried to implement it so that we don't go back to the synchronized_data_batch first. Let me know if I am missing something but I thought it's better to make it simpler for the user and minimize mistakes.

Reviews are appreciated, I am not surprised if I am missing something while working on this.

@bwyogatama
Copy link
Copy Markdown
Contributor Author

bwyogatama commented Apr 7, 2026

One thing that I still don't know is the best API to pop data_batch from the repository with this new model. Previously we passed in the batch_state that we want to transition it into, which I think is not a good API but it allows us to skip data_batch that does not meet the state machine requirement to transition to that state.

For example, when popping data batch from the repo for downgrade it allows us to skip data batch that's currently in the in_processing.

But right now, since the state is gone, we will just pop data_batch from repository in a FIFO manner and it's up to the caller what do they want to do with it. The problem is now we might run into a 'pop and then insert again' scenarios where we pop a data batch, notice that we cannot downgrade cause a task is executing on it, and have to insert again to the repo.

Let me know if people have some ideas @aminaramoon @dhruv9vats

@dhruv9vats
Copy link
Copy Markdown
Member

Thanks for the refactor @bwyogatama !

I think we cannot do enable_shared_from_this because we want data_batch to work for shared_ptr<data_batch> and unique_ptr<data_batch>. Sirius use shared_ptr<data_batch> but other engine might want to use unique_ptr<data_batch>

These are, IMO, 2 different use cases. The current use case we are addressing is: we want to be able to have multiple readers + enforce a single writer in a streamlined fashion using standard sync utilities (shared_lock / unique_lock). If we remove shared_from_this, we lose the ability to extend the lifetime of the synchronized_data_batch when we create a read_only (shallow) copy of the object. Which means, the responsibility of managing the lifetime of the original object now again lies in the hands of the end user, since we are propagating raw pointers when we get the read-only / mutable views of the synchronized_data_batch. I believe we dont want to do that especially since we have the option to ensure it ourselves.

@dhruv9vats
Copy link
Copy Markdown
Member

dhruv9vats commented Apr 13, 2026

idata_batch_probe, ... removed

We would need the probing ability for planned observability in the near future. We should think about how we want to do that.

* that expose the data_batch state when certain events occur, like state transitions.
* Key characteristics:
* - Compiler-enforced read/write separation via accessor types
* - PtrType agnostic — works with shared_ptr, unique_ptr, or stack allocation
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we dont want this to be PtrType agnostic, we are intentionally keeping it so this is only shared_ptr.

{
}
private:
friend class synchronized_data_batch;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is okay to be explicit, but I believe we dont need the friend class declarations.


/**
* @brief Destructor decrements processing count and potentially transitions state.
* @brief RAII read-only accessor. Borrows the parent, does not extend lifetime.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to extend the lifetime.

data_batch_processing_handle& operator=(const data_batch_processing_handle&) = delete;
/**
* @brief Downgrade from mutable → read-only.
* Internally releases unique lock, then blocks until shared lock acquired.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then blocks until

I feel the user should be allowed to make this decision, and inspect if they can get a write lock quickly or not.

};
/**
* @brief Upgrade from read-only → mutable.
* Internally releases shared lock, then blocks until unique lock acquired.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, should be the users choice whether they want to wait or not.

* as try_to_create_task(). Always succeeds when it returns.
*/
void wait_to_create_task();
// -- Immutable field exposed on wrapper (lock-free, for repository lookups) --
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of const for the batch_id_ field in the data_batch should be considered.

Comment on lines +189 to +191
bool subscribe();
void unsubscribe();
size_t get_subscriber_count() const;
Copy link
Copy Markdown
Member

@dhruv9vats dhruv9vats Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some doc string to elaborate on the use case of these explicit counters? Is this something where the shared_ptr provided use_count, and unique will not suffice? Or did you intend to keep counters of the number of read copies there are currently active or something like that?

@dhruv9vats
Copy link
Copy Markdown
Member

dhruv9vats commented Apr 13, 2026

One thing that I still don't know is the best API to pop data_batch from the repository with this new model. Previously we passed in the batch_state that we want to transition it into, which I think is not a good API but it allows us to skip data_batch that does not meet the state machine requirement to transition to that state.

For example, when popping data batch from the repo for downgrade it allows us to skip data batch that's currently in the in_processing.

But right now, since the state is gone, we will just pop data_batch from repository in a FIFO manner and it's up to the caller what do they want to do with it. The problem is now we might run into a 'pop and then insert again' scenarios where we pop a data batch, notice that we cannot downgrade cause a task is executing on it, and have to insert again to the repo.

Let me know if people have some ideas @aminaramoon @dhruv9vats

we can circumvent this by having an API on the data repository along the lines of try_pop that basically goes through all the batches it has in some order (maybe even a configurable order), and tries to obtain a write lock on each. The first batch on which a write lock is obtained is returned. So basically:

std::optional<mutable_data_batch> try_pop() {
	std::optional<mutable_data_batch> result = std::nullopt;
	auto position = std::ranges::find_if(batches_, [&to_return](std::shared_ptr<synchronized_data_batch>& batch) { 			
		result = batch.try_get_mutable();
		return result != std::nullopt;
	});
	if (pos != batches_.end()) {
		std::erase(pos);
	}

	return result;
}

or something like this. And then have a blocking API that can be used if none of the batches can be readily downgraded.

Redesign data_batch concurrency model with compile-time enforced data
access safety. The new 3-class design uses data_batch (idle/unlocked),
read_only_data_batch (shared lock), and mutable_data_batch (exclusive
lock). All data access requires acquiring a lock through RAII accessor
types, and move semantics make stale references a compile error.

- Rewrite data_batch.hpp/cpp with new type system
- Migrate data_repository and data_repository_manager to new types
- Rewrite all test files for the new API
- Add representation_converter support for new batch types
- Fix pixi.toml channel configuration for cudf dependencies

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bwyogatama bwyogatama force-pushed the data-batch-lock-refactor branch from 31ffd86 to 85316e8 Compare April 14, 2026 23:13
@bwyogatama bwyogatama changed the title feat!: Replace data_batch state machine with synchronized_data_batch accessor types feat\!: Replace data_batch state machine with RAII lock accessor types Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants