CASSANDRA-21134: Direct I/O for background SSTable writes#4815
CASSANDRA-21134: Direct I/O for background SSTable writes#4815samueldlightfoot wants to merge 4 commits into
Conversation
3349322 to
005f4e1
Compare
b95f990 to
3d95be1
Compare
Adds an opt-in O_DIRECT write path for background SSTable producers, bypassing the OS page cache for data that is unlikely to be re-read soon after being written. Memtable flushes remain buffered. Enabled via two new YAML knobs: - background_write_disk_access_mode: standard (default) | direct - direct_write_buffer_size: 1MiB (default; aligned up to FS block size, auto-grown to chunk_length) The path is gated by config, table compression being enabled, and an OperationType allowlist in DataComponent. The allowlist is exhaustive: any new OperationType with writesData=true that is not classified will fail static initialization. Operations on the DIO path: COMPACTION, MAJOR_COMPACTION, TOMBSTONE_COMPACTION, ANTICOMPACTION, GARBAGE_COLLECT, CLEANUP, UPGRADE_SSTABLES, WRITE, STREAM (chunked receiver only). Operations off the DIO path: - FLUSH (policy: just-flushed data is hot, keep in page cache) - SCRUB (correctness: tryAppend needs mark/resetAndTruncate) - Zero-Copy Streaming (bypasses DataComponent.buildWriter) - Uncompressed writers (only CompressedSequentialWriter has a DIO subclass in this change) StartupChecks fails fast if 'direct' is requested on a platform/FS that does not support O_DIRECT. patch by Sam Lightfoot; reviewed by <reviewers> for CASSANDRA-21134
a49806c to
ca8ef09
Compare
Fail fast on unsupported background_write_disk_access_mode values, drop "scrub" from the DIO operation docs (it's UNSUPPORTED_CORRECTNESS), and tighten AntiCompactionTest/CompactionsTest to assert Direct I/O actually engages rather than passing spuriously.
…st.yaml Add background_write_disk_access_mode: direct to cassandra_latest.yaml so new-install/latest configurations exercise Direct I/O by default, and keep the test latest config in sync (test/conf/latest_diff.yaml and the DTEST_JVM_DTESTS_USE_LATEST block in InstanceConfig.java). The cassandra.yaml default remains standard.
aweisberg
left a comment
There was a problem hiding this comment.
Mostly just nits but at least one or two things worth doing and then I shared a pastebin with you of an analysis of the DirectCompressedSequentialWriterTest code and boundary coverage that is worth considering.
| } | ||
|
|
||
| @Test | ||
| public void testInitializeBackgroundWriteDiskAccessModeRejectsZeroBufferSize() |
There was a problem hiding this comment.
Technically haven't exercised the checks of the paths of isDirectIOSupported which would require being able to intercept that and provide test answers to whether paths support that. Maybe something that can be done with jimfs.
The auto path which we discussed might be worth changing.
There was a problem hiding this comment.
I think the best we could do is a custom hook that captures what open options were requested, but not via JimFS, since getBlockSize throws an exception. Let me know how strongly you feel on this one.
My plan with auto was to have it resolve to standard now, and then if/when we decide to use direct as the default, we can have it check DIO support and fallback to standard if not.
|
|
||
| import static org.assertj.core.api.Assertions.assertThat; | ||
|
|
||
| public class StreamingDirectWriteTest extends TestBaseImpl |
There was a problem hiding this comment.
Curious why Streaming needed a dedicated test. Incoming buffers is just incoming buffers right? Not complaining about the coverage though.
There was a problem hiding this comment.
It doesn't increase coverage; the logic is already tested. I tried to cover all operations with a test for said op - happy to drop the test if you don't think worthwhile.
- Reuse ChecksumWriter for O_DIRECT compressed writes via DirectChecksumWriter - Centralize Direct I/O write paths and drop redundant background-write support validation - Rely on startup checks for engagement; remove per-operation log and its reflective test seam - Exercise isDirectIOSupported path checks in DatabaseDescriptorTest
95279ff to
83a3175
Compare
CASSANDRA-21134: Direct I/O for background SSTable writes
Summary
Opt-in
O_DIRECTwrite path for background SSTable producers, bypassing the OS page cache for write-once read-never data. Memtable flushes remain buffered (hot data benefits from the cache).Gated by (1) config, (2) table compression enabled, (3) an
OperationTypeallowlist (DataComponent#DIRECT_WRITE_SUPPORT). Selection is central inDataComponent.buildWriter; producers are unchanged.Performance
Benchmark results are attached to the JIRA. Significant p99 read latency improvements under throttled compaction.
Operations covered (DIO eligible)
WRITECQLSSTableWriterDaemonTest(parameterised on disk mode)COMPACTIONCompactionsTest(parameterised on disk mode)MAJOR_COMPACTIONCompactionsTest.testCompactionWithSizeLimitedRewriterCLEANUP,GARBAGE_COLLECT,TOMBSTONE_COMPACTION,UPGRADE_SSTABLESCompactionsTest(transitive)ANTICOMPACTIONAntiCompactionTest.testAntiCompactionWithCompressedTableAndDirectWritesSTREAMStreamingDirectWriteTestThe allowlist is exhaustive: any new
OperationTypewithwritesData == truethat is not classified fails static initialization (AssertionError).Operations NOT covered
FLUSH(memtable)UNSUPPORTED_POLICYDataComponentDirectWriteSelectionTestSCRUBUNSUPPORTED_CORRECTNESStryAppendneedsmark()/resetAndTruncate(), which DIO cannot satisfy.DataComponentDirectWriteSelectionTestDataComponent.buildWriter.StreamingDirectWriteTest(disables ZCS)CompressedSequentialWriterhas a DIO subclass.DataComponentDirectWriteSelectionTest(compression gate)Removing an
UNSUPPORTED_CORRECTNESSentry requires code changes;UNSUPPORTED_POLICYis a policy decision.Key code
io/DirectIoSupport.java— eligibility enum (SUPPORTED/UNSUPPORTED_CORRECTNESS/UNSUPPORTED_POLICY/NOT_APPLICABLE).io/sstable/format/DataComponent.java— selection, allowlist, exhaustiveness check;per-op first-activation log.
io/compress/DirectCompressedSequentialWriter.java— new writer; aligned buffers,mark()/resetAndTruncate()unsupported.io/compress/CompressedSequentialWriter.java— refactored so the DIO subclass canoverride the write-chunk path;
writeChunkcontract documented and asserted.config/Config.java,config/DatabaseDescriptor.java— new knobs, validation, startupwiring; buffer size aligned to FS block size, auto-grown to chunk length.
service/StartupChecks.java— fails fast ifdirectis requested on a platform/FSthat does not support
O_DIRECT.Tests introduced
vs. the buffered writer over compressors × chunk lengths × random payload sizes;
seed-logged for repro (
DirectCompressedSequentialWriterTest).flushCompleteBlocks(DirectCompressedSequentialWriterTest).OperationTypeeligibility, allowlist exhaustiveness,compression gate, config-mode gate (
DataComponentDirectWriteSelectionTest).WRITE(CQLSSTableWriterDirectWriteTest),STREAM(in-JVM dtest,StreamingDirectWriteTest), and the compaction family +ANTICOMPACTION(extendedCompactionsTest,AntiCompactionTest).block-size rejection, once-per-JVM undersized-buffer warn, SCRUB-gating canaries
(
DirectCompressedSequentialWriterTest).BufferPoolMXBeancheck that the off-heap alignedbuffer is returned on close (
DirectCompressedSequentialWriterTest).in
DatabaseDescriptorTest.Not in scope
Reviewer notes
Findings from the Cassandra bug-hunting skills (Opus 4.7 xhigh & kimi-k2.6:cloud) were addressed prior to
review.