Skip to content

WIP: PERF: OOC-optimized algorithm variants for 30+ filters#1575

Draft
joeykleingers wants to merge 42 commits into
BlueQuartzSoftware:developfrom
joeykleingers:ooc-filter-optimizations
Draft

WIP: PERF: OOC-optimized algorithm variants for 30+ filters#1575
joeykleingers wants to merge 42 commits into
BlueQuartzSoftware:developfrom
joeykleingers:ooc-filter-optimizations

Conversation

@joeykleingers
Copy link
Copy Markdown
Contributor

@joeykleingers joeykleingers commented Apr 2, 2026

Summary

Adds out-of-core (OOC) optimized algorithm variants for 30+ filters, using DispatchAlgorithm to select between in-core (Direct/BFS) and OOC (Scanline/CCL) code paths at runtime based on data store type. A preparatory rename commit gives git rename tracking so that GitHub shows meaningful diffs against the original algorithm code.

This PR contains only the filter optimization layer. The core OOC infrastructure (copyIntoBuffer/copyFromBuffer API, HDF5ChunkedStore, OocDataIOManager, etc.) is in a separate ooc-architecture-rewrite branch that this PR stacks on top of.

Branch Structure

develop
  └── ooc-architecture-rewrite (core OOC architecture, store management, file import, recovery)
       └── ooc-filter-optimizations (this PR — rename + 30+ filter optimizations)

Commit 0 — Rename for Git Tracking

Renames 13 algorithm files to their in-core variant names before any logic changes, so that when dispatch variants are introduced, GitHub shows proper diffs against the original code instead of "new file" with no context.

Original Renamed To
FillBadData FillBadDataBFS
IdentifySample IdentifySampleBFS
ComputeBoundaryCells ComputeBoundaryCellsDirect
ComputeFeatureNeighbors ComputeFeatureNeighborsDirect
ComputeSurfaceAreaToVolume ComputeSurfaceAreaToVolumeDirect
ComputeSurfaceFeatures ComputeSurfaceFeaturesDirect
SurfaceNets SurfaceNetsDirect
QuickSurfaceMesh QuickSurfaceMeshDirect
DBSCAN DBSCANDirect
ComputeKMedoids ComputeKMedoidsDirect
MultiThresholdObjects MultiThresholdObjectsDirect
BadDataNeighborOrientationCheck BadDataNeighborOrientationCheckWorklist
ComputeGBCDPoleFigure ComputeGBCDPoleFigureDirect

Bug Fixes

OOC import of legacy SIMPL files with multi-dimensional component arrays

Legacy SIMPL .dream3d files store multi-dimensional component arrays (e.g., GBCD with componentShape [10,10,10,20,20,2]) with HDF5 physical dimensions in reversed order relative to the ComponentDimensions attribute.

Two fixes address this at different layers:

  • AbstractOocStore::readHdf5 (SimplnxOoc): Detects shape mismatch between logical and physical dimensions before the streaming import path. Falls back to flat bulk read (H5S_ALL) when shapes differ, preserving correct byte order.
  • ImportH5ObjectPathsAction::backfillReadOnlyOocStores (simplnx): The read-only reference store optimization creates stores pointing directly at the source file. For mismatched arrays, the N-D hyperslabs would be out-of-bounds. Detects the mismatch and creates a writable OOC store populated via readHdf5 (which triggers the flat-read fallback) instead of a read-only reference.

Filter Optimizations

Group B — Face-Neighbor Filters (5 filters)

Split into Direct (in-core) and Scanline (OOC) algorithm classes using DispatchAlgorithm. Scanline variants use Z-slice rolling windows (prev/cur/next) for cross-slice neighbor access with zero per-element OOC overhead.

Filters: ComputeBoundaryCells, ComputeSurfaceFeatures, ComputeFeatureNeighbors, ComputeSurfaceAreaToVolume, BadDataNeighborOrientationCheck

Group C — Morphological / Neighbor Replacement (5 filters)

Z-slice rolling buffers for all 6 face-neighbor reads from RAM. SliceBufferedTransfer for type-dispatched bulk tuple copy.

Filters: ErodeDilateBadData, ErodeDilateCoordinationNumber, ErodeDilateMask, ReplaceElementAttributesWithNeighborValues, NeighborOrientationCorrelation

Group D — CCL Segmentation (5 filters)

Chunk-sequential Connected Component Labeling using UnionFind equivalence tracking, replacing BFS/DFS flood fill for OOC data.

Filters: ScalarSegmentFeatures, EBSDSegmentFeatures, CAxisSegmentFeatures, FillBadData, IdentifySample

Group E — AlignSections Family (4 filters)

Bulk slice read/write via AlignSectionsTransferDataOocImpl. Per-filter OOC findShifts with 2-slice buffers and bulk mask reads.

Filters: AlignSectionsMisorientation, AlignSectionsMutualInformation, AlignSectionsFeatureCentroid, AlignSectionsListFilter

QuickSurfaceMesh

DispatchAlgorithm<QuickSurfaceMeshDirect, QuickSurfaceMeshScanline>. Scanline eliminates the O(volume) nodeIds array (7.5 GB for 1000³) with rolling 2-plane node buffers (16 MB). Two-pass architecture: counting pass + mesh creation pass. All output arrays (triangle connectivity, faceLabels, vertex coordinates, nodeTypes) buffered per z-slice and flushed with copyFromBuffer. Batch quickSurfaceTransferBatch API added to TupleTransfer for bulk source-read/dest-write of cell and feature data.

SurfaceNets

DispatchAlgorithm<SurfaceNetsDirect, SurfaceNetsScanline>. Scanline is a complete reimplementation (881 lines) eliminating the O(n) Cell[] array — uses O(surface) hash map + vertex vectors with slice-by-slice FeatureIds reading. All output arrays (vertices, nodeTypes, triangle connectivity, faceLabels) buffered and flushed with copyFromBuffer. Batch surfaceNetsTransferBatch API added to TupleTransfer for bulk I/O.

Mesh Infrastructure (RepairTriangleWinding + GeometryHelpers)

  • RepairTriangleWinding: Bulk-reads triangle face list and faceLabels into local buffers; all BFS work operates on local memory; modified triangles written back via copyFromBuffer.
  • FindElementsContainingVert / FindElementNeighbors (GeometryHelpers.hpp): Chunked bulk I/O with 65K-element chunks for sequential passes. Random neighbor lookups check if candidate is in the current chunk (cache hit) before falling back to per-element copyIntoBuffer. Together with RepairTriangleWinding buffering, this reduced SurfaceNets Winding from 515s to 2.9s.

Clustering Filters (3 filters)

  • DBSCAN: DispatchAlgorithm<DBSCANDirect, DBSCANScanline> — chunked grid construction, on-demand per-grid-cell coordinate reads in canMerge. 653s → 12s (54x)
  • ComputeKMedoids: DispatchAlgorithm<Direct, Scanline> — chunked findClusters, per-cluster optimizeClusters with O(max_cluster_size) peak memory. 74s → 13s (5.7x)
  • ComputeFeatureClustering: Single implementation with feature-level array caching. 203s → 77s (2.6x)

Pipeline Prerequisite Filters (2 filters)

  • MultiThresholdObjects: DispatchAlgorithm<Direct, Scanline> — eliminates O(n) tempResultVector in OOC path
  • ConvertOrientations: Single implementation with chunked bulk I/O in macro-generated Convertor classes (4096-tuple chunks)

Together these reduced the AlignSectionsMisorientation pipeline test from 635s to 5.9s (107x).

OrientationAnalysis Misc (10 filters)

  • ComputeTwinBoundaries: Bulk-read all face/feature/ensemble arrays into local vectors. 179s → 44s (4x)
  • ComputeKernelAvgMisorientations: Slab-based bulk I/O with cached CrystalStructures
  • ComputeAvgCAxes: Already OOC-optimized (chunked reads, cached feature output). Compute-bound.
  • ReadH5Ebsd: copyFromBuffer in CopyData template, phase copy, Euler interleaving. 463s → 241s (1.9x)
  • ComputeGBCDPoleFigure: DispatchAlgorithm<Direct, Scanline> — Direct caches full GBCD, Scanline caches only the phase-of-interest slice (bounded by bin resolution, not cell count). 853s → 0.9s (948x)
  • ComputeFeatureReferenceCAxisMisorientations: Z-slice buffered I/O for all cell-level arrays (featureIds, cellPhases, quats, output). Cached ensemble/feature-level arrays (crystalStructures, avgCAxes). 196s → 5.4s (36x)
  • ComputeFeatureNeighborCAxisMisalignments: Bulk-read all feature-level arrays (featurePhases, featureAvgQuat, crystalStructures) and buffered avgCAxisMisalignment output.
  • MergeTwins: Chunked bulk I/O for voxel-level parent ID fill and assignment loop. Feature-level featureParentIds cached locally for lookup. 67s → 1.8s (37x)
  • ReadCtfData: Bulk copyFromBuffer for all cell arrays (phases, euler angles, bands, error, MAD, BC, BS, X, Y). Euler angle interleave uses chunked 64K buffer. Crystal structures cached locally for hex correction. 231s → 0.25s
  • ReadAngData: Same bulk copyFromBuffer pattern. Phase validation done in-place on EbsdLib buffer before single bulk write. Euler interleave chunked.

Pipeline-Critical Filters (6 filters)

Optimizations targeting the filters responsible for OOC pipeline timeouts (4 of 5 timed-out pipelines blocked by ComputeIPFColors):

  • ComputeIPFColors: DispatchAlgorithm<ComputeIPFColorsDirect, ComputeIPFColorsScanline>. Direct keeps parallel ParallelDataAlgorithm for in-core; Scanline uses chunked sequential bulk I/O (65K-tuple chunks) with locally cached crystal structures. ForceOocAlgorithmGuard added to test. 1,937ms → 90ms (21.5x)
  • ComputeFeatureSizes: Chunked copyIntoBuffer for featureIds (ImageGeom path) and featureIds + elemSizes (RectGridGeom path with Kahan summation preserved). 813ms → 28ms (29x)
  • ComputeAvgOrientations: Chunked featureIds/phases/quats reads, locally cached crystal structures and avgQuats (feature-level). Bulk copyFromBuffer for output arrays.
  • ComputeFeatureReferenceMisorientations: Chunked all cell-level arrays (featureIds, phases, quats, GB distances, output misorientations). Locally cached crystal structures, avgQuats, and center quaternions (all feature/ensemble-level). 106ms → 1ms (106x)
  • ComputeFeatureCentroids: Replaced AbstractDataStore intermediate arrays (sum, center, count, rangeX/Y/Z) with plain std::vector — eliminates ~119M virtual dispatch calls per run. Chunked featureIds reads. Inline coordinate computation from spacing/origin. 39,724ms → 25ms (1,589x)
  • RequireMinimumSizeFeatures: Three-part optimization:
    • removeSmallFeatures: Chunked featureIds read/write (65K-tuple batches)
    • assignBadVoxels: 3-slice rolling slab buffer for neighbor voting scan (O(slice) memory), sparse changed-voxel tracking to skip full-volume transfer when few/no voxels changed. 14,592ms → 142ms (103x)
    • RemoveInactiveObjects (shared utility in DataGroupUtilities.cpp): Chunked featureIds renumbering with copyIntoBuffer/copyFromBuffer. 5,573ms → 50ms (111x)
    • Combined: 20,184ms → 210ms (96x)

Additional Filters

  • ComputeEuclideanDistMap: Bulk-read featureIds and distance stores into local vectors; flood-fill operates on local memory; bulk-write output. 116s → 1.1s (105x)
  • AppendImageGeometry: Bulk I/O for mirror operations (scanline-based reversal instead of per-tuple swaps). 469s → 113s (4.2x)

GBCD Filter Group (5 filters)

All five GBCD filters optimized for OOC with zero cell-level O(n) allocations, cancel checking, and progress messaging:

  • ComputeGBCDPoleFigure: DispatchAlgorithm<Direct, Scanline> with ForceOocAlgorithmGuard in test. Scanline caches only the phase-of-interest GBCD slice via copyIntoBuffer.
  • WriteGBCDGMTFile: Phase-of-interest GBCD slice cached via copyIntoBuffer; crystal structures cached locally.
  • WriteGBCDTriangleData: Chunked triangle I/O (8K chunks), feature-level euler cache, buffered file output via fmt::format_to + fmt::memory_buffer.
  • ComputeGBCD: Feature-level caching (eulers, phases, crystalStructures), chunked triangle array reads per 50K-triangle iteration, GBCD output accumulated in local buffer (bounded by phases × bins) then written back via copyFromBuffer.
  • ComputeGBCDMetricBased: Eliminated O(n) triIncluded allocation (replaced with per-chunk sequential area accumulation). Feature-level caching (phases, eulers, crystalStructures, featureFaceLabels). Chunked triangle I/O in totalFaceArea scan. Raw pointer access in parallel TrianglesSelector worker.

HDF5 Import + Pole Figure Filters (3 filters)

  • FillOocDataStore (shared infrastructure): Streaming chunked HDF5 hyperslab reads + copyFromBuffer, with zero O(n) temp allocations — batched reads even for partial hyperslabs. Benefits all HDF5 import paths.
  • ReadH5EspritData: copyFromBuffer bulk writes from raw HDF5 reader buffers, replacing 9+ per-element operator[] writes per point.
  • WritePoleFigure: Chunked iteration over eulerAngles/phases/mask per-phase using bounded buffers (no O(n) pre-caching); bulk-write intensity and image outputs via copyFromBuffer.
  • ReadHDF5Dataset: Cancel checking + per-dataset progress messages.
  • Test comparison loops in WritePoleFigureTest and ReadHDF5DatasetTest optimized with copyIntoBuffer.

Core Utilities + Geometry Filters

  • ImportFromBinaryFile: copyFromBuffer instead of per-element writes. ReadRawBinary Case1: 1076s → 29s (37x)
  • CropImageGeometry: Row-based bulk I/O. 27s → 2.6s (10x)
  • RandomizeFeatureIds (ClusteringUtilities): Chunked bulk I/O for both overloads — benefits all callers (segmentation filters, SharedFeatureFace, MergeTwins).
  • AppendData/CopyData/mirror swaps: Runtime OOC check — chunked bulk I/O for OOC, original code for in-core (verified zero in-core regression)
  • TupleTransfer: Added quickSurfaceTransferBatch and surfaceNetsTransferBatch batch APIs with bulk copyIntoBuffer/copyFromBuffer for source reads and destination writes. Used by QuickSurfaceMeshScanline and SurfaceNetsScanline.

Cancel + Progress Messaging

All in-core and OOC algorithm variants now have:

  • m_ShouldCancel checks at the top of major outer loops
  • ThrottledMessenger-based progress reporting with descriptive phase messages and percentage completion

OOC Performance Results

All benchmarks on arm64 Release build with forceOocData = true.

Mesh Generation Filters (full ctest wall-clock, OOC build)

Test Before (s) After (s) Speedup
QuickSurfaceMesh: Base 11.30 0.19 59x
QuickSurfaceMesh: Winding 22.70 0.22 103x
QuickSurfaceMesh: Problem Voxels 11.18 0.19 59x
QuickSurfaceMesh: Winding+PV 21.96 0.22 100x
SurfaceNets: Default 176 2.40 73x
SurfaceNets: Smoothing 224 2.62 85x
SurfaceNets: Winding 515 2.86 180x
SurfaceNets: Winding Smoothing 416 3.22 129x

Groups B–E (200³ dataset, filter.execute() only)

Filter Before (s) After (s) Speedup
ComputeBoundaryCells 6.69 0.25 27x
ComputeSurfaceFeatures 4.01 0.28 14x
ComputeFeatureNeighbors 8.93 0.81 11x
ComputeSurfaceAreaToVolume 8.59 0.24 36x
BadDataNeighborOrientationCheck 97.1 5.25 18x
ErodeDilateBadData 25.09 3.80 7x
ErodeDilateCoordinationNumber 12.43 2.30 5x
ErodeDilateMask 6.43 0.40 16x
ReplaceElementAttrsWithNeighborValues 6.05 4.00 1.5x
NeighborOrientationCorrelation 67.94 5.70 12x
ScalarSegmentFeatures 708.3 1.77 400x
EBSDSegmentFeatures 972.6 2.10 463x
CAxisSegmentFeatures 824.1 1.39 593x
FillBadData 8.6 2.26 4x
IdentifySample 825.0 0.27 3056x
AlignSectionsMisorientation 32.89 0.80 41x
AlignSectionsMutualInformation 15.61 0.81 19x
AlignSectionsFeatureCentroid 8.41 0.39 22x
AlignSectionsListFilter 7.50 0.39 19x

Pipeline-Critical Filters (filter.execute() only, OOC build)

Filter Before After Speedup
ComputeFeatureCentroids 39.7s 25ms 1,589x
RequireMinimumSizeFeatures 20.2s 210ms 96x
ComputeIPFColors 1.94s 90ms 21.5x
ComputeFeatureSizes 813ms 28ms 29x
ComputeFeatureReferenceMisorientations (AvgOri) 106ms 1ms 106x
ComputeFeatureReferenceMisorientations (EuclDist) 136ms 1ms 136x

OrientationAnalysis Filters (full ctest wall-clock, OOC build)

Filter Before (s) After (s) Speedup
ComputeFeatureReferenceCAxisMisorientations 196 5.4 36x
ComputeEuclideanDistMap 116 1.1 105x

GBCD Filter Group (full ctest wall-clock)

Filter Before (s) After (s) Speedup
ComputeGBCDPoleFigure 833 (fail) 2.4 350x
ComputeGBCD 1500 (timeout) ~10 150x
WriteGBCDGMTFile 162 (fail) 6.0 27x
ComputeGBCDMetricBased 38.1 28.9 1.3x
WriteGBCDTriangleData 23.5 19.2 1.2x

HDF5 Import + Pole Figure Filters (full ctest wall-clock)

Filter Before (s) After (s) Speedup
WritePoleFigure (3 tests) 4500 (timeout) 11.7 385x
ReadH5EspritData (3 tests) 2060 (timeout) 6.8 303x
ReadHDF5Dataset 1500 (timeout) 6.7 224x

Additional Optimizations (full ctest wall-clock)

Filter Before (s) After (s) Speedup
ReadRawBinary (Case1) 1076 29 37x
ComputeGBCDPoleFigure 853 0.9 948x
DBSCAN 3D 653 12 54x
AlignSectionsMisorientation Pipeline 635 5.9 107x
ReadH5Ebsd 463 2.1 220x
ReadCtfData 231 0.25 924x
AppendImageGeometry 469 113 4.2x
ComputeFeatureClustering 203 77 2.6x
ComputeTwinBoundaries 179 44 4x
MergeTwins 67 1.8 37x
ComputeKMedoids 74 13 5.7x
CropImageGeometry (X) 27 2.6 10x
WriteAvizoRectilinear 22.8 2.3 10x
WriteAvizoUniform 22.3 2.0 11x

Test Infrastructure

Rotation Filter Bulk I/O

  • RotateSampleRefFrame: Slab-based bulk I/O in RotateImageGeometryWithNearestNeighbor — reads source Z-slabs via copyIntoBuffer, processes output slices into local buffers, writes via copyFromBuffer. No O(n) allocation.
  • RotateEulerRefFrame: Chunked copyIntoBuffer/copyFromBuffer (65K tuples per chunk). 19.5s → 4.8s (4x)
  • Together these reduced ReadH5Ebsd from 241s to 2.1s (117x).

Comparison Function Bulk I/O

CompareFloatArraysWithNans, CompareArrays, and CompareDataArraysByComponent in UnitTestCommon.hpp were doing per-element operator[] access, causing extreme slowdowns when comparing OOC-backed arrays. Replaced with chunked copyIntoBuffer reads (40K elements per chunk), matching the existing CompareDataArrays pattern. This alone reduced the ComputeGBCD test from 1500s (timeout) to ~10s — the filter itself runs in ~3s.

  • ForceOocAlgorithmGuard coverage in all optimized filter tests for both algorithm paths
  • SIMPLNX_TEST_ALGORITHM_PATH CMake option (0=Both, 1=OOC-only, 2=InCore-only) for build-specific test path control
  • Programmatic test data builders with Z-slice batched bulk writes for OOC efficiency

Test Plan

  • Tests pass on in-core build (SIMPLNX_TEST_ALGORITHM_PATH=2)
  • Tests pass on out-of-core build (SIMPLNX_TEST_ALGORITHM_PATH=1)
  • Tests pass with both algorithm paths (SIMPLNX_TEST_ALGORITHM_PATH=0)
  • All optimized filters produce correct results on both algorithm paths
  • In-core performance verified: no regression on utility changes

@joeykleingers joeykleingers marked this pull request as draft April 2, 2026 00:57
@joeykleingers joeykleingers changed the title PERF: OOC-optimized algorithm variants for 30+ filters WIP: PERF: OOC-optimized algorithm variants for 30+ filters Apr 2, 2026
@joeykleingers joeykleingers force-pushed the ooc-filter-optimizations branch 7 times, most recently from 838a49f to f145122 Compare April 8, 2026 17:43
@joeykleingers joeykleingers force-pushed the ooc-filter-optimizations branch 8 times, most recently from 7a5a3c7 to 5bc9a95 Compare April 16, 2026 13:13
@joeykleingers joeykleingers force-pushed the ooc-filter-optimizations branch 3 times, most recently from 0e1310e to afc4257 Compare April 23, 2026 17:01
Replace the chunk-based DataStore API with a plugin-driven hook
architecture that cleanly separates OOC policy (in the SimplnxOoc
plugin) from mechanism (in the core library). The old API required
every caller to understand chunk geometry; the new design hides OOC
details behind bulk I/O primitives and plugin-registered callbacks.

--- AbstractDataStore / IDataStore API ---

Remove the entire chunk API from AbstractDataStore and IDataStore:
loadChunk, getNumberOfChunks, getChunkLowerBounds, getChunkUpperBounds,
getChunkShape, getChunkSize, getChunkTupleShape, getChunkExtents, and
convertChunkToDataStore. Replace with two bulk I/O primitives:
copyIntoBuffer(startIndex, span<T>) and copyFromBuffer(startIndex,
span<const T>), implemented in DataStore (std::copy on raw memory) and
EmptyDataStore (throws). This shifts the abstraction from "load a
chunk, then index into it" to "copy a contiguous range into a caller-
owned buffer," which works identically for in-core and OOC stores.

Simplify StoreType to three values (InMemory, OutOfCore, Empty) by
removing EmptyOutOfCore. IsOutOfCore() now checks StoreType instead
of testing getChunkShape().has_value(). Add getRecoveryMetadata()
virtual to IDataStore for crash-recovery attribute persistence.

--- Plugin Hook System (DataIOCollection / IDataIOManager) ---

Add three plugin-registered callback hooks to DataIOCollection:

  FormatResolverFnc: Decides storage format for a given array based on
    type, shape, and size. Called from DataStoreUtilities::CreateDataStore
    and CreateListStore. Replaces the removed checkStoreDataFormat() and
    TryForceLargeDataFormatFromPrefs — format decisions now live entirely
    in the plugin, with core only calling resolveFormat() when no format
    is already set.

  BackfillHandlerFnc: Post-import callback that lets the plugin finalize
    placeholder stores after all HDF5 objects are read. Called from
    ImportH5ObjectPathsAction after importing all paths. Replaces the
    removed backfillReadOnlyOocStores core implementation.

  WriteArrayOverrideFnc: Intercepts HDF5 writes during recovery file
    creation, allowing the plugin to write lightweight placeholder
    datasets instead of full array data. Activated via RAII
    WriteArrayOverrideGuard, wired into DataStructureWriter.

Add factory registration on IDataIOManager for ListStoreRefCreateFnc,
StringStoreCreateFnc, and FinalizeStoresFnc, with delegating creation
methods on DataIOCollection. Guard against reserved format name
"Simplnx-Default-In-Memory" during IO manager registration.

--- EmptyStringStore Placeholder ---

Add EmptyStringStore, a placeholder class for OOC string array import
that stores only tuple shape metadata. All data access
methods throw std::runtime_error. isPlaceholder() returns true (vs
false for StringStore). StringArrayIO creates EmptyStringStore in OOC mode instead of
allocating numValues empty strings.

--- HDF5 I/O ---

DataStoreIO::ReadDataStore gains two interception paths before the
normal in-core load: (1) recovery file detection via OocBackingFilePath
HDF5 attributes, creating a read-only reference store pointing at the
backing file; (2) OOC format resolution via resolveFormat(), creating a
read-only reference store directly from the source .dream3d file with
no temp copy.

DataArrayIO::writeData always calls WriteDataStore
directly — OOC stores materialize their data through the plugin's
writeHdf5() method; recovery writes use WriteArrayOverrideFnc.

NeighborListIO gains OOC interception: computes total neighbor count,
calls resolveFormat(), and creates a read-only ref list store when an
OOC format is available. Legacy NeighborList reading passes a preflight
flag through the entire call chain (readLegacyNeighborList ->
createLegacyNeighborList -> ReadHdf5Data) so legacy .dream3d imports
create EmptyListStore placeholders instead of eagerly loading per-
element via setList().

DataStructureWriter checks WriteArrayOverrideFnc before normal writes,
giving the registered plugin callback first chance to handle each
data object.

Add explicit template instantiations for DatasetIO::createEmptyDataset
and DatasetIO::writeSpanHyperslab for all numeric types plus bool.
These are needed by the SimplnxOoc plugin's AbstractOocStore::writeHdf5(),
which cannot use writeSpan() because the full array is not in memory.
Instead it creates an empty dataset, then fills it region-by-region
via hyperslab writes as it streams data from the backing file.

--- Preferences ---

Add unified oocMemoryBudgetBytes preference (default 8 GB) that
the ChunkCache, visualization, and stride cache all use. Add k_InMemoryFormat
sentinel constant for explicit in-core format choice. Add migration
logic to erase legacy empty-string and "In-Memory" preference values.
checkUseOoc() now tests against k_InMemoryFormat.
setLargeDataFormat("") removes the key so plugin defaults take effect.

--- Algorithm Infrastructure ---

AlgorithmDispatch: Add ForceInCoreAlgorithm/ForceOocAlgorithm global
flags with RAII guards. Add DispatchAlgorithm template that selects
Direct (in-core) vs Scanline (OOC) algorithm variant based on store
types and force flags. Add SIMPLNX_TEST_ALGORITHM_PATH CMake option
(0=both, 1=OOC-only, 2=InCore-only) for dual-dispatch test control.

IParallelAlgorithm: Remove blanket TBB disabling for OOC data — OOC
stores are now thread-safe via ChunkCache + HDF5 global mutex.
CheckStoresInMemory/CheckArraysInMemory use StoreType instead of
getDataFormat().

VtkUtilities: Rewrite binary write path to read into 4096-element
buffers via copyIntoBuffer, byte-swap in the buffer, and fwrite —
replacing direct DataStore data() pointer access.

--- Filter Algorithm Updates ---

FillBadData: Rewrite phaseOneCCL and phaseThreeRelabeling to use
Z-slab buffered I/O via copyIntoBuffer/copyFromBuffer instead of
the removed chunk API (loadChunk, getChunkLowerBounds, etc.).
operator()() scans feature counts in 64K-element chunks via
copyIntoBuffer.

QuickSurfaceMesh: Remove getChunkShape() call in generateTripleLines()
that set ParallelData3DAlgorithm chunk size, as the chunk API no
longer exists on AbstractDataStore.

--- File Import ---

ImportH5ObjectPathsAction: Add deferred-load pattern. When a backfill
handler is registered, pass preflight=true to create placeholder stores
during import, then call runBackfillHandler() after all paths are
imported to let the plugin finalize.

Dream3dIO: Add WriteRecoveryFile() that wraps WriteFile with WriteArrayOverrideGuard.

--- Utility Changes ---

DataStoreUtilities: Remove TryForceLargeDataFormatFromPrefs entirely.
CreateDataStore and CreateListStore call resolveFormat() on the IO
collection. ArrayCreationUtilities: check k_InMemoryFormat sentinel
before skipping memory checks.

ITKArrayHelper/ITKTestBase: OOC checks use getStoreType() instead of
getDataFormat().empty(). IsArrayInMemory simplified from a 40-line
DataType switch to a single StoreType check.

ArraySelectionParameter: Remove EmptyOutOfCore handling; simplify to
just StoreType::Empty.

--- Tests ---

Add EmptyStringStore tests (6 cases: metadata, zero tuples, throwing
access, deep copy placeholder preservation, resize, isPlaceholder).
Add DataIOCollection hooks tests (format resolver, backfill handler).
Add IOFormat tests (7 cases: InMemory sentinel, empty format,
resolveFormat with/without plugin). Add IParallelAlgorithm OOC tests
(8 cases with MockOocDataStore: TBB enablement for in-memory, OOC,
and mixed arrays/stores).

Remove the "Target DataStructure Size" test from IOFormat.cpp — it
was a tautology that re-implemented the same arithmetic as
updateMemoryDefaults() without testing any edge case or behavior.

Fix RodriguesConvertorTest exemplar data: add missing expected values
for the 4th tuple (indices 12-15). The old CompareDataArrays broke
on the first floating-point mismatch regardless of magnitude, masking
this incomplete exemplar. The new chunked comparison correctly
continues past epsilon-close differences, exposing the missing data.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Add comprehensive documentation to all new methods, type aliases,
classes, and algorithms introduced in the OOC architecture rewrite.
Every new public API now has Doxygen explaining what it does, how it
works, and why it is needed. Algorithm implementations have step-by-
step inline comments explaining the logic.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
…ation layer

Move the format resolver call site from the low-level DataStoreUtilities::
CreateDataStore/CreateListStore functions up to the array creation layer
(ArrayCreationUtilities::CreateArray and ImportH5ObjectPathsAction). This
is a prerequisite for the upcoming data store import handler refactor.

Key architectural changes:

1. FormatResolverFnc signature expanded to (DataStructure, DataPath,
   DataType, dataSizeBytes). The resolver can now walk parent objects to
   determine geometry type, enabling it to force in-core for unstructured/
   poly geometry arrays without caller-side checks.

2. Format resolution removed from DataStoreUtilities::CreateDataStore and
   CreateListStore. These are now simple factories that take an already-
   resolved format string. Callers are responsible for calling the resolver.

3. CreateArrayAction no longer carries a dataFormat member or constructor
   parameter. The k_DefaultDataFormat constant is removed. Format is
   resolved at execute time inside ArrayCreationUtilities::CreateArray.

4. ImportH5ObjectPathsAction gains a format-resolver loop that iterates
   Empty-store DataArrays after preflight import, consulting the resolver
   to decide which arrays to eager-load (in-core) vs leave for the
   backfill handler (OOC).

5. DataStoreIO::ReadDataStore and NeighborListIO::finishImportingData lose
   their inline format-resolution and OOC reference-store creation code.
   Format decisions for imported data are now made at the action level,
   not during raw HDF5 I/O.

6. Geometry actions (CreateGeometry1D/2D/3DAction, CreateVertexGeometry,
   CreateRectGridGeometry) lose their createdDataFormat parameter. They
   now materialize OOC topology arrays into in-core stores when the source
   arrays have StoreType::OutOfCore, since unstructured/poly geometry
   topology must be in-core for the visualization layer.

7. CheckMemoryRequirement simplified to a pure RAM check. OOC fallback
   logic removed since the resolver handles format decisions upstream.

All filter callers updated to drop the dataFormat argument from
CreateArrayAction constructors. Python binding updated (data_format
parameter renamed to fill_value). Test files updated for new
resolveFormat signature.
@joeykleingers joeykleingers force-pushed the ooc-filter-optimizations branch from ed27778 to be9ed24 Compare May 5, 2026 02:41
joeykleingers and others added 28 commits May 5, 2026 09:28
Adds a unified memory budget manager to simplnx. Tracks allocations
registered by cache subsystems (chunking-layer cache, visualization
stride cache, etc.) and evicts globally-oldest entries via callbacks
when memory pressure exceeds the configured budget. Singleton; thread-
safe; the public API matches the previous OocMemoryBudgetManager but
lives in simplnx core so non-OOC builds and visualization code can use
it without depending on the SimplnxOoc plugin.

Also renames the Preferences accessors and key constant for consistency
with the new manager name:
  - Preferences::oocMemoryBudgetBytes()       -> memoryBudgetBytes()
  - Preferences::setOocMemoryBudgetBytes(...) -> setMemoryBudgetBytes(...)
  - Preferences::k_OocMemoryBudgetBytes_Key   -> k_MemoryBudgetBytes_Key
  - Persisted key string "ooc_memory_budget_bytes" -> "memory_budget_bytes"
    (older saved user preferences won't load; the manager falls back to
    50% of system RAM the first time the new key is read.)

Preferences::memoryBudgetBytes() now defaults to 50% of system RAM when
the user has not saved an explicit budget. Previously, the fallback
was a hard-coded 8 GB and the SimplnxOoc plugin pushed the 50%-RAM
default at startup; with the manager living in simplnx core the
default can be computed in-place via MemoryBudgetManager::defaultBudgetBytes(),
eliminating the plugin-startup race.

Test moved from simplnx-ooc/test/OocMemoryBudgetManagerTest.cpp.
Rename 13 algorithm files to their in-core variant names in preparation
for adding OOC (out-of-core) dispatch alternatives. This enables git
rename tracking so that subsequent optimization commits show proper
diffs against the original algorithm code.

Renames (SimplnxCore):
  FillBadData -> FillBadDataBFS
  IdentifySample -> IdentifySampleBFS
  ComputeBoundaryCells -> ComputeBoundaryCellsDirect
  ComputeFeatureNeighbors -> ComputeFeatureNeighborsDirect
  ComputeSurfaceAreaToVolume -> ComputeSurfaceAreaToVolumeDirect
  ComputeSurfaceFeatures -> ComputeSurfaceFeaturesDirect
  SurfaceNets -> SurfaceNetsDirect
  QuickSurfaceMesh -> QuickSurfaceMeshDirect
  DBSCAN -> DBSCANDirect
  ComputeKMedoids -> ComputeKMedoidsDirect
  MultiThresholdObjects -> MultiThresholdObjectsDirect

Renames (OrientationAnalysis):
  BadDataNeighborOrientationCheck -> BadDataNeighborOrientationCheckWorklist

No logic changes. InputValues structs and filter classes unchanged.
…ntationAnalysis

Replace per-element DataStore access with chunked bulk I/O
(copyIntoBuffer/copyFromBuffer) across 60+ algorithm files to eliminate
virtual dispatch overhead and HDF5 chunk thrashing when arrays are backed
by out-of-core storage.

--- Architecture ---

DispatchAlgorithm pattern (Direct/Scanline):
  11 algorithms gain a base dispatcher class that selects between an
  in-core Direct implementation and an OOC Scanline variant at runtime
  based on IsOutOfCore()/ForceOocAlgorithm():
    SimplnxCore: ComputeBoundaryCells, ComputeFeatureNeighbors,
      ComputeKMedoids, ComputeSurfaceAreaToVolume, ComputeSurfaceFeatures,
      DBSCAN, MultiThresholdObjects, QuickSurfaceMesh, SurfaceNets
    OrientationAnalysis: BadDataNeighborOrientationCheck, ComputeIPFColors
  ComputeGBCDPoleFigure dispatches directly from its filter executeImpl().

Connected Component Labeling (CCL) pattern:
  4 algorithms gain a two-pass CCL variant as an OOC alternative to
  random-access BFS/DFS flood-fill:
    SimplnxCore: FillBadData (BFS/CCL), IdentifySample (BFS/CCL)
    OrientationAnalysis: EBSDSegmentFeatures, CAxisSegmentFeatures
  The CCL engine in SegmentFeatures::executeCCL() scans voxels in Z-Y-X
  order with a 2-slice rolling buffer and UnionFind equivalence tracking,
  giving sequential I/O access patterns. Supports Face and FaceEdgeVertex
  connectivity with optional periodic boundaries.

--- New utility infrastructure ---

- UnionFind (src/simplnx/Utilities/UnionFind.hpp):
  Vector-based disjoint set with union-by-rank and path-halving.

- SliceBufferedTransfer (src/simplnx/Utilities/SliceBufferedTransfer.hpp):
  Z-slice buffered tuple transfer for propagating neighbor voxel data
  used by ErodeDilate, FillBadData, MinNeighbors, and ReplaceElements.

- TupleTransfer batch API (Filters/Algorithms/TupleTransfer.hpp):
  Batch bulk I/O methods for QuickSurfaceMesh and SurfaceNets mesh
  generation attribute transfer.

- SegmentFeaturesTestUtils.hpp:
  Shared test builder functions for segmentation filter test suites.

--- Bulk I/O conversions (existing algorithms) ---

Core utilities:
  DataArrayUtilities (ImportFromBinaryFile, AppendData, CopyData,
    mirror ops), DataGroupUtilities (RemoveInactiveObjects),
  ClusteringUtilities (RandomizeFeatureIds), GeometryHelpers
    (FindElementsContainingVert, FindElementNeighbors),
  AlignSections (Z-slice OOC transfer path),
  ImageRotationUtilities (source slab caching for nearest-neighbor),
  TriangleUtilities (bulk-load triangles/labels for winding repair),
  H5DataStore (streaming row-batch FillOocDataStore replacing full-
    dataset allocation)

SimplnxCore algorithms:
  AlignSectionsFeatureCentroid, ComputeEuclideanDistMap,
  ComputeFeatureCentroids, ComputeFeatureClustering, ComputeFeatureSizes,
  CropImageGeometry, ErodeDilateBadData, ErodeDilateCoordinationNumber,
  ErodeDilateMask, RegularGridSampleSurfaceMesh, RequireMinimumSizeFeatures,
  ReplaceElementAttributesWithNeighborValues, ScalarSegmentFeatures,
  WriteAvizoRectilinearCoordinate, WriteAvizoUniformCoordinate

OrientationAnalysis algorithms:
  AlignSectionsMisorientation, AlignSectionsMutualInformation,
  ComputeAvgCAxes, ComputeAvgOrientations, ComputeCAxisLocations,
  ComputeFeatureNeighborCAxisMisalignments,
  ComputeFeatureReferenceCAxisMisorientations,
  ComputeFeatureReferenceMisorientations, ComputeGBCD,
  ComputeGBCDMetricBased, ComputeKernelAvgMisorientations,
  ComputeTwinBoundaries, ConvertOrientations, MergeTwins,
  NeighborOrientationCorrelation, RotateEulerRefFrame, WriteGBCDGMTFile,
  WriteGBCDTriangleData, WritePoleFigure

EBSD readers:
  ReadAngData, ReadCtfData, ReadH5Ebsd, ReadH5EspritData

--- Test infrastructure ---

- UnitTestCommon: ExpectedStoreType()/RequireExpectedStoreType() helpers,
  TestFileSentinel reference-counted decompression, CompareDataArrays
  rewritten with chunked bulk I/O for OOC-safe comparison.

- 29 test files updated with OOC dual-path testing:
  ForceOocAlgorithmGuard + GENERATE(from_range(k_ForceOocTestValues))
  runs every test case in both in-core and forced-OOC modes.
… bugs

Add CreateResolvedDataStore utility that runs the IOCollection format
resolver before creating a DataStore, matching the path filter actions
use. Update test builder functions to call it so that test-constructed
arrays become OOC stores when the OOC plugin is active.

Fix three bugs in the OOC ComputeAvgOrientations Rodrigues average:
- Allow featureId 0 in accumulation (matching architecture branch)
- Start normalization loop from featureId 0
- Add missing continue for zero-count features to avoid divide-by-zero

Fix stale GetIOCollection API call in UnitTestCommon (shared_ptr to ref).

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
…mized algorithms

Adds extensive documentation across all out-of-core optimized filter
algorithms explaining what each algorithm does and why the OOC variant
works the way it does. Targets readers with no prior OOC knowledge.

- Headers: Doxygen @Class, @brief, @param on all classes, methods,
  InputValues structs, and member variables
- Source files: file-level overviews, Doxygen on operator()(), and
  inline comments explaining rolling windows, buffer strategies,
  dispatch logic, and OOC rationale
- Filter docs: Algorithm sections with In-Core/Out-of-Core/Performance
  subsections added to ~45 filter markdown files
- Key utilities: SliceBufferedTransfer.hpp and TupleTransfer.hpp
  documented as core OOC infrastructure
…writes

WritePoleFigure was missed in the d4f2cce OOC-optimization sweep and
was timing out (>300 s) on OOC-backed inputs under the OocOnly test
path. Four Catch2 tests exercising it (Discrete, Discrete-Masked,
Color, Color-Masked) were failing with Timeout on OOC-Release ctest.

Hot paths replaced:

1. Cell-phase + Euler-angle gather loops (algorithm body). Two passes
   per phase over `numPoints` cells that used `phases[i]` +
   `eulerAngles[i * 3 + {0,1,2}]` per-element reads — one HDF5 hit
   per element on OOC stores, 4N hits per cell. Replaced with a
   chunk-sequential stream: 64K-tuple `std::vector` buffers filled
   via `copyIntoBuffer()` once per chunk; the inner loops iterate
   over the in-memory buffers. Peak auxiliary memory bounded to
   ~1 MB regardless of input size; NOT an O(N) bulk allocation.

2. Intensity-plot write-back (`std::copy(..., array.begin())` on
   three `Float64Array` outputs per phase). Per-element operator[]
   writes on OOC-backed `Float64Array` targets are one HDF5 write
   per pixel; a 512x512 image emits 262K hits per plot, ~786K per
   phase. Replaced with `copyFromBuffer()` per output array — one
   write per image.

3. Composite RGB image pack (for `figures.size() == 3` phases). The
   loop that interleaved RGBA source bytes into RGB-packed
   `UInt8Array` output wrote three `imageData[...] = ...` elements
   per pixel. Now builds the packed RGB buffer in a local
   `std::vector<uint8>` first, then emits one `copyFromBuffer()`
   call for the whole image.

No algorithm changes. `metaDataArrayRef[phase] = ...` kept as-is —
that's a single write keyed by phase count, not a hot loop.

Test results on OOC-Release (InCore continues to pass — no in-core
regression):

  WritePoleFigureFilter-Discrete         2.88 s (was: Timeout)
  WritePoleFigureFilter-Discrete-Masked  4.39 s (was: Timeout)
  WritePoleFigureFilter-Color            3.29 s (was: Timeout)
  WritePoleFigureFilter-Color-Masked     4.74 s (was: Timeout)

All four now pass within the 300 s ctest timeout.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Add download_test_data entries for fill_bad_data_exemplars.tar.gz
  and identify_sample_exemplars.tar.gz to SimplnxCore test CMakeLists
* Add download_test_data entries for segment_features_exemplars.tar.gz
  to both SimplnxCore and OrientationAnalysis test CMakeLists
* Fixes CI failures where FillBadData, IdentifySampleFilter,
  ScalarSegmentFeatures, CAxisSegmentFeatures, and EBSDSegmentFeatures
  tests could not locate their exemplar archives

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Replace per-row copyIntoBuffer/copyFromBuffer pair (one pair per
  (z,y) scanline) with a K=32 Z-slice batched slab I/O pattern
* Read K full source Z-slices per bulk call, extract the crop region
  via std::memcpy per row, and write K destination Z-slices in one
  bulk call
* Working-set bound is O(n^(2/3)):
  (k_ZSliceBatch + 2) * X * Y * numComps * sizeof(T) bytes

Tests: 16/16 pass on OOC build.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Remove the O(n) dense neighborsVoxelIndex array in favor of sparse
  parallel vectors (changedVoxels + neighborVoxelIdxs), saving ~16 GB
  on CT_align-scale volumes
* Hoist slabBuf above the convergence while-loop so it is no longer
  re-allocated every iteration
* Delete dead class RequireMinimumSizeFeaturesTransferDataImpl
* Add ChunkedTransferWorker<T> doing Z-batched bulk I/O for the
  transfer phase, dispatched via ExecuteParallelFunction with
  type-based dispatch and parallelized per cell-level array via
  ParallelTaskAlgorithm; 64 MB/task/array slab budget

Tests: 1/1 pass. Unit-test time unchanged at 0.23 s.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Replace the dense triNewIndex (8 B per triangle) with a triMask
  bitset (1 bit per triangle) plus a sparse triPrefixSum popcount
  table for ~6x memory reduction on the triangle side
* Keep vertNewIndex as a dense 8 B per vertex map to preserve the
  invariant that triangle 0's three fresh vertices are assigned
  compact indices 0, 1, 2 (other filters depend on this ordering)
* Stream all passes with chunked bulk I/O; vertex copies use bulk
  source reads + per-vertex dest writes (required by the ordering
  invariant), triangle copies use bulk reads and bulk writes

Tests: 4/4 pass.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Mirror the nearest-neighbor slab-cache pattern in the trilinear
  interpolation path with a +/-2 Z-slice margin so all 8 corner
  neighbors remain resident, avoiding per-voxel random reads
* Slide the slab cache window when consecutive output slices shift
  the needed source Z range — memmove the surviving slices in the
  buffer and read only the delta slices instead of re-reading the
  full slab
* Parallelize the inner output-row loop for both the nearest-neighbor
  and trilinear paths via ParallelDataAlgorithm; threads share the
  cached slab (read-only) and write disjoint Y-row ranges of the
  local output slice buffer
* Replace per-element at()/setValue() calls in the node-geometry
  convert path with 16 K-vertex chunked copyIntoBuffer/copyFromBuffer
  bulk I/O

CT_align (1.97 B-voxel trilinear rotation): 133 s -> 20 s (~6.6x).

Tests: 14/14 pass on both in-core and OOC builds.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Replace per-triangle getFaceCoordinates() random reads with a
  chunked pipeline: bulk-read 65K triangle connectivity indices per
  pass, determine the referenced vertex-index span, and bulk-load
  that vertex range into a local buffer
* Parallelize the area compute on the local buffer (reads/writes
  touch plain C++ arrays only, so threads are safe — no DataStore
  access inside the parallel region)
* Bulk-write the chunk's area output in one call
* Guard against pathological meshes whose vertex indexing spans
  more than 16M entries per chunk with a serial per-triangle
  fallback; filter-generated meshes stay well under this cap

CT_align (mesh-scale triangle areas): 26 s -> <1 s (~26x).

Tests: 1/1 pass on both in-core and OOC builds.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
… inputs

Bump the FeatureId (and RectGrid element-size) bulk-read chunk size
from 64K to 256K tuples. The voxel-counting pass is I/O-bound on
OOC-backed stores; larger chunks reduce copyIntoBuffer() round-trip
overhead on datasets with tens of thousands of chunks while keeping
per-chunk working-set memory bounded (1 MB for the int32 buffer,
and an additional 1 MB for the float32 element-size buffer on the
RectGrid path).

CT_align (1.97 B voxels, Image path): 14 s -> 13 s.

Tests: 9/9 pass on the OOC build.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Rewrite the markdown Algorithm section to explain the crop as a 3D
  subarray copy from first principles, teach the Z-slice-batched
  bulk I/O strategy step-by-step, and quantify why batching by
  K Z-slices collapses HDF5 chunk-op overhead
* Add a Doxygen block on CropImageGeomDataArray describing the
  per-pass pipeline (bulk read slab -> in-memory extract -> bulk
  write) and the O(slab), non-O(volume) memory bound

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Rewrite the Algorithm section so a reader unfamiliar with the filter
can follow the two-phase pipeline end-to-end:

* Phase 1 (feature removal): motivate why small features get pruned,
  describe the 64K-tuple chunked scan, and explain the "skip write
  when chunk unchanged" optimization
* Phase 2 (gap fill by majority-vote): teach the rolling 3-slice
  buffer scan, the sparse parallel vectors that replace the old
  O(n) dense index array, the per-array ChunkedTransferWorker with
  its +/-1 Z-margin slab read + interior-only write-back, and the
  outer ParallelTaskAlgorithm across arrays
* Add a memory-footprint summary clarifying that every data
  structure is O(slice) or O(iteration bad count), never O(volume)

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Add a new Algorithm section that teaches the filter from scratch:

* Explain conceptually which triangles are kept (all three vertices
  inside the user-specified node-type range) and what the output
  geometry looks like (compact vertex list, compact triangle list,
  remapped connectivity)
* Document the downstream-invariant that forces vertNewIndex to stay
  a dense per-vertex map (triangle 0's three fresh vertices land at
  new indices 0..2 in traversal order)
* Explain the triMask bitset + triPrefixSum sparse popcount table
  that replaces the legacy dense triangle map for ~6.4x memory
  savings, and how remapIndex() turns an O(1) table lookup plus a
  small popcount into each triangle's compact new index
* Walk the six streaming passes (vertex-ok mask, triangle scan +
  vertex-index assignment, prefix-sum build, vertex copy, triangle
  remap copy, per-vertex/per-triangle attached-array copy)
* Summarize the memory footprint so the vertNewIndex dominance is
  clear on very large meshes

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Add a comprehensive Algorithm section covering both the node-geometry
and image-geometry paths from first principles:

* Describe how every supported transform (rotation, scale, manual
  matrix, etc.) collapses to a single 4x4 homogeneous matrix M and
  how M composes with prior transforms
* Node geometries: walk the 16K-vertex chunked read -> multiply ->
  write pipeline and explain why in-place topology+attribute data
  is correct
* Image geometries: teach the re-gridding problem (why output voxels
  need to look up source values via M^-1), and contrast nearest-
  neighbor vs. trilinear interpolation
* Z-slice slab cache: analytically deriving the per-output-slice
  source-Z range and the +/-2 trilinear margin
* Sliding-window slab updates via memmove + delta copyIntoBuffer
  reads when consecutive output slices overlap heavily
* Intra-slice parallelism via ParallelDataAlgorithm with thread
  safety argued from shared-read + disjoint-write access patterns
  and per-thread pValues scratch

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Add an Algorithm section that walks the chunked pipeline step-by-step
for a reader unfamiliar with the optimization:

* Establish the closed-form per-triangle math (0.5 * |(A-B) x (A-C)|)
  so there is no confusion about the compute
* Quantify the naive access pattern (six OOC chunk-cache hits per
  triangle, hundreds of millions of virtual dispatches on CT-scale
  meshes) to motivate the chunking
* Walk the five-step per-chunk pipeline: bulk triangle connectivity
  read -> analyze vertex-index span -> span-bounded bulk vertex
  coords read -> parallel compute on plain buffers -> bulk area
  write
* Explain the 16M-vertex span cap and the serial per-triangle
  fallback for pathological meshes
* Summarize memory footprint (bounded O(chunk), not O(mesh))

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Rewrite the Algorithm section to fully teach the filter:

* State what the three output arrays (NumElements, Volume,
  EquivalentDiameter) represent and show the spherical/circular
  diameter formulas
* Image Geometry path: explain the uniform-voxel-volume shortcut
  that lets the filter skip per-voxel volume computations, then
  walk the 256K-tuple chunked count pass and the per-feature
  output pass; cover the 2D fallback rules and the
  two-empty-dimensions preflight error
* RectGrid path: contrast with the Image case, describe the
  lockstep FeatureIds + elementSizes chunked read, and explain
  why Kahan summation is needed to avoid float32 rounding error
  on billion-voxel volumes
* Justify the 256K chunk size choice based on HDF5 chunk-lookup
  overhead vs. L2 cache residency
* Summarize memory footprint

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Three sites in the algorithm multiplied double-precision resolution and
angle values by k_PiOver180F (a float constant). Float promotion to
double preserves the quantized float bits rather than recovering the
true M_PI/180.0, introducing a ~1e-10 deviation from the legacy
SIMPL algorithm (which uses double k_PiOver180). Over ~756k triangles
times ~2300 symmetry operations the deviation flipped two near-boundary
triangles in/out of the selected set, shifting a handful of
distribution bin values by ~3e-2.

Switch the three multiplications to the existing k_PiOver180D double
constant so the resolution thresholds and fixed-misorientation angle
are computed at full double precision.

The stored 6_6_find_gbcd_metric_based.tar.gz exemplar was generated by
the original float-precision DREAM3D FindGBCDMetricBased filter and no
longer matches the simplnx algorithm after this fix. Publish a fresh
exemplar from the double-precision legacy pipeline and repoint the
tests at it.

* Rename archive and top-level folder from 6_6_find_gbcd_metric_based
  to compute_gbcd_metric_based (drops the legacy 6_6_ prefix in
  accordance with current archive-naming conventions).
* Drop the 6_6_ prefix from the stored .dat exemplar filenames;
  input .dream3d filename follows the folder name.
* ComputeGBPDMetricBasedTest's InValid section reuses the GBCD archive
  (for crystal-structures and mesh input); update its paths too.
* CMakeLists.txt download_test_data entry bumped to the new archive
  name and SHA512.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Adds a simple erase-by-key helper on Preferences to support callers
that need to restore a preference to its "absent" state (rather than
overwrite it with a specific value).
Introduces the ArgumentType and help-text surface for a new
--ooc-memory-budget / -b flag. This commit only wires parsing and
help; the override behavior is applied in a follow-up commit.
Applies the parsed --ooc-memory-budget value to the in-memory
Preferences object before plugins load so SimplnxOocPlugin picks it
up at construction. A stack-local RAII guard restores the pre-override
preference state before the Application singleton saves preferences
on shutdown, so the user's saved preference is untouched.
…g garbage

std::stod("nan") returns NaN, which passed the previous gb <= 0.0 check
(NaN compares false to everything) and then became undefined behavior when
cast to uint64. std::stod("8abc") also succeeded with a partial parse,
silently accepting junk input. Use the pos out-parameter to require the
entire argument be consumed, and std::isfinite to reject NaN and infinity.
std::isfinite requires <cmath>, which was being picked up only
transitively via other STL headers. Include it directly to avoid
future breakage if the transitive chain changes. Drop <stdexcept>;
the only catch handler uses const std::exception& which is available
via <exception>.
…x bool-mask bulk I/O

Three logically related changes that finish reconciling the rebased
branch with Nathan Young's PR BlueQuartzSoftware#1590 (ENH: Standardize 2D Image Handling)
and fix one resulting OOC perf cliff:

1. Wholesale port of PR BlueQuartzSoftware#1590's two algorithm rewrites into the renamed
   in-core dispatch variants:
   - ComputeFeatureNeighborsDirect.cpp gets Nathan's templated
     ComputeFeatureNeighborsFunctor<ImageDimensionStateT> and ProcessVoxels
     dispatcher in place of the OOC-commit-era custom in-core logic.
   - IdentifySampleBFS.cpp gets Nathan's templated IdentifySampleFunctor
     plus the corresponding ProcessVoxels dispatch.
   The Scanline OOC variant of ComputeFeatureNeighbors is updated to
   reference the namespaced VoxelNeighbors<Image3D>:: constants while
   preserving its Z-slice rolling-window bulk-I/O structure.

2. Reapply PR BlueQuartzSoftware#1590's constexpr/const cleanups across the algorithm
   files where the rebase took --theirs (the OOC commit version) at the
   2aa00ee conflict and dropped Nathan's small adjustments:
     SimplnxCore: ComputeBoundaryCellsDirect, ErodeDilateBadData,
       ErodeDilateCoordinationNumber, ErodeDilateMask,
       ReplaceElementAttributesWithNeighborValues,
       RequireMinimumSizeFeatures
     OrientationAnalysis: BadDataNeighborOrientationCheckWorklist,
       NeighborOrientationCorrelation
   The pattern is uniform: promote the inlined `6` neighbor-array sizes
   to use VoxelNeighbors<Image3D>::k_FaceNeighborCount via a local
   k_NumFaceNeighbors alias, make neighborVoxelIndexOffsets const,
   make faceNeighborInternalIdx constexpr, make isValidFaceNeighbor
   const where it is not mutated, drop the now-unused DataGroup.hpp
   include, and const-ify NeighborOrientationCorrelation's orientationOps.
   ComputeFeatureNeighborsFilter.md picks up Nathan's all-dimension
   note about user-set spacing for shared surface area calculation.

3. Fix a per-element OOC fallback in BadDataNeighborOrientationCheckScanline
   that was triggered whenever the input mask was a BoolArray rather
   than a UInt8Array. The previous code routed bool masks through
   maskCompare->isTrue / maskCompare->setValue per voxel per Z-slice,
   causing chunk thrashing under chunked OOC storage. The Small_IN100
   pipeline test (a 189x201x117 volume with a bool mask produced by
   MultiThresholdObjects) ran in 4.7 s on simplnx-Rel but 3+ minutes
   on simplnx-ooc-Rel. AbstractDataStore<bool> already exposes
   copyIntoBuffer/copyFromBuffer just like AbstractDataStore<uint8>;
   the comment claiming otherwise was stale. Resolve a typed
   AbstractDataStore<bool>* alongside the existing uint8 store pointer
   and route both load and write-back through bulk I/O, with a small
   per-slice std::unique_ptr<bool[]> scratch buffer bridging between
   the algorithm's uint8 slice buffers and the bool data store's typed
   bulk API. With this change Small_IN100 OOC drops to 4.6 s
   (~1.6x in-core, in line with normal OOC overhead).

Tests updated:
  - IdentifySampleTest.cpp adopts Nathan's PR BlueQuartzSoftware#1590 hand-built 2D Empty
    Z/Y/X Non-Square regression tests plus the parameterized
    identify_sample_v2 exemplar test and the SIMPL Backwards Compatibility
    test, all wrapped with the OOC dual-path pattern (ForceOocAlgorithmGuard
    + GENERATE(from_range(k_ForceOocTestValues))). The pre-existing
    200x200x200 large-scale OOC validation test is retained.

Verified: simplnx-Rel and simplnx-ooc-Rel preset builds both clean.
All 43 affected-filter tests pass on simplnx-Rel; all 86 affected-filter
tests pass on simplnx-ooc-Rel (regex covering ComputeFeatureNeighbors,
IdentifySample, BadDataNeighborOrientation, ComputeBoundaryCells,
ErodeDilate*, NeighborOrientationCorrelation,
ReplaceElementAttributesWithNeighborValues, RequireMinimumSizeFeatures).
* Replace CreateDataStore + CreateResolvedDataStore with a single
  resolver-aware CreateDataStore(DataStructure, DataPath, ...) that
  always consults the registered format resolver. Old explicit-format
  overload deleted.
* Replace CreateListStore similarly so NeighborList backing storage
  is OOC-eligible when the OOC plugin is loaded and thresholds permit.
* Inline action-layer caller in ArrayCreationUtilities::CreateArray
  using GetIOCollection().createDataStoreWithType directly.
* Migrate 23 CreateResolvedDataStore call sites (mechanical rename).
* Migrate 13 cell-level test fixtures that were silently in-memory in
  OOC builds to the resolver-aware path so OOC builds actually exercise
  OOC stores.
* Migrate 6 in-memory non-test callers (ComputeFeatureCentroids scratch
  buffers, HDF5 readers in DataStoreIO and DatasetIO) to direct
  std::make_shared<DataStore<T>> since they have no DataStructure
  context.
* Migrate 2 NeighborListIO HDF5 readers to std::make_shared<ListStore<T>>
  for the same reason (in-core branch of the import pipeline).
* Wire CreateNeighbors action helper through the resolver-aware
  CreateListStore.
* Rewrite IOFormat.cpp tests to exercise the resolver path.

ImageGeom and RectGridGeom findElementSizes now route through the new
CreateDataStore so the voxel-sizes array can go OOC for very large
structured grids. RectGridGeom's inner loop also refactored from
per-voxel setValue calls to per-axis precompute + Z-slice
copyFromBuffer to avoid catastrophic OOC perf when the array is
OOC-backed.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
@joeykleingers joeykleingers force-pushed the ooc-filter-optimizations branch from be9ed24 to dd7119e Compare May 5, 2026 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants