Fix memory safety issues#3188
Conversation
aa54fef to
30d26c8
Compare
| hex.match(/.{2}/g)!.map((b) => parseInt(b, 16)), | ||
| ); | ||
| await expect(perspective.table(bytes)).rejects.toThrow( | ||
| /row count exceeds maximum supported size/, |
There was a problem hiding this comment.
This is brittle and irrelevent, we don't cate about the message.
| expect(paths.length).toEqual(3); | ||
| expect(paths).toContainEqual([]); | ||
| expect(paths).toContainEqual(["x"]); | ||
| expect(paths).toContainEqual(["y"]); |
There was a problem hiding this comment.
These data structures are small, for readability its worth just asserting the entire structure instead of multiple playwright assertion calls in a row.
| const auto validation = m_table->ValidateFull(); | ||
| if (!validation.ok()) { | ||
| PSP_COMPLAIN_AND_ABORT(validation.ToString()); | ||
| } |
There was a problem hiding this comment.
I checked that this has a minimal (nonexistent) performance impact.
| if (n < 0 | ||
| || static_cast<std::uint64_t>(n) | ||
| > std::numeric_limits<std::uint32_t>::max()) { | ||
| PSP_COMPLAIN_AND_ABORT( |
There was a problem hiding this comment.
This is not something Perspective itself will check internally, e.g. a subsequent update call will still cause the same issue this verbose error messages "fixes".
| if (v1.size() < 3 || v2.size() < 3 || out.size() < 3) { | ||
| rval.m_status = STATUS_CLEAR; | ||
| return rval; | ||
| } |
There was a problem hiding this comment.
This is an actual good catch but we can validate this better by just checking the exact cases and erroring else?
| case AGGTYPE_WEIGHTED_MEAN: { | ||
| auto pkeys = get_pkeys(nidx); | ||
|
|
||
| if (spec.get_dependencies().size() < 2) { |
There was a problem hiding this comment.
I don't think empty categories can be constructed with the current config but this fix satisfies the analysis.
Signed-off-by: Andrew Stein <steinlink@gmail.com>
30d26c8 to
c4cb7a3
Compare
I asked Claude to find memory safety bugs in Perspective and write a PR, while I played Balatro on my phone. Here's its own summary of what it found:
Fixes
Uneven column lengths in columnar table create/update (
table.cpp—Table::from_cols,Table::update_cols): the table was sized from a singlecolumn's length while every column was filled to its own length. Now the
table is sized to the longest column and shorter columns are null-padded, so
all writes stay in bounds.
NDJSON row under-allocation (
table.cpp—Table::from_ndjson): capacitywas estimated from the newline count but one row was written per parsed
object. The table is now grown per row (amortized O(1) via geometric capacity
growth), so concatenated objects without newline separators can no longer
overrun the buffer.
Arrow row-count truncation (
arrow_loader.cpp—ArrowLoader::row_count):Arrow's 64-bit row count was silently truncated to 32 bits, under-sizing the
destination table relative to the data written during fill. Oversized/negative
counts are now rejected instead of truncated.
Arrow
time32element width (arrow_loader.cpp—copy_array):time32values are 32-bit and map to a 4-byte column, but the loader copied 8 bytes
per element, over-reading the source and over-writing the destination. Now
copies 4 bytes per element.
first/lastaggregate with a missing sort dependency (sparse_tree.cpp—
first_last_helper): the helper assumed the aggregate spec always carriedboth a value and a sort dependency and indexed the dependency list
unconditionally. A view whose sort column falls outside the visible set could
produce a spec without the sort dependency, causing an out-of-bounds read. Now
guarded (covers
first,last, andlast − first).countaggregate over a string column reads at the wrong stride(
aggregate.cpp—t_aggregate::init,AGGTYPE_COUNT/DTYPE_STR): astring column's backing store holds 4-byte
t_uindexvocabulary indices, butthe
countaggregate was instantiated ast_aggimpl_count<std::uint64_t, …>,so
build_aggregate→t_column::fillread the input column at an 8-byte(
uint64) stride — double the real element width — and ran off the end of thebuffer. The read is pure waste (
t_aggimpl_count::reducediscards the values;the count is filled in later from
nstrands), but it still tripped a storagebounds check. The raw type is now
t_uindex, matching the storage width. Thiswas a latent over-read for years, masked by
push_back's ~20% capacityheadroom; tighter column sizing in a later optimization removed the slack and
exposed it. It is reachable from every pivoted (
group_by/split_by) viewwith a string column, since the default per-column aggregate for a string is
countand that aggregate is built during view construction — so the abortfired before the view's actual query (
get_min_max,to_columns, etc.) everran. Found by updating and enabling
PSP_STORAGE_VERIFYin debug builds.Residency eviction data race (
residency.cpp,residency.h): theshared pending-eviction vectors were cleared outside the manager's mutex, so
concurrent request-thread eviction passes could double-free. All mutations now
occur under the lock, and a dedicated mutex serializes each eviction cycle.
Unvalidated Arrow input (
arrow_loader.cpp—ArrowLoader::initialize):Arrow's IPC reader does not validate buffer contents, yet the fill paths index
value/offset/dictionary buffers directly. The loaded (remotely supplied) table
is now fully validated (
ValidateFull()) before those buffers are trusted, soa malformed payload — bad offsets, out-of-range dictionary indices, inconsistent
chunk lengths — is rejected instead of read out of bounds. This is the systemic
defense behind the
time32and row-count fixes above. Note: validation isO(data) per ingested payload;
Validate()plus targeted bounds checks would bea cheaper alternative if ingest throughput matters.
Out-of-bounds access in expression vector functions (
computed_function.cpp—
diff3,norm3,cross_product3,dot_product3): these operate on3-element vectors and indexed
v[0..2]unconditionally, but their exprtkparameter sequences (
"VVV"/"VV"/"V") enforce vector type, not length.A user expression can declare shorter vectors (e.g.
var v[2] := {1,2}; norm3(v)), causing an out-of-bounds read — and fordiff3/cross_product3,an out-of-bounds write to the (short) output vector. Each function now
clears its result for vectors shorter than 3 before indexing; this surfaces
as an invalid expression (rejected at view creation) rather than an
out-of-bounds access. Vectors of length 3 are unaffected.