refactor(signature): Overhaul cache keying and async verification pipeline#681
Open
rabbitstack wants to merge 4 commits into
Open
refactor(signature): Overhaul cache keying and async verification pipeline#681rabbitstack wants to merge 4 commits into
rabbitstack wants to merge 4 commits into
Conversation
6fce62d to
b90d2ce
Compare
…eline The signature verification subsystem has been redesigned to address correctness, performance, and observability concerns that accumulated as the codebase grew. Cache keying was fundamentally broken by relying on the module base address as the primary index. ASLR randomises base addresses across processes, causing every process spawn to produce a full cache miss for system DLLs that were already verified moments earlier. The new key is derived entirely from data available in the image-load event including a normalised NT device path, PE timestamp, checksum, and mapped image size. This requires zero additional syscalls and no file handle acquisition at key construction time. It also eliminates stale hit and false miss scenarios caused by DLL unload/reload at the same virtual address, and collapses redundant verification work for the same physical file loaded across multiple processes. The synchronisation model has been reworked throughout. The global mutex has been replaced with a reader-writer lock, and the signature accessed timestamp is updated via an atomic store, eliminating the need to upgrade to a write lock on every cache hit. All fields on the Signature struct that are written by worker goroutines and read by the rule engine are now accessed through the appropriate atomic primitives, closing a class of data races that the previous implementation left undetected. Certificate parsing no longer holds the write lock across file I/O and WinTrust calls, which previously serialised all cache readers for the full duration of a catalog hash operation. Cache invalidation on CreateFile events with FILE_OVERWRITE and FILE_OVERWRITE_IF dispositions remove stale signatures from the store. The async verification pipeline has been designed around the observation that the enrichment stages between raw ETW delivery and rule engine evaluation (stack walk correlation, symbol resolution, and evasion scanner execution) provide a natural latency budget that covers the median catalog hash duration. Image rundown events emitted at session start are used to prewarm the cache during bootstrap, ensuring that common system modules are fully verified before the first live event reaches the rule engine.
b90d2ce to
563d3ce
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of this PR / why it is needed?
The signature verification subsystem has been redesigned to address
correctness, performance, and observability concerns that accumulated
as the codebase grew.
Cache keying was fundamentally broken by relying on the module base
address as the primary index. ASLR randomises base addresses across
processes, causing every process spawn to produce a full cache miss
for system DLLs that were already verified moments earlier. The new
key is derived entirely from data available in the image-load
event including a normalised NT device path, PE timestamp, checksum,
and mapped image size. This requires zero additional syscalls and no
file handle acquisition at key construction time. It also eliminates
stale hit and false miss scenarios caused by DLL unload/reload at
the same virtual address, and collapses redundant verification work
for the same physical file loaded across multiple processes.
The synchronisation model has been reworked throughout. The global
mutex has been replaced with a reader-writer lock, and the signature
accessed timestamp is updated via an atomic store, eliminating the
need to upgrade to a write lock on every cache hit. All fields on
the Signature struct that are written by worker goroutines and read
by the rule engine are now accessed through the appropriate atomic
primitives, closing a class of data races that the previous implementation
left undetected.
Certificate parsing no longer holds the write lock across file I/O
and WinTrust calls, which previously serialised all cache readers
for the full duration of a catalog hash operation.
Cache invalidation on CreateFile events with
FILE_OVERWRITEandFILE_OVERWRITE_IFdispositions remove stale signatures from the store.The async verification pipeline has been designed around the
observation that the enrichment stages between raw ETW delivery
and rule engine evaluation (stack walk correlation, symbol
resolution, and evasion scanner execution) provide a natural
latency budget that covers the median catalog hash duration. Image
rundown events emitted at session start are used to prewarm the
cache during bootstrap, ensuring that common system modules are
fully verified before the first live event reaches the rule engine.
What type of change does this PR introduce?
/kind refactor (non-breaking change that restructures the code, while not changing the original functionality)
/kind improvement
Any specific area of the project related to this PR?
/area telemetry
/area filters
/area event
Special notes for the reviewer
Does this PR introduce a user-facing change?