Skip to content

perf(redis-lua): pool *lua.LState to cut GC pressure#580

Open
bootjp wants to merge 4 commits intomainfrom
perf/lua-vm-pool
Open

perf(redis-lua): pool *lua.LState to cut GC pressure#580
bootjp wants to merge 4 commits intomainfrom
perf/lua-vm-pool

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 21, 2026

Summary

  • Pools *lua.LState across EVAL / EVALSHA invocations, reusing the expensive base-lib plus redis/cjson/cmsgpack initialisation instead of minting a fresh VM every call.
  • Heap profiles in production showed ~34% of in-use heap in gopher-lua allocations (newFuncContext 11.8%, newRegistry 11.3%, newFunctionProto 10.5%); a BullMQ-style workload (~10 scripts/s) is exactly the shape that this wasted.
  • Micro-bench on darwin/arm64 (M1 Max, go 1.26): ~5x fewer allocs/op, ~5x less B/op, ~4.5x faster per eval (see below).

Security invariant

No script state must leak from one evaluator to the next. On release, the pool:

  1. Walks the state global table and deletes any key not present in the snapshot captured at pool-fill time -- this removes user-introduced globals such as KEYS, ARGV, or a leaking GLOBAL_LEAK.
  2. Restores every snapshot key to its original value, so a script that rebinds allowed globals (e.g. redis = nil, string = {upper = evil}) cannot poison the next user.
  3. Truncates the value stack and unbinds the per-eval *luaScriptContext so stale redis.call / redis.pcall invocations cannot fire.

The pre-registered redis.call / redis.pcall closures dispatch the per-eval context via a *lua.LState to *luaScriptContext binding map (set on acquire, cleared on release), which is what makes pooling safe without re-registering closures every eval.

Benchmark

BenchmarkLuaState_NewVsPooled (darwin/arm64, Apple M1 Max, go 1.26):

new_state_per_call   59255 ns/op   194417 B/op   543 allocs/op
pooled_state         13306 ns/op    39700 B/op   102 allocs/op

Rough extrapolation to the 34% heap slice cited in the profile: the pooled path allocates roughly 1/5 the bytes per script, so that slice should compress to ~7-8% of in-use heap in steady state (actual number depends on script mix and sync.Pool eviction rate under GC).

Tests added

  • TestLua_VMReuseDoesNotLeakGlobals -- the load-bearing leak test. Script A sets GLOBAL_LEAK = 42 plus a LEAKY_TABLE, script B asserts both are nil on a recycled VM.
  • TestLua_VMReuseRestoresRebindsWhitelistedGlobals -- sabotages redis and string.upper on the pooled state and confirms they are restored.
  • TestLua_PoolRecordsReuseVsAllocation -- proves via the hit counter that the pool actually hands back existing VMs.
  • TestRedis_LuaPoolNoGlobalLeakEndToEnd -- drives the full EVAL path on a real RedisServer and confirms cross-script leakage cannot happen through the user-facing protocol.

Test plan

  • go test -race -short ./adapter/... (green locally, ~60s)
  • golangci-lint --config=.golangci.yaml run ./adapter/... -> 0 issues
  • go test -run='^$' -bench=BenchmarkLuaState_NewVsPooled -benchmem ./adapter/ shows ~5x reduction in B/op and allocs/op
  • CI green on this branch

EVAL / EVALSHA previously minted a fresh *lua.LState per call.
Production heap profiles showed ~34% of in-use heap in
gopher-lua allocs (newFuncContext 11.8%, newRegistry 11.3%,
newFunctionProto 10.5%), driven by BullMQ-style ~10 scripts/s
workloads. Each Eval allocating a VM is pure waste.

This change introduces a sync.Pool-backed *lua.LState pool.
Pooled states are pre-initialised once with base libs + nil-ed
dangerous loaders + redis/cjson/cmsgpack modules. The redis
module's call/pcall closures resolve the per-eval
*luaScriptContext out of a global binding map (set on acquire,
cleared on release) instead of capturing it, so the state does
not need to be rewired with fresh closures every eval -- which
is what makes pooling both safe and cheap.

Reset logic (security invariant): on release the pool
  1) walks the state's global table and deletes any key not
     present in the snapshot captured at pool-fill time -- this
     removes user-introduced globals such as KEYS, ARGV, or a
     leaking GLOBAL_LEAK,
  2 restores every snapshot key to its original value, so a
     script that rebinds allowed globals (redis = nil,
     string = {upper = evil}) cannot poison the next user, and
  3) truncates the value stack and unbinds the script context
     so stale redis.call invocations cannot fire.

No script from evaluator A can observe or influence state in
evaluator B's run. Tests:
  - TestLua_VMReuseDoesNotLeakGlobals covers the
    load-bearing invariant for GLOBAL_LEAK-style cases.
  - TestLua_VMReuseRestoresRebindsWhitelistedGlobals covers
    sabotage of whitelisted globals.
  - TestLua_PoolRecordsReuseVsAllocation proves the pool is
    actually being used via a hit counter.
  - TestRedis_LuaPoolNoGlobalLeakEndToEnd drives the full
    EVAL path on a live server.

BenchmarkLuaState_NewVsPooled (darwin/arm64, Apple M1 Max):
  new_state_per_call   59255 ns/op   194417 B/op   543 allocs/op
  pooled_state         13306 ns/op    39700 B/op   102 allocs/op
-> ~5x fewer allocs, ~5x less B/op, ~4.5x faster per eval.
EOF
)
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 0 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 53 minutes and 0 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c035e2ad-95be-45dc-8f63-df0296ecccf4

📥 Commits

Reviewing files that changed from the base of the PR and between bbd7850 and eacc395.

📒 Files selected for processing (4)
  • adapter/redis.go
  • adapter/redis_lua.go
  • adapter/redis_lua_pool.go
  • adapter/redis_lua_pool_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/lua-vm-pool

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a luaStatePool to reuse *lua.LState instances, significantly reducing heap and GC pressure for high-rate Lua script workloads. The implementation includes a mechanism to snapshot and reset global variables to ensure isolation between script executions. Feedback highlights several critical areas: the current reset logic only handles string-keyed globals and fails to revert mutations within standard library tables, potentially leading to state leakage or environment poisoning. Additionally, there are potential nil pointer dereferences in redis.call and redis.pcall when the script context is missing, and the use of a global mutex for context lookup may become a performance bottleneck under high concurrency.

Comment thread adapter/redis_lua_pool.go Outdated
// not attempt to deep-clone them; if a script mutates
// `table.insert = nil`, the restore re-binds the original function
// value which is still alive.
func snapshotGlobals(state *lua.LState) map[string]lua.LValue {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

snapshotGlobals only captures string-keyed globals. Lua allows any value type (except nil and NaN) as a key in the global table _G. A script could intentionally or accidentally leak state by setting a non-string key (e.g., _G[42] = "secret"), which would bypass the current reset logic and persist across pooled state reuses.

References
  1. Avoid silently dropping entries during normalization or state capture to ensure state consistency and prevent data leaks.

Comment thread adapter/redis_lua_pool.go
// the whitelisted ones, then truncates the value stack. It is the
// heart of the security invariant: anything the script did to globals
// must not be observable by the next user.
func (p *pooledLuaState) reset() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The reset logic is vulnerable to table poisoning. While it restores global variable references (e.g., the string table itself), it does not undo mutations to the fields within those tables (e.g., string.upper = function() return "pwned" end).

Since standard library tables are mutable in gopher-lua and the snapshot only holds a reference to the table object, any mutation to the table's contents will persist and poison subsequent script executions. The test TestLua_VMReuseRestoresRebindsWhitelistedGlobals only verifies that rebinding the global name string is handled, but it does not check for mutation of the original table's fields.

Comment thread adapter/redis_lua_pool.go Outdated
Comment on lines +263 to +266
if !ok {
scriptState.RaiseError("redis.call invoked without an active script context")
return 0
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If luaBindContext is called with a nil context (as seen in the benchmarks), luaLookupContext will return nil, true. The current check if !ok will pass, leading to a nil pointer dereference when luaRedisCommand attempts to use the context. The check should explicitly verify that the context is non-nil.

Suggested change
if !ok {
scriptState.RaiseError("redis.call invoked without an active script context")
return 0
}
if !ok || ctx == nil {
scriptState.RaiseError("redis.call invoked without an active script context")
return 0
}
References
  1. When using context.Context for managing operations, ensure it is correctly propagated and validated to prevent runtime errors.

Comment thread adapter/redis_lua_pool.go Outdated
Comment on lines +271 to +274
if !ok {
scriptState.Push(luaErrorTable(scriptState, "redis.pcall invoked without an active script context"))
return 1
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to redis.call, redis.pcall does not check if the resolved context is nil. If luaLookupContext returns nil, true, this will result in a nil pointer dereference in luaRedisCommand.

Suggested change
if !ok {
scriptState.Push(luaErrorTable(scriptState, "redis.pcall invoked without an active script context"))
return 1
}
if !ok || ctx == nil {
scriptState.Push(luaErrorTable(scriptState, "redis.pcall invoked without an active script context"))
return 1
}
References
  1. When using context.Context for managing operations, ensure it is correctly propagated and validated to prevent runtime errors.

Comment thread adapter/redis_lua_pool.go Outdated
Comment on lines +81 to +84
var luaStateBindings = struct {
sync.RWMutex
m map[*lua.LState]*luaScriptContext
}{m: map[*lua.LState]*luaScriptContext{}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a global map protected by a single sync.RWMutex for luaStateBindings will likely become a performance bottleneck under high concurrency. Every redis.call and redis.pcall across all concurrent scripts must acquire an RLock on this global mutex to resolve the context.

Consider storing the *luaScriptContext directly within the *lua.LState's registry (using lua.RegistryIndex) when the state is acquired. This localizes the context lookup to the specific Lua state and eliminates the need for global locking.

References
  1. For frequently accessed fields, prefer mechanisms that avoid global mutex contention to improve performance on hot paths.

Comment thread adapter/redis_lua_pool.go Outdated

// Collect all current string keys first; mutating the table in
// ForEach is unsafe.
currentKeys := make([]string, 0, len(p.globalsSnapshot)+luaResetKeySlack)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The currentKeys slice is allocated on every call to reset. To further reduce GC pressure (the primary goal of this PR), consider storing this slice within the pooledLuaState struct and reusing it (after clearing it) across resets.

References
  1. Pre-allocating or reusing buffers is acceptable to manage memory bounds and reduce allocation overhead.

Addresses three findings from the gemini-code-assist review on #580.

1. Non-string-keyed globals survived reset. The old snapshot only
   captured string keys, so a script doing `_G[42] = "secret"` or
   `_G[true] = "bad"` leaked across pool reuse. Snapshot now uses an
   LValue-keyed map and reset iterates every key type. Switched to
   RawSet (not RawSetH) because gopher-lua stores integer keys in an
   internal array slice that RawSetH does not touch.

2. Table poisoning survived reset. Restoring the `string` / `math` /
   `redis` / etc. GLOBAL references did not undo in-place mutations of
   those tables' fields (e.g. `string.upper = function() return
   "pwned" end`). We now snapshot every whitelisted nested table's
   shallow field set at pool init and restore it on reset (Option A /
   B from the review), keyed by *LTable pointer so the fix still
   applies even if a script rebinds the outer global name.

3. `redis.call` / `redis.pcall` nil-deref. A nil *luaScriptContext
   stored via luaBindContext(nil) -- which the bench path does --
   produced a (nil, true) lookup, then panicked inside luaRedisCommand.
   Guards now explicitly check `!ok || ctx == nil`.

New regression tests (all pass under -race -count=5):
- TestLua_VMReuseNonStringGlobalKeysAreWiped
- TestLua_VMReuseDoesNotPoisonStringLib
- TestLua_VMReuseDoesNotPoisonRedisModule
- TestLua_PoolNilContextProducesErrorNotPanic
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 21, 2026

Addressed the three flagged findings (2x SECURITY-HIGH + 1x HIGH) in c56a7f0. Picked Option B for the table-poisoning fix (snapshot every whitelisted nested table's shallow field set at init, restore on reset) over Option C because the fix is tractable, the reset cost is O(fields) per table -- tiny in practice -- and shipping the pool off by default would forfeit the 4-5x heap win the PR is about.

1. Non-string-keyed globals leaking (SECURITY-HIGH, line 171)

  • snapshotGlobals now returns map[lua.LValue]lua.LValue and captures every key the init-time _G.ForEach yields, not just LString ones.
  • Reset collects current keys (any type) via ForEach and wipes the diff against the snapshot.
  • Subtle gopher-lua gotcha caught while writing the test: integer keys live in an internal array slice, and RawSetH(LNumber(42), LNil) only touches dict, leaving the array entry alive. Switched the reset loops to RawSet (which dispatches by key type).
  • Verified by TestLua_VMReuseNonStringGlobalKeysAreWiped: script A sets _G[42] = \"leak\"; _G[true] = \"bad\", script B asserts both are nil.

2. Table poisoning surviving reset (SECURITY-HIGH, line 186)

  • Added tableSnapshots map[*lua.LTable]map[lua.LValue]lua.LValue: at init we shallow-copy the fields of every LTable-valued global (string, math, table, redis, cjson, cmsgpack; _G itself is skipped since it's the globals table).
  • Reset runs resetTableContents on every snapshotted table BEFORE top-level globals are restored -- the outer map is keyed by *LTable, so even if a script did string = nil, the original string table is still reachable via the snapshot and its fields are restored.
  • Verified by TestLua_VMReuseDoesNotPoisonStringLib (script A replaces string.upper with a pwned function + adds a rogue string.pwn field; script B asserts string.upper(\"x\") == \"X\" and string.pwn == nil) and a sibling TestLua_VMReuseDoesNotPoisonRedisModule that does the same for redis.sha1hex.

3. Nil context nil-pointer deref (HIGH, lines 266 / 274)

  • Changed both closures to if !ok || ctx == nil as suggested. Matches the bench path that calls pool.get(nil).
  • Verified by TestLua_PoolNilContextProducesErrorNotPanic, which drives both redis.call (must raise a Lua error) and redis.pcall (must push an error table, not panic).

Verification

  • go test -race -count=1 -short ./adapter/... -- 384 tests, all pass.
  • go test -race -count=5 -short -run 'TestLua_VMReuse|TestLua_Pool|TestRedis_LuaPool' -v ./adapter/ -- the 9 pool-relevant tests (incl. the live-cluster e2e leak test) pass 5x under race with no flakes.
  • make lint clean (mnd caught one magic number, now a named constant).
  • A full -race -count=5 ./adapter/... ran hot into port exhaustion from the 3-node raft fixtures (listen tcp: address 65538: invalid port) -- pre-existing infra limit of the integration harness, unrelated to this PR.

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Lua state pool to optimize Redis EVAL operations by reusing Lua virtual machines, significantly reducing heap allocations and GC pressure. The implementation includes a security mechanism to snapshot and restore the Lua environment, preventing state leakage between executions. Reviewer feedback identifies a potential concurrency bottleneck in the global context mapping and recommends using the Lua registry for thread-local storage, alongside suggestions to use reusable slices during the environment reset to further minimize memory overhead.

Comment thread adapter/redis_lua_pool.go Outdated
Comment on lines +114 to +136
var luaStateBindings = struct {
sync.RWMutex
m map[*lua.LState]*luaScriptContext
}{m: map[*lua.LState]*luaScriptContext{}}

func luaBindContext(state *lua.LState, ctx *luaScriptContext) {
luaStateBindings.Lock()
luaStateBindings.m[state] = ctx
luaStateBindings.Unlock()
}

func luaUnbindContext(state *lua.LState) {
luaStateBindings.Lock()
delete(luaStateBindings.m, state)
luaStateBindings.Unlock()
}

func luaLookupContext(state *lua.LState) (*luaScriptContext, bool) {
luaStateBindings.RLock()
ctx, ok := luaStateBindings.m[state]
luaStateBindings.RUnlock()
return ctx, ok
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of a global map with a sync.RWMutex for mapping *lua.LState to *luaScriptContext introduces a global synchronization point that will become a bottleneck under high concurrency. Every redis.call or redis.pcall invocation inside a Lua script performs an RLock on this global mutex, which can lead to significant contention.

A more efficient and scalable approach is to store the *luaScriptContext directly within the *lua.LState using the Lua registry (lua.RegistryIndex). This is local to the state and requires no synchronization.

Suggested implementation pattern:

  1. In newPooledLuaState, pre-allocate an *lua.LUserData in the registry: state.RawSet(lua.RegistryIndex, lua.LString("elastickv_ctx"), state.NewUserData(nil)).
  2. In luaBindContext, simply update the Value of that pre-allocated LUserData.
  3. In luaLookupContext, read the LUserData from the registry and return its Value.

This also improves memory safety by ensuring the context reference is cleaned up when the LState is destroyed, avoiding potential leaks in a global map.

References
  1. For frequently accessed fields or hot paths, prefer mechanisms that avoid global mutex contention (like local storage or atomic operations) to improve performance.

Comment thread adapter/redis_lua_pool.go
Comment on lines +83 to +103
type pooledLuaState struct {
state *lua.LState
// globalsSnapshot is a copy of every entry reachable via the
// state's global table at init, keyed by LValue (not just string)
// so scripts cannot smuggle state across evals via non-string
// keys such as _G[42] = "secret".
globalsSnapshot map[lua.LValue]lua.LValue
// tableSnapshots holds the shallow field sets of well-known
// whitelisted tables (string, math, table, redis, cjson,
// cmsgpack) captured at init. On reset we restore each to its
// original contents so a script doing e.g.
// `string.upper = function() return "pwned" end` cannot poison
// subsequent pooled reuses.
//
// The outer map is keyed by the *LTable pointer of the parent
// (e.g. the `string` table) so tableSnapshots survives even if a
// script rebinds the global name (`string = nil`) -- the reset
// restores the global name first, then restores the table's
// internal contents from this snapshot.
tableSnapshots map[*lua.LTable]map[lua.LValue]lua.LValue
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To further reduce GC pressure and achieve zero-allocation cleanup, consider adding a scratchKeys []lua.LValue field to the pooledLuaState struct. This slice can be reused in reset and resetTableContents to collect keys before mutation, avoiding the current per-execution allocations. To prevent unbounded memory growth, ensure this collection remains within reasonable bounds.

type pooledLuaState struct {
	state *lua.LState
	// globalsSnapshot is a copy of every entry reachable via the
	// state's global table at init, keyed by LValue (not just string)
	// so scripts cannot smuggle state across evals via non-string
	// keys such as _G[42] = "secret".
	globalsSnapshot map[lua.LValue]lua.LValue
	// tableSnapshots holds the shallow field sets of well-known
	// whitelisted tables (string, math, table, redis, cjson,
	// cmsgpack) captured at init. On reset we restore each to its
	// original contents so a script doing e.g.
	// string.upper = function() return "pwned" end cannot poison
	// subsequent pooled reuses.
	//
	// The outer map is keyed by the *LTable pointer of the parent
	// (e.g. the string table) so tableSnapshots survives even if a
	// script rebinds the global name (string = nil) -- the reset
	// restores the global name first, then restores the table's
	// internal contents from this snapshot.
	tableSnapshots map[*lua.LTable]map[lua.LValue]lua.LValue
	// scratchKeys is a reusable slice for collecting table keys during
	// reset to avoid allocations.
	scratchKeys []lua.LValue
}
References
  1. To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests.

Comment thread adapter/redis_lua_pool.go Outdated
Comment on lines +278 to +281
currentKeys := make([]lua.LValue, 0, len(p.globalsSnapshot)+luaResetKeySlack)
globals.ForEach(func(k, _ lua.LValue) {
currentKeys = append(currentKeys, k)
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Reuse the scratchKeys slice from pooledLuaState here to avoid allocating a new slice on every reset. The same pattern should be applied to resetTableContents at line 318.

Suggested change
currentKeys := make([]lua.LValue, 0, len(p.globalsSnapshot)+luaResetKeySlack)
globals.ForEach(func(k, _ lua.LValue) {
currentKeys = append(currentKeys, k)
})
currentKeys := p.scratchKeys[:0]
globals.ForEach(func(k, _ lua.LValue) {
currentKeys = append(currentKeys, k)
})
p.scratchKeys = currentKeys

Move the per-eval *luaScriptContext binding off the global
luaStateBindings map (guarded by sync.RWMutex) and into a
pre-allocated *lua.LUserData stashed in each pooled state's own Lua
registry under a fixed key. redis.call / redis.pcall now resolve the
context via a lock-free, state-local read, eliminating the global
RLock contention point that would otherwise be hit on every
redis.call (e.g. ~50 lookups/s/script under BullMQ fan-out).

Also reuse pooledLuaState.scratchKeys across resets so
resetPooledLuaState / resetTableContents no longer allocate a fresh
[]lua.LValue per call. The retained backing array is dropped if it
grows past luaScratchKeysMaxCap (1024) to guard against pathological
scripts pinning unbounded memory on a pooled state.

Adds regression tests:
- TestLua_PoolConcurrentContextIsolation: 64 goroutines x 100 lookups
  must each observe their own *luaScriptContext (never another
  goroutine's, proved under -race -count=5.
- TestLua_PoolContextIsRegistryBacked: pins the contract that the
  binding lives in the state registry, not a package-level map.
- TestLua_PoolScratchKeysReused: asserts reuse + max-cap bound.
- BenchmarkLuaLookupContext_Concurrent: measures concurrent lookup
  cost (lock-free after this change).
EOF
)
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

Addressing Gemini's 2026-04-21 review — commit 3064b91f

HIGH: global luaStateBindings map (line 136)

Picked Option B (per-state Lua registry binding):

  • Pre-allocate one *lua.LUserData per pooled state during newPooledLuaState, stashed in the state's own registry under luaCtxRegistryKey = "elastickv_ctx".
  • luaStatePool.get / put set/clear the userdata's .Value (a single pointer write — no mutex, no global map).
  • luaLookupContext does state.GetField(state.Get(lua.RegistryIndex), luaCtxRegistryKey) + type assertion. Each pooled *lua.LState is used by at most one goroutine at a time, so the per-state registry is lock-free.
  • The global luaStateBindings map + sync.RWMutex are gone.

Why B over A: lua.LState.SetContext exists in gopher-lua v1.1.2 but takes a context.Context, not our *luaScriptContext. Using the registry keeps the existing type system and, crucially, removes the cross-state contention entirely rather than just sharding it (Option C).

Evidence of correctnessTestLua_PoolConcurrentContextIsolation: 64 goroutines x 100 lookups each, with distinct *luaScriptContext pointers; assertion that every lookup observed its own pointer. Passes under go test -race -count=5 -run TestLua_Pool. TestLua_PoolContextIsRegistryBacked pins the registry contract so a future refactor cannot silently reintroduce the global map.

Bench (darwin/arm64, M1 Max, -benchtime=3s):

BenchmarkLuaLookupContext_Concurrent-10    1,927,508 ops    1918 ns/op    1491 B/op    93 allocs/op
BenchmarkLuaState_NewVsPooled/pooled-10      132,823 ops   18853 ns/op   39967 B/op   162 allocs/op

No regression vs. the pre-change pooled path; lookups are now lock-free.

MEDIUM: reuse scratchKeys (lines 103 / 281 / 318)

Added scratchKeys []lua.LValue to pooledLuaState. reset borrows it, passes it into resetTableContents (now takes and returns a scratch slice), and stores it back sliced to [:0] for the next reset. Capped by luaScratchKeysMaxCap = 1024 — past that the backing array is dropped so a pathological script cannot pin unbounded memory on a pooled state. Covered by TestLua_PoolScratchKeysReused.

Gates

  • go test -race -count=1 -short ./adapter/... — passes (48s).
  • go test -race -count=5 -run TestLua_Pool -timeout 300s ./adapter/... — passes.
  • make lint — 0 issues.

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a pooled *lua.LState implementation for Redis EVAL/EVALSHA to reduce allocation rate and GC pressure by reusing initialized Lua VMs while enforcing a “no cross-script state leakage” reset invariant.

Changes:

  • Add luaStatePool / pooledLuaState with global/table snapshotting, per-state registry-backed context binding, and reset logic.
  • Wire Redis script execution to acquire/release pooled Lua states instead of creating/closing a fresh VM per script.
  • Add unit/integration tests and benchmarks validating reuse, isolation, and concurrency behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
adapter/redis_lua_pool.go Implements the pooled Lua VM, snapshotting, context binding, and reset logic.
adapter/redis_lua_pool_test.go Adds safety/concurrency regression tests and benchmarks for pooling behavior.
adapter/redis_lua.go Switches EVAL execution to use the pool and removes the per-invocation Lua VM init path.
adapter/redis.go Adds RedisServer fields for the Lua pool and initializes it in the constructor.

Comment thread adapter/redis_lua_pool.go
Comment on lines +300 to +344
func (p *pooledLuaState) reset() {
globals := p.state.G.Global

// (1) Restore inner contents of every snapshotted whitelisted
// table. This defeats poisoning attacks like
// `string.upper = function() return "pwned" end`.
//
// resetTableContents borrows p.scratchKeys as a working slice.
// We pass it in and receive the (possibly grown) backing array
// back so successive calls within the same reset share one
// allocation.
scratch := p.scratchKeys[:0]
for tbl, originalFields := range p.tableSnapshots {
scratch = resetTableContents(tbl, originalFields, scratch[:0])
}

// (2) Collect all current global keys (of any type). Mutating
// the table inside ForEach is unsafe, so snapshot keys first.
scratch = scratch[:0]
globals.ForEach(func(k, _ lua.LValue) {
scratch = append(scratch, k)
})

// Delete any key not in the init-time snapshot: these are
// user-introduced globals (KEYS, ARGV, GLOBAL_LEAK, _G[42],
// _G[true], ...).
//
// We use RawSet (not RawSetH) because gopher-lua stores integer
// keys in an internal `array` slice rather than `dict`; RawSetH
// only touches `dict`, so a call like RawSetH(LNumber(42), LNil)
// leaves the array entry intact. RawSet dispatches to the right
// storage by key type.
for _, k := range scratch {
if _, keep := p.globalsSnapshot[k]; !keep {
globals.RawSet(k, lua.LNil)
}
}

// (3) Restore every whitelisted global to its original value.
// This covers the case where a script rebinds an allowed global
// (e.g. `redis = something`) -- we simply put the original back.
for k, v := range p.globalsSnapshot {
globals.RawSet(k, v)
}

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reset logic restores global/table entries but does not restore table metatables. A script can call setmetatable(_G, ...) or setmetatable(string, ...) (base lib exposes setmetatable) and leak/poison behavior across pooled evals without changing any table fields, breaking the stated isolation invariant. Consider snapshotting metatables at pool-fill time (for _G and each table in tableSnapshots) and restoring them during reset (or explicitly clearing to the init-time metatable) in addition to restoring table fields.

Copilot uses AI. Check for mistakes.
Comment thread adapter/redis_lua_pool_test.go Outdated
// than string. Lua permits any non-nil, non-NaN value as a table key,
// so a script doing `_G[42] = "leak"` or `_G[true] = "bad"` bypasses a
// naive string-only snapshot/wipe. The LValue-keyed snapshot + the
// RawSetH-based reset in pool.reset must catch these.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test comment says the pool reset is "RawSetH-based", but the implementation uses RawSet to correctly clear integer keys in the array part (see redis_lua_pool.go reset()). Updating the comment will avoid confusion when diagnosing regressions.

Suggested change
// RawSetH-based reset in pool.reset must catch these.
// RawSet-based reset in pool.reset must catch these.

Copilot uses AI. Check for mistakes.
The Lua VM pool's reset logic snapshotted and restored _G entries plus
the fields of whitelisted tables, but it did not touch metatables.
Because setmetatable is in the base library, a script could install
a poisoned __index on _G (or on string/math/table/redis/cjson/cmsgpack)
and leak attacker-controlled behaviour into every subsequent pooled
eval that dereferenced an undefined key.

Capture each table's raw metatable at pool-fill time (LTable.Metatable,
bypassing __metatable dispatch) and restore it on reset. When the
original had no metatable, restore lua.LNil so any metatable the
script installed is stripped.

Also fixes the misleading "RawSetH-based" comment in
TestLua_VMReuseNonStringGlobalKeysAreWiped -- the implementation uses
RawSet so integer keys in gopher-lua's array part are actually cleared.

Covered tables: _G, string, table, math, redis, cjson, cmsgpack.
Regression tests TestLua_VMReuseDoesNotPoisonGlobalsMetatable and
TestLua_VMReuseDoesNotPoisonStringMetatable verify that both attack
vectors no longer leak across pool reuse, and both reproduce the
vulnerability when the metatable-restore loop is stubbed out.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

/gemini review

Addressed Copilot review items (commit eacc395)

1. Metatable snapshot/restore (security-relevant)

newPooledLuaState now captures the raw metatable of _G and of
every snapshotted whitelisted table (string, table, math,
redis, cjson, cmsgpack) alongside the existing field snapshot,
reading LTable.Metatable directly to bypass __metatable
indirection.

reset() restores each metatable as the first step, before reseting
table fields and globals. When the init-time metatable was absent,
lua.LNil is stored in the snapshot and passed to SetMetatable,
which strips any metatable the script installed.

Blocks:

  • setmetatable(_G, { __index = function() return "pwned" end })
  • setmetatable(string, { ... }) and the same for the other
    whitelisted tables.

Regression tests:

  • TestLua_VMReuseDoesNotPoisonGlobalsMetatable
  • TestLua_VMReuseDoesNotPoisonStringMetatable

Both were verified to fail without the fix (by stubbing out the
restore loop) and pass after. Full targeted run:
go test -race -count=5 -run TestLua_Pool ./adapter/... is green,
as is the full short adapter suite and make lint.

2. Test comment fix

The comment in TestLua_VMReuseNonStringGlobalKeysAreWiped that said
"RawSetH-based reset" has been updated to "RawSet-based reset",
matching the implementation (which uses RawSet so gopher-lua's
integer-array part is correctly cleared).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Lua state pool to the Redis adapter to reduce heap and GC pressure by reusing *lua.LState instances during script execution. The implementation includes a robust reset mechanism that snapshots and restores globals, nested tables, and metatables to ensure security and isolation between scripts. Additionally, it replaces a mutex-guarded global map with a lock-free context lookup using the Lua registry to improve performance under high concurrency. One issue was identified in the hit/miss tracking logic, where the use of sync.Pool.New causes all acquisitions to be recorded as hits, even when a new state is allocated.

Comment thread adapter/redis_lua_pool.go
Comment on lines +476 to +483
pls, ok := p.pool.Get().(*pooledLuaState)
if !ok || pls == nil {
// New func never returns nil, but defend against misuse.
pls = newPooledLuaState()
p.misses.Add(1)
} else {
p.hits.Add(1)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The hit/miss tracking logic is currently inaccurate because sync.Pool.New is set in newLuaStatePool. When p.pool.Get() is called, if the pool is empty, it automatically invokes the New function and returns a fresh state. Consequently, ok will always be true and pls will never be nil, causing every acquisition to be recorded as a 'hit' even when a new allocation occurs. To accurately track reuse, you should either remove the New function from the pool and handle allocation manually in get, or use a flag within pooledLuaState to distinguish fresh instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants