Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 79 additions & 28 deletions docs/design/2026_04_20_implemented_lease_read.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,26 +91,39 @@ holds:

```go
type leaseState struct {
gen atomic.Uint64 // bumped by invalidate()
expiry atomic.Pointer[time.Time] // nil = expired / invalidated
gen atomic.Uint64 // bumped by invalidate()
expiryNanos atomic.Int64 // 0 = expired / invalidated; else monoclock.Instant nanos
}
```

- `expiry == nil` or `time.Now() >= *expiry`: lease is expired. The next
`LeaseRead` falls back to `LinearizableRead` and refreshes the lease on
success.
- `time.Now() < *expiry`: lease is valid. `LeaseRead` returns immediately
without contacting the Raft layer.
- `invalidate()` increments `gen` before clearing `expiry`. `extend()`
captures `gen` at entry and, after its CAS lands, undoes its own
write (via CAS on the pointer it stored) iff `gen` has moved. This
prevents a Dispatch that succeeded just before a leader-loss
invalidate from resurrecting the lease milliseconds after it was
cleared. A fresh `extend()` that captured the post-invalidate
generation is left intact because it stored a different pointer.

The lock-free form lets readers do one atomic load + one wall-clock compare
on the fast path.
All timestamps on the lease path come from `internal/monoclock`, which
reads `CLOCK_MONOTONIC_RAW` via `clock_gettime(3)` on Linux and Darwin
(falling back to Go's runtime monotonic on other platforms — FreeBSD
included, since `golang.org/x/sys/unix` does not export
`CLOCK_MONOTONIC_RAW` on FreeBSD).
The raw monotonic clock is immune to NTP rate adjustment and wall-clock
step events — TiKV's lease path makes the same choice. Go's
`time.Now()` is not sufficient: its embedded monotonic component is
still NTP-slewed at up to 500 ppm under POSIX, and a misconfigured or
abused time daemon can exceed that cap. See §3.2 on why the safety
argument should not rest on NTP behaving.

- `expiryNanos == 0` or `monoclock.Now() >= expiry`: lease is expired.
The next `LeaseRead` falls back to `LinearizableRead` and refreshes
the lease on success.
- `monoclock.Now() < expiry`: lease is valid. `LeaseRead` returns
immediately without contacting the Raft layer.
- `invalidate()` increments `gen` before clearing `expiryNanos`.
`extend()` captures `gen` at entry and, after its CAS lands, undoes
its own write (via CAS on the exact value it wrote) iff `gen` has
moved. This prevents a Dispatch that succeeded just before a
leader-loss invalidate from resurrecting the lease milliseconds
after it was cleared. A fresh `extend()` that captured the
post-invalidate generation is left intact because its CAS already
replaced the earlier target.

The lock-free form lets readers do one atomic load + one monotonic-raw
Comment on lines 92 to +125
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the design doc to describe the implementation that ships in this PR.

Section 3.1 still shows the old gen/expiryNanos atomic layout, and the Section 3.5 pseudocode still describes only the caller-side lease path. The code in kv/lease_state.go and kv/coordinator.go now uses atomic.Pointer[leaseSlot] plus the primary LastQuorumAck() fast path, so the document is already out of sync.

Also applies to: 272-304

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/2026_04_20_implemented_lease_read.md` around lines 92 - 125, The
design doc still describes the old atomic layout (leaseState with
gen/expiryNanos) and caller-only pseudocode, but the shipped implementation uses
atomic.Pointer[leaseSlot] and the primary LastQuorumAck() fast path; update
Sections 3.1 and 3.5 (and lines referenced 272-304) to match the implementation
by replacing the gen/expiryNanos description with the leaseSlot/atomic.Pointer
design, document how invalidate(), extend(), and LeaseRead interact with the
atomic.Pointer-based slot and how LastQuorumAck() provides the fast path, and
update the pseudocode to reflect the coordinator-side behavior in lease_state.go
and coordinator.go (including how CAS on leaseSlot and generation/versioning is
performed and how fallbacks to LinearizableRead occur).

compare on the fast path.

### 3.2 Lease duration

Expand All @@ -134,13 +147,44 @@ config: `10ms * 100 - 300ms = 700 ms`.
`leaseSafetyMargin` (proposed: 300 ms) absorbs:

- Goroutine scheduling delay between heartbeat ack and lease refresh.
- Wall-clock skew between leader and the partition's new leader candidate.
- Clock skew between leader and the partition's new leader candidate.
(Both read `CLOCK_MONOTONIC_RAW` on their own hosts — the skew here
is between two independent monotonic oscillators, not NTP-adjusted
wall clocks. Per-host drift of quartz oscillators is ≤50 ppm, so
the two sides cannot diverge by more than that within an
electionTimeout window.)
- GC pauses on the leader.

The margin is conservative; reducing it shortens the post-write quiet window
during which lease reads still hit local state, at the cost of a smaller
safety buffer.

#### Why CLOCK_MONOTONIC_RAW, not time.Now()

Go's `time.Now()` embeds the kernel's NTP-adjusted monotonic clock
(`CLOCK_MONOTONIC` on Linux), which is rate-slewed at up to 500 ppm
under POSIX. Inside a 700 ms lease window that amounts to ~0.35 ms —
comfortably smaller than the 300 ms safety margin on paper. But the
safety case for the lease should not depend on NTP being well-behaved:

1. POSIX caps slew rate at 500 ppm, but a misconfigured or malicious
`adjtimex` call can exceed that cap.
2. A lease-read regression is observable only when the lease boundary
overshoots `electionTimeout`, i.e. under exactly the conditions
(partition + clock drift) where NTP is most likely to be wrong.
3. TiKV, whose lease-read design we otherwise track, already uses
`CLOCK_MONOTONIC_RAW` for this reason. Matching their choice keeps
the door open for tightening `leaseSafetyMargin` below ~5 ms, at
which point NTP slew alone becomes comparable to the margin.

The `internal/monoclock` package wraps `clock_gettime(CLOCK_MONOTONIC_RAW, &ts)`
(`unix.ClockGettime` from `golang.org/x/sys/unix`) on Linux and
Darwin. Other platforms — including FreeBSD, where `x/sys/unix` does
not export the `CLOCK_MONOTONIC_RAW` constant — fall back to Go's
runtime monotonic clock; on those platforms lease safety reverts to
the NTP-slewed baseline, which is still sufficient at the current
margin.

### 3.3 Refresh triggers

The lease is refreshed on:
Expand All @@ -152,8 +196,9 @@ The lease is refreshed on:
confirmation than ReadIndex.

Both refresh base the new expiry on `preOpInstant + LeaseDuration()`,
where `preOpInstant` is captured BEFORE the quorum operation starts, not
after it returns. This is strictly conservative: any real quorum
where `preOpInstant` is a `monoclock.Now()` reading
(`CLOCK_MONOTONIC_RAW`) captured BEFORE the quorum operation starts,
not after it returns. This is strictly conservative: any real quorum
confirmation must happen at or after `preOpInstant`, so the lease window
can only be shorter than the true safety window, never longer.
Post-operation sampling would let apply-queue depth / scheduling jitter
Expand Down Expand Up @@ -234,11 +279,13 @@ func (c *Coordinate) LeaseRead(ctx context.Context) (uint64, error) {
// Misconfigured tick settings disable the lease entirely.
return c.LinearizableRead(ctx)
}
// Capture time.Now() AND lease.generation() exactly once before
// any quorum work. The generation guard prevents a leader-loss
// callback that fires during LinearizableRead from being
// silently overwritten by the post-op extend.
now := time.Now()
// Capture monoclock.Now() AND lease.generation() exactly once
// before any quorum work. The monotonic-raw sample keeps the
// window safe against NTP rate adjustment / wall-clock steps;
// the generation guard prevents a leader-loss callback that
// fires during LinearizableRead from being silently overwritten
// by the post-op extend.
now := monoclock.Now()
expectedGen := c.lease.generation()
if c.lease.valid(now) {
return lp.AppliedIndex(), nil
Expand Down Expand Up @@ -357,9 +404,13 @@ write to commit. However:
out of leader and invalidates the lease.
- Clock skew exceeding `leaseSafetyMargin`: lease may extend beyond
`electionTimeout`, allowing a stale read after a successor leader has
accepted writes. Mitigation: keep `leaseSafetyMargin` larger than the
documented clock-skew SLO of the deployment. Default 300 ms is consistent
with the HLC physical window of 3 s used elsewhere.
accepted writes. Because the lease path uses `CLOCK_MONOTONIC_RAW`
(see §3.2), this hazard is bounded by inter-host oscillator drift
(~50 ppm quartz-spec ceiling), not by NTP's 500 ppm slew or
operator-driven `settimeofday` jumps — a misconfigured time daemon
can no longer push the lease past its safety window. The 300 ms
default margin remains consistent with the HLC physical window of
3 s used elsewhere.

---

Expand Down
61 changes: 61 additions & 0 deletions internal/monoclock/monoclock.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
// Package monoclock exposes a monotonic-raw clock for the lease-read
// path.
//
// Go's time.Now() returns a wall-clock value backed internally by the
// kernel's CLOCK_MONOTONIC (Linux) or its equivalent — which is
// rate-adjusted ("slewed") by NTP at up to 500 ppm. That slew is small
// in steady state (~0.35 ms over a 700 ms lease window), but the safety
// case for leader-local lease reads should not depend on NTP being
// well-behaved: a misconfigured or abused time daemon can push the
// slew rate far past the 500 ppm POSIX cap, and other monotonic time
// sources (e.g. CLOCK_MONOTONIC_COARSE) can compound the error.
// CLOCK_MONOTONIC_RAW is immune to NTP rate adjustment and step events
// and is what TiKV's lease path uses.
//
// Instant values are opaque int64 nanosecond counters. They are only
// comparable within the same process lifetime and MUST NOT be
// persisted, serialized, or sent over the wire — the zero point is
// arbitrary and changes across processes. Callers that need an
// externally-meaningful timestamp should sample time.Now() separately;
// Instant is only for intra-process lease-safety reasoning.
package monoclock

import "time"

// Instant is a reading from the monotonic-raw clock. The zero value
// represents "no reading" and compares equal to Zero.
type Instant struct {
ns int64
}

// Zero is the unset Instant.
var Zero = Instant{}

// Now returns the current monotonic-raw instant.
func Now() Instant { return Instant{ns: nowNanos()} }

// IsZero reports whether i is the zero Instant.
func (i Instant) IsZero() bool { return i.ns == 0 }

// After reports whether i is strictly after j.
func (i Instant) After(j Instant) bool { return i.ns > j.ns }

// Before reports whether i is strictly before j.
func (i Instant) Before(j Instant) bool { return i.ns < j.ns }

// Sub returns i - j as a Duration. Meaningful only when neither i nor
// j is the zero Instant; callers must guard with IsZero first.
func (i Instant) Sub(j Instant) time.Duration { return time.Duration(i.ns - j.ns) }

// Add returns i advanced by d.
func (i Instant) Add(d time.Duration) Instant { return Instant{ns: i.ns + int64(d)} }

// Nanos returns the raw int64 counter. Intended for atomic.Int64
// storage where a whole Instant struct cannot be stored atomically
// (see internal/raftengine/etcd/quorum_ack.go).
func (i Instant) Nanos() int64 { return i.ns }

// FromNanos reconstructs an Instant from a raw counter previously
// obtained via Nanos(). Counterpart to Nanos; the same intra-process
// caveats apply.
func FromNanos(ns int64) Instant { return Instant{ns: ns} }
19 changes: 19 additions & 0 deletions internal/monoclock/monoclock_fallback.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
//go:build !(linux || darwin)

package monoclock

import "time"

// epoch anchors the fallback monotonic counter. time.Since uses Go's
// runtime monotonic component and is step-immune, though unlike
// CLOCK_MONOTONIC_RAW it is still subject to NTP rate adjustment. On
// platforms where golang.org/x/sys/unix does not export
// CLOCK_MONOTONIC_RAW (FreeBSD, Windows, Plan 9, ...) this is the
// closest portable substitute; lease safety on those platforms
// therefore matches the pre-#551 behaviour. Linux and Darwin use
// the raw clock (monoclock_unix.go).
var epoch = time.Now()

func nowNanos() int64 {
return int64(time.Since(epoch))
}
59 changes: 59 additions & 0 deletions internal/monoclock/monoclock_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
package monoclock

import (
"testing"
"time"

"github.com/stretchr/testify/require"
)

func TestInstant_ZeroIsZero(t *testing.T) {
t.Parallel()
require.True(t, Zero.IsZero())
var i Instant
require.True(t, i.IsZero())
require.True(t, FromNanos(0).IsZero())
}

func TestNow_IsNonZeroAndMonotonic(t *testing.T) {
t.Parallel()
// CLOCK_MONOTONIC_RAW must advance across two Now() calls (modulo
// nanosecond-granularity ties; use a sleep to ensure monotonic
// progress). A regression that returns 0 or runs the clock
// backwards would break every lease-read safety guard.
a := Now()
require.False(t, a.IsZero(), "Now must return non-zero instant on supported platforms")
time.Sleep(100 * time.Microsecond)
b := Now()
require.False(t, b.Before(a), "monotonic-raw clock must not regress across calls")
require.True(t, b.After(a) || b == a)
}

func TestInstant_AddAndSub(t *testing.T) {
t.Parallel()
base := FromNanos(1_000_000)
later := base.Add(250 * time.Millisecond)
require.True(t, later.After(base))
require.Equal(t, 250*time.Millisecond, later.Sub(base))
require.Equal(t, -250*time.Millisecond, base.Sub(later))
}

func TestInstant_NanosRoundtrip(t *testing.T) {
t.Parallel()
i := FromNanos(42)
require.Equal(t, int64(42), i.Nanos())
}

func TestInstant_BeforeAfterOrdering(t *testing.T) {
t.Parallel()
a := FromNanos(100)
b := FromNanos(200)
require.True(t, a.Before(b))
require.True(t, b.After(a))
require.False(t, a.After(b))
require.False(t, b.Before(a))
// Equal instants: neither Before nor After.
c := FromNanos(100)
require.False(t, a.Before(c))
require.False(t, a.After(c))
}
27 changes: 27 additions & 0 deletions internal/monoclock/monoclock_unix.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
//go:build linux || darwin

package monoclock

import "golang.org/x/sys/unix"

// nowNanos reads CLOCK_MONOTONIC_RAW via clock_gettime(3). Only Linux
// and Darwin export this constant in golang.org/x/sys/unix; FreeBSD
// lacks the binding (its kernel exposes CLOCK_MONOTONIC_PRECISE, a
// different clock) and all other platforms use the portable fallback
// in monoclock_fallback.go.
//
// A non-nil error from ClockGettime should be essentially impossible
// on supported platforms — the syscall fails only on invalid clock
// IDs (compile-time constant here) or EFAULT on the timespec pointer
// (stack-allocated here). The realistic failure mode is a
// seccomp/sandbox profile that denies clock_gettime. We return 0 in
// that case: callers (leaseState.valid, engineLeaseAckValid) treat a
// zero Instant as "clock unavailable" and force the slow path, so a
// persistent syscall failure cannot leave a warmed lease valid.
func nowNanos() int64 {
var ts unix.Timespec
if err := unix.ClockGettime(unix.CLOCK_MONOTONIC_RAW, &ts); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use a suspend-aware clock for lease expiration

Selecting CLOCK_MONOTONIC_RAW in internal/monoclock/monoclock_unix.go makes lease age checks depend on a clock that does not advance during system suspend, but kv/engineLeaseAckValid and lease checks use now.Sub(ack) < leaseDur as if elapsed real time always advances. In a VM/host pause or cgroup freezer event, a former leader can resume with StateLeader and a still-“fresh” lease even though another node may have elected and committed writes during the pause, allowing stale reads until step-down traffic is processed. This should use a suspend-aware source (or explicitly invalidate leases on resume) to preserve lease safety across suspend/resume.

Useful? React with 👍 / 👎.

return 0
Comment on lines +23 to +24
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid returning zero when monotonic clock read fails

Returning 0 on ClockGettime error turns monoclock.Now() into the minimum possible instant, and that does not fail closed in the lease path: Coordinate.LeaseRead falls back to the caller-side check c.lease.valid(now) (kv/coordinator.go) where now=0 is treated as "before expiry" for any non-zero lease, so once a lease is warmed the node can keep serving local reads without LinearizableRead under persistent clock failures (for example, sandbox/seccomp environments where clock_gettime is denied). This can bypass lease expiration safety during leader isolation; the error path should invalidate/disable lease fast-paths instead of producing a valid timestamp.

Useful? React with 👍 / 👎.

}
return ts.Nano()
}
51 changes: 32 additions & 19 deletions internal/raftengine/engine.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ import (
"errors"
"io"
"time"

"github.com/bootjp/elastickv/internal/monoclock"
)

// Shared sentinel errors that both engine implementations should wrap
Expand Down Expand Up @@ -94,31 +96,42 @@ type LeaseProvider interface {
LeaseDuration() time.Duration
// AppliedIndex returns the highest log index applied to the local FSM.
AppliedIndex() uint64
// LastQuorumAck returns the instant at which the engine most recently
// observed majority liveness on the leader -- i.e. the wall-clock time
// by which a quorum of follower Progress entries had responded. The
// engine maintains this in the background from MsgHeartbeatResp /
// MsgAppResp traffic on the leader, so a fast-path lease read does
// not need to issue its own ReadIndex to "warm" the lease.
// LastQuorumAck returns the monotonic-raw instant at which the
// engine most recently observed majority liveness on the leader
// -- i.e. the CLOCK_MONOTONIC_RAW reading at which a quorum of
// follower Progress entries had responded. The engine maintains
// this in the background from MsgHeartbeatResp / MsgAppResp traffic
// on the leader, so a fast-path lease read does not need to issue
// its own ReadIndex to "warm" the lease.
//
// Safety: callers must verify the lease against a single
// `now := time.Now()` sample:
// `now := monoclock.Now()` sample:
// state == raftengine.StateLeader &&
// !ack.IsZero() && !ack.After(now) && now.Sub(ack) < LeaseDuration()
// !now.IsZero() && !ack.IsZero() && !ack.After(now) &&
// now.Sub(ack) < LeaseDuration()
//
// The !now.IsZero() guard fails closed when the caller's
// clock_gettime read errored (e.g. seccomp denies it) and
// monoclock.Now() returned the zero Instant; without it, a
// persistent clock failure could keep a once-warmed lease valid
// forever. See kv.engineLeaseAckValid.
//
// The !ack.After(now) guard matters because LastQuorumAck() may be
// reconstructed from UnixNano (no monotonic component): a backwards
// wall-clock adjustment would otherwise make now.Sub(ack) negative
// and pass the duration check against a stale ack. The LeaseDuration
// is bounded by electionTimeout - safety_margin, which guarantees
// that any new leader candidate cannot yet accept writes during
// The monotonic-raw clock (CLOCK_MONOTONIC_RAW on Linux / Darwin;
// runtime-monotonic fallback on FreeBSD / Windows / others, see
// internal/monoclock) is immune to NTP rate adjustment and
// wall-clock step events on the raw-clock platforms, so the
// comparison stays safe even if the system's time daemon slews
// or steps the wall clock. The !ack.After(now) guard remains as
// a defensive fail-closed for a zero / bogus ack reading.
// LeaseDuration is bounded by electionTimeout - safety_margin,
// guaranteeing no successor leader has accepted writes within
// that window.
//
// Returns the zero time when no quorum has been confirmed yet or
// when the local node is not the leader. Single-node LEADERS may
// return a recent time.Now() since self is the quorum; non-leader
// single-node replicas still return the zero time.
LastQuorumAck() time.Time
// Returns the zero Instant when no quorum has been confirmed yet
// or when the local node is not the leader. Single-node LEADERS
// may return a recent monoclock.Now() since self is the quorum;
// non-leader single-node replicas still return the zero Instant.
LastQuorumAck() monoclock.Instant
Comment thread
coderabbitai[bot] marked this conversation as resolved.
// RegisterLeaderLossCallback registers fn to be invoked whenever the
// local node leaves the leader role (graceful transfer, partition
// step-down, or shutdown). Callers use this to invalidate any
Expand Down
Loading
Loading