grep across the repo confirms zero external callers of bintrie.NodeStore,
NewNodeStore, NodeFlushFn, or NodeResolverFn. The arena is purely an
implementation detail of BinaryTrie; unexport the top-level names so
the package's external surface stays confined to BinaryTrie plus the
EIP-7864 helpers (ChunkifyCode, GetBinaryTreeKey*).
Methods on *nodeStore remain capitalized for now — with nodeStore
itself unexported, external code has no way to hold a *nodeStore
pointer, so the methods are effectively internal despite their case.
Method case is a cosmetic follow-up.
The invariant "mutating a value slot must mark the stem for re-hash
and re-flush" was enforced by every caller remembering to set both
flags after setValue. Moving the flip into setValue itself makes it
structurally impossible to forget, and drops the duplicate flag-sets
at each callsite.
decodeNode's on-disk load path still writes directly to sn.values
because loaded stems must retain whatever mustRecompute/dirty state
the caller asked for (typically both false).
Comparing against emptyRef (the single value makeRef(kindEmpty, 0))
only works because no other nodeRef with kindEmpty is ever constructed.
That invariant is easy to break if future code ever produces a
kindEmpty ref with nonzero index. Test the kind directly so any
kindEmpty ref reads as empty regardless of index.
Silently masking the index with indexMask would let an oversized idx
collide with emptyRef (makeRef(kindEmpty, 0)): e.g. makeRef(kindEmpty,
1<<30) returned emptyRef, which IsEmpty would accept as absent even
though the caller meant to reference a real node. Panic instead.
allocInternal/allocStem/allocHashed already panic on pool overflow,
so this is a belt-and-suspenders guard for any direct callers.
runtime.NumCPU() + bits.Len were recomputed on every hashInternal
call. NumCPU is immutable after startup; hoist to a package var
computed once. Also fixes a minor style wart: the constant is now a
value, not a zero-arg function.
Only freeHashed is written to (via freeHashedNode); the internal and
stem lists are declared, consumed in alloc paths, and copied in
NodeStore.Copy, but no callsite ever appends to them. Under current
semantics (no delete, stem-split keeps the old stem deeper in the
tree) there is no path that would free an internal or stem slot, so
the recycle branch was dead code. Drop it to avoid misleading future
contributors; the infrastructure is easy to restore if a delete path
is ever added.
splitStemValuesInsert increments existing.depth before recursing into
insertValuesAtStem. If the recursion fails, the depth stays incremented
but the tree is not re-rooted through the new internal, so a retry
reads bitStem at the wrong offset and can place the stem on the wrong
side of a fresh split. Roll back existing.depth on error to keep the
stem consistent across retries.
If s.computeHash(node.left) panics inside the goroutine and a caller
higher up recovers the panic, the parent would be stuck forever in
wg.Wait() — no log, no error. defer wg.Done() releases the waiter
unconditionally so Wait returns even on panic.
Deep branch used a pooled hash.Hash: h.Write(lh[:]) passed a subslice
through the interface, forcing lh/rh to heap; h.Sum(nil) allocated
32 B per rehash; empty children allocated 32 B via make([]byte, HashSize).
Mirror the shallow branch: write left/right hashes into a stack
[64]byte via copy() and call sha256.Sum256 in one shot. No interface
writes, no pool round-trip, no Sum(nil), no empty-child make.
Benchmark delta (M4 Pro, go1.24.0, --count=5 --benchtime=5s):
before: 9133 ns/op 6526 B/op 95 allocs/op
after: 8783 ns/op 5623 B/op 67 allocs/op
vs upstream/master@53ff723cc: allocs/op -50.0% (was -29.1%),
bytes/op -85.1% (was -82.7%), time/op +18.7% (was +23.4%).
The pooled hash.Hash interface forced the local [StemNodeWidth]common.Hash
data array to escape to the heap: h.Sum(data[i][:0]) passes a subslice of
data into an interface method, so escape analysis conservatively moves the
whole array. pprof (post-rollback) showed this single allocation as 52%
of total bytes (5 GB over BenchmarkCollectNodesSparseWrite).
Switch to sha256.Sum256 (takes []byte, returns [32]byte by value) — no
slice into data ever leaves the frame, so data stays on stack. Also
drops per-Hash h.Sum(nil) allocs and the sync.Pool Get/Put round-trip
for stems.
Benchmark delta (M4 Pro, go1.24.0, --count=5 --benchtime=5s):
before: 9095 ns/op 15008 B/op 106 allocs/op
after: 9133 ns/op 6526 B/op 95 allocs/op
vs upstream/master@53ff723cc: bytes/op -82.7% (was -60%),
allocs/op -29.1% (was -20.9%).
Port the same-shape benchmark from master (PR #34754) to the arena. A
benchstat-able pair lets reviewers verify the arena's allocation win is
preserved after the Group B + C rollbacks, and guards against future
regressions on the sparse-write commit path.
Gballet's three empty 'suggestion' blocks (comments 3101685618,
3101734697, 3101736436) mark the unexported wrapper declarations on
getValuesAtStem and insertValuesAtStem plus one temporary-var line.
Apply:
- Inline the unexported getValuesAtStem body into GetValuesAtStem (start
the walk at s.root directly instead of via a two-arg helper). The
function is not self-recursive, so the wrapper was pure indirection.
- Tighten InsertValuesAtStem to two lines using the 's.root, err = ...'
idiom — the recursive helper stays (it IS self-recursive), only the
public entry point gets the cleanup.
Adds docstrings on both public entry points.
Gballet (comment 3101708418): the hasParent check in the kindHashed
branch never fires — NewBinaryTrie resolves the root eagerly at open
time, so any HashedNode we encounter during a getValuesAtStem walk is
necessarily a child of a previously-visited internal (parentIdx /
parentIsLeft set on the prior kindInternal iteration).
Drop the hasParent flag and its setter; replace the check with a short
comment stating the invariant.
Per gballet's comment 3101751325 on PR #34055: the *Single functions
are essentially the same thing as *ValuesAtStem with one slot set. The
original design dispatched through *ValuesAtStem for dedup; this commit
restores that shape on the arena side.
- GetValue now delegates to GetValuesAtStem and indexes the returned
256-slot array header (no allocation — the stem node returns its own
inline values array as a slice).
- InsertSingle now builds a stack-allocated [StemNodeWidth][]byte with
only the target slot set and delegates to InsertValuesAtStem.
- Delete the insertSingleInternal tree walker (~90 LOC) and the whole
splitStemInsert (~60 LOC) — the *ValuesAtStem / splitStemValuesInsert
pair already handles every case.
Addresses gballet comments 3101751325, 3101739001, 3101724199, 3101721238
(the last three subsumed by the consolidation — the duplicated helper
bodies no longer exist).
Net: ~150 LOC removed from store_ops.go. Allocation cost for InsertSingle
is bounded by the stack-allocated 256-slot array (one stack frame, no
heap allocation on the hot path).
Gballet asked (comment 3101679920) to fold the unexported getSingle helper
into its single caller, and (comment 3101677731) to rename GetSingle ('bad
name: get single what?') with a top-level docstring.
- Inline getSingle into GetValue (one function instead of two).
- Rename GetSingle → GetValue and add a docstring.
- Drop the hasParent tracker that was only used for the 'hashed at root'
guard — that case is now handled by kindEmpty / the top-level
NewBinaryTrie-time root resolution, so remove the check rather than
keep dead state.
CE2 will later fold this into GetValuesAtStem; this commit closes the
naming + inline asks independently.
Gballet asked on PR #34055 (comments 3100043116, 3100050542, and the
bit-check dedup at 3100114416 / 3100878310) to revert StemNode from the
packed-bytes representation to the straightforward array-of-slices.
Before: StemNode carried a bitmap, a concatenated valueData []byte, a
count, and a shared COW flag. Every read/write went through a bit-count
posInData lookup; every mutation through ensureWritable COW.
After: values [StemNodeWidth][]byte — 256 slots, nil == absent. No
bitmap lookup, no COW. Direct sn.values[suffix] access.
Supporting changes:
- Drop posInData, ensureWritable; rewrite getValue/hasValue/allValues/
setValue as trivial slice access.
- Hash() iterates sn.values directly, matching master's shape.
- SerializeNode emits the bitmap + concatenated bytes on the wire from
the array-of-slices at serialize time; wire format unchanged.
- decodeNode populates sn.values[i] slots by aliasing the serialized
buffer (zero-copy).
- NodeStore.Copy deep-copies each slot.
- splitStemValuesInsert + the insertSingleInternal paths write directly
to sn.values[i].
Trade-off: stems now carry 256 []byte headers (6144 B) instead of 1
concatenated slice (~32 B) + bitmap. Stem-pool scan cost returns to
parity with master (the existing valueData pointer already made the
pool non-noscan; rollback adds 255 more pointers per stem). The primary
arena win — pointer-free InternalNode pool — is preserved.
Gballet asked (comment 3099953085) to leave the sha256Sum256 / constant
parallelHashDepth optimisation out of this PR: it's an orthogonal
microbenchmark concern that should be revisited post-group-depth under
Go 1.26.
- Delete the sha256Sum256 helper from hasher.go.
- Delete the const parallelHashDepth = 4 from hasher.go.
- Restore master's dynamic parallelDepth() helper in store_commit.go
(copy verbatim — min(bits.Len(NumCPU), 8)).
- In hashInternal's shallow-parallel branch, call sha256.Sum256 directly
(std-lib, stack-allocated [32]byte; common.Hash is a type alias for
[32]byte so no conversion needed).
- In hashInternal's deep-sequential branch, use the pooled newSha256 /
returnSha256 hasher (matches master's internal_node.go:170-185).
Intentional trade-off: the deep branch now re-introduces per-hash
sync.Pool Get/Put plus a 32-byte h.Sum(nil) allocation. Zero regression
vs master; foregoes the arena's proposed stack-based hashing until
Go 1.26 + post-group-depth benchmarks.
Gballet asked on PR #34055 to unexport nodeRef, nodeKind, and makeRef
(comments 3099846639, 3099847640, 3100717855) — none are used outside
trie/bintrie. Cascade to the internal-only support symbols and methods:
NodeKind → nodeKind
KindEmpty/... → kindEmpty/...
NodeRef → nodeRef
EmptyRef → emptyRef
MakeRef → makeRef
NodeStore.Root → deleted; inlined to s.root field access (same pkg)
NodeStore.SetRoot → deleted; inlined to s.root = ref
NodeStore.ComputeHash/SerializeNode/DeserializeNode(WithHash)/
CollectNodes/ToDot/GetHeight → lowercased
All 9 method signatures took or returned nodeRef so their export would
have tripped revive:unexported-return after the type rename. Zero
external callers means no API break. The private deserializeNode helper
was renamed to decodeNode to free the name for the newly-private
deserializeNode public function.
Pure rename; no behaviour change.
Master added (via PR #34754) a dirty bool to InternalNode/StemNode plus a
CollectNodes short-circuit that skips clean subtrees — the arena branch
diverged before that landed. Port the semantics onto the arena shape:
- Add dirty bool to InternalNode and StemNode.
- Wire dirty=true alongside every existing mustRecompute=true setter in
node_store.go (newInternalRef, newStemRef) and store_ops.go (8 mutation
sites across InsertSingle/insertSingleInternal/InsertValuesAtStem/
insertValuesAtStem/splitStemInsert/splitStemValuesInsert).
- Add 'if !node.dirty { return nil }' gate at the top of CollectNodes for
both KindInternal and KindStem; clear dirty after flushfn runs.
- Plumb a dirty parameter through deserializeNode; DeserializeNode passes
dirty=true (safe default), DeserializeNodeWithHash passes dirty=false
(loaded from disk, blob matches).
The arena test in trie_test.go that was auto-merged from master used
master-shape struct literals (tr.root, NewBinaryNode) that don't exist on
arena; delete those and replace with TestCommitSkipCleanSubtrees, an
arena-native version that asserts first-Commit flushes all nodes, no-op
Commit flushes none, and single-leaf Commit flushes only the root-to-leaf
path.
A prior commit aggressively trimmed comments. This restores the
ones that carried real information — ownership contracts on
returned slices, the index-tracking semantics inside Next(),
the Parent() grandparent note, and the "at a leaf" stem rule —
so future readers aren't left guessing. Also elaborates the
parallel-hashing rationale in hashInternal.
Replace the single-field struct with a type alias on common.Hash.
Both have identical layout (32 bytes, no pointers) and noscan span
placement, but the alias matches master's style and reads more
naturally. A zero-arg Hash() method keeps call sites terse.
Replace the BinaryNode interface (which uses Go interface pointers that
the GC must scan) with NodeRef uint32 indices into typed arena pools.
NodeRef packs a 2-bit kind tag and 30-bit pool index into a single
uint32, making it invisible to the garbage collector.
NodeStore manages chunked typed pools per node kind:
- InternalNode pool: ZERO Go pointers (children are NodeRef, hash is
[32]byte) → allocated in noscan spans, GC skips entirely
- HashedNode pool: ZERO Go pointers → noscan spans
- StemNode pool: ONE pointer per node (valueData []byte) → minimal GC
For a trie with 25K InternalNodes, this reduces GC-scanned pointer-words
from ~125K to ~10K (85% reduction). CPU profiling showed 44% of time
in GC; this refactor directly addresses that bottleneck.
Serialization format is unchanged — the on-disk representation is
fully compatible. All existing tests pass.
## Problem
`BinaryTrie.Commit` unconditionally walked every resolved in-memory node
and flushed it into the `NodeSet`, producing one Pebble write per
resolved internal + stem node on every block — even when the node's
on-disk blob was bitwise identical to the previous commit. On a warm
400M-state workload this meant tens of thousands of redundant 65-byte
writes per block, compounding Pebble compaction pressure on every
commit.
The existing `mustRecompute` flag tracks *hash* staleness, not
*disk-blob* staleness: after `Hash()` completes, `mustRecompute` is
cleared even though the fresh blob has not been persisted. It is
therefore insufficient for a skip-flush optimization.
## Fix
Mirror the MPT committer pattern (`trie/committer.go:51-56`) by adding a
`dirty` flag on `InternalNode` and `StemNode` with the semantics *the
on-disk blob is stale*. The flag is:
- set to `true` wherever the node is created or structurally modified
(the same call sites that already set `mustRecompute = true`);
- set to `false` only after the node has been passed to the `flushfn`
inside `CollectNodes`;
- left `false` on nodes produced by `DeserializeNodeWithHash`, matching
the *loaded from disk, already persisted* semantics.
`CollectNodes` short-circuits on `!dirty` subtrees. The propagation
invariant (an ancestor of any dirty node is itself dirty) is already
maintained by the existing `InsertValuesAtStem` / `Insert` paths, which
now mirror every `mustRecompute = true` setter with a `dirty = true`
setter.
## Benchmark
New `BenchmarkCollectNodes_SparseWrite` measures commit cost when only
one leaf changes between blocks — the common case for state updates.
10,000-stem trie, one-leaf modification + Commit per iteration, Apple M4
Pro:
| | before | after | delta |
|---|---|---|---|
| time / op | 12,653,000 ns | 7,336 ns | **~1,725×** |
| bytes / op | 107,224,740 B | 37,774 B | **~2,839×** |
| allocs / op | 80,953 | 134 | **~604×** |
End-to-end impact on a real workload depends on the
resolved-footprint-to-dirty-path ratio; the new
`TestBinaryTrieCommitIncremental` provides a structural regression guard
(asserts that a Commit following a single-leaf modification flushes a
root-to-leaf path, not the whole tree).
---
Found all of this stuff while bloating my #34706 DB to make some
benchmarks. And saw we were spending A LOT OF TIME on hashing.
Hope this helps the perf a bit. Will rebase the flat-state PR on top of
this once merged.
This Pr implements some prerequisite changes for #34004 : split the
`CachingDB` into a `MerkleDB` and a `UBTDB`, so that very different
behaviors don't clash as much.
The transition isn't handled by this PR, but after talking to Gary we
agreed that `UBTDB` should receive another `triedb`, which will only be
loaded if the `Ended` flag is set to false in the conversion contract.
If this is too hard to achieve, it makes sense to load it regardless,
and then loading can be prevented at a later stage by adding a
`UBTTransitionFinalizationTime` in `ChainConfig`.
---------
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
Fix `GetAccount` returning **wrong account data** for non-existent
addresses when the trie root is a `StemNode` (single-account trie) — the
`StemNode` branch returned `r.Values` without verifying the queried
address's stem matches.
Co-authored-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>
`BinaryTrie.DeleteAccount` was a no-op, silently ignoring the caller's
deletion request and leaving the old `BasicData` and `CodeHash` in the
trie.
Co-authored-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>
Fix three issues in the binary trie NodeIterator:
1. Empty nodes now properly backtrack to parent and continue iteration
instead of terminating the entire walk early.
2. `HashedNode` resolver handles `nil` data (all-zeros hash) gracefully
by treating it as Empty rather than panicking.
3. Parent update after node resolution guards against stack underflow
when resolving the root node itself.
---------
Co-authored-by: tellabg <249254436+tellabg@users.noreply.github.com>
## Summary
At tree depths below `log2(NumCPU)` (clamped to [2, 8]), hash the left
subtree in a goroutine while hashing the right subtree inline. This
exploits available CPU cores for the top levels of the tree where
subtree hashing is most expensive. On single-core machines, the parallel
path is disabled entirely.
Deeper nodes use sequential hashing with the existing `sync.Pool` hasher
where goroutine overhead would exceed the hash computation cost. The
parallel path uses `sha256.Sum256` with a stack-allocated buffer to
avoid pool contention across goroutines.
**Safety:**
- Left/right subtrees are disjoint — no shared mutable state
- `sync.WaitGroup` provides happens-before guarantee for the result
- `defer wg.Done()` + `recover()` prevents goroutine panics from
crashing the process
- `!bt.mustRecompute` early return means clean nodes never enter the
parallel path
- Hash results are deterministic regardless of computation order — no
consensus risk
## Benchmark (AMD EPYC 48-core, 500K entries, `--benchtime=10s
--count=3`, post-H01 baseline)
| Metric | Baseline | Parallel | Delta |
|--------|----------|----------|-------|
| Approve (Mgas/s) | 224.5 ± 7.1 | **259.6 ± 2.4** | **+15.6%** |
| BalanceOf (Mgas/s) | 982.9 ± 5.1 | 954.3 ± 10.8 | -2.9% (noise, clean
nodes skip parallel path) |
| Allocs/op (approve) | ~810K | ~700K | -13.6% |
Binary tree hashing is quite slow, owing to many factors. One of them is
the GC pressure that is the consequence of allocating many hashers, as a
binary tree has 4x the size of an MPT. This PR introduces an
optimization that already exists for the MPT: keep a pool of hashers, in
order to reduce the amount of allocations.