The invariant "mutating a value slot must mark the stem for re-hash
and re-flush" was enforced by every caller remembering to set both
flags after setValue. Moving the flip into setValue itself makes it
structurally impossible to forget, and drops the duplicate flag-sets
at each callsite.
decodeNode's on-disk load path still writes directly to sn.values
because loaded stems must retain whatever mustRecompute/dirty state
the caller asked for (typically both false).
The pooled hash.Hash interface forced the local [StemNodeWidth]common.Hash
data array to escape to the heap: h.Sum(data[i][:0]) passes a subslice of
data into an interface method, so escape analysis conservatively moves the
whole array. pprof (post-rollback) showed this single allocation as 52%
of total bytes (5 GB over BenchmarkCollectNodesSparseWrite).
Switch to sha256.Sum256 (takes []byte, returns [32]byte by value) — no
slice into data ever leaves the frame, so data stays on stack. Also
drops per-Hash h.Sum(nil) allocs and the sync.Pool Get/Put round-trip
for stems.
Benchmark delta (M4 Pro, go1.24.0, --count=5 --benchtime=5s):
before: 9095 ns/op 15008 B/op 106 allocs/op
after: 9133 ns/op 6526 B/op 95 allocs/op
vs upstream/master@53ff723cc: bytes/op -82.7% (was -60%),
allocs/op -29.1% (was -20.9%).
Gballet asked on PR #34055 (comments 3100043116, 3100050542, and the
bit-check dedup at 3100114416 / 3100878310) to revert StemNode from the
packed-bytes representation to the straightforward array-of-slices.
Before: StemNode carried a bitmap, a concatenated valueData []byte, a
count, and a shared COW flag. Every read/write went through a bit-count
posInData lookup; every mutation through ensureWritable COW.
After: values [StemNodeWidth][]byte — 256 slots, nil == absent. No
bitmap lookup, no COW. Direct sn.values[suffix] access.
Supporting changes:
- Drop posInData, ensureWritable; rewrite getValue/hasValue/allValues/
setValue as trivial slice access.
- Hash() iterates sn.values directly, matching master's shape.
- SerializeNode emits the bitmap + concatenated bytes on the wire from
the array-of-slices at serialize time; wire format unchanged.
- decodeNode populates sn.values[i] slots by aliasing the serialized
buffer (zero-copy).
- NodeStore.Copy deep-copies each slot.
- splitStemValuesInsert + the insertSingleInternal paths write directly
to sn.values[i].
Trade-off: stems now carry 256 []byte headers (6144 B) instead of 1
concatenated slice (~32 B) + bitmap. Stem-pool scan cost returns to
parity with master (the existing valueData pointer already made the
pool non-noscan; rollback adds 255 more pointers per stem). The primary
arena win — pointer-free InternalNode pool — is preserved.
Master added (via PR #34754) a dirty bool to InternalNode/StemNode plus a
CollectNodes short-circuit that skips clean subtrees — the arena branch
diverged before that landed. Port the semantics onto the arena shape:
- Add dirty bool to InternalNode and StemNode.
- Wire dirty=true alongside every existing mustRecompute=true setter in
node_store.go (newInternalRef, newStemRef) and store_ops.go (8 mutation
sites across InsertSingle/insertSingleInternal/InsertValuesAtStem/
insertValuesAtStem/splitStemInsert/splitStemValuesInsert).
- Add 'if !node.dirty { return nil }' gate at the top of CollectNodes for
both KindInternal and KindStem; clear dirty after flushfn runs.
- Plumb a dirty parameter through deserializeNode; DeserializeNode passes
dirty=true (safe default), DeserializeNodeWithHash passes dirty=false
(loaded from disk, blob matches).
The arena test in trie_test.go that was auto-merged from master used
master-shape struct literals (tr.root, NewBinaryNode) that don't exist on
arena; delete those and replace with TestCommitSkipCleanSubtrees, an
arena-native version that asserts first-Commit flushes all nodes, no-op
Commit flushes none, and single-leaf Commit flushes only the root-to-leaf
path.
Replace the BinaryNode interface (which uses Go interface pointers that
the GC must scan) with NodeRef uint32 indices into typed arena pools.
NodeRef packs a 2-bit kind tag and 30-bit pool index into a single
uint32, making it invisible to the garbage collector.
NodeStore manages chunked typed pools per node kind:
- InternalNode pool: ZERO Go pointers (children are NodeRef, hash is
[32]byte) → allocated in noscan spans, GC skips entirely
- HashedNode pool: ZERO Go pointers → noscan spans
- StemNode pool: ONE pointer per node (valueData []byte) → minimal GC
For a trie with 25K InternalNodes, this reduces GC-scanned pointer-words
from ~125K to ~10K (85% reduction). CPU profiling showed 44% of time
in GC; this refactor directly addresses that bottleneck.
Serialization format is unchanged — the on-disk representation is
fully compatible. All existing tests pass.
## Problem
`BinaryTrie.Commit` unconditionally walked every resolved in-memory node
and flushed it into the `NodeSet`, producing one Pebble write per
resolved internal + stem node on every block — even when the node's
on-disk blob was bitwise identical to the previous commit. On a warm
400M-state workload this meant tens of thousands of redundant 65-byte
writes per block, compounding Pebble compaction pressure on every
commit.
The existing `mustRecompute` flag tracks *hash* staleness, not
*disk-blob* staleness: after `Hash()` completes, `mustRecompute` is
cleared even though the fresh blob has not been persisted. It is
therefore insufficient for a skip-flush optimization.
## Fix
Mirror the MPT committer pattern (`trie/committer.go:51-56`) by adding a
`dirty` flag on `InternalNode` and `StemNode` with the semantics *the
on-disk blob is stale*. The flag is:
- set to `true` wherever the node is created or structurally modified
(the same call sites that already set `mustRecompute = true`);
- set to `false` only after the node has been passed to the `flushfn`
inside `CollectNodes`;
- left `false` on nodes produced by `DeserializeNodeWithHash`, matching
the *loaded from disk, already persisted* semantics.
`CollectNodes` short-circuits on `!dirty` subtrees. The propagation
invariant (an ancestor of any dirty node is itself dirty) is already
maintained by the existing `InsertValuesAtStem` / `Insert` paths, which
now mirror every `mustRecompute = true` setter with a `dirty = true`
setter.
## Benchmark
New `BenchmarkCollectNodes_SparseWrite` measures commit cost when only
one leaf changes between blocks — the common case for state updates.
10,000-stem trie, one-leaf modification + Commit per iteration, Apple M4
Pro:
| | before | after | delta |
|---|---|---|---|
| time / op | 12,653,000 ns | 7,336 ns | **~1,725×** |
| bytes / op | 107,224,740 B | 37,774 B | **~2,839×** |
| allocs / op | 80,953 | 134 | **~604×** |
End-to-end impact on a real workload depends on the
resolved-footprint-to-dirty-path ratio; the new
`TestBinaryTrieCommitIncremental` provides a structural regression guard
(asserts that a Commit following a single-leaf modification flushes a
root-to-leaf path, not the whole tree).
---
Found all of this stuff while bloating my #34706 DB to make some
benchmarks. And saw we were spending A LOT OF TIME on hashing.
Hope this helps the perf a bit. Will rebase the flat-state PR on top of
this once merged.
Fix `GetAccount` returning **wrong account data** for non-existent
addresses when the trie root is a `StemNode` (single-account trie) — the
`StemNode` branch returned `r.Values` without verifying the queried
address's stem matches.
Co-authored-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>
Binary tree hashing is quite slow, owing to many factors. One of them is
the GC pressure that is the consequence of allocating many hashers, as a
binary tree has 4x the size of an MPT. This PR introduces an
optimization that already exists for the MPT: keep a pool of hashers, in
order to reduce the amount of allocations.
This is an optimization that existed for verkle and the MPT, but that
got dropped during the rebase.
Mark the nodes that were modified as needing recomputation, and skip the
hash computation if this is not needed. Otherwise, the whole tree is
hashed, which kills performance.
This is broken off of #31730 to only focus on testing networks that
start with verkle at genesis.
The PR has seen a lot of work since its creation, and it now targets
creating and re-executing tests for a binary tree testnet without the
transition (so it starts at genesis). The transition tree has been moved
to its own package. It also replaces verkle with the binary tree for
this specific application.
---------
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
Implement the binary tree as specified in [eip-7864](https://eips.ethereum.org/EIPS/eip-7864).
This will gradually replace verkle trees in the codebase. This is only
running the tests and will not be executed in production, but will help
me rebase some of my work, so that it doesn't bitrot as much.
---------
Signed-off-by: Guillaume Ballet
Co-authored-by: Parithosh Jayanthi <parithosh.jayanthi@ethereum.org>
Co-authored-by: rjl493456442 <garyrong0905@gmail.com>