This PR addresses one of the biggest performance issue with binary
tries: storing each internal node individually bloats the index, the
disk, and triggers a lot of write amplifications. To fix this issue,
this PR serializes groups of nodes together.
Because we are still looking for the ideal group size, the "depth" of
the group tree is made a parameter, but that will be removed in the
future, once the perfect size is known.
This is a rebase of #33658
---------
Co-authored-by: Copilot <copilot@github.com>
The nodes were named using the byte representation of the path, instead
of the binary representation. This was confusing to other client devs
trying to achieve interop.
## Summary
Replace the `BinaryNode` interface with `NodeRef uint32` indices into
typed arena pools, eliminating GC-scanned pointers from binary trie
nodes.
Inspired by [fjl's
observation](https://github.com/ethereum/go-ethereum/pull/34034#issuecomment-4075176446):
> *"if the binary trie produces such a large graph, it should probably
be changed so that the trie node type does not contain pointers. The
runtime does not scan objects that do not contain pointers, so it can
really help with the performance to build it this way."*
### The problem
CPU profiling of the binary trie (EIP-7864) showed **44% of CPU time in
garbage collection**. Each `InternalNode` held two `BinaryNode`
interface values (2 pointer-words each), and the GC scanned every one.
With ~25K `InternalNode`s in memory during block processing, this
created enormous GC pressure.
### The solution
`NodeRef` is a compact `uint32` (2-bit kind tag + 30-bit pool index).
`NodeStore` manages chunked typed pools per node kind:
- **InternalNode pool**: ZERO Go pointers (children are `NodeRef`, hash
is `[32]byte`) → noscan spans
- **HashedNode pool**: ZERO Go pointers → noscan spans
- **StemNode pool**: retains `Values [][]byte` (matching existing
format)
The serialization format is unchanged — flat InternalNode
`[type][leftHash][rightHash]` = 65 bytes.
## Benchmark: Apple M4 Pro (`--benchtime=10s --count=3`, on top of
#34021)
| Metric | Baseline | Arena | Delta |
|--------|----------|-------|-------|
| Approve (Mgas/s) | 374 | 382 | **+2.1%** |
| BalanceOf (Mgas/s) | 885 | 901 | **+1.8%** |
| Approve allocs/op | 775K | **607K** | **-21.7%** |
| BalanceOf allocs/op | 265K | **228K** | **-14.0%** |
## Benchmark: AMD EPYC 48-core (50GB state, execution-specs ERC-20, on
top of #34021 + #34032)
| Benchmark | Baseline | Arena | Delta |
|-----------|----------|-------|-------|
| erc20_approve (write) | 22.4 Mgas/s | **27.0 Mgas/s** | **+20.5%** |
| mixed_sload_sstore | 62.9 Mgas/s | **97.3 Mgas/s** | **+54.7%** |
| erc20_balanceof (read) | 180.8 Mgas/s | 167.6 Mgas/s | -7.3% (cold
cache variance) |
The arena benefit scales with heap size — the EPYC (larger heap, more GC
pressure) shows much larger gains than the M4 Pro (efficient unified
memory). The mixed workload baseline was unstable (62.9 vs 16.3 Mgas/s
between runs due to GC-induced throughput collapse); the arena
eliminates this entirely (95-97 Mgas/s, stable).
## Dependencies
Benchmarked with #34021 (H01 N+1 fix) + #34032 (R14 parallel hashing).
No code dependency — applies independently to master.
All test suites pass (`trie/bintrie` with `-race`, `core/state`,
`triedb/pathdb`, `cmd/geth`).
---------
Co-authored-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>
## Problem
`BinaryTrie.Commit` unconditionally walked every resolved in-memory node
and flushed it into the `NodeSet`, producing one Pebble write per
resolved internal + stem node on every block — even when the node's
on-disk blob was bitwise identical to the previous commit. On a warm
400M-state workload this meant tens of thousands of redundant 65-byte
writes per block, compounding Pebble compaction pressure on every
commit.
The existing `mustRecompute` flag tracks *hash* staleness, not
*disk-blob* staleness: after `Hash()` completes, `mustRecompute` is
cleared even though the fresh blob has not been persisted. It is
therefore insufficient for a skip-flush optimization.
## Fix
Mirror the MPT committer pattern (`trie/committer.go:51-56`) by adding a
`dirty` flag on `InternalNode` and `StemNode` with the semantics *the
on-disk blob is stale*. The flag is:
- set to `true` wherever the node is created or structurally modified
(the same call sites that already set `mustRecompute = true`);
- set to `false` only after the node has been passed to the `flushfn`
inside `CollectNodes`;
- left `false` on nodes produced by `DeserializeNodeWithHash`, matching
the *loaded from disk, already persisted* semantics.
`CollectNodes` short-circuits on `!dirty` subtrees. The propagation
invariant (an ancestor of any dirty node is itself dirty) is already
maintained by the existing `InsertValuesAtStem` / `Insert` paths, which
now mirror every `mustRecompute = true` setter with a `dirty = true`
setter.
## Benchmark
New `BenchmarkCollectNodes_SparseWrite` measures commit cost when only
one leaf changes between blocks — the common case for state updates.
10,000-stem trie, one-leaf modification + Commit per iteration, Apple M4
Pro:
| | before | after | delta |
|---|---|---|---|
| time / op | 12,653,000 ns | 7,336 ns | **~1,725×** |
| bytes / op | 107,224,740 B | 37,774 B | **~2,839×** |
| allocs / op | 80,953 | 134 | **~604×** |
End-to-end impact on a real workload depends on the
resolved-footprint-to-dirty-path ratio; the new
`TestBinaryTrieCommitIncremental` provides a structural regression guard
(asserts that a Commit following a single-leaf modification flushes a
root-to-leaf path, not the whole tree).
---
Found all of this stuff while bloating my #34706 DB to make some
benchmarks. And saw we were spending A LOT OF TIME on hashing.
Hope this helps the perf a bit. Will rebase the flat-state PR on top of
this once merged.
This Pr implements some prerequisite changes for #34004 : split the
`CachingDB` into a `MerkleDB` and a `UBTDB`, so that very different
behaviors don't clash as much.
The transition isn't handled by this PR, but after talking to Gary we
agreed that `UBTDB` should receive another `triedb`, which will only be
loaded if the `Ended` flag is set to false in the conversion contract.
If this is too hard to achieve, it makes sense to load it regardless,
and then loading can be prevented at a later stage by adding a
`UBTTransitionFinalizationTime` in `ChainConfig`.
---------
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
Fix `GetAccount` returning **wrong account data** for non-existent
addresses when the trie root is a `StemNode` (single-account trie) — the
`StemNode` branch returned `r.Values` without verifying the queried
address's stem matches.
Co-authored-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>
`BinaryTrie.DeleteAccount` was a no-op, silently ignoring the caller's
deletion request and leaving the old `BasicData` and `CodeHash` in the
trie.
Co-authored-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>
Fix three issues in the binary trie NodeIterator:
1. Empty nodes now properly backtrack to parent and continue iteration
instead of terminating the entire walk early.
2. `HashedNode` resolver handles `nil` data (all-zeros hash) gracefully
by treating it as Empty rather than panicking.
3. Parent update after node resolution guards against stack underflow
when resolving the root node itself.
---------
Co-authored-by: tellabg <249254436+tellabg@users.noreply.github.com>
## Summary
At tree depths below `log2(NumCPU)` (clamped to [2, 8]), hash the left
subtree in a goroutine while hashing the right subtree inline. This
exploits available CPU cores for the top levels of the tree where
subtree hashing is most expensive. On single-core machines, the parallel
path is disabled entirely.
Deeper nodes use sequential hashing with the existing `sync.Pool` hasher
where goroutine overhead would exceed the hash computation cost. The
parallel path uses `sha256.Sum256` with a stack-allocated buffer to
avoid pool contention across goroutines.
**Safety:**
- Left/right subtrees are disjoint — no shared mutable state
- `sync.WaitGroup` provides happens-before guarantee for the result
- `defer wg.Done()` + `recover()` prevents goroutine panics from
crashing the process
- `!bt.mustRecompute` early return means clean nodes never enter the
parallel path
- Hash results are deterministic regardless of computation order — no
consensus risk
## Benchmark (AMD EPYC 48-core, 500K entries, `--benchtime=10s
--count=3`, post-H01 baseline)
| Metric | Baseline | Parallel | Delta |
|--------|----------|----------|-------|
| Approve (Mgas/s) | 224.5 ± 7.1 | **259.6 ± 2.4** | **+15.6%** |
| BalanceOf (Mgas/s) | 982.9 ± 5.1 | 954.3 ± 10.8 | -2.9% (noise, clean
nodes skip parallel path) |
| Allocs/op (approve) | ~810K | ~700K | -13.6% |
Binary tree hashing is quite slow, owing to many factors. One of them is
the GC pressure that is the consequence of allocating many hashers, as a
binary tree has 4x the size of an MPT. This PR introduces an
optimization that already exists for the MPT: keep a pool of hashers, in
order to reduce the amount of allocations.
This is an optimization that existed for verkle and the MPT, but that
got dropped during the rebase.
Mark the nodes that were modified as needing recomputation, and skip the
hash computation if this is not needed. Otherwise, the whole tree is
hashed, which kills performance.
The computation of `MAIN_STORAGE_OFFSET` was incorrect, causing the last
byte of the stem to be dropped. This means that there would be a
collision in the hash computation (at the preimage level, not a hash
collision of course) if two keys were only differing at byte 31.
Pebble maintains a batch pool to recycle the batch object. Unfortunately
batch object must be
explicitly returned via `batch.Close` function. This PR extends the
batch interface by adding
the close function and also invoke batch.Close in some critical code
paths.
Memory allocation must be measured before merging this change. What's
more, it's an open
question that whether we should apply batch.Close as much as possible in
every invocation.
I removed `Iterator.Count` in #33840, because it appeared to be unused
and did not provide the documented invariant: the returned count should
always be an upper bound on the number of iterations allowed by `Next`.
In order to make `Count` work, the semantics of `CountValues` has to
change to return the number of items up and including the invalid one. I
have reviewed all callsites of `CountValues` to assess if changing this
is safe. There aren't that many, and the only call that doesn't check
the error and return is in the trie node parser,
`trie.decodeNodeUnsafe`. There, we distinguish the node type based on
the number of items, and it previously returned an error for item count
zero. In order to avoid any potential issue that could result from this
change, I'm adding an error check in that function, though it isn't
necessary.
GetStorage and DeleteStorage used GetBinaryTreeKey to compute the tree
key, while UpdateStorage used GetBinaryTreeKeyStorageSlot. The latter
applies storage slot remapping (header offset for slots <64, main
storage prefix for the rest), so reads and deletes were targeting
different tree locations than writes.
Replace GetBinaryTreeKey with GetBinaryTreeKeyStorageSlot in both
GetStorage and DeleteStorage to match UpdateStorage. Add a regression
test that verifies the write→read→delete→read round-trip for main
storage slots.
The `decodeRef` function used `size > hashLen` to reject oversized
embedded nodes, but this incorrectly allowed nodes of exactly 32 bytes
through. The encoding side (hasher.go, stacktrie.go) consistently uses
`len(enc) < 32` to decide whether to embed a node inline, meaning nodes
of 32+ bytes are always hash-referenced. The error message itself
already stated `want size < 32`, confirming the intended threshold.
Changed `size > hashLen` to `size >= hashLen` in `decodeRef` to align
the decoding validation with the encoding logic, the Yellow Paper spec,
and the surrounding comments.
The upstream libray has removed the assembly-based implementation of
keccak. We need to maintain our own library to avoid a peformance
regression.
---------
Co-authored-by: lightclient <lightclient@protonmail.com>
The `Witness` method was not implemented for the binary tree, which
caused `debug_excutionWitness` to panic. This PR fixes that.
Note that the `TransitionTrie` version isn't implemented, and that's on
purpose: more thought must be given to what should go in the global
witness.
Based on [EIP-7864](https://eips.ethereum.org/EIPS/eip-7864), the tree
index should be 32 bytes instead of 31 bytes.
```
def get_tree_key(address: Address32, tree_index: int, sub_index: int):
# Assumes STEM_SUBTREE_WIDTH = 256
return tree_hash(address + tree_index.to_bytes(32, "little"))[:31] + bytes(
[sub_index]
)
```
This PR optimizes memory allocation in StateTrie.PrefetchAccount() and
StateTrie.PrefetchStorage() by preallocating slice capacity when the
final size is known.
This pull request introduces a mechanism to compress trienode history by
storing only the node diffs between consecutive versions.
- For full nodes, only the modified children are recorded in the history;
- For short nodes, only the modified value is stored;
If the node type has changed, or if the node is newly created or
deleted, the entire node value is stored instead.
To mitigate the overhead of reassembling nodes from diffs during history
reads, checkpoints are introduced by periodically storing full node values.
The current checkpoint interval is set to every 16 mutations, though
this parameter may be made configurable in the future.
In order to reduce the amount of code that is embedded into the keeper
binary, I am removing all the verkle code that uses go-verkle and
go-ipa. This will be followed by further PRs that are more like stubs to
replace code when the keeper build is detected.
I'm keeping the binary tree of course. This means that you will still
see `isVerkle` variables all over the codebase, but they will be renamed
when code is touched (i.e. this is not an invitation for 30+ AI slop
PRs).
---------
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
This is broken off of #31730 to only focus on testing networks that
start with verkle at genesis.
The PR has seen a lot of work since its creation, and it now targets
creating and re-executing tests for a binary tree testnet without the
transition (so it starts at genesis). The transition tree has been moved
to its own package. It also replaces verkle with the binary tree for
this specific application.
---------
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
This change addresses critical issues in the state object duplication
process specific to Verkle trie implementations. Without these
modifications, updates to state objects fail to propagate correctly
through the trie structure after a statedb copy operation, leading to
inaccuracies in the computation of the state root hash.
---------
Co-authored-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>
- Correct the error message in TestOneElementProof to expect 'v' instead
of 'k'.
- The trie is updated with key "k" and value "v"; on mismatch the
expected value must be 'v'.
- Aligns the message with the actual test logic and other similar checks
in this file, reducing confusion during test failures. No behavioral
changes.
- Adds `NodeIteratorWithPrefix()` method to support iterating only nodes
within a specific key prefix
- Adds `NodeIteratorWithRange()` method to support iterating only nodes
within a specific key range
Current `NodeIterator` always traverses the entire remaining trie from a
start position. For non-ethereum applications using the trie implementation,
there's no way to limit iteration to just a subtree with a specific prefix.
**Usage:**
```go
// Only iterate nodes with prefix "key1"
iter, err := trie.NodeIteratorWithPrefix([]byte("key1"))
```
Testing: Comprehensive test suite covering edge cases and boundary conditions.
Closes#32484
---------
Co-authored-by: gballet <guillaume.ballet@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
This PR is the first step in the trienode history series.
It introduces the `nodeWithOrigin` struct in the path database, which tracks
the original values of dirty nodes to support trienode history construction.
Note, the original value is always empty in this PR, so it won't break the
existing journal for encoding and decoding. The compatibility of journal
should be handled in the following PR.
Implement the binary tree as specified in [eip-7864](https://eips.ethereum.org/EIPS/eip-7864).
This will gradually replace verkle trees in the codebase. This is only
running the tests and will not be executed in production, but will help
me rebase some of my work, so that it doesn't bitrot as much.
---------
Signed-off-by: Guillaume Ballet
Co-authored-by: Parithosh Jayanthi <parithosh.jayanthi@ethereum.org>
Co-authored-by: rjl493456442 <garyrong0905@gmail.com>
This pull introduces a `Prefetch` operation in the trie to prefetch trie
nodes in parallel. It is used by the `triePrefetcher` to accelerate state
loading and improve overall chain processing performance.
This add some of the changes that were missing from #31634. It
introduces the `TransitionTrie`, which is a façade pattern between the
current MPT trie and the overlay tree.
---------
Signed-off-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>
Co-authored-by: rjl493456442 <garyrong0905@gmail.com>
These changes made in the PR should be highlighted here
The trie tracer is split into two distinct structs: opTracer and prevalueTracer.
The former is specific to MPT, while the latter is generic and applicable to all
trie implementations.
The original values of dirty nodes are tracked in a NodeSet. This serves
as the foundation for both full archive node implementations and the state live
tracer.
This pull request optimizes trie hashing by reducing memory allocation
overhead. Specifically:
- define a fullNodeEncoder pool to reuse encoders and avoid memory
allocations.
- simplify the encoding logic for shortNode and fullNode by getting rid
of the Go interfaces.
The optimization tried to defer allocating the cache map until it was used for the
first time. It's a relic from earlier times, when tries were copied often. This seems
unnecessary now, so we can just create the map when the trie is created.
---------
Co-authored-by: Felix Lange <fjl@twurst.com>
As the preimage will only be stored if `t.preimages != nil`, so no need
to save them into local cache if not enabled. This will reduce the memory
wasted to copy the bytes
---------
Signed-off-by: jsvisa <delweng@gmail.com>