Commit graph

2 commits

Author SHA1 Message Date
weiihann
bdb7b64173 trie/bintrie: mark promoted stem dirty during splitStemValuesInsert
When splitStemValuesInsert inserts a new stem that shares a prefix with
an existing stem, it increments the existing stem's depth and inserts a
new internal node above it. The existing stem's on-disk path is derived
from its depth via collectChildGroups + extendPathToGroupLeaf, so
promoting its depth means it should be flushed at a new path.

Previously, only the new stem (created in the divergence branch) was
marked dirty. The promoted existing stem retained whatever dirty value
it had — false if it was just deserialized from disk via a HashedNode
resolve. collectNodes would then skip flushing the existing stem at its
new path, while the new ancestor internal blob (also dirty) overwrites
the existing stem's old blob at the prior path. The stem's data is
left with no on-disk home, breaking subsequent reads with
"missing trie node".

The bug surfaces in the integration-test harness (state-actor builds a
DB with single-stem-per-slot at depth 8, geth then mutates by adding a
new stem that shares ≥8 prefix bits with the existing stem). After
mutation, geth's `getValuesAtStem` resolves a HashedNode whose blob
should be at the extended-depth path but isn't on disk.

Mark `existing.dirty = true` when promoting the depth so collectNodes
re-flushes the stem at its new path.

Verification: the 100MB integration-test harness (which previously
failed at block 9-10 with "missing trie node bdaf89... (path c96010)")
now runs cleanly through 200+ blocks of ERC20 deploys and bloat
transactions without any missing-trie-node errors.
2026-05-13 09:39:31 +08:00
CPerezz
b6d415c88d
trie/bintrie: replace BinaryNode interface with GC-free NodeRef arena (#34055)
## Summary

Replace the `BinaryNode` interface with `NodeRef uint32` indices into
typed arena pools, eliminating GC-scanned pointers from binary trie
nodes.

Inspired by [fjl's
observation](https://github.com/ethereum/go-ethereum/pull/34034#issuecomment-4075176446):
> *"if the binary trie produces such a large graph, it should probably
be changed so that the trie node type does not contain pointers. The
runtime does not scan objects that do not contain pointers, so it can
really help with the performance to build it this way."*

### The problem

CPU profiling of the binary trie (EIP-7864) showed **44% of CPU time in
garbage collection**. Each `InternalNode` held two `BinaryNode`
interface values (2 pointer-words each), and the GC scanned every one.
With ~25K `InternalNode`s in memory during block processing, this
created enormous GC pressure.

### The solution

`NodeRef` is a compact `uint32` (2-bit kind tag + 30-bit pool index).
`NodeStore` manages chunked typed pools per node kind:
- **InternalNode pool**: ZERO Go pointers (children are `NodeRef`, hash
is `[32]byte`) → noscan spans
- **HashedNode pool**: ZERO Go pointers → noscan spans
- **StemNode pool**: retains `Values [][]byte` (matching existing
format)

The serialization format is unchanged — flat InternalNode
`[type][leftHash][rightHash]` = 65 bytes.

## Benchmark: Apple M4 Pro (`--benchtime=10s --count=3`, on top of
#34021)

| Metric | Baseline | Arena | Delta |
|--------|----------|-------|-------|
| Approve (Mgas/s) | 374 | 382 | **+2.1%** |
| BalanceOf (Mgas/s) | 885 | 901 | **+1.8%** |
| Approve allocs/op | 775K | **607K** | **-21.7%** |
| BalanceOf allocs/op | 265K | **228K** | **-14.0%** |

## Benchmark: AMD EPYC 48-core (50GB state, execution-specs ERC-20, on
top of #34021 + #34032)

| Benchmark | Baseline | Arena | Delta |
|-----------|----------|-------|-------|
| erc20_approve (write) | 22.4 Mgas/s | **27.0 Mgas/s** | **+20.5%** |
| mixed_sload_sstore | 62.9 Mgas/s | **97.3 Mgas/s** | **+54.7%** |
| erc20_balanceof (read) | 180.8 Mgas/s | 167.6 Mgas/s | -7.3% (cold
cache variance) |

The arena benefit scales with heap size — the EPYC (larger heap, more GC
pressure) shows much larger gains than the M4 Pro (efficient unified
memory). The mixed workload baseline was unstable (62.9 vs 16.3 Mgas/s
between runs due to GC-induced throughput collapse); the arena
eliminates this entirely (95-97 Mgas/s, stable).

## Dependencies

Benchmarked with #34021 (H01 N+1 fix) + #34032 (R14 parallel hashing).
No code dependency — applies independently to master.

All test suites pass (`trie/bintrie` with `-race`, `core/state`,
`triedb/pathdb`, `cmd/geth`).

---------

Co-authored-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>
2026-04-20 14:08:30 +02:00