go-ethereum

forks/go-ethereum

Fork 1

mirror of https://github.com/ethereum/go-ethereum.git synced 2026-05-24 08:49:29 +00:00

Commit graph

Author	SHA1	Message	Date
weiihann	bdb7b64173	trie/bintrie: mark promoted stem dirty during splitStemValuesInsert When splitStemValuesInsert inserts a new stem that shares a prefix with an existing stem, it increments the existing stem's depth and inserts a new internal node above it. The existing stem's on-disk path is derived from its depth via collectChildGroups + extendPathToGroupLeaf, so promoting its depth means it should be flushed at a new path. Previously, only the new stem (created in the divergence branch) was marked dirty. The promoted existing stem retained whatever dirty value it had — false if it was just deserialized from disk via a HashedNode resolve. collectNodes would then skip flushing the existing stem at its new path, while the new ancestor internal blob (also dirty) overwrites the existing stem's old blob at the prior path. The stem's data is left with no on-disk home, breaking subsequent reads with "missing trie node". The bug surfaces in the integration-test harness (state-actor builds a DB with single-stem-per-slot at depth 8, geth then mutates by adding a new stem that shares ≥8 prefix bits with the existing stem). After mutation, geth's `getValuesAtStem` resolves a HashedNode whose blob should be at the extended-depth path but isn't on disk. Mark `existing.dirty = true` when promoting the depth so collectNodes re-flushes the stem at its new path. Verification: the 100MB integration-test harness (which previously failed at block 9-10 with "missing trie node bdaf89... (path c96010)") now runs cleanly through 200+ blocks of ERC20 deploys and bloat transactions without any missing-trie-node errors.	2026-05-13 09:39:31 +08:00
CPerezz	b6d415c88d	trie/bintrie: replace BinaryNode interface with GC-free NodeRef arena (#34055 ) ## Summary Replace the `BinaryNode` interface with `NodeRef uint32` indices into typed arena pools, eliminating GC-scanned pointers from binary trie nodes. Inspired by [fjl's observation](https://github.com/ethereum/go-ethereum/pull/34034#issuecomment-4075176446): > "if the binary trie produces such a large graph, it should probably be changed so that the trie node type does not contain pointers. The runtime does not scan objects that do not contain pointers, so it can really help with the performance to build it this way." ### The problem CPU profiling of the binary trie (EIP-7864) showed 44% of CPU time in garbage collection. Each `InternalNode` held two `BinaryNode` interface values (2 pointer-words each), and the GC scanned every one. With ~25K `InternalNode`s in memory during block processing, this created enormous GC pressure. ### The solution `NodeRef` is a compact `uint32` (2-bit kind tag + 30-bit pool index). `NodeStore` manages chunked typed pools per node kind: - InternalNode pool: ZERO Go pointers (children are `NodeRef`, hash is `[32]byte`) → noscan spans - HashedNode pool: ZERO Go pointers → noscan spans - StemNode pool: retains `Values [][]byte` (matching existing format) The serialization format is unchanged — flat InternalNode `[type][leftHash][rightHash]` = 65 bytes. ## Benchmark: Apple M4 Pro (`--benchtime=10s --count=3`, on top of #34021) \| Metric \| Baseline \| Arena \| Delta \| \|--------\|----------\|-------\|-------\| \| Approve (Mgas/s) \| 374 \| 382 \| +2.1% \| \| BalanceOf (Mgas/s) \| 885 \| 901 \| +1.8% \| \| Approve allocs/op \| 775K \| 607K \| -21.7% \| \| BalanceOf allocs/op \| 265K \| 228K \| -14.0% \| ## Benchmark: AMD EPYC 48-core (50GB state, execution-specs ERC-20, on top of #34021 + #34032) \| Benchmark \| Baseline \| Arena \| Delta \| \|-----------\|----------\|-------\|-------\| \| erc20_approve (write) \| 22.4 Mgas/s \| 27.0 Mgas/s \| +20.5% \| \| mixed_sload_sstore \| 62.9 Mgas/s \| 97.3 Mgas/s \| +54.7% \| \| erc20_balanceof (read) \| 180.8 Mgas/s \| 167.6 Mgas/s \| -7.3% (cold cache variance) \| The arena benefit scales with heap size — the EPYC (larger heap, more GC pressure) shows much larger gains than the M4 Pro (efficient unified memory). The mixed workload baseline was unstable (62.9 vs 16.3 Mgas/s between runs due to GC-induced throughput collapse); the arena eliminates this entirely (95-97 Mgas/s, stable). ## Dependencies Benchmarked with #34021 (H01 N+1 fix) + #34032 (R14 parallel hashing). No code dependency — applies independently to master. All test suites pass (`trie/bintrie` with `-race`, `core/state`, `triedb/pathdb`, `cmd/geth`). --------- Co-authored-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>	2026-04-20 14:08:30 +02:00

Author

SHA1

Message

Date

weiihann

bdb7b64173

trie/bintrie: mark promoted stem dirty during splitStemValuesInsert

When splitStemValuesInsert inserts a new stem that shares a prefix with
an existing stem, it increments the existing stem's depth and inserts a
new internal node above it. The existing stem's on-disk path is derived
from its depth via collectChildGroups + extendPathToGroupLeaf, so
promoting its depth means it should be flushed at a new path.

Previously, only the new stem (created in the divergence branch) was
marked dirty. The promoted existing stem retained whatever dirty value
it had — false if it was just deserialized from disk via a HashedNode
resolve. collectNodes would then skip flushing the existing stem at its
new path, while the new ancestor internal blob (also dirty) overwrites
the existing stem's old blob at the prior path. The stem's data is
left with no on-disk home, breaking subsequent reads with
"missing trie node".

The bug surfaces in the integration-test harness (state-actor builds a
DB with single-stem-per-slot at depth 8, geth then mutates by adding a
new stem that shares ≥8 prefix bits with the existing stem). After
mutation, geth's `getValuesAtStem` resolves a HashedNode whose blob
should be at the extended-depth path but isn't on disk.

Mark `existing.dirty = true` when promoting the depth so collectNodes
re-flushes the stem at its new path.

Verification: the 100MB integration-test harness (which previously
failed at block 9-10 with "missing trie node bdaf89... (path c96010)")
now runs cleanly through 200+ blocks of ERC20 deploys and bloat
transactions without any missing-trie-node errors.

2026-05-13 09:39:31 +08:00

CPerezz

b6d415c88d

trie/bintrie: replace BinaryNode interface with GC-free NodeRef arena (#34055 )

## Summary

Replace the `BinaryNode` interface with `NodeRef uint32` indices into
typed arena pools, eliminating GC-scanned pointers from binary trie
nodes.

Inspired by [fjl's
observation](https://github.com/ethereum/go-ethereum/pull/34034#issuecomment-4075176446):
> *"if the binary trie produces such a large graph, it should probably
be changed so that the trie node type does not contain pointers. The
runtime does not scan objects that do not contain pointers, so it can
really help with the performance to build it this way."*

### The problem

CPU profiling of the binary trie (EIP-7864) showed **44% of CPU time in
garbage collection**. Each `InternalNode` held two `BinaryNode`
interface values (2 pointer-words each), and the GC scanned every one.
With ~25K `InternalNode`s in memory during block processing, this
created enormous GC pressure.

### The solution

`NodeRef` is a compact `uint32` (2-bit kind tag + 30-bit pool index).
`NodeStore` manages chunked typed pools per node kind:
- **InternalNode pool**: ZERO Go pointers (children are `NodeRef`, hash
is `[32]byte`) → noscan spans
- **HashedNode pool**: ZERO Go pointers → noscan spans
- **StemNode pool**: retains `Values [][]byte` (matching existing
format)

The serialization format is unchanged — flat InternalNode
`[type][leftHash][rightHash]` = 65 bytes.

## Benchmark: Apple M4 Pro (`--benchtime=10s --count=3`, on top of
#34021)

| Metric | Baseline | Arena | Delta |
|--------|----------|-------|-------|
| Approve (Mgas/s) | 374 | 382 | **+2.1%** |
| BalanceOf (Mgas/s) | 885 | 901 | **+1.8%** |
| Approve allocs/op | 775K | **607K** | **-21.7%** |
| BalanceOf allocs/op | 265K | **228K** | **-14.0%** |

## Benchmark: AMD EPYC 48-core (50GB state, execution-specs ERC-20, on
top of #34021 + #34032)

| Benchmark | Baseline | Arena | Delta |
|-----------|----------|-------|-------|
| erc20_approve (write) | 22.4 Mgas/s | **27.0 Mgas/s** | **+20.5%** |
| mixed_sload_sstore | 62.9 Mgas/s | **97.3 Mgas/s** | **+54.7%** |
| erc20_balanceof (read) | 180.8 Mgas/s | 167.6 Mgas/s | -7.3% (cold
cache variance) |

The arena benefit scales with heap size — the EPYC (larger heap, more GC
pressure) shows much larger gains than the M4 Pro (efficient unified
memory). The mixed workload baseline was unstable (62.9 vs 16.3 Mgas/s
between runs due to GC-induced throughput collapse); the arena
eliminates this entirely (95-97 Mgas/s, stable).

## Dependencies

Benchmarked with #34021 (H01 N+1 fix) + #34032 (R14 parallel hashing).
No code dependency — applies independently to master.

All test suites pass (`trie/bintrie` with `-race`, `core/state`,
`triedb/pathdb`, `cmd/geth`).

---------

Co-authored-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>

2026-04-20 14:08:30 +02:00

2 commits