go-ethereum

mirror of https://github.com/ethereum/go-ethereum.git synced 2026-05-09 17:46:37 +00:00

Author	SHA1	Message	Date
Guillaume Ballet	a15778c52f	trie: group 2^N binary trie nodes in serialization (#34794 ) Some checks are pending / Linux Build (push) Waiting to run Details / Linux Build (arm) (push) Waiting to run Details / Keeper Build (push) Waiting to run Details / Windows Build (push) Waiting to run Details / Docker Image (push) Waiting to run Details This PR addresses one of the biggest performance issue with binary tries: storing each internal node individually bloats the index, the disk, and triggers a lot of write amplifications. To fix this issue, this PR serializes groups of nodes together. Because we are still looking for the ideal group size, the "depth" of the group tree is made a parameter, but that will be removed in the future, once the perfect size is known. This is a rebase of #33658 --------- Co-authored-by: Copilot <copilot@github.com>	2026-05-01 15:28:19 +02:00
CPerezz	b6d415c88d	trie/bintrie: replace BinaryNode interface with GC-free NodeRef arena (#34055 ) ## Summary Replace the `BinaryNode` interface with `NodeRef uint32` indices into typed arena pools, eliminating GC-scanned pointers from binary trie nodes. Inspired by [fjl's observation](https://github.com/ethereum/go-ethereum/pull/34034#issuecomment-4075176446): > "if the binary trie produces such a large graph, it should probably be changed so that the trie node type does not contain pointers. The runtime does not scan objects that do not contain pointers, so it can really help with the performance to build it this way." ### The problem CPU profiling of the binary trie (EIP-7864) showed 44% of CPU time in garbage collection. Each `InternalNode` held two `BinaryNode` interface values (2 pointer-words each), and the GC scanned every one. With ~25K `InternalNode`s in memory during block processing, this created enormous GC pressure. ### The solution `NodeRef` is a compact `uint32` (2-bit kind tag + 30-bit pool index). `NodeStore` manages chunked typed pools per node kind: - InternalNode pool: ZERO Go pointers (children are `NodeRef`, hash is `[32]byte`) → noscan spans - HashedNode pool: ZERO Go pointers → noscan spans - StemNode pool: retains `Values [][]byte` (matching existing format) The serialization format is unchanged — flat InternalNode `[type][leftHash][rightHash]` = 65 bytes. ## Benchmark: Apple M4 Pro (`--benchtime=10s --count=3`, on top of #34021) \| Metric \| Baseline \| Arena \| Delta \| \|--------\|----------\|-------\|-------\| \| Approve (Mgas/s) \| 374 \| 382 \| +2.1% \| \| BalanceOf (Mgas/s) \| 885 \| 901 \| +1.8% \| \| Approve allocs/op \| 775K \| 607K \| -21.7% \| \| BalanceOf allocs/op \| 265K \| 228K \| -14.0% \| ## Benchmark: AMD EPYC 48-core (50GB state, execution-specs ERC-20, on top of #34021 + #34032) \| Benchmark \| Baseline \| Arena \| Delta \| \|-----------\|----------\|-------\|-------\| \| erc20_approve (write) \| 22.4 Mgas/s \| 27.0 Mgas/s \| +20.5% \| \| mixed_sload_sstore \| 62.9 Mgas/s \| 97.3 Mgas/s \| +54.7% \| \| erc20_balanceof (read) \| 180.8 Mgas/s \| 167.6 Mgas/s \| -7.3% (cold cache variance) \| The arena benefit scales with heap size — the EPYC (larger heap, more GC pressure) shows much larger gains than the M4 Pro (efficient unified memory). The mixed workload baseline was unstable (62.9 vs 16.3 Mgas/s between runs due to GC-induced throughput collapse); the arena eliminates this entirely (95-97 Mgas/s, stable). ## Dependencies Benchmarked with #34021 (H01 N+1 fix) + #34032 (R14 parallel hashing). No code dependency — applies independently to master. All test suites pass (`trie/bintrie` with `-race`, `core/state`, `triedb/pathdb`, `cmd/geth`). --------- Co-authored-by: Guillaume Ballet <3272758+gballet@users.noreply.github.com>	2026-04-20 14:08:30 +02:00
CPerezz	61bfacc52f	trie/bintrie: skip clean nodes in CollectNodes to reduce commit write amplification (#34754 ) Some checks are pending / Linux Build (push) Waiting to run Details / Linux Build (arm) (push) Waiting to run Details / Keeper Build (push) Waiting to run Details / Windows Build (push) Waiting to run Details / Docker Image (push) Waiting to run Details ## Problem `BinaryTrie.Commit` unconditionally walked every resolved in-memory node and flushed it into the `NodeSet`, producing one Pebble write per resolved internal + stem node on every block — even when the node's on-disk blob was bitwise identical to the previous commit. On a warm 400M-state workload this meant tens of thousands of redundant 65-byte writes per block, compounding Pebble compaction pressure on every commit. The existing `mustRecompute` flag tracks hash staleness, not disk-blob staleness: after `Hash()` completes, `mustRecompute` is cleared even though the fresh blob has not been persisted. It is therefore insufficient for a skip-flush optimization. ## Fix Mirror the MPT committer pattern (`trie/committer.go:51-56`) by adding a `dirty` flag on `InternalNode` and `StemNode` with the semantics the on-disk blob is stale. The flag is: - set to `true` wherever the node is created or structurally modified (the same call sites that already set `mustRecompute = true`); - set to `false` only after the node has been passed to the `flushfn` inside `CollectNodes`; - left `false` on nodes produced by `DeserializeNodeWithHash`, matching the loaded from disk, already persisted semantics. `CollectNodes` short-circuits on `!dirty` subtrees. The propagation invariant (an ancestor of any dirty node is itself dirty) is already maintained by the existing `InsertValuesAtStem` / `Insert` paths, which now mirror every `mustRecompute = true` setter with a `dirty = true` setter. ## Benchmark New `BenchmarkCollectNodes_SparseWrite` measures commit cost when only one leaf changes between blocks — the common case for state updates. 10,000-stem trie, one-leaf modification + Commit per iteration, Apple M4 Pro: \| \| before \| after \| delta \| \|---\|---\|---\|---\| \| time / op \| 12,653,000 ns \| 7,336 ns \| ~1,725× \| \| bytes / op \| 107,224,740 B \| 37,774 B \| ~2,839× \| \| allocs / op \| 80,953 \| 134 \| ~604× \| End-to-end impact on a real workload depends on the resolved-footprint-to-dirty-path ratio; the new `TestBinaryTrieCommitIncremental` provides a structural regression guard (asserts that a Commit following a single-leaf modification flushes a root-to-leaf path, not the whole tree). --- Found all of this stuff while bloating my #34706 DB to make some benchmarks. And saw we were spending A LOT OF TIME on hashing. Hope this helps the perf a bit. Will rebase the flat-state PR on top of this once merged.	2026-04-18 11:42:58 +02:00
Guillaume Ballet	3f1871524f	trie/bintrie: cache hashes of clean nodes so as not to rehash the whole tree (#33961 ) This is an optimization that existed for verkle and the MPT, but that got dropped during the rebase. Mark the nodes that were modified as needing recomputation, and skip the hash computation if this is not needed. Otherwise, the whole tree is hashed, which kills performance.	2026-03-06 18:06:24 +01:00
Guillaume Ballet	bd4b17907f	trie/bintrie: add eip7864 binary trees and run its tests (#32365 ) Implement the binary tree as specified in [eip-7864](https://eips.ethereum.org/EIPS/eip-7864). This will gradually replace verkle trees in the codebase. This is only running the tests and will not be executed in production, but will help me rebase some of my work, so that it doesn't bitrot as much. --------- Signed-off-by: Guillaume Ballet Co-authored-by: Parithosh Jayanthi <parithosh.jayanthi@ethereum.org> Co-authored-by: rjl493456442 <garyrong0905@gmail.com>	2025-09-01 21:06:51 +08:00

5 commits