Gballet asked (comment 3099953085) to leave the sha256Sum256 / constant
parallelHashDepth optimisation out of this PR: it's an orthogonal
microbenchmark concern that should be revisited post-group-depth under
Go 1.26.
- Delete the sha256Sum256 helper from hasher.go.
- Delete the const parallelHashDepth = 4 from hasher.go.
- Restore master's dynamic parallelDepth() helper in store_commit.go
(copy verbatim — min(bits.Len(NumCPU), 8)).
- In hashInternal's shallow-parallel branch, call sha256.Sum256 directly
(std-lib, stack-allocated [32]byte; common.Hash is a type alias for
[32]byte so no conversion needed).
- In hashInternal's deep-sequential branch, use the pooled newSha256 /
returnSha256 hasher (matches master's internal_node.go:170-185).
Intentional trade-off: the deep branch now re-introduces per-hash
sync.Pool Get/Put plus a 32-byte h.Sum(nil) allocation. Zero regression
vs master; foregoes the arena's proposed stack-based hashing until
Go 1.26 + post-group-depth benchmarks.
Replace the BinaryNode interface (which uses Go interface pointers that
the GC must scan) with NodeRef uint32 indices into typed arena pools.
NodeRef packs a 2-bit kind tag and 30-bit pool index into a single
uint32, making it invisible to the garbage collector.
NodeStore manages chunked typed pools per node kind:
- InternalNode pool: ZERO Go pointers (children are NodeRef, hash is
[32]byte) → allocated in noscan spans, GC skips entirely
- HashedNode pool: ZERO Go pointers → noscan spans
- StemNode pool: ONE pointer per node (valueData []byte) → minimal GC
For a trie with 25K InternalNodes, this reduces GC-scanned pointer-words
from ~125K to ~10K (85% reduction). CPU profiling showed 44% of time
in GC; this refactor directly addresses that bottleneck.
Serialization format is unchanged — the on-disk representation is
fully compatible. All existing tests pass.
Binary tree hashing is quite slow, owing to many factors. One of them is
the GC pressure that is the consequence of allocating many hashers, as a
binary tree has 4x the size of an MPT. This PR introduces an
optimization that already exists for the MPT: keep a pool of hashers, in
order to reduce the amount of allocations.