Performance optimizations to the NOMT storage engine while preserving
correctness (all triecompare cross-validation tests pass at 10K+ scale):
- Pool SHA256 hashers via sync.Pool in HashInternal and HashStem
- Replace allStems map with sorted slice + O(N+M) merge (in-place fast
path for incremental updates avoids allocation entirely)
- Add UpdateSorted to db.DB, skipping redundant sort of pre-sorted ops
- Simplify canonicalRoot to use pre-sorted allStems directly
- Optimize StemSharedBits with byte-level XOR + bits.LeadingZeros8
- Replace stemLess loops with bytes.Compare in all locations
- Eliminate per-stem map alloc in groupAndHashStems (use [256]bool dirty)
- Use stack-allocated [248]bool for downBits in BuildInternalTree
- Remove unused stemPathCmp function
BenchmarkHash/10000/nomt: 9.8ms → 8.2ms (-16%)
BenchmarkBlockWorkload/nomt: 7.7ms → 6.6ms (-14%)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove leaf compaction from PageWalker.compactStep, replace KeyValue
with StemKeyValue throughout the merkle engine and worker, and update
DB.Update to accept pre-hashed stem key-value pairs.
Key change: singleThreadedUpdate now uses the same depth-7 child-index
partitioning as the parallel path, ensuring identical intermediate
hashes without leaf compaction.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Parallelize the PageWalker trie update across multiple goroutines by
partitioning sorted operations by the root page's 64 child subtrees
(first 6 bits of each key path).
Each worker runs an independent PageWalker constrained to child pages
below the root (using parentPage mechanism), producing ChildPageRoots.
After all workers complete, a root walker places the child roots using
AdvanceAndPlaceNode and concludes with the final trie root.
Workers operate on disjoint page subtrees so no synchronization is
needed during computation — only sync.WaitGroup for goroutine join.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>