Performance optimizations to the NOMT storage engine while preserving
correctness (all triecompare cross-validation tests pass at 10K+ scale):
- Pool SHA256 hashers via sync.Pool in HashInternal and HashStem
- Replace allStems map with sorted slice + O(N+M) merge (in-place fast
path for incremental updates avoids allocation entirely)
- Add UpdateSorted to db.DB, skipping redundant sort of pre-sorted ops
- Simplify canonicalRoot to use pre-sorted allStems directly
- Optimize StemSharedBits with byte-level XOR + bits.LeadingZeros8
- Replace stemLess loops with bytes.Compare in all locations
- Eliminate per-stem map alloc in groupAndHashStems (use [256]bool dirty)
- Use stack-allocated [248]bool for downBits in BuildInternalTree
- Remove unused stemPathCmp function
BenchmarkHash/10000/nomt: 9.8ms → 8.2ms (-16%)
BenchmarkBlockWorkload/nomt: 7.7ms → 6.6ms (-14%)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>