At tree depths below log2(NumCPU) (capped at 8), hash the left subtree
in a goroutine while hashing the right subtree inline when both children
need rehashing. This exploits available CPU cores for the top levels of
the tree where subtree hashing is most expensive. When only one child
is dirty, the goroutine is skipped to avoid overhead.
Deeper nodes use sequential hashing with the existing sync.Pool hasher.
The parallel path uses sha256.Sum256 with a stack-allocated buffer to
avoid pool contention across goroutines.