mirror of
https://github.com/ethereum/go-ethereum.git
synced 2026-06-06 06:58:39 +00:00
## Overview
This PR fixes a race condition during blockchain shutdown where snapshot
generation could continue accessing the trie database after it has been
closed, leading to iterator errors. We noticed this in one of our nodes
on https://github.com/ava-labs/avalanchego, which relies on an older
version of geth with the same issue (so this behavior does happen!).
During node shutdown, the following sequence occurs:
1. `BlockChain.Stop()` calls `snaps.Release()` to clean up snapshot
resources
2. `Release()` only resets the cache but doesn't stop the generator
goroutine
3. The trie database is then closed via `triedb.Close()`
4. The still-running generator attempts to iterate storage tries
5. Iterator fails because the database is closed (`"Generator failed to
iterate storage trie"`)
## Problem
There are three related bugs:
1. `Release()` doesn't stop generation: The `diskLayer.Release()` method
only resets the cache without stopping ongoing snapshot generation,
leaving the generator goroutine running after database closure.
2. `stopGeneration()` has an incorrect completion check: The
`stopGeneration()` method checks `genMarker != nil` to determine if
generation is running. However, `genMarker` is set to nil when
generation completes successfully, even though the generator goroutine
is still waiting for the abort signal at the end of `generate()`. See
line 705 in `generate.go`:
eaaa5b716d/core/state/snapshot/generate.go (L699-L707)
This means `stopGeneration()` returns early without sending the abort
signal.
3. Node shutdown doesn't stop generation: During shutdown, no code path
calls `stopGeneration()` or sends the abort signal to the generator,
causing the generator to access a closed database and error.
## Fix
- Modified `diskLayer.Release()` to call `stopGeneration()` before
releasing resources
- Added cancelation architecture, removing reliance on someone having to
wait
- Fixed `stopGeneration()` to properly and safely stop snapshot
generation
- Added `TestGenerateGoroutineLeak` to verify the fix and prevent
regression. The test fails without the fix and passes with it.
- The test creates a snapshot with active generation, waits for
completion, then calls `Release()`, and uses `go.uber.org/goleak` to
assert no generator goroutine survives.
- Without the fix, the test fails: `Release()` returns without stopping
the generator, which stays parked at `generate.go:705` waiting for an
abort signal that never comes:
```
--- FAIL: TestGenerateGoroutineLeak (0.88s)
generate_test.go: found unexpected goroutines:
[Goroutine 6 in state chan receive, with
core/state/snapshot.(*diskLayer).generate on top of the stack:
core/state/snapshot.(*diskLayer).generate(...)
core/state/snapshot/generate.go:705
created by core/state/snapshot.generateSnapshot
core/state/snapshot/generate.go:79 ]
```
- With the fix, the test passes: `Release()` -> `stopGeneration()`
blocks until the generator goroutine has fully exited, so nothing leaks
Note that this fix follows the same pattern used in `Tree.Disable()` in
https://github.com/ethereum/go-ethereum/pull/30040, which introduced
`stopGeneration()` for use in `Disable()` and `Rebuild()` but didn't
address the shutdown path.
The test follows the same pattern used in
`TestCheckSimBackendGoroutineLeak`
222 lines
7.3 KiB
Go
222 lines
7.3 KiB
Go
// Copyright 2019 The go-ethereum Authors
|
|
// This file is part of the go-ethereum library.
|
|
//
|
|
// The go-ethereum library is free software: you can redistribute it and/or modify
|
|
// it under the terms of the GNU Lesser General Public License as published by
|
|
// the Free Software Foundation, either version 3 of the License, or
|
|
// (at your option) any later version.
|
|
//
|
|
// The go-ethereum library is distributed in the hope that it will be useful,
|
|
// but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
// GNU Lesser General Public License for more details.
|
|
//
|
|
// You should have received a copy of the GNU Lesser General Public License
|
|
// along with the go-ethereum library. If not, see <http://www.gnu.org/licenses/>.
|
|
|
|
package snapshot
|
|
|
|
import (
|
|
"bytes"
|
|
"sync"
|
|
|
|
"github.com/VictoriaMetrics/fastcache"
|
|
"github.com/ethereum/go-ethereum/common"
|
|
"github.com/ethereum/go-ethereum/core/rawdb"
|
|
"github.com/ethereum/go-ethereum/core/types"
|
|
"github.com/ethereum/go-ethereum/ethdb"
|
|
"github.com/ethereum/go-ethereum/rlp"
|
|
"github.com/ethereum/go-ethereum/triedb"
|
|
)
|
|
|
|
// diskLayer is a low level persistent snapshot built on top of a key-value store.
|
|
type diskLayer struct {
|
|
diskdb ethdb.KeyValueStore // Key-value store containing the base snapshot
|
|
triedb *triedb.Database // Trie node cache for reconstruction purposes
|
|
cache *fastcache.Cache // Cache to avoid hitting the disk for direct access
|
|
|
|
root common.Hash // Root hash of the base snapshot
|
|
stale bool // Signals that the layer became stale (state progressed)
|
|
|
|
genMarker []byte // Marker for the state that's indexed during initial layer generation
|
|
genPending chan struct{} // Notification channel when generation is done (test synchronicity)
|
|
|
|
// Generator lifecycle management:
|
|
// - [cancel] is closed to request termination (broadcast).
|
|
// - [done] is closed by the generator goroutine on exit.
|
|
cancel chan struct{}
|
|
done chan struct{}
|
|
cancelOnce sync.Once
|
|
|
|
genStats *generatorStats // Stats for snapshot generation (generation aborted/finished if non-nil)
|
|
|
|
lock sync.RWMutex
|
|
}
|
|
|
|
// Release releases underlying resources; specifically the fastcache requires
|
|
// Reset() in order to not leak memory.
|
|
// OBS: It does not invoke Close on the diskdb
|
|
func (dl *diskLayer) Release() error {
|
|
// Stop any ongoing snapshot generation to prevent it from accessing
|
|
// the database after it's closed during shutdown
|
|
dl.stopGeneration()
|
|
|
|
if dl.cache != nil {
|
|
dl.cache.Reset()
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// Root returns root hash for which this snapshot was made.
|
|
func (dl *diskLayer) Root() common.Hash {
|
|
return dl.root
|
|
}
|
|
|
|
// Parent always returns nil as there's no layer below the disk.
|
|
func (dl *diskLayer) Parent() snapshot {
|
|
return nil
|
|
}
|
|
|
|
// Stale return whether this layer has become stale (was flattened across) or if
|
|
// it's still live.
|
|
func (dl *diskLayer) Stale() bool {
|
|
dl.lock.RLock()
|
|
defer dl.lock.RUnlock()
|
|
|
|
return dl.stale
|
|
}
|
|
|
|
// markStale sets the stale flag as true.
|
|
func (dl *diskLayer) markStale() {
|
|
dl.lock.Lock()
|
|
defer dl.lock.Unlock()
|
|
|
|
dl.stale = true
|
|
}
|
|
|
|
// Account directly retrieves the account associated with a particular hash in
|
|
// the snapshot slim data format.
|
|
func (dl *diskLayer) Account(hash common.Hash) (*types.SlimAccount, error) {
|
|
data, err := dl.AccountRLP(hash)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
if len(data) == 0 { // can be both nil and []byte{}
|
|
return nil, nil
|
|
}
|
|
account := new(types.SlimAccount)
|
|
if err := rlp.DecodeBytes(data, account); err != nil {
|
|
panic(err)
|
|
}
|
|
return account, nil
|
|
}
|
|
|
|
// AccountRLP directly retrieves the account RLP associated with a particular
|
|
// hash in the snapshot slim data format.
|
|
func (dl *diskLayer) AccountRLP(hash common.Hash) ([]byte, error) {
|
|
dl.lock.RLock()
|
|
defer dl.lock.RUnlock()
|
|
|
|
// If the layer was flattened into, consider it invalid (any live reference to
|
|
// the original should be marked as unusable).
|
|
if dl.stale {
|
|
return nil, ErrSnapshotStale
|
|
}
|
|
// If the layer is being generated, ensure the requested hash has already been
|
|
// covered by the generator.
|
|
if dl.genMarker != nil && bytes.Compare(hash[:], dl.genMarker) > 0 {
|
|
return nil, ErrNotCoveredYet
|
|
}
|
|
// If we're in the disk layer, all diff layers missed
|
|
snapshotDirtyAccountMissMeter.Mark(1)
|
|
|
|
// Try to retrieve the account from the memory cache
|
|
if blob, found := dl.cache.HasGet(nil, hash[:]); found {
|
|
snapshotCleanAccountHitMeter.Mark(1)
|
|
snapshotCleanAccountReadMeter.Mark(int64(len(blob)))
|
|
return blob, nil
|
|
}
|
|
// Cache doesn't contain account, pull from disk and cache for later
|
|
blob := rawdb.ReadAccountSnapshot(dl.diskdb, hash)
|
|
dl.cache.Set(hash[:], blob)
|
|
|
|
snapshotCleanAccountMissMeter.Mark(1)
|
|
if n := len(blob); n > 0 {
|
|
snapshotCleanAccountWriteMeter.Mark(int64(n))
|
|
} else {
|
|
snapshotCleanAccountInexMeter.Mark(1)
|
|
}
|
|
return blob, nil
|
|
}
|
|
|
|
// Storage directly retrieves the storage data associated with a particular hash,
|
|
// within a particular account.
|
|
func (dl *diskLayer) Storage(accountHash, storageHash common.Hash) ([]byte, error) {
|
|
dl.lock.RLock()
|
|
defer dl.lock.RUnlock()
|
|
|
|
// If the layer was flattened into, consider it invalid (any live reference to
|
|
// the original should be marked as unusable).
|
|
if dl.stale {
|
|
return nil, ErrSnapshotStale
|
|
}
|
|
key := append(accountHash[:], storageHash[:]...)
|
|
|
|
// If the layer is being generated, ensure the requested hash has already been
|
|
// covered by the generator.
|
|
if dl.genMarker != nil && bytes.Compare(key, dl.genMarker) > 0 {
|
|
return nil, ErrNotCoveredYet
|
|
}
|
|
// If we're in the disk layer, all diff layers missed
|
|
snapshotDirtyStorageMissMeter.Mark(1)
|
|
|
|
// Try to retrieve the storage slot from the memory cache
|
|
if blob, found := dl.cache.HasGet(nil, key); found {
|
|
snapshotCleanStorageHitMeter.Mark(1)
|
|
snapshotCleanStorageReadMeter.Mark(int64(len(blob)))
|
|
return blob, nil
|
|
}
|
|
// Cache doesn't contain storage slot, pull from disk and cache for later
|
|
blob := rawdb.ReadStorageSnapshot(dl.diskdb, accountHash, storageHash)
|
|
dl.cache.Set(key, blob)
|
|
|
|
snapshotCleanStorageMissMeter.Mark(1)
|
|
if n := len(blob); n > 0 {
|
|
snapshotCleanStorageWriteMeter.Mark(int64(n))
|
|
} else {
|
|
snapshotCleanStorageInexMeter.Mark(1)
|
|
}
|
|
return blob, nil
|
|
}
|
|
|
|
// Update creates a new layer on top of the existing snapshot diff tree with
|
|
// the specified data items. Note, the maps are retained by the method to avoid
|
|
// copying everything.
|
|
func (dl *diskLayer) Update(blockHash common.Hash, accounts map[common.Hash][]byte, storage map[common.Hash]map[common.Hash][]byte) *diffLayer {
|
|
return newDiffLayer(dl, blockHash, accounts, storage)
|
|
}
|
|
|
|
// stopGeneration requests cancellation of any running snapshot generation and
|
|
// blocks until the generator goroutine (if running) has fully terminated.
|
|
//
|
|
// Concurrency guarantees:
|
|
// - Thread-safe: May be called concurrently from multiple goroutines
|
|
// - Idempotent: Safe to call multiple times; subsequent calls have no effect
|
|
// - Blocking: Returns only after the generator goroutine (if any) has exited
|
|
// - Safe to call at any time, including when no generation is running
|
|
//
|
|
// After return, it is **guaranteed** that:
|
|
// - The generator goroutine has terminated
|
|
// - It is safe to proceed with cleanup operations (e.g. closing databases)
|
|
func (dl *diskLayer) stopGeneration() {
|
|
cancel := dl.cancel
|
|
done := dl.done
|
|
if cancel == nil || done == nil {
|
|
return
|
|
}
|
|
|
|
dl.cancelOnce.Do(func() {
|
|
close(cancel)
|
|
})
|
|
<-done
|
|
}
|