The refactor from `for el := plist.Front(); ...; el = el.Next()` to the
new `iterList` iterator in #34743 silently dropped two things needed by
resetTimeout:
1. `nextTimeout = el.Value.(*replyMatcher)` at the top of the loop. This
assignment is what gives `nextTimeout` its documented meaning ("head of
plist when timeout was last reset"), and what makes the early-return
optimization at the top of resetTimeout work. Without it, nextTimeout is
only ever written to nil, so `nextTimeout == plist.Front().Value` is
always false and the optimization is dead.
2. `nextTimeout.errc <- errClockWarp` in the clock-warp branch now reads
a stale or nil pointer. Prior to the refactor, the inner assignment kept
nextTimeout pointing at the current matcher so its errc was the right
channel to receive the errClockWarp signal. After the refactor, on first
entry into the clock-warp branch nextTimeout is nil, which panics the
UDPv4 loop goroutine with a nil pointer deref and takes discv4 down.
Re-assign `nextTimeout = p` at the head of the loop (restoring the
documented invariant) and send the clock-warp error on `p.errc` rather
than the now-stale `nextTimeout.errc`.
The clock-warp branch triggers only when the system clock jumps backward
after a deadline is assigned (deadline - time.Now() >= 2*respTimeout,
i.e. at least ~500ms backward jump), which is why this regression
slipped past CI - it is not exercised by any existing unit test, and
writing one would require plumbing a clock through the loop.
Save `el.Next()` before calling `plist.Remove(el)` so iteration
continues correctly. Previously the loop exited after removing the first
expired matcher because `Remove` invalidates the element's links.
---------
Co-authored-by: Felix Lange <fjl@twurst.com>
Fixes#34108
The UDPv5 test harness (`newUDPV5Test`) uses the default `PingInterval`
of 3 seconds. When tests like `TestUDPv5_findnodeHandling` insert nodes
into the routing table via `fillTable`, the table's revalidation loop
may schedule PING packets for those nodes. Under the race detector or on
slow CI runners, the test runs long enough for revalidation to fire,
causing background pings to be written to the test pipe. The `close()`
method then finds these as unmatched packets and fails.
The fix sets `PingInterval` to a very large value in the test harness so
revalidation never fires during tests.
Verified locally: 100 iterations with `-race -count=100` pass reliably,
where previously the test would fail within ~50 iterations.
This changes the challenge resend logic again to use the existing
`ChallengeData` field of `v5wire.Whoareyou` instead of storing a second
copy of the packet in `Whoareyou.Encoded`. It's more correct this way
since `ChallengeData` is supposed to be the data that is used by the ID
verification procedure.
Also adapts the cross-client test to verify this behavior.
Follow-up to #31543
This changes the p2p protocol handlers to delay message decoding. It's
the first part of a larger change that will delay decoding all the way
through message processing. For responses, we delay the decoding until
it is confirmed that the response matches an active request and does not
exceed its limits.
In order to make this work, all messages have been changed to use
rlp.RawList instead of a slice of the decoded item type. For block
bodies specifically, the decoding has been delayed all the way until
after verification of the response hash.
The role of p2p/tracker.Tracker changes significantly in this PR. The
Tracker's original purpose was to maintain metrics about requests and
responses in the peer-to-peer protocols. Each protocol maintained a
single global Tracker instance. As of this change, the Tracker is now
always active (regardless of metrics collection), and there is a
separate instance of it for each peer. Whenever a response arrives, it
is first verified that a request exists for it in the tracker. The
tracker is also the place where limits are kept.
The upstream libray has removed the assembly-based implementation of
keccak. We need to maintain our own library to avoid a peformance
regression.
---------
Co-authored-by: lightclient <lightclient@protonmail.com>
Fix ECIES invalid-curve handling in RLPx handshake (reject invalid
ephemeral pubkeys early)
- Add curve validation in crypto/ecies.GenerateShared to reject invalid
public keys before ECDH.
- Update RLPx PoC test to assert invalid curve points fail with
ErrInvalidPublicKey.
Motivation / Context
RLPx handshake uses ECIES decryption on unauthenticated network input.
Prior to this change, an invalid-curve ephemeral public key would
proceed into ECDH and only fail at MAC verification, returning
ErrInvalidMessage. This allows an oracle on decrypt success/failure and
leaves the code path vulnerable to invalid-curve/small-subgroup attacks.
The fix enforces IsOnCurve validation up front.
XORBytes was added to package crypto/subtle in Go 1.20, and it's faster
than our bitutil.XORBytes. There is only one use of this function
across go-ethereum so we can simply deprecate the custom implementation.
---------
Co-authored-by: Felix Lange <fjl@twurst.com>
The random-port retry loop in addAnyPortMapping shadowed the err
variable, causing the function to return (0, nil) when all attempts
failed. This change removes the shadowing and preserves the last error
across both the fixed-port and random-port retries, ensuring failures
are reported to callers correctly.
This fixes a regression introduced in #32518. In that PR, we removed the
slowdown logic that would throttle lookups when the table runs empty.
Said logic was originally added in #20389.
Usually it's fine, but there exist pathological cases, such as hive
tests, where the node can only discover one other node, so it can only
ever query that node and won't get any results. In cases like these, we
need to throttle the creation of lookups to avoid crazy CPU usage.
Description:
We found a occasionally node hang issue on BSC, I think Geth may
also have the issue, so pick the fix patch here.
The fix on BSC repo: https://github.com/bnb-chain/bsc/pull/3347
When the hang occurs, there are two routines stuck.
- routine 1: AsyncFilter(...)
On node start, it will run part of the DiscoveryV4 protocol, which could
take considerable time, here is its hang callstack:
```
goroutine 9711 [chan receive]: // this routine was stuck on read channel: `<-f.slots`
github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter.func1()
github.com/ethereum/go-ethereum/p2p/enode/iter.go:206 +0x125
created by github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter in goroutine 1
github.com/ethereum/go-ethereum/p2p/enode/iter.go:192 +0x205
```
- Routine 2: Node Stop
It is the main routine to shutdown the process, but it got stuck when it
tries to shutdown the discovery components, as it tries to drain the
channel of `<-f.slots`, but the extra 1 slot will never have chance to
be resumed.
```
goroutine 11796 [chan receive]:
github.com/ethereum/go-ethereum/p2p/enode.(*asyncFilterIter).Close.func1()
github.com/ethereum/go-ethereum/p2p/enode/iter.go:248 +0x5c
sync.(*Once).doSlow(0xc032a97cb8?, 0xc032a97d18?)
sync/once.go:78 +0xab
sync.(*Once).Do(...)
sync/once.go:69
github.com/ethereum/go-ethereum/p2p/enode.(*asyncFilterIter).Close(0xc092ff8d00?)
github.com/ethereum/go-ethereum/p2p/enode/iter.go:244 +0x36
github.com/ethereum/go-ethereum/p2p/enode.(*bufferIter).Close.func1()
github.com/ethereum/go-ethereum/p2p/enode/iter.go:299 +0x24
sync.(*Once).doSlow(0x11a175f?, 0x2bfe63e?)
sync/once.go:78 +0xab
sync.(*Once).Do(...)
sync/once.go:69
github.com/ethereum/go-ethereum/p2p/enode.(*bufferIter).Close(0x30?)
github.com/ethereum/go-ethereum/p2p/enode/iter.go:298 +0x36
github.com/ethereum/go-ethereum/p2p/enode.(*FairMix).Close(0xc0004bfea0)
github.com/ethereum/go-ethereum/p2p/enode/iter.go:379 +0xb7
github.com/ethereum/go-ethereum/eth.(*Ethereum).Stop(0xc000997b00)
github.com/ethereum/go-ethereum/eth/backend.go:960 +0x4a
github.com/ethereum/go-ethereum/node.(*Node).stopServices(0xc0001362a0, {0xc012e16330, 0x1, 0xc000111410?})
github.com/ethereum/go-ethereum/node/node.go:333 +0xb3
github.com/ethereum/go-ethereum/node.(*Node).Close(0xc0001362a0)
github.com/ethereum/go-ethereum/node/node.go:263 +0x167
created by github.com/ethereum/go-ethereum/cmd/utils.StartNode.func1.1 in goroutine 9729
github.com/ethereum/go-ethereum/cmd/utils/cmd.go:101 +0x78
```
The rootcause of the hang is caused by the extra 1 slot, which was
designed to make sure the routines in `AsyncFilter(...)` can be
finished. This PR fixes it by making sure the extra 1 shot can always be
resumed when node shutdown.
The lookup would add self into the replyBuffer if returned by another node.
Avoid doing that by marking self as seen.
With the changed initialization behavior of lookup, the lookupIterator needs to yield the
buffer right after creation. This fixes the smallNetConvergence test, where all results
are straight out of the local table.
Refresh is doing some lookups and thus it could block for some time. We
do not want the initializer of an iterator to block. If there is
something blocking, it should happen when calling Next.
Here, next will start a lookup, which will wait if needed (no nodes),
making sure the iterator's Next is not creating a busy loop.
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
Supersedes #32470.
### What
- snap: shorten stall watchdog in `eth/protocols/snap/sync_test.go` from
1m to 10s.
- discover/v5: consolidate FINDNODE negative tests into a single
table-driven test:
- `TestUDPv5_findnodeCall_InvalidNodes` covers:
- invalid IP (unspecified `0.0.0.0`) → ignored
- low UDP port (`<=1024`) → ignored
### Why
- Addresses TODOs:
- “Make tests smaller” (reduce long 1m timeout).
- “check invalid IPs”; also cover low port per `verifyResponseNode`
rules (UDP must be >1024).
### How it’s validated
- Test-only changes; no production code touched.
- Local runs:
- `go test ./p2p/discover -count=1 -timeout=300s` → ok
- `go test ./eth/protocols/snap -count=1 -timeout=600s` → ok
- Lint:
- `go run build/ci.go lint` → 0 issues on modified files.
### Notes
- The test harness uses `enode.ValidSchemesForTesting` (which includes
the “null” scheme), so records signed with `enode.SignNull` are
signature-valid; failures here are due to IP/port validation in
`verifyResponseNode` and `netutil.CheckRelayAddr`.
- Tests are written as a single table-driven function for clarity; no
helpers or environment switching.
---------
Co-authored-by: lightclient <lightclient@protonmail.com>