Commit graph

257 commits

Author SHA1 Message Date
Csaba Kiraly
33785aab21
p2p/discover: document BFS choice, add RandomWorkers split
Two related changes to CrawlIterator:

(1) Add a file-level commentary block explaining why the iterator uses a
FIFO queue (BFS over the FINDNODE-response graph) and what it is *not*
suitable for (target-directed lookup -- use RandomNodes() / the alpha=3
lookup iterator for that). The choice was inherited from dcrawl.nim
without explicit reasoning; making it visible avoids future readers
re-deriving the survey-vs-lookup distinction.

The BFS rationale is two-fold:

 - Coverage: BFS reaches every peer within N hops of the seeds in
   order, so a time-bounded run produces a representative sample of the
   reachable graph rather than a deep tendril through one sub-region.
 - Adversarial resilience: a peer returning malicious "neighbour"
   claims, dead-end peers, or eclipse-style sub-graphs cannot
   monopolise the worker pool, because pending work from other branches
   sits ahead of the attacker's responses in the queue. DFS would
   amplify each of these attacks.

(2) Add a RandomWorkers field to CrawlOptions. Of the Workers-sized
worker pool, the first (Workers - RandomWorkers) workers pop the FIFO
front (BFS), while RandomWorkers workers pop a uniform-random queue
index via swap-and-pop (O(1)). Total worker count is unchanged.

Default RandomWorkers = Workers / 4 (4 of 16 with the default
parallelism). At this ratio:

 - Cold-start cost is negligible: 12 of 16 workers still drain FIFO,
   so the first ~1s of a fresh crawl behaves like pure BFS.
 - 25% of pops break strict FIFO ordering, providing a mild
   anti-fingerprint defence against an attacker who could otherwise
   predict our processing order from the contents of their own
   FINDNODE responses.

Operators can override per-run via the new --random-workers CLI flag
on `devp2p discv4 crawl` and `discv5 crawl`. Negative value forces
pure BFS; positive value selects an explicit count.

The new TestCrawlIteratorRandomWorkers covers four pop-policy
configurations (all-fifo, all-random, half-half, default) and
asserts the iterator still terminates and emits each node exactly
once in each.
2026-05-07 14:41:58 +02:00
Csaba Kiraly
b026ef6bb7
p2p/discover: drop discv4 prefix-bit grind from CrawlIterator
The original CrawlIterator on the discv4 path generated FINDNODE targets
by grinding random pubkeys until their Keccak256 had a specific top-N-bit
prefix matching a per-call rotation index, then sending them. The aim was
to anchor each peer's response to a different /16 region of the global
keyspace.

Empirically (3 x 5-minute runs against mainnet bootnodes):

    mode          total mean ± std    mainnet mean ± std
    fast (grind)  5714 ± 117          549 ±  33
    fast-random   5306 ± 366          521 ± 124

Means are within 1σ of each other. The grind's only measurable benefit
is reduced run-to-run variance, not higher yield. For long-running
curated crawls (the production use case for cmd/devp2p) the variance
amortises away, so the simplification is worth taking.

Replace the grind with a plain crand.Read on the v4 target, drop the
randomTargetWithPrefix helper, log2Pow2 helper, and the v4-side
prefix-bit math from withDefaults. Drange becomes a v5-only knob and
its doc is updated to say so; the power-of-two requirement is gone.

discv5 is unchanged: it uses native distance rotation, not target
hashes, and was never affected by the grind.
2026-05-07 14:41:58 +02:00
Csaba Kiraly
6c0d848d9c
p2p/discover: add CrawlIterator for breadth-first FINDNODE walks
Add an enode.Iterator that drives discovery by issuing a single
FINDNODE per discovered peer, rotating the target through Drange
sub-regions of the keyspace. Compared to RandomNodes (which wraps an
alpha=3 Kademlia lookup that converges on a single target), this
shape is geared for breadth: each peer is asked about a different
slice of the keyspace, so aggregate coverage grows quickly without
per-peer overlap.

The two protocols expose different FINDNODE primitives, so the
iterator threads a per-protocol queryFn:

 * discv5 takes a list of distances natively, so we just pass
   [256-d] for d in 0..Drange-1.
 * discv4 takes a target NodeID and replies with the K closest. To
   get an equivalent rotation, we pick a random pubkey whose
   Keccak256 starts with the desired prefix nibble. With Drange=16
   that's ~16 random draws per call -- negligible compared to the
   network round trip.

Concurrency is bounded by Workers (default 16). There is intentionally
no rate limit: pacing is RTT-driven, ~Workers/RTT on the wire.

Termination is implicit: when the work queue is empty AND no FINDNODE
is in flight, the iterator closes its output and Next returns false.
Close() short-circuits this for callers that want to bail early.

Adapts the algorithm from github.com/cskiraly/fast-ethereum-crawler
(dcrawl.nim) -- the prefix-rotation idea -- but drops its 1000 req/s
rate limit in favour of the bounded worker pool.
2026-05-07 14:41:58 +02:00
rayoo
60db25b070
p2p/discover: restore nextTimeout update in UDPv4 resetTimeout loop (#34878)
Some checks are pending
/ Linux Build (push) Waiting to run
/ Linux Build (arm) (push) Waiting to run
/ Keeper Build (push) Waiting to run
/ Windows Build (push) Waiting to run
/ Docker Image (push) Waiting to run
The refactor from `for el := plist.Front(); ...; el = el.Next()` to the
new `iterList` iterator in #34743 silently dropped two things needed by
resetTimeout:

1. `nextTimeout = el.Value.(*replyMatcher)` at the top of the loop. This
assignment is what gives `nextTimeout` its documented meaning ("head of
plist when timeout was last reset"), and what makes the early-return
optimization at the top of resetTimeout work. Without it, nextTimeout is
only ever written to nil, so `nextTimeout == plist.Front().Value` is
always false and the optimization is dead.

2. `nextTimeout.errc <- errClockWarp` in the clock-warp branch now reads
a stale or nil pointer. Prior to the refactor, the inner assignment kept
nextTimeout pointing at the current matcher so its errc was the right
channel to receive the errClockWarp signal. After the refactor, on first
entry into the clock-warp branch nextTimeout is nil, which panics the
UDPv4 loop goroutine with a nil pointer deref and takes discv4 down.

Re-assign `nextTimeout = p` at the head of the loop (restoring the
documented invariant) and send the clock-warp error on `p.errc` rather
than the now-stale `nextTimeout.errc`.

The clock-warp branch triggers only when the system clock jumps backward
after a deadline is assigned (deadline - time.Now() >= 2*respTimeout,
i.e. at least ~500ms backward jump), which is why this regression
slipped past CI - it is not exercised by any existing unit test, and
writing one would require plumbing a clock through the loop.
2026-05-05 15:28:28 +02:00
Rahman
51c97216c5
p2p/discover: fix timeout loop early exit when removing expired matchers (#34743)
Save `el.Next()` before calling `plist.Remove(el)` so iteration
continues correctly. Previously the loop exited after removing the first
expired matcher because `Remove` invalidates the element's links.

---------

Co-authored-by: Felix Lange <fjl@twurst.com>
2026-04-28 10:57:58 +02:00
Charles Dusek
e1fe4a1a98
p2p/discover: fix flaky TestUDPv5_findnodeHandling (#34109)
Fixes #34108

The UDPv5 test harness (`newUDPV5Test`) uses the default `PingInterval`
of 3 seconds. When tests like `TestUDPv5_findnodeHandling` insert nodes
into the routing table via `fillTable`, the table's revalidation loop
may schedule PING packets for those nodes. Under the race detector or on
slow CI runners, the test runs long enough for revalidation to fire,
causing background pings to be written to the test pipe. The `close()`
method then finds these as unmatched packets and fails.

The fix sets `PingInterval` to a very large value in the test harness so
revalidation never fires during tests.

Verified locally: 100 iterations with `-race -count=100` pass reliably,
where previously the test would fail within ~50 iterations.
2026-04-14 09:43:44 +02:00
Charles Dusek
a2496852e9
p2p/discover: resolve DNS hostnames for bootstrap nodes (#34101)
Fixes #31208
2026-03-28 11:37:39 +01:00
Felix Lange
00cbd2e6f4
p2p/discover/v5wire: use Whoareyou.ChallengeData instead of storing encoded packet (#31547)
Some checks are pending
/ Linux Build (push) Waiting to run
/ Linux Build (arm) (push) Waiting to run
/ Keeper Build (push) Waiting to run
/ Windows Build (push) Waiting to run
/ Docker Image (push) Waiting to run
This changes the challenge resend logic again to use the existing
`ChallengeData` field of `v5wire.Whoareyou` instead of storing a second
copy of the packet in `Whoareyou.Encoded`. It's more correct this way
since `ChallengeData` is supposed to be the data that is used by the ID
verification procedure.

Also adapts the cross-client test to verify this behavior.

Follow-up to #31543
2026-02-22 21:58:47 +01:00
oxBoni
1468331f9d
p2p/discover/v5wire: remove redundant bytes clone in WHOAREYOU encoding (#33180)
head.AuthData is assigned later in the function, so the earlier assignment
can safely be removed.
2025-11-26 15:34:11 +01:00
Felix Lange
7c107c2691
p2p/discover: remove hot-spin in table refresh trigger (#32912)
Some checks are pending
/ Linux Build (push) Waiting to run
/ Linux Build (arm) (push) Waiting to run
/ Windows Build (push) Waiting to run
/ Docker Image (push) Waiting to run
This fixes a regression introduced in #32518. In that PR, we removed the
slowdown logic that would throttle lookups when the table runs empty.
Said logic was originally added in #20389.

Usually it's fine, but there exist pathological cases, such as hive
tests, where the node can only discover one other node, so it can only
ever query that node and won't get any results. In cases like these, we
need to throttle the creation of lookups to avoid crazy CPU usage.
2025-10-15 11:51:33 +02:00
Delweng
6337577434
p2p/discover: wait for bootstrap to be done (#32881)
Some checks are pending
/ Linux Build (push) Waiting to run
/ Linux Build (arm) (push) Waiting to run
/ Windows Build (push) Waiting to run
/ Docker Image (push) Waiting to run
This ensures the node is ready to accept other nodes into the
table before it is used in a test.

Closes #32863
2025-10-13 19:58:50 +02:00
Delweng
85e9977fae
p2p: rm unused var seedMinTableTime (#32876) 2025-10-13 16:40:08 +08:00
cui
64c6de7747
p2p: using testing.B.Loop (#32664) 2025-09-19 16:38:36 -06:00
Csaba Kiraly
de9fb9722b
revert to using table parameter
using it.lookup.tab inside is unsafe

Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
2025-09-17 09:04:41 +02:00
Csaba Kiraly
3589c0d59b
p2p/discover: expose timeout in lookupFailed
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>

# Conflicts:
#	p2p/discover/lookup.go
2025-09-16 14:03:11 +02:00
Felix Lange
0643427965 p2p/discover: continue 2025-09-12 12:50:07 +02:00
Felix Lange
68c18ede06
Update lookup.go 2025-09-12 11:34:44 +02:00
Csaba Kiraly
97afa2815b
Revert "p2p/discover: add test for lookup returning immediately"
This reverts commit 3eab4616a6.
2025-09-12 11:29:43 +02:00
Csaba Kiraly
3eab4616a6
p2p/discover: add test for lookup returning immediately
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
2025-09-12 10:59:29 +02:00
Csaba Kiraly
72d3e881b3
p2p/discover: clarify lookup behavior on empty table
We have changed this behavior, better clarify in comment.

Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
2025-09-12 10:52:53 +02:00
Felix Lange
a9f9e0d589 p2p/discover: add imports in test 2025-09-10 20:10:51 +02:00
Felix Lange
3133fd369a p2p/discover: remove print in test 2025-09-10 20:10:51 +02:00
Felix Lange
3946708935 p2p/discover: fix two bugs in lookup iterator
The lookup would add self into the replyBuffer if returned by another node.
Avoid doing that by marking self as seen.

With the changed initialization behavior of lookup, the lookupIterator needs to yield the
buffer right after creation. This fixes the smallNetConvergence test, where all results
are straight out of the local table.
2025-09-10 20:10:51 +02:00
Felix Lange
cf0503da7c p2p/discover: track missing nodes in test 2025-09-10 20:10:51 +02:00
Felix Lange
721c8de738 p2p/discover: trigger refresh in lookupIterator 2025-09-10 20:10:51 +02:00
Felix Lange
e58e7f7927 p2p/discover: fix bug in lookup 2025-09-10 20:10:51 +02:00
Felix Lange
4ed8f5ee2b p2p/discover: improve iterator 2025-09-10 20:10:51 +02:00
Felix Lange
f4046b0cfb p2p/discover: move wait condition to lookupIterator 2025-09-10 20:10:51 +02:00
Felix Lange
f8e0e8dc55 p2p/discover: add context in waitForNodes 2025-09-10 20:10:51 +02:00
Felix Lange
46e4f0b5c1 p2p/discover: add waitForNodes 2025-09-10 20:10:51 +02:00
Csaba Kiraly
1f7f95d718
p2p/discover: remove delay from discv5 RandomNodes (#32517)
Some checks are pending
/ Linux Build (push) Waiting to run
/ Linux Build (arm) (push) Waiting to run
/ Windows Build (push) Waiting to run
/ Docker Image (push) Waiting to run
Refresh is doing some lookups and thus it could block for some time. We
do not want the initializer of an iterator to block. If there is
something blocking, it should happen when calling Next.

Here, next will start a lookup, which will wait if needed (no nodes),
making sure the iterator's Next is not creating a busy loop.

Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
2025-09-10 19:51:04 +02:00
cui
9b2e8e7ce3
p2p: use slices.Clone (#32428)
Some checks are pending
/ Linux Build (push) Waiting to run
/ Linux Build (arm) (push) Waiting to run
/ Windows Build (push) Waiting to run
/ Docker Image (push) Waiting to run
Replaces a helper method with slices.Clone
2025-08-25 11:30:51 +02:00
Ocenka
276ed4848c
p2p/discover: add discv5 invalid findnodes result test cases (#32481)
Some checks failed
/ Linux Build (push) Has been cancelled
/ Linux Build (arm) (push) Has been cancelled
/ Windows Build (push) Has been cancelled
/ Docker Image (push) Has been cancelled
Supersedes #32470.

### What
- snap: shorten stall watchdog in `eth/protocols/snap/sync_test.go` from
1m to 10s.
- discover/v5: consolidate FINDNODE negative tests into a single
table-driven test:
  - `TestUDPv5_findnodeCall_InvalidNodes` covers:
    - invalid IP (unspecified `0.0.0.0`) → ignored
    - low UDP port (`<=1024`) → ignored

### Why
- Addresses TODOs:
  - “Make tests smaller” (reduce long 1m timeout).
- “check invalid IPs”; also cover low port per `verifyResponseNode`
rules (UDP must be >1024).

### How it’s validated
- Test-only changes; no production code touched.
- Local runs:
  - `go test ./p2p/discover -count=1 -timeout=300s` → ok
  - `go test ./eth/protocols/snap -count=1 -timeout=600s` → ok
- Lint:
  - `go run build/ci.go lint` → 0 issues on modified files.

### Notes
- The test harness uses `enode.ValidSchemesForTesting` (which includes
the “null” scheme), so records signed with `enode.SignNull` are
signature-valid; failures here are due to IP/port validation in
`verifyResponseNode` and `netutil.CheckRelayAddr`.
- Tests are written as a single table-driven function for clarity; no
helpers or environment switching.

---------

Co-authored-by: lightclient <lightclient@protonmail.com>
2025-08-22 11:44:11 -06:00
asamuj
d7db10ddbd
eth/protocols/snap, p2p/discover: improve zero time checks (#32214)
Some checks are pending
/ Linux Build (push) Waiting to run
/ Linux Build (arm) (push) Waiting to run
/ Windows Build (push) Waiting to run
/ Docker Image (push) Waiting to run
2025-07-15 14:20:45 +08:00
thinkAfCod
d2176f463b
p2p/discover: pass node instead of node ID to TALKREQ handler (#31075)
This is for the implementation of Portal Network in the Shisui client.
Their handler needs access to the node object in order to send further
calls to the requesting node. This is a breaking API change but it
should be fine, since there are basically no known users of TALKREQ
outside of Portal network.

---------

Signed-off-by: thinkAfCod <q315xia@163.com>
Co-authored-by: Felix Lange <fjl@twurst.com>
2025-04-02 14:56:21 +02:00
thinkAfCod
3e4fbce034
p2p/discover: repeat exact encoding when resending WHOAREYOU packet (#31543)
When resending the WHOAREYOU packet, a new nonce and random IV should not
be generated. The sent packet needs to match the previously-sent one exactly
in order to make the handshake retry work.

---------

Co-authored-by: Felix Lange <fjl@twurst.com>
2025-04-02 13:47:44 +02:00
Felix Lange
9eb610f0a9
p2p/discover: repeat WHOAREYOU challenge when handshake in progress (#31356)
This fixes the handshake in a scenario where the remote end sends two unknown
packets in a row. When this happens, we would previously respond to both with
a WHOAREYOU challenge, but keep only the latest sent challenge. Transmission is
assumed to be unreliable, so any client that sends two request packets simultaneously
has to be prepared to follow up on whichever request leads to a handshake. With
this fix, we force them to do the handshake that we can actually complete.

Fixes #30581
2025-03-20 17:11:40 +01:00
Chen Kai
5117f77af9
p2p/discover: expose discv5 functions for portal JSON-RPC interface (#31117)
Fixes #31093

Here we add some API functions on the UDPv5 object for the purpose of implementing
the Portal Network JSON-RPC API in the shisui client.

---------

Signed-off-by: Chen Kai <281165273grape@gmail.com>
2025-03-13 15:16:01 +01:00
Chen Kai
22b9354494
p2p/discover: make discv5 response timeout configurable (#31119) 2025-02-11 13:52:43 +01:00
Harry Ngo
d2ca7cf9f1
p2p/discover: remove unused parameter in revalidationList.get (#31155) 2025-02-11 13:45:44 +01:00
georgehao
1843f27766
all: fix some typos in comments and names (#31023) 2025-01-14 14:16:15 +01:00
Martin HS
9045b79bc2
metrics, cmd/geth: change init-process of metrics (#30814)
This PR modifies how the metrics library handles `Enabled`: previously,
the package `init` decided whether to serve real metrics or just
dummy-types.

This has several drawbacks: 
- During pkg init, we need to determine whether metrics are enabled or
not. So we first hacked in a check if certain geth-specific
commandline-flags were enabled. Then we added a similar check for
geth-env-vars. Then we almost added a very elaborate check for
toml-config-file, plus toml parsing.

- Using "real" types and dummy types interchangeably means that
everything is hidden behind interfaces. This has a performance penalty,
and also it just adds a lot of code.

This PR removes the interface stuff, uses concrete types, and allows for
the setting of Enabled to happen later. It is still assumed that
`metrics.Enable()` is invoked early on.

The somewhat 'heavy' operations, such as ticking meters and exp-decay,
now checks the enable-flag to prevent resource leak.

The change may be large, but it's mostly pretty trivial, and from the
last time I gutted the metrics, I ensured that we have fairly good test
coverage.

---------

Co-authored-by: Felix Lange <fjl@twurst.com>
2024-12-10 13:27:29 +01:00
Martin HS
5adc314817
build: update to golangci-lint 1.61.0 (#30587)
Changelog: https://golangci-lint.run/product/changelog/#1610 

Removes `exportloopref` (no longer needed), replaces it with
`copyloopvar` which is basically the opposite.

Also adds: 
- `durationcheck`
- `gocheckcompilerdirectives`
- `reassign`
- `mirror`
- `tenv`

---------

Co-authored-by: Marius van der Wijden <m.vanderwijden@live.de>
2024-10-14 19:25:22 +02:00
Felix Lange
6b61b54dc7
p2p/discover: add config option for disabling FINDNODE liveness check (#30512)
This is for fixing Prysm integration tests.
2024-09-30 10:56:14 +02:00
Martin HS
b5a88dafae
p2p/discover: fix flaky tests writing to test.log after completion (#30506)
This PR fixes two tests, which had a tendency to sometimes write to the `*testing.T` `log` facility after the test function had completed, which is not allowed. This PR fixes it by using waitgroups to ensure that the handler/logwriter terminates before the test exits.

closes #30505
2024-09-26 08:12:12 +02:00
Nicolas Gotchac
87377c58bc
p2p/discover: fix Write method in metered connection (#30355)
`WriteToUDP` was never called, since `meteredUdpConn` exposed directly
all the methods from the underlying `UDPConn` interface.

This fixes the `discover/egress` metric never being updated.
2024-08-27 14:10:32 +02:00
lightclient
00294e9d28
cmd/utils,p2p: enable discv5 by default (#30327) 2024-08-20 16:02:54 +02:00
Daniel Knopik
de6d597679
p2p/discover: schedule revalidation also when all nodes are excluded (#30239)
## Issue

If `nextTime` has passed, but all nodes are excluded, `get` would return
`nil` and `run` would therefore not invoke `schedule`. Then, we schedule
a timer for the past, as neither `nextTime` value has been updated. This
creates a busy loop, as the timer immediately returns.

## Fix

With this PR, revalidation will be also rescheduled when all nodes are
excluded.

---------

Co-authored-by: lightclient <lightclient@protonmail.com>
2024-07-31 21:38:23 +02:00
Felix Lange
ad49c708f5
p2p/discover: remove type encPubkey (#30172)
The pubkey type was moved to package v4wire a long time ago. Remaining uses of
encPubkey were probably left in due to laziness.
2024-07-18 11:09:02 +02:00
Halimao
a71f6f91fd
p2p/discover: improve flaky revalidation tests (#30023) 2024-06-21 15:29:07 +02:00