Two related changes to CrawlIterator:
(1) Add a file-level commentary block explaining why the iterator uses a
FIFO queue (BFS over the FINDNODE-response graph) and what it is *not*
suitable for (target-directed lookup -- use RandomNodes() / the alpha=3
lookup iterator for that). The choice was inherited from dcrawl.nim
without explicit reasoning; making it visible avoids future readers
re-deriving the survey-vs-lookup distinction.
The BFS rationale is two-fold:
- Coverage: BFS reaches every peer within N hops of the seeds in
order, so a time-bounded run produces a representative sample of the
reachable graph rather than a deep tendril through one sub-region.
- Adversarial resilience: a peer returning malicious "neighbour"
claims, dead-end peers, or eclipse-style sub-graphs cannot
monopolise the worker pool, because pending work from other branches
sits ahead of the attacker's responses in the queue. DFS would
amplify each of these attacks.
(2) Add a RandomWorkers field to CrawlOptions. Of the Workers-sized
worker pool, the first (Workers - RandomWorkers) workers pop the FIFO
front (BFS), while RandomWorkers workers pop a uniform-random queue
index via swap-and-pop (O(1)). Total worker count is unchanged.
Default RandomWorkers = Workers / 4 (4 of 16 with the default
parallelism). At this ratio:
- Cold-start cost is negligible: 12 of 16 workers still drain FIFO,
so the first ~1s of a fresh crawl behaves like pure BFS.
- 25% of pops break strict FIFO ordering, providing a mild
anti-fingerprint defence against an attacker who could otherwise
predict our processing order from the contents of their own
FINDNODE responses.
Operators can override per-run via the new --random-workers CLI flag
on `devp2p discv4 crawl` and `discv5 crawl`. Negative value forces
pure BFS; positive value selects an explicit count.
The new TestCrawlIteratorRandomWorkers covers four pop-policy
configurations (all-fifo, all-random, half-half, default) and
asserts the iterator still terminates and emits each node exactly
once in each.
Add an enode.Iterator that drives discovery by issuing a single
FINDNODE per discovered peer, rotating the target through Drange
sub-regions of the keyspace. Compared to RandomNodes (which wraps an
alpha=3 Kademlia lookup that converges on a single target), this
shape is geared for breadth: each peer is asked about a different
slice of the keyspace, so aggregate coverage grows quickly without
per-peer overlap.
The two protocols expose different FINDNODE primitives, so the
iterator threads a per-protocol queryFn:
* discv5 takes a list of distances natively, so we just pass
[256-d] for d in 0..Drange-1.
* discv4 takes a target NodeID and replies with the K closest. To
get an equivalent rotation, we pick a random pubkey whose
Keccak256 starts with the desired prefix nibble. With Drange=16
that's ~16 random draws per call -- negligible compared to the
network round trip.
Concurrency is bounded by Workers (default 16). There is intentionally
no rate limit: pacing is RTT-driven, ~Workers/RTT on the wire.
Termination is implicit: when the work queue is empty AND no FINDNODE
is in flight, the iterator closes its output and Next returns false.
Close() short-circuits this for callers that want to bail early.
Adapts the algorithm from github.com/cskiraly/fast-ethereum-crawler
(dcrawl.nim) -- the prefix-rotation idea -- but drops its 1000 req/s
rate limit in favour of the bounded worker pool.