Preallocates slices with known capacity in `stateSet.encode()` and
`StateSetWithOrigin.encode()` methods to eliminate redundant
reallocations during serialization.
Preallocate capacity for `keyOffsets` and `valOffsets` slices in
`decodeRestartTrailer` since the exact size (`nRestarts`) is known
upfront.
---------
Co-authored-by: rjl493456442 <garyrong0905@gmail.com>
This PR optimizes the historical trie node reader by reworking how data
is accessed and memory is managed, reducing allocation overhead
significantly.
Specifically:
- Instead of decoding an entire history object to locate a specific trie node,
the reader now searches directly within the history.
- Besides, slice pre-allocation can avoid unnecessary deep-copy significantly.
The bitmap is used in compact-encoded trie nodes to indicate which elements
have been modified. The bitmap format has been updated to use big-endian
encoding.
Bit positions are numbered from 0 to 15, where position 0 corresponds to
the most significant bit of b[0], and position 15 corresponds to the least
significant bit of b[1].
This pull request introduces a mechanism to compress trienode history by
storing only the node diffs between consecutive versions.
- For full nodes, only the modified children are recorded in the history;
- For short nodes, only the modified value is stored;
If the node type has changed, or if the node is newly created or
deleted, the entire node value is stored instead.
To mitigate the overhead of reassembling nodes from diffs during history
reads, checkpoints are introduced by periodically storing full node values.
The current checkpoint interval is set to every 16 mutations, though
this parameter may be made configurable in the future.
It's a PR based on #33303 and introduces an approach for trienode
history indexing.
---
In the current archive node design, resolving a historical trie node at
a specific block
involves the following steps:
- Look up the corresponding trie node index and locate the first entry
whose state ID
is greater than the target state ID.
- Resolve the trie node from the associated trienode history object.
A naive approach would be to store mutation records for every trie node,
similar to
how flat state mutations are recorded. However, the total number of trie
nodes is
extremely large (approximately 2.4 billion), and the vast majority of
them are rarely
modified. Creating an index entry for each individual trie node would be
very wasteful
in both storage and indexing overhead. To address this, we aggregate
multiple trie
nodes into chunks and index mutations at the chunk level instead.
---
For a storage trie, the trie is vertically partitioned into multiple sub
tries, each spanning
three consecutive levels. The top three levels (1 + 16 + 256 nodes) form
the first chunk,
and every subsequent three-level segment forms another chunk.
```
Original trie structure
Level 0 [ ROOT ] 1 node
Level 1 [0] [1] [2] ... [f] 16 nodes
Level 2 [00] [01] ... [0f] [10] ... [ff] 256 nodes
Level 3 [000] [001] ... [00f] [010] ... [fff] 4096 nodes
Level 4 [0000] ... [000f] [0010] ... [001f] ... [ffff] 65536 nodes
Vertical split into chunks (3 levels per chunk)
Level0 [ ROOT ] 1 chunk
Level3 [000] ... [fff] 4096 chunks
Level6 [000000] ... [fffffff] 16777216 chunks
```
Within each chunk, there are 273 nodes in total, regardless of the
chunk's depth in the trie.
```
Level 0 [ 0 ] 1 node
Level 1 [ 1 ] … [ 16 ] 16 nodes
Level 2 [ 17 ] … … [ 272 ] 256 nodes
```
Each chunk is uniquely identified by the path prefix of the root node of
its corresponding
sub-trie. Within a chunk, nodes are identified by a numeric index
ranging from 0 to 272.
For example, suppose that at block 100, the nodes with paths `[]`,
`[0]`, `[f]`, `[00]`, and `[ff]`
are modified. The mutation record for chunk 0 is then appended with the
following entry:
`[100 → [0, 1, 16, 17, 272]]`, `272` is the numeric ID of path `[ff]`.
Furthermore, due to the structural properties of the Merkle Patricia
Trie, if a child node
is modified, all of its ancestors along the same path must also be
updated. As a result,
in the above example, recording mutations for nodes `00` and `ff` alone
is sufficient,
as this implicitly indicates that their ancestor nodes `[]`, `[0]` and
`[f]` were also
modified at block 100.
---
Query processing is slightly more complicated. Since trie nodes are
indexed at the chunk
level, each individual trie node lookup requires an additional filtering
step to ensure that
a given mutation record actually corresponds to the target trie node.
As mentioned earlier, mutation records store only the numeric
identifiers of leaf nodes,
while ancestor nodes are omitted for storage efficiency. Consequently,
when querying
an ancestor node, additional checks are required to determine whether
the mutation
record implicitly represents a modification to that ancestor.
Moreover, since trie nodes are indexed at the chunk level, some trie
nodes may be
updated frequently, causing their mutation records to dominate the
index. Queries
targeting rarely modified trie nodes would then scan a large amount of
irrelevant
index data, significantly degrading performance.
To address this issue, a bitmap is introduced for each index block and
stored in the
chunk's metadata. Before loading a specific index block, the bitmap is
checked to
determine whether the block contains mutation records relevant to the
target trie node.
If the bitmap indicates that the block does not contain such records,
the block is skipped entirely.
This pull request optimizes history indexing by splitting a single large
database
batch into multiple smaller chunks.
Originally, the indexer will resolve a batch of state histories and
commit all
corresponding index entries atomically together with the indexing
marker.
While indexing more state histories in a single batch improves
efficiency, excessively
large batches can cause significant memory issues.
To mitigate this, the pull request splits the mega-batch into several
smaller batches
and flushes them independently during indexing. However, this introduces
a potential
inconsistency that some index entries may be flushed while the indexing
marker is not,
and an unclean shutdown may leave the database in a partially updated
state.
This can corrupt index data.
To address this, head truncation is introduced. After a restart, any
excessive index
entries beyond the expected indexing marker are removed, ensuring the
index remains
consistent after an unclean shutdown.
This change introduces an iterator for the history index in the pathdb.
It provides sequential access to historical entries, enabling efficient
scanning and future features built on top of historical state traversal.
This is broken off of #31730 to only focus on testing networks that
start with verkle at genesis.
The PR has seen a lot of work since its creation, and it now targets
creating and re-executing tests for a binary tree testnet without the
transition (so it starts at genesis). The transition tree has been moved
to its own package. It also replaces verkle with the binary tree for
this specific application.
---------
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
In this PR, several changes have been made:
(a) restructure the trienode history header section
Previously, the offsets of the key and value sections were recorded before
encoding data into these sections. As a result, these offsets referred to the
start position of each chunk rather than the end position.
This caused an issue where the end position of the last chunk was
unknown, making it incompatible with the freezer partial-read APIs.
With this update, all offsets now refer to the end position, and the
start position of the first chunk is always 0.
(b) Enable partial freezer read for trienode data retrieval
The partial freezer read feature is now utilized in trienode data
retrieval, improving efficiency.
In this PR, the database batch for writing the history index data is
pre-allocated.
It's observed that database batch repeatedly grows the size of the
mega-batch,
causing significant memory allocation pressure. This approach can
effectively
mitigate the overhead.
The limit check for `MaxUint32` is done after the cast to `int`. On 64
bits machines, that will work without a problem. On 32 bits machines,
that will always fail. The compiler catches it and refuses to build.
Note that this only fixes the compiler build. ~~If the limit is above
`MaxInt32` but strictly below `MaxUint32` then this will fail at runtime
and we have another issue.~~ I checked and this should not happen during
regular execution, although it might happen in tests.
This PR implements the partial read functionalities in the freezer, optimizing
the state history reader by resolving less data from freezer.
---------
Signed-off-by: jsvisa <delweng@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
This pull request is based on #32306 , is the second part for shipping
trienode history.
Specifically, this pull request generalize the existing index mechanism,
making is usable
by both state history and trienode history in the near future.
This pull request addresses the corrupted path database with log
indicating:
`history head truncation out of range, tail: 122557, head: 212208,
target: 212557`
This is a rare edge case where the in-memory layers, including the write
buffer
in the disk layer, are fully persisted (e.g., written to file), but the
state history
freezer is not properly closed (e.g., Geth is terminated after
journaling but
before freezer.Close). In this situation, the recent state history
writes will be
truncated on the next startup, while the in-memory layers resolve
correctly.
As a result, the state history falls behind the disk layer (including
the write buffer).
In this pull request, the state history freezer is always synced before
journal,
ensuring the state history writes are always persisted before the
others.
Edit:
It's confirmed that devops team has 10s container termination setting.
It
explains why Geth didn't finish the entire termination without state
history
being closed.
https://github.com/ethpandaops/fusaka-devnets/pull/63/files
Add state size tracking and retrieve api, start geth with `--state.size-tracking`,
the initial bootstrap is required (around 1h on mainnet), after the bootstrap,
use `debug_stateSize()` RPC to retrieve the state size:
```
> debug.stateSize()
{
accountBytes: "0x39681967b",
accountTrienodeBytes: "0xc57939f0c",
accountTrienodes: "0x198b36ac",
accounts: "0x129da14a",
blockNumber: "0x1635e90",
contractCodeBytes: "0x2b63ef481",
contractCodes: "0x1c7b45",
stateRoot: "0x9c36a3ec3745d72eea8700bd27b90dcaa66de0494b187c5600750044151e620a",
storageBytes: "0x18a6e7d3f1",
storageTrienodeBytes: "0x2e7f53fae6",
storageTrienodes: "0x6e49a234",
storages: "0x517859c5"
}
```
---------
Signed-off-by: jsvisa <delweng@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
This PR is the first step in the trienode history series.
It introduces the `nodeWithOrigin` struct in the path database, which tracks
the original values of dirty nodes to support trienode history construction.
Note, the original value is always empty in this PR, so it won't break the
existing journal for encoding and decoding. The compatibility of journal
should be handled in the following PR.
### Summary
Fixes long-standing ETA calculation errors in progress indicators that
have been present since February 2021. The current implementation
produces increasingly inaccurate estimates due to integer division
precision loss.
### Problem
3aeccadd04/triedb/pathdb/history_indexer.go (L541-L553)
The ETA calculation has two critical issues:
1. **Integer division precision loss**: `speed` is calculated as
`uint64`
2. **Off-by-one**: `speed` uses `+ 1`(2 times) to avoid division by
zero, however it makes mistake in the final calculation
This results in wildly inaccurate time estimates that don't improve as
progress continues.
### Example
Current output during state history indexing:
```
lvl=info msg="Indexing state history" processed=16858580 left=41802252 elapsed=18h22m59.848s eta=11h36m42.252s
```
**Expected calculation:**
- Speed: 16858580 ÷ 66179848ms = 0.255 blocks/ms
- ETA: 41802252 ÷ 0.255 = ~45.6 hours
**Current buggy calculation:**
- Speed: rounds to 1 block/ms
- ETA: 41802252 ÷ 1 = ~11.6 hours ❌
### Solution
- Created centralized `CalculateETA()` function in common package
- Replaced all 8 duplicate code copies across the codebase
### Testing
Verified accurate ETA calculations during archive node reindexing with
significantly improved time estimates.
This pull request preserves the root->ID mappings in the path database
even after the associated state histories are truncated, regardless of
whether the truncation occurs at the head or the tail.
The motivation is to support an additional history type, trienode history.
Since the root->ID mappings are shared between two history instances,
they must not be removed by either one.
As a consequence, the root->ID mappings remain in the database even
after the corresponding histories are pruned. While these mappings may
become dangling, it is safe and cheap to keep them.
Additionally, this pull request enhances validation during historical
reader construction, ensuring that only canonical historical state will be
served.
This pull request refactors the internal implementation in path database
a bit, specifically:
- purge the state index data in batch
- simplify the logic of state history construction and index, make it more readable
This is a internal refactoring PR, renaming the history to stateHistory.
It's a pre-requisite PR for merging trienode history, avoid the name
conflict.
Seems the `signal.result` was not sent back in shorten case, this will
cause a deadlock.
---------
Signed-off-by: jsvisa <delweng@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
`binary.AppendUvarint` offers better performance than using append
directly, because it avoids unnecessary memory allocation and copying.
In our case, it can increase the performance by +35.8% for the
`blockWriter.append` function:
```
benchmark old ns/op new ns/op delta
BenchmarkBlockWriterAppend-8 5.97 3.83 -35.80%
```
---------
Signed-off-by: jsvisa <delweng@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
The implementation of `parseIndexBlock` used a reverse loop with slice
appends to build the restart points, which was less cache-friendly and
involved unnecessary allocations and operations. In this PR we change
the implementation to read and validate the restart points in one single
forward loop.
Here is the benchmark test:
```bash
go test -benchmem -bench=BenchmarkParseIndexBlock ./triedb/pathdb/
```
The result as below:
```
benchmark old ns/op new ns/op delta
BenchmarkParseIndexBlock-8 52.9 37.5 -29.05%
```
about 29% improvements
---------
Signed-off-by: jsvisa <delweng@gmail.com>
Introduce file-based state journal in path database, fixing
the Pebble restriction when the journal size exceeds 4GB.
---------
Signed-off-by: jsvisa <delweng@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
Fix the issue after initial snap sync with `gcmode=archive` enabled.
```
NewPayload: inserting block failed error="history indexing is out of order, last: null, requested: 1"
```
---------
Signed-off-by: Delweng <delweng@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
This pull request tracks the state indexing progress in eth_syncing
RPC response, i.e. we will return non-null syncing status until indexing
has finished.
This pull request fixes a flaw in the PBSS state iterator, which
could return empty account or storage data.
In PBSS, multiple in-memory diff layers and a write buffer are
maintained. These layers are persisted to the database and reloaded after
node restarts. However, since the state data is encoded using RLP, the
distinction between nil and an empty byte slice is lost during the encode/decode
process. As a result, invalid state values such as `[]byte{}` can appear in PBSS
and ultimately be returned by the state iterator.
Checkout
https://github.com/ethereum/go-ethereum/blob/master/triedb/pathdb/iterator_fast.go#L270
for more iterator details.
It's a long-term existent issue and now be activated since the snapshot
integration.
The error `err="range contains deletion"` will occur when Geth tries to
serve other
peers with SNAP protocol request.
---------
Co-authored-by: Felix Lange <fjl@twurst.com>
This pull request fixes a flaw in PBSS archive mode that significantly
degrades performance when the mode is enabled.
Originally, in hash mode, the dirty trie cache is completely disabled
when archive mode is active, in order to disable the in-memory garbage
collection mechanism. However, the internal logic in path mode differs
significantly, and the dirty trie node cache is essential for maintaining
chain insertion performance. Therefore, the cache is now retained in
path mode.