It's a PR based on #33303 and introduces an approach for trienode
history indexing.
---
In the current archive node design, resolving a historical trie node at
a specific block
involves the following steps:
- Look up the corresponding trie node index and locate the first entry
whose state ID
is greater than the target state ID.
- Resolve the trie node from the associated trienode history object.
A naive approach would be to store mutation records for every trie node,
similar to
how flat state mutations are recorded. However, the total number of trie
nodes is
extremely large (approximately 2.4 billion), and the vast majority of
them are rarely
modified. Creating an index entry for each individual trie node would be
very wasteful
in both storage and indexing overhead. To address this, we aggregate
multiple trie
nodes into chunks and index mutations at the chunk level instead.
---
For a storage trie, the trie is vertically partitioned into multiple sub
tries, each spanning
three consecutive levels. The top three levels (1 + 16 + 256 nodes) form
the first chunk,
and every subsequent three-level segment forms another chunk.
```
Original trie structure
Level 0 [ ROOT ] 1 node
Level 1 [0] [1] [2] ... [f] 16 nodes
Level 2 [00] [01] ... [0f] [10] ... [ff] 256 nodes
Level 3 [000] [001] ... [00f] [010] ... [fff] 4096 nodes
Level 4 [0000] ... [000f] [0010] ... [001f] ... [ffff] 65536 nodes
Vertical split into chunks (3 levels per chunk)
Level0 [ ROOT ] 1 chunk
Level3 [000] ... [fff] 4096 chunks
Level6 [000000] ... [fffffff] 16777216 chunks
```
Within each chunk, there are 273 nodes in total, regardless of the
chunk's depth in the trie.
```
Level 0 [ 0 ] 1 node
Level 1 [ 1 ] … [ 16 ] 16 nodes
Level 2 [ 17 ] … … [ 272 ] 256 nodes
```
Each chunk is uniquely identified by the path prefix of the root node of
its corresponding
sub-trie. Within a chunk, nodes are identified by a numeric index
ranging from 0 to 272.
For example, suppose that at block 100, the nodes with paths `[]`,
`[0]`, `[f]`, `[00]`, and `[ff]`
are modified. The mutation record for chunk 0 is then appended with the
following entry:
`[100 → [0, 1, 16, 17, 272]]`, `272` is the numeric ID of path `[ff]`.
Furthermore, due to the structural properties of the Merkle Patricia
Trie, if a child node
is modified, all of its ancestors along the same path must also be
updated. As a result,
in the above example, recording mutations for nodes `00` and `ff` alone
is sufficient,
as this implicitly indicates that their ancestor nodes `[]`, `[0]` and
`[f]` were also
modified at block 100.
---
Query processing is slightly more complicated. Since trie nodes are
indexed at the chunk
level, each individual trie node lookup requires an additional filtering
step to ensure that
a given mutation record actually corresponds to the target trie node.
As mentioned earlier, mutation records store only the numeric
identifiers of leaf nodes,
while ancestor nodes are omitted for storage efficiency. Consequently,
when querying
an ancestor node, additional checks are required to determine whether
the mutation
record implicitly represents a modification to that ancestor.
Moreover, since trie nodes are indexed at the chunk level, some trie
nodes may be
updated frequently, causing their mutation records to dominate the
index. Queries
targeting rarely modified trie nodes would then scan a large amount of
irrelevant
index data, significantly degrading performance.
To address this issue, a bitmap is introduced for each index block and
stored in the
chunk's metadata. Before loading a specific index block, the bitmap is
checked to
determine whether the block contains mutation records relevant to the
target trie node.
If the bitmap indicates that the block does not contain such records,
the block is skipped entirely.
This pull request optimizes history indexing by splitting a single large
database
batch into multiple smaller chunks.
Originally, the indexer will resolve a batch of state histories and
commit all
corresponding index entries atomically together with the indexing
marker.
While indexing more state histories in a single batch improves
efficiency, excessively
large batches can cause significant memory issues.
To mitigate this, the pull request splits the mega-batch into several
smaller batches
and flushes them independently during indexing. However, this introduces
a potential
inconsistency that some index entries may be flushed while the indexing
marker is not,
and an unclean shutdown may leave the database in a partially updated
state.
This can corrupt index data.
To address this, head truncation is introduced. After a restart, any
excessive index
entries beyond the expected indexing marker are removed, ensuring the
index remains
consistent after an unclean shutdown.
In this PR, the database batch for writing the history index data is
pre-allocated.
It's observed that database batch repeatedly grows the size of the
mega-batch,
causing significant memory allocation pressure. This approach can
effectively
mitigate the overhead.
This pull request is based on #32306 , is the second part for shipping
trienode history.
Specifically, this pull request generalize the existing index mechanism,
making is usable
by both state history and trienode history in the near future.
### Summary
Fixes long-standing ETA calculation errors in progress indicators that
have been present since February 2021. The current implementation
produces increasingly inaccurate estimates due to integer division
precision loss.
### Problem
3aeccadd04/triedb/pathdb/history_indexer.go (L541-L553)
The ETA calculation has two critical issues:
1. **Integer division precision loss**: `speed` is calculated as
`uint64`
2. **Off-by-one**: `speed` uses `+ 1`(2 times) to avoid division by
zero, however it makes mistake in the final calculation
This results in wildly inaccurate time estimates that don't improve as
progress continues.
### Example
Current output during state history indexing:
```
lvl=info msg="Indexing state history" processed=16858580 left=41802252 elapsed=18h22m59.848s eta=11h36m42.252s
```
**Expected calculation:**
- Speed: 16858580 ÷ 66179848ms = 0.255 blocks/ms
- ETA: 41802252 ÷ 0.255 = ~45.6 hours
**Current buggy calculation:**
- Speed: rounds to 1 block/ms
- ETA: 41802252 ÷ 1 = ~11.6 hours ❌
### Solution
- Created centralized `CalculateETA()` function in common package
- Replaced all 8 duplicate code copies across the codebase
### Testing
Verified accurate ETA calculations during archive node reindexing with
significantly improved time estimates.
This pull request refactors the internal implementation in path database
a bit, specifically:
- purge the state index data in batch
- simplify the logic of state history construction and index, make it more readable
This is a internal refactoring PR, renaming the history to stateHistory.
It's a pre-requisite PR for merging trienode history, avoid the name
conflict.
Seems the `signal.result` was not sent back in shorten case, this will
cause a deadlock.
---------
Signed-off-by: jsvisa <delweng@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
Fix the issue after initial snap sync with `gcmode=archive` enabled.
```
NewPayload: inserting block failed error="history indexing is out of order, last: null, requested: 1"
```
---------
Signed-off-by: Delweng <delweng@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
This pull request tracks the state indexing progress in eth_syncing
RPC response, i.e. we will return non-null syncing status until indexing
has finished.
This pull request fixes a flaw in PBSS archive mode that significantly
degrades performance when the mode is enabled.
Originally, in hash mode, the dirty trie cache is completely disabled
when archive mode is active, in order to disable the in-memory garbage
collection mechanism. However, the internal logic in path mode differs
significantly, and the dirty trie node cache is essential for maintaining
chain insertion performance. Therefore, the cache is now retained in
path mode.