Skip to content
This repository was archived by the owner on Jul 14, 2023. It is now read-only.
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
336 changes: 336 additions & 0 deletions proposals/0000-transient-storages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,336 @@
---
Title: transient storage host functions
Number: 0
Status: Proposed
Authors:
- cheme
Created: 2023-04-xx
Category: Runtime Environment
Requires:
Replaces:
---


Current Implementation: https://github.com/cheme/substrate/tree/transient


## Summary

This PPPs defines new host functions to keep trace of transient data accross a block processing.

It is derived and modified from initial writing https://github.com/paritytech/substrate/issues/12577, credits to Gavin Wood.

It exposes two new sets of host functions, blob storage an api around bytes storage (similar to rust `Vec<u8>`), and ordered key values storage (similar to rust `BTreeMap<Vec<u8>, Vec<u8>>` storage).

Each storage structure are defined and address by a name with no limit in number except the block weight.
Each structure can be hashed or reduced to a single merkle root to possible store it on state.
Each structure can define if it should be send to client storage.

## Motivation

From https://github.com/paritytech/substrate/issues/5396:
"Right now we abuse storage for intra-block data such as block number, parent hash and block author as well as various housekeeping information and flags like whether we set the uncles Authorship::DidSetUncles.

When initially writing, this incurs an extra trie lookup, which is slow. Instead there should be another host API, which works exactly like set_storage/get_storage but has no trie backing, so it never tries to lookup the value in the trie, nor does it write the value at the end of the block." Credits Gavin Wood.

Additionally content such as events often are used in a log based manner (append only) with possibly a larger size than usual content.

Hashing through host function involves passing all data at once, and is not memory efficient.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this sentence, it seems that one alternative could be to provide new host functions that allow hashing data progressively, which seems tremendously more simple than all this machinery.

If this alternative is not viable, this section really should explain why.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I propose in the paragraph "Implementation of Blob storage hashing". This part is not really strictly needed for transient storage, but it did not make sense to me to have blob (potentially big) hashing with the current api.

Maybe I should extract it in a separate PPP (progressive hashing host function).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that the "Motivation" and/or "Alternatives" section should explain what problem is being solved, but also why this specific design was chosen and not a different one. The objective is to make sure that this isn't an XY problem.

If the problem that is being solved is just that "hashing through host function isn't memory efficient", then the solution in this RFC is way overkill.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more another issue that I tackled when implementing and force push in this PPP, it is a bit confusing to have all in it.
Similarily the ordered map solve a precise issue, and blob a slightly different one.
Clearly would make sense to me to have the transient ordered map in a first time and blob in a second time: could be two different PPPs (even if this helps factoring).
I think I will extract the hashing part as a first step, not sure about splitting between ordered map and blob.

Copy link
Contributor

@tomaka tomaka Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarily the ordered map solve a precise issue, and blob a slightly different one.

But again, my point is that it should be written in the PPP which issue is being solved.
It's not possible to have an opinion on whether a proposal is good if you don't know which use-cases it has and problems it solves.


## Implementation

Transient storage act as current state storage, but without a persistent backend.

This implies that the storage must support commiting and reverting transaction with `ext_storage_commit_transaction` or `ext_storage_rollback_transaction`.
This transactional support is both at transient storage content and at transient storage definition (a delete transient storage will be restore on rollback).

Btree and blob are using a specific `Mode`, either `drop` or `archive` passed respectively as the byte 0 or 1. When using `drop` the data will not be send from the runtime executor to the calling client. When using `archive` the committed state of the transient storage will be passed as a change set to the client calling runtime executor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using archive the committed state of the transient storage will be passed as a change set to the client calling runtime executor.

Isn't that an implementation detail that isn't relevant here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could stick to "archive indicate that information should be storable and fetch-able from the client".


In archive mode it is the client that choose its strategy for storing the block transient storages final states.


### Implementation of Btree storage

- `ext_btree_storage_new` with parameters:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all the functions named btree? This implies a binary tree, whereas the implementation doesn't actually need to use a binary tree.
Naming them like tree or ext_transient_storage_* would be more appropriate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, it is good to make it clear that the structure is same as rust btree in term of api and capability, but it is indeed incorrect.

What do you think of "ordered_map" (actually it is a bit long but no better idea right now)?

- name : a pointer size to the name of the new transient storage.
- mode : `Mode` as an u8 (either 0 `drop` or 1 `archive).
No result.
Allows using a transient storage for a given `name` and `mode`.
If a transient storage already exists with the same `name`, it is overwritten.

- `ext_btree_storage_exists` with parameters:
- name : a pointer size to the name of a transient storage.
Result is a boolean indicating if transient storage was instantiated.

- `ext_btree_storage_delete` with parameters:
- name : a pointer size to the name of a transient storage.
Result true if a transient storage did exist and was removed, and false if no
transient storage did exist.


- `ext_btree_storage_clone` with parameters:
- name : a pointer size to the name of a transient storage to clone.
- target_name : a pointer size to the new transient storage to use.
Result is a true if operation succeed and false if there was no storage at `name`.
Clone keep same `Mode`. Clone copy all content from a storage to another storage.
If a transient storage is present at `target_name` it is overwritten.

This operation cost is high, the implementation do not try to avoid copy.

- `ext_btree_storage_rename` with parameters:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function necessary? Especially if transient storages aren't stored across blocks, I have trouble seeing why you would need to rename a storage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought only to have a file like api, but thought of a possible usage.

You got a blob define globally: let s say system pallet defines "log" blob.
Then every extrinsic can access it and append to it.
But if a pallet want to use code from other pallet but not appending to log, then it just need to rename "log" do something on another "log" and afterward restore "log", at pretty much 0 cost.
So in this case to isolate a storage and restore it afterward. Same thing with potentially switching context for the remaining of a block processing.

But I would not say it is strictly necessary. And considering it is the more involved implementation detail (to keep it 0 cost), I would not drop the idea of not having it.

- name : a pointer size to the name of a transient storage to rename.
- target_name : the new name to use.
Result is a true if operation succeed and false if there was no storage at `name`.

Renaming iternally rename the storage. As `name` is the main way to address a storage,
it is very likelly to be implemented as a move in an indexing structure.
As all this need to be transactional and revertable.

If a transient storage is present at `target_name` it is overwritten.

This operation cost is small, there should be no copy of storage content.


- `ext_btree_storage_insert_item` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- key : a pointer size to the key of the content to insert.
- value : a pointer size to the value of the content to insert.
Returns false if there is no btree storage defined for this `name`, true otherwhise (success).

This insert a new key value content.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This insert a new key value content.
This insert a new key value content. Does nothing if the btree storage of this `name` doesn't exist.

Needs to be explicited.

If an item already exists for the `key` it is overwritten.

- `ext_btree_storage_remove_item` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- key : a pointer size to the key of the content to insert.
Returns true when a key and content where removed, false otherwhise.

This attempts to remove a content at a given key.

- `ext_btree_storage_contains_item` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- key : a pointer size to the key of the content to insert.
Returns true when a key and content exists, false otherwhise.

- `ext_btree_storage_get_item` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- key : a pointer size to the key of the content to insert.
Returns an optional bytes content containing the value associated with the given key.

- `ext_btree_storage_read_item` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- key : a pointer size to the key of the content to insert.
- `value_out` : a pointer-size containing the buffer to which the value will be written to.
- `value_offset` : a u32 offset on the value.
Returns an optional size of content written in the buffer (None if the content associated with key is not found).
The size written can only differ from the buffer size if there is not enough content to read.

- `ext_btree_storage_len_item` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- key : a pointer size to the key of the content to insert.
Returns an optional size of value content when the key exists.


- `ext_btree_storage_count` with parameters:
- name : a pointer size to the name of a transient storage to rename.
Returns an optinal u32 number of element currently in the btree.
This should be a small cost operation (implementation needs to maintain a count of content).


- `ext_btree_storage_hash32_item` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- key : a pointer size to the key of the content to insert.
- algorithm: `Hash32Algorithm` passed as a byte, 0 for Blake2b256.
Returns an optional 32 bytes buffer containing the hash of the value when the content exists.

This can be a costy on big value, but then it would be the blob usecase, therefore no caching of result is expecting from implementation.

- `ext_btree_storage_root32_item` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- structure: `Root32Structure` passed as a byte, 0 for `SubstrateDefault` which is the same merkle trie implementation as the substrate trie.
Copy link
Contributor

@tomaka tomaka Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are Root32Structure and SubstrateDefault?
I shouldn't have to read the Substrate code in order to understand the spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind "Root32Structure" is to allow multiple way to calculate a merkle root from key value content.

"SubstrateDefault" is just using the merkle trie V1 that we currently have in state.
But this is not a great merkle structure, clearly a binary trie will be way better (smaller proof), but this is straight forward implementation as a start.

Returns an optional 32 bytes buffer containing the root hash for the current stae or nothing if there is no transient storage for the given name.

This call is very costy. Implementation shall at least cache hash result when state did not change.

- `ext_btree_storage_next_keys` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- key: a pointer size to the previously accessed key.
- count: a u32 indicating the maximum number of key to return.
Returns a scale encoded optional sized array of keys (rust `Option<Vec<Vec<u8>>>`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Returns a scale encoded optional sized array of keys (rust `Option<Vec<Vec<u8>>>`).
Returns a scale encoded optional sized array of keys (rust `Option<Vec<Vec<u8>>>`). Returns a SCALE-encoded `None` either if there is no transient btree with that name or if the `key` is the last key of that transient btree.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intention was more to:
Returns a SCALE-encoded Some(vec![]) if the key is the last key of that transient storage.



- `ext_btree_storage_root32_dump` with parameters:
- name : a pointer size to the name of a transient storage to rename.
Returns an scale encoded optional vector of key value with all the key value content from the transient btree (in rust `Option<Vec<(Vec<u8>, Vec<u8>)>>`).

- `ext_btree_storage_root32_dump_hashed` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- algorithm: `Hash32Algorithm` passed as a byte, 0 for Blake2b256.
Returns an scale encoded optional vector of key value with hash of key and hash of value from the transient btree (in rust `Option<Vec<([u8; 32], [u8; 32])>>`).

### Implementation of Blob storage

If it sounds easy to use Btree storage to store a blob of bytes, a different api is defined to focus on managing larger bytes array efficiently.

Internally blob storage is acting as an array of bytes chunk. This sets the granularity for transactional support. The chunk size is 256 bytes.

Most of the api is working the same way as for the btree storage, please notice that `name` are isolated from btree transient `name`: a btree and
a blob can use the same name without conflicts.

- `ext_blob_storage_new`, same as `ext_btree_storage_new` for a blob.
- `ext_blob_storage_exists`, same as `ext_btree_storage_exists` for a blob.
- `ext_blob_storage_delete`, same as `ext_btree_storage_delete` for a blob.
- `ext_blob_storage_clone`, same as `ext_btree_storage_clone` for a blob.
- `ext_blob_storage_rename`, same as `ext_btree_storage_rename` for a blob.


- `ext_blob_storage_set` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- value : a pointer size to a buffer of bytes to write.
- offset : the position where the value shall be written
Returns true if written, false otherwhise.

Reason for returning false can be either no instantiated blob for `name` or an offset bigger than the current length of the blob.

When value is written on existing blob content, it overwrites it. If value is too big to be written in the blog, the blog size will increase.

If final blob size is bigger than u32::max, the calls is a noops and return false.

- `ext_blob_storage_truncate` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- at : the new size for the blob storage as u32.
Truncate the blob to a new size. Return true if content was truncated, false otherwhise.

- `ext_blob_storage_read` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- `value_out` : a pointer-size containing the buffer to which the value will be written to.
- `offset` : a u32 offset in the blog.

Returns an optional size of content written in the buffer (None if the blob is not found).
The size written can only differ from the buffer size if there is not enough content to read.
0 is also return for size if `offset` is bigger than the blob size.

- `ext_blob_storage_get` with parameters:
- name : a pointer size to the name of a transient storage to rename.

Returns an optional full blob value.


- `ext_blob_storage_len` with parameters:
- name : a pointer size to the name of a transient storage to rename.

Returns an optional byte size of the current blob.

- `ext_blob_storage_len` with parameters:
- name : a pointer size to the name of a transient storage to rename.
- algorithm: `Hash32Algorithm` passed as a byte, 0 for Blake2b256.
Returns an optional 32 byte hash of the current blob.

The hash shall be cached between two call with no blob content changes.


### Implementation of Blob storage hashing

To avoid passing big chunk of bytes to the hasher new host functions allows incremental hashing from the runtime.

- `ext_hashing_get_hasher` with parameters:
- algorithm: `Hash32Algorithm` passed as a byte, 0 for Blake2b256.
Returns an optional `HasherHandle`.
Handle is a u32 identifier.
Handle is obtain by incrementing a global handle counter and is always the highest used handle.
Counter only decrement for the latest handles.
Only u32 max value handle are obtainable, return none if past this limit.

- `ext_hashing_drop_hasher` with parameters:
- hasher: an optional hasher handle (rust `Option<u32>`).
If hasher handler is passed, remove the hasher from the host for this handle.
If hasher handle is `None`, drop all instantiated hasher from the host.
When an handle is removed, its handle could only be used again after it was in last position.
For instance if handles 0 to 6 are instantiated and we drop handle 5, next call to `ext_hashing_get_hasher` will return 7.
It is only if handler 7 and 6 are dropped or finalized that handle 5 will be finalize in turn and could be return again
by `ext_hashing_get_hasher`.

- `ext_hashing_hasher_update` with parameters:
- hasher: u32 hasher handle to use.
- data: a pointer-size containing a buffer of byte content to send in the hasher.
Return false if there is no instantiated hasher for his handle.
Return true otherwhise.

- `ext_hashing_hasher_finaliez` with parameters:
- hasher: u32 hasher handle to use.
Return an optional u32 byte array containing the final hash of all content send to the
hasher with update.
The hasher is dropped afterward (another call to hasher_update will return false).

## Security considerations

- blob access can grow in memory

- hasher opaque handle value can be used from an extrinsic and lead to non determinism (when processing a block the inner value are deterministic but not runing things at a transaction level, but it this is also true for any transient storage).

## Alternatives

## Questions and open Discussions

- `new` behavior on existing content is discutable, could just return `false` and noops if a transient storage already exists. I would actually quiet like this change.
Similarilly the overwriting of the destination on clone or move is discutable.

- `exists` (not in original issue) seems needed, but maybe a few changes would remove this need.
An alternative would also be to return Option<`Mode`> so you can see the current mode from an extrinsic.

- `rename` is not the most trivial when it comes to transactional implementation (but there is certainly way to have simplier implementation).

- `clone` we could have a smarter implementation where we Ref count content and clone only on write.

- ext_btree_storage_root32_dump and ext_btree_storage_root32_dump_hashed: I am really not sure if those should be exposed (looks more like debugging calls that should not be part of any production runtime).

- ext_btree_storage_next_keys: Would be a good idea to pass the previous key as Option<&[u8]> so we can include the very first value. Currently value &[] need to be queried manually if it can be defined. I am not sure it is needed (and not sure if Option<&[u8]> passing efficiently is easy).

- ext_btree_storage_next_keys: Maybe result should be a Vec<u8> and a Vec<u32> to avoid scale encoding (all keys in same buff and an array of key offset).

- ext_btree_storage_remove_item or ext_btree_storage_contains_item or ext_btree_storage_get_item, the result do not allow to distinguish between a non instantiated transient storage and a key not present in the storage.

- ext_btree_storage_hash32_item: in its current form (no caching), it does not provide much over just accessing content and then calling the host hash function. At this point, I would suggest removing it.

- ext_btree_storage_root32_item: the caching is rudimentary. A plain trie structure could be kept in memory for minimal work between two calls. Currently I only did cache the hash if the btree do not change.

- Hash32Algorithm currently is only blake2b 256, probably need other variant to make sense, I would suggest blake3 (even if the current api is not handling its internal tree structure).

- ext_blob_storage_set: the limit size at u32max is way too much as we probably don't allow that much wasm memory, should define another one.

- ext_blob_truncate: maybe return true if the truncate size is the same as the current size.

- ext_blob_storage_get: involve concatenating all blob chunks, maybe a chunk based api could be better.

- read operation: maybe return false when offset is bigger than value size (instead of 0 length written).

- The new hashing host functions for blob are very discutable: they are only really usefull when using cumulus (otherwhise the data is already in the host memory) to avoid
passing large array to the host.
Alternative could be to store in the host from cumulus, but does not sounds good (very impactful in term of memory).
Second alternative would be to pass the hasher internal state instead of an opaque pointer, so on the host side we don't have to keep state of all open hashers.
Third alternative is just to pass the memory at once and use old host function.

- Transient items are explicitely instantiated and access to undefined transient structure do not create them (even if access is appending byte or inserting key value).
This is not how for instant current child trie works (undefined is always just an empty structure).
This force instantiating specifically. I think it is a good direction since the implicitely instantiation and removal of child trie was often criticized.

- In current implementation transient items are managed as child state (see branch for design in primitive/storage/src/lib.rs).

- The size of the blob chunk (and also the need for a blob chunk management) can probably be change. The current need I see is that it would be alignable with 64 bytes word and with the blake3 block size (64k).

256 byte might be way to small, probably could be pushed to 1024.

- The api allow accessing more than a blob chunk at a time, the current implementation then instantiate and concat chunk, maybe being more restrictive (single chunk access) would make the impact more evident.
For instance accessing a full blob means concatenating all chunks each time, my opinion is that if we want to cache such access a btree item can be use for it.

- Instead of chunking we could store variable length diff in the transaction layer, at first glance it looks a bit less efficient to me considering access. But we could favor size and change this to works with incremental variable length diffs.

- Archive storage: the default implementation for substrate is just storing and applying pruning as standard storage (same pruning range as state).
Key for blob are `b"TransientBlobsItem" ++ block_hash ++ blob_name`.
Key for btree items are `b"TransientBtreesItem" ++ block_hash ++ btree_name ++ btree_item_key`.
Data is put in `AUX` column.
Trait `TransientStorageHook` to allows extending or replacing this behavior (for instance to index specific items).
This trait is usable through library dynamic linking.
Here a big question is this library linking as it is always awkward to support (please refer to implementation in `client/db/src/transient_storage.rs`).

- `HasherHandle` are reclaimed when free and on terminal position, but they could just be never reclaimed.