herodotusdev / hdp Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 4.0 7.15 MB

Herodotus Data Processor Toolkit. Enhance zk-offchain compute for verifiable onchain data using zkVMs

Home Page: https://docs.herodotus.dev/herodotus-docs/developers/herodotus-data-processor-hdp

License: GNU General Public License v3.0

Rust 99.21% Shell 0.18% Dockerfile 0.61%

data-processor etheureum rust

hdp's People

Contributors

Stargazers

Watchers

Forkers

elielnfinic olegjakushkin timucinylmz brianmillsjr

hdp's Issues

Optimize fetcher

goal : handle massive block range at robust speed

use multi thread concurrency for independent tasks handle
make memory fetcher read access in sync way
handle with batch request with abstract fetcher and rpc fetcher

[tx data lake] Add start_tx_index

we need a flexibility to specify index of the transaction where we want to start the range

so that we can have tx range: {start_tx_index .. last tx in the block}.step(increment)

pub struct TransactionsInBlockDatalake {
    // target block number
    pub target_block: u64,
    // ex. "tx.to" , "tx.gas_price" or "tx_receipt.success", "tx_receipt.cumulative_gas_used"
    pub sampled_property: TransactionsCollection,
    // start index
    pub tx_start_index: u64
    // increment of transactions
    pub increment: u64,
}

support `dry-run` on preprocessing step

dry-run

identifies values needed during cairo1 module with cairo-run.
This is unsound environment.
return each task that maps with list of keys that later can retrieve values from provider

`Types` for `Decoder`

Datalake
Task

some basic types that need to use in Decoder

deep rework: `hdp-core` and `hdp-primitives` refactoring

I realize now it's really time to go through deep rework. The rework goal is to have:

clear categorize on component's scope
design in the way it can be scalable with multiple datalake type
yes. I wrote some spagetti codes to ship fast. Should consider value consumption.

This issue is also will be used as ground work of documentation.

Design

Primitives

Block

Ethereum block fields that needed for datalakes.

RPC type that provider will deserialize the response. ( alloy-rpc-type dependable )
Consensus type that will be rlp encode/decodable. This should be compatible with actual ethereum trie implementation. ( alloy-consensus-type dependable )

Note: Both Datalake, Task will not have compile in primitives scope

Datalake

each datalake type folder

Acceptable Fields. This indexes of fields will be sync across Cairo and Solidity. Each datalake's collection will have each property field enum
Datalake Collection: It bind with sampled_property, and allow to be serialize and deserialize into Vec format
Datalake instance: support encode decode commit

root

DatalakeEnvelope: Enum that embed different Datalake types
DatalakeType : Enum for just identify type

Task

ComputationalTaskWithDatalake: The type that embed the datalake envelope support encode decode commit
ComputationalTask: support encode decode commit

Core

Note: now this is where we should use Provider

DatalakeCompiler

DatalakeCompiler is responsible for fetch relevant proofs and values from given datalake type and request using Provider. This returns CompiledDatalakewhich stacks all the relevant onchain data of required datalake

each datalake type folder

Compiled type: Evaluator depends on this type to return result format that can be serialize into json format. And Compiler will return compatible compiled type as result
Compiler: Takes datalake instance and fetch and compile it and return compiled type ( do not consume value of datalake and provider )

Evaluator

Evaluator is responsible for return EvaluationResult that can be serialize into json that is compatible with input.json format.

EvaluationResult embed compiled type, but also contains encoded values from evaluator. And this is directly serialize into json file output.json or input.json

Provider

( Not target scope for current refactoring ) Fetch proofs and values.

Dynamic layout compiler

DynamicLayout Datalake

There might be some cases where someone would like to iterate through let’s say a solidity mapping or array.

For such reason we allow such datalake, maybe the name is not perfect.

An example of such datalake is:

IterativeDynamicLayoutDatalake(
		block_number=10231740,
		account_address="0x7b2f05ce9ae365c3dbf30657e2dc6449989e83d6",
		slot_index=5,
		initial_key=0,
		key_boundry=3,
)

Such datalake is scoped to a specific block as it would be very weird to iterate a mapping across multiple blocks or different smart contracts.

In solidity each variable also dynamic ones such as mapping or array have their unique slot index.

To read more about storage layouts please see:

[Layout of State Variables in Storage — Solidity 0.8.24 documentation](https://docs.soliditylang.org/en/latest/internals/layout_in_storage.html)

The initial key is the initial key to be placed in the mapping, key_boundry is the key that once reached the loop will stop.

It also takes an argument increment with 1 being the default value.

[tx data lake] sender field as sampled property

in transaction, there is v,r,s field which we can generate signer from it. So for user friendly approach, can have additional field sender and if user request to get sender, we compute sender address from v,r,s field

resolve slow http response parse time ( potentially enable gzip feature)

Note: http response size is more than 10MB

Fortunately seems solution : https://users.rust-lang.org/t/using-json-with-response-data-in-reqwest-is-slow/95346/19 --- one questionable thing is, compare to this case we need to parse all of the response anyway. not sure how much improvement it will give

With latest version of indexer, we can interact much more blocks as response, and this works well in curl.

However with exact query ( fetching like 2k ~16k ) indexer wrapper of hdp doesn't have same performance.

I left log, basically fetching time it self is super similar from curl however, parsing part takes alot of time ...

get_headers_proof took 3021 ms
start parsing response
json response took 21787 ms
2024-06-26T14:29:37.982150Z  INFO hdp_provider::evm::provider: Time taken (Headers Proofs Fetch): 26.856025709s

so, this single line takes like 20s !!

 let body: Value = response.json().await.map_err(IndexerError::ReqwestError)?;

Multicall syntax serialize for blocksampledDatalake

BlockSampledDatalake memory datalake1 = BlockSampledDatalake({
            blockRangeStart: 10399990,
            blockRangeEnd: 10400000,
            increment: 1,
            sampledProperties: BlockSampledDatalakeCodecs
                .encodeSampledPropertyForStorage(
                    address(0x7b2f05cE9aE365c3DBF30657e2DC6449989e83D6),
                    bytes32(myInt),
                    bytes32(myInt),
                    bytes32(myInt),
                )
        });

what if we can put sampledProperties multiple fields? -> compiler can be much efficient

its different from compilation pipline. It's specific to BlockSampledDatalake. more like when we decode the rlp we anyway get all the fields information, but why it just send one corresponding field, if it's in same structure ( header structure, account structure that can be retrieve from rlp at once ) .

Match block header rlp that can retrieve the block hash

just to move quick, i currently implemented rlp encode/decode logic in alloy-rlp. but rlp representation is incorrect to retrieve block hash, somehow need to resolve this later when it would be time to connect with Cairo HDP to provide as input

TransactionsInBlock

Note: #48 issue will be deprecated with this due to limitation on verifying signature in cairo program cost enormous steps.

Milstone

transaction datalake interface
transaction provider support
transaction datalake compiler

Definition

TransactionsDatalake

The datalake iterate all the transactions over specific block number.

An example usage TransactionsInBlock is as follows:

TransactionsDatalake(
                 target_block= "10000",
		sampled_property="recipient",
		increment=1 # optional, default value 1
)

Here are some examples of sampled_property fields:
For transactions:

pub enum TransactionField {
    // ===== Transaction fields =====
    Nonce,
    GasPrice,
    GasLimit,
    To,
    Value,
    Input,
    V,
    R,
    S,
    ChainId,
    // Not for legacy transactions
    AccessList,

    // EIP-1559 transactions and EIP-4844 transactions
    MaxFeePerGas,
    // EIP-1559 transactions and EIP-4844 transactions
    MaxPriorityFeePerGas,

    // Only for EIP-4844 transactions
    BlobVersionedHashes,
    // Only for EIP-4844 transactions
    MaxFeePerBlobGas,
}

Decoder

implement decoder for data lake, write a unit test for decoder
implement decoder for task write a unit test for task

TransactionsDatalake

Milstone

transaction datalake interface ( #49 )
transaction provider support (#52 )
transaction datalake compiler ( transaction #53 )
transaction datalake compiler ( transaction receipt )

Definition

TransactionsDatalake

This datalake facilitates iterating over a set of transactions originating from a specific Ethereum address, denoted as sender. The iteration begins from a specified nonce and concludes at another specified nonce. Note that sampled_property can include all fields from both transaction and transaction receipts.

An example usage TransactionsBySenderDatalake is as follows:

TransactionsDatalake(
                account_type= "sender",
		address="0x7b2f05ce9ae365c3dbf30657e2dc6449989e83d6",
		from_nonce=0,
		to_nonce=20,
		sampled_property="recipient",
		increment=1 # optional, default value 1
)

Here are some examples of sampled_property fields:
For transactions:

pub enum TransactionField {
    // ===== Transaction fields =====
    Nonce,
    GasPrice,
    GasLimit,
    To,
    Value,
    Input,
    V,
    R,
    S,
    ChainId,
    // Not for legacy transactions
    AccessList,

    // EIP-1559 transactions and EIP-4844 transactions
    MaxFeePerGas,
    // EIP-1559 transactions and EIP-4844 transactions
    MaxPriorityFeePerGas,

    // Only for EIP-4844 transactions
    BlobVersionedHashes,
    // Only for EIP-4844 transactions
    MaxFeePerBlobGas,
}

For transaction receipts:

pub enum TransactionReceiptField {
    Success,
    CumulativeGasUsed,
    Logs,
    Bloom,
}

Implementation

An important detail in the implementation is that, although we use the same interface as TransactionsBySenderDatalake, the sampled_property fields determine whether we should obtain proof against the transaction trie or the transaction receipt trie.

Here is step by step implementation:

Scope block numbers: get nonce by eth_getTransactionCount while iterate the blocks. Have to iterate the block number that the account has deployed to reduce irrelevant rpc calls. And track the nonce difference, and scope the block numbers that only contains tx sent by senders
Header proofs: get relevant mmr proofs and header data from indexer
Transaction / Transaction receipt Proofs: get tx proof or tx reciept proof per each block numbers. Whether if it's tx or tx receipt is determine by sampled_property.
If property requires transaction receipt, call eth_getBlockReceipts iterate all scoped block numbers, and filter out sender match with given address and get proof for matched ones.
If property requires transaction, first, get filtered tx_hash from step above, and call eth_getTransactionByHash to get detail tx

Open to discuss

Base on implementation steps listed above, as eth_getTransactionCount compatible method for receiver doesn't exist, we couldn't scope minimum target block numbers by given nonce. We could have iterate all blocks since receiver account deployed usingeth_getAccount, and scope target block numbers.

does all this iteration around 1. finding target block numbers using nonce 2. filter out sender or receiver address all all necessary iteration? can't we reduce it somehow?
eth_getAccount or TransactionsByReceiverDatalake need to be compare as it both provide sender's nonce information
another interface suggestion : could put identifier in sampled_property instead of address. but personally I think it's less intuitive then put identifier in account_type

TransactionsDatalake(
		address="0x7b2f05ce9ae365c3dbf30657e2dc6449989e83d6",
		from_nonce=0,
		to_nonce=20,
		sampled_property="sender.recipient",
		increment=1 # optional, default value 1
)

Filter non-provable `BlockSampledDataLake` request

Context

Some of the header fields might return None ( e.g. in case calling non-supported eip fields for later block ), or return some non properly handled data ( e.g ExtraData field returns "0x" in some situations ), and there might potentially more of those cases.

As HDP rust repository should filter out non-provable data before sharing to Cairo Program, it is important to identify the potential issues and return proper error message for non-provable cases.

Generate output files

The cli should take an output_path argument which defines where the output json object should be persisted.
Such json object would include two main fields.

results // Hashmap from task_hash to it's result, the keys should preserve a FIFO order.
prover_input// input to the cairo program @petscheit

EventWatcher

Scrapes events from HreExecutionStore.sol.

[core] serialize/deserialize with trait import

Solidity type sync serde/deserde

Datalake & Task definition had strictly tied to solidity bytes representation.

encode : object into bytes representation.
decode : bytes representation to object
commit : object to FixedBytes<32>

So meaning, we should keep interface type all that tied to solidity primitives, and import Display or Serialize trait if it requires to print. This would require refactor around utils as most of type conversion should not have in util ( = pure function) can mostly handle by appropriate traits for nit

Explore with cairo vm in rust

Trace generation is super slow maybe cairo vm in rust and have dependency and generate trace out of it can solve CPU bounded bottleneck of infra

Move auth logic from Frontend to Backend

Version and type sync with `alloy`

alloy now have crate of v0.1.0 which means I need to consider removing the legacy dependecies, also can consider switching up on RpcType and ConsensusType.

But this can be able to happen when 2 case satify, otherwise will add redundant complexity :

type casting around RpcType -> ConsensusType should be working
rlp encode, decode of ConsensusType should be working

I will work on this if all 5 types I needed (header, account, storage, tx, receipt ) all satisfies the constraint above.

clean up cli interface

Clean up cli interface on encode / run part to be more simple and intuitive to use.
Especially around Chain id / rpc url parameter is not that easy to use currently

One more note on is have to handle error clearly

Benchmark for HDP provider

Context

Add proper benchmark for provider.
Split headers vs account vs storage so that we could calculate upper bound easily

Fetcher

Retrieves proofs for the generated tasks.

Handle `mmr_meta` if it's not single

Context

current input.json cairo format is assuming to have only 1 mmr is containing all the batched tasks.

However, the goal of this batch task is to have multiple independent tasks to be proven in one process and unlike mainnet testnet have chunked numbers of MMR exist so the assume that all batched tasks will contain in one mmr might incorrect case.

Suggestion

mmr_meta should be array of multiple mmrs
and each header object have one field which identifies mmr id and size.

Fact Hash computation command

For debugging purpose and mocking submit-sharp logic, would nice to have command in hdp to compute fact hash value from given Cairo input ( As the hash will be deterministic from input file )

Reference for fact hash calcuation

https://gist.github.com/rkdud007/0b161e3dcd6ede711637edd5a5ead9c3

Ideal solution

hdp compute-fact --input-file ${file_path}

hdp CLI upgrade

dev mode : allow ppl to generate serialized datalake and tasks etc..

`hdp-provider` crate to be stateful

Context

hdp-provider crate is responsible to get request and fetch the proofs of MPT/MMR. As we have introduced caching mechanism for fetching proofs, suggest to have this data structure implementation in provider crate, so that we can interact sn-mpt via library call during processing provider step.

This gives two benefit:

utilize state (db) on preprocessor, to make step spilt not get limited from stateless environment like now
we don't need unnecessary instance or IO call to make sn-mpt into seperate server deployment.

Fetch MMR metadata & header data from indexer api hdp endpoint

Call /mmr-meta-and-proof endpoint when calling fetcher
Parse result of endpoint: 1) get mmr metadata 2) use rlp encoded header to compile value
Check the full flow with contract <> server <> cli with fetched value & mmr metadata
Generate file in correct format. pass --output-file as option and allow to pass path
Check the full flow again with server with altered generated file

curl -X 'GET' \
  'https://rs-indexer.api.herodotus.cloud/accumulators/mmr-meta-and-proof?deployed_on_chain=11155111&accumulates_chain=11155111&block_numbers=4952103&block_numbers=4952102&block_numbers=4952101&block_numbers=4952100&hashing_function=poseidon&contract_type=AGGREGATOR' \
  -H 'accept: application/json'

Testing response : https://rs-indexer.api.herodotus.cloud/accumulators/mmr-meta-and-proof?deployed_on_chain=11155111&accumulates_chain=11155111&block_numbers=4952100&block_numbers=4952101&block_numbers=4952102&block_numbers=4952103&hashing_function=poseidon&contract_type=AGGREGATOR

Verify header inclusion proofs

hash header
verify inclusion proof
ensure peak is part of MMR root

Cross compile

upload pre-built binary to remote release. Binary should be cross-compile able as the use case will be used in ubuntu environment docker container.

Evaluator

Processes tasks with their proofs and evaluates results.

feat: Optimize account and storage fetching as much as possible

Context

As can see #35 most of the latency on hdp-cli is from fetching account proof and storage proof. Currently we didn't implemented persistent db & Currently those rpc calls are not handled in parallel.

Yes, hdp-cli fetcher is really dumb implementation atm.

Some Steps

Another approach

http requet cache

`module_local_run` command on cli for run module

entry point of running preprocess (including dry-run) + process for module. But intermediate steps all keep task definition generic using Task enum

`block_sampled_datalake` compiler and memoizer

implement Task Compiler

`cairo-run` execution in parallel

Context

currently as we finish preprocess for generating input, we run binary in same thread. Ideally should spawn multiple thread so that other request via input file will not get block.

And as spawning multiple run means, we are chunking processing step ( get input.json -> generate output.json & PIE ) to be expose seperate, this should happen after #80 make it stateful to divide the step of preprocess and process.

proper `count_if`

update schema for task

struct ComputationalTask {
    bytes32 aggregateFnId;
    bytes1 operator;
    uint256 valueToCompare;
}

pub struct ComputationalTask {
    pub aggregate_fn_id: FixedBytes<32>,
    pub operator: Option<FixedBytes<1>>,
    pub value_to_compare: Option<U256>,
}

update operator bytes represnetation

/// - 00 : Not operator ( to avoid collision with solidity )
/// - 01: Equal (=)
/// - 02: Not equal (!=)
/// - 03: Greater than (>)
/// - 04: Greater than or equal (>=)
/// - 05: Less than (<)
/// - 06: Less than or equal (<=)

make sure to encode/decode properly
cross check with TS <> Solidity <> RS

process type sync to `mpt_key`

The following fields defined in ProcessedType will name as mpt_key:

account_key
storage_key
key ( defined in transaction and tx receipt both )

Support Mult Network

context

Current version is assuming on getting BlockSampledDataLake only Ethereum Sepolia network. However, in the future, as we planning on supporting other networks, we need to specify exact chain Id in BlockSampledDataLake field.

more, before hardfork of ethereum, there was no chain_id was introduced, so without passing chain_id would result unsound TransactionsInBlockDatalake

Here is following todos:

BlockSampledDatalake

(hdp-cli) add chain_id as field of BlockSampledDataLake struct.

pub struct BlockSampledDatalake {
    pub chain_id: u64,
    pub block_range_start: u64,
    pub block_range_end: u64,
    pub increment: u64,
    pub sampled_property: String,
}

(Cairo Program) add chain_id as field of BlockSampledDataLake struct.
(Contract) add chain_id as field of BlockSampledDataLake struct.

struct BlockSampledDatalake {
    uint256 chain_id;
    uint256 blockRangeStart;
    uint256 blockRangeEnd;
    uint256 increment;
    bytes sampledProperty;
}

(Server) add chain_id as field of BlockSampledDataLake struct.

TransactionsInBlockDatalake

(hdp-cli) add chain_id as field of TransactionsInBlockDatalake struct.

pub struct TransactionsInBlockDatalake {
    pub chain_id: u64,
    pub target_block: u64,
    pub sampled_property: TransactionsCollection,
    pub increment: u64,
}

(Cairo Program) add chain_id as field of TransactionsInBlockDatalake struct.
(Contract) add chain_id as field of TransactionsInBlockDatalake struct.

struct TransactionsInBlockDatalake {
    uint256 chain_id;
    uint256 target_block;
    uint256 increment;
    bytes sampledProperty;
}

(Server) add chain_id as field of TransactionsInBlockDatalake struct.

Test check

Make sure to have hash be same around all flow

[tx data lake] Filter out specific transaction type

There are different eips introduced different transaction type that holds different fields to decode from. By default we get all the types within the range, but ideally user should have flexibility to filter the specific transaction types.

[SLR support] support SLR base on new flow design

We figure out with custom cairo1 module, we cannot just run exact computation on preprocessing step. Meaning, the result & result root is able to fetch after cairo run.

we need this methods in cli:

build input file: need to route base on inclusion of custom compute module, and if module exist, we don't return results_root from input.json
run cairo-run with input file + compiled cairo -> return output file & pie: this run compiled cairo program + with given input -> construct output.json and pie. / and this output file will contain compiled_result and result root within all compute function cases.

now we can just pass the pie object to the server and just wait for queuing with sharp.

note that, hardcoding compiled program is not doable as compile program contain absolute path fully.

Use `thiserror` for have error structure

Have error not just plain text. Use thiserror for make enum and have it organized

hdp_provider
hdp_primitives
hdp_core (can split later)

Handle edge cases

contract initiated & slot not initiated : even though storage value return as 0x0, if proof is vec![], this meaning Cairo cannot prove the proof.

make cli great again

Note: this is bit tricky problem love to hear more feedback

trying to find sweet spot of "try not to introduce breaking change too much" vs "have more intuitive cli with v2 <> v1 generalization".

And also try to remove unnecessary serde/deserde and commands

Background

This is the cases where we use hdp most often:

hdp-test : pass 1 raw task-> preprocess
hdp-server(onchain) : pass N encoded tasks -> preprocess -> process
later : pass N tasks input.json -> process

we will have 4 command

hdp encode --request-query {FILE_PATH} --encoded-query {FILE_PATH}

hdp decode --encoded-query {FILE_PATH}  --request-query {FILE_PATH}

and run command will support 2 type, each can be either by calling encoded request or raw request

hdp run-datalake --encoded-datalakes {BYTES} --encoded-tasks {BYTES} ... some etc arguments

hdp run-module  --encoded-modules {BYTES} ... some etc arguments

and raw server request

hdp run --request-query {FILE_PATH}

idea: HDP cli as server optional

Context

As for right now, as thread stop with one request call to cli, InMemoryProvider is useless. There is not much of possibility for provide proof in memory that requested in this short amont of seconds and get reset.

Idea

What if make cli optionally able to spawn dedicated server in localhost and get requests with same thread — that way could leverage caching & or later persistent db and it fetching speed would be hugely faster

Ideally we could have a command to run server

hdp run-server

and it start in localhost::8080 for dedicated server.

And for use run command that we originally have, it would be calling api in this case

Consideration

We have HDP-Server which export api interface. This might make conflict on responsibility.
-> another thought on this is, we could split HDP-Server for business purpose / and server on top of HDP cli for open-source purpose, as everyone could run and spawn HDP full logic on local

But If want to delicate this Caching logic to Server, we should implement logic around HDP Runner. Which also mean need to save result of proof to DB or Inmemory.
I believe work load around implement this in either HDP cli and HDP-Server would be not much difference, but As HDP cli takes on fetching proof, would like approach of make Provider support Fetching and also Caching so that it can deliver robust speed

Conclusion

Adding server on top of HDP cli doesn't mean just adding interface. I belive it could make sense to have this as HDP Core that support 1) cli for dev purpose 2) server for run the HDP flow 3) Have dedicated DB and Providers to optimize fetching 4) Connected to many backend as options for zkVMs

support custom module function as hash

switch SLR support from raw aggregate function index to hash and connect module registry to make it work.
It might result some refactoring around building blocks from the sketch crate.

Generate valid poseidon MMR root

Generate a valid MMR root by hashing the peaks passed into the program and return as output

Clean up configuration for components

Clean up core/config.rs, and simplify the configuration that passing through PreProcessor and Processor.
Try to think of value consumption so that duplicated configuration, just pass by reference etc
If the config struct is somewhat unnecessary try to make it simple.

imo better to consider using like toml format to parse configuration than env

herodotusdev / hdp Goto Github PK

hdp's People

Contributors

Stargazers

Watchers

Forkers

hdp's Issues

dry-run

Design

Primitives

Block

Datalake

Task

Core

DatalakeCompiler

Evaluator

Provider

DynamicLayout Datalake

Milstone

Definition

TransactionsDatalake

Milstone

Definition

TransactionsDatalake

Implementation

Open to discuss

Context

Solidity type sync serde/deserde

Context

Context

Suggestion

Reference for fact hash calcuation

Ideal solution

Context

Context

Some Steps

Another approach

Context

context

BlockSampledDatalake

TransactionsInBlockDatalake

Test check

And also try to remove unnecessary serde/deserde and commands

Background

Context

Idea

Consideration

Conclusion

Recommend Projects

Recommend Topics

Recommend Org