mystenlabs / narwhal Goto Github PK
View Code? Open in Web Editor NEWNarwhal & Tusk are a high throughput mempool & consensus, used in the Sui smart contract platform
License: Apache License 2.0
Narwhal & Tusk are a high throughput mempool & consensus, used in the Sui smart contract platform
License: Apache License 2.0
Epic: #44
In order to allow our customers to interface with our mempool we need to introduce a series of endpoints. One of those is the read_causal
which will takes a collection digest and returns the collection digests encountered in a BFS of the graph, up to a commit boundary. On high level this endpoint will something like:
read_causal( collection_boundary: Digest ) -> Vec<Digest>
The behaviour of this endpoint on high level will work like this:
Parent story: #67
Currently the block_waiter
returns an error when a certificate is not found. As part of this task the block_waiter
should be modified in order to leverage the block_synchronizer
to fetch the missing blocks. The process will be best effort since we might still fail to fetch the missing blocks. To ensure that we avoid unnecessary latency, the block_waiter
should continue the normal processing on fetching the found the blocks and continue fetching the retrieved blocks from the block_synchronizer
once they become available.
In order to improve the block_remover
robustness, we want to retry a few times to fetch a batch from a worker in case the request has been failed or not response has been received in time.
As part of this task this logic should be implemented on the block_remover
. The number of retries should be configurable.
Story: #67
Following the comments on the PR #66 , we should consider whether we should stick with the current storage model , meaning storing the WorkerMessage::Batch(Batch)
serialised in database, or just store the Batch
directly to allow us directly retrieve the required quantity.
Points that should be taken into account:
WorkerMessage::Batch(Batch)
as is to avoid the serialisation costThis is a sub-part of #44 .
We should expose the ability for the primary to read the contents of a collection (a Narwhal block) locally.
The transactions for a block:
Reading the collection should therefore involve mapping all of this in reverse:
An extension of this issue will include a cache at the primary, see #44.
Upstream issue:
facebookresearch/narwhal#13
Related issue
MystenLabs/sui#5184
Note:
Worker fault tolerance will require data-tracking at the primary, and C&C for replication of worker's data at other workers. It would be nice if it could extend the rails built here.
A new macro has been introduced as part of the typed-store
to allow us reopen multiple column families at once and avoid code duplication.
As part of this issue, the macro reopen!
should be used instead of handling the reopen of column families one by one.
For example, a place where this could be used:
Lines 123 to 141 in 6b08995
#11 introduces a typed DB interface that supports keying a key-value map by ColumnFamily
This issue is about refactoring the general storage interface:
https://github.com/MystenLabs/narwhal/blob/main/store/src/lib.rs
So that it acts as an asynchronous front-end to the typed DB interface.
Parent story: ##67
As part of this task it should be implemented the logic on the primary nodes to handle the requests for fetch the missing certificates. It is reminded that the first step of the protocol is for a primary node to ask their peers for a list of certificate ids (block ids) . Then the peers should respond back with the certificate them selfs (or empty for the ones that are not found).
Story: #67
At the moment there is no way to pass error messages across the network communication between nodes (either primary <-> primary, worker <-> worker, primary <-> worker) . So far there wasn't any need to propagate any errors between nodes. However, as we progress with features that demand some request-response communication, this becomes necessary.
As part of this issue it should be provided a common way to wrap and send errors across the network for any node communication. As reference is provided the way that Sui handles errors and the corresponding SuiError type.
We store many digests as hashes deduced from their contents, where the deduction is driven by an impl<T> Hash for T
.
This applies for votes, certificates, headers ... It would be good to be able to differentiate between a Certificate Digest
and a Vote Digest
. Using phantom types, we should be able to do this at no runtime cost.
Currently on the repository Tusk is running by default without having the option to disable it. All the information is kept on its internal DAG. However, not all of our customers will want to run Narwhal with Tusk - in fact at the moment no-one will immediately want to. For that reason we should refactor the codebase to:
read_causal
requests #69 and remove_collections
#70 . Those 2 issues are basically blocked until this work is finished firstOn a high level we should aim to:
The Celo implementation supports BLS12-377 for validator signatures only : https://docs.celo.org/validator-guide/summary
Sommelier uses Tendermint Core, which supports either of Ed25519 signatures or Secp256k1 ECDSA signatures: https://docs.tendermint.com/v0.35/spec/core/encoding.html#public-key-cryptography
Our implementation currently considers only one variant : Ed25519 signatures. As a consequence the code base will need enough crypto agility to support signing with multiple types of keypairs.
This umbrella issue aims at referring further issues on the topic.
Prior APIs that allow this:
https://github.com/diem/diem/blob/main/crypto/crypto/src/traits.rs (though YMMV)
This task is part of the light client story #95
The subscriber handler deals with errors with a lot of ad-hoc prints and breaks.
We encode and decode key material in base 64:
Lines 43 to 58 in 822acf2
Using base64ct:
https://docs.rs/base64ct/latest/base64ct/f
We should check base64ct is using (or can be configured to use) canonical encodings as well to avoid the information leakage through padding.
#85 introduces a DAG but special cases leaves as not removable / compressible.
This hinders actual removals of nodes from the dag at the end of a parent-pointer chain (suffix removals).
On the coat tails of #85, Implement:
We have now used the DBKey
and DBBatch
to implement persistence in fastnft. Some feedback:
multi_get
, it is used often enough.Along with the certificate DB:
Lines 281 to 284 in 46f2457
The only state that the consensus needs to recover is the last_voted_round table:
https://github.com/MystenLabs/narwhal/blob/main/consensus/src/lib.rs#L32
This fixed-sized element can be recovered from a memory-mapped file.
This PR should implement:
Epic: #44
In order to ensure that:
we are looking to introduce an endpoint that will allow a validator to remove the specified collections both from the internal DAG
and the data store. At the moment more details can be found on the doc here.
The endpoint on high level will look like:
remove_collections(collection_refs: Vec<CollectionDigest>) -> Result((), Error)
where the collection_refs
is a vector with the collection digests that should be removed.
Given that the majority of the spawned tasks on the codebase do not return a JoinHandle
, I believe it would make sense to open a ticket and refactor them, what do you think?
Originally posted by @akichidis in #77 (comment)
The current block waiter fetches transactions' data in batches in an implicit (but deterministic) order. We should make this order explicit and choose a way to order them. We should also think of the best place to order those batches.
The async typed storage API is very nice and we can use it in other projects too. It would be nice to provide a WriteBatch functionality that allows for multiple writes / deletes potentially across many Column Families to be executed atomically. This is low priority since for the moment we do not need this feature in Narwhal / Tusk.
Parent story: #67
When the get_collections
request is received, it is possible for some of the requested collections to be missing from the validator node that servers the request. On the first iteration we have handled this by emitting back an error message that says the collection is missing. However, this is not acceptable as it can block the consensus protocol. Legitimate and valid certificates that are missing from a validator node should be fetched instead and populated internally before responding back to the get_collections request. If there is no way to retrieve them, then an error will be returned.
On this design doc 4 different options have been presented. We are going to implement option 1 for now:
https://docs.google.com/document/d/1DpwOPgI-YSIlt2SrQaT3ioF6G-rlNJoYtHls3eZogkw/edit#heading=h.rgoyj9yf0kb8
As part of this task we should focus on implementing the want-block protocolling for fetching the missing certificates. Syncing the missing batches is part of another issue. The blocker_waiter
should be extended to:
certificate_store
)It has to be noted that we should allow fetching the batches for the certificates that have already be found to ensure that we can work in parallel while we are fetching the missing ones.
Most blockchains have a mempool size limit which they use as backpressure.
This is necessary because as a point of decentralization most blockchains' block are bound by a size limit and a block going over the limit is considered invalid.
Tusk commits, through
can consist of a large amount of blocks, and, transitively, transactions. Even if we assumed we had a way to price those transactions in gas ahead of time (which often requires an execution attempt "in real life"), we do not have a way to present the contents of the mempool under a certain size limit. Would it be possible to design one?
/cc @gdanezis
Story: #74
As part of this task we should be able to disable Tusk and just run a primary node without any consensus algorithm. We should also try to run a test how the server will respond. A command line argument should be introduced that will enable/disable the Tusk (consensus). When disabled, the produced certificated should still be consumed but basically not processed.
The sanitize_header
function processes all headers received by a primary:
Lines 328 to 340 in 03e91f8
This block round is managed using an AtomicU64
in the core's fields:
Line 44 in 03e91f8
A simple but large integer interval between a received header's claimed round and the current round should be enough to avoid spam (which is the main goal of this issue). Determining what "large enough" looks like will require analyzing how fast the block round evolves over time. The idea is that a small gap in round numbers is acceptable (though the header will not be voted on, it is not invalid, and post verification should contribute to an increase of the node's own round number).
The error produced here should clearly indicate the cause of the rejection, and produce a new variant in DagError
:
Line 27 in 03e91f8
It should also emit logs, ideally with context allowed since the switch to tracing (#26).
Story: #67
On the first iteration we were aiming to retrieve only one collection for simplicity. As part of this task we want to improve the functionality and make the endpoint receive a list of digests (collection ids) and return back the corresponding collections. That will ensure we save roundtrips and improve the overall performance.
Expose an API for Narwhal to communicate with an external execution layer. This API should be general enough to support a wide variety of execution layers, but two known clients are:
One important factor is the interaction with XO vs. OX models of executions (see: https://arxiv.org/abs/1906.11229 for toponymy)
See Google Doc
The block remover tests may execute concurrently and create databases conflicting for the same path
One solution would be to make the tests operate in different crates using https://crates.io/crates/tempfile or whatever we use in the code base already.
Run the block remover tests
running 2 tests
test src/block_remover.rs - block_remover::BlockRemover (line 94) ... FAILED
test src/block_waiter.rs - block_waiter::BlockWaiter (line 122) ... ok
failures:
---- src/block_remover.rs - block_remover::BlockRemover (line 94) stdout ----
Test executable failed (exit code 101).
stderr:
thread 'main' panicked at 'Failed creating database: RocksDBError("IO error: While lock file: /tmp/LOCK: Resource temporarily unavailable")', src/block_remover.rs:25:10
stack backtrace:
0: rust_begin_unwind
at /rustc/ee915c34e2f33a07856a9e39be7e35e648bfbd5d/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/ee915c34e2f33a07856a9e39be7e35e648bfbd5d/library/core/src/panicking.rs:143:14
2: core::result::unwrap_failed
at /rustc/ee915c34e2f33a07856a9e39be7e35e648bfbd5d/library/core/src/result.rs:1785:5
3: core::result::Result<T,E>::expect
4: rust_out::main::{{closure}}
5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
6: <core::pin::Pin<P> as core::future::future::Future>::poll
7: tokio::runtime::basic_scheduler::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
8: tokio::coop::with_budget::{{closure}}
9: std::thread::local::LocalKey<T>::try_with
10: std::thread::local::LocalKey<T>::with
11: tokio::runtime::basic_scheduler::CoreGuard::block_on::{{closure}}::{{closure}}
12: tokio::runtime::basic_scheduler::Context::enter
13: tokio::runtime::basic_scheduler::CoreGuard::block_on::{{closure}}
14: tokio::runtime::basic_scheduler::CoreGuard::enter::{{closure}}
15: tokio::macros::scoped_tls::ScopedKey<T>::set
16: tokio::runtime::basic_scheduler::CoreGuard::enter
17: tokio::runtime::basic_scheduler::CoreGuard::block_on
18: tokio::runtime::basic_scheduler::BasicScheduler::block_on
19: tokio::runtime::Runtime::block_on
20: rust_out::main
21: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
failures:
src/block_remover.rs - block_remover::BlockRemover (line 94)
test result: FAILED. 1 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 17.37s
error: test failed, to rerun pass '--doc'
We would like to make sure the customer can implement a priority-based block proposal.
This is described in detail in this talk: https://drive.google.com/drive/u/1/folders/1xf8wJdQTsN_nvlMMOOGztIDdg81EAatJ
read_causal
to make complete graph walks in between two remove_collections
calls (the later signals committed data that should move off the mempool),remove_collections
call therefore needs to remove suffixes of the graph, to avoid graph cuts,remove_collections
call that remove the storage of transactions but keeps the headers,At the moment, Narwhal only verifies its own transactions and votes.
This is problematic because in production the payload of those messages is itself a set of transactions (whether received through a mempool or directly by workers), and those should at minima have their signatures verified before hitting the proposer.
We should support the following as user signatures:
But that also includes subtleties like EIP-155
(see also CIP-35)
reminder: go has a single type, ECDSA, to represent public keys on both secp256k1 and secp256r1, and dynamically injects the correct verification impl at runtime. ↩
Parent story: #67
Following the implementation of fetching the missing certificates from peer nodes #97 , as part of this task we should focus on syncing the missing batches. The approach that should be followed is the one described in Option 1 on the design doc.
We should reuse the existing mechanism to sync batches but with the difference that we'll extend the existing solution to dictate the peer from which we want to fetch the batches from, as on the current solution the batch is fetched via the original author of the header. Also, as part of this ticket we should also extend the block_waiter
to wait for the batches to be synced before going ahead and requesting them to eventually fulfil the request.
We already have identified two missing bits of our cryptographic inventory story:
This umbrella issue is to map out the inventory of the cryptographic building blocks we need as a minimalistic MVP of Narwhal:
The lighter our "footprint" the better: an anti-example of this design task would be to require a random beacon fed by a Joint-Feldman DKG, which:
this in fact maps the "forever" constraint that in so far as we integrate Narwhal/Tusk in other chains / code bases, we'll need to operate within their validators' crypto tooling environment. ↩
They should all be updated to include (first line only where a pre-existing stanza referring to FB exists):
// Copyright(C) 2021, Facebook, Inc. and its associates
// Copyright(C) 2022, Mysten Labs
// SPDX-License-Identifier: Apache-2.0
On the current implementation when a worker forms a new batch it broadcasts it to the worker nodes of the same id of other validators. That assumption is propagated across the codebase - meaning that irrespective of which primary node we query, we assume that a specific batch will be found to the worker of same id (which is basically dictated by the propagated header)
The above assumption is making scaling quite restrictive and eventually we want it removed. On our network our primary node should keep the mapping it self to which local worker a batch has been stored and basically decouple the storying/query semantics from the ones that the author produces (worker id). It has to be noted that a batch is identified by two parameters: the batch digest and the worker_id (author). We still want to be able to identify batches based on both those two properties, but for our primary node we should be able to translate that to which local worker the batch has been stored.
This task is part of the light client story #95
Right now only the subscriber core limits the number of subscribers. Our network is however a nice DoS target.
This is a sister issue of MystenLabs/sui#382 (which is cross-repos)
We would need some light client support to integrate Narwhal into other products. That is, allow other systems to use Narwhal & Tusk as a black-box sequencer.
A (crude) example of what would be nice to have is the following: https://github.com/MystenLabs/sui/blob/5b112c50eb326c29b2e9db8bf483cc4c638fdf07/test_utils/src/sequencer.rs#L247
This issue is part of the light client story #95
The subscriber core can handle a fixed (maximum) number of light clients. Right now, when this maximum is reached, additional light clients' connections are simply refused. It would be nice to instead send them back an error message.
On various places in our codebase we have the need to construct request/look up keys that are based on some other arbitrary (but deterministically defined) data. One of the examples is on the block_remover
component the construct_request_key
method which forms a key based on provided block (certificate) ids. Those ids are used to form a unique request id which is further used on map look ups. This is now done in a bit naive way by just concatenating the id bytes, which is not ideal since:
As part of this task we want to define a way of creating more concise keys.
Note: Ideally we would like to implement a proper request/response pattern so we don't have to handle this logic on the core application level.
Epic: #44
In order to allow our customers to integrate with Narwhal, one of the endpoints that we would like to introduce is the get_collections
. As part of this story we are looking to expose an endpoint that will allow our clients to retrieve the transactions for a list of given headers / certificates. We'll call this a collection
.
Currently we do have a way to deploy Narwhal to Amazon and run a benchmark, which is definitely great and helpful. However, we are lacking the ability to have a constantly running network of Narwhal nodes in a way that "staging" (or development) environments do in a more traditional enterprise setup. Ideally, we would like to deploy in a (or a set of) AWS Amazon nodes a few Narwhal nodes in a primary-worker setup and let them run uninterrupted. Ideally we would like to have a nice CI/CD way to deploy after merging to a staging branch - or for simplicity we could just do it after the main
branch.
Why is this important?
Approach
On its simplest form we would like to be able to trigger a (re)deployment after a CI/CD run on a main
or staging
branch. I am aware that things can get quite complex and certainly we are not looking for SRE level of work here, but something simple enough and robust that will allow us to deploy Narwhal in AWS Amazon. I would suggest looking into the following direction (some thoughts):
Docker images
for our Narwhal node that will allow us to run either as primary or workerAWS ECR
ECS
to deploy the nodes. Alternatively, if we want to really abstract the deployment and node management process , we can use something like AWS Fargateprimary --> workers
node. We can expand to multiple primaries network as follow upCloudwatch
would be an extra bonus hereParent story: #70
To remove collections from a Narwhal node, for each collection the following should happen on primary (ignore order for now):
certificates_store
headers_store
payload_store
batch_store
to achieve the above a mechanism / module should be build to orchestrate the deletion of all those resources. As part of this task we should focus mainly on the part of orchestrating the deletion of batches from the worker nodes. On high level the primary should:
then the results can be used on further processes to perform deletions on the primary node storage it self.
Dealing with the actual deletion of the batches from the worker nodes should be addressed on different issue.
Parent story: #70
Following the work on #80 , as part of this task the mechanism should be built in Worker nodes to actually delete from their internal store batch_store
, the batches dictated by the received message. The worker node should reply back to the primary with the successful response or error of the operation.
It has to be noted that the operations on the storage should be atomic to guarantee that either all or none will succeed (see multi_remove )
The consensus crate is growing and there are now a number of testing facilities spread across the unit test files (e.g., consensus_tests.rs
, subscriber_tests.rs
, etc). These testing facilities are used multiple times in various test files (not only in the one where they are defined). It would great to group all facilities into a single file common.rs
.
This task is part of the light client story #95
The way the subscriber works increases the latency of commit since we wait for the consensus to output in order to start downloading batches at the executor. Can't we use the DAG to start earlier than consensus and ideally be ready to execute the moment the consensus outputs? From @LefKok
Epic: #44
In order to ensure that:
Our customers will only receive collections that haven't already consumed and proposed (or discarded)
We reclaim space in Narwhal nodes
we are looking to introduce an endpoint that will allow a validator to remove the specified collections both from the internal DAG and the data store. At the moment more details can be found on the doc here.
The endpoint on high level will look like:
remove_collections(collection_refs: Vec<CollectionDigest>) -> Result((), Error)
where the collection_refs is a vector with the collection digests that should be removed.
Our customers want to integrate with Narwhal as a mempool first, which requires their consensus modules, co-located with a primary, to be able to query the mempool and update it with the outcome of consensus. Note: in this context, we call Narwhal blocks "collections" (because the consensus has a lock on the word block).
We have identified this will require at least the following API, exposed by Narwhal to consensus:
read_causal( bd: Digest ) -> Vec<Digest>
takes a collection digest and returns the collection digests encountered in a BFS of the graph, up to a commit boundary (see also MystenLabs/sui#5187 for the current implementation of this walk).get_collection(Digest) -> Vec<Transaction>
takes a collection digest and returns the transactions contained in the collection.drop_causal_read( bd: Digest )
takes a collection digest and registers it to update the "commit boundary" of the aforementioned BFS, thereby making sure the collections in question are no longer considered valid for proposal. This should be called by the co-located validator's consensus module when voting for such a proposal.drop_collections( bd: Vec<Digest> )
takes a collection digest and drops the contents of the collection from storage. This should be called by the co-located consensus module when witnessing a commit, and thereby enshrining the content of a block (proposed beforehand through collections) into some ledger storage.The present umbrella epic is about building the API in question. It involves three broad areas of work:
More details here:
https://docs.google.com/document/d/1BuD7JaZxes8UtWOJZd8D2ZJkcj7tqYSj4601ffuceTg/edit?usp=sharing
Following the work on the PR , we should introduce the necessary functionality to the worker nodes in order to be able to process the received RequestBatch
messages and respond back to the primary with the results.
The RequestBatch
message (received over the TCP communication) is basically dictating to a worker node to retrieve from its internal store a batch by the provided id. Then the worker should respond back to the primary node with a RequestedBatch
piggy backed with the transactions fetched from the retrieved batch.
If the requested batch
has not been found (although not really expected this to happen), then the worker should respond with an error of batch not found
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.