Coder Social home page Coder Social logo

concordium / concordium-node Goto Github PK

View Code? Open in Web Editor NEW
45.0 8.0 21.0 37.77 MB

The main concordium node implementation.

License: GNU Affero General Public License v3.0

Shell 0.95% Haskell 84.01% Rust 14.51% Dockerfile 0.13% Makefile 0.01% PowerShell 0.07% C 0.01% Objective-C 0.20% AppleScript 0.11%
blockchain blockchain-node concordium haskell rust

concordium-node's Introduction

concordium-node

  • Contributor Covenant

  • Build and test

This repository contains the implementation of the concordium p2p node with its dependencies and auxiliaries. The node is split into two parts

  • concordium-consensus is a Haskell package that contains the implementation of the consensus with its dependencies. This includes basic consensus, finalization, scheduler, implementations of block and tree storage, and auxiliaries.
  • concordium-node is a Rust package containing a number of executables, the chief among them being concordium-node.rs which is the program that participates in the Concordium network and runs consensus, finalization, and other components. It uses concordium-consensus as a package, either linked dynamically or statically, depending on the build configuration. The main feature added by the concordium-node is the network layer.

The auxiliary packages are the

  • collector The collector is a service that queries the node for some information and publishes data to the collector-backend. The collector runs alongside the node.
  • collector-backend The collector backend listens for data from the collectors and serves a summary of it. This component is used by the concordium-network-dashboard to display the network overview.
  • macos_logger_wrapper provides an interface to the macOS logging interface using os_log_create. This is used by both the node and the collector so that the mac distribution package logs to the system logging service.

Submodules

The concordium-base is a a direct dependency of both concordium-consensus/ and concordium-node. Because concordium-base is also used by other components it is a separate repository brought in as a submodule.

The concordium-grpc-api is a simple repository that defines the external GRPC API of the node. This is in term of the .proto file. Because this is used by other components it is also a small separate repository brought in as a submodule.

Do remember to clone recursively or use git submodule update --init --recursive after cloning this repository, or after changing branches.

Configurations and scripts

  • The jenkinsfiles directory contains Jenkins configurations for deployment and testing.
  • The scripts directory contains a variety of bash scripts, Dockerfiles, and similar, to build different configurations of the node for testing and deployment.

Building the node

See concordium-node/README.md.

Contributing

To contribute start a new branch starting from main, make changes, and make a merge request. A person familiar with the codebase should be asked to review the changes before they are merged.

Haskell workflow

We typically use stack to build, run, and test the code. In order to build the haskell libraries the rust dependencies must be pre-build, which is done automatically by the cabal setup script.

Code should be formatted using fourmolu version 0.13.1.0 and using the config fourmolu.yaml found in the project root. The CI is setup to ensure the code follows this style.

To check the formatting locally run the following command from the project root:

On unix-like systems:

$ fourmolu --mode check $(git ls-files '*.hs')

To format run the following command from the project root:

On unix-like systems:

$ fourmolu --mode inplace $(git ls-files '*.hs')

Lines should strive to be at most 100 characters, naming and code style should follow the scheme that already exists.

We do not use any linting tool on the CI. Running hlint might uncover common issues.

Rust workflow

We use stable version of rust, 1.73, to compile the code.

The CI is configured to check two things

  • the clippy tool is run to check for common mistakes and issues. We try to have no clippy warnings. Sometimes what clippy thinks is not reasonable is necessary, in which case you should explicitly disable the warning on that site (a function or module), such as #[allow(clippy::too_many_arguments)], but that is a method of last resort. Try to resolve the issue in a different way first.

  • the rust fmt tool is run to check the formatting. Unfortunately the stable version of the tool is quite outdated, so we use a nightly version, which is updated a few times a year. Thus in order for the CI to pass you will need to install the relevant nightly version, see the rustfmt job in the file .github/workflows/build-test.yaml, look for nightly-...).

concordium-node's People

Contributors

abhconcordium avatar abizjak avatar amaurremi avatar andreaslyn avatar annenkov avatar bargsteen avatar bisgardo avatar chrmatt avatar eb-concordium avatar fdibat avatar jasagredo avatar kryptomouse avatar lassemand avatar limemloh avatar lottekh avatar mh-concordium avatar milkywaypirate avatar mkmks avatar nadimkobeissi avatar omahs avatar rasmus-kirk avatar rimbi avatar smh1001 avatar soerenbf avatar td-concordium avatar td202 avatar thahara avatar tschudid avatar vikt0r0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

concordium-node's Issues

Replace integer-simple static libraries with integer-gmp

Description

Currently the linux nodes use static libraries built using the integer-simple variant of the Haskell compiler.
This was historically necessary due to licensing issues (GMP being LGPL licensed).

The main downside of this is abysmal performance compared to GMP based integer libraries.

Seeing that the node is now fully open-source the main reason for using integer-simple no longer applies.

Thus we should

  • update the static libraries build scripts to use the integer-gmp variant of the compiler (we still need the custom built compiler since it has to be built with fPIC support for base libraries)
  • update all users of static libraries scripts. Likely some gmp package will need to be installed, either in the docker images we build, or added as a dependency to the .deb packages.

Revise GRPC APIs to account for protocol updates

Task description

The current GRPC APIs do not expose information about protocol updates that have occurred. They should be revised accordingly.

Sub-tasks

  • Identify what should be exposed and how.
  • Implement the change in the node.
  • Update dependent components as necessary.

Stale connections may linger

Bug Description

A node may erroneously consider itself connected to a peer for an extended period after the connection was dropped. This disallows the node from reconnecting.

Steps to Reproduce

  • Run a node on stagenet (or a small network) so that it will connect to all available nodes.
  • Interrupt the network connection (e.g. disconnect from VPN)
  • Shutdown the node
  • Restart the node
  • Reconnect the network connection

Expected Result

The node should be able to reconnect to its peers within a short amount of time.

Actual Result

Reconnecting to peers takes >10 minutes. The peers show as connected to the node, but the node does not show as connected to them on the network dashboard.

Versions

  • Software Version 1.1.1

Reuse transaction database connection pool on protocol update

Task description

Currently, on protocol update, each protocol version will have its own connection pool to the transaction database (if transaction logging is enabled). This does not seem to cause issues, but it would likely be preferable to reuse the connection pool across updates.

Native Mac Node: Remove the need to enter a name, when a node is not going to be run anyway.

Description
When running the installer for the native Mac node, the installer requires names in both Mainnet and Testnet name fields, even though one of them might not be chosen.

It is a bit confusing, that you have to enter a name for a node which is not going to be started anyway.

image

Steps to Reproduce

  • Download installer
  • Browse through the proces, until the screen with Mainnet + Testnet setup is seen.
  • Deselect all checkmarks for one of the nets. Click continue.
  • The installer will throw an error.

Expected Result

  • The installer let's the user continue without an error, if all checkmarks and the name field is left empty for one net.

Actual Result

  • The installer gives an error, if one of the name fields are empty, even if all checkmarks of the given net is deselected.

Versions

  • Software Version: 1.1.1
  • OS: MacOS 11.4

Node crashes when queried at inopportune times.

Bug Description

When a node is queried for any data that requires the LMDB database access during finalization it will tend to crash with a corrupt database leaving only the option of deleting the database.

This happens because we are using the LMDB API somewhat incorrectly, namely http://www.lmdb.tech/doc/group__mdb.html#gaa2506ec8dab3d969b0e609cd82e619e5 mdb_env_set_mapsize requires that there are no outstanding database transactions when executed.

Steps to Reproduce

It is hard to reproduce deterministically without modifying the node, but it can be reliably reproduced by running queries getBlocksAtHeight + getBlockInfo + getBlockSummary in sequence by increasing block height while the node is catching up on the testnet. Likely just running getBlocksAtHeight would do the same.

This leads to a segmentation fault and analyzing the core dump leads to the following backtrace

[Current thread is 1 (Thread 0x7fb2a0b86700 (LWP 3313041))]
(gdb) backtrace
#0  0x00005644d2edd6fd in mdb_txn_renew0 ()
#1  0x00005644d2ede423 in mdb_txn_begin ()
#2  0x00007fb2a9a78d52 in ?? () at hsrc_lib/Database/LMDB/Raw.hsc:855

Expected Result

The node does not crash as a result of external queries.

Actual Result

The node crashes.

Versions

  • Software Version: 1.1.2 (but I think any version starting with 1.0.0 is affected)

`node_info` reports incorrect state upon an unrecognized protocol update

Bug Description

Upon an unrecognized protocol update the baker stops, but this is not reflected in the status.

Steps to Reproduce

  1. Run a node, send a protocol update with an unrecognized specification hash.
  2. Query concordium-client raw GetNodeInfo.

Expected Result

Baker running should be false since the baker has stopped at that point. Finalization committee status is reported correctly as False.

Actual Result

Baker ID: 0
Peer type: "Node"
Baker running: True
Consensus running: True
Consensus type: "Active"
Baker committee member: NodeInfoResponse'ACTIVE_IN_COMMITTEE
Finalization committee member: False

Versions

  • Software Version: 1, 1.1

Optimize and improve the use of LMDB database

Description

Upon each finalization the node (in persistent state configuration) currently performs at least two LMDB transactions, the first to store the finalization record via addFinalization and then for each block that is finalized via markFinalized. If there are transactions in a block an additional DB transaction is performed to mark all the contained transactions as finalized.

This has a significant effect on performance of catchup when there is low load on the chain. Block import is dominated by database operations, in particular fsync(or fdatasync) which is called on each transaction seems to dominate runtime.

With a bit of change we could reduce the number of database transactions we perform per finalization to 1, which would improve initial catchup by about 30% when most blocks are empty (this is based on benchmarks on a single machine, so the exact number may vary).

Reducing the number of DB transactions would also improve the robustness of block and transaction storage

  • all blocks that are finalized would be written in a single transaction. Currently they are written by decreasing height, i.e., if block at height 30 is finalized and the previous finalized block is at 27 then the order of writes is block at height 30, then at 29, then at 28.
  • since they are written by decreasing height it can happen that we have a block at height n but not at height n-1 if the node crashes at an unfortunate time
  • since transactions are marked as finalized in a separate transaction (and by increasing block height) it can happen that all blocks and the finalization record that finalizes them are written, but transactions are not marked as such.

Some things need to be considered.

  • is there a limit on transaction size in LMDB?

Have the node read environment variables directly

The node currently mostly accepts parameters via command-line arguments.

Because of this there is usually a start script associated with the node that essentially only maps environment variables to command-line arguments and starts the node. One such script is used in the debian package, another for docker images.

We should have the node use environment variables directly. This is essentially trivial to achieve with structopt we use, and it will greatly simplify the start scripts, or in some cases eliminate them entirely.

The effect of this change will be

  • simplified node distribution and deployment
  • easier and less error-prone updates of parameters

The following need to be done

  • for each command-line option add an env = "..." argument.
  • update start scripts to use env vars directly instead of mapping to command line arguments

More informative endpoint responses

Task description

Some endpoints, such as SendTransaction provide very little information when the action fails. At the very least we could warn the exact reason for rejection on the node's end, but ideally the response to the client would include the reason as well, and not just the boolean.

We must maintain backwards compatibility though, so care must be taken.

Once the node has better error reporting, if the API has been extended, add the handling to concordium-client as well.

Remove serde serialization of network messages

Task description

We currently have serde_cbor and serde_msgpack variants of network message serialization. They are behind features and have never been used other than for benchmarks a long time ago.

They incur a maintenance burden so we should remove them. They are almost certainly in disrepair. Flatbuffers works well for us.

Block state hash reported as "000...0" for genesis blocks

Bug Description

The reported block state hash for genesis blocks is "0000000000000000000000000000000000000000000000000000000000000000".
Instead, the correct block state hash should be reported.

Steps to Reproduce
Call GetBlockInfo for the genesis block.

Expected Result
The actual hash of the genesis block state.

Actual Result
"0000000000000000000000000000000000000000000000000000000000000000".

Versions

  • Software Version 1.x

Average throughput is not maintained correctly

Bug Description
The send/receive throughput reported via the GRPC APIs seem to be inaccurate, and generally report 1bps sent/received.

Steps to Reproduce
Run a node. Observe the average sent/received values in the node dashboard.

(Note, as far as I can tell, these stats are not currently available through a query from concordium-client query.)

Expected Result
Reasonable values > 0.001 kB/s.

Actual Result
0.001 kB/s.

Versions

  • Software Version: 0.7.2
  • OS: Windows

Relatively high memory use per contract instance

Creating 100000 smart contract instances from the same module, and whose states are simple integers leads the nodes to use around 1GB of memory.

Restarting decreases the memory use to about 700MB. Block state stays reasonably small, around 42MB.

This means around 7-10kB of additional RAM memory per trivial smart contract instance, which seems excessive, seeing that all contracts are created from the same module (whose size is around 13kB).

It would be useful to both understand where this is coming from, and hopefully reduce the overhead, especially since it persists after restart.

I do not think this is high priority, but I wanted to record the data and observations.

Replace failure with anyhow

Task description

The failure crate is deprecated. Its functionality is largely replaced by anyhow, but there are some minor differences in error handling

Sub-tasks

  • replace failure with anyhow + thiserror in concordium-base/rust-src
  • replace failure with anyhow in the node

Streamline baker key registration/changes on the node

Task description

It is currently cumbersome both to register as a baker, as well as to change keys. It involves node restarts and copying files to a specific place. Concordium-client and other tools (e.g., desktop wallet) already handle baker keys when sending transactions so they can update the node as well.

The node already supports some baker management commands, such as start and stop baker. They are not really used much at this point, but they could be revised to support key management as well.

Subtasks

  • Revise the node API/internals to support registering a baker. It is currently not possible to switch a node from a passive configuration to an active one, so registering a baker might be more work than we are willing to spend. But changing keys does not have this restriction.
  • Use the new API to control the baker from concordium-client.

Out of band catchup in debian packages

Task description

Currently the debian packages do not integrate the out of band catchup. For the intended use-case this is not an issue, but we should

Sub-tasks

  • investigate whether the added complexity is worth it and should be integrated into the package
  • if decided to do so implement it

Handle corrupt database on startup

We sometimes experience database corruption from which the node currently has no recovery mechanism. This happens partly because not all state is written in a single transaction, the node using multiple databases. Some failures will be fixed if #91 is merged, but other issues will remain.

We need to devise and implement a mechanism for recovery. There is an obvious way to go back in history via parent block pointers and check when is the last block and block state we can load. However we need to ensure that

  • we do not delete data without backing it up first unless the user chooses to force a reset
  • we clean-up the databases to establish all of the invariants global state needs to maintain. This in particular means removing blocks, transactions, transaction statuses, and possibly some of the block state.

The longer the chain becomes the more important this will be, since catching up from the start is going to become infeasible.

Regression: Expired transactions are not rejected.

Bug Description

Already expired transactions are not rejected.

This is a regression introduced by #1.

Steps to Reproduce

Send an expired transaction to the node.

Expected Result

The node rejects and does not retransmit the transaction.

Actual Result

The transaction is accepted and later on deleted when the baker bakes a block (or by the purge mechanism) since it is expired.

Versions

  • Software Version: 1.0.2+

Consider storing account maps separately from block state files

Task description

Currently account maps, mapping addresses to account indices, are stored in the block state file directly.

It might be more efficient (space wise, possibly performance-wise) if they were stored separately.
The following points should be considered

  • account maps can in principle always be recovered from the account table.
  • accounts are never deleted, so account map is always just extended.

Sub-tasks

  • Analyze if this would be the case.
  • Implement if necessary, or close.

Improve sandboxing of the debian package

Task description

The debian package uses systemd to manage the node and the collector. The collector is effectively sandboxed, it does not have write access to anywhere, and runs with DynamicUser as well.

The node service is also heavily sandboxed, it only has read access to most of the system, and limited system call availability, but it still runs as root.

The node only needs read and write access to its config & data directories. We should

  • run the service with DynamicUser=yes
  • have config & data directories specified with StateDirectory (or similar flags if we need to set the current working directory)
  • make sure to update instructions on how to become a baker. With sandboxing it is important to make sure the node can read its keys on start. Perhaps using ConfigurationDirectory to specify that makes sense.

Performance with static libraries (in particular integer-simple)

It appears that the performance of some node operations is substantially (thousand times or more) worse when the node is built with static Haskell libraries.

The cause of this seems to be the use of integer-simple as opposed to GMP.

This is especially noticable when genesis time is sufficiently in the past. The time it takes to produce the first block is many times longer (I don't have precise numbers at the moment) with a node built with static libraries compared to shared ones. The main difference between the two is the use of integer-simple vs GMP.

The reason for using integer-simple is licensing of the distribution, however that does not really affect us when we are running our own nodes, and we should thus probably use builds using GMP.

It is unclear what effect this has on the general performance of the node, so it would be worth investigating that as well.

Node does not stop during out of band catchup

Bug Description

The node cannot be stopped in a nice way when doing out-of-band catchup. It can be killed of course, but it does not respond to SIGINT to be shut down gracefully.

This is due to the way the signal handler is installed. We install the handler which terminates the main node loop upon SIGINT, we then import all blocks, and then start the main node loop. Thus the signal is handled, but it will only have effect after all the blocks have been imported.

Steps to Reproduce

  • Start the node with --import-blocks-from.
  • Press Ctrl+C during block import

Expected Result

The expected behaviour (from a user perspective) would be that the node shuts down in a reasonable amount of time, e.g., after importing a few more blocks.

Actual Result

The node finishes importing all blocks, then terminates.

Versions

  • Software Version: 1.0.*
  • OS: Linux

Inconsistent precision of time values

The block info query returns a number of values of interest, including blockSlotTime, blockArriveTime, and blockReceiveTime. The first one has millisecond precision (which is needed, because slot times are in milliseconds), but the latter two round down to seconds.

While this is not wrong per se, it leads to some warts in that for most blocks baked by the baker the blockArriveTime value is earlier than blockSlotTime.

I think it would make sense to unify the precision to remove these warts.

Native Mac node processes not visible in the Activity Monitor

Bug Description
For some reason the node-collector and concordium-node processes could not be found in the Activity Monitor, even though the were supposed to be shown in the list.

Steps to Reproduce

  • Install the native Mac node using the installer
  • Launch node
  • Go to Activity Monitor and check if node-collector or concordium-node can be found in the list.

Expected Result

  • The processes are shown in the Activity Monitor list, to make sure the node is running as intended.

Actual Result

  • The processes are not found in the Activity Monitor list, even though the node is running as it should.

Versions

  • Software Version: 1.1.1
  • OS: MacOS 11.4 and 11.5.2

Improve awkward ban handling

Task description

The ban handling is not entirely clean. There are two types BanId and PersistedBanId. The main use for BanId are soft bans, the main use for PersistedBanId are long-term bans. This latter is only supported by IP.

But then the BanId is repurposed as the type to use for bans supplied via the RPC interface, leading to awkward ad-hoc functions like drop_and_maybe_ban. We should improve this to remove this ad-hoc handling.

Either we should remove bans by node id completely, or introduce a new intermediate ban type

Unban is currently only possible by IP, but the protobuf definition is more liberal, leading to error cases that should be disallowed by typing. This should also be improved..

Cleanup obsolete code regarding how a node holds its peers

Task description

Currently the node holds its peers in Buckets which is always hardcoded to be of size 1 and there is no way to profit from this type. As a consequence of this, the code is not as straightforward as it could've been, thus it's harder to understand what is going on.

The Buckets type should be removed and replaced with a simpler approach focusing on what is required right now, and not at some point in the future.

[CB-1082] Protocol update

Task description
Add support for updating the chain protocol via the Renovatio mechanism.

From the perspective of a node, the update flow should be something like:

  1. Protocol update transaction is posted to chain and finalized.
  2. Node software determines if the protocol update is supported. Warn if not.
  3. First explicitly finalized block with block time after the effective time of the update is finalized.
  4. Finalization is shut down on the original chain.
  5. Genesis block for new chain is created from the last finalized block of the old chain. (If the update is not supported, log an error instead.)
  6. Node's new tree state is initialised with new genesis block.
  7. Non-finalized transactions from old tree state are transferred to new tree state (possibly requiring conversion or filtering). From this point, incoming transactions are handled with respect to the new tree state.
  8. Catch up with peers with respect to the new genesis index.
  9. Commence baking and finalization on the new chain.
  10. The tree state of the old chain is archived.

Code clean-up

Task description
Various code clean-up tasks

Sub-tasks

  • Reconsider the use of isFull in NominationSet.
  • Restructure the global state package in line with: globalstate.txt

Update documentation of data storage

Task description

With the protocol update branch data storage will receive some refinement, each regenesis generating a new pair of tree and block state databases.

This needs to be reflected in the description of data storage in docs/data-storage.md, both to make it up-to-date, and also to describe the process by which these databases are generated.

Separate the node and the bootstrapper

The distinction between the behavior of the Bootstrapper and of the Node is quite different, but the way that they are currently entangled means that in a few places we check "are we a bootstrapper, respond like this, otherwise respond in some other way".

This makes it hard to get an overview of what is going on. Moreover it would be useful if the bootstrapper could be built without building consensus, since it does not need it.

This is a large task. It would mean factoring the functionality in such a way that the core code could be reused between a bootstrap node, and a normal node, but that the distinction would be cleaner and we would not check in many places whether we are a bootstrapper or not.

Add transaction metadata

This task is to add three new transaction types that will be enabled after a protocol update instruction is in effect.
The extension is conservative in the sense that any valid transaction will remain valid with the same outcomes as before. Valid here means that the changes will be effected as intended. If transaction payload was malformed and with a non-existing transaction tag then it could become valid according to the new protocol. To understand this recall that a transaction can be included in a block as soon as it is signed correctly and paid for, even if the rest of the payload is uninterpretable.

  • Add three new transaction types that are extensions of Transfer, EncryptedTransfer and TransferWithSchedule. They are extensions with an addition of the metadata field. This field is mandatory, since existing transactions allow one to send transfers without a memo. Metadata field is an unstructured byte array and its contents are not validated as part of transaction validation.

  • Extend the Event type with an additional Metadata event which is generated by the three new transactions. This means that the transaction outcome of a TransferWithMetadata will be a list of two elements

    • transfer event (i.e., transfer x GTU from A to B)
    • metadata

    Since serialization is derived for this Event type we must take care to add the new event at the end of the type so that existing tags do not change. We must review the derived serialization and make sure that any value of the old type is serialized in the same way for the new type.
    (The reason serialization of this type is important is that transaction outcomes are hashed by being serialized and the resulting bytestring hashed. Thus it will change semantics if we change serialization, and until the update the semantics must stay the same.)

  • The decodePayload function that takes a serialized payload should be parametrized by a protocol version and act as it does now with the current version, and allow deserializing the three new transaction types with the new protocol version.

  • Add handlers in the scheduler for the new transaction types that log the memo, but otherwise act as handlers did before.

  • The cost of the new transactions uses the same per-transaction-type constants as the corresponding existing transactions. The memo size is accounted for in the size of the contract.

No other changes should be necessary in the node transaction handling and storage itself.

"WARN: The high priority inbound consensus queue is full!

Bug Description
An error seems to occur on Baker nodes that are running Concordium Client v0.4.11 (testnet) on WIndows OS, when stopped
and restarted. This happens when the node is out of sync with the consensus nodes on the chain, and it fails to catch up with
them.

Running a node from scratch works fine. This error seems to happen when there is already node data that isn't flushed during
the execution of the concordium-client app. Running with or without "--no-block-state-import" produces the same error.

An example of the series of events that lead to the error is below:

$ concordium-node --no-block-state-import --listen-node-port
Checking whether docker is accessible on your PATH
OK
Checking whether Docker is running
OK
Checking whether we can access docker
OK
Validating system credentials
OK
Currently your node name is set to gidiguru. Press enter to keep it or enter a new one:
OK
Latest version of the concordium client is 0.4.11
Using your local image
Do you want to flush the local node database before starting? (y/N): n
Now going to boot up node with name gidiguru running v0.4.11 (testnet)
INFO: Consensus layer started
INFO: RPC server started
INFO: Starting out of band catch-up
ERROR: External: Error importing block: OtherError ResultDuplicate
INFO: Starting the P2P layer
INFO: Attempting to bootstrap
INFO: Using bootstrapper 54.77.133.220:8888
INFO: Using bootstrapper 52.17.42.154:8888
INFO: Commencing baking
INFO: Runner: Starting baker thread
ERROR: [sending to x.x.x.x:8888] Connection refused (os error 111)
INFO: Not enough peers - sending GetPeers requests
ERROR: [sending to x.x.x.x:8888] Connection refused (os error 111)
No peers at all - retrying bootstrapping
INFO: Sent a direct message to peer containing a catch-up status message
WARN: Couldn't process a finalization message from peer due to error code InvalidResult
ERROR: [sending to x.x.x.x:8888] Connection refused (os error 111)
WARN: The high priority inbound consensus queue is full!
WARN: The high priority inbound consensus queue is full!
WARN: The high priority inbound consensus queue is full!

Steps to Reproduce
Any of the following produces the same result
concordium-node --no-block-state-import --listen-node-port
concordium-node --listen-node-port
concordium-node --no-block-state-import --listen-grpc-port
concordium-node --listen-grpc-port

Expected Result
Once the node is restarted, it catches up with the other nodes in the chain.

Actual Result
It doesn't

Versions
Software Version is Concordium Client v0.4.11

OS
Windows OS running GitBash

Node Logs
https://drive.google.com/file/d/1gIykGV_sdY9KNJt9u7-2FWpb1MC7JqEC/view?usp=sharing

Maintain or remove features

Task description

The following features from concordium-node are not routinely tested as part of the CI workflow, and have started to experience bit rot. These features should either be updated if they still serve a purpose, or removed otherwise:

  • test_utils
  • genesis_tester

Revise logging to be more useful

Task description

The current node logging does not serve any specific purpose very well. Some of the issues that routinely come up

  • logs tend to get large quickly even in INFO mode
  • we have found that INFO logging is of very very limited value when debugging issues
  • on the other hand INFO logging is quite spammy, it logs each received block, printing a bunch of statistics for them, but at the same time does not log very clearly which block it produced itself (this has to be discerned from followup log items)
  • DEBUG and TRACE logging are more useful for debugging issues, but many things are logged which are of dubious value, and on the other hand some are not. For example we do not log why transactions were rejected on the gRPC layer, and this would often be useful.
  • some things that are logged as ERROR should probably not be that. For example failing to send on a connection is not really a critical error. It most likely means the peer disconnected.

This issue is meant to collect the various concrete improvements we could do to logging to make it more useful.

Sub-tasks

  • Come to an agreement on what the purpose of the different logging levels is. The current situation is not really designed, it has grown and write a document describing the agreement.
  • Make followup tasks to revise the logging according to the agreement.

Concordium node has no Avg Ping and Peers

Hi Team,

I am running a concordium node from this morning. I am running node on Ubuntu 20.04 LTS with 8GB RAM and 250GB HDD.
As I can see there are no Peers and n/a as Avg Ping time for my node named as "blocktech_dev_om" is there any reason behind this.
Please let me know if any suggestions.

Thanks.
Screenshot from 2021-09-14 15-01-21
Screenshot from 2021-09-14 15-01-21

Fix or remove the use of the envoy proxy

Task description

In the distribution docker image we use the envoy proxy to proxy GRPC requests from the node dashboard. The way this is built is not properly versioned. It uses getenvoy which is installed from an unversioned source, which then leads to frequent breakage. It also looks like versions are frequently deprecated.

In order for the build not to break all the time we need to fix the versions in some way so that things don't break every 2 months.

Ideally we'd use just the nginx, but that does not support proxying grpc over http/1, and browsers do not support http/2 over unencrypted connections. And we shouldn't complicate things with https on localhost.

Builds are not reproducible

The following .gitignore rules currently exclude a number of dependency management lock files from the repo:

  • **/**/Cargo.lock (note that this excludes all Cargo.lock files in the repo)
  • /concordium-consensus/stack.yaml.lock
  • /concordium-consensus/stack.integer-simple.yaml.lock

The files record the resolved versions of all libraries used in the build. Without these files, builds are not reproducible.

It's not clear to me which exact lock files should be included, but it seems that at least concordium-node/Cargo.lock and concordium-base/rust-src/Cargo.lock (excluded in concordium-base) should:

For example, the most recently tagged commit (1.0.1-0) no longer builds with Rust 1.45.2. The reason is the dependency chain

concordium_node v1.0.1
-> crypto_common v0.1.0
-> ed25519-dalek v1.0.1
-> curve25519-dalek v3.1.0
-> zeroize v1.4.0

The newly released v1.4.0 of zeroize requires Rust 1.51+. As curve25519-dalek depends on the library as just version "1", it automatically bumps the minor version which raises the minimum supported Rust version (they explicitly allow themselves to do that).

Had Cargo.lock been checked in, this would not have been an issue.

Windows node distribution

Task description

Provide a distribution of the node for Windows.
This includes a service wrapper that will run as a background service and start and monitor nodes.

Sub-tasks

  • A service wrapper for running nodes based on a configuration file, running as a Windows service.
  • Shortcuts for starting, stopping and configuring nodes.
  • An installer/uninstaller.
  • A pipeline for reproducible, automatic builds on Windows.

Update flatbuffers compiler and dependency

Task description

The node currently needs a very specific old version of the flatbuffers compiler to build.
We should investigate whether we can remove that restriction and update the flatbuffers dependency to above 0.7

Sub-tasks

  • Make code compatible with recent versions of flatc
  • Update the flatbuffers dependency

Use a verifier instead of catch_unwind when deserializing network messages.

Task description

Use a flatbuffers verifier instead of relying on catch_unwind.

Previously this was not easily possible since verifiers were not exposed, but version 2.0 of the flatbuffers dependency does expose a verifier, which we should use.

The verifier should check that internal offsets are valid and do not lead to out-of-bounds errors or other problematic behaviour.

Revise DNS resolution

Description
The node currently uses a custom wrapper around unbound for DNS resolution. The reason for this approach is likely due to a now-defunct intention to use DNSSec. Since this is not necessary or really used in the current implementation, we should probably switch to the system resolver.

We have also seen issues that have been triggered by slow DNS resolution. The custom DNS resolution code can be quite slow, and even if it's not removed, the performance could well be improved. For instance, at least on Windows, the address may be resolved against the same DNS server multiple times.

This proposal is to replace the custom DNS resolver code with using the system resolver via standard Rust APIs (such as ToSocketAddrs).

Some possible benefits include:

  • Likely to be faster. This is important because resolution can delay the main network processing loop when bootstrapping is attempted.
  • Reduces maintenance burden. (We have previously had memory leaks in the DNS resolution code.)
  • Removes dependency on unbound library.

Quicker verification of Network packets

Today most verification is only being carried out by the Scheduler i.e., after the transaction has already been added to the transactions table and it's about to get executed.

In order to prevent congestion on the network we should carry out more verification of the incoming transactions even before adding the transactions to the transactions table. The verification results can - to some extent - be cached and reused by the scheduler and thus lowering the computational cost when executing transactions.

Furthermore it could alleviate the current (lack of) rpc status' for send_transaction and thus help integrators with more detailed error messages.

Note. Extensive testing has to be carried out as much of the transaction flow is rather subtle.

Sub-tasks

  • Implement an earlier verification process for the incoming transactions. As such a heuristic has to be created for when and when not to reject incoming transactions.
  • Create a document which describes the transaction acceptance heuristic.

Expose protocol version information via API

In order to support clients upgrading from v0 to v1 protocol we need to expose additional information to the clients.

  • Consensus info query should expose the current protocol version

  • Expose a new endpoint that can be queried with

    • block item version
    • block item type (account creation/update/account transaction)
    • type of the item/payload (e.g., Transfer, InitialCredential, ...)
      The node should respond whether it currently supports the given transaction type.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.