Coder Social home page Coder Social logo

Comments (8)

perseus-algol avatar perseus-algol commented on June 9, 2024 1

I can try to investigate it, @jpraynaud you can assign me on it

from mithril.

jpraynaud avatar jpraynaud commented on June 9, 2024 1

Steps to reproduce

To reproduce this issue it is enough just to run the devnet and watch logs:

You are right, running the devnet is not enough. In order to reproduce the problem, you can run the end to end test:

cargo build --release -p mithril-aggregator -p mithril-signer -p mithril-client-cli && cargo run -p mithril-end-to-end -- -vvvv --bin-directory target/release/ --work-directory=./artifacts --devnet-scripts-directory=./mithril-test-lab/mithril-devnet

or this one to keep the network alive:

cargo build --release -p mithril-aggregator -p mithril-signer -p mithril-client-cli && cargo run -p mithril-end-to-end -- -vvvv --bin-directory target/release/ --work-directory=./artifacts --devnet-scripts-directory=./mithril-test-lab/mithril-devnet --run-only

The logs will be accessible in the artifacts/devnet/node_pool{N}.log files

The problem is happening on the signer node so you should probably investigate what's happening between the Mithril signer and the Cardano node. The signer is asking the node the current epoch at regular interval (usually 2 min in production) and this is done by the ChainObserver which takes care of handling the n2c communcations.

from mithril.

jpraynaud avatar jpraynaud commented on June 9, 2024

@falcucci is this something that could be related to the way pallas is communicating with the Cardano node?
(The Mithril signer is retrieving the epoch from the node at this pace of once every mninute)

from mithril.

jpraynaud avatar jpraynaud commented on June 9, 2024

@scarmuega FYI, this is the issue we talked about during Office Hours

from mithril.

ch1bo avatar ch1bo commented on June 9, 2024

Could be because a non-graceful shutdown of the local state query. We should double-check whether the mithril-signer is sending a MsgRelease and MsgDone after the query.

from mithril.

jpraynaud avatar jpraynaud commented on June 9, 2024

Also seen on the devnet:

[jp:cardano.node.LocalErrorPolicy:Error:64] [2024-04-16 13:22:14.42 UTC] IP LocalAddress "node-pool2/ipc/node.sock@11042" ErrorPolicyUnhandledApplicationException (MuxError MuxBearerClosed "<socket: 34> closed when reading data, waiting on next header True")
[jp:cardano.node.LocalErrorPolicy:Error:64] [2024-04-16 13:22:14.42 UTC] IP LocalAddress "node-pool2/ipc/node.sock@11043" ErrorPolicyUnhandledApplicationException (MuxError MuxBearerClosed "<socket: 35> closed when reading data, waiting on next header True")
[jp:cardano.node.LocalErrorPolicy:Error:64] [2024-04-16 13:22:14.46 UTC] IP LocalAddress "node-pool2/ipc/node.sock@11044" ErrorPolicyUnhandledApplicationException (MuxError MuxBearerClosed "<socket: 36> closed when reading data, waiting on next header True")

from mithril.

perseus-algol avatar perseus-algol commented on June 9, 2024

Steps to reproduce

To reproduce this issue it is enough just to run the devnet and watch logs:

# Terminal 1
cd mithril-test-lab/mithril-devnet
SKIP_CARDANO_BIN_DOWNLOAD=true \
FORCE_DELETE_ARTIFACTS_DIR=true \
NUM_BFT_NODES=1 NUM_POOL_NODES=2 ./devnet-run.sh

# Terminal 2
cd mithril-test-lab/mithril-devnet/artifacts && tail -n 62 -f ./node-pool1/node.log

# Terminal 3
docker logs -f artifacts-mithril-signer-node-pool1-1

Then after some time you will see in Terminal 2 an error

[andrew-d:cardano.node.LocalErrorPolicy:Error:62] [2024-04-18 13:51:30.14 UTC] IP LocalAddress "node-pool1/ipc/node.sock@3" ErrorPolicyUnhandledApplicationException (MuxError MuxBearerClosed "<socket: 28> closed when reading data, waiting on next header True")

and in Terminal 3 could not retrieve epoch settings at epoch Epoch(1) with nested error: Epoch service was not initialized, the function inform_epoch must be called first.

Investigating the causes

Signing node called /epoch-settings HTTP-endpoint of an aggregator. And the aggregator responds with an error Epoch service was not initialized, the function inform_epoch must be called first, which means:

    /// Raised when service has not collected data at least once.
    #[error("Epoch service was not initialized, the function `inform_epoch` must be called first")]
    NotYetInitialized,

So it is obviously that epoch_service wasn't initialized by calling inform_epoch first.

There is self.runner.inform_new_epoch(new_time_point.epoch).await?; call in mithril-aggregator/src/runtime/state_machine.rs. And this seemed the only place where inform_epoch is called.

There are two very close to each other error messages in aggregator's logs with events sequence:

  • 2024-04-18T15:00:44.925938221Z Epoch service could not obtain current protocol parameters for epoch 4
  • 2024-04-18T15:00:45.073509748Z Epoch service was not initialized, the function inform_epoch must be called first.

Stack backtrace of 1st one is:

   2: mithril_aggregator::services::epoch_service::MithrilEpochService::get_protocol_parameters::{{closure}}
/mithril/mithril-aggregator/src/services/epoch_service.rs:136:26
   3: <mithril_aggregator::services::epoch_service::MithrilEpochService as mithril_aggregator::services::epoch_service::EpochService>::inform_epoch::{{closure}}
/mithril/mithril-aggregator/src/services/epoch_service.rs:193:14
   4: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/future/future.rs:124:9
   5: <mithril_aggregator::runtime::runner::AggregatorRunner as mithril_aggregator::runtime::runner::AggregatorRunnerTrait>::inform_new_epoch::{{closure}}
/mithril/mithril-aggregator/src/runtime/runner.rs:460:14
   6: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/future/future.rs:124:9
   7: mithril_aggregator::runtime::state_machine::AggregatorRuntime::try_transition_from_idle_to_ready::{{closure}}
/mithril/mithril-aggregator/src/runtime/state_machine.rs:287:64
   8: mithril_aggregator::runtime::state_machine::AggregatorRuntime::cycle::{{closure}}
/mithril/mithril-aggregator/src/runtime/state_machine.rs:177:18
   9: mithril_aggregator::runtime::state_machine::AggregatorRuntime::run::{{closure}}
/mithril/mithril-aggregator/src/runtime/state_machine.rs:110:42

Which means that when aggregator try_transition_from_idle_to_ready it calls inform_epoch with epoch 4, but get_protocol_parameters fails.

I'm continuing to explore

...

from mithril.

perseus-algol avatar perseus-algol commented on June 9, 2024

@jpraynaud , thank you for pointing to e2e tests and to the right direction. I left a comment in #1644. Not 100% sure that this is correct solution, it seems works and error message is not appearing anymore.

from mithril.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.