Coder Social home page Coder Social logo

glados's Introduction

Glados

Network health monitoring tool for the Portal Network

Project Overview

The project is split up into a few different crates.

  • glados-core: Contains shared code that is shared by the other crates.
  • glados-web: The web application that serves the HTML dashboard
  • glados-monitor: The long running system processes that pull in chain data and audit the portal network.

Technology Choices

  • sea-orm - ORM and database migrations. The entity and migration crates are sea-orm conventions.
  • axum - Web framework for serving HTML.
  • askama - Templating for HTML pages.
  • web3 - For querying an Ethereum provider for chain data
  • tokio - Async runtime.
  • tracing - Structured logging

For our database, we use Postgres in both development and production.

Architecture

The rough shape of Glados is as follows:

The glados-monitor crate implements a long running process which continually follows the tip of the chain, and computes the ContentID/ContentKey values for new content as new blocks are added to the canonical chain. These values are inserted into a relational database.

The glados-audit process then queries the database for content that it will then "audit" to determine whether the content can be successfully retrieved from the network. The audit process will use the Portal Network JSON-RPC api to query the portal network for the given content and then record in the database whether the content could be successfully retrieved. The database is structured such that a piece of content can be audited many times, giving a historical view over the lifetime of the content showing times when it was or was not available.

The glados-web crate implements a web application to display information from the database about the audits. The goal is to have a dashboard that provides a single high level overview of the network health, as well as the ability to drill down into specific pieces of content to see the individual audit history.

Running Things

For specific examples, see the SETUP_GUIDE.md.

Quick Deploy via Docker:

See the DOCKER_GUIDE.md

Basics

Glados needs a postgres database to use. To run a postgres instance locally using docker:

docker run --name postgres -e POSTGRES_DB=glados -e POSTGRES_PASSWORD=password -d -p 5432:5432 postgres

This postgres instance can be accessed via postgres://postgres:password@localhost:5432/glados. This value will be referred to as the DATABASE_URL.

In most cases, you will want to set the environment variable RUST_LOG to enable some level of debug level logs. RUST_LOG=glados_monitor=debug is a good way to only enable the debug logs for a specific crate/namespace.

Running glados-monitor

The glados-monitor crate can be run as follows to populate a local database with content ids.

The CLI needs a DATABASE_URL to know what relational database to connect to, as well as an HTTP_PROVIDER_URI to connect to an Ethereum JSON-RPC provider (not a portal node).

$ cargo run -p glados-monitor -- --database-url <DATABASE_URL> follow-head --provider-url <HTTP_PROVIDER_URI>

For example, if an Ethereum execution client is running on localhost port 8545:

$ cargo run -p glados-monitor -- --database-url  follow-head --provider-url http://127.0.0.1:8545

Importing the pre-merge accumulators

The pre-merge epoch accumulators can be found here: https://github.com/njgheorghita/portal-accumulators

They can be imported with this command

$ cargo run -p glados-monitor -- --database-url <DATABASE_URL> import-pre-merge-accumulators --path /path/to/portal-accumulators/bridge_content

Running glados-web

The CLI needs a DATABASE_URL to know what relational database to connect to.

This has only been tested using the trin portal network client.

$ cargo run -p glados-web -- --database-url DATABASE_URL

This must be run from the project root, or static assets will fail to load, with 404 errors.

You should then be able to view the web application at http://127.0.0.1:3001/ in your browser.

Running a census with glados-cartographer

First, launch a portal client, like trin, with an HTTP endpoint. Assuming you already launched postgres using Docker, the cartographer command would look like:

cargo run -p glados-cartographer -- --database-url postgres://postgres:password@localhost:5432/glados --transport http --http-url http://localhost:8545 --concurrency 10

Running an audit with glados-audit

First, launch a portal client, like trin, with an HTTP endpoint. Assuming you already launched postgres using Docker, the audit command would look like:

cargo run -p glados-audit -- --database-url postgres://postgres:password@localhost:5432/glados --history-strategy latest --portal-client http://localhost:8545

glados's People

Contributors

carver avatar fearlessfe avatar kaydenml avatar kdeme avatar kolbyml avatar morph-dev avatar mrferris avatar mynameisdaniil avatar njgheorghita avatar pauljickling avatar perama-v avatar pipermerriam avatar xiaoxianboy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

glados's Issues

Add a mode to `glados-monitor` for backfilling missing chain information.

depends on: #35

We need a mode for glados-monitor that looks at the database and finds missing historical blocks and backfills them.

My initial thoughts on how we model this in the CLI would be:

# normal mode that follows the head of the chain (this can be the default)
glados-monitor --mode follow-head

# backfill mode that fills until it reaches the head of the chain and then either exits or sleeps and the restarts it's search for missing information)
glados-monitor --mode backfill

Audits do not record the originating strategy

Description

One suggestion for a future PR is that we include within the audit's DB record which selection strategy it was triggered by.

When an audit fails, it may be exclusively from a particular strategy. By incorporating the source strategy such correlations could be interrogated

Add Radius and "Should store" column in Audit page

The audit page is very useful to track how content is available over the network and how many hops are done for a recursive find content.

Now when a No Progress node occurs, it would be nice to know whether that node should have had the content or not.

I think that to achieve this we could add a radius column that could optionally be filled in if the Origin node happens to know the radius of the No Progress (or other) nodes. Additionally, the calculation, based on distance and radius of this node, can be done to see whether this node should store this content in theory. This could be indicated in another column, or perhaps be indicated with a color, e.g. marking the whole row red if it didn't have the content but should have had it.

Additional summaries could be made based on how many nodes failed to fulfill their task, etc.

I assume that the portal_historyTraceRecursiveFindContent call would need a slight adaptation to include the optional radius information.

edit: Actually, the information is equally useful for Responded nodes.

A way to view audit metrics over time

Glados is good at showing us how well the network is doing right now.

I think the main index/dashboard should be updated so that it provides context for how well we are doing with respect to how well we previously were doing. A simple view of this might be showing a digest for the last month, with each day segmented off and showing aggregate success fail metrics.

  • This week 70% success
  • Last week 60% success
  • The week before that 62% success

The current "last hour", "last day" and "last week" metrics are still good, but I think they become less interesting/important than looking at how well we've been doing over longer time arcs.

Web view for DHT census data

We should have a web view to explore census data.

  • Paginated list of all past census data showing (Id/CreatedAt/NumNodes)
  • Detail page for a census that shows meta stats (CreatedAt/NumNodes), client diversity graph, and a list of all found ENR records

Add block metadata to web view

The web application should be updated to include the new block metadata in any contentkey/id displays so that when we view a content key/id in the web application we can know if it is a header/body/receipts/accumulator

How to handle Portal node timeouts

Description

Should a timeout be entered into the database as a failed audit or logged as an error?

glados-audit can either recieve

  • A Portal node can respond with "no content" ("0x") message.
  • A timeout. E.g., I have observed that a HTTP connection usually dies at 60s with Trin. Sometimes a message comes through (with the content or with the no-content message at ~50 seconds. So it seems plausible that for some content a timeout could occur despite the content existing but being inaccessible within the timeout.

Current behaviour: Timout is logged as an error in glados-audit

I am inclined to keep this behaviour.

Backfill metadata tables

What is wrong

With #45 merged and deployed, we now have lots of ContentKey/Id records in the production glados instance that don't have these populated.

How can it be fixed

Write a script that can be run against an existing database which will find ContentKey rows that don't have associated meta-data, and that backfills this data.

Database is not initialized for in-memory db

Description

In glados-monitor, when an in-memory database is created using the "sqlite::memory:" argument, as in:

$ cargo run -p glados-monitor -- --database-url sqlite::memory: follow-head --provider-url http://127.0.0.1:8545

The following error is encountered:

Query(SqlxError(Database(SqliteError { code: 1, message: "no such table: content_key" })))

path
glados/entity/src/contentkey.rs:46:10

This indicates that upon lookup of the first key in the table, the action fails as the result of a non-existent table. 
```rs
// glados/entity/src/contentkey.rs
get_or_create(content_key_raw: &impl ContentKey, conn: &DatabaseConnection) -> Model {
    // First try to lookup an existing entry.
    let content_key = Entity::find()
        .filter(Column::ContentKey.eq(content_key_raw.encode()))
        .one(conn)
        .await
        .unwrap(); <--- panic
    // snip
}

Resolution

Some options is see:

  1. Require that the CLI flags prevent in-memory-db + no-migration combination
  2. Run db migration if in-memory-db is selected (this initialises the db)

Write tests to make sure glados is validating content correctly

Currently we have a high success rate on our testnet, but glados has a bug where it says the audit failed. But we can see in the trace that it did find the content.

image

^ here is a picture (the green bubble is that we found the content)

The reason we want this test is so if we are updating library versions we know that the code will still pass. We want to catch the bug before at the PR stage to prevent headache and to ensure we know it is a issue with the network not glados itself.

Add support for distance queries against content

Now that #114 is merged, we need to do the same for content.

  • add a new content_id_high: i64 field to the content table.
  • add ability to query for content closest to a specific node-id
  • add ability to query for nodes closest to a specific content-id
  • add ability to query for content closest to other content.
  • add all of these to the web views in some way.

Add meta/context information to the content-key/content-id information in the database.

We need to add some new information to our database that lets us have a richer view into what each content-key/content-id represents.

The new tables should loosely be something like:

execution_header
---
block_number: uint64 (unique)
block_hash: [u8, 32] (unique)
content_id: foreign_key_to(content_id) (unique)

execution_body
---
block_number: uint64 (unique)
block_hash: [u8, 32] (unique)
content_id: foreign_key_to(content_id) (unique)

execution_receipts
---
block_number: uint64 (unique)
block_hash: [u8, 32] (unique)
content_id: foreign_key_to(content_id) (unique)

The glados-monitor process needs to be updated to populate these tables with information.

Improve web view of audit information.

Some notes on the information we want to be able to gleam from glados.

  • in the last 1-hour/24-hours/7-days
    • total number of new content items
    • total number of audits performed with success/fail percentages
  • the N most recent:
    • successful audits
    • failed audits
    • new content items

Need links to view content:

  • List views of content that show:
    • content key / content-id / content-type / first-available-at / created-at / most-recent-audit-result
  • Ability to "filter/sort" content
    • filter by kind/type (multi-select)
    • filter by datetime range
    • sort by first-available-at
    • sort by created-at
    • filter by most-recent-audit-result (success/failure)

Content Dashboard loads slowly

http://glados.ethportal.net/content/ loads slowly.

The endpoint for this page has 8 queries. The time they currently take (with 3.9 mil audit rows) are as follows:

Content ID: 120 ms
Content: 1641 ms
Audits: 451 ms
Successes: 303 ms
Failures: 377 ms
Hour stats: 941 ms
Day stats: 938 ms
Week stats 958 ms
= 5729 ms

The two priorities should be improving

  • "Content" which is the most recent content that has been audited
  • Stats, note that this is with #134 so despite being sped up it's still a second per stat period.

Number of audit tasks generated-per-min is not configurable

Description

glados-audit can be viewed as a funnel as follows:

flowchart TD
subgraph generate[Trigger every AUDIT_SELECTION_PERIOD_SECONDS]
s1[Strategy Latest]
s2[Strategy Random]
s3[Strategy Random]
end
s1  & s2 & s3 --> |send KEYS_PER_PERIOD tasks|chan[Audit task channel]


chan --> |take 1 task|a1 & a2 & a3 & a4
subgraph fulfill[Continuously replenish with new threads once they complete]
a1[Auditing thread 1]
a2[Auditing thread 2]
a3[...]
a4[Auditing thread CONCURRENCY]
end
a1 & a2 & a3 & a4 --> node[Portal node]
Loading

At present, the CLI can control the throughput as follows:

  • --concurrency <n> flag controls the maximum funnel output rate.
  • --strategy <strat> flag controls the nature of tasks generated (limited effect on throughput. E.g., setting multiple --strategy random)

The two variables that control the maximum funnel input rate are:

  • KEYS_PER_PERIOD. Currently hard coded as 10.
  • AUDIT_SELECTION_PERIOD_SECONDS. Currently hard coded as 120 (seconds)

Thus max audits per minute can be calculated:

  • one active strategy, the funnel is filled at 10/120 * 60 = 5 tasks (individual content key audits) per minute.
  • Currently the default is three active strategies and the funnel is filled at 3 * 10/120 * 60 = 15 tasks (individual content key audits) per minute.

Noting that observed audits/min rate will be lower because audits that timeout against a portal node are not recorded as pass/fail.

The funnel has a "rim height" set to overflow at 100 pending tasks. That is, when the channel has 100 pending tasks, new
tasks generated at this point will be discarded.

Resolution

Expose funnel input control from the CLI. Options flag to:

  1. Expose KEYS_PER_PERIOD variable.
  2. Expose AUDIT_SELECTION_PERIOD_SECONDS variable.
  3. Expose KEYS_PER_PERIOD and AUDIT_SELECTION_PERIOD_SECONDS variables.
  4. New --max-task-rate <n = max audits per min> flag that controls maximum audits per minute that are generated.
    a. Titrate AUDIT_SELECTION_PERIOD_SECONDS to n, taking into account number of strategies and KEYS_PER_PERIOD
    b. Titrate KEYS_PER_PERIOD to n, taking into account number of strategies and AUDIT_SELECTION_PERIOD_SECONDS.

Current flags

Usage: glados-audit [OPTIONS] --transport <TRANSPORT>

Options:
  -d, --database-url <DATABASE_URL>
          [default: sqlite::memory:]

  -i, --ipc-path <IPC_PATH>
          

  -u, --http-url <HTTP_URL>
          

  -t, --transport <TRANSPORT>
          [possible values: ipc, http]

  -c, --concurrency <CONCURRENCY>
          number of auditing threads
          
          [default: 4]

  -s, --strategy <STRATEGY>
          Specific strategy to use. Default is to use all available strategies. May be passed multiple times for multiple strategies (--strategy latest --strategy random). Duplicates are permitted (--strategy random --strategy random).

          Possible values:
          - latest:
            Content that is: 1. Not yet audited 2. Sorted by date entered into glados database (newest first)
          - random:
            Randomly selected content
          - failed:
            Content that looks for failed audits and checks whether the data is still missing. 1. Key was audited previously 2. Latest audit for the key failed (data absent) 3. Keys sorted by date audited (keys with oldest failed audit first)
          - select_oldest_unaudited:
            Content that is: 1. Not yet audited. 2. Sorted by date entered into glados database (oldest first)

  -h, --help
          Print help information (use `-h` for a summary)

  -V, --version
          Print version information

Timestamp problem

Dec 14 18:44:07 localhost run_glados_monitor.sh[699373]: thread 'tokio-runtime-worker' panicked at 'Error inserting new content key: Query(SqlxError(ColumnDecode { index: "\"created_at\"", source: "mismatched types; Rust type `core::option::Option<chrono::datetime::DateTime<chrono::offset::utc::Utc>>` (as SQL type `TIMESTAMPTZ`) is not compatible with SQL type `TIMESTAMP`" }))', /root/glados/entity/src/contentkey.rs:64:14

Just another issues that showed up when switching from sqlite3 -> postgres

Unify logging formatting

Go through all the logging and make sure that everything is being logged in ergonomic ways.

  • binary data should be hex encoded.
  • structured logging should be using the same key/value pairs
  • lists are better with 3 things.

Audit & Display Beacon Data

Tasks

No tasks being tracked yet.

Table hierarchy

Description

Need to decide if content_id (current) or content_key are the subject matter for operations in glados.

Highest ranked item:

  • Trin: content_key. In ethportal-api we define OverlayContentKey which has the method content_id. So content_key is the top level
  • Glados: content_id. In glados/entities we define a table contentid with a "has_many" content_keys relationship. So the content_id is the top level.

This seems to be inverted in glados.

Discussion

  • What are the downsides to staying with content_id?
    • Mental overhead: Function names are content_key-centric
    • Additional lookups: block data -> content_key table -> content_id table

Make auditing compatible with different sub-protocols

Description

The auditing is specific to the History network. If it were to be made agnostic to sub-protocols then this will save work later on.
The cause of the current limitation is that we:

  1. ✅ (glados-monitor) Follow a Portal node and record content keys/ids/metadata in a glados-db
  2. ✅ (glados-audit) Employ different strategies to decide on what glados-db content to audit.
  3. (glados-audit) presume all content in glados-db are from the History sub-protocol and send HistoryContentKeys in the mpsc audit channel. ❌ Would not handle other sub-protocols.
  4. (glados-audit) For each task in the audit channel, perform a portal_historyRecursiveFindContent request to a Portal Node. ❌ Would not handle other sub-protocols.

Solution

In three parts:

Record the sub-protocol in the glados-db.

Currently we do this with the execution metadata tables, for header, body & receipts but not EpochAccumulator. Options are:
A. Treat these tables as sub-protocol identifiers. Would need to add a table for EpochAccumulator items, and any items added for different sub-protocols
B. Add a new table for sub-protocol

Option B seems cleaner because in the event of a second sub-protocol, you do not need to check multiple tables to find matches.

Make mpsc channel more broad

The channel currently handles HistoryContentKey. It could send database IDs in the channel.

Lookup sub-protocols at audit time

The audit task can operate as follows:

    1. Look up what sub-protocol the item is from
    1. Call the appropriate portal_*RecursiveFindContent
    1. Log the appropriate metadata (based on the sub-protocol)

Considerations

Any key entered into the glados-db may clash with an identical key on a different sub-protocol. To handle this,
keys should AFAICT have a many-to-one relationship with a sub-protocol identifier.

  • Decide on a key to audit
    • Look up what sub-protocol it is from
    • Audit on that sub-protocol (E.g., portal_historyRecursiveFindContent)
    • Record audit result for that content
  • Glados monitor creates a second key that is identical (to the above) but on a different subprotocol
    • Need to store the key, but in a way so as to not inherit the audit from the similar key
      • Perhaps by storing the sub-protocol along with the audit date/result data.

Not yet sure of the best way to a organise a sub-protocol table / foreign key to achieve this.

Alternatives

Maybe there are other better solutions.

  • ? Separate database for sub-protocols

Local validation of content

Currently glados is happy with a non-zero response when it audits content.

This should be changed to actually check that the content returned from the network correctly passes validation.

  • For headers, we should re-construct the RLP header and verify it hashes to the block hash.
  • For bodies we should reconstruct the transaction and uncle tries and verfiy them against the corresponding headers fields
  • For receipts we should reconstruct the trie and verify it against the corresponding header field
  • For the accumulator, we should verify that the epoch hash matches the one from the master accumulator

Improve DHT census code routing table enumeration

Once #117 is merged there is improvement that can be done to the routing table enumeration.

Presently, we simply send a FIND_NODES request for all buckets between 245-256. We should be able to cut this number of requests down significantly.

A smarter tactic would be to change this range to be dynamic. Querying bucket 256 is rarely useful because everything in that bucket will almost always be in someone else's lower numbered bucket that is closer. Similarly, once we start hitting the lower bucket numbers and a request comes back empty, we should also be able to exit early.

A census algorithm that still reliably gets us 99.9% visibility into the nodes of the network and that requires 10x fewer network requests is an improvement.

Roadmap notes

End goal is roughly:

  • configure application with any number of JSON-RPC connections to running nodes.
  • web application serves data from database
  • long running process collects information from the various running clients.

Starting point

  • single running client
  1. basic node routing table information
  2. network explorer enumerating ENR records and node ids
  3. explore data radius values

Monitor may miss blocks

Description

The way errors are handled in glados-monitor, within the follow_chain_head() and retrieve_new_blocks() leads to situations where a new block is not stored in the database.

For example:

  • In transmitting a new block number to the retrieval thread, the message may fail,
  • During retrieval the block contents are not received properly, there is no attempt to try again later.

Discussion

I suspect that this is a non-issue, because glados performs a sampling based audit, rather than a completeness audit.

That is, glados creates a record of keys to challenge the portal node with. It should pass all keys tested (not all canonical keys). The tested keys are a subset of all keys, so if glados fails to record every block at the chain head, the sampling will still be valid.

Actions

  1. If this the right way to think about it, this issue can be closed.
  2. If glados should strive for completeness, I can take a look at making glados-monitor handle these error cases by remembering/retrying rather than moving on.

Glados audit incompatibility with Fluffy - fix out of date portal JSON-RPC API?

Latest version of glados is no longer compatible with Fluffy.

Each audit returns a failure even though there is clearly data arriving:

[2023-09-20T12:37:11Z DEBUG hyper::proto::h1::conn] incoming body is content-length (73007 bytes)
[2023-09-20T12:37:11Z DEBUG hyper::proto::h1::conn] incoming body completed
[2023-09-20T12:37:11Z ERROR glados_audit] Problem requesting content from Portal node. content.key="0x01e57dc6f3241f3a4f8c55293a3ec3afe21563f3614c0cfa049facc7314ee2460b" err=ContainsNone

From some further debug it looks like the json response parsing (as_str) is failing: https://github.com/ethereum/glados/blob/master/glados-core/src/jsonrpc.rs#L200

There has been a change related to this to the JSON RPCall: ethereum/portal-network-specs@92b79b8

Retesting this with a modified fluffy build that returns just the content(like before the spec change) , not the object works.

Can it be that the json-rpc portal api here needs an update?

In fluffy we implemented this change as else portal-hive was failing, so not sure how this works for Trin, perhaps due to the usage of portal_historyTraceRecursiveFindContent in glados?

OverlayContentKey use does not include selector

Description

A T: OverlayContentKey passed into a function does not have a way to get the actual content_key (e.g., the bytes including the selector).

Use of .into(), which is the counterpart to From and has the following implementation.

// trin/ethportal-api/src/types/content_key.rs
impl From<HistoryContentKey> for Vec<u8> {
    fn from(val: HistoryContentKey) -> Self {
        val.as_ssz_bytes()
    }
}

This gets the bytes of the enum, using the derived Encode

/// A content key in the history overlay network.
#[derive(Clone, Debug, Decode, Encode, Eq, PartialEq)]
#[ssz(enum_behaviour = "union")]
pub enum HistoryContentKey {

Which does not include the selector.

Add new timestamp field to content to represent first moment the data should have been available.

We currently have a created_at field on the ContentKey model, which is being set to the timestamp when the database entry was created....

I'd like to add another timestamp field first_available_at that is set to the time when the content should have first been available.

  • For headers/bodies/receipts this should be set to the timestamp of the corresponding block.
  • For epoch accumulators, this should be set to the timestamp of the last block in the accumulator.

The glados-monitor will need to be updated to correctly populate this field for all new entries.

The import_pre_merge_accumulators script in glados-monitor/src/lib.rs will need to be updated to populate this field.

We also need a script that can be used to populate existing records in the database for which this value is missing. This script should probably take something like an Infura URI or some other way of getting at a running JSON-RPC API to be able t fetch block data.

  • Step 1: Add new timestamp field as nullable.
  • Step 2: run script to populate historical records
  • Step 3: modify database field to be non-nullable

I believe this will also depend on us having a script that will backfill missing content-key metadata, since #45 is now merged and deployed, but the main instance of glados is missing most of the metadata for values that were already present in the database. Depends on https://github.com/pipermerriam/glados/issues/65

Absent 0x prefix in portal network request

Description

In a call to "portal_historyRecursiveFindContent", the parameters sent do not include "0x" prefix. This results
in an error

Error while processing portal_historyRecursiveFindContent: Error returned from chain history subnetwork: \"Invalid RecursiveFindContent params: \\\"Unable to decode content_key\\\"\"

Occurs in: glados_core::jsonrpc::PortalClient::get_content().

Solution

  • Likely fixed/obsoleted by upcoming json-rpc changes in trin
  • Implies that an OverlayContentKey type should have a "as_hex_string()" convenience method to prevent similar.
  • May be quick-fixed by adding "0x" to the string.

Add ability to run audits with different selection strategy

(low priority, only useful once the network is actually mostly working and most audits are successful)

depends on: https://github.com/pipermerriam/glados/issues/37

Lets call the strategy laid out in #37 the latest strategy, as it focuses on auditing the newest stuff.

We want two more strategies, and to modify glados-audit in any way necessary to allow us to run multiple audit processes concurrently (which might require database locking of some sort).

  • A random strategy, that simply randomly audits things.
  • A missing strategy that looks for failed audits and checks whether the data is still missing.

Latest content validation causes deserialize failure due to sequence instead of string

Error seen when running local glados instance, due to validation addition in #99.

[2023-04-07T08:52:31Z WARN  glados_audit::validation] could not deserialize content bytes content.value="0x080000001b020000f9021..." err=Error("invalid type: sequence, expected a string", line: 0, column: 0)

At first glance, it seems to be rather an issue in what is expected as JSON data here: https://github.com/ethereum/glados/blob/master/glados-audit/src/validation.rs#L10.

I might be wrong as I don't really know this code base but I think the data coming from get_content (https://github.com/ethereum/glados/blob/master/glados-audit/src/lib.rs) is just the raw SSZ encoded bytes, and thus likely not to be accepted by the JSON parsing in the validation code?

Support for running multiple clients for glados-audit

With #85 being added, it seems we now will want the ability to run glados-audit with multiple clients.

  • Change the audit schema to allow storing information about the client that was used for the audit. I suggest we store both a reference to the ENR record and the client version string.
  • Change to allow "multiple" clients (requiring at least 1) to be specified via --ipc-path or --http-url
  • Change the audit process to round-robin audits across all of the available clients

Accept `null` as "no content" response

Description

Glados treats null responses from Portal Network node portal_recursiveFindContentResult as an error. It should treat these the same as "0x" (interpreted as "content not found")

I had followed some discussion in the following PR (ethereum/portal-network-specs#176), but now I see that this is specifically for portal_*LocalContent, which uses a special "0x0" response.

Looking at the spec

  "RecursiveFindContentResult": {
    "name": "recursiveFindContentResult",
    "description": "The data corresponding to the lookup target",
    "schema": {
      "title": "Encoded target content data",
      "$ref": "#/components/schemas/hexString"
    }
  },

Where hexString is:

  "hexString": {
    "title": "Hex string",
    "type": "string",
    "pattern": "^0x[0-9a-f]$"
  }

So null and "0x" both seem valid and Glados should look for these to mean "content is absent". Rather than the current behaviour of looking only for "0x".

Relevant code:

https://github.com/ethereum/glados/blob/master/glados-core/src/jsonrpc.rs#L166-L186

Cannot create Non-Unique indices using SeaORM

There are a few SQL indices that we want for performance purposes that we do not want to have UNIQUE constraints:

(content_audit::CreatedAt, content_audit::Result)
(content::FirstAvailableAt, content::ProtocolId)
key_value::Key
execution_metadata::BlockNumber

Creating a non-unique index in seaORM via eg:

 .index(
         Index::create()
        .name("idx_execution-block_number")
       .col(ExecutionMetadata::BlockNumber),
)

results in the error:

thread 'main' panicked at 'Database migration failed: Exec(SqlxError(Database(PgDatabaseError { severity: Error, code: "42601", message: "syntax error at or near \"(\"", detail: None, hint: None, position: Some(Original(172)), where: None, schema: None, table: None, column: None, data_type: None, constraint: None, file: Some("scan.l"), line: Some(1188), routine: Some("scanner_yyerror") })))', glados-monitor/src/main.rs:47:14

The offending SQL appears to be:

CREATE TABLE "execution_metadata" (
      "id" serial NOT NULL PRIMARY KEY,
      "content" integer NOT NULL,
      "block_number" integer NOT NULL,
      CONSTRAINT "idx_execution-block_number" ("block_number"),
      CONSTRAINT "idx-unique-metadata" UNIQUE ("content"),
      CONSTRAINT "FK_executionmetadata_content" FOREIGN KEY ("content") REFERENCES "content" ("id") ON DELETE
      SET
        NULL ON
      UPDATE
        CASCADE
    )

while the working SQL generated with a unique constraint looks like:

CREATE TABLE "execution_metadata" (
      "id" serial NOT NULL PRIMARY KEY,
      "content" integer NOT NULL,
      "block_number" integer NOT NULL,
      CONSTRAINT "idx_execution-block_number" UNIQUE ("block_number"),
      CONSTRAINT "idx-unique-metadata" UNIQUE ("content"),
      CONSTRAINT "FK_executionmetadata_content" FOREIGN KEY ("content") REFERENCES "content" ("id") ON DELETE
      SET
        NULL ON
      UPDATE
        CASCADE
    )

The only difference is the UNIQUE on the fifth line, so how that's resulting in a "syntax error at or near \"(\"" is unclear.

Change audit strategy

The current selection strategy for content auditing needs to be updated. The priority for auditing should be as follows.

  • Query content_id table and order by number of audits that have been performed (fewer first, more last)
  • Secondary ordering by creation date content_id.created_at in descending order (newer first, older last)

This ensures that we focus on auditing the latest stuff first that has never been audited, and then once everything in the database has been audited, it starts focusing on re-auditing things with priority on the newest of those items.

problem with migrations

Execution Error: error returned from database: foreign key constraint "fk_enr_id_node_id" cannot be implemented

New block_number feature clash with postgres

Description

When main branch is run with a (new/empty) Postgres backend the following error occurs:

Query Error: error occurred while decoding column "block_number": mismatched types; 
Rust type `core::option::Option<i64>` (as SQL type `INT8`) is not compatible with SQL type `INT4`

Origin

This was introduced by PR #45, which was only tested using sqlite. Root cause TBD.

Full logged error

Failed to create database record 
content.key="0x004be441720f239cdc201bcafc41a29d86f5f4056005d0af29176ce0e19ade2c33" 
content.kind="header_metadata" 
err=Query Error: error occurred while decoding column "block_number": mismatched types; 
Rust type `core::option::Option<i64>` (as SQL type `INT8`) is not compatible with SQL type `INT4`

Solution

  • Possibly downgrade block_number to be a 32 bit not 64 bit.

Add concurrency to auditing

What is wrong

Currently, I believe our auditing is effectively single threaded, aka, we only ever hit our connected portal node (trin) with a single running content lookup. Since content lookups can take a bit of time due to needing to traverse the network, we should probably run multiple of these concurrently.

How can it be fixed

Change glados-audit to be able to run multiple lookups concurrently.

Add a new flag to glados-audit such as --concurrency N which sets the number of concurrent lookups that can be running at any given time.

Failure of BlockHeader validation since Shanghai

Running glados since Shanghai / Capella fork results in failure of header validation.

I did not investigate this further but presumably this is because of the added withdrawals_root field in the BlockHeader (see EIP-4895)

Block bodies and receipts still work. Bodies did change also (added withdrawals), but we didn't actually alter those yet in the Portal specifications/implementations. BlockHeader requires the added field immediatly however because else the block hash itself is no longer valid.

Example error:

[2023-04-13T13:07:32Z WARN  glados_audit::validation] computed header hash did not match expected content.key="0x00327395c9900fbd349f338069b0ecc98a547e52ad0e9f430ff13a5fe314669176" content.value="0x080000003d020000f90232a032a4a6a8ebb57c8fb34ecb81e4d331bef22f80bc87bc98d650ac7ea982bc8522a01dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d4934794388c818ca8b9251b393131c08a736a67ccb19297a049cba5a8779350f39813bb849fa94e1020288e79d9de694740462d26f55eee2fa022ec8930142a33421dba2ddfd4284bcb4d7dcf40c5718ce4d800d18101464bf9a0759d23c1d8bcaacda19cb90a64d29558f1a36cfe447a44553f5b4fe4e313039ab90100bcb9528483419a03966a0d00a65567ab2a31a899d4913a2ac881330464012039923413dc2514902124a03f18340abd5a0b83bc1a8b8a3cc972960640307853012686c6026a1a986ba9b64819d5386ef69c81088691411ec992e35b659c00a259ae60a01612222457252f52a0124aa9690253687d5454af0aa60710da8808fd086146c0d0131495ec08f8cf75d39a20b703a22d818344db1875cfbada4538703bab5a61d27965e24714e177c4591cadef7d61c2c002c5817a13af4472b0bc00d8e5c803073343ba721d12c50b30dc7209d6ee9902e444061ce4a112269a17383010d8e63b8852400d16ccb389a4e20405e877109f10c8507422cc91f8e45646a480840103fd7b8401c9c380839c7a1f846437fe378f6265617665726275696c642e6f7267a0726e88cc67d9fcf2c4fd23b50158911fc1633a991f131aa6dc4f8c43d4a288ef880000000000000000850b097be493a06f6a7789c8b169158508f74372ca3bae7de88a4ad7a802d1a1b49030db3b250d00"

Gracefully handle failed lookups in glados-web

What is wrong

Trying to view a content key/id in the web interface for something that isn't present in the database results in a panic.

How can it be fixed

Provide a reasonable 404 not found page instead of exploding

Network explorer

Need to write the network explorer pieces.

  • process that regularly (every N minutes) "walks" the network using RFN (recursive find nodes) to enumerate all knowable ENR records.
  • a process that looks up ENR records from the database and uses PING messages to determine "liveliness". At present, we will fail on any node behind a NAT since we don't have traversal.

Some web views that allow exploration of this data.

Reduce channel buffer size for `glados-audit`

what is wrong?

Glados seems to lag roughly 4 minutes behind "now" for auditing "latest" content.

how can it be fixed.

This is just a guess but...

let (collation_tx, collation_rx) = mpsc::channel::<AuditTask>(100);

and

let (tx, rx) = mpsc::channel::<AuditTask>(100);

The channel size of 100, and glados running at roughly 50 audits-per-minute, means that things will reside in the channel for as much as 2 minutes in the main collation channel after already waiting at least the same length in the individual strategy channel before being picked up. this is too long!

Lets try something dumb/simple like changing both of these numbers to 4

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.