bestarch-ae / cacherpc Goto Github PK

View Code? Open in Web Editor NEW

66.0 6.0 10.0 605 KB

Solana JSON-RPC caching server

License: MIT License

Rust 96.38% Lua 3.62%

solana

cacherpc's Introduction

Solana JSON-RPC caching server

Disclaimer: This project is an early stage Work-In-Progress and is not ready for production use.

This cache server implementation aims to provide a general solution for both offloading Solana validator RPC service and improving the overall speed and stability of the RPC. It achieves it by caching and updating some of the heaviest and most frequent requests and keeps the requested info updated with the use of PubSub API.

The server itself is a singe binary which is designed to be deployed in front of the validator as its public RPC entrypoint.

Running the server

To build and run the server you will need the Cargo package manager installed, which comes together with Rust compiler. Those two can be obtained here by following "Installing Rust" guideline.

# build
$ cargo build --release
# run
./target/release/rpccache

Configuration

The server supports a number of configuration options, which are the following:

-r, --rpc-api-url — validator or cluster JSON-RPC HTTP endpoint.
-w, --websocket-url — validator or cluster PubSub endpoint.
-l, --listen — cache server bind address.
-a, --account-request-limit — sets a maximum number of concurrent getAccountInfo requests the cache is allowed to send to the cluster/validator.
-p, --program-request-limit — sets a maximum number of concurrent getProgramAccounts requests the cacher is allowed to send to the cluster/validator.
-A, --account-request-queue-size — sets a maximum number of getAccountInfo requests that are allowed to wait for the permit to send the request to validator.
-P, --program-request-queue-size — sets a maximum number of getProgramAccounts requests that are allowed to wait for the permit to send the request to validator.
-b, --body-cache-size — sets the maximum amount of cached responses.
-c, --websocket-connections — sets the number of websocket connections to validator
-t, --time-to-live — duration of time for which values will be kept in cache
-d, --slot-distance — sets the maximum slot distance for health check purposes
--log-file - file, which should be used for the output of generated logs
--config - limits related configuration file in TOML format
--ignore-base58-limit — flag whether to ignore base58 overflowing size limit
--log-format — the format, in which to output the logs: plain | json
--request-timeout - time duration, upon of elapsing of which passthrough requests will be aborted, and the client will be notified of request timeout, default is 60 seconds. Timeouts for getAccountinfo and getProgramAccounts requests are configured separately via configuration file.
--rules — path to firewall rules written in lua
--identity — optional identity key for cacherpc service, should be base58 encoded public key
--control-socket-path — path to socket file, e.g. /run/cacherpc.sock

Configuration file

Some configuration parameters can be loaded from TOML formatted file, and can be re-read from it during application runtime, in order to dynamically reapply them. Example configuration:

[rpc.request_limits]
account_info = 10 # concurrent getAccountinfo requests to validator
program_accounts = 50 # concurrent getProgramAccounts requests to validator

[rpc.request_queue_size]
account_info = 10 # number of getAccountinfo requests that can wait in queue before making request to validator
program_accounts = 10 # number of getProgramAccounts requests that can wait in queue before making request to validator

[rpc]
ignore_base58_limit = true

[rpc.timeouts]
account_info_request = 30 # timeout in seconds, before getAccountinfo is aborted
program_accounts_request = 60 # timeout in seconds, before getProgramAccounts is aborted
account_info_backoff = 30 # time duration during which getAccountinfo will be repeatedly retried, in case of failure
program_accounts_backoff = 60 # time duration during which getProgramAccounts will be repeatedly retried, in case of failure

Commands

Running instance of caching server supports several commands that can be sent to it via unix domain socket:

cache-rpc config-reload - reload limits related configuration from file (must have been started with --config <path> option)
cache-rpc waf-reload - reload WAF rules from lua file, (must have been started with --rules <path> option)
cache-rpc subscriptions off - prevent caching server from initiating new subscriptions after fetching data via rpc requests
cache-rpc subscriptions on - allow caching server to initiate new subscriptions after fetching data via rpc requests (default)
cache-rpc subscriptions status - print out the current status of subscriptions allowance (on or off)

Metrics

Caching server provides various metrics, which are available in Prometheus compatible format. Metrics can be retrieved via /metrics HTTP endpoint.

Features

Implemented methods

In the current version caching is implemented for these methods:

Requests to other methods are passed through to the validator.

Unlikely to be implemented

Features that are unlikely to be implemented:

Root and Single commitments.

Disclaimer: This project is an early stage Work-In-Progress and is not ready for production use.

cacherpc's People

Contributors

Stargazers

Watchers

Forkers

polachok 00nktk sol-crystal rpcpool dappiolab bridgesplit denniswon tttlkkkl anselsol

cacherpc's Issues

Use proper backoff instead of handrolled loop with sleep

There're some nice backoff libraries on crates.io.

Log updated config

1.when config is reloaded, log new config
2.when config is applied, log new values (limits are already being logged)

Track subscriptions by key

The problem:

We use subscription_active method to check whether we can retrieve the value from cache for a particular key. Currently it's implemented by checking AtomicBool flag for a websocket connection to which this key is routed. This flag is updated in update_status by comparing number of active subscriptions to the number of desired subscriptions, which means that if one of subscription confirmations is delayed, all keys routed to this worker are considered not active, and are not retrievable from cache, which is obviously not desirable.

Proposed solution:

Track subscription status by key.

Detect situations when when ws subscriptions are active but no updates received

Cacher relies on the fact that cache entries have corresponding active ws subscriptions, in order to detect whether the cache entry might be stale or not. But sometimes it may be that subscription exists but no updates are being received from it, which possibly may indicate that the cache entry became stale.

There's a need for a way to detect such situations and not to serve request from cache, if it happens.

Content type with charset returns 415

We're seeing 415 errors when the content-type header specifies a charset. This happens on requests where the content-type reads:

application/json;charset=….

`filter::tests::tree_matches_overall` can randomly fail

Currently filter::tests::tree_matches_overall can randomly panic with 'Uniform::new called with `low >= high`' message, which probably means that proptest generates x..x strategy at some point. My guess is that this is related to arb_non_matching_filters strategy.

This is not a critical issue since the panic is not caused by the test logic, but is annoying. For the time being, broken CI can be fixed with a restart.

Add pubsub metrics

HTTP/rpc code has metrics, websocket/pubsub has none.

Cannot request Stake11111 (Gateway timeout)

When requesting getProgramAccounts(Stake11111...) without filters via the cache I get a gateway timeout error, however it works without the cache.

In logs I can see the following error:

Sep 15 12:39:24.950  WARN cache_rpc::rpc: request: Request { jsonrpc: "2.0", id: Num(1), method: "getProgramAccounts", params: Some((Pubkey([6, 161, 216, 23, 145, 55, 84, 42, 152, 52, 55, 189, 254, 42, 122, 178, 85, 127, 83, 92, 138, 120, 114, 43, 104, 164, 157, 192, 0, 0, 0, 0]), ProgramAccountsConfig { encoding: Base64, commitment: None, data_slice: None, filters: None, with_context: Some(true) })) } error: Overflow
Sep 15 12:39:24.950  INFO cache_rpc::rpc: reporting gateway timeout req.id=Num(1)

This is the RPC request:

curl api.mainnet-beta.solana.com -X POST -H Content-Type: application/json -d 
  {"jsonrpc":"2.0","id":1, "method":"getProgramAccounts", "params":["Stake11111111111111111111111111111111111111", { "encoding": "base64" }]}

Unix socket based, command listener implementation

For the purposes of convenient management of dynamic settings of the application, there's a proposal to implement unix socket listener, which will listen for external commands, and to dynamically change the settings accordingly. It will be quite flexible alternative for signal handling.

Batch requests support implementatin in WAF

Currently cacher doesn't preprocess batch requests, simply forwarding them to validator cluster.
There's a requirement to support batch requests in WAF rules evaluation, so that the rules will apply to all requests, regardless of their type and composition.

Pubsub reconnect backoff settings

Make ExponentialBackoff max_elapsed_time infinite with max_interval = 10sec.

Pubsub reconnect counter (and possibly rpc)

Add metric to count reconnect attempts

getAccountInfo occasionally returns error

It's been reported that getAccountInfo occasionally returns

Error: failed to get info about account 2CSEjyDtAtgCjTKyXHZyfUVe7EERtL7rjYjJSgcBPYLf: Invalid params: invalid type: map, expected string at line 1 column 34

Gist to reproduce here https://gist.github.com/barry-the-sender/d877a87f23d8d52d32fdd71c01f71481.

Implement metrics endpoint with Prometheus format

Allow monitoring caching performance by keeping metrics and printing them in Prometheus (and/or other) format

Slots & commitments

We currently store & report the latest slot seen in either validator response or received in pubsub (per commitment level). This leads to slot in our responses lagging behind validator if there're no changes for cached accounts.

We should find out how to get the latest slot for commitment via pubsub (preferably).

2.If we have account A cached with slot = N, commitment = Processed, and later we receive an update for account B with slot N+1, commitment = Finalized, it seems like it means that slot N is also Finalized, which in turn, means that we can return account A's data for a request with Finalized commitment requirement.

Add ability to reload waf script without restart

Add command like cacherpd reload waf or something

Cleanup client code in get_account_info / get_program_accounts

It's basically the same code, could be refactored into a function.

Detect infinite reconnections on websocket

Sometimes, some worker threads, that manager webosket connections, go into infinite loop trying to establish working websocket connection to validator. Need to figure out the reason, why it happens, and implement detection and correction for such cases.

Currently cacherpc returns 404 on any path a part from "/". We work around it in our proxy by just overwriting the path component to any request we pass to zubr cache.

bestarch-ae / cacherpc Goto Github PK

cacherpc's Introduction

Solana JSON-RPC caching server

Running the server

Configuration

Configuration file

Commands

Metrics

Features

Implemented methods

Unlikely to be implemented

cacherpc's People

Contributors

Stargazers

Watchers

Forkers

cacherpc's Issues

Recommend Projects

Recommend Topics

Recommend Org