Coder Social home page Coder Social logo

doyoubi / undermoon Goto Github PK

View Code? Open in Web Editor NEW
696.0 17.0 37.0 2.4 MB

Mordern Redis Cluster solution for easy operation.

License: Apache License 2.0

Rust 94.63% Makefile 0.21% Shell 0.65% Python 2.57% Go 0.53% Kotlin 0.15% Java 0.99% Jinja 0.26%
redis redis-cluster rust proxy slot migration redis-clusters redis-instances redis-protocol redis-proxy

undermoon's People

Contributors

cfeitong avatar dependabot[bot] avatar doyoubi avatar traceming2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

undermoon's Issues

Specify Node ID

Add another parameter to specify node ID in CLUSTER NODES instead of using domains as the prefix of domains may be the same.

FutureGroupHandle should signal in `drop` function.

Current FutureGroupHandle implementation requires it should be the outmost future directly fed into tokio::spawn.
To support nested group, we need to send the signal in drop function of FutureGroupHandle.

New Migration Process through SCAN command

Recent Migration Limitations

There are two drawbacks for the current migration process based on the current approach - trigger replication between the source Redis and destination Redis:

  • We can only perform a double scale because multiple FULL SYNC to a Redis node can't preserve all the data - the latter one will remove some of the right data transferred by the formers.
  • We have to wait for a fixed long time to HOPE that the replication between two Redis nodes is done because there's no way to know it. (Well, actually we can perform a block and check method).

Solutions

There are two solutions:

  • (1) Implement the Redis Replication Protocol which consists of two parts:
    • RDB parsing (hard and time-consuming to maintain)
    • Replication Network Protocol (not that hard but feasible)
  • (2) Use SCAN, DUMP, RESTORE to mimic the replication.

I don't want to use the first solution since there are just too many works for the RDB parsing. And we need to keep updating the codes as the RDB format changes.

The second one should be able to be compatible with the future versions of Redis.

The SCAN command has a great property that it can guarantee that all the keys set before the first SCAN command will finally be returned, for multiple times sometimes though. We can perform a 3 stage migration to mimic the replication.

  • Wait for all the commands to be finished by Redis.
  • Start the scanning and forward the data to peer Redis.
  • Redirect all the write operation after the first SCAN to peer Redis.

But it also has some problems:

  • It will have some impacts on the latency while scanning the data. But that could be tunable.
  • If a large collection type key gets updated frequently, we need to store the large data from DUMP again and again, which could result in a large amount of memory.

Detailed Steps for Scanning

  • Redirect all the later commands to Queue_block1.
  • Wait for all the existing commands to be finished. We can maintain a counter for running commands to achieve that.
  • Start to send SCAN, DUMP to get the data and add RESTORE command to Queue1. When SCAN is done, mark the Queue1 as SENT_FINISHED.
  • When the first SCAN command gets the reply, release all the commands in Queue_block1.
  • Let all the commands in Queue_block1 and the commands after the first SCAN to another send function which for all the write requests, do the following with Lua script to support atomic:
    • apply the write operation
    • get the latest version of the key by DUMP and forward the new data to Queue2
  • Forward all the RESTORE commands in Queue1. Once the Queue1 is set SENT_FINISHED and is empty, start to forward the RESTORE commands in Queue2.
  • When Queue1 is set SENT_FINISHED and is empty, start to block the commands in Queue_block2, wait for all the commands in Queue2 to be sent.
  • Migration source proxy commits the process with destination proxy. Destination starts to handle new slots. Source proxy starts to redirect the migrated slots to destination proxy.
  • Release the commands in Queue_block2. Redirect all the keys inside migrated slots to destination proxy.
  • INFOMGR starts to return success.
  • Wait for the Coordinator to commit the migration.
  • Delete the migrated out data in source proxy.

Add compatibility tests

We can play the same commands script to both undermoon and redis and see whether the result is the same.

Support blocking command

Since #81
complex commands are much easier to implement than before.
We can implement blocking commands by transforming them to non-blocking commands.

For example, BLPOP can be implemented by keeping calling LPOP until it does not return Nil.

Report server proxy epoch to broker

The metadata storage needs to know the real epoch for server proxy for the following reasons:

  • For scaling down, the broker needs to know when the latest metadata have already been synchronized to all the related server proxies to safely remove server proxies out of the cluster.
  • The metadata storage needs to know all the up-to-date server proxies to provide an API to query all the ready-to-use server proxy endpoints.
  • The metadata storage can use real epochs to detect inconsistent metadata.

Amend setting role.

Now the replicator module just keeps sending SLAVEOF command to the backend Redis, resulting in the following log in Redis triggered again and again:

REPLICAOF would result into synchronization with the master we are already connected with. No operation performed.

Maybe we need to check whether the role is incorrect. But the address in ROLE is replica-announce-ip. We need to use CONFIG GET to get the replica-annoucne-ip first from the peer master.

Let SETREPL trigger SLAVEOF command to redis directly

Now SETREPL only set out an asynchronous task to periodically send SLAVEOF command to Redis.
This could result in a short time when a promoted master is still a slave but need to serve the requests.
We need to trigger SLAVEOF once directly before SETREPL returns.

Only one cluster in a server proxy

The original design of undermoon is to support multiple logical clusters in a single server-side proxy to support multi-tenant. Now it turns out to be a bad idea for the following reasons:

  • maintaining metadata of multiple clusters is not easy especially when it comes to migration states.
  • the rolling upgrade could be more difficult as it affects multiple clusters at the same time.

I better remove the support for multiple logical clusters.

Track spawned futures

Build a future wrapper to track spawned futures to detect future leak like goroutine leak in Golang.

v0.3 Roadmap

The main change in v0.3 will be broker API that will break compatibility. It will be done with overmoon v0.2

  • Broker API change
  • Add version to broker api
  • Make SlotRange an arbitrary range for better migration performance.
  • Build mem_broker and make it usable.
  • Add a host field in ProxyResource of mem_broker.
  • Save mem_broker data in a file.
  • Add an API to bump global epoch and cluster epoch to force proxies to update their metadata.
  • Add a broker api to check whether the current free proxies are sufficient for any host failure.
  • Support redis-cluster-proxy But not fully tested
  • Amend INFO command
  • Balance masters
  • Remove {:?} exposed to the users.
  • Support some blocking commands
  • More unit tests.
  • More docs.
  • Benchmark Graph
  • Migration Time Graph
  • Add docs about samaritan.
  • Add pagination for the API getting proxy addresses.
  • Compress metadata done in v0.5
  • Support Replication for memory broker
  • Support dynamically change the broker address of coordinator.
  • Add docs for config.
  • Support active redirection for Redis single instance clients.
  • Delete keys on the destination proxy during migration to make migration faster.
  • Need more docs on the migration.

Manage freed nodes

  • After the replicas get freed, they are not changed to master and keep and replication.
  • The data inside the freed nodes are not cleared.

Should it be managed by the coordinator and server_proxy?

Migration use wrong offset field.

Now we use lag from INFO to determine whether the replication has finished, which is wrong. We should use master_repl_offset and offset of each replica.

Rename Host to Proxy

At the first time, I think we will only deploy one proxy per host. So host and proxy are the same and are used interchangeably in the codes and API.
Now to support some clients and redis cluster proxies which do not support AUTH command for the backend clusters, we need to deploy multiple proxies in the same machine to support multiple tenants.
I have changed the API in #21.
Later we need to change the host in code to proxy.

v0.2 Roadmap

Undermoon v0.2 focuses on supporting arbitrary slot migration, whose key functionality has been done in #65

There are still 2 problems need to solve:

  • (1) The migration speed is too slow - around 450 keys per second Now 4000 keys per second, which could also be tunable.
  • (2) The blocking phase to ensure consistency is not implemented yet.

The first one needs the Redis pipeline to increase the throughput. The second one needs complex synchronization. Both of them are not easy to implement without async and await.

Thus, in v0.2 Undermoon will change to futures-rs 0.3.

v0.2 Roadmap

  • Move to the new future api.
  • Optimize Resp by only storing the index in the data to eliminate data copy.
  • Let CmdTask support multiple commands as a single request.
  • Refactor api from executor to the backend sender. Make then returns a Pin<Box> for future functionality such as MGET and blocking command.
  • More unit tests. moved to v0.3
  • More docs. moved to v0.3

Optimization

  • Change RwLock on server proxy meta to
  • Batch sendto syscall.
  • Optimize memory copy
  • Let Resp objects just store the index of the Redis packet to reduce memory allocation.

Returns peers proxies in GET /api/proxies/meta/<server_proxy_address>

Now, get_peer in coordinator use separated HTTP calls to get the peer server proxies.

  • get the cluster name from host metadata
  • get the cluster metadata

This could lead to inconsistent data.

We should return the metadata of peer proxies in /api/proxies/meta/<server_proxy_address> directly.

False Negative Failure After Recovering Proxy

When proxies are tagged failed and recover again, the client pool in coordinator might get a stale connection and fails to send PING, which cause false negative failure report.
This is confusing but could be fine. Might be fixed later.

Migration could potentially recover deleted keys

Since during the key migration, a key could be written from source shard to destination shard for multiple times, a key deleted by users could be recover again.

The overall process is:

  • The key gets migrated to the destination by the RESTORE command.
  • Users delete the key.
  • Key gets migrated again using the RESTORE command since SCAN could generate the key for multiple times or the first migration is triggered actively by the destination shard. Then the deleted key recovers.

Compress metadata

When metadata of a large cluster are synchronized from HTTP broker to coordinator, from coordinator to server proxy, they may need to be compressed to eliminate the data size.

Amend broker api

  • Rename host to proxy
  • Rename database to cluster
  • Rename db_name to cluster_name

Sharded Coordinator

If the whole undermoon cluster has more than 100k server proxies, the coordinator might not be able to hold such amount of connections.

We need to divide coordinators into different shards by clusters and server proxies.

Need to delete some part of the data after migration

When scaling the proxies just migrate all the data from one to another, leaving the two involving proxies holding another half part of the data which they don't own.
Need to delete this data after migration by SCAN and DELETE command.

Amend INFO command

  • Amend the formatting of INFO command.
  • Move UMCTL INFOREPL to INFO command.

Optimize UMFORWARD

Now we use UMFORWARD command to support additional attributes to implement max_redirections, which result in command wrapping and unwrapping and not the best performance.
Maybe we can implement RESP3 and use the attributes in RESPS3 to optimize it.

Task for deleting keys running with migration task

After migration, a task for deleting keys will be started and currently will cause some problems:

(1) Data Inconsistency When scaling up and down
Fixed by #158

If a cluster is scaling up and down frequently,
a migration task could be running with a task for deleting keys covering the same slots, which could result in losing some keys.

The PR 158 fix it by checking whether there's any deleting key task before starting a migration in the API.

(2) High CPU Usage

Improvements

  • Limit thread number
  • Expose inner meta of server_proxy
  • set tcp no delay
  • let channel size configurable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.