Coder Social home page Coder Social logo

Architectural info? about indeep HOT 1 OPEN

cbluth avatar cbluth commented on June 16, 2024
Architectural info?

from indeep.

Comments (1)

anqurvanillapy avatar anqurvanillapy commented on June 16, 2024 3

Hi @cbluth, thanks for reaching out!

We created this project for verifying if what we understand is correct, since we're from the same team responsible for some in-house object storage for massive users.

What's Object Storage

About data

An object, or blob, or file, what ever you'd like to call it, consists of the following content:

  • Object meta:
    • Size, length
    • ETag: Entity tag, for identifying object's content, e.g. the same ETag with different file names could indicate that their underlying physical location could be the same
    • Content type, content disposition, expires, etc. Namely those HTTP stuff
    • User meta: User-provided meta, prefixed with X-Amz-Meta-, it's an AWS S3 standard
    • ..., etc.
  • Data, the actual content, object body

About operations

When we talk about filesystem, there are 2 features which are impossble to support in object storage:

  • Access: Namely reads and writes could be sequential, or random
  • Namespacing: Manipulating directories is super fast

So you could only have these in object storage:

  • Sequential and random reads
  • Whole-file write, you want to update a tiny portion of the file, you have to upload the whole file to overwrite it
  • Pseudo-namespacing: E.g. /image/foo.jpg, the /image/ is just a string prefix, you can manipulate objects with prefixes like filtering, but this will be super slow

Why Distributed?

  • To achieve high availability
  • Easy to scale in and out

Remember that our service is stateful, just like databases, it's generally hard to achieve the goals above.

So How?

Availability: Replication and consensus

You can replicate the data (or operation logs) to some backup servers, and have some protocols to make sure the backup servers have the same data with the main servers.

In 2023, Raft is always the starter to do replication and consensus. You have 2 ways to implement your service:

  • In-memory states and periodical snapshotting: If your states won't blow up the memory, you could store them in memory and let the Raft framework inform you to persist upon snapshotting. Typically snapshotting won't affect the log replication and state machine running, they can be parallel
  • Durable states and no-op snapshotting: You could store your states in some fast persistent DB like RocksDB inside your state machine, and upon snapshotting, you simply just record some important Raft states into anywhere you like. The Raft states include terms, current configuration (namely all the peers info in the cluster), configuration index (the configuration is just an array, so the index indicates who my service is in the cluster), etc. Obviously, durable states are suitable for using less memory of your machines

Scalability: Partitioning (sharding) and rebalancing

If you're sure that reading from the followers is consistent for your business, you can route all of your reads to a random peer in your cluster. However, you could only route all the writes to the leader, draining that server with very high load when all your customers are flooding to your service. In this case, you could only scale up, but not scale out, which means you need to buy more expensive servers instead of adding more cheap ones.

So to solve this problem, we could partition our data into multiple leaders. Upon high load, we could add a new follower to the configuration, and rebalance the whole cluster to decrease the load from those running servers.

Generally you have 3 rules to partition your data:

  • Range-based: Your data are indexed by keys of type strings, or some monotonically incremental integers, it's natural to set up some range to do partitioning
  • Hash-based: Hash your keys and place the data to the corresponding buckets, it's very intuitive
  • Range-hash-based, yup the hybrid way, if you want to shard your data in a multiple-level manner

You only have these rules to follow, the actual algorithms of your partitioning is business-defined, just test and adjust.

With respect to partitioning a Raft cluster, that's where the multiraft kicks in, in your server you could have thousands of Raft peers running, one could be a follower, or a leader. A Raft group is just a data partition, nothing fancy. All the locations (the server address that contains multiple Raft groups, not a particular Raft node) of Raft leaders should be stored in an independently running service, called PD (placement driver), or controller, or orchestrator, or locator, just pick your favorite name here. That's because our client should locate the correct partition, and write the data to the Raft leader.

But when would partitioning happen? And how to rebalance them?

In the very beginning, you only have 1 Raft group #0 in, for instance, 4 servers (A, B, C, D), and servers A, B, C own this Raft group, the leader is in server A. You set up a threshold to partition the data, e.g., a maximum number of keys inside a partition, or a maximum size of the parition, if we consider the data. When triggered, you split your partition into two, create a new Raft group #1 and force server A to have the leadership, so it's unnecessary to transfer the second half of the data to other servers, it's costly. We split the data and create the Raft group in-place. If it's successful, we tell PD that a new Raft group is created, with its range of the partition, so the clients could locate new writes to 2 partitions now.

And then, our long running rebalancing worker (maybe in PD? not sure yet) somehow finds that we have 2 partitions and server D is idle, that's not good. We create a learner of Raft group #1 in server D, receiving all the logs and snapshots from other peers, once it's done, we transfer the leadership of server A to server D (or others), and let server A quit the configuration, finishing the rebalancing.

Distributed Object Storage

Now we know how to make a distributed systems with high availability and scalability, how to use these toys to create a distributed object storage?

We create the following components:

  • Gateway: Stateless service, implements the S3 protocol, and locates the current servers to read and write the data. Could scale out by adding new servers
  • Placer: Typically a 3-node Raft cluster, it stores the mapping from key ranges to the locations of all partitions, and all the states of dataservers for future rebalancing and other administrating stuff
  • Dataserver: The multiraft cluster, contains the object metas and object data

But to our knowledge, it's not feasible to combine object meta and data together inside the dataserver, because objects could be huge and tiny, and deletes will be massive every day. We separate it into dataserver and metaserver, the latter is just responsible for object meta only, and the former acts like the following Go function:

type ObjectID string

func alloc(data []byte) ObjectID

Dataserver is only for storing data and its ID, and metaserver would store the mapping between object keys (user-defined, just like filenames) and object IDs.

The design of dataserver from our team is complicated, I want to simplify it and make an intuitive and minimal one without reading their code. Some key notes of the design would be:

  • Data are stored sequentially within a super long fixed-size block circular buffer
  • Blocks are indexed by a table in memory for object ID lookup
  • Delete is an operation of making tombstones on blocks
  • GC acts like how Haskell does, since our object data is immutable, we just need to move the live values to the head of the block queue, where the GC workers run at every midnight

What Else?

  • DR (disaster recovery): We could add learners for metaserver and dataserver instances in different DCs, so data are replicated per partition, we could switch the traffic upon DC failures
  • EC (erasure coding): To achieve high availability and low cost of dataservers, we need to support EC to lower the budget (3-node Raft means 33% of the data are available, but 4+2 EC could achieve 66%, that's a huge boost)

Further Reading

  • Design of TiKV is really inspiring, you could see this post for more insights

from indeep.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.