Coder Social home page Coder Social logo

mit-distributed-systems-labs's People

Contributors

jamestran201 avatar

Watchers

 avatar

mit-distributed-systems-labs's Issues

Lab 2: Raft

In this lab you'll implement Raft as a Go object type with associated methods, meant to be used as a module in a larger service. A set of Raft instances talk to each other with RPC to maintain replicated logs. Your Raft interface will support an indefinite sequence of numbered commands, also called log entries. The entries are numbered with index numbers. The log entry with a given index will eventually be committed. At that point, your Raft should send the log entry to the larger service for it to execute.

Follow the design in https://pdos.csail.mit.edu/6.824/papers/raft-extended.pdf, pay particular attention to Figure 2. Will not implement "cluster membership changes" (chapter 6).

Resources:

GFS paper notes

Design assumptions

  • Must be fault-tolerant
  • Can handle millions of large files efficiently
  • Supports large reads from a contiguous region in the file, and small reads at random locations in the file
  • Supports large, sequential writes to a file. Also supports small writes to random offsets, but does not need to be efficient
  • Concurrently appending to a file must be efficient and have well-defined semantics. Atomicity with minimal overhead is required
  • High sustained bandwidth is more important than low latency

Architecture

A cluster consists of:

  • 1 master
  • Multiple chunkservers
  • Multiple clients

File creation

When a file is created, it is divided into chunks, each chunk is assigned a globally unique ID by the master. Chunkservers store chunks as Linux files. Master contains mapping from file name to chunk ID's, and other metadata.

Clients often ask the master for metadata, but it will talk to the chunkserers to read/write data.

Single master

  • Single master is easy to manage
  • But need to prevent reading/writing files through the master, otherwise, it becomes a bottleneck

Interactions for requesting a file

  • Client transforms the file name and byte offset into a chunk index
  • Client sends the file name and chunk index to master
  • Master replies with chunk handle and chunkserver locations
  • Client caches this info
  • Client reaches out to chunkservers with the chunk handle and a byte range within that chunk

Chunk size

Around 64MB

  • Helps reduce the number of requests between client and master, client only needs to ask master for chunk location once
  • Reduces size of metadata in master -> leads to other advantages

Small files may only have 1 chunk. Chunkservers storing these chunks may become hot spots if many clients are reading the same file.

Metadata

Master stores 3 main types of metadata:

  1. File and chunk namespaces
  2. Mapping from files to chunks
  3. Chunk locations

In-memory datastore

All 3 are stored in mem. 1 and 2 are also persisted to logs to help with recovery when the master crashes. Data for 3 is gathered when master starts up or when new chunkserver joins the cluster.

Storing metadata in memory
Pros:

  • Operations involving this data are fast
  • Make things simple
    Cons:
  • The capacity of the file system is capped by the amount of memory the master has. This has not been a problem so far because the metadata is very compact

Chunk locations

The master knows where to find the chunk locations because:

  • It controls where those chunks are placed
  • It periodically polls the chunkservers for this data

Operation log

The operation log contains historical record of critical metadata changes. Files and chunks are uniquely and eternally identified by their logical timestamps at which they were created in the logs.

The operation log is replicated across many machines. The master will not respond to a client operation until the log has been flushed to disk locally and remotely. The master batches several log records together for flushing to reduce the overall impact on throughput.

The master recovers its state by using a checkpoint and replaying the operations in the log. Once the log reaches a certain size, the master will produce a checkpoint which will be replicated to remote machines. The log can then be reduced to only contain operations after the checkpoint.

Question: How do they process writes to the metadata data structure while checkpointing is going on?

Consistency Model

Guarantees by GFS

File namespace creations (e.g., file creation) are atomic

The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations.

A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.

A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.

  • When a mutation succeeds without interference from concurrent writers, the affected region is defined (and by implication consistent): all clients will always see what the mutation has written.
  • Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations.
  • A failed mutation makes the region inconsistent (hence also undefined): different clients may see different data at different times.

Because clients cache chunk locations, they may read from a stale replica. This issue is limited by the cache timeout and refreshing the cache whenever a file is re-opened.

System interactions

Leases and mutation order

Summarize the requirements for lab 1

Summarize the requirements for the coordinator and worker in lab 1.

Coordinator requirements

  • Only 1 coordinator process
  • Runs on same machine as workers
  • Detect that a worker has not completed its task within the time limit (hard-coded 10s), and re-assign the task to another worker
  • The coordinator knows what the input files are. Input files are stored on the same disk as coordinator and worker
  • Each input file corresponds to one split of the data, i.e. no need for us to write code to split the data into chunks
  • Able to handle workers that crash while performing a task
  • The coordinator does not need to spin up the workers by itself. It may only need to tell the workers that all tasks are complete so the workers should exit

Worker requirements

  • Many worker processes
  • Runs on same machine as coordinator
  • Ask coordinator for a task
  • Read the task input from one or more files
  • Execute the task
  • Write the task output to one or more files
  • Workers write their outputs to files named like mr-out-*
  • The map phase should divide the intermediate keys into buckets for nReduce reduce tasks, where nReduce is the number of reduce tasks -- the argument that main/mrcoordinator.go passes to MakeCoordinator(). Each mapper should create nReduce intermediate files for consumption by the reduce tasks.
  • The worker implementation should put the output of the X'th reduce task in the file mr-out-X.
  • A mr-out-X file should contain one line per Reduce function output. The line should be generated with the Go "%v %v" format, called with the key and value. Have a look in main/mrsequential.go for the line commented "this is the correct format". The test script will fail if your implementation deviates too much from this format.
  • The worker should put intermediate Map output in files in the current directory, where your worker can later read them as input to Reduce tasks.
  • main/mrcoordinator.go expects mr/coordinator.go to implement a Done() method that returns true when the MapReduce job is completely finished; at that point, mrcoordinator.go will exit.
  • When the job is completely finished, the worker processes should exit. A simple way to implement this is to use the return value from call(): if the worker fails to contact the coordinator, it can assume that the coordinator has exited because the job is done, so the worker can terminate too. Depending on your design, you might also find it helpful to have a "please exit" pseudo-task that the coordinator can give to workers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.