jamestran201 / mit-distributed-systems-labs Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 1.46 MB

Implementation of the labs for MIT distributed systems course

Go 97.36% Shell 2.64%

mit-distributed-systems-labs's People

Contributors

Watchers

mit-distributed-systems-labs's Issues

Lab 2: Raft

In this lab you'll implement Raft as a Go object type with associated methods, meant to be used as a module in a larger service. A set of Raft instances talk to each other with RPC to maintain replicated logs. Your Raft interface will support an indefinite sequence of numbered commands, also called log entries. The entries are numbered with index numbers. The log entry with a given index will eventually be committed. At that point, your Raft should send the log entry to the larger service for it to execute.

Follow the design in https://pdos.csail.mit.edu/6.824/papers/raft-extended.pdf, pay particular attention to Figure 2. Will not implement "cluster membership changes" (chapter 6).

Resources:

Guide to raft, part 2D has changed since this blog post was written
Raft locking advice
Raft structure advice
Paxos, Chubby, Paxos Made Live, Spanner, Zookeeper, Harp, Viewstamped Replication
http://static.usenix.org/event/nsdi11/tech/full_papers/Bolosky.pdf
Diagram of Raft interaction
https://blog.josejg.com/debugging-pretty/

Design assumptions

Must be fault-tolerant
Can handle millions of large files efficiently
Supports large reads from a contiguous region in the file, and small reads at random locations in the file
Supports large, sequential writes to a file. Also supports small writes to random offsets, but does not need to be efficient
Concurrently appending to a file must be efficient and have well-defined semantics. Atomicity with minimal overhead is required
High sustained bandwidth is more important than low latency

Architecture

A cluster consists of:

1 master
Multiple chunkservers
Multiple clients

File creation

When a file is created, it is divided into chunks, each chunk is assigned a globally unique ID by the master. Chunkservers store chunks as Linux files. Master contains mapping from file name to chunk ID's, and other metadata.

Clients often ask the master for metadata, but it will talk to the chunkserers to read/write data.

Single master

Single master is easy to manage
But need to prevent reading/writing files through the master, otherwise, it becomes a bottleneck

Interactions for requesting a file

Client transforms the file name and byte offset into a chunk index
Client sends the file name and chunk index to master
Master replies with chunk handle and chunkserver locations
Client caches this info
Client reaches out to chunkservers with the chunk handle and a byte range within that chunk

Chunk size

Around 64MB

Helps reduce the number of requests between client and master, client only needs to ask master for chunk location once
Reduces size of metadata in master -> leads to other advantages

Small files may only have 1 chunk. Chunkservers storing these chunks may become hot spots if many clients are reading the same file.

Metadata

Master stores 3 main types of metadata:

File and chunk namespaces
Mapping from files to chunks
Chunk locations

In-memory datastore

All 3 are stored in mem. 1 and 2 are also persisted to logs to help with recovery when the master crashes. Data for 3 is gathered when master starts up or when new chunkserver joins the cluster.

Storing metadata in memory
Pros:

Operations involving this data are fast
Make things simple
Cons:
The capacity of the file system is capped by the amount of memory the master has. This has not been a problem so far because the metadata is very compact

Chunk locations

The master knows where to find the chunk locations because:

It controls where those chunks are placed
It periodically polls the chunkservers for this data

Operation log

The operation log contains historical record of critical metadata changes. Files and chunks are uniquely and eternally identified by their logical timestamps at which they were created in the logs.

The operation log is replicated across many machines. The master will not respond to a client operation until the log has been flushed to disk locally and remotely. The master batches several log records together for flushing to reduce the overall impact on throughput.

The master recovers its state by using a checkpoint and replaying the operations in the log. Once the log reaches a certain size, the master will produce a checkpoint which will be replicated to remote machines. The log can then be reduced to only contain operations after the checkpoint.

Question: How do they process writes to the metadata data structure while checkpointing is going on?

Consistency Model

Guarantees by GFS

File namespace creations (e.g., file creation) are atomic

The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations.

A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.

A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.

When a mutation succeeds without interference from concurrent writers, the affected region is defined (and by implication consistent): all clients will always see what the mutation has written.
Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations.
A failed mutation makes the region inconsistent (hence also undefined): different clients may see different data at different times.

Because clients cache chunk locations, they may read from a stale replica. This issue is limited by the cache timeout and refreshing the cache whenever a file is re-opened.

System interactions

Leases and mutation order

Summarize the requirements for lab 1

Summarize the requirements for the coordinator and worker in lab 1.

Coordinator requirements

Only 1 coordinator process
Runs on same machine as workers
Detect that a worker has not completed its task within the time limit (hard-coded 10s), and re-assign the task to another worker
The coordinator knows what the input files are. Input files are stored on the same disk as coordinator and worker
Each input file corresponds to one split of the data, i.e. no need for us to write code to split the data into chunks
Able to handle workers that crash while performing a task
The coordinator does not need to spin up the workers by itself. It may only need to tell the workers that all tasks are complete so the workers should exit

Worker requirements

Many worker processes
Runs on same machine as coordinator
Ask coordinator for a task
Read the task input from one or more files
Execute the task
Write the task output to one or more files
Workers write their outputs to files named like mr-out-*
The map phase should divide the intermediate keys into buckets for nReduce reduce tasks, where nReduce is the number of reduce tasks -- the argument that main/mrcoordinator.go passes to MakeCoordinator(). Each mapper should create nReduce intermediate files for consumption by the reduce tasks.
The worker implementation should put the output of the X'th reduce task in the file mr-out-X.
A mr-out-X file should contain one line per Reduce function output. The line should be generated with the Go "%v %v" format, called with the key and value. Have a look in main/mrsequential.go for the line commented "this is the correct format". The test script will fail if your implementation deviates too much from this format.
The worker should put intermediate Map output in files in the current directory, where your worker can later read them as input to Reduce tasks.
main/mrcoordinator.go expects mr/coordinator.go to implement a Done() method that returns true when the MapReduce job is completely finished; at that point, mrcoordinator.go will exit.
When the job is completely finished, the worker processes should exit. A simple way to implement this is to use the return value from call(): if the worker fails to contact the coordinator, it can assume that the coordinator has exited because the job is done, so the worker can terminate too. Depending on your design, you might also find it helpful to have a "please exit" pseudo-task that the coordinator can give to workers.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.