byshen / cse221notes Goto Github PK

View Code? Open in Web Editor NEW

8.0 2.0 1.0 3 KB

cse221notes's Introduction

cse221notes

paper summaries
lecture notes
part of my notes is in the issues

Caveats:

Reading the original papers will give you better understanding of why these talents design systems in this way.

cse221notes's People

Contributors

Stargazers

Watchers

Forkers

tianyishan

cse221notes's Issues

soft updates & RIO file cache

metadata consistency/reliability

FFS synchronized writes => LFS solves the problem but may cause consistency issues.

opportunity/benefit => cache metadata writes, (async) update every 30s

update dependence
update sequence/order

roll back & roll forward: a new update on file cache depends on a update to disk, so it waits and undo all the updates to that portion of blocks, and make updates in order. (like a time machine on that portion of blocks) Then. roll forward to the most recent state. (the block is locked to prevent applications from seeing the rolled-back state.)

constraints/rules:

no dangling ptrs => e.g. "create": update innode first, then update directory block pointer; "delete" is reverse; may cause dependency on blocks
never reuse until all old ptrs removed
never reset last ptr untill new one disk is on disk.

=> granularity of a pointer, keep all changes to the pointer, not the entire inode/directory pointer.
roll back to a point where no cyclic dependency exists, and do a bunch of updates together.

fsync: all pending updates for that file must go to disk before the application to continue..

lottery scheduling

Simple and useful to the problem it solves. Randomized decisions to achieve a good performance.

Goal: proportional scheduling to the execution rate of computations. (statistically)

Tickets -> currencies. => representation.
decides: how much CPU and how long to wait.

Lock inversion. Threads waiting for the lock will donate tickets to the thread holding the lock, make it has a higher priority.

Can be used for CPU, network, memory, disk, etc. Uniformity.

A priority based scheduler, is more likey to cause block. A randomized.

Why it is not used?
Hard to express how many tickets the thread should have.

implementing RPC

Client diagram for how RPC works

Why RPC?

well defined semantics. (programmers do not need to take care of programming communications), familiar with the semantics is a big advantage
principle -> as much like local! Just like remote library calls.

How RPC works? (see the diagram)

Calls in the client are registered in the run time as libraries
server just take of the buffer read&write in the runtime
RPC Runtime does all the communication

What do we do when the types (passing as the parameters) are complex?

need to generate the code automatically for both client/server stubs, for every application
e.g. string/array/pointer-based data structures, needed to linearized into the buffer, then the data structures will be recovered in the server side. => higher overhead.
e.g. function pointers? Some need to migrate the code, and recompile the code... Then you need distributed garbage collections.. too complex that no one uses. (better local data/local code)

Procedural-based languages => object-oriented languages

things to consider: need copy object address space? or just migrate objects?

how to support multiple langiages (like grpc)

network data representation (XDR), agreement on the network how to interpret data
compiler support (same rules for different compilers)

Bindings => client-server bindings before communications.

Optimizations:

process: Use of idle processes in caller and callee machines to reduce process creation and process swaps.
communications: design its own specialized comm protocols. (1) simple calls: fit into one packet. (2)complex calls: (not common)
connections: Avoid the cost of establishing and termination connection by just sending requests. stateless servers. (impact on NFS server design), minimize the impact of many idle clients.

microkernel and exokernel

L4
push as much as possible to use level.

low perf penalty of microkernel: L4 LINUX ~ LINUX (<10%)

Benifits

protability - multiple arches
extensibility (benifit of user level)+ specializations (pipes, less generability, less overhead)

Why is microkernele slow?

App use IPC trap into mk, then send a message to OS. (4 times) (colocated monolithic OS with mk has less overhead, only 2 times)

=======
MK components

IPC
Thread <-> address spaces (managing page tables)
- context switch - change address space a lot of flush TLB, (L4 WILL try to use physical address as much as possible, thus less TLB flush, think about many adress spaces)
Memory
- one additional step, OS want to update the page table entry, so OS trap to L4 kenel (runs at high level privilege)
- this means that L4 actually manages all address space for
Events
- normal is syscall/exception, trap into OS
- now OS are at user lv, app trap into mk, mk encapsulate into an IPC message, then invoke an OS handler
CPU -> Sched
- just normal time sharing among threads

Mach (OS server) for comparison

ExoKernel

expose h/w
management

Need three functionalities

track resource ownership
- simple, track in the exokernel
protection
- secure bindings (applications can securely bind to machine resources and handle events.)
- implemented by hardware mechanisms, software caching, and downloading application code.
- Examples, PTE/TLB, packet filters
resource revoke (abort protocol)

memory
- libOS want to update TLB, exokernel check if the TLB is owned by the App.
- s/wTLB caching
- s/w managed TLBs (exo mechanism), (h/w TLBs like x86, exo does not work)
network
- packet filters
disk
want to minimize microkernel
each app has its own libOS, the OS has only the features that app needs

Micro v.s. Exo Difference?

M: use IPC communicate. Easy to share files
Exo: just library call. Application have own file system. Hard to share files. No cross kernel, no ipc, much fast to call.

Linux scalability to many cores

Many cores

new OS design (Corey exo)
wait, maybe monolithic is ok
what are fundamental limitations (1. os implementation, 2. OS interface, 3. fundamental problem)

1. locks on shared data structures

more cores -> more wait (RCU does not work well for more copys)

2. Waiting to shared memory

cache coherency (RCU does not solve this problem, serialization in one place)

sloppy counters (1. distribute counts, so each core just get the count from one place, do not sync with other cores, 2. expensive to compute the global sum, but ok to use)

cache line aliasing - false sharing

cache coherency is at the granularity of cache line, some data structs are not actually shared, but on the same cache line, so different cores ping-pong on the same cache, more cache misses).

3. limitation for finite resources

4. not enough tasks but use more cores

more idleness

GFS

GFS -> HDFS

Requirements:

large scale (1k - 100k nodes)
componenets always failing
very large files
read dominated
writes append, many writers - (record append)
bandwidth > latency

scheduler activations

”classic" - e.g ls, short small process
"one user kernel" - e.g. JVM. many user thread stacks , each corresponding to one kernel thread stack, but all these kernel stacks are in the same kernel address space (this is not efficient because OS DOES not know which of next optimal kernel thread to run on, it is actually determined by the corresponding user thread)
User level threads (libraries) - e.g. pthreads
"M - N" Many user kernels, M user level threads, N kernel stacks (on N cores), just put M into the ready queue, when kernel threads are available to run on. (This still has the drawback of blocking the kernel thread by an invisible user thread)

User:

Lightweight => fast => just procedure calls
invisible to OS, only know the kernel thread it is in => bad scheduling (e.g. reading from disk, a thread block entire theads)

Kernel

system calls => more expensive to cross domains
obvious
physical parallelism => each processor has one kernel thread to run on

IX- a dataplane OS

RPCs are widely used in datacenters(DC)

multiple languages
efficient network

question: really fast h/w, why is communication slow? - too MUCH overhead in the communication path.
(Linux put TCP/IP inside the OS, so communication between apps need to cross domains. => some design move tcp/ip to the user-app level, achieves ZERO-COPY, but lose protection.)

Goals

high throughput - batching !
low latency - batching, run to completion
protection - root ring 0
many connections
elastic resources

designs:
(1) Run-to-completion: - Improves data-cache locality - Remove scheduling unpredictability. Just do not stop the queues on the NICs.
(2) batching: as large as cache size, improve locality. prefetching.

protection:

libix/apps on ring 3
control path still on ring 0
tcp/ip and the dataplane IX on ring 0 (non-root). But shares the same address space for apps (like exokernel). Thus efficient data sharing and protection are achieved.

Berkerly FFS & log structure FS

cylinder groups are the tracks across cylinders. (speed across the surface is faster than seek. => the new format to organize the FS in FFS.)

Question: the difference between seek and move between tracks?

Problems & solutions:
see table here.
Notes
cylinder group => inode and data are in the same group.
superblocks (are fixed to point to inodes), inodes are not fixed now.