MIT 6.824: Distributed Systems 2020


    1. GFS
    1. Fault Tolerance Raft (2)
    1. Zookeeper
    1. Cloud Replicated DB, Aurora
    1. Distributed Transactions
    1. Spanner
    1. Big Data: Spark

Lecture 1: Introduction


  • Infrastructure
    • Storage
    • Communication
    • Computation
  • Implementations
    • RPC
    • threads
    • concurrency

  • performance and scalability

    • double computers, double throughputs
    • Fault Tolerance
      • Availability
      • Recoverability
      • Non-volatile Storage
      • Replication
  • Consistency

    • example key value service

Google MapReduce

  • simple for a programmer to write a parallel distributed computation
    • programmer gives a map function and a reduce function
    • the MapReduce program distributes the workload to Google servers
      • the programmer does not need to worry about the distributed system itself!

  • word count is the classic example

  • no need to worry about fault-tolerance!
    • if a single worker fails, rerun it!
  • GFS - file system that is distributed

  • have to do column store
    • This is a shuffle, but expensive to do

Lecture 3: GFS

  • GFS published in 2003, MapReduce in 2004, Hadoop/HDFS in 2006

Big Storage

  • performance - sharding
  • fault - tolerance
  • tolerance - replication
  • replication - inconsistency
  • consistency - low performance
  • strong vs weak conssitency

  • last write corrupt


  • big, fast
  • global
  • sharding
  • automatic recovery
  • single data center (really :O)
  • internal use
  • big sequential accces (not random)
  • single master!
    • map reduce has as single master too, but failure is so unlikely its fine to rerun all operations

  • master data
    • file name
    • __ handles__
      • list of chunk server (cs)
      • primary version number (v)
      • lease expiration
      • append to log efficiently
  • READ
    • name of master
    • master ot list of servers
    • gets chunk server which sends data back

    • no primary - on master
    • find up to date replicas
    • pick p, s
    • increment version #
  • problem of split brain
    • network partition
    • give a primary a lease (has a timer)
      • primary know who has the lease and can wait for it to expire

  • these are secondaries
  • mostly appends
  • ask if they can do it
    • only write if they promise they can
    • what if primary crashes

Lecture 6: Fault Tolerance: Raft (1)

  • RAFT
    • state machine replication correct
  • MapReduce and GFS require single master to choose
    • simple, but single point of failure :/
  • Split Brain

  • two clients and two servers
    • what if a client crash?
    • what if a server crash?

  • split brain c1-s1, c2-s2, the data: 1 can be inconsistent
    • this is a network partition
      • solution: majority vote with odd number of servers
        • example 2 out of 3 majority
        • quorum
        • with 2f+1, you can have f failures, because f+1 agree to a majority
      • Paxos
      • VSR


  • K/V server
    • table
    • Raft layer
      • log of operations
    • replicated
  • Clients: C1, C2
    • put(k, v)
    • get(k)

  • for an operation
    • enters master server, enters RAFT layer
      • when have majority vote coordination
        • send up to the key/value state

  • AE: append entry
    • with 1 master and two other servers, only need one to reply

  • go interface
  • applyChannel, applyMsg with index
  • election timer -> start election
  • leader election
    • term++, request votes

  • choosing random times
  • newly elected leader, but has divergent logs
  • handling server crashes

  • S1 lost 3 on log

  • can become leader and crash


  • green client sends value to blue server
  • distributed consensus is agreeing on a value with multiple nodes
  • nodes can be in 3 states

  • follower state

  • candidate state

  • leader state
  • all nodes start in follower state

  • it sends a request vote to the other nodes
    • they reply with their vote

  • if it gets a majority it becomes leader
    • this is leader election
    • all changes now go through the leader

  • each change is written to a log (uncommitted)

  • to commit the change is replicated to followers

  • after the leader commits, it notifies the followers the entry is committed
    • they commit in the process, called log replication

Leader Election

  • first election timeout
    • amount of time follower waits until becoming candidate
    • randomized between 150ms and 300ms

  • and votes for itself

  • the nodes also reset election timeout

  • like a server crashes

  • no heartbeat sent. one node starts and election because its timeout was randomly faster

  • example of a split vote

  • when this occurs a revote happens and Node D is leader

Log Replication

  • done using append entries messages

  • a response is sent to the client

  • network partitions

  • the log entry stays uncommitted!

  • once the partition is healed, B steps down

Lecture 7: Fault Tolerance: Raft (2)

  • Why not longest log as leader?

  • example vot yes only if high log entry
  • AE Reply
  • Persistent
    • log
    • correct Term
    • voted For
  • synchronize disk update - persisting
    • write(fd, _)
    • fsync(fd)

  • storing a log
    • snapshot every once in a while for efficiency

  • linearizability
    • linear execution history
    • operations that finish first, and reads must come after start of writes
  • reads and writes in lineariable order
  • not linearizable - has a cycle

Lecture 8: Zookeeper

  • The history is linearizable

  • not linearizable - cycle in the graph

  • Zookeeper layer
    • ZAB layer
  • many clients
  • more servers -> slower
  • allow client to send any read only request to any replica

  • log

  • guarantees
    • linearizable writes
    • cheap read
    • want fresh data
      • can send a sync operation

  • write order

  • read order?
    • exist("ready")

  • if it ever exists, PLEASE send me a notification
    • at the correct point in the log to the client's reads!

Lecture 10: Cloud Replicated DB, Aurora

  • Amazon Web Services
    • Custom Backend! - Not AWS Services
    • EC2 - general server/vm
    • S3 - store snapshots

  • persist to EBS
    • can load a new EC2
  • Availability Zones

  • Transaction (XACTION)
    • crash recovery
    • to log

  • RDS
    • replicated to another AZ

  • replicas with quorum
    • just log entries
    • only need 4 out of 6!
  • doing this instead of having to replicate the page table (like RDS) 35x performance!

  • goals of survival of failure of datacenter (AZ)
    • transient slow replicas (?)

  • quorum replication
    • n replicas
    • read and write quorum must overlap
    • W and R

  • databases vm/servers can crash
    • spin up new server log says redo these in the transactions
    • many many pages

  • copy of A, B, etc on different servers
    • replacement, pick 100 segment services
    • pick 1 replica from each
  • reads more common than writes
  • atomicly

Lecture 12: Distributed Transactions

  • Distributed Transactions
    • Concurrency Control
    • Atomic Commit
    • Above done in an abstraction call a transaction

  • END_X
  • example of a read only transaction

  • ACID
    • Atomic - all or nothing vs failure
    • Consistent
    • Isolated - Serializable
    • Durable

  • serial order of execution of transactions

  • Concurrency Control
    • pessimistic - locking
    • optimistic - occ

  • concurrency control with 2 phase locking

  • distributed transactions
    • want atomic all or nothing - but what if failure
    • two-phase commit!

  • transaction coordinator
    • sends prepare messages to everybody
    • yes/no
    • commit
    • ack
    • lock/unlock - but can crash
      • use a log
      • that is why we have yes/no and ack response

  • individually replicated - raft/group
  • sharding?

Lecture 13: Spanner

  • globally replicated database

  • each data center

    • copies for redundancy and to be near clients
    • sharded - throughput
    • has a leader/follower
  • R/W transactions

    • BEGIN
      • x = x + 1
      • y = y + 1
    • END

  • 2 Phase Commit and 2 Phase Locking in Paxos Groups

  • do a read transaction
    • to master of paxos group - replicate to others
    • 2PC with other master of paxos group

  • commits and leaders of paxos group can replicate to copies

  • simple wishes - wanted constraints
  • serializable
  • external consistency

  • can't just read latest value rip

  • Snapshot Isolation
    • imagine all time is synchronized

  • can now do timestamp

  • need clock synchronization
  • gps master
  • Google has TrueTime
    • returns (earliest, latest)

  • get latest
  • start
  • commit
  • delay time stamp such that the earliest it could possibly be is greater that the latest another transaction could be

  • the latest for the T1 timestamp is 10
    • so read after 12

Lecture 15: Big Data: Spark

  • PageRank in Spark
    • scala
    • graph
    • simulation in parallel iteratively

  • run example

  • page1 has highest rank

val lines ="in").rdd

// MapPartitionRDD[5]
// not actually executed! lazy lineage graph!

// get an array rep

val links1 ={ s => val parts = s.split("\\s+"); (parts(0), parts(1)) }

val links2 = links1.distinct()

val links3 = links2.groupByKey()

// persist to disk ?
val links4 = links3.cache()

var ranks = links4.mapValues(v => 1.0)

val jj = links4.join(ranks)

val contribs = jj.values.flatMap{ case (usls, rank) => => (url, rank / urls.size)) }

ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)

val output = rank.collect()

output.foreach(tup => println(s"${tup._1} has rank: ${tup._2} ."))

  • the lineage graph

  • store in HDFS (like GFS)
  • parallel RDD
    • read
    • map
  • shuffle
    • expensive
    • distinct
    • wide dependencies

  • complete

  • failed worker
    • wide dependencies problem!

  • store on HDFS - fault tolerant storage
    • checkpoint
    • if a worker fail
    • have the lineage graph so we can reconstruct

Lecture 16: Cache Consistency: Memcached at Facebook

  • Facebook's goal is to making features and selling ads
    • Only scaling if they need to


  • Simple, single server


  • Multiple Front End Servers
  • Still a Single backend database
    • Don't have to worry about things like distributed transactions


  • Multiple Front End Servers, Multiple Databases Shared
    • a-g for database 1, g-q for database 2, q-z for database 3


  • MEMCACHE Servers
    • get(key, value)
    • put(value)
    • a cache for data that is hit often "hot" data
  • but if a MEMCACHE fails, the database will fail
    • the database usually only sees a few percent that the caching layers sees!!!


  • Stale data doesn't really matter
    • just some pictures from friends from last hour, etc.
  • Two datacenters
    • Try to avoid cross nation communication

Read and Write Lookaside Caching

  • delete for overwriting, because the data in the cache is stale
    • performance by:
      • partition
        • RAM efficient
        • not efficient for hot keys
      • replication
        • good for hot keys
        • few TCP connections

  • replication between datacenters

In One Region

  • sharded database
  • multiple clusters
    • FE to MC connections is \( O(N^2) \) so have to limit it and make it in clusters
    • Regional Pool Cache

Thundering Herd

  • a problem where a miss causes other front end servers to overload the database

  • instead use a lease
    • a number
    • invalidate the lease
    • when trying to write note lease is invalid, write is invalid

  • when deleting, use gutter memcache


  • we get into a data race
    • problem - data is out of date forever!
    • use lease again
      • grant a lease (a number)


Extra 1: Scalability Harvard CS 75

  • VPS Service
    • for hosting
    • Amazon Web Services :)

Vertical Scaling

  • Just throw a bigger computers
  • more RAM, more cores, more storage space
  • SAS and SSD drives for write heavy operations like databases

Horizontal Scaling

  • a bunch of cheaper slower machines

Load Balancer

  • load balancer has its own IP address
    • from client to load balancer to server 1
    • could do a copy of everything on each server
    • could do instead and have a server for each part

  • could load balance with our DNS server
  • Round Robin: to server 1, to server 2, to server 3, then back to server 1
  • What if server 1 gets heavy weight users?
    • Let Load Balancer decide instead of round robin

  • Can use AWS Load Balancer,
    • Application Load Balancer (ALB), Network Load Balancer (NLB)

Sticky Sessions

  • session cookies


  • RAID 0: 2 hard drives, stripe to drive 1 then drive 2
  • RAID 1: 2 hard drives, mirror data, if one dies, you still have a copy of your data
  • RAID 5: 5 drives, stripe to 4, and have 1 for redundancy
  • RAID 6: 2 hard drives can die
  • NFS file system, distributed file system
  • What if you trip over a power cord?


  • static page instead of dynamic page from PHP, HTML file is static
  • problem: changing the style would have to change hundreds of thousands of HTML files
  • MySQL cache: for identically executed queries

  • memcached: memory cache, stores in RAM
    • use extensively by Facebook

Data Replication

  • avoid single points of failure

  • Master-Master

  • High avaliability with Heartbeat Master-Master

  • Load Balancing + Replication
  • heartbeat
    • Active/Passive
    • if server dies, heartbeat ends, Passive becomes Active

  • Partition and Server slaves for replications

Layers of Replication and Redundancy

  • Multiple Load Balancers
  • Multiple Backend Servers
  • Multiple Load Balancers to Databases
  • Multiple Databases
  • Cross Connected
  • Multiple Network Swtiches
  • In a single Datacenter
    • AWS Avalibility Zone
      • Then different regions, East, West, Asia, Europe
  • To do Geography based Load Balancing by doing a Load Balancer with DNS
  • Avoid Single Points of Failure!


  • want TCP, SSL to Load Balancer
    • Now everything is in HTTP instead


Extra 2: Chord Algorithm Berkeley CS 162

  • Consistent Hashing
    • ring space of \(0\) to \( 2^m - 1 \)

  • take an id like ip, hash it and put it on the ring
  • in our key value store, where to store keys?
  • node 20 knows key-values for [16, 20] since the previous node is at 15

  • correctness
    • because it is a ring!
    • always a node that hold they keys
  • performance
    • \(O(log(n))\)

  • routing
    • start at any node which will contain where the next node is
    • worst case is \(O(n)\) - not great

  • correctness with stabilization algorithm
    • updates predecestor and successor

  • a new node gets one node from DNS
    • it calls another node, which calls another node, etc.
      • the final node connects to the new node and updates its successor as the new node

  • finger table
    • a table that knows probabilisticly approximately halfway, quarter, eighth, etc.
    • now instead of having to hope each node
      • you can jump at most half way, then another quarter, etc.
      • \(\text{key}+2^i ~ \text{mod} ~ \text{ringsize}\)

  • fault tolerance if the node fails, successor and precessor has a copy!

  • Dynamo used Chord's Consistent Hashing but using a SLA agreement

Extra 3: Dynamo Amazon’s Highly Available Key-value Store

  • problem: reads scale with read replicates but writes too much overhead
    • another problem was it doesn't have auto-scaling
    • wanted to do data modeling of shopping cart
      • as a key-value operation

  • app at top
    • key value operations
    • shared (A-F, G-K, etc)

  • problem: wanted partitioning
    • solution/technique: consistent hashing
  • prob: high availability for writes
    • sol: vector clocks with conflict resolution on read
  • prob: handling failures (temporary)
    • sol: hinted handoff
  • prob: recovering from failures (permanent)
    • sol: anti-entropy with merkle-trees
  • prob: membership and failure dectection
    • sol: gossip-bsaed protocol

  • partitioning with consistent hashing
    • instead of just choosing a server from N server with hash(k) mod N
      • have a ring [0, L]
      • hash(k) mod L
      • distribute N servers randomly among the L ring and handle all key calls from ring section of previous node to current node

  • need a quorum to consider a read or write successful

  • vector clocks
    • conflict if servers think have different items
    • consistenty by sending messages as vectors
      • there is a property to timely order operations and get servers to be consistent

  • handling failures
    • temporary: hinted handoff, gives it to the next node in the consistent hash
    • permanent: sync using merkle tree
      • periodicly exchange key ranges
      • just pass root of tree instead of whole ring
        • if inconsistent, pass the children and repeat until find when inconsistent
        • pass only a logorithmic size of data and only the root in best case

  • membership and failure decection
    • could send heart beats
      • could have single node to handle listening like zookeeper
        • single point of failure
      • instead just gossip it and in logorithmic time, realize can identify node is dead

Extra 4: CAP Theorem

  • A distributed system coan have at msot two out of the three properties
    • Consistency
    • Availability
    • Partition Tolerance
  • BigTable, MongoDB, and Redis are CP
    • Spanner is CP but highly available
  • Dynamo, Cassandra are AP
  • Amazon Aurora? Physalia in AWS EBS (Looks to be CP but highly available)?

  • servers \(G_1\), \(G_2\) and client with initial value \(v_0\)

  • write to any server. Only \(G_1\) is updated in this case


  • any read operation that begins after a write operation completes must return that value, or the result of a later writer operation

  • we write to \(G_1\) but read from \(G_2\) and get a different result

  • we write to \(G_1\) which replicates to \(G_2\) and sends an acknowledgement which \(G_1\) sends back to ghe client
    • now the client can read from \(G_2\) and get a consistent result


  • every request received by a non-failing node in the system must result in a response
    • server cannot ignore client's requests

Partition Tolerance

  • the network will be allowed to lose arbitarily many messages sent from one node to another


  • proof by contradiction. Assume a system can be CAP

  • begin by partitioning our system.

  • with a partition, the data is inconsistent

Extra 5: Serverless, Coordination-free Distributed Computing, and the CALM Theorem

  • 4 parts
    • Serverless
    • Avoiding Coordination
    • CALM Theorem
    • Hydro

  • new platform, new language!
    • super computers with new programming paradigms!

  • how will people program the cloud!?
    • building a program modeling is hard
    • distributed systems, consistent, and partial failure

  • more popular than Map Reduce


  • fine grain resource usage and efficiency
    • new economy models for cloud providers and users

  • auto scaling!
  • not unbounded distributed computing

  • pay per IO
    • can't do the "batch" to disks
  • No inbound network communcation
    • makes distributed comp difficult
      • embarrassingly parallel like Map work
      • Reduce doesn't :(
        • communication heavy, like shuffles in Spark :'(

Avoiding Coordination

  • How do you embrace state
    • data gravity
    • consistency - hard :(

  • consistency over long distances is hard
    • split brain problem

  • coordination based consistency is bad!

  • make consistency as small as possible

  • coordination has really bad tail latency
    • slowdown cascades

  • instead reason about application semantics
    • rich application logic to READS and WRITES
    • formalize semantics!

CALM Theorem

  • Consistency as Logical Monotonicity
    • if they are logically monotonic, it's consistent!

  • programming confluence
  • only care about outcomes

  • distribtued deadlock detection
    • checks cycles
    • there exists
  • garbage collection
    • reference between objects on different machines
    • 05 to 06 are garbage
      • but machine 2 can't say its garbage until machine 3
      • it requires coordination

  • you can get crazy parallelism!
    • share nothing!
    • how to write in logic language instead of declarative language?
      • like SQL
      • maybe it'll be internal language IR for compiler
        • like databases too!
    • have our cake and eat it too!

Hydro: Stateful Serverless and Beyond

  • Anna autoscaling multi-tier KVS
  • Cloudburst Stateful FaaS with caches
  • HydroLogic, an IR (doesn't depend on order)
  • Hydrolysis, compiler

  • Anna be like Redis and S3
  • CALM consistency of simple lattices
    • autoscaling
    • best-of-conference
    • multi-tiered!
      • can be in fast memory
      • or slow persistent disk

  • shared nothing at all scales and threads!!!
  • under contention, cache thrashing problems

  • auto scaling
    • cost 350x performance

  • robot motion planning

  • serverless jupyter
  • each cell is running a cloudburst lambda! :O

  • can handle all the memroy pursue

  • sharing model state!

  • motion planning with lambdas

  • run compute, then share state

  • need a coordination EC2, bottleneck

  • much quicker cost

Extra 6: Stellar

  • We take for granted currency that's stable
    • good investments

  • want equitable access to financial systems
  • open membership
  • issuer-enforced finality
    • security
  • cross-issuer atomicity
    • good market to trade

  • ACH
    • requires national regulations

  • can't trade across systems paypal -> venmo?


  • Replicated State Machine
    • keep ledger safe
    • public key authorize operations
    • accounts can issue assets
  • Transaction guarantee atomicity

  • only works if everyone agrees on ledge double-spend attack

  • think of the internet
    • transitive connections
    • China -> Stanford
    • China -/-> Google
    • Google -> Stanford

  • Stellar Consensus Protocol SCP
  • Byztentime Hypothesis
    • quorum slice - majority

  • quorum slice

  • is a quorum slice but not quorum

  • quorum tier
    • like the internet

  • citi bank sends an attack

  • ACLU doesn't agree because it has no quorum from other non-profits

  • both nodes depend on V7

  • liveness?
    • sufficiency problem
      • must be of an intact set (non-evil)

  • cascade theorem
  • federated voting
    • can get stuck
  • balloting
  • nomination

  • quorum intersecting
  • cascade until it includes all the nodes

  • crappy consensus

  • intertwined nodes cannot confirm contradictory statement
    • optimal safety

  • differ vote

  • green is in a quorum

  • will eventually cascade

  • balloting
    • federated voting can get stuck
    • invariant: all stuck and decided ballots must chose same value

  • prepare(n, x)
  • commit(n, x)
  • synchronize with timer
    • cascades

  • 1,000 operations/ledger
  • 133 nodes, 74 validators, 17 "tier-one" by 5 organizations
  • hash signing

Extra 7: CockroachDB, Spanner, MongoDB

  • timestamp 18:44
  • Amazon
    • fork of MySQL or PostgreSQL
    • custom storage on EBS
      • transaction level on EBS
  • Redis
    • In memory
    • single thread engine


  • Database Don't die
  • Open Source like Spanner
  • Distributed Database
    • Decentralized Shared Nothing
    • Log Structured Storage Architecture at individual nodes (RocksDB)
    • Concurrency Control Model: Multi-Version OCC
    • Serializable Isolation

  • multi-layer
    • RocksDB storage manager (low level)
    • Raft - replication and consensus
    • key-value api
      • for pages to fetch data on

  • hybrid clock
    • order transaction globally
  • transaction stage intent check conflict commits (?)

  • global keyspace
    • key -> data
  • leader
    • raft replication
  • instead of buffer manager -> disk (your classic single node database)
    • get from distributed key value store system
  • FoundationDB does the same thing (distributed database)
  • engineering! done well, fault tolerant and all the edge cases!


  • Cloud Spanner
  • Google wrote BigTable in 2006
    • give up SQL
    • give up joins
    • column-based database
  • Adwords ran on sharded MySQL
    • needed transactions
    • Megastore

  • 2011
  • geo-replicated
  • schematized, semi-relational ?
  • log structured on disk
  • strict 2PC MVCC Multi-Paxos 2PC
    • Paxos groups
    • External Consistency
      • global write_transaction synchronous replication
    • lock-free read only transactions

  • joins are slow :'(
    • have to go to another node/table
  • interleave
    • single page efficient physical denormalization

  • wound-wait deadlock prevention
    • don't need deadlock detection
    • ordering throuhg unique timestamps from atomic clocks and GPS devices
  • tablets (shards)
    • paxos - elect leader in tablet group
    • 2PC for txn spanning tablets

  • completely physical wall-clock time
    • necessary for linearizability
    • paxos group decide order transaction commit

  • internal Google API
    • time range

  • wait long enough
    • then can commit
      • at commit + release locks

  • F1 has OCC
    • now SQL
    • Built for the ad system
  • Spanner SQL
    • for everyone else

  • good benchmarks but expensive!


  • shared nothing document database (2007)
    • json document
    • now can do multi-action transactions

  • don't join

  • instead embed single document
  • pre join

  • the json example

  • no query optimizer
    • just heuristics
  • instead generate a bunch of query plans
    • execute all of them!
      • whichever is fastest return

  • shared nothing architecture
  • master slave replication
  • auto-sharding
    • partition by hash or range
    • automaticly split if a shard is too big
  • startups liked this, oh if I grow, it'll shard automaticly

  • mmap - OS storage manage
    • changed
    • storage backend is WiredTiger but could replace with RocksDB

  • avoid premature optimizations!
    • based on months in the future
  • MySQL and Postgres should be fine for most cases
  • If need to ACTUALLY scale, you have money - instead find and pay smart people to help

Extra 8: Scalability! But at What COST?

  • GraphX paper
    • lots of cores 128

  • repeated look at each label in the graph
    • flows through

  • ran a single core in Rust and it was fast!

  • with better algorithms! - Even faster

  • scalability fetishised by community

  • at 16 cores it is the same performance at a single core

  • 1, 2 cores
    • very very fast! micro-seconds

  • still high performance on a single laptop!

  • When you are focued - high performance compiled languages, caches, not JVM

  • step 1. get 100x on your computer
  • step 2. then 100x from 100x computers
    • but that's expensive!

Extra 9: Ray: A Distributed Execution Framework for AI

  • Task Parallel and Actors Abstraction
    • compatible with TensorFlow, PyTorch etc!

  • million of tasks per seconds
  • stateful compution
  • fault tolerance


  • take python functions
    • add @ray.remote
    • returns a future
    • ray.get(id3)

  • task scheduled
    • think about tasks and futures

  • parameter server basically key-value
import ray


  • now a ray.remote
    • get with ray.get(ps.get_params.remote())

  • fake update

Ray Architecture

  • workers, object manager and scheduler
  • global control store


  • Apache Arrow
    • serializing, like Python C++

  • pip install ray

  • linear based fault tolerance

    • rerun
    • can rerun


Extra 10: Cluster Management with Borg and Kubernetes

  • Borg

  • Google Data Centers

  • Even Bigger

  • program to run
  • cell/custesr
  • cmdline args

  • 10,000 copies of hello world

  • Borg
  • BorgMaster
  • Borglet
  • all managed by Config file
  • And Blaze?

  • things can go wrong

  • a cell
    • usage - cpu/vm

  • lots of waste
  • overhead

  • bigger cells are better, less partitioning overhead

  • bucket to next order of 2

  • some people run very little resource
  • some people have nice round numbers
  • used best fit scheduling algorithm
    • but it could be gamed

  • people overallocate

  • be more aggressive

  • yellow safety margin
    • smaller on week 2
    • then set it back to original settings

  • configs
  • monitor
  • binaries
  • you only need to focus on the app


  • open source Borg

  • it schedules and allocates
    • you don't need to car about the machine

  • replication

  • Google Kubernetes Service

  • resiliency
  • efficiency
  • containers

Extra 11: Introduction to Apache Kafka

  • needs to be ledger over just a stream

  • horizontal scaling

  • Kafka - Event Ledge
    • Distributed Commit Log

  • Records
    • tuple (key, value, timestamp)
    • Immutable, Append Only, Persisted

  • Producers and Consumers

  • Topics can contain multiple partitions

  • Offsets

  • Delivery Guarantees

  • Akka Streams

  • aggregating from Kafka

Extra 12: Introduction to Apache Cassandra

  • Cassandra was a blend of distributed features of Dynamo and Data Model of BigTable

  • Shared nothing, add, remove as needed

  • Data written to commit log

  • Data ritten to memtable

  • Server acknowledges to client

  • Memtable flushed to disk
    • to SSTable
    • In Sequential Write
    • Really good for time-series

  • How to Deal with duplicates?

  • Compaction of Sequential Append only SSTables
    • With Merge Sort

  • Compacted file written

  • Clean up old files

  • Partitioning wiht primary key

  • With Key hashing

  • ring

  • Replication factor = 3?

  • Can do virtual nodes

  • Coordinated reads

  • Consistency Level
    • Quorum is when more than 51% replicas acknowledge data read
  • ALL - All replicas ack - full consistency - usually not necessary

  • Quorum and Avalability

  • Rapid Read Protection

  • CQL Cassandra Query Language

  • Inserts will always overwrite

Extra 13: Anna KVS and Cloudburst

Acolyer The Morning Paper



Extra 14: Physalia Millions of Tiny Databases

Extra 15: Facebook

Tao: Facebook’s Distributed Data Store for the Social Graph. required
N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo,
S. Kulkarni, H. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani.

Scaling Memcache at Facebook. supplemental
R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy,
M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani.
NSDI, 2013.

Existential Consistency: Measuring and Understanding Consistency at Facebook. supplemental
Haonan Lu, Kaushik Veeraraghavan, Philippe Ajoux, Jim Hunt,
Yee Jiun Song, Wendy Tobagus, Sanjeev Kumar, and Wyatt Lloyd.
SOSP, 2015.

Finding a Needle in Haystack: Facebook’s Photo Storage. reference
D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel.
OSDI, 2010.

An Analysis of Facebook Photo Caching. reference
Q. Huang, K. Birman, R. van Renesse, W. Lloyd, S. Kumar, and H. C. Li.
SOSP, 2013.

f4: Facebook’s Warm BLOB Storage System. reference
Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu,
Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, and Sanjeev Kumar.
OSDI, 2014.

RIPQ: Advanced Photo Caching on Flash for Facebook. reference
Linpeng Tang, Qi Huang, Wyatt Lloyd, Sanjeev Kumar, and Kai Li.
FAST, 2015.

Analysis of HDFS Under HBase: A Facebook Messages Case Study. reference
T. Harter, D. Borthakur, S. Dong, A. S. Aiyer, L. Tang, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau.
FAST, 2014.

Wormhole: Reliable pub-sub to support geo-replicated internet service. reference
Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, David Callies, Abhishek Choudhary, Laurent Demailly, et al.

Extra 16: RocksDB




ByteDance TerarkDB

Netflix EVCache


Extra 17: Snowflake and FoundationDB



Extra 18: Chaos Engineering


Extra 19: Networking

Extra 20: Materialize and Timely/Differential Dataflow

Extra 21: DynamoDB

Amazon DynamoDB powers multiple high-traffic Amazon properties and systems including Alexa, the sites, and all Amazon fulfillment centers. Over the course of the 66-hour Prime Day, these sources made 16.4 trillion calls to the DynamoDB API, peaking at 80.1 million requests per second.

On the block storage side, Amazon Elastic Block Store (EBS) added 241 petabytes of storage in preparation for Prime Day; the resulting fleet handled 6.2 trillion requests per day and transferred 563 petabytes per day.

Extra 22: Memcache and Redis

Extra 23: YouTube Vietess

Extra 24: Delta Lake and Bolt-On Consistency


Miscellaneous 1: Exponential Backoff And Jitter

Miscellaneous 2: Post Mortems

Miscellaneous 3: Cloud Computing Course from Stanford

Miscellaneous 4: Machine Learning Systems Design

Miscellaneous 5: Learned Indexes