Target: Connect physically seperated machines

--> sharing
--> incresing capacity through parallelism
--> tolerate faults
--> archieve security via isolate

Historical Context


  • Local area networks (1980s)
  • Data centers, Big web sites (1990s)
  • such as: web search, shopping
  • ?: a lot of data, and a lot of users
    Next ego:
  • Cloud computing (2000s)
  • move computation & data to cloud server
  • ?: start websites on personal willing, data learning apps, high performance data
    Current states:
  • actual problems
  • booming field


  • Many concurrent parts
  • must be deal with partial failure -> complexity
  • tricky realise the performance benefits


Lab 1: distributed big-data framework (like MapReduce)
Lab 2: fault tolerance library using replication (Raft)
Lab 3: a simple fault-tolerant database
Lab 4: scalable database performance via sharing

Focus course and The main topics

This is a course about infrastructure for applications.

  • Storage.
  • Communication.
  • Computation.

Main topics:

Fault tolerance

  • make systems highly availably, no matter what status they are being
    • replication
  • recoverability, recover previous state
    • logging/transactions


General-purpose infrastructure needs well-defined behavior.
E.g. “Get(k) yields the value from the most recent Put(k,v).”
Achieving good behavior is hard!
“Replica” servers are hard to keep identical.


scalable throughput


Replication improves the availability, but lose performance
Reduce latency


This material comes up a lot in the real world.
All big web sites and cloud providers are expert at distributed systems.
Many big open source projects are built around these ideas.
We’ll read multiple papers from industry.
And industry has adopted many ideas from academia

Case Study: MapReduce

Chivier Humber
Posted on
March 8, 2023
Licensed under