Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

5 US dollars for lifetime access globally, or 299 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of 5 dollars instead of annual subscriptions costing 100 to 200 dollars per year.

What is the difference between consensus and a quorum?

A quorum is a counting rule: require enough nodes (usually a majority) to acknowledge an operation so any two operations share at least one node. Consensus is the full agreement protocol, like Raft or Paxos, that lets nodes agree on an ordered sequence of values even while some crash or lag. Consensus algorithms use quorums internally, but a quorum by itself does not handle leader changes, log ordering, or recovery. Think of quorum as one rule inside the larger consensus machine.

What does the CAP theorem actually force me to choose?

CAP says that during a network partition, when nodes cannot reach each other, you must pick either consistency or availability. You cannot have both at that moment. A CP system rejects requests it cannot safely serve, so clients see errors but never stale data. An AP system keeps answering and accepts that replicas may temporarily disagree, then reconciles later through mechanisms like read repair and anti-entropy. When the network is healthy you get both; CAP only bites during the partition.

When should I use the saga pattern instead of two-phase commit?

Use two-phase commit when the operation spans resources that support a shared transaction coordinator and you can tolerate blocking if the coordinator fails, which is more common inside a single database cluster. Use the saga pattern in microservices where each service owns its own database and a global lock is not practical. A saga breaks the work into local transactions, each with a compensating action to undo it, trading strict atomicity for availability and looser coupling. Most modern service architectures pick sagas for exactly this reason.

Why can't I just use timestamps to order events across servers?

Server clocks drift and are never perfectly in sync, so a wall-clock timestamp from one machine cannot reliably tell you whether its event happened before another machine's event. Lamport timestamps fix this by assigning logical counters that respect causality, and vector clocks go further by revealing when two events are truly concurrent rather than one causing the other. That concurrency information is what lets a system detect conflicting writes and decide how to resolve them.

What is split-brain and how do systems prevent it?

Split-brain happens when a network partition leaves two halves of a cluster unable to communicate, and each half elects its own leader and keeps accepting writes. You end up with two diverging copies of the data that are hard to merge. The standard defense is to require a majority quorum before a node can act as leader, so a minority partition cannot elect one and goes read-only or rejects writes. Fencing tokens and lease-based locks add another layer by invalidating an old leader's actions once a new one is chosen.

Do I need to learn all of these topics, or is some of it optional?

Start with CAP theorem, quorum, leader election, and consensus, since almost every other topic builds on them. Then learn the replica repair set (gossip, hinted handoff, read repair, anti-entropy) if you work with eventually consistent databases, and the transaction set (two-phase commit, sagas, event sourcing, CQRS) if you build services that coordinate state. The clocks and actor model lessons are valuable for deeper reasoning but can come later. The lessons are ordered so each one prepares you for the next.

advanced

Distributed Systems Core

The moment your system runs on more than one machine, a new set of problems shows up that single-server code never has to think about. Machines crash mid-write. The network drops messages or delivers them out of order. Two nodes both decide they are in charge. A payment gets debited on one server but the confirmation never reaches another. These are not edge cases you can patch later. They are the default behavior of any system spread across processes, racks, or regions, and getting them wrong is how companies lose money, corrupt data, or take down an entire region during an outage.

This category covers the ideas that hold distributed systems together when individual parts fail. You will learn how nodes agree on a single answer when none of them can fully trust the others, how to order events without a shared clock, how to keep data consistent across replicas, and how to coordinate work that spans many services without one giant lock. These are the same building blocks behind Kafka, Cassandra, DynamoDB, etcd, and every database that promises to survive a server dying at 3 AM.

Distributed Systems Core: the landscape

What distributed systems core actually means

A distributed system is a group of independent computers that work together but can fail independently. The hard part is not splitting work across them. The hard part is that there is no shared memory, no shared clock, and no guarantee that a message you sent ever arrived. Each node sees its own slightly different view of the world, and the system has to behave correctly anyway.

The lessons here start from the formal models that describe this behavior. A State Machine and a Finite State Machine give you a precise way to reason about what a node can do and which transitions are legal, which matters because most replication and consensus protocols are built on the idea of replicating a state machine across nodes so they all end up in the same state.

From there the category splits into the recurring problems: agreeing on values, ordering events, replicating data, and coordinating transactions. Every topic in this hub is one answer to the question of how independent machines stay correct together when the network and the hardware are working against them.

The hard problems: partitions, ordering, and agreement

The CAP Theorem is the starting point for almost every design decision in this space. When the network splits and two halves of your cluster cannot talk to each other, a state called Network Partitioning, you can keep serving requests or you can stay consistent, but not both. The lesson on Split-Brain shows what happens when a system gets this wrong: two nodes both believe they are the leader and both accept writes, and now you have two diverging copies of reality that are painful to merge back together.

Ordering is the second hard problem. Without a single shared clock, you cannot trust wall-clock timestamps to tell you what happened first. Lamport Timestamps give you a logical ordering of events, and Vector Clocks go further by letting you detect when two events are genuinely concurrent, which is exactly the information you need to spot and resolve conflicting writes.

Agreement ties it together. Quorum is the simplest tool: require a majority of nodes to acknowledge a read or write so any two operations overlap on at least one node. Leader Election picks a single coordinator so the cluster has one source of truth, and Consensus Algorithms like Raft and Paxos let a set of nodes agree on a sequence of values even while some of them are crashing or slow. Distributed Locks build on these to make sure only one process touches a resource at a time across the whole cluster.

Keeping replicas in sync, and coordinating work across services

Replicating data is not a one-time copy. Replicas drift apart because writes land on different nodes at different times, so distributed databases run a constant repair cycle. Gossip Protocol spreads membership and state information node to node, the same way news travels through a crowd, so the cluster knows who is alive without a central registry. Hinted Handoff stores writes meant for a temporarily down node and delivers them when it returns. Read Repair fixes stale data at read time, Anti-Entropy runs background comparisons to reconcile differences, and Sloppy Quorum keeps a system available by accepting writes on backup nodes during a partition. These five together are the machinery that powers Cassandra and DynamoDB style availability.

Coordinating writes across multiple services is the other major theme. Distributed Transactions ask how you commit an operation that touches several databases as a unit. Two-Phase Commit gives you atomicity but blocks if the coordinator dies, and Three-Phase Commit tries to remove that blocking at the cost of more network round trips. The Saga Pattern takes a different route entirely, breaking one big transaction into a sequence of local steps each with a compensating undo, which is the standard approach in microservices where a single locking transaction across services is not realistic.

Finally, the category covers patterns for modeling state itself. CQRS separates the write path from the read path so each can scale and be shaped independently. Event Sourcing stores every change as an immutable event rather than overwriting state, giving you a full audit history and the ability to rebuild state at any point in time. The Actor Model and Reactive Systems round it out with concurrency and resilience patterns for systems that must stay responsive under load and partial failure.

How real companies use these ideas

These are not academic curiosities. etcd, the data store behind Kubernetes, uses the Raft consensus algorithm to keep cluster configuration consistent even when control plane nodes fail. Apache ZooKeeper uses leader election and quorum writes to coordinate large fleets at companies like Yahoo and LinkedIn. Cassandra and Amazon DynamoDB lean on gossip, hinted handoff, read repair, and anti-entropy to stay available across regions, accepting eventual consistency as a deliberate trade for uptime.

Google Spanner combines consensus with tightly synchronized clocks to offer strong consistency across continents, while Kafka relies on a leader-and-replica model with quorum acknowledgment to never lose committed messages. Payment and order systems at companies running microservices use the saga pattern because a customer's checkout might touch inventory, billing, and shipping services that each own their own database. Knowing which of these patterns a system uses tells you immediately what it will do during a network partition, and that is exactly the kind of reasoning interviewers and on-call engineers are tested on.

Frequently asked questions

Learn Distributed Systems Core the interactive way

All 24 lessons with step by step diagrams, runnable code, and quizzes. One payment of ₹299 in India or $5 worldwide. Lifetime access, no subscription.

Distributed Systems Core

What distributed systems core actually means

The hard problems: partitions, ordering, and agreement

Keeping replicas in sync, and coordinating work across services

How real companies use these ideas

Frequently asked questions

Distributed Systems Core

What distributed systems core actually means

The hard problems: partitions, ordering, and agreement

Keeping replicas in sync, and coordinating work across services

How real companies use these ideas

All 24 lessons in Distributed Systems Core

Frequently asked questions

Learn Distributed Systems Core the interactive way

Distributed Systems Core

What distributed systems core actually means

The hard problems: partitions, ordering, and agreement

Keeping replicas in sync, and coordinating work across services

How real companies use these ideas

All 24 lessons in Distributed Systems Core

Frequently asked questions

Learn Distributed Systems Core the interactive way