Distributed Systems Core
The moment your system runs on more than one machine, a new set of problems shows up that single-server code never has to think about. Machines crash mid-write. The network drops messages or delivers them out of order. Two nodes both decide they are in charge. A payment gets debited on one server but the confirmation never reaches another. These are not edge cases you can patch later. They are the default behavior of any system spread across processes, racks, or regions, and getting them wrong is how companies lose money, corrupt data, or take down an entire region during an outage.
This category covers the ideas that hold distributed systems together when individual parts fail. You will learn how nodes agree on a single answer when none of them can fully trust the others, how to order events without a shared clock, how to keep data consistent across replicas, and how to coordinate work that spans many services without one giant lock. These are the same building blocks behind Kafka, Cassandra, DynamoDB, etcd, and every database that promises to survive a server dying at 3 AM.
What distributed systems core actually means
A distributed system is a group of independent computers that work together but can fail independently. The hard part is not splitting work across them. The hard part is that there is no shared memory, no shared clock, and no guarantee that a message you sent ever arrived. Each node sees its own slightly different view of the world, and the system has to behave correctly anyway.
The lessons here start from the formal models that describe this behavior. A State Machine and a Finite State Machine give you a precise way to reason about what a node can do and which transitions are legal, which matters because most replication and consensus protocols are built on the idea of replicating a state machine across nodes so they all end up in the same state.
From there the category splits into the recurring problems: agreeing on values, ordering events, replicating data, and coordinating transactions. Every topic in this hub is one answer to the question of how independent machines stay correct together when the network and the hardware are working against them.
The hard problems: partitions, ordering, and agreement
The CAP Theorem is the starting point for almost every design decision in this space. When the network splits and two halves of your cluster cannot talk to each other, a state called Network Partitioning, you can keep serving requests or you can stay consistent, but not both. The lesson on Split-Brain shows what happens when a system gets this wrong: two nodes both believe they are the leader and both accept writes, and now you have two diverging copies of reality that are painful to merge back together.
Ordering is the second hard problem. Without a single shared clock, you cannot trust wall-clock timestamps to tell you what happened first. Lamport Timestamps give you a logical ordering of events, and Vector Clocks go further by letting you detect when two events are genuinely concurrent, which is exactly the information you need to spot and resolve conflicting writes.
Agreement ties it together. Quorum is the simplest tool: require a majority of nodes to acknowledge a read or write so any two operations overlap on at least one node. Leader Election picks a single coordinator so the cluster has one source of truth, and Consensus Algorithms like Raft and Paxos let a set of nodes agree on a sequence of values even while some of them are crashing or slow. Distributed Locks build on these to make sure only one process touches a resource at a time across the whole cluster.
Keeping replicas in sync, and coordinating work across services
Replicating data is not a one-time copy. Replicas drift apart because writes land on different nodes at different times, so distributed databases run a constant repair cycle. Gossip Protocol spreads membership and state information node to node, the same way news travels through a crowd, so the cluster knows who is alive without a central registry. Hinted Handoff stores writes meant for a temporarily down node and delivers them when it returns. Read Repair fixes stale data at read time, Anti-Entropy runs background comparisons to reconcile differences, and Sloppy Quorum keeps a system available by accepting writes on backup nodes during a partition. These five together are the machinery that powers Cassandra and DynamoDB style availability.
Coordinating writes across multiple services is the other major theme. Distributed Transactions ask how you commit an operation that touches several databases as a unit. Two-Phase Commit gives you atomicity but blocks if the coordinator dies, and Three-Phase Commit tries to remove that blocking at the cost of more network round trips. The Saga Pattern takes a different route entirely, breaking one big transaction into a sequence of local steps each with a compensating undo, which is the standard approach in microservices where a single locking transaction across services is not realistic.
Finally, the category covers patterns for modeling state itself. CQRS separates the write path from the read path so each can scale and be shaped independently. Event Sourcing stores every change as an immutable event rather than overwriting state, giving you a full audit history and the ability to rebuild state at any point in time. The Actor Model and Reactive Systems round it out with concurrency and resilience patterns for systems that must stay responsive under load and partial failure.
How real companies use these ideas
These are not academic curiosities. etcd, the data store behind Kubernetes, uses the Raft consensus algorithm to keep cluster configuration consistent even when control plane nodes fail. Apache ZooKeeper uses leader election and quorum writes to coordinate large fleets at companies like Yahoo and LinkedIn. Cassandra and Amazon DynamoDB lean on gossip, hinted handoff, read repair, and anti-entropy to stay available across regions, accepting eventual consistency as a deliberate trade for uptime.
Google Spanner combines consensus with tightly synchronized clocks to offer strong consistency across continents, while Kafka relies on a leader-and-replica model with quorum acknowledgment to never lose committed messages. Payment and order systems at companies running microservices use the saga pattern because a customer's checkout might touch inventory, billing, and shipping services that each own their own database. Knowing which of these patterns a system uses tells you immediately what it will do during a network partition, and that is exactly the kind of reasoning interviewers and on-call engineers are tested on.