Core Fundamentals
Every large system that has ever fallen over did so for a reason that traces back to something on this page. A checkout page that times out on Black Friday, an API that double-charges a customer because a retry fired twice, a database that locks up the moment traffic triples. None of those are exotic problems. They are failures to respect the basics: how fast a system answers, how much it can handle at once, what happens when you add a second server, and what a request is allowed to assume about the request before it.
Core fundamentals is the vocabulary and the mental model everything else is built on. Before you can reason about message queues, sharding, or consensus, you need to know the difference between latency and throughput, why a stateless service scales and a stateful one fights you, and when ACID guarantees are worth the cost versus when you trade them away for availability. The 26 lessons here are the foundation. Get them right and the advanced material reads like common sense. Skip them and you will keep relearning the same lessons in production, at 3 AM, with customers watching.
Performance: latency, throughput, and bandwidth are not the same thing
People use these three words interchangeably and then make bad decisions because of it. Latency is how long one request takes from send to response, measured in milliseconds. Throughput is how many requests you can complete per second. Bandwidth is the raw capacity of the pipe, how much data can move through it. A system can have low latency and low throughput, or high throughput and terrible latency. They move independently.
The classic trap is optimizing one and assuming the other followed. You speed up a single database query and feel good, but under load the server is queuing requests and the latency a real user sees has tripled. Or you add bandwidth expecting things to feel faster, but the bottleneck was processing time, not the network, so nothing changes. The slowest component dominates total latency. A 500ms database call makes your 1ms network irrelevant.
The lessons on synchronous and asynchronous processing connect directly here. A synchronous call makes the caller wait for the result, which is simple to reason about but ties up resources and stacks latency. Asynchronous processing lets the caller move on while the work happens in the background, which is how you keep throughput high when individual operations are slow. Knowing which one a given workload needs is one of the most common real design decisions you will make.
Scaling: vertical, horizontal, and the statelessness that makes it possible
When traffic grows, you have two moves. Vertical scaling means making one machine bigger, more CPU, more memory. It is the easy answer and it works until it does not, because there is a ceiling and a single machine is a single point of failure. Horizontal scaling means adding more machines and spreading the work across them. It has no real ceiling, but it forces you to answer a hard question: when a user's second request lands on a different server than their first, does anything break?
That question is why stateless versus stateful is on this list right next to scaling. A stateless service keeps no memory of past requests between calls, so any server can handle any request and you can add or remove servers freely. A stateful service remembers things locally, which means a specific user is tied to a specific server, and now horizontal scaling becomes a fight. Session management is the practical version of this problem. Where do you keep a logged-in user's session so that any server in the pool can serve them?
Scalability, elasticity, and load balancing complete the picture. Scalability is whether your system can grow at all. Elasticity is whether it can grow and shrink automatically as demand changes, which is what cloud autoscaling sells you. Load balancing is the traffic cop that spreads incoming requests across your pool so no single server gets buried. Caching sits alongside all of it as the cheapest performance win there is: store the answer once, serve it many times, and take the load off everything downstream.
Correctness: idempotency, timeouts, and the data guarantees behind them
Distributed systems fail in ways single programs do not. The network drops a response, the client never hears back, so it retries. Now the same operation runs twice. If that operation was charging a credit card, you have a furious customer. Idempotency is the property that running the same request twice has the same effect as running it once. It is not optional at scale, it is the thing that makes retries safe, and retries are unavoidable.
Timeouts are the other half of surviving failure. Connection timeout caps how long you wait to establish a connection, request timeout caps how long you wait for the answer. Without them, one slow dependency can hold every thread hostage and cascade into a full outage. Setting them too tight causes false failures, too loose and a sick service drags everything down with it. There is no universal number, only a trade-off you have to reason about per dependency.
Under all of this sit the data guarantees. ACID properties (atomicity, consistency, isolation, durability) are the strong promises a traditional database makes: a transaction either fully happens or fully does not, and once committed it stays committed. BASE (basically available, soft state, eventual consistency) is the looser model many distributed systems choose so they can stay available under partition and scale wide. SQL versus NoSQL is largely this same choice expressed as a database category. You pick based on whether your workload needs strict consistency and rich queries or whether it needs to scale horizontally and tolerate eventual consistency.
Contracts: APIs, schemas, and validation at the boundary
A system is only as reliable as the agreements between its parts. REST API is the dominant style for those agreements over HTTP, a set of conventions for how services expose resources and how clients talk to them. The value of a convention is predictability. Anyone who knows REST can pick up your API and guess how it works, which is why it became the default.
But a convention is not enough on its own. JSON Schema and XML Schema let you write down exactly what a valid message looks like, so both sides agree on shape and types before anything goes wrong. API documentation through Swagger and OpenAPI turns that contract into something humans and tools can read, generate clients from, and test against. Semantic versioning is how you evolve the contract without breaking everyone who depends on it: a clear rule for which changes are safe, which add features, and which break compatibility.
The boundary is also where security lives. Input validation means never trusting what comes in from outside, checking every field before it touches your logic, because attackers send malformed and malicious data on purpose. Output encoding means safely formatting what you send back so that data can never be misread as code, which is the core defense against injection attacks. Validate on the way in, encode on the way out. These two habits prevent an entire family of breaches and belong in your reflexes from day one.