Dream11 System Design Interview: Fantasy Sports at Scale
Just before a big cricket match, millions of Dream11 users rush to lock their fantasy teams in the final minutes, and one event, the toss, can trigger almost all of them at once. Dream11 reported handling more than 5.5 million concurrent users during the IPL 2020 final and around 100 million requests per minute at its edge, on a base of over 200 million registered users.
Designing Dream11 is the extreme-spike problem. The load is not smooth: for a popular match, a large share of the day's users create or edit their fantasy teams in the last few minutes before the deadline, and the toss right before the match makes everyone act at once. On top of that, once the match starts, the platform has to score every user's team from live ball-by-ball data and update the ranks across thousands of contests for millions of players, continuously. The interview is about absorbing a predictable but enormous surge, keeping contest joins and money correct under that load, and fanning live scores out to huge leaderboards. Note that Dream11 paused paid contests in August 2025 after a change in Indian law, so this describes how the platform was engineered during its paid-contest era.
Asked at: Commonly asked at Dream11, other gaming and fantasy-sports companies, and ticketing or flash-sale businesses, and the general form, meaning design a system for a massive predictable traffic spike or a live leaderboard, shows up at Amazon, Google, and most product companies for SDE2 and SDE3 rounds. It is a favorite because the spike is both extreme and predictable, which turns the interview into a real capacity and pre-scaling conversation rather than a generic one.
Why this question is asked
Most systems are designed for load that rises and falls gradually. Dream11 is the opposite: the demand is spiky, correlated, and tied to an external event no one controls. Interviewers use it to check three things. First, can you handle a surge where millions of users act in the same few minutes before a deadline, which breaks reactive autoscaling because servers cannot be provisioned fast enough once the spike has started. Second, can you keep the write path correct under that load, so a contest with a fixed number of spots is never oversold and money is never lost. Third, can you design the live-scoring fan-out, where a single ball in the real match changes the points and rank of millions of fantasy teams across thousands of contests at once. It separates candidates who can only design steady-state systems from those who can reason about a predictable megaspike and a real-time scoring pipeline.
Requirements
Always clarify these in the first 5 minutes of the interview. Do not start drawing boxes until both lists are agreed.
Functional requirements
- User browses upcoming matches, picks a fantasy team of players within a budget and role rules, and edits it until the deadline
- User joins one or more contests for a match, paying an entry fee, with each contest holding a fixed number of spots and a defined prize structure
- Team edits and contest joins are locked at the match deadline, after which no changes are allowed
- As the real match plays, each fantasy team earns points from live ball-by-ball events
- Live leaderboards show every user their rank within each contest, updating through the match
- At match end, ranks are finalized and winnings are credited to user wallets
- User manages a wallet: deposits, entry fees, winnings, and withdrawals
- Social features such as following other users and viewing their activity
Non-functional requirements
- Absorb a massive, predictable spike in the final minutes before a match deadline, when a large share of users join or edit teams at once
- The toss just before the match can trigger a correlated surge, so capacity must be ready before it, not provisioned after
- Contest joins must be correct under extreme concurrency: a fixed-size contest is never oversold, and entry fees and winnings never drift
- Live scoring must fan out to millions of users across thousands of contests with fresh ranks through the match
- Very high availability during matches, since downtime during a big game is extremely costly
- Read-heavy screens such as leaderboards and team views must stay fast during the surge
- Cost efficiency, since the peak capacity needed for a big match is many times the everyday baseline
Back-of-envelope scale estimates
Show your math. Pulling numbers from thin air signals you have not thought about the load.
Registered users
200M+ (2023)
Dream11 reported crossing 200 million registered users in October 2023, up from 45 million in 2018. This is the base from which a match-day spike draws.
Peak concurrent users
5.5M (IPL 2020 final), ~10.5M later
Dream11 and AWS reported more than 5.5 million concurrent users during the IPL 2020 final, with later peaks reported around 10.5 million. Concurrency, not total users, is what the match-day design is sized for.
Peak edge requests
~100M requests/minute
Dream11 reported around 100 million requests per minute at its edge services during peak match traffic. This is the request rate the front door and caches must absorb.
Compute at peak
30,000+ resources, 700 ASGs
Dream11 reported using more than 30,000 compute resources at peak, managed as roughly 700 auto-scaling groups behind about 200 load balancers, across 100-plus sporting events a day, at 99.99 percent uptime.
Real-time data store throughput
~1.8M ops/s, sub-15ms
Dream11's Aerospike layer was reported to serve around 1.8 million operations per second at peak with sub-15-millisecond latency, handling about 35 terabytes of data a day. It is the hot store behind the match-day load.
Cost of downtime
~$1M per minute
A minute of downtime during peak was reported to cost Dream11 around one million dollars, which is why the design targets 99.99 percent uptime and pre-provisions capacity.
High-level architecture
Design Dream11 around one fact: the load is a predictable megaspike, not a steady stream. Everything follows from that. All of this runs on AWS, and the numbers below are what Dream11 reported during its paid-contest era. The first problem is capacity for the spike. For a popular match, a large share of users create or edit teams and join contests in the final minutes, and the toss just before the match can make them all act at once. Reactive autoscaling does not work here, because by the time metrics show the surge, it is too late to boot servers. So Dream11 built an in-house scaling platform, Scaler, that pre-provisions capacity before the match using prediction. A forecasting model based on XGBoost iterates over 200-plus variables, such as the number and popularity of matches, active users, and transaction sizes, to forecast concurrency, and Scaler scales capacity out in phases against slabs of expected concurrency. It uses fully-baked machine images, a diverse mix of instance types including spot instances, and weighted DNS across load-balancer shards to add capacity fast, which cut scale-out time from hours to about five minutes and let the platform forecast roughly 10 million active users about two hours before a major match and be ready for it. The second problem is the write path: joining contests and locking teams. A contest has a fixed number of spots and an entry fee, so under a surge you must never oversell it or lose money. This is a strongly consistent, contention-heavy path. Transactional data such as registrations, team selections, and payments is held in MySQL with the ACID guarantees that money and fixed-capacity contests need, while the fast, high-volume reads and writes around the match are served by Aerospike, which Dream11 adopted as a combined persistence and caching layer at around 1.8 million operations per second with sub-15-millisecond latency. A wallet ledger records entry fees and winnings. The third problem is live scoring. Once the match starts, a stream of ball-by-ball events arrives, and each event can change the points and rank of millions of fantasy teams across thousands of contests. Dream11 described a leaderboard engine, called Aryabhata, built on Apache Spark that recomputes points, groups by contest, and ranks users, generating leaderboards across thousands of contests on a tight cycle, in its published design processing tens of millions of records within a 60-second service level. The scoring is designed as an immutable, append-only pipeline using multi-version concurrency control and snapshot isolation, so events can be processed concurrently and rolled back cleanly if a scoring decision is reversed. Cassandra was the store for this leaderboard system in the published design, with Redis caching in front, and a separate social graph, the follow network, was built on Amazon Neptune with Redis sorted sets for fast follower lookups.
In a real interview, sketch this on the whiteboard before diving into any single box.
Core components
Walk through each service. The interviewer wants to hear what each one owns, not just the names.
Scaler: prediction-based pre-scaling
Dream11's in-house platform that provisions capacity before a match rather than reacting to it. A forecasting model based on XGBoost, over 200-plus variables, predicts concurrency, and Scaler scales out in phases against concurrency slabs using pre-baked images, mixed and spot instance types, and weighted DNS across load-balancer shards. It cut scale-out time from hours to about five minutes.
Edge and API layer
The front door that absorbed around 100 million requests per minute at peak, fronted by roughly 200 load balancers and about 700 auto-scaling groups. It handles authentication, routing, and read-heavy traffic, and is the layer Scaler pre-provisions most aggressively before a match.
Contest and team service (MySQL)
The strongly consistent write path for joining contests and locking teams. Held in MySQL for ACID guarantees, because a fixed-size contest must never be oversold and entry fees must be exact. It enforces the deadline lock after which no team edits are allowed.
Aerospike real-time store
The combined persistence and caching layer for the hot match-day workload, adopted in place of an earlier relational persistence setup. Reported at around 1.8 million operations per second with sub-15-millisecond latency and about 35 terabytes of data a day, it serves the high-volume reads and writes that a big match generates.
Live scoring engine (Aryabhata on Spark)
The pipeline that turns live ball-by-ball match events into fantasy points and ranks. Built on Apache Spark, it recomputes points across thousands of contests on a tight cycle, in Dream11's published design processing tens of millions of records within a 60-second service level, and fans the results out to leaderboards for millions of users.
Leaderboard store and cache
In the published leaderboard design, Apache Cassandra was the primary store, chosen for high write throughput and tunable consistency, with Redis caching in front so the constant leaderboard reads during a match stay fast and do not overload the store.
Wallet and payments
The ledger that tracks deposits, entry fees, winnings, and withdrawals per user. Strongly consistent, because it is real money, and it settles winnings after a match once ranks are final.
Social graph (Neptune + Redis)
The follow network, built on Amazon Neptune as a graph database with Redis sorted sets in front for fast follower lists and counts. It is kept separate from the contest and scoring path so social traffic does not compete with match-critical work.
Data model
Pick the right store per table. Justify each choice with the access pattern, not by reflex.
usersuser_id (PK)namewallet_balance_paisekyc_statuscreated_atThe player account and wallet balance. Transactional data held in MySQL. The base from which a match-day spike draws.
matchesmatch_id (PK)sportteamsstart_timedeadlinestateThe real-world match. The deadline is when team edits and contest joins lock, and start_time is close to the toss that triggers the final surge. State drives whether joins are open, locked, live, or settled.
contestscontest_id (PK)match_identry_fee_paisetotal_spotsfilled_spotsprize_structurestateA contest for a match with a fixed number of spots. filled_spots must never exceed total_spots even under a surge of joins, so this is the fixed-capacity, contention-heavy write. Held in MySQL for ACID.
fantasy_teamsteam_id (PK)user_idmatch_idplayer_ids[]captain_idvice_captain_idlocked_atThe eleven players a user picked for a match, within budget and role rules. Editable until the deadline, then locked. The captain and vice-captain earn multiplied points.
contest_entriesentry_id (PK)contest_id (FK)user_idteam_idjoined_atA user's team entered into a specific contest. Creating this row is the contest-join write that must respect the contest's fixed capacity and charge the entry fee exactly once.
player_scoresscore_id (PK)match_idplayer_ideventpoints_deltatsAppend-only stream of scoring events derived from live ball-by-ball data. Immutable, so events can be processed concurrently and rolled back cleanly. This is the input to the scoring engine.
leaderboardscontest_iduser_idpointsrankupdated_atThe ranked standing per contest, recomputed on a tight cycle during the match and served from cache. The highest-fan-out data in the system, since one scoring event can change ranks for millions of entries.
wallet_ledgerentry_id (PK)user_iddirection (credit|debit)amount_paisereasoncreated_atAppend-only record of money movement: deposits, entry fees, winnings, withdrawals. Strong consistency required. The balance is always reconstructable from the ledger.
Deep dives
These are the conversations the interviewer is steering you toward. Practice each one until you can talk through it without notes.
The pre-match spike and why reactive autoscaling fails
The defining challenge is that demand is not smooth. For a popular match, a large share of users create or edit teams and join contests in the final minutes before the deadline, and the toss just before the match is an external trigger that makes them act at nearly the same moment. Reactive autoscaling, which adds servers when metrics cross a threshold, cannot keep up, because provisioning takes minutes and the spike has already peaked by then. Dream11's answer was to pre-provision using prediction. Its Scaler platform runs a forecasting model based on XGBoost over 200-plus variables, such as how many matches are on, how popular they are, active users, and transaction sizes, to forecast concurrency ahead of time, then scales capacity out in phases against slabs of expected concurrency. It uses fully-baked machine images so new instances are ready instantly, a diverse mix of instance types including spot instances for cost, and weighted DNS across load-balancer shards to spread the new capacity. The reported result was scale-out time cut from hours to about five minutes, and forecasting roughly 10 million active users about two hours before a major match. The interview point is that a predictable spike should be met with prediction and pre-scaling, not reaction.
Contest joins under extreme concurrency
A contest has a fixed number of spots and an entry fee, so when millions try to join in the final minutes, the write path must stay correct: the contest cannot be oversold, and each entry fee must be charged exactly once. This is a fixed-capacity, high-contention problem, the same shape as a flash sale. The correct approach keeps this data in a store with real transactional guarantees, which is why Dream11 used MySQL for registrations, team selection, and payments, and treats the join as an atomic operation that checks and increments the filled-spots count within a transaction, or uses an equivalent guarded counter, so two joins cannot both take the last spot. The entry-fee debit is idempotent so a retry during the surge does not double-charge. Around this strongly consistent core, the high-volume reads and less-critical writes are served by the fast Aerospike layer, so the surge does not all land on the transactional database.
Live scoring and the leaderboard fan-out
Once the match starts, scoring is a fan-out problem. A single real-world event, a wicket or a boundary, changes the points for every fantasy team that picked those players, which changes ranks across thousands of contests for millions of users. Dream11 described a leaderboard engine, Aryabhata, built on Apache Spark, that takes the stream of scoring events, recomputes points, groups entries by contest, and ranks them, generating leaderboards across thousands of contests on a tight cycle, in its published design within a 60-second service level over tens of millions of records. Rather than update every user the instant a ball is bowled, it recomputes and publishes on a short, fixed cycle, which turns an impossible per-event fan-out into a bounded, repeatable batch. The results are cached, in the published design on Cassandra with Redis in front, so the constant leaderboard reads during the match stay fast.
The immutable, append-only scoring pipeline
Live sports scoring has a hard requirement most systems do not: decisions can be reversed. A third umpire overturns a call, a catch is deemed not out, and the points already awarded must be corrected. Dream11 designed the scoring as an immutable, append-only pipeline using multi-version concurrency control and snapshot isolation. Every scoring event is an insert, never an in-place update, so the state at any point is reconstructable, events can be processed concurrently without corrupting each other, and a reversal is handled by appending a correcting event and recomputing, rather than trying to mutate shared state under load. This is what lets the platform score safely and quickly while the match is still in flux.
Data stores: matching each workload to the right engine
Dream11's design uses different stores for different jobs rather than one database for everything. MySQL holds the transactional data, registrations, team selection, contests, and payments, where ACID guarantees matter because money and fixed-capacity contests cannot drift. Aerospike serves the hot match-day workload as a combined persistence and caching layer, reported at around 1.8 million operations per second with sub-15-millisecond latency and 35 terabytes a day, adopted in place of an earlier relational persistence setup that could not keep up. In the published leaderboard design, Cassandra stored the leaderboard data for its high write throughput and tunable consistency, with Redis caching the hot reads. The social follow network is built separately on Amazon Neptune, a graph database, with Redis sorted sets for fast follower counts. The lesson is to pick the store per access pattern: strong consistency for money, a fast key-value layer for the match-day hot path, a write-optimized store for leaderboards, and a graph store for the social network.
Availability, cost, and the price of a bad minute
During a big match, downtime is extraordinarily expensive: Dream11 reported that a minute of downtime at peak cost around one million dollars, which is why the design targets 99.99 percent uptime and pre-provisions capacity rather than risk running short. Cost is the other side of the same coin, because the capacity needed for a big match is many times the everyday baseline, and paying for that around the clock would be wasteful. Dream11 managed this with the mix of spot and on-demand instances in Scaler and by scaling capacity in and out around each match, and separately cut compute cost by around 42 percent by migrating to more efficient instance types. The interview framing is that extreme reliability during the spike and cost efficiency the rest of the time are both requirements, and pre-scaling with a diverse instance mix is how you get both.
Trade-offs to discuss
Every senior interviewer expects you to surface at least 3 of these. Pick the decisions, state the alternatives, and justify your choice.
Prediction-based pre-scaling versus reactive autoscaling
Reactive autoscaling is standard and simple, but it provisions too slowly for a spike that peaks in minutes and is triggered by an external event like the toss. Predicting concurrency ahead of time and pre-provisioning against it, as Dream11 did with Scaler, means capacity is ready before the surge. The cost is the complexity of building and trusting a forecasting model and paying for capacity slightly ahead of need, accepted because being caught short during a big match is far more expensive.
MySQL for contests and money versus a single scalable NoSQL store
A NoSQL store scales writes easily, but a fixed-capacity contest and a wallet cannot tolerate the weak guarantees, since overselling spots or losing entry fees is unacceptable. Dream11 kept this data in MySQL for ACID guarantees, and offloaded the high-volume, less-critical match-day traffic to Aerospike. The cost is running more than one kind of store and deciding what belongs where, accepted because correctness on money and capacity is non-negotiable.
Recomputing leaderboards on a fixed cycle versus updating per event
Updating every affected user the instant a ball is bowled would give the freshest ranks but is an impossible fan-out at this scale, since one event touches millions of entries across thousands of contests. Recomputing on a short fixed cycle, as Dream11 did within a 60-second service level, bounds the work into a repeatable batch that finishes predictably. The cost is that ranks can be up to a cycle stale, which is an acceptable trade for being able to serve them to everyone at all.
Immutable append-only scoring versus mutable in-place updates
Updating scores in place is simpler, but live sports decisions get reversed, and mutating shared score state under heavy concurrent processing is error-prone. An immutable, append-only pipeline with snapshot isolation lets events be processed concurrently and corrections be handled by appending and recomputing, so the state is always reconstructable and safe. The cost is more storage and a recompute step, accepted because correctness under reversal and concurrency is essential.
Aerospike as a combined persistence and cache versus a relational persistence layer
A relational store as the persistence layer is familiar, but it could not keep up with the match-day operation rate. Aerospike as a combined persistence and caching layer served around 1.8 million operations per second at sub-15-millisecond latency, removing a separate cache tier for that workload. The cost is operating a specialized store and keeping the data model suited to it, accepted because the match-day throughput demanded it.
Spot and mixed instances versus all on-demand capacity
Running all peak capacity on on-demand instances is simplest and most reliable but very expensive for capacity used only around matches. Using a diverse mix that includes spot instances, as Scaler does, cuts cost sharply. The cost is that spot capacity can be reclaimed, so the design must tolerate losing some instances and diversify across types and pools, which Dream11 accepted to make peak capacity affordable.
How Dream11 actually does it
Most of this comes from Dream11's own engineering blog and AWS case studies, and describes the platform during its paid-contest era. Dream11 reported crossing 200 million registered users in October 2023, handling more than 5.5 million concurrent users during the IPL 2020 final, with later peaks reported around 10.5 million, and around 100 million requests per minute at its edge. It described its in-house Scaler platform using a forecasting model based on XGBoost over 200-plus variables to pre-provision capacity in phases against concurrency slabs, using pre-baked images, mixed and spot instances, and weighted DNS, cutting scale-out time from hours to about five minutes and forecasting roughly 10 million active users about two hours before a major match, with more than 30,000 compute resources, about 700 auto-scaling groups, and around 200 load balancers at peak, at 99.99 percent uptime across 100-plus events a day. Its published leaderboard engine, Aryabhata, used Apache Spark to recompute points and ranks across thousands of contests within a 60-second service level over tens of millions of records, with Cassandra as the store and Redis caching, and an immutable append-only design using multi-version concurrency control and snapshot isolation. It reported adopting Aerospike as a combined persistence and caching layer at around 1.8 million operations per second with sub-15-millisecond latency and 35 terabytes a day, using MySQL for transactional data, running a social follow network on Amazon Neptune with Redis, and cutting compute cost by about 42 percent through an instance migration. All of it runs on AWS. Three accuracy notes for the interview. First, the leaderboard figures are from Dream11's earlier published design and the platform will have evolved, so present them as how Dream11 described its live-scoring engine. Second, the scaling model is XGBoost, not a neural network, and some widely repeated figures such as a fixed joins-per-second number are not confirmed by primary sources, so avoid them. Third, Dream11 paused paid contests in August 2025 after India's online gaming law changed, so describe the paid-contest scale in past terms.
Sources
- AWS Game Tech, How Dream11 uses an in-house scaling platform (Scaler): pre-scaling, 190M+ users, 100-plus events a day, 700 auto-scaling groups, 99.99 percent uptime
- Dream11 Engineering, To Scale In Or Scale Out: 5.5M concurrent, 100M requests per minute, XGBoost prediction, the toss-driven spike
- Dream11 Engineering, Leaderboard at Dream11 (Aryabhata): Spark plus Cassandra plus Redis, 60-second recompute, snapshot isolation
- Aerospike, Dream11 customer story: about 10.5M concurrent, 1.8M operations per second, sub-15ms, 35TB a day, one million dollars per minute of downtime
- AWS Game Tech, Dream11 saved 42 percent compute cost by migrating instance families: 5.5M concurrent IPL 2020 final, 120M+ users
- AWS Database Blog, Dream11 scales its social network with Amazon Neptune and ElastiCache: graph plus Redis sorted sets
- Wikipedia, Dream11: user growth timeline, 200M users in October 2023, and the August 2025 paid-contest pause
Lessons to study before this interview
If any of these topics are fuzzy, the interviewer will catch it. Each lesson is 15 to 60 minutes with diagrams, code, and a quiz.
Cache Stampede Prevention
foundation / caching strategies
Rate Limiting for Resilience
advanced / reliability resilience
Load Balancing
foundation / core fundamentals
Cache-Aside Pattern
foundation / caching strategies
Message Queues
intermediate / messaging event systems
Database Sharding
foundation / database fundamentals
High Availability
advanced / reliability resilience