Design Swiggy: System Design Interview Guide
Swiggy serves 200,000+ restaurants across 580+ cities with a fleet of 390,000+ delivery partners, and dinner alone drove 215 million orders in 2024 — roughly 29% more than lunch — with demand crushed into two 90-minute windows a day.
Designing Swiggy means solving three coupled problems at once: serviceability (which restaurants can even reach this customer), dispatch (which delivery partner picks up which order, often batched), and a three-party order state machine that survives a restaurant rejecting an order, a partner going offline mid-trip, or a payment webhook arriving late. The hard part is that almost all of the load lands in two short meal peaks, so the system is sized for 4-5x its average and idle the rest of the day.
Asked at: Commonly asked at Swiggy, Zomato, Zepto, Uber, DoorDash, Grab, Rapido, Meesho, and most Indian product companies (Flipkart, PhonePe, Razorpay) for SDE2 and SDE3 rounds. It is the canonical hyperlocal logistics / geospatial matching problem for the Indian market, and a favorite because the lunch/dinner peak concentration forces a real capacity conversation.
Why this question is asked
Interviewers reach for Design Swiggy because a generic three-tier diagram falls apart immediately. You have three independent actors (customer, restaurant, delivery partner) whose actions interleave, a geospatial layer that has to answer "can this restaurant serve this address" in single-digit milliseconds, and an assignment problem that is genuinely an optimization (cost minimization under capacity and time-window constraints), not a lookup. On top of that, the traffic profile is brutal: 70-80% of a day's orders arrive in two narrow windows, so the candidate has to talk about peak provisioning, batching for efficiency, and graceful degradation when a city's kitchens are all slammed at 8:30 PM. It separates people who have only memorized a CDN diagram from people who can reason about a live multi-party system.
Requirements
Always clarify these in the first 5 minutes of the interview. Do not start drawing boxes until both lists are agreed.
Functional requirements
- Customer enters a delivery address; system returns only restaurants that can actually serve that address (serviceability)
- Customer browses a menu, builds a cart, and places an order with online payment or cash on delivery
- Restaurant receives the order on a partner tablet and accepts or rejects it within a short window
- System assigns a delivery partner to pick up the order, batching nearby orders where it improves efficiency
- Customer sees a live ETA and the delivery partner's location on a map once the order is picked up
- Order moves through a strict state machine: placed, confirmed, food being prepared, ready, picked up, on the way, delivered (or canceled / refunded)
- Delivery partner app shows the pickup, the drop, optimized route, and earnings for the trip
- Customer can rate the restaurant and the delivery partner after delivery
- Surge / dynamic delivery fee applied during peak demand when partner supply is tight
Non-functional requirements
- Serviceability and restaurant-listing reads under 100 ms at p99 (this is the home-screen hot path)
- Dispatch assignment decision within a few seconds of an order becoming ready-to-assign
- Live tracking location updates every 4-5 seconds per active trip with sub-second delivery to the customer app
- System sized for meal-peak load roughly 4-5x the daily average, concentrated in two 90-minute windows
- Strong consistency on order state transitions and payment capture; no double charges, no lost orders
- 99.9%+ availability for order placement and tracking; degrade gracefully (e.g., wider ETA, fewer batches) rather than fail under peak
- Geo-partitioned by city so a hot city does not starve others
Back-of-envelope scale estimates
Show your math. Pulling numbers from thin air signals you have not thought about the load.
Daily orders
~6-7M/day
Reported public figures put Swiggy food delivery in the low single-digit millions of orders per day in 2024. Use ~6M as a round working number for capacity math; an interviewer cares about the method, not the exact figure.
Peak orders per second
~600-700 OPS
If 70% of ~6M daily orders land in roughly 3 hours of combined lunch+dinner peak, that is ~4.2M orders / ~10,800 s ≈ 390 OPS average across the peak, and bursts inside the peak push the instantaneous rate to ~600-700 OPS. Average over a full day is only ~70 OPS, which is the whole point: provision for the peak, not the mean.
Active delivery partners (peak concurrent)
~150K-200K
390,000+ total partners; peak concurrency around 40-50% during dinner. Each emits a location ping every 4-5 seconds while on an active trip or idle-but-online.
Location pings per second (peak)
~35K-45K/s
~180K online partners / 4.5 s ping interval. Written to an in-memory geo index, never to the durable order store at full fidelity.
Serviceability + listing reads
~50K-100K reads/s at peak
Every home-screen open and address change triggers a point-in-polygon serviceability check plus a restaurant-list fetch. This dwarfs order writes (~600 OPS) by two orders of magnitude, so it is the read path that must be cached aggressively.
Restaurants / cities
200K+ restaurants, 580+ cities
Public 2024 figures. Drives the size of the serviceability polygon index and the per-city sharding strategy.
High-level architecture
Mentally split Swiggy into four planes that talk through an event backbone (Kafka), because conflating them is where candidates lose the room. Plane 1 is the discovery / read path. A customer opens the app, the client sends a delivery lat/lng, and the Serviceability service answers "which restaurants can reach you." This is a point-in-polygon problem: cities are carved into delivery zones (polygons), and doing a raw PIP check against thousands of polygons per request would be too slow, so Swiggy builds a GeoHash index where each polygon is registered against the GeoHash cells it overlaps. At request time you compute the customer's GeoHash key, fetch the small candidate set of overlapping zones from memory, run the precise PIP only on those, and return the eligible restaurants. The catalog (menus, prices, ratings) is served from a search/listing store (think ElasticSearch + a heavy read cache) because this path runs at 50K-100K reads/s and must stay under 100 ms. Plane 2 is order placement and the state machine. The customer places an order; an Order service writes the order durably (Aurora/Postgres-class transactional store, sharded by city), captures payment idempotently, and emits an order.placed event. The restaurant's partner tablet, connected over a push channel, receives the order and accepts or rejects. Every transition (placed -> confirmed -> preparing -> ready -> picked_up -> on_the_way -> delivered, plus rejected/canceled/refunded branches) is a guarded finite-state-machine step persisted atomically, with each change appended to an order_events log for audit. This is the strongly-consistent core; it cannot be eventually consistent or you get double charges and lost orders. Plane 3 is dispatch — the genuinely hard, genuinely interesting part. Assignment does NOT fire greedily the instant an order is placed. The system waits until the order is close to ready (so the partner arrives just as the food is, minimizing both customer wait and partner idle time), pools the orders that are ready-to-assign in a city zone with the partners available nearby, and solves a Mixed-Integer Program every few seconds: a cost matrix of (partner x batch) where the objective minimizes total wait/cost across the zone, subject to capacity (a partner's bag/load limit), uniqueness (one batch to exactly one partner), and time windows (arrive after food is ready, before it gets cold). Batching — combining two orders from one restaurant going to nearby drops, or two nearby restaurants to nearby drops — is what makes peak economics work; it is a small Vehicle Routing Problem inside each batch. The ETA shown to the customer comes from a separate ML model (Swiggy's is a multi-output network that jointly predicts assignment delay, first-mile, kitchen wait, and last-mile). Plane 4 is live tracking. Once a partner is on a trip, their app pings location every few seconds into an in-memory geo store; a tracking service streams that to the customer app over WebSocket/push and recomputes ETA. Pings are never written at full fidelity to the order DB — they live in Redis-class memory and a downsampled trace goes to the data lake for analytics. Across all four planes, Kafka carries the events (order placed, accepted, assigned, picked up, delivered) that fan out to notifications, surge computation, analytics, and the partner-payout pipeline.
In a real interview, sketch this on the whiteboard before diving into any single box.
Core components
Walk through each service. The interviewer wants to hear what each one owns, not just the names.
Serviceability service
Answers 'which restaurants can deliver to this address' in under 100 ms. Cities are modeled as delivery zone polygons; a GeoHash index registers each polygon against the cells it overlaps so a request resolves to a tiny candidate set, then runs precise point-in-polygon only on those. Also computes road-distance (not straight-line) from restaurant to drop to decide eligibility and base delivery fee.
Catalog / listing + search
Serves restaurant lists, menus, prices, photos, and ratings. Read-heavy (50K-100K reads/s at peak), backed by ElasticSearch for search/filter and an aggressive cache (Redis/CDN) for the home feed. Personalization and ranking layered on top. Decoupled from the transactional order store.
Order service + state machine
The strongly-consistent core. Persists the order, captures payment idempotently, and enforces the finite state machine across all three actors. Each transition is atomic and appended to an immutable order_events log. Sharded by city; this is where double-charge and lost-order bugs live, so it gets the strictest consistency guarantees.
Restaurant partner channel
Push connection (long-lived socket / FCM fallback) to the restaurant tablet. Delivers new orders, collects accept/reject, and surfaces prep-time signals (the orders-placed vs orders-prepared ratio that feeds kitchen-stress into the ETA model). A reject triggers customer notification and refund flow.
Dispatch / assignment engine
Runs the optimization. Pools ready-to-assign orders and nearby available partners per zone and solves a Mixed-Integer Program every few seconds — cost matrix over (partner x batch), minimizing total cost/wait under capacity, uniqueness, and time-window constraints. Owns just-in-time assignment timing and order batching (a small VRP per batch).
Geo / location index
In-memory store of every online partner's live position, keyed by GeoHash/H3 cell so proximity queries are cheap. Ingests 35K-45K pings/s at peak via Kafka. Feeds both dispatch (find nearby idle partners) and tracking. Never durably persisted at full fidelity.
ETA prediction service
ML model that predicts the five legs of delivery time — O2A (order-to-assignment), first-mile, kitchen wait, last-mile, totaling order-to-reach — jointly via a multi-input/multi-output network. Features: restaurant type, item count/complexity, live kitchen stress, partner availability in the zone, historical road speeds, live GPS. Optimizes not just MAE but the rate of jarring ETA 'bumps' shown to the customer.
Live tracking service
Streams partner location and live ETA to the customer app over WebSocket/push at 4-5 s cadence. Reads from the in-memory geo index, not the order DB. Handles GPS gaps and partner disconnects by holding the last known position and widening the ETA.
Surge / pricing service
A streaming job (Flink-class) aggregates demand (orders in a zone) vs supply (available partners in a zone) per minute per GeoHash cell, publishes a smoothed, clamped delivery-fee multiplier to a cache, and the order service reads it at checkout. Locked in at order time so the customer pays what they agreed to.
Notification + event backbone
Kafka carries order.placed, .confirmed, .assigned, .picked_up, .delivered events that fan out to push notifications (customer + restaurant + partner), surge computation, partner-payout accrual, and the analytics lake. Decouples slow side-effects from the synchronous order path.
Data model
Pick the right store per table. Justify each choice with the access pattern, not by reflex.
restaurantsrestaurant_id (PK)namelatlngcity_idzone_idprep_time_p50_minutesis_onlineratinggeohashListing/discovery data. Lives in the catalog store + search index, heavily cached. geohash and zone_id let serviceability narrow the candidate set fast. is_online flips frequently (kitchen busy, closed, out of an item).
delivery_zoneszone_id (PK)city_idpolygon (geometry)geohash_cells[]is_activeThe serviceability polygons. polygon is the precise boundary used for point-in-polygon; geohash_cells is the denormalized list of cells the polygon overlaps, used to build the in-memory GeoHash index that avoids scanning every polygon per request.
ordersorder_id (PK)customer_idrestaurant_idzone_idstateitems (jsonb)subtotal_centsdelivery_fee_centssurge_multiplierpayment_idplaced_atdelivered_atThe transactional core. Sharded by city_id / zone_id. state is the current FSM state. Strong consistency required. Append-mostly; mutations are state transitions, each mirrored into order_events.
order_eventsevent_id (PK)order_id (FK)from_stateto_stateactor (customer|restaurant|partner|system)reasoncreated_atImmutable audit log of every state transition. Lets you reconstruct exactly who did what when — essential for disputes, refunds, and replaying state if the orders row is ever corrupted.
delivery_partnerspartner_id (PK)namevehicle_typeis_onlinecurrent_zone_idbag_capacityratingStatic-ish partner profile. Live position is NOT here — it lives in the in-memory geo index. bag_capacity feeds the dispatch capacity constraint. is_online and current_zone_id scope the candidate pool for assignment.
partner_locationspartner_id (PK)latlngheadinggeohashupdated_atIn-memory only (Redis/custom geo service), keyed by geohash so proximity queries are O(small). Overwritten every 4-5 s. A downsampled trace (one point per ~30 s, completed trips only) is logged to the data lake; idle pings are discarded on expiry.
assignmentsassignment_id (PK)partner_idbatch_idorder_ids[]assigned_atpickup_etadrop_etastatusOutput of the dispatch optimizer. A batch can hold multiple orders (the batching case). status tracks accepted/picked_up/completed/reassigned. Reassignment (partner cancels / goes unreachable) creates a new assignment row and re-enters the order into the next optimization round.
paymentspayment_id (PK)order_id (FK)amount_centsmethod (upi|card|cod|wallet)statusidempotency_keyprovider_refStrongly consistent. Indexed by order_id and idempotency_key so a retried capture never double-charges. UPI dominates in the Indian market; COD bypasses pre-capture and reconciles on delivery. Provider webhooks update status asynchronously.
Deep dives
These are the conversations the interviewer is steering you toward. Practice each one until you can talk through it without notes.
Serviceability: answering 'can this restaurant reach me' in under 100 ms
This is the home-screen hot path and runs at 50K-100K reads/s, far above order writes, so it must be fast and cacheable. Cities are carved into delivery zone polygons. The naive approach — run point-in-polygon for the customer's location against every polygon — is O(zones) per request and dies at scale. Swiggy's published approach builds a GeoHash index: pick a GeoHash resolution, and for each zone, register it against every GeoHash cell its polygon overlaps. At request time you compute the customer's GeoHash cell key, fetch from memory the small set of zones associated with that cell, then run precise PIP only on that handful. From the matched zone(s) you do a directional discovery of restaurant clusters and return the eligible list. Two refinements interviewers love: (1) eligibility uses actual road distance the partner will travel, not straight-line haversine, because a river or highway can make a 1 km haversine into a 6 km drive; (2) the whole result is cacheable per GeoHash cell for a short TTL, so two customers in the same cell share the computation.
Just-in-time dispatch and the assignment optimization
The single most common mistake is assigning a partner the instant the order is placed. If the kitchen needs 20 minutes and you assign immediately, the partner sits idle at the restaurant for 18 minutes — wasted supply during a peak when supply is the bottleneck. So assignment is deliberately delayed until the order is close to ready, using the predicted prep time. When a pool of orders in a zone becomes ready-to-assign, the engine builds a cost matrix over (available partner x candidate batch) and solves a Mixed-Integer Program roughly every few seconds. The objective minimizes total cost / wait across the zone (a weighted sum of partner idle time, customer wait, and travel cost). Constraints: capacity (a partner can't carry more than their bag/load limit), uniqueness (a batch goes to exactly one partner, a partner gets at most one batch this round), and time windows (arrive after the food is ready but before it cools). This is solved with a MIP/linear-sum-assignment solver, not a greedy nearest-partner loop, because greedy is locally fine but globally leaves money on the table across a zone with hundreds of simultaneous orders. The key talking point: batch-and-optimize per zone every few seconds, don't decide per-order in isolation.
Order batching and the Vehicle Routing Problem inside it
Batching is how Swiggy makes peak economics work — one trip carrying two orders roughly halves the per-order delivery cost. Two batchable cases: (a) two orders from the SAME restaurant going to nearby drops (pick up once, drop twice), and (b) orders from two NEARBY restaurants to nearby drops (pick up twice, drop twice). Once a batch is formed, you have a small Vehicle Routing / TSP problem: in what sequence does the partner visit the pickups and drops to minimize total time without letting the first order get cold while fetching the second? Constraints make it tractable: batch sizes are tiny (2, occasionally 3), so you can brute-force the few route permutations in milliseconds. The danger is over-batching during a peak: stuffing three orders on one partner to save cost can blow the ETA on the first order and tank the customer experience. So batching is gated by the same time-window constraint as assignment — a batch is only valid if every order in it still meets its freshness window.
Decomposed ETA prediction (and why a single number is wrong)
A naive ETA predicts one number end-to-end. Swiggy decomposes delivery time as Max(assignment_delay + first_mile, prep_time) + last_mile — the Max captures that the kitchen cooking and the partner riding to the restaurant happen IN PARALLEL, so the binding constraint is whichever finishes later. Concretely the model predicts five interdependent legs: O2A (order-to-assignment), first-mile (partner to restaurant), kitchen wait, last-mile (restaurant to customer), summing to order-to-reach. These are trained jointly with a multi-input/multi-output network rather than five separate models, because the legs are coupled — the dispatch engine deliberately times assignment so the partner arrives as food is ready, so O2A and first-mile depend on each other. Features include restaurant type (cloud kitchen vs dine-in), item count and prep complexity, live kitchen stress (orders placed vs prepared ratio), partner availability in the zone, historical road speeds around the locations, and live GPS pings. Beyond minimizing mean absolute error, the team explicitly tracks 'inaccurate bumps' — sudden ETA jumps that make the customer anxious — because a stable-but-slightly-wrong ETA beats a jittery one.
The three-party order state machine and failure branches
The order FSM has a clean happy path — placed -> confirmed (restaurant accepts) -> preparing -> ready -> picked_up -> on_the_way -> delivered — but the interview is really about the unhappy branches, because three independent humans can each break the flow. Restaurant rejects (out of an item, too busy): order goes to rejected, payment auto-refunds, customer is notified, and if it was a batch, the batch is re-optimized. No partner accepts / assigned partner cancels: the order re-enters the next dispatch round; after N failures or a timeout, escalate (surge the fee, widen the radius, or cancel with refund). Partner goes unreachable mid-trip (phone dies, GPS gap): tracking holds last-known position and widens ETA; if no ping for a threshold, ops/auto-reassign kicks in. Payment webhook arrives late: the order can sit in a pending_payment state with a timeout, and idempotency keys ensure a delayed-then-retried capture never double-charges. Every transition is guarded (you can't go from preparing straight to delivered) and written atomically with an entry in order_events, so the system can always answer 'what state is this order in and how did it get there.'
Surviving the lunch/dinner peak (the part most candidates skip)
This is the requirement that makes Swiggy distinct from Uber. ~70-80% of a day's orders land in two ~90-minute windows; dinner peak in 2024 was ~29% larger than lunch. Daily-average sizing (~70 OPS) is irrelevant — you must provision for instantaneous peak (~600-700 OPS) and a partner pool that's 40-50% concurrent. Concrete tactics: (1) Autoscale the stateless order, listing, and tracking services ahead of the peak on a schedule, not reactively, because reactive scaling lags the 8 PM cliff. (2) Lean on batching harder during peak — it's both an efficiency win and a load-shedding lever, because each batch is one trip instead of two. (3) Apply surge to flatten demand at the edges of the peak and pull more partners online. (4) Degrade gracefully under extreme load: widen ETAs, temporarily mark slammed kitchens unavailable rather than letting them accumulate a 90-minute backlog, and shed non-critical work (defer analytics, recommendations) to protect the order path. (5) The dispatch optimizer's run cadence and zone size are tunable knobs — smaller zones and faster rounds during peak. The honest framing for an interviewer: this system is overprovisioned and idle most of the day, and that's an accepted cost of hyperlocal food delivery.
Trade-offs to discuss
Every senior interviewer expects you to surface at least 3 of these. Pick the decisions, state the alternatives, and justify your choice.
GeoHash polygon index vs raw point-in-polygon vs PostGIS
Raw PIP against every zone is simplest but O(zones) per request and too slow at 50K-100K reads/s. PostGIS with a spatial index works but adds DB load on the hottest read path. An in-memory GeoHash index (precompute which cells each polygon overlaps, look up by cell, then PIP only the few candidates) keeps the request in memory and sub-100ms. Cost: you must rebuild the index when zones change and pick a GeoHash resolution that balances candidate-set size against memory. Swiggy chose the in-memory GeoHash approach.
Just-in-time delayed assignment vs assign-on-order-placed
Assigning immediately is simpler and feels faster, but it pins a partner idle at the restaurant for the entire prep time — catastrophic when supply is the peak bottleneck. Delaying assignment until the order is near-ready, driven by predicted prep time, keeps partners productive but risks under-supply if the prediction is wrong and no partner is free when the food is ready. The delayed approach wins because partner idle time is the dominant cost lever at peak; you mitigate the risk by widening the candidate pool as readiness approaches.
MIP / global optimization per zone vs greedy nearest-partner
Greedy (assign each order to its nearest free partner) is trivial to build and reason about, but it's locally optimal and globally wasteful — across a zone with hundreds of simultaneous orders it leaves batching and total-cost gains unrealized. A Mixed-Integer Program over the (partner x batch) cost matrix finds a near-global optimum every few seconds. Cost: solver complexity, a hard latency budget, and the need to bound problem size per zone. At Swiggy's order density the MIP pays for itself; a tiny new market could start greedy.
Batch orders aggressively vs single-order trips
Batching roughly halves per-order delivery cost and sheds load during peak (fewer trips), which is why it's essential to the economics. But over-batching delays the first order in the batch and degrades freshness and CX. The resolution is to allow batching only when every order in the batch still satisfies its freshness/time-window constraint, and to cap batch size small (2-3). Efficiency is bounded by experience, not maximized blindly.
In-memory geo index for live location vs durable writes
Writing 35K-45K pings/s to the durable order store would crush it and almost nothing reads historical pings in real time. Keeping live positions only in an in-memory GeoHash-keyed store gives cheap proximity queries and cheap writes; a downsampled trace goes to the data lake for analytics. Cost: a node failure loses live positions, but partners re-ping within seconds, so the index self-heals — an acceptable trade for the throughput.
Strong consistency on orders/payments vs eventual everywhere
Discovery, listings, and tracking tolerate eventual consistency and stale caches fine. The order FSM and payment capture cannot — eventual consistency there means double charges, lost orders, or an order stuck between two states. So you split the system: eventually-consistent, cached, horizontally-scaled read planes around a small strongly-consistent transactional core (sharded SQL with atomic transitions and idempotency keys). Don't pay for strong consistency where you don't need it; never skip it on money and order state.
Surge to flatten peak demand vs fixed pricing
Fixed delivery fees are simpler and feel fairer to customers, but during the dinner cliff they leave demand unbounded while partner supply is fixed, producing long ETAs and failed assignments. A smoothed, clamped surge multiplier shaves demand at the peak's edges and pulls more partners online. Cost: customer friction and the risk of a runaway feedback loop, mitigated by smoothing over a window, capping the multiplier, and locking the price in at order time.
How Swiggy actually does it
Swiggy's engineering blog (Swiggy Bytes) documents most of this directly. Their serviceability platform really does use a GeoHash index over delivery-zone polygons with a point-in-polygon resolution step, computing actual road distance rather than straight-line. Dispatch is framed as a Mixed-Integer Program: a cost matrix matching delivery partners to batches, minimizing total wait/cost subject to capacity, uniqueness, and time-window constraints, re-solved every few seconds, with order batching modeled as a small Vehicle Routing Problem. The ETA system decomposes delivery time as Max(assignment_delay + first_mile, prep_time) + last_mile and predicts five legs (O2A, first-mile, kitchen wait, last-mile, order-to-reach) jointly via a multi-input/multi-output neural network that evolved from gradient-boosted trees; it optimizes for both MAE and the rate of jarring ETA 'bumps.' The backbone is Kafka, with transactional data in an Aurora/Postgres-class store and catalog/search in ElasticSearch. Scale figures cited (200K+ restaurants, 580+ cities, 390K+ delivery partners, dinner ~29% above lunch with ~215M dinner orders in 2024) are from Swiggy's own 2024 year-in-review and press. Order-per-second numbers here are estimates derived from public daily-order figures and the peak-concentration assumption — treat them as back-of-envelope, which is exactly what an interviewer wants you to show.
Sources
- Swiggy Bytes: Designing the Serviceability Platform at Swiggy for High Scale (Part 1)
- Swiggy Bytes: What Serviceability means at Swiggy
- Swiggy Bytes: The Swiggy Delivery Challenge (Part One)
- Swiggy Bytes: The Swiggy Delivery Challenge (Part Two)
- Swiggy Bytes: Logistic Zones for Assignment
- Swiggy Corporate: How India Swiggy'd Its Way Through 2024 (scale figures)
- Swiggy Annual Report FY 2023-24
Lessons to study before this interview
If any of these topics are fuzzy, the interviewer will catch it. Each lesson is 15 to 60 minutes with diagrams, code, and a quiz.
Design a Notification System
capstone / capstone
Load Balancing
foundation / core fundamentals
Idempotency
foundation / core fundamentals
Distributed Locks
advanced / distributed systems core
Cache-Aside Pattern
foundation / caching strategies
High Availability
advanced / reliability resilience
Rate Limiting for Resilience
advanced / reliability resilience