Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

6.99 US dollars for lifetime access globally, or 399 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of 7 dollars instead of annual subscriptions costing 100 to 200 dollars per year.

Is PhonePe really the biggest UPI app in India?

Yes. NPCI data reported for January 2025 put PhonePe at about 48 percent of UPI transactions by volume and just over 50 percent by value, ahead of Google Pay and Paytm. The top three apps together are roughly 95 percent of all UPI, and PhonePe leads both on volume and, by a clearer margin, on value.

How does a UPI payment actually work end to end on PhonePe?

PhonePe acts as a Payment Service Provider between you, NPCI, and the banks. In a pay flow you initiate and approve with your UPI PIN, and the debit and credit happen at the banks over NPCI. In a collect flow a request is pushed to you to approve. The final status comes back asynchronously through an NPCI callback that can be delayed. So the payment is recorded first as pending with an idempotency key, and only moved to success or failure once the true outcome is known. This is the standard UPI flow that every app implements.

Why does PhonePe use sharded MySQL instead of one big database?

Because one database becomes a hard ceiling at national UPI volume. PhonePe has written that it runs sharded MySQL as a shared-nothing architecture, with a common sharding library across all databases, no scatter-gather queries on the user path so latency stays predictable, and no local data on service containers so data and services scale independently. The result is that no single database is a bottleneck and the system grows by adding shards.

What does PhonePe use for fast, real-time reads and fraud checks?

A layer built on Aerospike. PhonePe engineers have reported running many Aerospike clusters per site with over a trillion records, replicated active-active across sites, serving more than 500,000 queries per second with sub-millisecond reads. It holds balance and session data, a feature store, and fraud signals, so the real-time risk engine can score a payment inline before the money moves. Note that 500,000 is queries per second on this read layer, not payments per second.

What happens if my payment gets stuck or shows as pending?

NPCI can mark a transaction as pending or deemed, meaning the outcome is not yet known. The system must not just show failed and let you retry, because the money may have moved and a blind retry could double charge. Instead the transaction stays pending and a reconciliation service polls NPCI for the true status, matches the callback when it arrives, and drives the payment to a final state. Every step is idempotent, so a duplicate or late callback never marks it paid twice.

Does PhonePe run on the public cloud?

Mostly no. PhonePe runs its own on-premises data centers, reported as three sites including Mumbai and Bangalore plus a hybrid environment, orchestrated with Mesos and Marathon. At its scale and steady volume, owning the data centers can be cheaper over time and gives tighter control over latency and hardware, at the cost of large upfront investment and slower elasticity than the cloud.

How is designing PhonePe different from designing Paytm?

Both are UPI payment problems and share the same correctness core: idempotent money movement, a double-entry ledger, and reconciliation of pending payments. The PhonePe framing leans harder on scale and infrastructure, because PhonePe is the largest UPI app and has published a lot about how it scales: sharded shared-nothing MySQL, an Aerospike fast layer at hundreds of thousands of queries per second, a Kafka backbone at 100 billion events a day, and its own data centers. A strong PhonePe answer combines the payments correctness story with a credible horizontal-scale story.

System Design Interview Guide

PhonePe System Design Interview: UPI at National Scale

PhonePe handles close to half of every UPI payment in India. In January 2025 it processed more than 8 billion UPI transactions in a single month, about 48 percent of all UPI volume and just over 50 percent by value, and its event pipeline alone moves roughly 100 billion events a day.

Designing PhonePe is the India payments problem at national scale. You have to move real money over the NPCI UPI rails without ever creating or losing a rupee, make every step idempotent so a retry never double charges, and reconcile against the bank when a callback arrives late. On top of that correctness core, PhonePe is a study in scaling: a shared-nothing sharded MySQL ledger, an Aerospike layer serving real-time reads and fraud checks at very high throughput, a Kafka backbone carrying about 100 billion events a day, and its own on-premises data centers. The interview is as much about horizontal scale and availability as it is about money.

Asked at: Commonly asked at PhonePe, Paytm, Razorpay, Cred, Google Pay, Amazon Pay, and most India fintech and FAANG-India teams, for SDE2 and above. It is the standard UPI payments and high-scale infrastructure interview in the Indian market. The PhonePe variant leans harder on the scaling and data-store questions, because PhonePe is the largest UPI player and has published a lot about how it scales.

Why this question is asked

Payments is the one domain where eventual consistency is not an acceptable answer. The interviewer wants to see that you understand money movement as a distributed transaction across systems you do not control, meaning the payer's bank, NPCI, and the payee's bank, that the network will fail in the middle, and that the only acceptable outcomes are fully done or fully reversed with a customer who can see exactly what happened. The PhonePe framing adds a second axis: national scale. You are expected to talk about how the transactional store is sharded so no single database is a bottleneck, how a fast in-memory layer serves balance checks and fraud lookups in under a millisecond during a festival peak, how an event backbone carries tens of billions of events a day without the read load hurting writes, and why a company might run its own data centers instead of the public cloud. You earn the offer by combining idempotency, a strict ledger, and reconciliation with a credible horizontal-scale story. You lose it by drawing one database box and moving on.

Requirements

Always clarify these in the first 5 minutes of the interview. Do not start drawing boxes until both lists are agreed.

Functional requirements

User links a bank account, creates a UPI ID (VPA) like name@ybl, and sets a UPI PIN
User pays a person or merchant by VPA, QR code, phone number, or bank account (push or pay)
User approves a collect request raised by a merchant or another user (pull or collect) with the UPI PIN
Scan-and-pay at a merchant QR for person-to-merchant payments
User can check balance, see transaction history, and get an instant status for every payment
Value-added flows such as recharges, bill payments, and merchant checkout run on top of the payment core
Support and dispute handling for a payment that is stuck, failed, or debited-but-not-credited
Adjacent products such as insurance, lending, and stockbroking run on the same identity and payment rails

Non-functional requirements

Every payment is idempotent, so a retried request never causes a second debit
Strong consistency and durability on money movement and the ledger, so a rupee is never created or lost
Correctly handle NPCI callbacks that arrive late or out of order, and reconcile the true status of every pending payment
Sub-millisecond reads on the hot real-time paths, such as balance checks and fraud lookups, even at festival peak
Very high availability on the payment path, with non-critical features degraded first under stress
Horizontal scale to national UPI volume, with no single database or queue acting as a bottleneck
Real-time fraud and risk decisioning inline with the payment, not after the fact

Back-of-envelope scale estimates

Show your math. Pulling numbers from thin air signals you have not thought about the load.

UPI market position

~48% by volume, 50%+ by value

NPCI data reported for January 2025: PhonePe held about 48 percent of UPI transaction volume and just over 50 percent by value, the number one position by a wide margin. This is the most robustly verifiable figure about PhonePe and the anchor for the scale story.

Monthly UPI transactions

8B+ / month (Jan 2025)

PhonePe processed more than 8 billion UPI transactions in January 2025 out of roughly 17 billion across all of UPI that month. Divided across the month that is on the order of 3,000 transactions per second on average, with festival and month-start peaks several times higher.

Registered users

500M+

PhonePe crossed 500 million lifetime registered users in November 2023, described as roughly one in three Indians. This drives identity, VPA, and account-linking scale.

Merchants and coverage

47M+ merchants, 98.61% of pin codes

PhonePe's IPO filing reported about 47.19 million registered merchants covering 98.61 percent of India's pin codes as of September 2025. This drives the person-to-merchant payment path and the QR acceptance network.

Event backbone volume

~100B events/day on Kafka

PhonePe's engineering blog states its Kafka pipeline carries about 100 billion events per day. This is what feeds fraud, reconciliation, analytics, and downstream flows, and it is why the read and write paths are separated.

Real-time read throughput

500,000+ QPS, sub-ms reads

PhonePe engineers, quoted by Aerospike, report more than 500,000 queries per second on real-time transactional workloads with sub-millisecond reads. Note that this is queries per second on the fast read layer, not payment transactions per second, which PhonePe does not publish. Do not confuse the two in an interview.

High-level architecture

Split PhonePe into the money-movement core and the scaling infrastructure around it, because the interview rewards both. One honesty note first: the UPI PSP flow, the pending or DEEMED state, and reconciliation described below are the standard pattern that any UPI app must implement, not internals PhonePe has published. The data-store and infrastructure details, meaning sharded MySQL, Aerospike, the Kafka backbone, the in-house platforms, and the on-premises data centers, are things PhonePe has actually written about on its engineering blog, and those are called out as such. The money-movement core is a UPI PSP flow. When a user pays, the app talks to PhonePe as a Payment Service Provider, which talks to NPCI, which routes to the payer bank and the payee bank. There are two shapes. In the pay or intent flow, the payer initiates and approves with a UPI PIN, and the debit and credit happen over NPCI. In the collect flow, a request is pushed to the payer, who approves it. In both cases the final status comes back asynchronously through an NPCI callback, and it can be delayed or arrive out of order. So the payment is written first as a pending record with an idempotency key, and only moved to success or failure when the true status is known. A double-entry ledger records every debit and matching credit so the books always balance, and every state change is appended to an immutable event log for audit and dispute handling. The scaling infrastructure is what makes PhonePe distinctive. The primary transactional store is sharded MySQL, run as a strict shared-nothing architecture. PhonePe uses a common sharding library across all its MySQL databases, avoids scatter-gather queries in the user path, and keeps zero local data on the service containers, so a service and its data are decoupled and each can scale on its own. A fast layer built on Aerospike serves the reads that must be quick, such as balance checks, session data, a feature store, and fraud lookups, at more than 500,000 queries per second with sub-millisecond latency, replicated active-active across sites. An event backbone built on Kafka carries about 100 billion events a day, split into separate read and write clusters per data center so that scaling the read side never slows down writes, fed by an in-house two-tier ingestion path. On top of these sit PhonePe's own platforms: a Payments Orchestrator that models any money flow, a Risk and Decisioning Engine that aggregates across the billions of daily events in real time to score fraud inline, and a Fulfillment Engine for the steps around a payment. All of this runs in PhonePe's own on-premises data centers, orchestrated with Mesos and Marathon and fronted by an edge stack of NGINX and Traefik with AnyCast routing and circuit breakers.

In a real interview, sketch this on the whiteboard before diving into any single box.

Core components

Walk through each service. The interviewer wants to hear what each one owns, not just the names.

API edge and gateway

The entry point for every app request. PhonePe has described an edge stack that uses Mesos and Marathon as the data-center orchestration layer, NGINX and Traefik as edge routers, AnyCast with a routing component for traffic steering, and Hystrix-style circuit breaking to shed load. It authenticates the request, applies rate limits, and routes to the right service.

Payments Orchestrator

PhonePe's in-house platform, built on a flexible framework, that models any money flow as a sequence of steps. It coordinates the debit and credit legs, writes the pending record, applies the idempotency key, and drives the payment through its state machine. Keeping this generic is what lets PhonePe add new payment types and products without rebuilding the core.

Ledger and transaction store on sharded MySQL

The strongly consistent core, held in sharded MySQL under a shared-nothing design. A common sharding library spreads users and transactions across shards so no single database is a bottleneck, the user path avoids scatter-gather queries, and services hold no local data. The ledger is double-entry, so every debit has a matching credit and the books always balance.

UPI switch and NPCI connector

The component that speaks the UPI protocol to NPCI and the banks. It initiates the pay or collect flow, handles the UPI PIN verification, and processes the asynchronous callback that reports the true outcome. It has to treat every callback as possibly duplicated or late, which is why idempotency and reconciliation live close to it. This flow is the standard UPI PSP pattern rather than a PhonePe-published internal.

Aerospike real-time store

PhonePe's low-latency layer, used for real-time transactions, balance and session reads, a feature store, and fraud lookups. PhonePe engineers report running many Aerospike clusters per site with over a trillion records, replicated active-active across sites, serving more than 500,000 queries per second with sub-millisecond reads. It replaced heavier workloads and cut the server footprint sharply.

Risk and Decisioning Engine

PhonePe's in-house fraud and risk platform, built on a generic entity store that does high-velocity real-time aggregation across the billions of events the system produces each day. It scores a payment for fraud inline, before the money moves, so a suspicious transaction can be held or blocked rather than reversed afterward.

Kafka event backbone

The asynchronous spine of the system, carrying about 100 billion events a day. PhonePe moved from a single Kafka cluster to separate write and read clusters per data center, so heavy read consumers never slow down the write path, and uses an in-house two-tier ingestion model with a disk buffer for durability and a fast direct path for latency-critical producers.

Reconciliation service

The safety net for money movement. It polls NPCI for the true status of any payment still pending past a threshold, matches every callback against the pending record, and drives the transaction to a final state. It must be idempotent, because a duplicate or late callback should never mark a payment paid twice. This is the standard UPI reconciliation pattern.

Fulfillment Engine and product services

The use-case-agnostic platform that handles the steps around a payment, with pre-checkout, checkout, and post-checkout stages, plus the services for recharges, bill payments, insurance, lending, and stockbroking that reuse the same identity and rails.

Data model

Pick the right store per table. Justify each choice with the access pattern, not by reflex.

accounts

account_id (PK)user_idbank_refvpastatusshard_key

The linked bank account and its VPA (UPI ID). Sharded by user or account id via the common sharding library. The vpa is the routable address other users pay to.

transactions

txn_id (PK)payer_account_idpayee_vpaamount_paiseflow (pay|collect)stateidempotency_keynpci_refcreated_at

The payment record. Written first as pending with an idempotency key, then moved to success or failure once NPCI confirms. Sharded by payer. Strong consistency required. The idempotency_key is indexed so a retried request maps to the same transaction.

ledger_entries

entry_id (PK)txn_id (FK)account_iddirection (debit|credit)amount_paisebalance_aftercreated_at

Double-entry accounting. Every transaction produces matching debit and credit rows so the books always balance. Append-only, never updated in place, so the full money history is auditable.

txn_events

event_id (PK)txn_id (FK)from_stateto_statesource (app|npci|recon|risk)created_at

An immutable log of every state change on a transaction, so you can reconstruct exactly what happened and when. This is what a dispute or a reconciliation run reads.

risk_signals

entity_id (PK)entity_type (user|device|vpa|merchant)features (jsonb)scoreupdated_at

Held in the Aerospike layer for sub-millisecond reads. Real-time aggregated features per entity that the Risk and Decisioning Engine reads inline to score a payment before it clears.

idempotency_keys

idempotency_key (PK)txn_idrequest_hashstatuscreated_at

Maps a client request to a single transaction. A retried pay request with the same key returns the existing transaction rather than creating a new debit. Central to never double charging.

Deep dives

These are the conversations the interviewer is steering you toward. Practice each one until you can talk through it without notes.

The UPI PSP flow and why every step must be idempotent

A UPI payment is a distributed transaction across parties PhonePe does not control: the payer bank, NPCI, and the payee bank. PhonePe acts as the Payment Service Provider. In the pay flow the payer initiates and approves with a UPI PIN; in the collect flow a request is pushed to the payer to approve. In both, the debit and credit happen at the banks over NPCI, and the final status returns asynchronously through a callback that can be delayed, duplicated, or out of order. That is why the payment is written first as a pending record tied to an idempotency key. If the app retries because it did not get a response, the same key maps to the same transaction, so there is no second debit. The status only moves to success or failure once the true outcome is known. This end-to-end flow is the standard UPI pattern that every PSP implements, rather than a PhonePe-published internal, but it is the correctness backbone the interviewer is looking for.

Sharded shared-nothing MySQL as the transactional core

PhonePe has written that its primary transactional store is sharded MySQL, run as a strict shared-nothing architecture. Three choices matter. First, a common sharding library is used across all MySQL databases, so sharding logic is consistent and not re-invented per service. Second, the user path avoids scatter-gather queries, meaning a single request does not fan out to every shard and wait for the slowest, which is what keeps latency predictable at scale. Third, services store zero local data, so the data is decoupled from the service container and both can be scaled independently. The payoff is that no single database is a bottleneck and the system grows by adding shards. The cost is that cross-shard operations, such as a payment where payer and payee live on different shards, need care, and analytics that would want to scan across shards have to be served elsewhere, which is part of why the event backbone exists.

Aerospike for real-time reads and inline fraud at scale

Not every read can wait for a sharded SQL query. Balance checks, session lookups, feature-store reads, and fraud signals have to return in well under a millisecond, and they run at very high volume during a festival peak. PhonePe uses Aerospike for this layer. Engineers have reported running many Aerospike clusters per site holding over a trillion records, replicated active-active across sites with strong consistency where it is required, and serving more than 500,000 queries per second with sub-millisecond reads. Aerospike replaced heavier workloads and cut the server footprint substantially. The interview point is the split: keep the durable, auditable money state in the sharded SQL ledger, and put the hot, high-throughput reads that feed the live payment decision in a fast key-value layer next to it.

A Kafka backbone at 100 billion events a day

Behind the synchronous payment path, PhonePe runs an event backbone on Kafka that carries about 100 billion events a day. Two design decisions stand out. First, PhonePe moved from a single Kafka cluster to separate write and read clusters per data center, so heavy read consumers, such as analytics and reconciliation jobs, cannot slow down the write path that the live system depends on. Second, ingestion is a two-tier model: a client library writes to a local disk-backed buffer for durability, an ingestor forwards to Kafka, and latency-critical producers get a faster direct path. This backbone is how fraud scoring, reconciliation, notifications, and downstream products all get the payment events without loading the transactional store.

The pending or DEEMED state and reconciliation

The hardest part of a UPI payment is the case where the outcome is uncertain. NPCI can mark a transaction deemed, meaning the result is not yet known, not a clean success or failure. The system must not show failed and let the user retry, because the money may in fact have moved, and a blind retry could double debit. So the transaction sits in a pending state, and a reconciliation service takes over: it polls NPCI for the true status, matches the callback against the pending record when it arrives, and drives the transaction to a final state. Every step is idempotent, so a duplicate or late callback never marks a payment paid twice, and the ledger is only finalized when the truth is known. This reconciliation behavior is the standard UPI pattern rather than a PhonePe-specific published design, but handling it correctly is what separates a real payments answer from a naive one.

Running your own data centers instead of the public cloud

PhonePe runs its own on-premises data centers, reported as three sites including Mumbai and Bangalore plus a hybrid environment, orchestrated with Mesos and Marathon. This is a real differentiator worth discussing. The reasons a payments company at this scale might choose on-premises are cost at very high and steady volume, control over latency and hardware for sub-millisecond workloads, data residency and regulatory comfort, and predictable performance during national peak events. The costs are large upfront capital, the need to build and run the orchestration, networking, and failover that a cloud would otherwise provide, and slower elasticity than the cloud gives. The honest framing is that on-premises at national scale is a deliberate trade PhonePe made, not the default choice for a smaller product.

Trade-offs to discuss

Every senior interviewer expects you to surface at least 3 of these. Pick the decisions, state the alternatives, and justify your choice.

Sharded MySQL versus a single large database versus a NoSQL store

A single large SQL database is simplest and gives easy transactions, but it becomes a hard ceiling at national UPI volume. A NoSQL store scales writes easily but makes the strong consistency and relational integrity that money needs harder to guarantee. PhonePe chose sharded MySQL with a shared-nothing design: it keeps SQL consistency per shard while scaling horizontally by adding shards. The cost is that cross-shard work and any query that wants to span shards need deliberate handling, which is why scatter-gather is banned on the user path and analytics run off the event backbone.

Own on-premises data centers versus the public cloud

The public cloud gives fast elasticity and no capital outlay, which is right for most products. At PhonePe's scale and steadiness of volume, running its own data centers can be cheaper over time, gives tighter control over latency and hardware for sub-millisecond workloads, and helps with data residency. The costs are heavy upfront investment and having to build the orchestration, networking, and failover the cloud would otherwise handle, plus slower elasticity. It is a trade that only makes sense at very large, sustained scale.

Aerospike fast layer versus a cache over the primary database

A simple read-through cache over the sharded SQL store is easy to add, but under a festival peak a cache miss storm can hammer the database, and a cache does not give the durable, replicated, strongly-consistent-where-needed store that balance and fraud reads want. A purpose-built low-latency store like Aerospike serves those reads at very high throughput with sub-millisecond latency and its own replication. The cost is another system to operate and keep consistent with the source of truth, justified by the volume of hot reads on the payment path.

Separate read and write Kafka clusters versus one cluster

One Kafka cluster is simpler to run, but heavy read consumers such as analytics and reconciliation can steal capacity from the write path that the live payment flow depends on. Splitting into write and read clusters isolates the two so a spike in downstream reading never slows down producers. The cost is more clusters to operate and the need to replicate data from write to read side, which PhonePe accepted at 100 billion events a day.

Strong consistency on money versus eventual consistency elsewhere

The ledger and payment state cannot be eventually consistent, because that is where double charges, lost money, and stuck transactions come from, so they live in the strongly consistent sharded SQL core with idempotency and double-entry accounting. Reads that tolerate slight staleness, such as transaction history views or aggregated features, can be served from the fast layer or the event backbone. Splitting the system this way avoids paying for strong consistency where it is not needed while never giving it up on money.

Inline synchronous fraud check versus scoring after the payment

Scoring fraud after a payment clears is simpler and keeps the payment path fast, but it means a fraudulent transaction has already moved money and must be clawed back. Scoring inline, before the money moves, can block or hold a suspicious payment, which is why PhonePe built a real-time risk engine reading from the sub-millisecond Aerospike layer. The cost is that the fraud check is now on the critical path and must itself be extremely fast and highly available, which is exactly why it reads from the fast store rather than the SQL ledger.

How PhonePe actually does it

Several parts of this are documented on PhonePe's own engineering blog. PhonePe has written that its primary transactional store is sharded MySQL under a shared-nothing architecture, with a common sharding library, no scatter-gather queries on the user path, and no local data on service containers. It has described in-house platforms including a Payments Orchestrator, a Risk and Decisioning Engine that aggregates across its event stream in real time, and a Fulfillment Engine, along with an edge stack built on Mesos and Marathon with NGINX and Traefik. Its Kafka backbone carries about 100 billion events a day and is split into separate write and read clusters per data center with a two-tier ingestion path. The Aerospike figures, meaning many clusters per site, over a trillion records, active-active replication across sites, and more than 500,000 queries per second with sub-millisecond reads, come from PhonePe engineers quoted in Aerospike case studies, so they are attributed engineer statements rather than audited numbers, and different write-ups give slightly different snapshots. The market position, meaning about 48 percent of UPI by volume and just over 50 percent by value in January 2025, comes from NPCI data reported in the press. The user and merchant figures, meaning 500 million registered users in November 2023 and about 47.19 million merchants across 98.61 percent of pin codes in the 2025 IPO filing, come from PhonePe press releases and its filing. Two honesty notes for the interview. First, PhonePe does not publish a payment transactions-per-second figure, so the 500,000 number is queries per second on the fast read layer, not payments per second. Second, the NPCI PSP flow, the pending or DEEMED state, and reconciliation are the standard UPI pattern that every app implements, not internals PhonePe has published, so present them as the correct design and reserve the phrase PhonePe does this for the data-store and infrastructure work it has actually written about.

Sources

Lessons to study before this interview

If any of these topics are fuzzy, the interviewer will catch it. Each lesson is 15 to 60 minutes with diagrams, code, and a quiz.

Idempotency

foundation / core fundamentals

Database Sharding

foundation / database fundamentals

Distributed Transactions

advanced / distributed systems core

Design a Payment System

capstone / capstone

Retry Patterns

advanced / reliability resilience

High Availability

advanced / reliability resilience

Rate Limiting for Resilience

advanced / reliability resilience

Frequently asked questions

Practice with 766 system design lessons

Lifetime access for INR 399 or $6.99. Interactive diagrams, runnable code, quizzes, and 20 capstone projects including Design PhonePe.

PhonePe System Design Interview: UPI at National Scale

Why this question is asked

Requirements

Always clarify these in the first 5 minutes of the interview. Do not start drawing boxes until both lists are agreed.

Functional requirements

User links a bank account, creates a UPI ID (VPA) like name@ybl, and sets a UPI PIN
User pays a person or merchant by VPA, QR code, phone number, or bank account (push or pay)
User approves a collect request raised by a merchant or another user (pull or collect) with the UPI PIN
Scan-and-pay at a merchant QR for person-to-merchant payments
User can check balance, see transaction history, and get an instant status for every payment
Value-added flows such as recharges, bill payments, and merchant checkout run on top of the payment core
Support and dispute handling for a payment that is stuck, failed, or debited-but-not-credited
Adjacent products such as insurance, lending, and stockbroking run on the same identity and payment rails

Non-functional requirements

Every payment is idempotent, so a retried request never causes a second debit
Strong consistency and durability on money movement and the ledger, so a rupee is never created or lost
Correctly handle NPCI callbacks that arrive late or out of order, and reconcile the true status of every pending payment
Sub-millisecond reads on the hot real-time paths, such as balance checks and fraud lookups, even at festival peak
Very high availability on the payment path, with non-critical features degraded first under stress
Horizontal scale to national UPI volume, with no single database or queue acting as a bottleneck
Real-time fraud and risk decisioning inline with the payment, not after the fact

Back-of-envelope scale estimates

Show your math. Pulling numbers from thin air signals you have not thought about the load.

UPI market position

~48% by volume, 50%+ by value

Monthly UPI transactions

8B+ / month (Jan 2025)

Registered users

500M+

PhonePe crossed 500 million lifetime registered users in November 2023, described as roughly one in three Indians. This drives identity, VPA, and account-linking scale.

Merchants and coverage

47M+ merchants, 98.61% of pin codes

Event backbone volume

~100B events/day on Kafka

Real-time read throughput

500,000+ QPS, sub-ms reads

How PhonePe actually does it

Frequently asked questions