LLM serving infrastructure
Model weights are big. A 70B-parameter model needs about 140 GB of GPU memory in FP16, or 35 to 40 GB if you quantize down to INT4. That number alone forces most architectural decisions: which GPUs you can use, how many you need per replica, and whether you tensor-parallel across multiple devices or fit on a single H100 with 80 GB of HBM3.
Once the model is loaded, the next problem is throughput. A naive request-per-request loop wastes most of the GPU. Continuous batching, popularized by vLLM and now standard in TensorRT-LLM and TGI, lets the scheduler add new requests into a running batch every step. Real numbers from the vLLM paper: 2 to 4 times higher throughput than static batching at the same latency budget.
In interviews, expect questions on KV cache sizing, PagedAttention, speculative decoding, and how prefill and decode get scheduled differently. Walk the interviewer through the GPU memory budget, then through the batcher, then through how you would shard the model across nodes if it does not fit on one.