Rishiraj's blog

Optimized LLM Inference Engines

Unlike my opinion blogs, this blog is written with AI assistance. If even in 2026 you're allergic to AI assistance in technical writing, you can ask an AI to introduce flaws here and then read. For the rest, there is a specific kind of heartbreak that every AI engineer experiences exactly once. It happens when you take the code that worked perfectly in your Jupyter Notebook—where you loaded Llama-3-70B, sent a prompt, and got a cool response—and try to wrap it in a FastAPI endpoint for production.

Suddenly, everything falls apart. You send three simultaneous requests, and the latency spikes to five seconds. You try to batch five requests, and your GPU, which theoretically has enough VRAM, crashes with an Out Of Memory (OOM) error. You realize that the standard Hugging Face transformers pipeline, while brilliant for research, treats your GPU memory with the efficiency of a toddler packing a suitcase.

The gap between "it runs" and "it serves" is massive. Standard pipelines suffer from two physics problems: memory fragmentation and scheduling inefficiency.

When you generate text, the model creates a Key-Value (KV) cache for every token. Naive implementations reserve a contiguous block of VRAM for the maximum sequence length because they don’t know how long your answer will be. If you reserve space for 2,048 tokens but only generate 50, you’ve just wasted that VRAM. Estimates suggest standard transformers waste 60% to 80% of memory this way. Worse, batching is usually synchronous. If you batch a short request (10 tokens) with a long one (500 tokens), the GPU cores sit idle waiting for the long one to finish.

To fix this, we have to leave the comfort of standard PyTorch and look at the four specialized engines that run the modern LLM world: Llama.cpp, Ollama, vLLM, and SGLang.

They aren't just wrappers; they represent completely different engineering philosophies. Here is how they actually work under the hood.


The "Run Anywhere" Stack: Llama.cpp and Ollama

If your goal is to run a model on hardware that you actually own—a MacBook Pro, a gaming rig with a 3090, or even a Raspberry Pi—you are likely stepping into the ecosystem built around llama.cpp.

Llama.cpp: The Bare Metal Miracle

Llama.cpp started in early 2023 as a hack to get Meta’s LLaMA model running on a MacBook, but it evolved into a masterclass in low-level C++ optimization. It bypasses Python and PyTorch entirely to interact directly with the hardware.

Its superpower is GGUF and k-quantization. Standard models run in FP16 (16-bit floating point). A 70B parameter model in FP16 demands about 140GB of VRAM. Unless you have an A100 cluster in your basement, that’s a non-starter. Llama.cpp aggressively quantizes weights into 2, 3, 4, or 8-bit integers.

But it’s not just about file size. Llama.cpp optimizes the compute:

  1. Apple Silicon Native: It uses the Metal Performance Shaders (MPS) framework. Unlike CUDA, which shuffles data between CPU RAM and VRAM, Apple’s Unified Memory means the CPU and GPU share the pool. Llama.cpp is tuned to keep that memory bus saturated without copying data.
  2. Split Computation: This is a lifesaver for older hardware. If a model doesn't fit in your 24GB VRAM, Llama.cpp allows you to offload, say, 40 layers to the GPU and process the remaining 40 on the CPU. It’s slower than pure GPU, but it prevents the crash.
  3. SIMD: On the CPU side, it leverages AVX2 and AVX-512 instructions to parallelize matrix multiplication.

It now includes a server mode—a single binary that spins up an OpenAI-compatible API. It supports speculative decoding (using a small draft model to guess tokens for a large model) and grammar constraints. However, it’s not built for high-concurrency. It usually processes one request at a time. It’s about portability, not throughput.

Ollama: The Manager

If Llama.cpp is the engine, Ollama is the dashboard. It wraps the llama.cpp libraries (and occasionally others) in a Go-based server that manages the lifecycle of the models.

Ollama solves "configuration hell." Running Llama.cpp raw requires passing flags for thread counts, GPU layers, and context windows. Ollama automates this by detecting your hardware (e.g., "Oh, you have an M2 Max") and optimizing the launch.

Its killer feature is the Modelfile. Think of this like a Dockerfile for LLMs. You can pull a base model, set the temperature, define a strict system prompt, and package it as a new distinct model. When you hit the Ollama API:

  1. It checks if the model is loaded.
  2. If not, it unloads whatever is currently hogging VRAM.
  3. It loads your requested model and passes the query to the C++ backend.

While Ollama creates a seamless developer experience (Mac/Linux/Windows support, Docker integration), it inherits the performance profile of its backend. It prioritizes accessibility. On an NVIDIA H100, Ollama (via llama.cpp) might hit 75 tokens/second on a 14B model, which is respectable, but it struggles to scale with concurrent users because it lacks sophisticated batching. It is the gold standard for local development, not high-traffic deployment.


The Production Stack: vLLM and SGLang

If you are paying for cloud GPUs (H100s, A100s) and need to serve API traffic to hundreds of users, the local stack won't cut it. You need throughput. You need to saturate the GPU. This is where vLLM and SGLang come in.

vLLM: Virtual Memory for GPUs

vLLM came out of UC Berkeley in 2023 with a solution to the memory fragmentation problem: PagedAttention.

They looked at how Operating Systems handle RAM. OSs don't give programs contiguous blocks of physical memory; they chop memory into pages and use a page table to map them. vLLM applied this to the KV cache.

This reduces memory waste from ~60-80% to nearly zero. Because the memory is virtualized, vLLM can fit significantly more sequences into the GPU at once.

Once the memory problem was solved, they implemented Continuous Batching (iteration-level scheduling). In the old world, a batch waited for the slowest request. In vLLM, the scheduler works per-token.

  1. Request A finishes.
  2. The scheduler immediately evicts A and pulls Request C from the queue to join Request B.
  3. The GPU never stops.

The results are stark: benchmarks show vLLM delivering up to 24x higher throughput than standard Hugging Face pipelines. It supports multi-GPU scaling (tensor parallelism) and is the default choice for most industry-standard OpenAI-compatible endpoints today.

SGLang: Structure and Control

SGLang (Structured Generation Language) is the newest heavyweight, developed by the LMSYS team (the creators of Vicuna and Chatbot Arena). It takes the vLLM foundation and optimizes it for complex, "agentic" workloads.

vLLM is great for independent chat requests. But what if you are building an agent that loops? Or a system that sends the same massive system prompt 50 times with different user questions?

SGLang introduces RadixAttention. While vLLM manages memory for a linear sequence, RadixAttention manages memory as a Radix Tree (prefix tree).

For multi-turn agents or RAG systems with shared context, this automatic reuse can result in 5x higher throughput.

Furthermore, SGLang co-designs the frontend and backend. It includes a domain-specific language (DSL) that allows you to program the generation. If you need the LLM to output valid JSON, SGLang constructs a Finite State Machine (FSM) based on your regex or JSON schema. During inference, the model is physically prevented from sampling a token that would break the JSON structure. This is faster and more reliable than retry loops.

It’s battle-tested, too. Elon Musk’s xAI uses SGLang to serve Grok, and Microsoft Azure uses it for DeepSeek-R1. It supports NVIDIA and AMD GPUs, quantization (FP8, INT4), and consistently beats vLLM in complex, multi-step benchmarks (up to 3.1x faster on Llama-70B).


The Verdict: Which one do you choose?

The ecosystem has fragmented because the use cases have diverged. There is no single "best" engine, but there is definitely a correct tool for your specific constraint.

1. The "MacBook Engineer" / Local Dev: Use Ollama. It is the path of least resistance. You don't need to compile C++. You don't need to understand quantization parameters. You just run ollama run llama3 and you have an API. It wraps the efficiency of llama.cpp in a package that respects your time.

2. The Edge Deployer / IoT: Use Llama.cpp Server. If you are deploying to a Raspberry Pi, a Jetson Nano, or a strict air-gapped environment where every megabyte of RAM counts, go to the source. The single binary simplicity and the ability to run on almost any CPU architecture make it the only viable choice for the extreme edge.

3. The Standard Production API: Use vLLM. If you are building a ChatGPT clone for your company and expect high traffic, vLLM is the reliable workhorse. It maximizes the ROI on your expensive H100s via PagedAttention and provides a stable, drop-in replacement for OpenAI’s API.

4. The Agent Architect: Use SGLang. If your application involves complex reasoning loops, tool use, or heavy structured output (JSON), SGLang is currently superior. The RadixAttention caching changes the economics of long-context agents, and the FSM-based decoding ensures your downstream code doesn't break because the LLM forgot a closing bracket.

We are long past the days of model.generate(). The inference layer is now a sophisticated piece of infrastructure, and choosing the right engine is just as important as choosing the right model.