Fix a Slow Self-Hosted LLM Under Load

Stand up a local LLM and it feels snappy for one user, then collapses the moment your team hits it together. The culprit is almost always the serving layer, not your hardware. Here is why concurrency breaks naive deployments, what continuous batching actually does, and how to pick and tune a production inference server like vLLM on a single-GPU box such as an NVIDIA DGX Spark.

If you have ever stood up a local large language model, served it to your team, and watched it fall apart the moment more than one person hit it at the same time, you are not alone. This is one of the most common pitfalls teams run into when they move an LLM from "works on my machine" to "serves my whole department." The good news is that the problem is almost always the serving layer, not your hardware, and the fix is well understood.

This post walks through why concurrency breaks naive LLM deployments, what continuous batching actually does, and how to pick and configure a production-grade inference server. The examples assume a single-GPU box like an NVIDIA DGX Spark running open-weight models such as Qwen or Phi-4, but the principles apply to almost any self-hosted setup.

Rolling out a self-hosted LLM for your team? Book a free 30-minute architecture review and we will help you size the hardware and serving stack before it bites you in production.

The Symptom: Great for One User, Terrible for Many

A typical story goes like this. You install Ollama, pull a model, and the responses are snappy. You share the endpoint with your team. As long as one person is using it, everything feels fine. Then two or three colleagues send requests at the same time, and suddenly everyone is staring at a spinner. Latency balloons, and the GPU utilization graph looks oddly calm despite the slowdown.

That calm GPU is the clue. Your expensive accelerator is not actually saturated. It is sitting idle between requests because the serving layer is feeding it work one piece at a time.

Why This Happens

Tools like Ollama are fantastic for local experimentation and single-user workflows. They are easy to install, manage models cleanly, and expose a simple API. But they were designed primarily for that single-user, one-request-at-a-time experience. When multiple requests arrive together, they get queued and largely serialized. Request two waits for request one to finish, request three waits for two, and so on.

This matters because of how text generation works. A language model produces output one token at a time, and each token requires a full pass through the model. Generating a 500-token answer means 500 sequential steps. If your server handles one conversation at a time, a second user's request simply sits in line until the first is completely done.

The frustrating part is that GPUs are massively parallel processors. They are built to do enormous batches of math simultaneously. When you generate tokens for a single request, you are using a tiny fraction of the chip's capacity. The hardware is bored, and your users are waiting. The bottleneck is the orchestration, not the silicon.

The Fix: Continuous Batching

The solution that modern inference servers implement is called continuous batching, sometimes called in-flight batching. The idea is elegant. Instead of processing requests one at a time, the server combines many requests into a single batch and generates tokens for all of them in the same forward pass through the model.

Crucially, "continuous" means requests can join and leave the batch dynamically. When one user's response finishes, that slot is immediately freed and the next waiting request slots in, without waiting for the entire batch to complete. The GPU stays busy, throughput climbs dramatically, and individual users barely notice that they are sharing the hardware.

Alongside continuous batching, the best servers add a second optimization: smarter memory management for the key-value cache. Every token a model generates has to remember the tokens before it, and that memory (the KV cache) grows with conversation length and number of users. Techniques like PagedAttention allocate this memory in small reusable blocks rather than reserving large contiguous chunks up front, which lets you fit far more simultaneous conversations into the same amount of VRAM.

Together, these two ideas are the difference between a server that chokes at three users and one that comfortably handles dozens.

Choosing a Production Inference Server

Once you accept that you need a real serving layer, you have a few strong options. All of them expose an OpenAI-compatible API, which means switching is mostly a matter of changing an endpoint URL rather than rewriting your application code.

vLLM

vLLM is the most popular open-source choice and a sensible default for most teams. It pioneered PagedAttention, has excellent continuous batching, supports a huge range of models out of the box, and has a large active community. If you are not sure where to start, start here.

A minimal launch looks like this:

pip install vllm

vllm serve Qwen/Qwen2.5-14B-Instruct \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.90

That command starts an OpenAI-compatible server. Point your existing client at it and you are done.

SGLang

SGLang is a close competitor that often edges out vLLM on high-concurrency workloads and on tasks involving structured output or complex prompting patterns. It is worth benchmarking against vLLM for your specific models and traffic shape if you want to squeeze out maximum throughput.

NVIDIA NIM and TensorRT-LLM

If you are running on NVIDIA hardware like a DGX Spark, NVIDIA's own stack is worth a look. NIM packages optimized models as ready-to-run containers, and underneath it uses TensorRT-LLM, which compiles models specifically for your GPU architecture for the best possible raw throughput. The tradeoff is more setup complexity and a narrower set of supported models compared to vLLM, but the integration with the NVIDIA ecosystem is clean and the performance ceiling is high.

Tuning for Concurrency

Picking the right server is most of the battle, but a few configuration choices will meaningfully affect how many simultaneous users you can support.

Quantization. Running your models in a reduced-precision format such as AWQ, GPTQ, or FP8 shrinks their memory footprint substantially. That freed-up VRAM goes straight into the KV cache budget, which means more concurrent conversations. Quality loss with modern quantization methods is usually small and well worth the capacity gain for internal tooling.
Maximum batched sequences. This is the maximum number of sequences the server will batch together, exposed in vLLM as `--max-num-seqs`. Raising this number lets more requests share each forward pass. Push it up until you start hitting memory limits or until per-user latency climbs beyond what you find acceptable, then back off slightly. This is the single most important knob for concurrent throughput.
Memory utilization. Controlled by `--gpu-memory-utilization`, setting this high (around 0.90) tells the server it can claim most of the GPU's memory for the model weights and KV cache, maximizing how many users you can pack in. Leave a little headroom so the system does not run out of memory under bursty load.

A Note on Hardware Honesty

There is one caveat worth stating plainly, especially for DGX Spark owners. The Spark is a wonderful development and prototyping machine, and its large unified memory lets you load sizeable models. But its memory bandwidth is modest compared to a full data-center GPU server, and memory bandwidth is the real constraint on how fast you can serve many users at once.

Continuous batching will transform your concurrent throughput regardless. You should expect a large improvement just by switching off a single-user-oriented tool. But if your team grows to many heavy simultaneous users, you may eventually hit a wall that no amount of software tuning can move. At that point the honest answer is that you have outgrown the box, and the fix is hardware rather than configuration. Knowing where that line sits for your workload is worth measuring early.

Putting It Together

If your self-hosted LLM falls apart under concurrent load, the diagnosis is almost always the same: a single-user serving tool is serializing requests and starving your GPU. The cure is a production inference server that does continuous batching and smart KV-cache management.

For most teams the path is straightforward. Move from Ollama to vLLM, serve a quantized version of your model, set `--max-num-seqs` and `--gpu-memory-utilization` appropriately for your hardware, and point your existing OpenAI-compatible client at the new endpoint. Benchmark SGLang or NVIDIA's NIM stack if you want to push performance further. Measure your real concurrency under load, and let those numbers tell you whether you have a software problem you have now solved, or a hardware ceiling you are approaching.

Either way, you will have turned a frustrating spinner into a system your whole team can actually use.

Scaling self-hosted AI for your organization? Talk to TunerLabs - we design and deploy production LLM serving infrastructure for businesses worldwide, from server selection and quantization to load testing and capacity planning.

Topics:

vLLMOllamaLLM servingcontinuous batchingPagedAttentionNVIDIA DGX SparkSGLangTensorRT-LLMquantizationself-hosted LLM

Free Guide

Master Claude Code

The complete architecture guide — Skills, Agents, Memory & the full Tools reference. Everything in one beautiful page.

Read the Guide

Curated Models as Departmental Brains: How Specialized AI Agents Are Accelerating Medical Diagnosis

SkillSpector: NVIDIA's Open Source Security Scanner for AI Agent Skills

Menu

Why Your Self-Hosted LLM Slows to a Crawl Under Load (and How to Fix It)