A complete guide to running Ollama with Qwen Coder locally on an NVIDIA DGX Spark and wiring it into Claude Code - covering hardware buy links, what Qwen is and why it matters for coding, the full deployment steps, and how to access the model remotely.
Running a capable coding model on your own hardware used to mean stitching together multiple GPUs, fighting CUDA driver versions, and praying the model fit in VRAM. The NVIDIA DGX Spark changes that math. With 128GB of unified memory in a desktop-sized box, it can hold large coding models comfortably, and Ollama makes serving them a one-line affair. Pair that with Qwen Coder as the model and Claude Code as the client, and you have a private, fast, local coding assistant that never sends a line of your source to a third party.
This guide walks through the entire pipeline end to end: what Qwen is, where to buy a DGX Spark, how to install and run Ollama on it, how to safely expose the API over your network, and how to point Claude Code at your self-hosted model.
Want help standing up a private LLM stack for your team? Book a free 30-minute architecture review and we will scope it with one of our engineers.
1. What Is Qwen (and Qwen Coder)?
Qwen is a family of open-weight large language models developed by Alibaba Cloud. The models are released under permissive licenses, published on Hugging Face, and documented at the official Qwen blog. They are among the strongest open models you can run on your own hardware, and they ship in a wide range of parameter sizes so you can match the model to the silicon you have.
Qwen Coder is the code-specialized branch of the family. The Qwen2.5-Coder series was trained on a large corpus of source code and code-related text, and it is tuned for real engineering tasks: code generation, completion, debugging, refactoring, and reasoning across multiple files. The newer Qwen3-Coder generation pushes this further with stronger agentic and long-context behavior.
Why Qwen Coder for a local setup?
- Open weights. You download the model once and run it offline. No per-token billing, no rate limits, no data leaving your machine.
- Size options. Qwen2.5-Coder ships in 0.5B, 1.5B, 3B, 7B, 14B, and 32B variants. On a 128GB DGX Spark you can comfortably run the 32B model.
- Strong coding benchmarks. The 32B variant is competitive with much larger proprietary models on code generation and repair tasks.
- Long context. The Coder models support large context windows, which matters when you feed an agent multiple files at once.
For this guide we use qwen2.5-coder:32b as the primary model, with qwen2.5-coder:7b as a lighter fallback. If a newer Qwen Coder release is available in the Ollama model library when you read this, prefer the latest tag.
2. NVIDIA DGX Spark: Specs and Where to Buy
The NVIDIA DGX Spark is a compact AI development system built around the GB10 Grace Blackwell Superchip. It is designed to put data-center-class AI capability on a desk, and its large unified memory pool is exactly what makes serving big models practical without a multi-GPU rig.
Key specifications
| Component | Specification |
|---|---|
| Superchip | NVIDIA GB10 Grace Blackwell |
| GPU | Blackwell architecture with 5th-gen Tensor Cores |
| CPU | 20-core Arm (10x Cortex-X925 + 10x Cortex-A725) |
| Unified memory | 128 GB LPDDR5x, coherent CPU + GPU |
| AI performance | Up to 1 petaFLOP (1000 TOPS) at FP4 |
| Storage | Up to 4 TB NVMe |
| Networking | NVIDIA ConnectX-7 (up to 200 Gb/s), Wi-Fi, 10GbE |
| OS | NVIDIA DGX OS (Ubuntu-based) |
The 128GB of coherent unified memory is the headline feature for our purposes: the GPU can address the full pool, so a 32B model and its context fit without offloading. Two DGX Spark units can also be linked over ConnectX-7 to run even larger models.
Where to buy
Buy from official or reputable channels only. Prices and availability vary by region.
- NVIDIA official product page: nvidia.com/en-us/products/workstations/dgx-spark - start here for the canonical specs and the current buy / notify links.
- NVIDIA Founders Edition (order / notify): nvidia.com DGX Spark page - use the "Order Now" / "Notify Me" link on NVIDIA's official product page for the Founders Edition.
- ASUS Ascent GX10: asus.com - ASUS's DGX Spark variant.
- Dell Pro Max with GB10: dell.com - Dell's DGX Spark-class system.
- Lenovo: lenovo.com - Lenovo DGX Spark-based workstation offerings.
- HP: hp.com - HP DGX Spark-class systems.
- Micro Center: microcenter.com - retail availability in the US.
Always confirm the configuration (memory, storage) and warranty on the vendor's own checkout page before buying.
3. Deploy Ollama on the DGX Spark
Ollama is the simplest way to download, manage, and serve open models with GPU acceleration. It exposes an HTTP API on port 11434 by default. Since the DGX Spark runs an Ubuntu-based, Arm64 NVIDIA DGX OS, both the native installer and the Docker image work well.
Option A: Native install (recommended)
SSH into the DGX Spark, then run the official installer:
curl -fsSL https://ollama.com/install.sh | sh
The installer detects the NVIDIA GPU, installs the Ollama service via systemd, and starts it. Verify it is running:
systemctl status ollama
ollama --version
Pull the Qwen Coder model. On 128GB of memory the 32B model is the sweet spot:
ollama pull qwen2.5-coder:32b
Run a quick smoke test to confirm the model responds:
ollama run qwen2.5-coder:32b "Write a Python function that reverses a linked list."
Option B: Docker install
If you prefer containers, make sure the NVIDIA Container Toolkit is installed (it ships with DGX OS), then run:
docker run -d --gpus=all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
Pull the model inside the container:
docker exec -it ollama ollama pull qwen2.5-coder:32b
Either way, you now have an Ollama server listening on 127.0.0.1:11434.
4. Expose the Ollama Port (11434) Safely
By default Ollama binds only to localhost. To reach it from another machine you must bind it to the network interface and open the port. Treat this as you would any internal service: do not expose it raw to the public internet.
Step 1: Bind Ollama to all interfaces
For the native systemd service, override the host environment variable:
sudo systemctl edit ollama
Add the following block, then save:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
(For the Docker option, the -p 11434:11434 flag already publishes the port; set -e OLLAMA_HOST=0.0.0.0 if needed.)
Step 2: Allow the port through the firewall
Restrict access to your trusted subnet rather than opening it to everyone. With UFW:
sudo ufw allow from 192.168.1.0/24 to any port 11434 proto tcp
sudo ufw reload
Step 3: Put Nginx in front (recommended)
A reverse proxy lets you add TLS and basic authentication, which matters because Ollama has no auth of its own. Install Nginx and create a site config:
server {
listen 443 ssl;
server_name ollama.internal.example.com;
ssl_certificate /etc/nginx/ssl/ollama.crt;
ssl_certificate_key /etc/nginx/ssl/ollama.key;
location / {
proxy_pass http://127.0.0.1:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Authorization $http_authorization;
# Long-lived streaming responses
proxy_buffering off;
proxy_read_timeout 600s;
}
}
Add HTTP basic auth with an auth_basic directive and an htpasswd file if you want a shared secret in front of the API. Reload Nginx:
sudo nginx -t && sudo systemctl reload nginx
Now the model is reachable at https://ollama.internal.example.com with TLS, while Ollama itself stays bound to localhost behind the proxy. If you also bind Ollama directly to 0.0.0.0, keep the firewall rule scoped to your subnet.
5. Use Ollama From a Remote Machine
From your laptop or any machine on the network, point requests at the DGX Spark. Replace the host with your server's address or Nginx hostname.
Quick check with curl
List the models the server has loaded:
curl http://192.168.1.50:11434/api/tags
Native Ollama generate API
curl http://192.168.1.50:11434/api/generate -d '{
"model": "qwen2.5-coder:32b",
"prompt": "Explain the difference between a process and a thread.",
"stream": false
}'
OpenAI-compatible chat endpoint
Ollama also exposes an OpenAI-compatible API under /v1, which is what most tooling expects:
curl http://192.168.1.50:11434/v1/chat/completions -d '{
"model": "qwen2.5-coder:32b",
"messages": [
{ "role": "user", "content": "Refactor this loop into a list comprehension." }
]
}'
If you put Nginx with TLS and basic auth in front, swap the URL for https://ollama.internal.example.com and add an Authorization header.
6. Integrate With Claude Code
Claude Code talks to an Anthropic-style API. Ollama serves an OpenAI-compatible API, so the two do not speak the same protocol directly. The reliable pattern is to run a small translation proxy that accepts Anthropic-format requests from Claude Code and forwards them to Ollama, then point Claude Code at that proxy with environment variables.
Step 1: Run an Anthropic-to-OpenAI proxy
Tools like claude-code-router or LiteLLM translate between the Anthropic message format and Ollama's OpenAI-compatible endpoint. With LiteLLM, a minimal config pointing at your DGX Spark looks like this:
model_list:
- model_name: qwen-coder
litellm_params:
model: openai/qwen2.5-coder:32b
api_base: http://192.168.1.50:11434/v1
api_key: "ollama"
Start the proxy (it defaults to port 4000):
litellm --config litellm.config.yaml
Step 2: Point Claude Code at the proxy
Create or edit .claude/settings.json in your project. The env block sets the base URL, auth token, and model that Claude Code will use:
{
"env": {
"ANTHROPIC_BASE_URL": "http://localhost:4000",
"ANTHROPIC_AUTH_TOKEN": "ollama",
"ANTHROPIC_MODEL": "qwen-coder",
"ANTHROPIC_SMALL_FAST_MODEL": "qwen-coder"
}
}
You can also export the same variables in your shell instead of using settings.json:
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_MODEL="qwen-coder"
Step 3: Launch Claude Code
claude
Claude Code now routes its requests through the proxy to Qwen Coder running on your DGX Spark. Your prompts, your code, and the model responses all stay inside your own network.
A note on MCP
If you want Claude Code to reach your Ollama box as a tool rather than as the underlying model, you can register an MCP server in .mcp.json that wraps the Ollama API. That keeps Claude as the reasoning model while letting it call your local Qwen Coder for specialized code completion. For most setups, though, the proxy approach above is simpler and is the recommended starting point.
Wrapping Up
You now have a fully local coding assistant: Qwen Coder served by Ollama on an NVIDIA DGX Spark, exposed safely over your network behind Nginx, and wired into Claude Code through a translation proxy. The result is a fast, private, no-per-token-cost workflow where none of your source code leaves your hardware.
Verify the versions of Ollama, the Qwen Coder tag, and your proxy at the time you build this, since all three move quickly. Start with the 7B model to validate the pipeline, then graduate to the 32B model once the wiring works end to end.
Building a private LLM platform for your engineering team? Talk to TunerLabs - we design and deploy self-hosted AI infrastructure and agentic coding workflows for businesses worldwide.
Topics:
Master Claude Code
The complete architecture guide — Skills, Agents, Memory & the full Tools reference. Everything in one beautiful page.