Menu

Home/Blog/Deploy Ollama with Qwen Coder on NVIDIA DGX Spark and Use It With Claude Code
AI Engineering

Deploy Ollama with Qwen Coder on NVIDIA DGX Spark and Use It With Claude Code

12 min readTunerLabs EngineeringJune 13, 2026

A complete guide to running Ollama with Qwen Coder locally on an NVIDIA DGX Spark and wiring it into Claude Code - covering hardware buy links, what Qwen is and why it matters for coding, the full deployment steps, and how to access the model remotely.

Running a capable coding model on your own hardware used to mean stitching together multiple GPUs, fighting CUDA driver versions, and praying the model fit in VRAM. The NVIDIA DGX Spark changes that math. With 128GB of unified memory in a desktop-sized box, it can hold large coding models comfortably, and Ollama makes serving them a one-line affair. Pair that with Qwen Coder as the model and Claude Code as the client, and you have a private, fast, local coding assistant that never sends a line of your source to a third party.

This guide walks through the entire pipeline end to end: what Qwen is, where to buy a DGX Spark, how to install and run Ollama on it, how to safely expose the API over your network, and how to point Claude Code at your self-hosted model.

Want help standing up a private LLM stack for your team? Book a free 30-minute architecture review and we will scope it with one of our engineers.


1. What Is Qwen (and Qwen Coder)?

Qwen is a family of open-weight large language models developed by Alibaba Cloud. The models are released under permissive licenses, published on Hugging Face, and documented at the official Qwen blog. They are among the strongest open models you can run on your own hardware, and they ship in a wide range of parameter sizes so you can match the model to the silicon you have.

Qwen Coder is the code-specialized branch of the family. The Qwen2.5-Coder series was trained on a large corpus of source code and code-related text, and it is tuned for real engineering tasks: code generation, completion, debugging, refactoring, and reasoning across multiple files. The newer Qwen3-Coder generation pushes this further with stronger agentic and long-context behavior.

Why Qwen Coder for a local setup?

  • Open weights. You download the model once and run it offline. No per-token billing, no rate limits, no data leaving your machine.
  • Size options. Qwen2.5-Coder ships in 0.5B, 1.5B, 3B, 7B, 14B, and 32B variants. On a 128GB DGX Spark you can comfortably run the 32B model.
  • Strong coding benchmarks. The 32B variant is competitive with much larger proprietary models on code generation and repair tasks.
  • Long context. The Coder models support large context windows, which matters when you feed an agent multiple files at once.

For this guide we use qwen2.5-coder:32b as the primary model, with qwen2.5-coder:7b as a lighter fallback. If a newer Qwen Coder release is available in the Ollama model library when you read this, prefer the latest tag.


2. NVIDIA DGX Spark: Specs and Where to Buy

The NVIDIA DGX Spark is a compact AI development system built around the GB10 Grace Blackwell Superchip. It is designed to put data-center-class AI capability on a desk, and its large unified memory pool is exactly what makes serving big models practical without a multi-GPU rig.

Key specifications

ComponentSpecification
SuperchipNVIDIA GB10 Grace Blackwell
GPUBlackwell architecture with 5th-gen Tensor Cores
CPU20-core Arm (10x Cortex-X925 + 10x Cortex-A725)
Unified memory128 GB LPDDR5x, coherent CPU + GPU
AI performanceUp to 1 petaFLOP (1000 TOPS) at FP4
StorageUp to 4 TB NVMe
NetworkingNVIDIA ConnectX-7 (up to 200 Gb/s), Wi-Fi, 10GbE
OSNVIDIA DGX OS (Ubuntu-based)

The 128GB of coherent unified memory is the headline feature for our purposes: the GPU can address the full pool, so a 32B model and its context fit without offloading. Two DGX Spark units can also be linked over ConnectX-7 to run even larger models.

Where to buy

Buy from official or reputable channels only. Prices and availability vary by region.

  • NVIDIA official product page: nvidia.com/en-us/products/workstations/dgx-spark - start here for the canonical specs and the current buy / notify links.
  • NVIDIA Founders Edition (order / notify): nvidia.com DGX Spark page - use the "Order Now" / "Notify Me" link on NVIDIA's official product page for the Founders Edition.
  • ASUS Ascent GX10: asus.com - ASUS's DGX Spark variant.
  • Dell Pro Max with GB10: dell.com - Dell's DGX Spark-class system.
  • Lenovo: lenovo.com - Lenovo DGX Spark-based workstation offerings.
  • HP: hp.com - HP DGX Spark-class systems.
  • Micro Center: microcenter.com - retail availability in the US.

Always confirm the configuration (memory, storage) and warranty on the vendor's own checkout page before buying.


3. Deploy Ollama on the DGX Spark

Ollama is the simplest way to download, manage, and serve open models with GPU acceleration. It exposes an HTTP API on port 11434 by default. Since the DGX Spark runs an Ubuntu-based, Arm64 NVIDIA DGX OS, both the native installer and the Docker image work well.

Option A: Native install (recommended)

SSH into the DGX Spark, then run the official installer:

curl -fsSL https://ollama.com/install.sh | sh

The installer detects the NVIDIA GPU, installs the Ollama service via systemd, and starts it. Verify it is running:

systemctl status ollama
ollama --version

Pull the Qwen Coder model. On 128GB of memory the 32B model is the sweet spot:

ollama pull qwen2.5-coder:32b

Run a quick smoke test to confirm the model responds:

ollama run qwen2.5-coder:32b "Write a Python function that reverses a linked list."

Option B: Docker install

If you prefer containers, make sure the NVIDIA Container Toolkit is installed (it ships with DGX OS), then run:

docker run -d --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Pull the model inside the container:

docker exec -it ollama ollama pull qwen2.5-coder:32b

Either way, you now have an Ollama server listening on 127.0.0.1:11434.


4. Expose the Ollama Port (11434) Safely

By default Ollama binds only to localhost. To reach it from another machine you must bind it to the network interface and open the port. Treat this as you would any internal service: do not expose it raw to the public internet.

Step 1: Bind Ollama to all interfaces

For the native systemd service, override the host environment variable:

sudo systemctl edit ollama

Add the following block, then save:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

(For the Docker option, the -p 11434:11434 flag already publishes the port; set -e OLLAMA_HOST=0.0.0.0 if needed.)

Step 2: Allow the port through the firewall

Restrict access to your trusted subnet rather than opening it to everyone. With UFW:

sudo ufw allow from 192.168.1.0/24 to any port 11434 proto tcp
sudo ufw reload

Step 3: Put Nginx in front (recommended)

A reverse proxy lets you add TLS and basic authentication, which matters because Ollama has no auth of its own. Install Nginx and create a site config:

server {
    listen 443 ssl;
    server_name ollama.internal.example.com;

    ssl_certificate     /etc/nginx/ssl/ollama.crt;
    ssl_certificate_key /etc/nginx/ssl/ollama.key;

    location / {
        proxy_pass         http://127.0.0.1:11434;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_set_header   Authorization $http_authorization;

        # Long-lived streaming responses
        proxy_buffering off;
        proxy_read_timeout 600s;
    }
}

Add HTTP basic auth with an auth_basic directive and an htpasswd file if you want a shared secret in front of the API. Reload Nginx:

sudo nginx -t && sudo systemctl reload nginx

Now the model is reachable at https://ollama.internal.example.com with TLS, while Ollama itself stays bound to localhost behind the proxy. If you also bind Ollama directly to 0.0.0.0, keep the firewall rule scoped to your subnet.


5. Use Ollama From a Remote Machine

From your laptop or any machine on the network, point requests at the DGX Spark. Replace the host with your server's address or Nginx hostname.

Quick check with curl

List the models the server has loaded:

curl http://192.168.1.50:11434/api/tags

Native Ollama generate API

curl http://192.168.1.50:11434/api/generate -d '{
  "model": "qwen2.5-coder:32b",
  "prompt": "Explain the difference between a process and a thread.",
  "stream": false
}'

OpenAI-compatible chat endpoint

Ollama also exposes an OpenAI-compatible API under /v1, which is what most tooling expects:

curl http://192.168.1.50:11434/v1/chat/completions -d '{
  "model": "qwen2.5-coder:32b",
  "messages": [
    { "role": "user", "content": "Refactor this loop into a list comprehension." }
  ]
}'

If you put Nginx with TLS and basic auth in front, swap the URL for https://ollama.internal.example.com and add an Authorization header.


6. Integrate With Claude Code

Claude Code talks to an Anthropic-style API. Ollama serves an OpenAI-compatible API, so the two do not speak the same protocol directly. The reliable pattern is to run a small translation proxy that accepts Anthropic-format requests from Claude Code and forwards them to Ollama, then point Claude Code at that proxy with environment variables.

Step 1: Run an Anthropic-to-OpenAI proxy

Tools like claude-code-router or LiteLLM translate between the Anthropic message format and Ollama's OpenAI-compatible endpoint. With LiteLLM, a minimal config pointing at your DGX Spark looks like this:

model_list:
  - model_name: qwen-coder
    litellm_params:
      model: openai/qwen2.5-coder:32b
      api_base: http://192.168.1.50:11434/v1
      api_key: "ollama"

Start the proxy (it defaults to port 4000):

litellm --config litellm.config.yaml

Step 2: Point Claude Code at the proxy

Create or edit .claude/settings.json in your project. The env block sets the base URL, auth token, and model that Claude Code will use:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:4000",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_MODEL": "qwen-coder",
    "ANTHROPIC_SMALL_FAST_MODEL": "qwen-coder"
  }
}

You can also export the same variables in your shell instead of using settings.json:

export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_MODEL="qwen-coder"

Step 3: Launch Claude Code

claude

Claude Code now routes its requests through the proxy to Qwen Coder running on your DGX Spark. Your prompts, your code, and the model responses all stay inside your own network.

A note on MCP

If you want Claude Code to reach your Ollama box as a tool rather than as the underlying model, you can register an MCP server in .mcp.json that wraps the Ollama API. That keeps Claude as the reasoning model while letting it call your local Qwen Coder for specialized code completion. For most setups, though, the proxy approach above is simpler and is the recommended starting point.


Wrapping Up

You now have a fully local coding assistant: Qwen Coder served by Ollama on an NVIDIA DGX Spark, exposed safely over your network behind Nginx, and wired into Claude Code through a translation proxy. The result is a fast, private, no-per-token-cost workflow where none of your source code leaves your hardware.

Verify the versions of Ollama, the Qwen Coder tag, and your proxy at the time you build this, since all three move quickly. Start with the 7B model to validate the pipeline, then graduate to the 32B model once the wiring works end to end.

Building a private LLM platform for your engineering team? Talk to TunerLabs - we design and deploy self-hosted AI infrastructure and agentic coding workflows for businesses worldwide.

Topics:

OllamaQwenQwen CoderNVIDIA DGX SparkClaude Codelocal LLMself-hosted AIremote accessAI engineeringdeveloper tools
Free Guide

Master Claude Code

The complete architecture guide — Skills, Agents, Memory & the full Tools reference. Everything in one beautiful page.

Read the Guide