Skip to content
Patrick Desjardins Blog
Patrick Desjardins picture from a conference

Running Local LLM on a Nvidia 5080 for Coding

Posted on: 2026-02-09

Goal

Few weekends ago, I wanted to see what would be the best model of coding that I could run locally on my Intel 9 with 32 gig of ram and my Nvidia 5080 with 8gig of vram. The idea was to run Open Code to QWEN 3.0 to have a local Claude. The result was impressive in term of how fast the response occured but the final product wasn't the quality of Claude.

How to do it?

I started with QWEN 2.5 and the result was horrible in term of connection with tools like simply reading and writing files. It was fine with general question but that wasn't the goal.

Moving to Qwen3-8B-AWQ did the job but with some tweaking, otherwise it was crashing or returning response too slow.

Installation:

sh
nvidia-smi
pip install --upgrade pip
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu121
pip install vllm
pip install autoawq
pip install huggingface_hub
huggingface-cli login

Everytime to use:

cd ~/llm/llm
pyenv activate vllm

unset VLLM_ATTENTION_BACKEND
unset VLLM_USE_FLASHINFER_SAMPLER

pkill -f vllm

vllm serve \
  Qwen/Qwen3-8B-AWQ \
  --quantization awq \
  --dtype float16 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --port 7555 \
  --api-key "opencode_local" \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --reasoning-parser qwen3

Loading everything takes less than 30 seconds:

(EngineCore_DP0 pid=36318) INFO 02-03 19:53:24 [default_loader.py:291] Loading weights took 4.14 seconds
(EngineCore_DP0 pid=36318) INFO 02-03 19:53:24 [gpu_model_runner.py:3905] Model loading took 5.71 GiB memory and 5.326229 seconds
(EngineCore_DP0 pid=36318) INFO 02-03 19:53:30 [backends.py:644] Using cache directory: /home/miste/.cache/vllm/torch_compile_cache/252055f4c9/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=36318) INFO 02-03 19:53:30 [backends.py:704] Dynamo bytecode transform time: 5.01 s
(EngineCore_DP0 pid=36318) INFO 02-03 19:53:35 [backends.py:226] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 1.098 s
(EngineCore_DP0 pid=36318) INFO 02-03 19:53:35 [monitor.py:34] torch.compile takes 6.11 s in total
(EngineCore_DP0 pid=36318) INFO 02-03 19:53:36 [gpu_worker.py:358] Available KV cache memory: 7.19 GiB
(EngineCore_DP0 pid=36318) INFO 02-03 19:53:36 [kv_cache_utils.py:1305] GPU KV cache size: 52,368 tokens
(EngineCore_DP0 pid=36318) INFO 02-03 19:53:36 [kv_cache_utils.py:1310] Maximum concurrency for 32,768 tokens per request: 1.60x
(EngineCore_DP0 pid=36318) 2026-02-03 19:53:36,340 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=36318) 2026-02-03 19:53:36,363 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:07<00:00,  6.51it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:05<00:00,  6.77it/s]
(EngineCore_DP0 pid=36318) INFO 02-03 19:53:49 [gpu_model_runner.py:4856] Graph capturing finished in 13 secs, took 0.00 GiB
(EngineCore_DP0 pid=36318) INFO 02-03 19:53:49 [core.py:273] init engine (profile, create kv cache, warmup model) took 25.12 seconds
(EngineCore_DP0 pid=36318) INFO 02-03 19:53:51 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=36148) INFO 02-03 19:53:51 [api_server.py:1014] Supported tasks: ['generate']
(APIServer pid=36148) WARNING 02-03 19:53:51 [model.py:1358] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=36148) INFO 02-03 19:53:51 [serving_responses.py:224] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=36148) INFO 02-03 19:53:51 [serving_engine.py:271] "auto" tool choice has been enabled.
(APIServer pid=36148) INFO 02-03 19:53:51 [serving_engine.py:271] "auto" tool choice has been enabled.
(APIServer pid=36148) INFO 02-03 19:53:51 [serving_chat.py:146] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=36148) INFO 02-03 19:53:51 [serving_chat.py:182] Warming up chat template processing...
(APIServer pid=36148) INFO 02-03 19:53:52 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=36148) INFO 02-03 19:53:52 [serving_chat.py:218] Chat template warmup completed in 1254.3ms
(APIServer pid=36148) INFO 02-03 19:53:53 [serving_completion.py:78] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=36148) INFO 02-03 19:53:53 [serving_engine.py:271] "auto" tool choice has been enabled.
(APIServer pid=36148) INFO 02-03 19:53:53 [serving_chat.py:146] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=36148) INFO 02-03 19:53:53 [api_server.py:1346] Starting vLLM API server 0 on http://0.0.0.0:7555

Configuring OpenCode:

nano ~/.config/opencode/config.json

By copying:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "vllm_local": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "vLLM (local)",
      "options": {
        "baseURL": "http://127.0.0.1:7555/v1",
        "apiKey": "opencode_local"
      },
      "models": {
        "Qwen/Qwen3-8B-AWQ": {
          "name": "Qwen 3.0 (8B Agent)",
          "tools": {
            "write": true,
            "bash": true,
            "read": true,
            "edit": true
          },
          "limit": {
            "context": 32768,
            "output": 4096
          },
          "supportsToolCalling": true,
          "supportsReasoning": true
        }
      }
    }
  }
}

The OpenCode configuration required a lot of trial and error. Even ChatGPT could not find the right configuration. Moving to QWEN 3.0 and explicitly enabling tools helped move things forward.

Then, running OpenCode. Here is a command that asks how many unit tests are in the repository.