How to (easily) run a fully local, private AI assistant with OpenClaw and QVAC

What an agent harness is, and what OpenClaw does

A language model on its own can only produce text. It cannot open a file, run a command, call an API, or remember what it did five minutes ago. An agent harness is the layer that closes that gap. It takes the model’s text output, turns it into real actions (read this file, run this command, search this folder), feeds the results back to the model, and loops until the task is done. The model is the brain. The harness is the hands.

OpenClaw is one of these harnesses, and right now it is the most widely used one. It crossed a large install base in a few months and sits at the top of agent usage charts. You give it a goal in plain language, and it plans, calls tools, and reports back. It speaks to any model that exposes an OpenAI-compatible API, which is the detail that matters for this guide: you are not locked to a single cloud vendor. You point it at whatever model endpoint you want, including one running on your own machine.

That is the whole idea here. OpenClaw for the agent loop, QVAC for the model, both on hardware you control.

What people use OpenClaw for

The harness is general, so the use cases are broad. The common ones:

Coding tasks. Read a repo, write a function, run the tests, fix what broke. The agent edits files and runs commands directly.
File and system chores. Rename a batch of files, summarize a folder of documents, reorganize a directory, pull a number out of a spreadsheet.
Local research and drafting. Read a set of notes and produce a summary, a draft, or a structured table.
Glue work. Chain a few steps together that would otherwise be a manual sequence: generate something, save it, open it, move it.
Always-on assistants. Run it on a small machine that stays on, and reach it from your laptop or phone.

In every one of these, the agent is taking actions on real files and a real system. Which is exactly why where the model runs starts to matter.

Why local AI is the right default for an agent

When a model only chats, where it runs is mostly a question of privacy. When a model drives an agent that reads your files and runs commands, it becomes a question of control. The agent acts on your machine, and if the model behind it lives in someone else’s datacenter, every file it reads and every command it runs depends on a system you do not own: its uptime, its latency, its terms, and its access policy, any of which can change without you. Running the model locally keeps the agent answerable to your hardware, not to a service that can throttle it, change it, or cut it off.

Running the model locally changes the property, not just the policy:

Your data stays on the machine. The files the agent reads, the code it writes, the commands it runs: none of it leaves your disk to reach the model. There is no prompt log on a server you cannot see.
It works offline. Once the model is downloaded, you can disconnect entirely and the agent keeps working. Enable airplane mode and ask it a question. It still answers.
No per-token bill. The model runs on hardware you already own. A long agent session that would cost real money against a metered API costs nothing locally. And no risk of seeing the bill inflate over time due to rising costs for serving inference at scale.
No silent model swaps. The weights are a file on your disk. They do not change under you between sessions, a practice that some providers run to decrease their cost.

The trade is latency and raw capability: a model that fits on a laptop is smaller than a frontier cloud model, so it is slower and less sharp. For a large class of real work, that trade is worth it, and it keeps getting better as small models improve.

QVAC is the piece that makes the local half practical. It is an open-source SDK and CLI from Tether that runs models on your own device across every major GPU backend (NVIDIA, AMD, Intel, Adreno (Qualcomm) and Mali on Linux, Windows and Android through Vulkan, and Apple Silicon through Metal), and it ships an OpenAI-compatible server. That server is the bridge OpenClaw plugs into.

How it fits together

Three pieces, two of them long-running:

The QVAC server holds the model in memory and answers OpenAI-compatible requests on your local port of your own machine. You start it once and leave it running.
The OpenClaw gateway runs the agent loop. You point it at the QVAC server during setup, then leave it running too.
Your commands go to the gateway. You ask a question or hand it a task, and it works against the local model.

Nothing in this chain calls out to the internet for inference. The model file is downloaded once, then everything happens on the machine.

Which model to run

The setup uses Qwen3-8B, which is the best balance for most laptops. If you have less memory, drop to the 4B. If you have a workstation and want stronger results on multi-step coding tasks, step up to the Qwen3.6-27B multimodal model. To switch, change the model name in the config file (step 2 of the setup) to the constant in the last column.

Use case	Model	Recommended RAM	Required storage	Config name
Light and fast, mostly conversation	Qwen3-4B (4-bit)	8 GB	~3 GB	`QWEN3_4B_INST_Q4_K_M`
Recommended default, good all-round balance	Qwen3-8B (4-bit)	16 GB	~6 GB	`QWEN3_8B_INST_Q4_K_M`
Heavier coding, multi-step agents, and vision	Qwen3.6-27B multimodal (4-bit)	48 GB	~18 GB	`QWEN3_6_27B_MULTIMODAL_Q4_K_XL`

A GPU helps a lot but is not required. QVAC uses Apple Metal on Apple Silicon and Vulkan on NVIDIA, AMD, and Intel. On a CPU-only machine it still runs, just slower. Run qvac doctor to see what your hardware supports.

Set it up

The walkthrough below is pure copy-paste. Pick your operating system in the tabs, run each step in order, and you will have a local coding agent in a few minutes. The only thing that takes real time is the first model download. The setup uses Qwen3-8B quantized at 4 bits (about 4.7 GB), which downloads once and is cached after that.

There are two paths through it, and the block handles both:

If you do not have OpenClaw yet, run the optional install step.
If you already have OpenClaw, skip that one step. Everything else is the same.

Install the QVAC CLI

One command. Brings the qvac tool and the SDK it runs on.

npm install -g @qvac/cli @qvac/sdk

Create a model config

Makes a folder and writes one small config file that tells QVAC which model to serve.

mkdir -p ~/qvac-openclaw && cd ~/qvac-openclaw
cat > qvac.config.json <<'EOF'
{
  "plugins": ["@qvac/sdk/llamacpp-completion/plugin"],
  "serve": {
    "models": {
      "qwen3-8B-Q4-chat": {
        "model": "QWEN3_8B_INST_Q4_K_M",
        "type": "llamacpp-completion",
        "preload": true,
        "config": { "tools": true, "toolsMode": "static", "ctx_size": 16384, "gpu_layers": -1, "reasoning_budget": 0 }
      }
    }
  }
}
EOF

mkdir -p ~/qvac-openclaw && cd ~/qvac-openclaw
cat > qvac.config.json <<'EOF'
{
  "plugins": ["@qvac/sdk/llamacpp-completion/plugin"],
  "serve": {
    "models": {
      "qwen3-8B-Q4-chat": {
        "model": "QWEN3_8B_INST_Q4_K_M",
        "type": "llamacpp-completion",
        "preload": true,
        "config": { "tools": true, "toolsMode": "static", "ctx_size": 16384, "gpu_layers": -1, "reasoning_budget": 0 }
      }
    }
  }
}
EOF

mkdir $HOME\qvac-openclaw; cd $HOME\qvac-openclaw
@'
{
  "plugins": ["@qvac/sdk/llamacpp-completion/plugin"],
  "serve": {
    "models": {
      "qwen3-8B-Q4-chat": {
        "model": "QWEN3_8B_INST_Q4_K_M",
        "type": "llamacpp-completion",
        "preload": true,
        "config": { "tools": true, "toolsMode": "static", "ctx_size": 16384, "gpu_layers": -1, "reasoning_budget": 0 }
      }
    }
  }
}
'@ | Set-Content -Encoding utf8 qvac.config.json

Start the QVAC server

qwen3-8B-Q4-chat is the alias you defined in step 2; it serves Qwen3-8B at 4-bit (Q4_K_M). Run this in the folder you just made and leave this terminal open. The first run downloads the model once (about 4.7 GB), so give it a few minutes. You are ready when you see QVAC API server listening.

qvac serve openai --model qwen3-8B-Q4-chat

Install OpenClaw Optional

Skip this step if you already have OpenClaw. Otherwise, install it once.

npm install -g openclaw

Point OpenClaw at QVAC

Open a new terminal for this. The first command connects OpenClaw to your local server. The next three keep the agent fast and reliable on a local model.

openclaw onboard --auth-choice custom-api-key --custom-base-url http://127.0.0.1:11434/v1 --custom-model-id qwen3-8B-Q4-chat --custom-api-key "qvac" --non-interactive --accept-risk --skip-channels --skip-daemon --skip-search --skip-ui --skip-skills --skip-health

Two values here tie back to step 2. --custom-model-id qwen3-8B-Q4-chat must match the model alias in your config (the key under serve.models). --custom-api-key "qvac" is only a placeholder: the local QVAC server does not require a key, but OpenClaw’s setup needs the field filled, so any non-empty value works.

openclaw config set tools.profile coding
openclaw config set tools.allow '["write","read","exec"]' --strict-json
openclaw config set models.providers.custom-127-0-0-1-11434.timeoutSeconds 600

Start the agent gateway

Open another new terminal and leave this one open too. You are ready when you see ready.

openclaw gateway run

Talk to your local agent

One more new terminal. Ask it anything. The answer comes back from the model running on your own machine. Try turning off your network and asking again.

openclaw agent --agent main --message "Are you running in the cloud or on my machine? Answer in one sentence."

Have it build something Optional

Hand it a real task. This asks the agent to write a small animated web page and open it in your browser, with no further input from you. On a local model it takes a minute or two.

openclaw agent --agent main --message "Create an HTML file at ~/openclaw_lobster.html showing a large lobster emoji at 140px pulsing with a CSS scale animation on a dark #0f1410 background, with the text 'openclaw running on local with QVAC' in teal #16E3C1 monospace below it, fully visible immediately. Then run the shell command: open ~/openclaw_lobster.html"

openclaw agent --agent main --message "Create an HTML file at ~/openclaw_lobster.html showing a large lobster emoji at 140px pulsing with a CSS scale animation on a dark #0f1410 background, with the text 'openclaw running on local with QVAC' in teal #16E3C1 monospace below it, fully visible immediately. Then run the shell command: xdg-open ~/openclaw_lobster.html"

openclaw agent --agent main --message "Create an HTML file at ~/openclaw_lobster.html showing a large lobster emoji at 140px pulsing with a CSS scale animation on a dark #0f1410 background, with the text 'openclaw running on local with QVAC' in teal #16E3C1 monospace below it, fully visible immediately. Then run the shell command: start ~/openclaw_lobster.html"

What you just built, and what to try next

You now have a coding agent running entirely on your own machine. A good first test is to confirm the obvious: ask it whether it is running in the cloud or on your machine, then turn off your network and ask again. It keeps answering, because nothing was ever being sent to a server in the first place.

For something you can see, hand it a small build task and watch it write a file and open it. A coding task like generating a small web page runs in a minute or two on a typical laptop, fully offline, because the model is doing real work locally rather than streaming from a datacenter. The setup block includes a ready-made example: it asks the agent to write a tiny animated web page and open it in your browser, with no further input from you.

From here, the same setup handles the rest of what OpenClaw does. Point it at a project folder, ask it to read and edit code, have it summarize a directory, or give it a multi-step chore. The agent loop is identical. The only thing that changed is that the intelligence behind it is yours.

Notes and limits

A model that fits on a laptop is not a frontier cloud model. Expect it to be slower and to need clearer instructions. For agent work, give it focused, single-goal tasks rather than long open-ended ones.
Depending on your hardware, a local response can take longer than a cloud API today. Treat that as a temporary gap, not a permanent cost: local hardware keeps getting faster and capable models keep getting smaller while holding their quality, so the gap is closing fast. The physics also runs in local's favor. A request to a datacenter is bounded by the speed of light, a round trip no provider can engineer away. For anything that has to react in real time, a robot, a control loop, a live interface, that round trip is a hard floor, and local is the only option that stays reliable.
QVAC is Apache 2.0 and free. The model you run is downloaded once and cached locally. Source and docs: github.com/tetherto/qvac and docs.qvac.tether.io.