Gemma + Ollama + OpenClaw: Local AI on Mac in 15 Minutes

Guide OpenClaw Install 2026-05-04 9 min

Key Takeaways:

Gemma 4 8B runs comfortably on any Apple Silicon Mac with 16 GB or more of unified memory — installation takes about 15 minutes from a fresh machine using just Homebrew and a single Ollama pull command
Metal GPU acceleration is automatic on Apple Silicon — Ollama offloads roughly 86% of inference to the GPU without any configuration, giving you 30–60 tokens per second on an M2 Pro and faster on M3/M4/M5 chips
Gemma 4 ships with a 131,072-token context window (about 350 pages of text), enough to feed an entire codebase or document collection into a single conversation without RAG plumbing
Connecting OpenClaw to Ollama takes three lines in your model config — once wired up, Gemma behaves like any other model in the OpenClaw model registry and works with all standard skills
A hybrid setup that routes simple tasks to local Gemma and complex reasoning to Claude Sonnet via subagents typically cuts your monthly API spend by 60–80% while keeping sensitive prompts on-device
Once Gemma is loaded, ongoing costs are zero — no API keys, no token billing, no rate limits — making it the cheapest production-grade option for AI agents you run all day every day

You can run a fully local AI agent on any modern Mac in about 15 minutes: install Ollama via Homebrew, pull the Gemma 4 model with one command, and point your OpenClaw config at http://localhost:11434. No cloud, no API keys, no token bills — just an agent that lives entirely on your own hardware.

This guide walks through every step, including the gotchas that catch most people on their first try. By the end you will have a working local OpenClaw agent backed by Gemma 4, and you will know how to extend it with a hybrid local-plus-cloud routing strategy for the few tasks where Gemma is not the right model.

---

Why Run Gemma Locally

Three reasons consistently come up for moving inference on-device.

Privacy. Anything you send to a cloud LLM provider — client emails, internal notes, source code under NDA, financial figures — leaves your machine and is processed on hardware you do not control. For most personal use this is acceptable; for regulated work or sensitive client data it often is not. A local model means the prompt and the response never leave the box you typed them on. Offline reliability. Cloud providers go down. API keys get rotated mid-flight. Rate limits hit during the demo. A local model has none of these failure modes. If your Mac is on, your agent is responsive. Zero per-token cost. A heavy OpenClaw user routing everything through Claude Sonnet can spend $30–80 per month on API fees. A local Gemma model costs you the electricity to run your laptop — call it $1–3 per month, depending on your utility rate. Over a year that is the price of the Mac itself.

These are real benefits, but they come with a tradeoff: Gemma 4 8B is excellent for most everyday agent work and not as strong as Claude Sonnet 4.5 for the hardest reasoning tasks. The hybrid setup later in this article lets you keep both.

---

What You Need

This guide is written for Apple Silicon. Intel Macs technically work but are slow enough that we do not recommend them for production agent use.

Hardware:

Apple Silicon Mac (M1, M2, M3, M4, or M5 — any generation)
16 GB unified memory minimum (24 GB or more is comfortable for running other apps alongside)
~12 GB of free disk space for the model and Ollama itself

Software:

macOS Sonoma 14 or newer (Sequoia or Tahoe recommended)
Homebrew (the install command below assumes it is already on your machine — if not, head to brew.sh first)
An existing OpenClaw installation, or willingness to install one as part of this walkthrough

If you are on a Mac mini in particular, our Mac Mini setup guide has additional notes on running OpenClaw as a 24/7 service via launchd.

---

Step 1. Install Ollama

Ollama is the runtime that loads and serves local LLMs. It handles model downloads, GPU acceleration, and the HTTP API that OpenClaw will talk to. Install it with one Homebrew command:

bash

brew install ollama
brew services start ollama

The second command starts the Ollama background service and configures it to launch at boot. You can verify it is running:

bash

curl http://localhost:11434

A response of Ollama is running means you are good. If the command hangs or fails, check brew services list and restart if needed.

---

Step 2. Download Gemma 4

Gemma 4 is Google DeepMind's open-weights model family, released under the Gemma Terms of Use. The 8B variant is the right size for an Apple Silicon Mac with 16 GB of memory — large enough to handle real reasoning tasks, small enough to leave headroom for the rest of your system.

Pull the model:

bash

ollama pull gemma:8b

The download is roughly 9.6 GB and takes 5–15 minutes depending on your connection. The model is stored in ~/.ollama/models/ and persists across reboots — you only do this once.

When the download finishes, run a quick sanity check:

bash

ollama run gemma:8b "Write a one-sentence haiku about local AI."

If you see a coherent response within a few seconds, the model is loaded and inference is working.

---

Step 3. Verify GPU Acceleration

Metal GPU acceleration on Apple Silicon is automatic — Ollama detects the GPU on launch and offloads inference layers to it without any flags. But it is worth confirming that it is actually happening, because CPU-only inference is roughly 10x slower and will make your agent feel sluggish.

While Gemma is generating a response, run this in a second terminal:

bash

ollama ps

Look at the PROCESSOR column. You want to see 100% GPU or close to it (a small CPU percentage is normal for the first few layers). If you see 100% CPU, something is wrong with Metal — usually a too-old macOS version or a corrupted Ollama install. The fix is almost always brew reinstall ollama followed by brew services restart ollama.

On an M2 Pro you should see roughly 30–60 tokens per second on Gemma 8B. On M3 and M4 chips, expect 50–90 tokens per second. M5 numbers are higher still but vary by configuration.

---

Step 4. Connect OpenClaw to Ollama

OpenClaw talks to LLMs through its model registry — a JSON or YAML file that lists each model along with its endpoint, API key (if any), and capabilities. To add Gemma, edit your models.json (or the equivalent file in your installation) and append a new entry:

json

{
  "name": "gemma-local",
  "provider": "openai-compatible",
  "base_url": "http://localhost:11434/v1",
  "model_id": "gemma:8b",
  "api_key": "ollama",
  "context_window": 131072
}

A few notes on this config:

provider: openai-compatible is the trick — Ollama exposes an OpenAI-compatible chat completions endpoint at /v1/chat/completions, so OpenClaw treats it like any other OpenAI-style provider.
api_key: ollama is a placeholder. Ollama does not require authentication, but the OpenAI client library that OpenClaw uses expects the field to be non-empty.
context_window: 131072 unlocks the full 128K context that Gemma 4 supports. Without this, OpenClaw will default to a much smaller window.

Reload OpenClaw (pm2 restart openclaw or your equivalent), and check the logs for a line confirming gemma-local is registered.

---

Step 5. Assign Gemma to Your Agent

In your agent's soul.md or agent config file, point the model field at the new entry:

yaml

name: "My Local Agent"
model: "gemma-local"
skills:
  - file-operations
  - calendar
  - notes

Send the agent a message in Telegram (or whatever messenger you have wired up) and watch the logs. You should see the request hitting localhost:11434, the model generating a response, and the agent replying — all without a single packet leaving your network.

If the response is slow on the first message, that is expected. Ollama lazy-loads the model on first request and keeps it in memory afterward; subsequent messages will be much faster.

---

Using the 128K Context Window

One of the underrated wins of Gemma 4 over older local models is the 131,072-token context window — roughly 350 pages of text in one prompt. For an agent this changes what is practical:

Whole-codebase questions. "Why does this function fail when called from the worker?" — feed the entire repo and let Gemma trace it.
Long-form research summaries. Feed a stack of PDFs or a year of emails into the prompt directly, no RAG required.
Multi-step planning. A 50-step plan with full context on each step fits in one conversation.

The practical limit is memory. A 128K context fully loaded uses about 12 GB of additional RAM beyond the model itself, so on a 16 GB Mac you will start swapping if you push the window to its limit. On 24 GB or 32 GB systems it is comfortable.

---

Hybrid Setup: Local Plus Cloud

Gemma 8B is excellent for the bulk of everyday agent work — message drafting, calendar lookups, file operations, news summaries, simple code edits. Where it is not as strong as Claude Sonnet 4.5 is the hardest reasoning: complex code generation, multi-step planning across many constraints, or tasks that benefit from the largest available model.

The right answer is not to pick one. OpenClaw's subagent system lets you route tasks dynamically.

A typical hybrid config looks like this:

yaml

name: "My Hybrid Agent"
model: "gemma-local"
subagents:
  - name: "deep-thinker"
    model: "claude-sonnet-4.5"
    delegate_when:
      - "complex code generation"
      - "multi-step planning with 5+ dependencies"
      - "tasks requiring web search synthesis"

The main agent runs on local Gemma and handles 90% of the work. When it hits a task it judges to be beyond its strengths, it delegates to the Claude subagent, which fires off a single API call. The user sees one continuous conversation.

In practice this routing typically cuts API spend by 60–80% versus running everything through Claude — you only pay for the hard cases. And the prompts that hit Claude are usually the ones that contain less personal context (code snippets, planning steps), while the ones with sensitive data tend to stay local. Both bills go down.

---

Common Problems and Fixes

Symptom: First response is slow, then everything is fast. Cause: Ollama lazy-loads the model on first request. Normal — not a problem. Symptom: Memory pressure warnings; system slowdown. Cause: You have other heavy apps open, or you are pushing context near the 128K limit on a 16 GB machine. Close apps, or upgrade to 24 GB+ if this is your daily setup. Symptom: OpenClaw says it cannot reach the model. Cause: Usually brew services start ollama was never run, or it stopped after a system update. brew services list will tell you. Also check that nothing else has bound to port 11434. Symptom: Responses are short and feel like the model is confused. Cause: You may have pulled gemma:2b instead of gemma:8b. The 2B model is too small for serious agent work. Re-pull the right size. Symptom: Agent works in a direct chat but not when called from a skill. Cause: Some skills set their own model parameter that overrides the agent default. Check the skill's config and switch its model to gemma-local too, or leave it on Claude if it is one of the cases where you want cloud reasoning.

---

Can OpenClaw Use Gemma Without an Internet Connection?

Yes — once Ollama and Gemma are downloaded, the entire stack runs offline. The agent will not be able to use skills that themselves require the internet (web search, weather, etc.), but file operations, calendar, notes, code analysis, and any locally-scoped tooling work without a network connection. This is one of the strongest arguments for the local setup if you travel or work from places with unreliable internet.

---

Is Local Gemma Good Enough to Replace Claude?

For 80–90% of everyday agent use, yes. For the hardest reasoning tasks — complex code generation, deep multi-step planning, anything that benefits from a frontier model — Claude Sonnet 4.5 is still meaningfully better. The hybrid setup described above is the practical answer for most users: local for daily volume, cloud for the few cases that genuinely need it.

---

When to Bring in Help

This setup is well within reach for anyone comfortable in a terminal. The pieces — Homebrew, Ollama, an OpenClaw config file — are straightforward, and the gotchas section above covers the failure modes we see most often.

That said, if you would rather skip the configuration entirely, our Install service covers the local Ollama setup as a $39 add-on to a standard OpenClaw install. We pre-pull the right model size for your hardware, wire up the OpenClaw model registry, configure the hybrid subagent routing, and hand you a working local-plus-cloud agent ready for daily use. Same-day turnaround.

But if you are reading this article, you probably do not need that. Run the five commands, edit one config, and you have a fully local AI agent on your Mac.

Related Wiki & Skills:

Alex Werner

Founder of OpenClaw Install. 5+ years in DevOps and AI infrastructure. Helped 50+ clients deploy AI agents.

About the author

Gemma + Ollama + OpenClaw: Local AI on Mac in 15 Minutes

Why Run Gemma Locally

What You Need

Step 1. Install Ollama

Step 2. Download Gemma 4

Step 3. Verify GPU Acceleration

Step 4. Connect OpenClaw to Ollama

Step 5. Assign Gemma to Your Agent

Using the 128K Context Window

Hybrid Setup: Local Plus Cloud

Common Problems and Fixes

Can OpenClaw Use Gemma Without an Internet Connection?

Is Local Gemma Good Enough to Replace Claude?

When to Bring in Help

Alex Werner

Read Also

OpenClaw with EU AI Models: Mistral, Aleph Alpha & GDPR Compliance

OpenClaw vs Hermes: Two Approaches to Personal AI Infrastructure

What is OpenClaw? Complete Guide 2026