Run Gemma 4 on Your Mac — Getting Started with Local AI via Ollama
How to run Google Gemma 4 locally on your Mac
April 4, 2026 · Tutorial
Every time I use ChatGPT or Claude, a thought nags at me: “This conversation is probably stored on a server somewhere.” When I’m summarizing personal notes, analyzing work code, or processing sensitive documents, that matters. So I decided to run AI directly on my Mac.
Google’s Gemma 4 is an open-source model released under the Apache 2.0 license. It comes in four sizes: 31B dense, 26B MoE, E4B, and E2B. The E2B and E4B variants are designed to run on edge devices like laptops. With a tool called Ollama, you can launch them with a single terminal command.
This guide covers everything: Ollama installation, running Gemma 4, comparing model sizes, and troubleshooting. Even if you’ve never used a terminal, you can follow along step by step.
– Gemma 4 models: 31B dense, 26B MoE, E4B, E2B (edge-optimized)
– Install Ollama: `brew install ollama` or download from ollama.com
– Run: `ollama run gemma4:4b` (exact model names may change — check `ollama list`)
– Minimum: Apple M1+, 8GB RAM (for E2B/E4B)
Why run AI locally?
The biggest advantage of local execution is privacy. Your input data never leaves your machine. It’s not transmitted to a server, not stored, and not used as training data. For processing internal company code or personal documents, this is the safest option.
Second, it’s free. No API charges. Just electricity. Third, it works offline. Once the model is downloaded, you can use AI on a plane, in a coffee shop without WiFi, or anywhere else.
The trade-off: it’s slower than cloud APIs. On a MacBook Pro M3, the E4B model generates roughly 15-25 tokens per second. That’s noticeably slower than ChatGPT or Claude, and complex reasoning tasks show a performance gap. But for simple questions, text cleanup, and code reviews, it’s more than enough.
Gemma 4 model sizes compared
| Model | Parameters | Architecture | Min RAM | Best for |
|---|---|---|---|---|
| Gemma 4 E2B | 2B | Dense | 8GB | Light testing, embedded |
| Gemma 4 E4B | 4B | Dense | 8GB | General use (recommended) |
| Gemma 4 26B | 26B | MoE | 16GB | Coding, analysis |
| Gemma 4 31B | 31B | Dense | 32GB | High-performance tasks |
Start with the E4B model. It handles most general questions well. If you have 16GB+, try the 26B MoE model too.
The “E” in E2B/E4B stands for Edge — models designed for edge devices like phones and laptops. MoE (Mixture of Experts) activates only a subset of parameters per inference, so the 26B model uses less compute than its parameter count suggests.
Step 1: Install Ollama
Ollama is a free, open-source tool for running LLMs locally. No Docker required. It supports Mac, Windows, and Linux.
Option 1: Install via Homebrew (recommended)
/bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)”
# Install Ollama
brew install ollama
# Verify installation
ollama –version
# Expected output: ollama version 0.6.x
Option 2: Download from the official site
Head to ollama.com and download the Mac installer. The CLI is added automatically when you install the app.
Step 2: Download and run Gemma 4
With Ollama installed, open your terminal and type one line.
ollama run gemma4:4b
# First run downloads the model (~2-3GB, takes 1-5 min)
# pulling manifest…
# pulling abc123def456… 100%
# verifying sha256 digest…
# writing manifest…
# success
# Chat starts immediately after download
>>> What’s the capital of South Korea?
The capital of South Korea is Seoul.
Ollama model names can change over time. If
gemma4:4b doesn’t work, run ollama list to see available models, or check ollama.com/library for the latest names.
You can run other sizes the same way.
ollama run gemma4:2b
# 26B MoE (needs 16GB+ RAM)
ollama run gemma4:26b
# 31B Dense (needs 32GB+ RAM)
ollama run gemma4:31b
# Exit the chat
>>> /bye
Step 3: Put it to work
Once the model is running, use it like any chat interface. Here are some practical examples.
>>> Review this Python code:
def calc(x,y): return x+y if x>0 else x-y
# Summarization
>>> Summarize this document in 3 bullet points: (paste text)
# Translation
>>> Translate this email to Korean: Dear team…
# Set a system prompt (advanced)
>>> /set system You are a Python code review expert.
With E4B, simple questions get answers in 1-3 seconds. Longer text generation takes 10-30 seconds. Slower than cloud APIs, but free and private.
Ollama also supports an API server mode. Run ollama serve in a separate terminal, and you can send API requests to http://localhost:11434. Useful if you want to integrate local AI into an app you’re building.
Troubleshooting
“Error: model not found”
ollama list
# Search for Gemma models
ollama search gemma
# Run with the correct name from search results
ollama run [model-name-from-results]
Running too slowly
If you’re running out of RAM, the system swaps to disk and performance tanks. Check Activity Monitor for memory pressure. If it’s yellow or red, switch to a smaller model. Drop from 26B to E4B, or from E4B to E2B.
Removing downloaded models
ollama rm gemma4:4b
# See all downloaded models
ollama list
Alternatives to Ollama
LM Studio — has a graphical interface, great for people who aren’t comfortable with terminals. Model browsing, downloading, and chatting all happen through a GUI. Free at lmstudio.ai.
llama.cpp — build from C++ source and run directly. Offers fine-grained performance tuning and quantization controls. Best for developers who want maximum control. If you’re not a developer, Ollama is much simpler.
All three tools run the same models — the results are identical. Ease of use ranks: Ollama > LM Studio > llama.cpp. Customizability is the reverse. For first-timers, go with Ollama.
FAQ
Q. What is Ollama?
A free, open-source tool for running LLMs locally. One terminal command to download and run any supported model. Works on Mac, Windows, and Linux.
Q. How much RAM do I need?
E2B/E4B: 8GB minimum. 26B MoE: 16GB+. 31B dense: 32GB+. Apple Silicon’s unified memory is more efficient than Intel for this workload.
Q. Do I need internet?
Only for the initial download. After that, it runs completely offline.
Q. Is it better than ChatGPT or Claude?
For raw performance, no. The E4B model is good for simple tasks, but complex reasoning is weaker. The trade-off: it’s free and your data stays on your machine.
Q. Are there alternatives to Ollama?
LM Studio (GUI-based) and llama.cpp (build from source). LM Studio if you dislike terminals, llama.cpp if you want performance tuning. All run the same models.
Final thoughts
Running AI on your Mac is easier than you’d think. Install Ollama, type ollama run gemma4:4b, and you’re done. Performance won’t match ChatGPT, but if privacy matters or you want to avoid API costs, it’s a solid option.
Open-source AI models are improving fast. A year ago, running AI on a laptop was a novelty. Now it’s practical. And after Gemma 4, better models will keep coming. Once Ollama is installed, trying a new model is always just one command away.
Published April 4, 2026. Gemma 4 specs based on Google’s official announcement. Ollama model names may change over time — check ollama.com/library for the latest. Speed benchmarks are approximate, based on a MacBook Pro M3 with 16GB RAM.
GoCodeLab covers AI tools with firsthand experience. Subscribe for more updates.