AI Trends9 min

Run Gemma 4 on Your Mac — Getting Started with Local AI via Ollama

How to run Google Gemma 4 locally on your Mac

April 4, 2026 · Tutorial

Every time I use ChatGPT or Claude, a thought nags at me: “This conversation is probably stored on a server somewhere.” When I’m summarizing personal notes, analyzing work code, or processing sensitive documents, that matters. So I decided to run AI directly on my Mac.

Google’s Gemma 4 is an open-source model released under the Apache 2.0 license. It comes in four sizes: 31B dense, 26B MoE, E4B, and E2B. The E2B and E4B variants are designed to run on edge devices like laptops. With a tool called Ollama, you can launch them with a single terminal command.

This guide covers everything: Ollama installation, running Gemma 4, comparing model sizes, and troubleshooting. Even if you’ve never used a terminal, you can follow along step by step.

Key takeaways

– Gemma 4 models: 31B dense, 26B MoE, E4B, E2B (edge-optimized)
– Install Ollama: `brew install ollama` or download from ollama.com
– Run: `ollama run gemma4:4b` (exact model names may change — check `ollama list`)
– Minimum: Apple M1+, 8GB RAM (for E2B/E4B)

Like AI reviews? I also build apps with AI.Lazy Developer EP.04 →

Why run AI locally?

The biggest advantage of local execution is privacy. Your input data never leaves your machine. It’s not transmitted to a server, not stored, and not used as training data. For processing internal company code or personal documents, this is the safest option.

Second, it’s free. No API charges. Just electricity. Third, it works offline. Once the model is downloaded, you can use AI on a plane, in a coffee shop without WiFi, or anywhere else.

The trade-off: it’s slower than cloud APIs. On a MacBook Pro M3, the E4B model generates roughly 15-25 tokens per second. That’s noticeably slower than ChatGPT or Claude, and complex reasoning tasks show a performance gap. But for simple questions, text cleanup, and code reviews, it’s more than enough.

Gemma 4 model sizes compared

Model Parameters Architecture Min RAM Best for
Gemma 4 E2B 2B Dense 8GB Light testing, embedded
Gemma 4 E4B 4B Dense 8GB General use (recommended)
Gemma 4 26B 26B MoE 16GB Coding, analysis
Gemma 4 31B 31B Dense 32GB High-performance tasks
Recommended for MacBook Air/Pro M1-M3 with 8GB:
Start with the E4B model. It handles most general questions well. If you have 16GB+, try the 26B MoE model too.

The “E” in E2B/E4B stands for Edge — models designed for edge devices like phones and laptops. MoE (Mixture of Experts) activates only a subset of parameters per inference, so the 26B model uses less compute than its parameter count suggests.

Step 1: Install Ollama

Ollama is a free, open-source tool for running LLMs locally. No Docker required. It supports Mac, Windows, and Linux.

Option 1: Install via Homebrew (recommended)

# Install Homebrew first if you don’t have it
/bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)”

# Install Ollama
brew install ollama

# Verify installation
ollama –version
# Expected output: ollama version 0.6.x

Option 2: Download from the official site

Head to ollama.com and download the Mac installer. The CLI is added automatically when you install the app.

Step 2: Download and run Gemma 4

With Ollama installed, open your terminal and type one line.

# Download + run Gemma 4 E4B
ollama run gemma4:4b

# First run downloads the model (~2-3GB, takes 1-5 min)
# pulling manifest…
# pulling abc123def456… 100%
# verifying sha256 digest…
# writing manifest…
# success

# Chat starts immediately after download
>>> What’s the capital of South Korea?
The capital of South Korea is Seoul.

Model names may change
Ollama model names can change over time. If gemma4:4b doesn’t work, run ollama list to see available models, or check ollama.com/library for the latest names.

You can run other sizes the same way.

# E2B (lightest)
ollama run gemma4:2b

# 26B MoE (needs 16GB+ RAM)
ollama run gemma4:26b

# 31B Dense (needs 32GB+ RAM)
ollama run gemma4:31b

# Exit the chat
>>> /bye

Step 3: Put it to work

Once the model is running, use it like any chat interface. Here are some practical examples.

# Code review
>>> Review this Python code:
def calc(x,y): return x+y if x>0 else x-y

# Summarization
>>> Summarize this document in 3 bullet points: (paste text)

# Translation
>>> Translate this email to Korean: Dear team…

# Set a system prompt (advanced)
>>> /set system You are a Python code review expert.

With E4B, simple questions get answers in 1-3 seconds. Longer text generation takes 10-30 seconds. Slower than cloud APIs, but free and private.

Ollama also supports an API server mode. Run ollama serve in a separate terminal, and you can send API requests to http://localhost:11434. Useful if you want to integrate local AI into an app you’re building.

Troubleshooting

“Error: model not found”

# List available models
ollama list

# Search for Gemma models
ollama search gemma

# Run with the correct name from search results
ollama run [model-name-from-results]

Running too slowly
If you’re running out of RAM, the system swaps to disk and performance tanks. Check Activity Monitor for memory pressure. If it’s yellow or red, switch to a smaller model. Drop from 26B to E4B, or from E4B to E2B.

Removing downloaded models

# Delete a specific model
ollama rm gemma4:4b

# See all downloaded models
ollama list

Alternatives to Ollama

LM Studio — has a graphical interface, great for people who aren’t comfortable with terminals. Model browsing, downloading, and chatting all happen through a GUI. Free at lmstudio.ai.

llama.cpp — build from C++ source and run directly. Offers fine-grained performance tuning and quantization controls. Best for developers who want maximum control. If you’re not a developer, Ollama is much simpler.

All three tools run the same models — the results are identical. Ease of use ranks: Ollama > LM Studio > llama.cpp. Customizability is the reverse. For first-timers, go with Ollama.

FAQ

Q. What is Ollama?
A free, open-source tool for running LLMs locally. One terminal command to download and run any supported model. Works on Mac, Windows, and Linux.

Q. How much RAM do I need?
E2B/E4B: 8GB minimum. 26B MoE: 16GB+. 31B dense: 32GB+. Apple Silicon’s unified memory is more efficient than Intel for this workload.

Q. Do I need internet?
Only for the initial download. After that, it runs completely offline.

Q. Is it better than ChatGPT or Claude?
For raw performance, no. The E4B model is good for simple tasks, but complex reasoning is weaker. The trade-off: it’s free and your data stays on your machine.

Q. Are there alternatives to Ollama?
LM Studio (GUI-based) and llama.cpp (build from source). LM Studio if you dislike terminals, llama.cpp if you want performance tuning. All run the same models.

Final thoughts

Running AI on your Mac is easier than you’d think. Install Ollama, type ollama run gemma4:4b, and you’re done. Performance won’t match ChatGPT, but if privacy matters or you want to avoid API costs, it’s a solid option.

Open-source AI models are improving fast. A year ago, running AI on a laptop was a novelty. Now it’s practical. And after Gemma 4, better models will keep coming. Once Ollama is installed, trying a new model is always just one command away.

Official Sources

Published April 4, 2026. Gemma 4 specs based on Google’s official announcement. Ollama model names may change over time — check ollama.com/library for the latest. Speed benchmarks are approximate, based on a MacBook Pro M3 with 16GB RAM.

GoCodeLab covers AI tools with firsthand experience. Subscribe for more updates.

Lazy Developer Series
Like reading AI reviews? I also build apps with AI. Here is how I built a SaaS in 7 days with Claude.

EP.04: I Built a SaaS in 7 Days →