Ollama vs LM Studio vs llama.cpp — A No-BS Guide to Running LLMs Locally on macOS
| 3 minutes read
Let’s cut the fluff: if you care about privacy, speed, and having full control over your stack, running LLMs locally is no longer optional — it’s survival. Cloud’s nice until it’s not. Especially when your data is the product and your API bill explodes overnight.
I put this guide together because I wanted LLMs running locally, even with limited hardware — no vendor lock-in, no middlemen sniffing packets, just raw local compute. And yes, I run this stuff daily on a MacBook Air M1 with 16GB RAM. Modest? Yep. Enough? Hell yes.
📊 Ollama vs LM Studio vs llama.cpp — Real Talk
Feature | Ollama | LM Studio | llama.cpp |
---|---|---|---|
Interface | CLI + API | GUI (Electron) | Pure CLI |
Open Source | Yes (MIT, server included) | Nope | Yes (MIT) |
Quantization | GGUF via llama.cpp | GGUF via llama.cpp | Native GGUF |
Model Install | One-liner (ollama run ) | Manual (GUI) | Manual (CLI/scripts) |
Modelfile Support | Yes (customizable) | No | Yes (limited) |
Customization | Medium | Basic | Full control |
Offline | Yes | Yes | Yes |
Local API | Yes (port 11434) | Yes (via server) | Yes (via server binary) |
RAM Usage (idle) | ~1GB | ~2GB | ~100MB |
Complexity | Low | Baby mode | Medium to high |
Prod Ready | Yes (with tuning) | No | Absolutely |
🚀 Why Run It Local?
Here’s why you should stop relying on the cloud when it comes to LLMs:
- 🔐 Privacy — No data leaves your machine. Period.
- ⚡ Low latency — Local means instant, no round trips.
- 💸 Free — No monthly API bloodsucking.
- 🛠️ Customizable — Tweak it all, from tokens to system prompts.
- 🔌 Offline — No signal? No problem.
GGUF quantized models are a game changer. You can run 7B models without melting your Mac. It’s not just viable — it’s smooth.
🔧 Setting Up llama.cpp on macOS (The Fast Way)
What you need:
- macOS 12.6+ (Apple Silicon strongly recommended)
- Homebrew
- Xcode Command Line Tools:
xcode-select --install
- CMake:
brew install cmake
Build llama.cpp with Metal support:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DLLAMA_METAL=ON
cmake --build . --config Release -j$(sysctl -n hw.logicalcpu)
🔧
-j$(sysctl -n hw.logicalcpu)
uses all your cores — because we don’t waste time.
Download a model (Example: Phi-3)
mkdir models && cd models
curl -LO https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf
cd ..
🧠 Go with
Q4_K_M
for best balance. Forget float32 unless you like RAM pain.
Run the damn thing
./bin/llama-cli -m models/Phi-3-mini-4k-instruct-fp16.gguf -p "Explain recursion in Python." -n 200
Need an API?
Just do this:./bin/server
→ then hithttp://localhost:8080
🧠 Models That Actually Work on a Mac M1
Model | Size | Use Case |
---|---|---|
Phi-3-mini-4k | ~2.3GB | General tasks / instruction |
Mistral-7B-Instruct | ~4.2GB | Chat / coding / workflows |
CodeLlama-7B | ~3.8GB | Code assistant |
DeepSeek-Coder-6.7B | ~3.5GB | Dev work / tech-heavy prompts |
🧩 Final Words
Running LLMs locally isn’t hype — it’s rebellion. It’s freedom.
llama.cpp gives you total control. Ollama and LM Studio? Handy, but still wrappers.
You choose the tools. You own the data.
No middleman. No leash.
Want to go deeper? Build your own local RAG pipeline. Add voice. Add vision. Build like it’s 1999 again — but with 2025 firepower.
Welcome to the resistance.