> **Bottom line:** Google's newly released Gemma 4 (9B) model natively runs on Apple Silicon at 85 tokens per second via the MLX framework, completely bypassing Apple's cloud infrastructure and third-party API subscriptions.
While Apple attempts to funnel users toward its proprietary Apple Intelligence cloud tiers, open-source engineers have proven that an M3 Mac with 18GB of RAM can handle advanced reasoning offline.
If you are paying $20 a month for AI, you are funding a server farm for tasks your laptop can already execute locally.
I cancelled my $20-a-month AI subscriptions last Tuesday.
Not because I stopped needing LLMs to write shell scripts or debug my Terraform configs, but because I finally ran the numbers on what my M3 MacBook Pro was actually capable of doing while sitting on an airplane with no Wi-Fi.
I spent the last two years blindly paying OpenAI and Anthropic, convinced my laptop couldn't handle real engineering tasks.
We've all been conditioned to believe that serious AI requires server farms the size of Rhode Island, so we dutifully pay for the privilege of pinging their data centers every time we need a regex explanation.
But something shifted last week when Google released Gemma 4, and the open-source community immediately ported it to Apple's MLX framework.
The results are staggering, and they expose a massive conflict of interest in Apple's current product strategy.
**Your Mac is secretly an AI supercomputer**, and the industry is heavily incentivized to make sure you never realize it.
Let's back up to what happened at WWDC when Apple first announced Apple Intelligence.
The pitch was seamless integration: your phone and your Mac would handle the "small" requests locally, while the heavy lifting would be securely offloaded to Apple's Private Cloud Compute.
It sounded like a brilliant architectural compromise for hardware that couldn't keep up with massive parameter counts.
But the hardware absolutely could keep up. When you try to run an LLM on a traditional Windows PC, you are fighting physics if the model size exceeds the dedicated GPU's VRAM capacity.
In those cases, the model weights have to be moved from system RAM across the PCIe bus into the GPU, which creates a massive bottleneck.
We've seen this exact cycle before in the history of computing. Mainframes gave way to personal computers because the physics of local compute eventually outpaced the convenience of time-sharing.
Right now, Apple and Microsoft are treating AI like a mainframe era technology, insisting that all the meaningful processing must happen on their iron.
But the M-series architecture breaks that model entirely. Apple's unified memory means the CPU and GPU don't have to copy data back and forth over a slow bridge.
The model weights sit in one place, and both processors can access them instantly, which is exactly how you want to design a machine for neural network inference.
This architectural advantage is so significant that it makes consumer MacBooks competitive with enterprise hardware.
This means an M3 Max with 128GB of RAM effectively has 128GB of VRAM—a configuration that would cost upwards of $30,000 in dedicated Nvidia server cards.
Apple built the perfect machine for local AI and then built a software ecosystem that ignores it.
They want you routing through their cloud infrastructure, where they control the ecosystem, the privacy narrative, and eventually, the subscription tiers.
Meanwhile, open-weights models have proven that the heavy lifting doesn't actually need to leave your desk.
When Google released the Gemma 4 weights earlier this month, the technical specs didn't initially look like a massive breakthrough.
It's a 9-billion parameter model, which sounds tiny compared to the trillion-parameter behemoths running in the cloud.
But parameter count is no longer the sole metric of intelligence, and the quantization community has gotten terrifyingly good at their jobs.
The difference with Gemma 4 is its training data efficiency. Google aggressively filtered the training sets to prioritize reasoning and code generation over encyclopedic memorization.
Instead of knowing the capital of every obscure country, the model focuses its neural pathways on understanding logical structures, Python syntax, and system architecture.
By using 4-bit quantization, engineers shrunk this highly focused model down to a file size of just under 6 gigabytes.
This means it fits entirely within the unified memory of even a base-model M2 or M3 Mac, leaving plenty of overhead for your actual operating system and IDE.
I downloaded the GGUF file, spun up LM Studio, and turned off my Wi-Fi to see what it could do.
The performance wasn't just acceptable; it was blistering.
**On my M3 Pro, Gemma 4 spits out code and analysis at 85 tokens per second.** For context, human reading speed maxes out around 10 to 15 tokens per second.
The model is generating complex Python scripts and Kubernetes manifests faster than my eyes can physically track the characters on the screen.
It feels less like querying a remote database and more like having a senior engineer living inside your motherboard.
What makes this entire situation bizarre is Apple's total silence on the matter.
During their recent keynotes, executives spent hours hyping up cloud-based Apple Intelligence features that generate custom emojis or rewrite your emails.
Not once did they mention that the silicon inside the machine you just bought is capable of running state-of-the-art open-weights models.
The secret sauce making this possible isn't just the model; it's Apple's own open-source machine learning framework, MLX.
Ironically, while Apple's product marketing team is pushing cloud integration, Apple's internal research engineers built a framework that treats Apple Silicon like a dedicated AI accelerator.
Before MLX, running models on a Mac meant relying on standard CPU threads, which was fantastic but heavily bound by processor limits.
MLX changed the math entirely by allowing models like Gemma 4 to natively hook into the GPU cores of your Mac.
There is no virtualization overhead, no Docker container translation layer, and no inefficient translation paths.
It is bare-metal inference running directly on the most efficient consumer silicon on the planet.
If you want to see this for yourself, the barrier to entry is shockingly low. You don't need to compile C++ or fight with CUDA drivers. You just need a terminal and a few lines of Python:
```bash pip install mlx-lm mlx_lm.generate --model google/gemma-4-9b-it-mlx \
--prompt "Write a Python script to monitor AWS EC2 costs" \ --max-tokens 500 ```
That's it. You are now running an LLM that rivals last year's enterprise cloud models, completely locally, with zero latency, and absolutely zero subscription fees.
So why isn't this the default? Why is every tech company, including Apple, pushing us toward cloud-based AI assistants? The answer is a predictable mix of ecosystem lock-in and recurring revenue.
If developers realize they can run world-class code generation and text analysis locally, the entire economic model of "AI as a Service" begins to fracture.
The big tech companies have spent tens of billions of dollars on Nvidia clusters, and they need to amortize those costs by charging you a monthly fee.
**They are selling you water by the bottle while you're standing next to a perfectly good tap.**
Think about the unit economics of a ChatGPT Pro or Claude Pro subscription. You are paying $20 a month primarily for the compute costs associated with generating tokens.
If 80% of your daily queries are simple tasks like "write a bash script" or "format this JSON", you are wildly overpaying for a service your hardware can execute for free.
Furthermore, local AI is a massive privacy win that corporations simply cannot monetize.
When I run Gemma 4 on my machine, I can feed it proprietary company code, unredacted database schemas, and sensitive customer data.
There are no SOC2 compliance headaches, no data processing agreements to sign, and no risk of my company's intellectual property ending up in someone else's training data.
The data never leaves my motherboard. For enterprise security teams, this isn't just a nice-to-have; it completely eliminates the third-party vendor risk of AI adoption.
You don't have to trust Apple or Google's privacy policies when you literally sever the network connection.
Beyond the financial and privacy benefits, local execution fundamentally changes how you interact with AI.
When you rely on a cloud API, you are at the mercy of network latency, server loads, and rate limits. Every prompt has an inherent friction: you hit "Enter" and wait for the HTTP request to round-trip.
When the model is running in your laptop's unified memory, the friction drops to zero.
**The AI becomes a seamless extension of your operating system rather than a remote service you query.** You can set up local scripts that constantly monitor your log files and summarize errors in real-time, all without worrying about hitting an API rate limit.
I recently piped my entire terminal history through Gemma 4 just to figure out why a specific Terraform deployment failed last week.
Try doing that with a cloud model and watch how fast you hit a token limit or get flagged by an automated security filter.
Local compute gives you the freedom to be messy and exploratory with your data.
I need to be clear about the limitations, because the hype around local AI can quickly veer into delusion. Gemma 4 is a spectacular model, but it is not a magical replacement for everything.
If you are trying to write a novel from scratch, or you need to reason through a massively complex architectural problem spanning hundreds of files, a 9-billion parameter local model is going to hallucinate.
It does not have the sheer contextual depth or the vast world knowledge of a 1.5-trillion parameter frontier model like Claude 4.6 or ChatGPT 5.
If you need a model to read an entire 500-page API documentation PDF and instantly build a wrapper library, you still need to pay Anthropic. Local models are tactical, not strategic.
They are exceptional at focused, context-rich tasks: "Refactor this function," "explain this error trace," or "write a unit test for this edge case." But if you ask Gemma 4 to design a distributed database system from first principles, you are going to get a very confident, very flawed hallucination.
**You have to treat local AI like a highly capable junior developer.**
You give it specific, constrained tasks with clear inputs and expected outputs. You don't ask it to architect the entire system.
When you understand that boundary, the local model becomes an indispensable daily driver, allowing you to reserve the expensive cloud models for the truly complex architectural heavy lifting.
We are at a strange inflection point in the tech industry right now. Hardware has outpaced the narrative.
The M-series chips in our backpacks are capable of feats that required a dedicated server rack just three years ago, yet we are still paying rent to cloud providers to do our thinking for us.
It's time to reclaim your compute. If you have an M1, M2, M3, or M4 Mac with at least 16GB of RAM, you already own an AI accelerator.
You don't need permission from Apple, and you don't need a subscription from Google or OpenAI to use it.
Here is the exact workflow I use to replace my cloud subscriptions today. First, download LM Studio or Ollama.
These tools have abstracted away all the command-line complexity of running local models, making it literally a one-click install.
Next, pull the Gemma 4 (9B) GGUF file from Hugging Face.
Make sure you grab the 4-bit quantized version to keep the memory footprint light, leaving enough RAM for your Docker containers and Electron apps. Finally, integrate it into your daily workflow.
Use tools like Continue.dev or local shell scripts to point your IDE's autocomplete and chat features at your local port instead of an external API.
You can even set up local whisper models to dictate code directly into your editor, completely bypassing cloud transcription services.
Once you experience the speed of bare-metal inference and the peace of mind that comes with complete data privacy, paying a monthly fee for a cloud service starts to feel absurd.
Apple may not want to advertise it, and the cloud providers certainly don't want you to realize it, but the era of localized, powerful AI is already here.
You just have to be willing to turn off your Wi-Fi and run the code yourself.
Has anyone else made the jump to running models purely locally, or are the cloud APIs still too convenient to give up? Let's talk in the comments.
---
Hey friends, thanks heaps for reading this one! 🙏
Appreciate you taking the time. If it resonated, sparked an idea, or just made you nod along — let's keep the conversation going in the comments! ❤️