The Open Source Flood: 12 Capable Models in Six Weeks
Free models are shipping faster than paid ones, and your laptop just got more powerful without any new hardware.
Twelve capable open source or open weight AI models shipped in the last six weeks. You probably heard about zero of them.
The ones everyone talks about (GPT, Claude, Gemini) are only half of what is happening in AI right now. The other half is a dozen labs quietly releasing models you can download, run on your own computer, and never send a single byte of data to anyone. Several of those models are now genuinely competitive with the paid tools, and that half is the one most small business owners are missing.
What “open” actually buys you
The practical version: when a model is open, you can download it, load it on your own laptop or server, and ask it questions without sending your words to anyone’s API. No subscription. No rate limits. No “we may use your conversations to improve our products.” Your data stays where it is, and your electricity bill goes up a little.
For a small business, that’s not a minor detail. If you handle client financials, health information, legal drafts, or anything sensitive, running a model locally sidesteps a whole category of data-handling concern. You don’t need to read a vendor’s privacy policy if the vendor isn’t involved.
The catch used to be quality. Open models were free but noticeably worse. That gap has closed, and the six-week release window above is the clearest evidence yet.
The size tiers, and what each one is for
Not every open model is trying to do the same thing. They come in five rough tiers, each designed for different hardware and different jobs. Understanding the tiers is the difference between picking a tool that fits your setup and wasting an afternoon trying to run a 400B-parameter model on a laptop.
A few things worth noticing.
The “Pocket” tier is mostly invisible, but it is real. When your phone autocompletes a message, summarizes a notification, or transcribes voice offline, there is probably a sub-1B model doing it. These are the models that show up in consumer products without being announced.
The “Small” tier is where most small business use cases actually live. Drafting an email, summarizing a transcript, classifying inbound leads, writing a first pass at a blog post. A modern 7B or 8B model handles all of that, and it runs on any laptop you would buy today.
The “Mid” and “Large” tiers are where the newest efficiency tricks pay off. Mixture-of-Experts models (MoE) are doing a lot of the work here. Alibaba’s Qwen 3.6-35B-A3B has 35 billion parameters total, but only 3 billion are active on any given question. That means it thinks like a 35B model but runs at the speed of a 3B one, and it scores 73.4% on SWE-Bench Verified, a coding benchmark where most paid models still struggle. Meta’s Llama 4 Scout tells the same story at a different scale: 109B total parameters, 17B active.
The “Frontier” tier used to require a data center. It still mostly does, but quantization (we’ll get to that) is pushing the edge of what a serious workstation can handle.
Your Mac got faster without a hardware upgrade
Here is the part that surprised me most when I went looking. The hardware you already own can run bigger, better, longer-context models today than it could six months ago. You did not buy a new machine. Nothing in the silicon changed. The software caught up.
Two things drove this.
MLX got fast. MLX is Apple’s machine learning framework for Apple Silicon, and the ecosystem around it matured dramatically in 2026. Independent benchmarks show MLX delivering 2 to 2.5 times faster prompt eval and generation than older llama.cpp builds, with the gap widening to 20–87% for models under 14B parameters. Ollama now ships an MLX backend. What used to require a beta tool and a willingness to troubleshoot now works out of the box.
Q8 KV cache halves your memory use.
Practically, that means your machine can now handle twice the context length it could before, using the exact same model. Or you can fit a bigger model into the memory you have. Either way, you get more out of the same hardware.
Put the two improvements together and the picture changes meaningfully.
What Your Mac Can Run Now vs. a Few Months Ago
The same story is true on Windows and Linux with Nvidia hardware, where flash attention improvements and similar KV cache tricks landed over the last several months. Apple Silicon just happens to be the most dramatic case, because MLX was younger and had more room to grow.
Why this is all happening now
Three forces, converging.
Chinese labs competing on efficiency. Alibaba (Qwen), Zhipu (GLM), DeepSeek, and MiniMax have been shipping relentlessly. The Chinese market for proprietary American AI is restricted, so these labs have a real incentive to release competitive open alternatives. Every release forces the next lab to be more efficient or more capable, and open models benefit from the arms race.
Google finally went fully permissive. Gemma 4 is the first Gemma generation under Apache 2.0, which means it can be used commercially without the license headaches that plagued earlier releases. That is a significant policy shift from Google, and it puts pressure on Meta’s Llama license, which is open-ish but has its own custom terms.
MoE architectures are changing the math. The old assumption was that bigger always means slower and more expensive to run. Mixture-of-Experts breaks that. A 400B MoE model with 17B active parameters is not 400B worth of compute per question, it’s 17B worth. That means you can get frontier-tier quality at mid-tier speed, and a lot of the new releases are built this way specifically.
Where the paid tools still win
Honesty section. Paid models still lead in a few places. The absolute frontier of multi-step reasoning, the most polished multimodal handling, voice and live conversation, and the integrated tooling (web search, code execution, file handling) you get for free with a ChatGPT or Claude subscription. If your job depends on the very best model on a hard task, the open ones haven’t caught the closed ones at the very top.
The point isn’t that open replaces paid. It’s that for a meaningful slice of what people are paying for, the free option is now good enough, and the slice is getting bigger every month.
What to do with this
You do not need to become an AI researcher to benefit from any of this. You need to spend an afternoon trying one thing.
Pick one task. Something you currently pay a subscription to do. Drafting emails. Summarizing meeting notes. Rewriting content in different tones. First-pass research.
Install LM Studio or Ollama. Both are free. Both work on Mac, Windows, and Linux. LM Studio has the friendlier interface for non-developers. Ollama is closer to a command-line tool but has broader compatibility.
Pull the right tier for your machine. Look at the hardware chart above, find your row, and download one of the models that fits. Start at the Small tier if you’re unsure. Gemma 4 7B or Qwen 3 8B are both solid first picks.
Run your task, compare the output. Not against perfection. Against what you’re already paying for. If it gets you 80% of the way there on something you do often, that is already a meaningful result.
Spend 20 minutes this week running a free model on a task you do every day. You’ll have a much better answer to “does this matter for my business?” than any blog post can give you.