perspectives

The Open Source Flood: 12 Capable Models in Six Weeks

Free models are shipping faster than paid ones, and your laptop just got more powerful without any new hardware.

Max April 25, 2026

Twelve capable open source or open weight AI models shipped in the last six weeks. You probably heard about zero of them.

The ones everyone talks about (GPT, Claude, Gemini) are only half of what is happening in AI right now. The other half is a dozen labs quietly releasing models you can download, run on your own computer, and never send a single byte of data to anyone. Several of those models are now genuinely competitive with the paid tools, and that half is the one most small business owners are missing.

Mar 11 12 releases, 45 days Apr 25

Mar 11

NVIDIA

Nemotron 3 Super

120B, 60.47% SWE-Bench

Mar 22

Lightricks

LTX 2.3

Open 4K video

Mar 25

ByteDance

Helios

Real-time video

Mar 28

DeepSeek

V3.2 update

Reasoning boost

Mar 31

Alibaba

Qwen 3.6 Plus

1M context

Apr 2

Google

Gemma 4

Apache 2.0

Apr 5

What “open” actually buys you

The practical version: when a model is open, you can download it, load it on your own laptop or server, and ask it questions without sending your words to anyone’s API. No subscription. No rate limits. No “we may use your conversations to improve our products.” Your data stays where it is, and your electricity bill goes up a little.

For a small business, that’s not a minor detail. If you handle client financials, health information, legal drafts, or anything sensitive, running a model locally sidesteps a whole category of data-handling concern. You don’t need to read a vendor’s privacy policy if the vendor isn’t involved.

The catch used to be quality. Open models were free but noticeably worse. That gap has closed, and the six-week release window above is the clearest evidence yet.

The size tiers, and what each one is for

Not every open model is trying to do the same thing. They come in five rough tiers, each designed for different hardware and different jobs. Understanding the tiers is the difference between picking a tool that fits your setup and wasting an afternoon trying to run a 400B-parameter model on a laptop.

Model Size Tiers Every tier below has capable open models shipping right now.

Pocket

Under 1B

Runs on Phones, always-on devices

Good for Smart replies, quick classification, offline voice commands

Gemma 4 NanoQwen 3.5 0.8B

Small

1B to 8B

Runs on Any modern laptop, 16GB RAM

Good for Drafting, summarizing, simple Q&A, most everyday chat

Gemma 4 7BQwen 3 8BPhi-4

Mid

8B to 30B

Runs on Good laptop or single consumer GPU

Good for Longer documents, coding help, research synthesis

Mistral Small 4Qwen 3 14BQwen 3.6-35B MoE

Large

30B to 120B

Runs on Workstation with 64GB+ or a pair of GPUs

Good for Agentic coding, long-context reasoning, near-frontier quality

Llama 4 Scout (109B MoE)Nemotron 3 Super 120B

Frontier

120B and up, usually MoE

Runs on Server or cloud, but runs quantized on a tower

Good for Anything a paid frontier model does

Llama 4 Maverick (400B)GLM-5.1Qwen 3.6 Plus

Every tier has capable open releases from the last six weeks. Pick the smallest one that does the job.

A few things worth noticing.

The “Pocket” tier is mostly invisible, but it is real. When your phone autocompletes a message, summarizes a notification, or transcribes voice offline, there is probably a sub-1B model doing it. These are the models that show up in consumer products without being announced.

The “Small” tier is where most small business use cases actually live. Drafting an email, summarizing a transcript, classifying inbound leads, writing a first pass at a blog post. A modern 7B or 8B model handles all of that, and it runs on any laptop you would buy today.

The “Mid” and “Large” tiers are where the newest efficiency tricks pay off. Mixture-of-Experts models (MoE) are doing a lot of the work here. Alibaba’s Qwen 3.6-35B-A3B has 35 billion parameters total, but only 3 billion are active on any given question. That means it thinks like a 35B model but runs at the speed of a 3B one, and it scores 73.4% on SWE-Bench Verified, a coding benchmark where most paid models still struggle. Meta’s Llama 4 Scout tells the same story at a different scale: 109B total parameters, 17B active.

The “Frontier” tier used to require a data center. It still mostly does, but quantization (we’ll get to that) is pushing the edge of what a serious workstation can handle.

Your Mac got faster without a hardware upgrade

Here is the part that surprised me most when I went looking. The hardware you already own can run bigger, better, longer-context models today than it could six months ago. You did not buy a new machine. Nothing in the silicon changed. The software caught up.

Two things drove this.

MLX got fast. MLX is Apple’s machine learning framework for Apple Silicon, and the ecosystem around it matured dramatically in 2026. Independent benchmarks show MLX delivering 2 to 2.5 times faster prompt eval and generation than older llama.cpp builds, with the gap widening to 20–87% for models under 14B parameters. Ollama now ships an MLX backend. What used to require a beta tool and a willingness to troubleshoot now works out of the box.

Q8 KV cache halves your memory use.

Practically, that means your machine can now handle twice the context length it could before, using the exact same model. Or you can fit a bigger model into the memory you have. Either way, you get more out of the same hardware.

Put the two improvements together and the picture changes meaningfully.

Same hardware, better results

What Your Mac Can Run Now vs. a Few Months Ago

Late 2025

Now (MLX + Q8 KV cache)

MacBook Air

16GB

Llama 3 8B at Q4 with short context. Usable but tight.

Qwen 3 8B or Gemma 4 7B at Q4 with 3 to 4 times the context, or faster generation on the same task. Real daily driver

MacBook Pro

32GB

Qwen 2.5 14B, comfortable. 30B only with aggressive quantization.

Qwen 3.6-35B-A3B MoE runs at 14B speeds. Mistral Small 4 with full context. Closer to paid-tool quality. New capability

Mac Studio

128GB

Llama 3 70B at Q5, comfortable. 405B only with aggressive quant and a lot of patience.

Llama 4 Maverick (400B MoE) at Q4 with full context. 70B-class models at Q6 with twice the prompt eval speed. Genuinely frontier-tier. Frontier-tier

Approximate comparison. Real performance varies by specific chip, quantization level, and context length. The pattern is what matters.

The same story is true on Windows and Linux with Nvidia hardware, where flash attention improvements and similar KV cache tricks landed over the last several months. Apple Silicon just happens to be the most dramatic case, because MLX was younger and had more room to grow.

Why this is all happening now

Three forces, converging.

Chinese labs competing on efficiency. Alibaba (Qwen), Zhipu (GLM), DeepSeek, and MiniMax have been shipping relentlessly. The Chinese market for proprietary American AI is restricted, so these labs have a real incentive to release competitive open alternatives. Every release forces the next lab to be more efficient or more capable, and open models benefit from the arms race.

Google finally went fully permissive. Gemma 4 is the first Gemma generation under Apache 2.0, which means it can be used commercially without the license headaches that plagued earlier releases. That is a significant policy shift from Google, and it puts pressure on Meta’s Llama license, which is open-ish but has its own custom terms.

MoE architectures are changing the math. The old assumption was that bigger always means slower and more expensive to run. Mixture-of-Experts breaks that. A 400B MoE model with 17B active parameters is not 400B worth of compute per question, it’s 17B worth. That means you can get frontier-tier quality at mid-tier speed, and a lot of the new releases are built this way specifically.

Where the paid tools still win

Honesty section. Paid models still lead in a few places. The absolute frontier of multi-step reasoning, the most polished multimodal handling, voice and live conversation, and the integrated tooling (web search, code execution, file handling) you get for free with a ChatGPT or Claude subscription. If your job depends on the very best model on a hard task, the open ones haven’t caught the closed ones at the very top.

The point isn’t that open replaces paid. It’s that for a meaningful slice of what people are paying for, the free option is now good enough, and the slice is getting bigger every month.

What to do with this

You do not need to become an AI researcher to benefit from any of this. You need to spend an afternoon trying one thing.

Pick one task. Something you currently pay a subscription to do. Drafting emails. Summarizing meeting notes. Rewriting content in different tones. First-pass research.

Install LM Studio or Ollama. Both are free. Both work on Mac, Windows, and Linux. LM Studio has the friendlier interface for non-developers. Ollama is closer to a command-line tool but has broader compatibility.

Pull the right tier for your machine. Look at the hardware chart above, find your row, and download one of the models that fits. Start at the Small tier if you’re unsure. Gemma 4 7B or Qwen 3 8B are both solid first picks.

Run your task, compare the output. Not against perfection. Against what you’re already paying for. If it gets you 80% of the way there on something you do often, that is already a meaningful result.

Spend 20 minutes this week running a free model on a task you do every day. You’ll have a much better answer to “does this matter for my business?” than any blog post can give you.