What Google Gemma 4 Is, and What the Four Sizes Mean for Developers

Open-weight models are no longer a niche experiment for researchers who like to tinker on weekends. They are a product category, with licensing, hardware targets, and ecosystem support discussed in the same breath as proprietary APIs. On April 2, 2026, Google DeepMind released Gemma 4, positioning it as the most capable Gemma generation so far—and explicitly tying it to the same research stack that powers Gemini 3, while keeping weights available under a commercially permissive Apache 2.0 license.

The practical question for builders is not whether the announcement is loud; it is whether the four-size lineup is coherent enough to plan around. This article walks through what shipped, how the architectures differ, and where each tier is meant to land in real deployments.

What’s new

Gemma 4 is a family of open-weights, multimodal models: they take text and images (and video) as inputs across the line; the smaller Effective 2B (E2B) and Effective 4B (E4B) variants add native audio input for speech-oriented workflows, per Google’s model documentation. Outputs are text.

Google released four distinct checkpoints:

E2B and E4B — “Effective” parameter counts aimed at phones, embedded boards, and other latency-sensitive environments. The documentation describes Per-Layer Embeddings (PLE) as a way to keep effective compute during inference smaller than total parameter counts would suggest, trading extra embedding lookups for memory and speed where that tradeoff matters on-device.
26B Mixture-of-Experts (MoE), labeled 26B A4B — 25.2B parameters total, with 3.8B active during inference in the configuration described in the model card. The design targets high token throughput relative to a dense model of similar headline size.
31B Dense — a dense model aimed at maximum quality within this family tier, sized so that unquantized bfloat16 weights can fit on a single 80GB NVIDIA H100 GPU in Google’s public guidance—an explicit nod to “run it on hardware you can actually rent or own.”

Licensing remains Apache 2.0, which matters for teams that need redistribution, modification, and commercial use without negotiating a separate contract with Google.

Google also emphasized ecosystem readiness: day-one mentions in its launch materials include Hugging Face, vLLM, llama.cpp, MLX, Ollama, cloud paths through Vertex AI and related Google Cloud services, and edge tooling such as AI Edge Gallery and Android-oriented previews for on-device workflows.

Why it matters now

Open models are competing on two axes at once: raw capability and cost-to-serve. Gemma 4’s framing leans hard on the second—intelligence per token, intelligence per watt, and intelligence per dollar of GPU.

For engineers, the release is another data point that multimodal and agent-shaped capabilities (structured outputs, function calling, system-role control) are becoming table stakes even outside closed APIs. For investors, it is a signal that Google continues to treat open weights as a distribution channel for developer mindshare and ecosystem pull-through to cloud and devices, not only as an academic exercise.

Technical breakdown: what is different in practice

Context windows split by tier in Google’s documentation: 128K tokens on the E2B/E4B models and up to 256K on the 26B MoE and 31B Dense models—long enough that “entire repo in one prompt” stops being a joke and starts being a workload design constraint.

Multimodal coverage is unified at the family level for text and vision; the smaller models extend into audio where microphones and speech are part of the product surface.

Agentic features are highlighted at the framework level: native system instructions, function calling, and structured JSON outputs are part of the story Google wants developers to build on—less “chat widget,” more “orchestrated tool use.”

Reasoning is presented as a first-class capability across the family, including references to configurable “thinking” modes in the model card, alongside benchmark tables that show large jumps versus prior Gemma generations on several math and coding benchmarks—useful as directional signal, with the usual caveat that your own task will not match any single benchmark mix.

Google’s launch blog also cited Arena AI leaderboard rankings for open models as of April 1, 2026 for the 31B and 26B MoE instruction-tuned variants. Treat public leaderboards as moving snapshots, not eternal scores—but they do capture how vendors want to be compared in the current open-model horserace.

Business and platform implications

Google is not shy about pairing open weights with Google Cloud and Android paths. That pairing is the strategic through-line: local-first and sovereign narratives for regulated customers, cloud-scale training and serving for everyone else, and on-device reach for consumer OEM stories.

For organizations comparing open versus API-only strategies, Gemma 4 is another entry where the license terms and weight availability allow private data to stay inside a boundary you control—while still pushing you toward Google’s tooling and infrastructure for anything that outgrows a single workstation.

Practical takeaway

If you are choosing a checkpoint, start from constraints, not from the biggest number on the card.

Need offline or embedded with audio and tight memory? Start with E2B/E4B and validate latency on your real hardware, not a demo laptop.
Need fast throughput on a single GPU without always running a full dense forward pass at the largest width? The 26B MoE is the latency-focused middle tier on paper.
Need maximum quality within this family for fine-tuning or offline “one big GPU” setups? The 31B Dense is the natural fit—at the cost of heavier inference.

Across all four, the same decision rule applies: measure on your prompts, your documents, and your tools. Open weights remove the billing meter, not the engineering work.

Sources / references

Google DeepMind — “Gemma 4: Byte for byte, the most capable open models” (Apr 2, 2026) — https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
Google AI for Developers — Gemma 4 model card — https://ai.google.dev/gemma/docs/core/model_card_4
Google Developers Blog — “Bring state-of-the-art agentic skills to the edge with Gemma 4” — https://developers.googleblog.com/bring-state-of-the-art-agentic-skills-to-the-edge-with-gemma-4/
Google Cloud Blog — “Gemma 4 available on Google Cloud” — https://cloud.google.com/blog/products/ai-machine-learning/gemma-4-available-on-google-cloud

What Google Gemma 4 Is, and What the Four Sizes Mean for Developers

What’s new

Why it matters now

Technical breakdown: what is different in practice

Business and platform implications

Practical takeaway

Sources / references

What cmux Adds to the macOS Terminal for Parallel AI Coding

Why AI Coding in 2026 Looks More Like Process Than Autocomplete

8 Open Source AI Tools Worth Watching in 2026

What’s new

Why it matters now

Technical breakdown: what is different in practice

Business and platform implications

Practical takeaway

Sources / references

Related posts

What cmux Adds to the macOS Terminal for Parallel AI Coding

Why AI Coding in 2026 Looks More Like Process Than Autocomplete

8 Open Source AI Tools Worth Watching in 2026