Development

Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster

June 11, 2026 Development Source: Ars Technica

Share this article

Join Group Now

Facebook

Another day, another AI model from Google. This time, Google DeepMind has released a new member of the Gemma 4 open model family, but it’s fundamentally different from the rest of the lineup. DiffusionGemma doesn’t generate outputs linearly like most AI models. Instead, it can produce an entire block of text in parallel. Google says this makes it faster and more efficient when running on local hardware like an Nvidia DGX or a humble gaming GPU. Most AI models are designed to be autoregressive—they generate text left to right one token at a time. DiffusionGemma has more in common with image generation models, which start with static and then denoise it to create the desired content. This model takes a field of placeholder tokens running over the canvas multiple times to generate likely tokens and using those to improve estimation of others. At the end of the process, the model finalizes its token outputs in one large block—the “denoised” text canvas. If diffusion is so much faster, why isn’t Google using it in big cloud-based Gemini models? Google has experimented with this, but there are a few drawbacks to text diffusion, including a higher error rate. In image diffusion models, a single badly predicted pixel doesn’t make the image useless, but language is discrete. An equivalent error in text can make a block of tokens meaningless and force you to start over to get a better output. Diffusion models also waste resources when the desired output is only a few tokens long. They have to do a lot more parallel work to whittle down to, say, five tokens that an autoregressive model does from beginning to end in just five steps.