# Gemma 4 12B 是該公司首款支援原生音訊輸入的中型模型。

*genai · news · 2026-06-04 · Android Authority*

## Key points

- Google 的 Gemma 4 12B 模型是首款支援原生音訊輸入的中型模型。
- Gemma 4 12B 採用無編碼器架構，降低多模態輸入的延遲和記憶體需求。
- 該模型在消費者筆電上的基準測試表現與較大型的 Gemma 4 26B MoE 相當。
- Gemma 4 12B 可在具備 16GB 記憶體的筆電上運行，無需專用 AI 硬體。
- Gemma 4 12B 的模型權重可在 Hugging Face 和 Kaggle 下載。

TL;DR Google has released the Gemma 4 12B model aimed at consumer laptops with at least 16GB RAM. Gemma 4 12B is the company’s first mid-sized model to support native audio input. It utilizes an encoder-free architecture to offer multimodal performance without the latency introduced by encoders. The new model performs close to the Gemma 4 26B MoE model in benchmarks. Back in April, Google released its mobile-friendly Gemma E2B and E4B models, bringing on-device multimodal AI to Android and iOS devices. It also released the high-end 26B Mixture of Experts (MoE) and 31B Dense models for higher-end devices with dedicated AI GPUs. Now, the company is launching another Gemma model that sits nicely between the four. Google today announced the Gemma 4 12B model aimed at bringing on-device AI capabilities to laptops. It offers multimodal features and is the first mid-sized model from Google to support native audio input. Don’t want to miss the best from Android Authority? Set us as a favorite source in Google Discover to never miss our latest exclusive reports, expert analysis, and much more. You can also set us as a preferred source in Google Search by clicking the button below. The company claims that its 12B model delivers performance similar to the 26B MoE model in benchmarks, while being small enough to run on normal consumer laptops with 16GB of RAM. To achieve this, the company came up with unique solutions for supporting multimodal inputs without increasing latency and memory usage. Gemma 4 12B uses an encoder-free architecture to avoid the memory costs associated with encoders that are typically used in most multimodal AI models. Google For vision, it’s using a lightweight module that utilizes “single matrix multiplication, positional embedding, and normalizations,” allowing image data to be passed to the LLM without requiring an encoder in the middle. It also completely does away with encoding for audio inputs. Google was able to project the raw audio signal directly into the same dimensional space as text tokens. What that means is that Gemma 4 12B can handle multimodal inputs, just like the other Gemma models, but without the added overhead of encoding such inputs. This should result in much better performance on laptops without the need for dedicated AI hardware. Interested users can try the new model right now in LM Studio, Ollama, Google AI Edge Gallery, and more. If you’re interested in running it locally on your laptop, the weights are available to download from Hugging Face and Kaggle. Follow

**Companies:** Google
**Countries:** USA

[Read the full story on Android Authority](https://www.androidauthority.com/google-gemma-4-12b-multimodal-ai-model-3674379/)

---

Canonical: https://newsio.io/zh-TW/n/850bee48-dc01-4509-8cd0-7c161a14b6c9/gemma-4-12b
Summarized by Newsio from Android Authority. https://newsio.io/how-it-works