# 蘋果研究人員發表了大規模多任務代理理解（MMAU）基準。

*genai · news · 2024-08-02 · Analytics India Magazine*

## Key points

- 蘋果推出MMAU基準，以評估大型語言模型在五個專業領域的五項能力。
- MMAU包含20項任務及超過3,000個提示，能比以往基準提供更細緻的評估。
- GPT-4及其他基於API的模型在挑戰性領域中持續優於開源模型。
- 自我修正仍是重大挑戰，而解決問題能力則是目前大型語言模型較普遍能達成的目標。
- 蘋果同時推出LazyLLM，一種在不犧牲準確度下加速大型語言模型推理的新技術。

Researchers from Apple have recently unveiled the Massive Multitask Agent Understanding (MMAU) benchmark, a new evaluation framework designed to assess the capabilities of large language models (LLMs) as intelligent agents across diverse domains and skills. Read the full paper here MMAU evaluates models on five key capabilities: understanding, reasoning, planning, problem-solving, and self-correction. It spans five domains: tool use, directed acyclic graph question answering, data science and machine learning coding, contest-level programming, and mathematics. The benchmark comprises 20 carefully designed tasks with over 3,000 distinct prompts, offering a more granular assessment of LLM capabilities compared to existing benchmarks. MMAU aims to provide insights into where model failures stem from by isolating and testing specific skills. Key findings from evaluating 18 models on MMAU revealed that commercial API-based models like GPT-4 consistently outperformed open-source models across various domains. The models demonstrated varying proficiency levels in different capabilities– problem-solving was more universally achievable, while self-correction posed significant challenges for many models. High-quality planning also boosted performance for all models in mathematical tasks. Interestingly, larger models did not always perform better, underscoring the importance of training strategies and model architectures The researchers emphasise that MMAU is designed to complement, not replace, existing interactive evaluations. They acknowledge limitations in the current scope and call for future work to expand into more domains and refine capability decomposition methods. By providing a comprehensive and granular evaluation framework, MMAU aims to drive progress in developing more capable and well-rounded AI agents. The datasets and evaluation scripts have been made publicly available to facilitate further research in this area. Also, recently, Apple introduced LazyLLM, a novel technique aimed at improving the efficiency of large language model (LLM) inference. This approach seeks to accelerate response generation in transformer-based language models while maintaining accuracy.

**Companies:** Apple
**Countries:** United States

[Read the full story on Analytics India Magazine](https://analyticsindiamag.com/ai-news-updates/apple-unveils-mmau-a-new-benchmark-for-evaluating-language-model-agents-across-diverse-domains/)

---

Canonical: https://newsio.io/zh-TW/n/db989e64-f0e6-4982-abf0-e15a3f2b04cd/mmaummau
Summarized by Newsio from Analytics India Magazine. https://newsio.io/how-it-works
