一、為什麼 Agent 評估比 LLM 評估難 10 倍?
LLM 的評估相對直接:給 N 個 prompt,比對輸出與 gold answer,算準確率。
Agent 完全不同——它的輸出是一連串動作。同樣的最終結果,可能來自精彩、最短的 3 步,也可能來自混亂的 30 步嘗試錯誤。我們既要評估「答對沒」,也要評估「過程合不合理、貴不貴」。
LangChain 2026 報告指出:57% 組織已部署 agent 到生產,但 32% 的人把「品質」列為最大瓶頸;>40% 的 agentic AI 專案預計在 2027 年底前被取消——多數死於「demo 漂亮,跑久了不行」。
本章拆解 agent 評估的三個面向:結果、軌跡、生產觀測。
LLM evaluation is relatively simple: feed N prompts, compare outputs to gold answers, compute accuracy.
Agents are different — their output is a sequence of actions. The same final answer can come from an elegant 3-step path or a chaotic 30-step trial-and-error. You must evaluate both "did it get the right answer" and "was the process sensible and affordable."
LangChain's 2026 report: 57% of organizations have agents in production, but 32% cite "quality" as the top deployment blocker; >40% of agentic AI projects are projected to be cancelled by end-2027 — most die from "great demo, terrible after a month."
This chapter covers three evaluation axes: outcome, trajectory, production observability.
二、Outcome Metrics vs Trajectory Metrics
| Outcome | Trajectory | ||||
|---|---|---|---|---|---|
| 問題 | Question | 「答案對嗎?」 | "Is the answer right?" | 「過程合不合理?」 | "Was the process sensible?" |
| 範例 | Examples | task_success_rate, F1, BLEU, pass@1 | tool_call_precision, plan_quality, step_count | ||
| 優點 | Pros | 客觀、易自動化 | Objective, easy to automate | 能定位「為什麼失敗」 | Locates "why it failed" |
| 缺點 | Cons | 看不見過程 | Blind to process | 需要 ground truth trajectory,昂貴 | Needs ground-truth trajectory — expensive |
trajectory_exact_match、trajectory_precision、trajectory_recall 三個軌跡指標 + 結果指標 task_success_rate 與 response_quality。
Google Vertex AI pattern: use trajectory_exact_match, trajectory_precision, trajectory_recall alongside outcome metrics task_success_rate and response_quality.
三、2026 必懂的五大 Agent Benchmarks
🐝 SWE-bench
2310 個來自 GitHub 真實 Python repo 的 PR 修 bug 任務。Agent 必須讀 codebase、找 bug、寫 patch、跑 test 全綠才算成功。SWE-bench Verified 是其專家驗證子集。2026 SOTA 約 70%(人類專家 90%)。
2,310 real-Python-repo PR bug-fixes from GitHub. The agent must read the codebase, locate the bug, write a patch, and pass tests. SWE-bench Verified is the expert-vetted subset. 2026 SOTA ≈ 70% (human experts ≈ 90%).
🌐 GAIA
由 Meta + HuggingFace + AutoGPT 團隊出題,466 題真實世界研究問題(如「2023 年 NeurIPS 最佳論文獎得主有幾個是日本人?」)。考驗網頁瀏覽 + 多模態 + 推理。2026 SOTA 在 Level 1 ≈ 80%、Level 3 ≈ 30%。
From Meta + HuggingFace + AutoGPT, 466 real-world research questions (e.g., "How many Japanese authors won NeurIPS Best Paper 2023?"). Tests web browsing + multimodal + reasoning. 2026 SOTA: ~80% on Level 1, ~30% on Level 3.
🛠️ TAU-bench / τ²-bench
Sierra 提出,模擬企業客服 / 訂票 / 修改訂單等多輪工具呼叫。2026 τ²-bench 新增語音與檢索領域,含 38 個模型 leaderboard(2026-04 更新)。
From Sierra, simulates enterprise customer-service / booking / order-modification with multi-turn tool calls. The 2026 τ²-bench adds voice and retrieval domains, with a 38-model leaderboard (April 2026 update).
📋 AgentBench
清華大學提出的 8 領域 benchmark(OS、DB、KG、Card Game、Web Shopping 等),是學界對 agent 的綜合能力測試。比 SWE-bench 更廣但深度較淺。
From Tsinghua, 8-domain benchmark (OS, DB, KG, card game, web shopping, etc.). The academic-side broad-capability test. Wider than SWE-bench but shallower per domain.
🌐 WebArena
真實網站環境的瀏覽器 agent 評測(GitLab、Reddit、E-commerce sandbox)。考驗長距離 web 操作。2026 SOTA 仍 < 60%——browser agent 仍是難題。
Browser-agent eval in realistic site environments (GitLab, Reddit, e-commerce sandbox). Tests long-horizon web operations. 2026 SOTA still <60% — browser agents remain hard.
⚖️ Static benchmarks 的限制
前緣模型進步太快,靜態 benchmark 上線數月就飽和。Kili Technology 2026 報告:lab benchmark 與真實部署之間有 37% 表現落差,相同準確度的成本可差 50×。靜態 benchmark 是必要起點,但不是終點。
Frontier models improve fast enough that static benchmarks saturate in months. Kili Technology 2026: a 37% performance gap between lab benchmarks and real deployment, with 50× cost variation at similar accuracy. Static benchmarks are a starting point, not the finish line.
四、為自己的 Agent 建一個 eval set
業界共識:通用 benchmark 只能粗略比較模型,每個生產 agent 都需要自己的 eval set。建構步驟:
- 蒐集真實對話 100–500 條:從生產 log、客服票、使用者回報。
- 分層標註:(a) 期望最終回答 (b) 期望必須呼叫的工具 (c) 不該呼叫的工具 (d) 對話風險等級。
- 建立失敗模式清單:「幻覺價格」「未經授權發送」「迴圈」「未引用來源」。
- 定期 regression run:每次改 prompt / 換模型 / 更新工具,跑完整 eval set。
- LLM-as-Judge 補強:用 GPT-5 / Claude Opus 當裁判給開放性問題打分。注意要校正 (用人類抽樣比對)。
Industry consensus: generic benchmarks only roughly compare models. Every production agent needs its own eval set. Steps:
- Collect 100–500 real conversations from production logs, support tickets, user reports.
- Layered labels: (a) expected final answer (b) tools that must be called (c) tools that must NOT be called (d) risk tier.
- Failure-mode list: "hallucinated price," "unauthorized send," "loop," "no citation."
- Regression runs: every prompt / model / tool change → full eval pass.
- LLM-as-Judge augmentation: GPT-5 / Claude Opus scores open-ended questions. Calibrate vs human samples.
# pip install deepeval from deepeval import assert_test from deepeval.metrics import ToolCorrectnessMetric, AnswerRelevancyMetric, GEval from deepeval.test_case import LLMTestCase, ToolCall def test_refund_flow(): case = LLMTestCase( input="I want a refund for order #4421", actual_output=my_agent("I want a refund for order #4421"), tools_called=[ToolCall(name="lookup_order"), ToolCall(name="issue_refund")], expected_tools=[ToolCall(name="lookup_order"), ToolCall(name="issue_refund")], expected_output="Refund issued. Funds in 3-5 business days." ) assert_test(case, metrics=[ ToolCorrectnessMetric(threshold=0.9), AnswerRelevancyMetric(threshold=0.85), GEval(name="Politeness", criteria="Reply is polite and apologetic", threshold=0.8) ])
五、Agent 生產觀測必裝四件事
① 完整 trace
每個 user → agent 對話留下:所有 LLM 呼叫、所有工具呼叫、輸入輸出、耗時、token 數。LangSmith、Langfuse、Arize、OpenAI Traces 都提供。
For each user → agent session: log every LLM call, every tool call, I/O, latency, tokens. LangSmith, Langfuse, Arize, OpenAI Traces all provide this.
② 成本儀表板
按使用者 / task type / model 拆解 token 與美元。沒有這個,模型升級時你會看不到後果。
Token + dollar cost broken down by user / task type / model. Without it, model upgrades hide their impact.
③ 失敗警報
對「達 max_iterations、tool 連續失敗、escalate_to_human、成本暴衝」自動 Slack/PagerDuty 通知。
Auto Slack/PagerDuty for "max_iterations hit, tool failed twice, escalate_to_human, cost spike."
④ A/B 與 canary
新 prompt 或新模型先在 5% 流量跑,觀察 trajectory + 結果指標再放量。LangSmith / Statsig 都支援。
Roll new prompts / models to 5% traffic first; watch trajectory + outcome metrics before scaling. Supported in LangSmith / Statsig.
六、Agent 評分卡視覺化
下表是三個版本 agent 在同一 eval set 上的指標——v3 是當前 production。觀察「不要只看一個指標」:
The table shows three agent versions on the same eval set — v3 is currently in production. Notice "don't optimize a single metric":
| Metric | v1 (baseline) | v2 (more tools) | v3 (better prompt) |
|---|---|---|---|
| Task Success | 62% | 71% | 78% |
| Tool Precision | 0.65 | 0.51 ⚠ | 0.82 |
| Avg Steps | 4.2 | 7.8 ⚠ | 3.9 |
| Cost / Task | $0.04 | $0.11 ⚠ | $0.05 |
| Hallucination Rate | 7% | 6% | 2% |
🎓 章節小測
Q1. SWE-bench 主要測什麼能力?
Q1. What does SWE-bench primarily measure?
Q2. 為什麼通用 benchmark 不能單獨用來評估生產 agent?
Q2. Why aren't generic benchmarks enough for production agents?
Q3. Trajectory metric 解決什麼 outcome metric 解決不了的問題?
Q3. What does trajectory metric do that outcome metric can't?