Step 9: Agent Evaluation — AI Agents Tutorial

出發點

一、為什麼 Agent 評估比 LLM 評估難 10 倍？

LLM 的評估相對直接：給 N 個 prompt，比對輸出與 gold answer，算準確率。

Agent 完全不同——它的輸出是一連串動作。同樣的最終結果，可能來自精彩、最短的 3 步，也可能來自混亂的 30 步嘗試錯誤。我們既要評估「答對沒」，也要評估「過程合不合理、貴不貴」。

LangChain 2026 報告指出：57% 組織已部署 agent 到生產，但 32% 的人把「品質」列為最大瓶頸；>40% 的 agentic AI 專案預計在 2027 年底前被取消——多數死於「demo 漂亮，跑久了不行」。

本章拆解 agent 評估的三個面向：結果、軌跡、生產觀測。

LLM evaluation is relatively simple: feed N prompts, compare outputs to gold answers, compute accuracy.

Agents are different — their output is a sequence of actions. The same final answer can come from an elegant 3-step path or a chaotic 30-step trial-and-error. You must evaluate both "did it get the right answer" and "was the process sensible and affordable."

LangChain's 2026 report: 57% of organizations have agents in production, but 32% cite "quality" as the top deployment blocker; >40% of agentic AI projects are projected to be cancelled by end-2027 — most die from "great demo, terrible after a month."

This chapter covers three evaluation axes: outcome, trajectory, production observability.

兩種指標

二、Outcome Metrics vs Trajectory Metrics

	Outcome	Trajectory
問題	Question	「答案對嗎？」	"Is the answer right?"	「過程合不合理？」	"Was the process sensible?"
範例	Examples	task_success_rate, F1, BLEU, pass@1	tool_call_precision, plan_quality, step_count
優點	Pros	客觀、易自動化	Objective, easy to automate	能定位「為什麼失敗」	Locates "why it failed"
缺點	Cons	看不見過程	Blind to process	需要 ground truth trajectory，昂貴	Needs ground-truth trajectory — expensive

📐

Google Vertex AI 範式：同時用 trajectory_exact_match、trajectory_precision、trajectory_recall 三個軌跡指標 + 結果指標 task_success_rate 與 response_quality。 Google Vertex AI pattern: use trajectory_exact_match, trajectory_precision, trajectory_recall alongside outcome metrics task_success_rate and response_quality.

主要 Benchmarks

三、2026 必懂的五大 Agent Benchmarks

🐝 SWE-bench

2310 個來自 GitHub 真實 Python repo 的 PR 修 bug 任務。Agent 必須讀 codebase、找 bug、寫 patch、跑 test 全綠才算成功。SWE-bench Verified 是其專家驗證子集。2026 SOTA 約 70%（人類專家 90%）。

2,310 real-Python-repo PR bug-fixes from GitHub. The agent must read the codebase, locate the bug, write a patch, and pass tests. SWE-bench Verified is the expert-vetted subset. 2026 SOTA ≈ 70% (human experts ≈ 90%).

🌐 GAIA

由 Meta + HuggingFace + AutoGPT 團隊出題，466 題真實世界研究問題（如「2023 年 NeurIPS 最佳論文獎得主有幾個是日本人？」）。考驗網頁瀏覽 + 多模態 + 推理。2026 SOTA 在 Level 1 ≈ 80%、Level 3 ≈ 30%。

From Meta + HuggingFace + AutoGPT, 466 real-world research questions (e.g., "How many Japanese authors won NeurIPS Best Paper 2023?"). Tests web browsing + multimodal + reasoning. 2026 SOTA: ~80% on Level 1, ~30% on Level 3.

🛠️ TAU-bench / τ²-bench

Sierra 提出，模擬企業客服 / 訂票 / 修改訂單等多輪工具呼叫。2026 τ²-bench 新增語音與檢索領域，含 38 個模型 leaderboard（2026-04 更新）。

From Sierra, simulates enterprise customer-service / booking / order-modification with multi-turn tool calls. The 2026 τ²-bench adds voice and retrieval domains, with a 38-model leaderboard (April 2026 update).

📋 AgentBench

清華大學提出的 8 領域 benchmark（OS、DB、KG、Card Game、Web Shopping 等），是學界對 agent 的綜合能力測試。比 SWE-bench 更廣但深度較淺。

From Tsinghua, 8-domain benchmark (OS, DB, KG, card game, web shopping, etc.). The academic-side broad-capability test. Wider than SWE-bench but shallower per domain.

🌐 WebArena

真實網站環境的瀏覽器 agent 評測（GitLab、Reddit、E-commerce sandbox）。考驗長距離 web 操作。2026 SOTA 仍 < 60%——browser agent 仍是難題。

Browser-agent eval in realistic site environments (GitLab, Reddit, e-commerce sandbox). Tests long-horizon web operations. 2026 SOTA still <60% — browser agents remain hard.

⚖️ Static benchmarks 的限制

前緣模型進步太快，靜態 benchmark 上線數月就飽和。Kili Technology 2026 報告：lab benchmark 與真實部署之間有 37% 表現落差，相同準確度的成本可差 50×。靜態 benchmark 是必要起點，但不是終點。

Frontier models improve fast enough that static benchmarks saturate in months. Kili Technology 2026: a 37% performance gap between lab benchmarks and real deployment, with 50× cost variation at similar accuracy. Static benchmarks are a starting point, not the finish line.

自建 eval 集

四、為自己的 Agent 建一個 eval set

業界共識：通用 benchmark 只能粗略比較模型，每個生產 agent 都需要自己的 eval set。建構步驟：

蒐集真實對話 100–500 條：從生產 log、客服票、使用者回報。
分層標註：(a) 期望最終回答 (b) 期望必須呼叫的工具 (c) 不該呼叫的工具 (d) 對話風險等級。
建立失敗模式清單：「幻覺價格」「未經授權發送」「迴圈」「未引用來源」。
定期 regression run：每次改 prompt / 換模型 / 更新工具，跑完整 eval set。
LLM-as-Judge 補強：用 GPT-5 / Claude Opus 當裁判給開放性問題打分。注意要校正 (用人類抽樣比對)。

Industry consensus: generic benchmarks only roughly compare models. Every production agent needs its own eval set. Steps:

Collect 100–500 real conversations from production logs, support tickets, user reports.
Layered labels: (a) expected final answer (b) tools that must be called (c) tools that must NOT be called (d) risk tier.
Failure-mode list: "hallucinated price," "unauthorized send," "loop," "no citation."
Regression runs: every prompt / model / tool change → full eval pass.
LLM-as-Judge augmentation: GPT-5 / Claude Opus scores open-ended questions. Calibrate vs human samples.

# pip install deepeval
from deepeval import assert_test
from deepeval.metrics import ToolCorrectnessMetric, AnswerRelevancyMetric, GEval
from deepeval.test_case import LLMTestCase, ToolCall

def test_refund_flow():
    case = LLMTestCase(
      input="I want a refund for order #4421",
      actual_output=my_agent("I want a refund for order #4421"),
      tools_called=[ToolCall(name="lookup_order"), ToolCall(name="issue_refund")],
      expected_tools=[ToolCall(name="lookup_order"), ToolCall(name="issue_refund")],
      expected_output="Refund issued. Funds in 3-5 business days."
    )
    assert_test(case, metrics=[
      ToolCorrectnessMetric(threshold=0.9),
      AnswerRelevancyMetric(threshold=0.85),
      GEval(name="Politeness", criteria="Reply is polite and apologetic", threshold=0.8)
    ])

生產觀測

五、Agent 生產觀測必裝四件事

① 完整 trace

每個 user → agent 對話留下：所有 LLM 呼叫、所有工具呼叫、輸入輸出、耗時、token 數。LangSmith、Langfuse、Arize、OpenAI Traces 都提供。

For each user → agent session: log every LLM call, every tool call, I/O, latency, tokens. LangSmith, Langfuse, Arize, OpenAI Traces all provide this.

② 成本儀表板

按使用者 / task type / model 拆解 token 與美元。沒有這個，模型升級時你會看不到後果。

Token + dollar cost broken down by user / task type / model. Without it, model upgrades hide their impact.

③ 失敗警報

對「達 max_iterations、tool 連續失敗、escalate_to_human、成本暴衝」自動 Slack/PagerDuty 通知。

Auto Slack/PagerDuty for "max_iterations hit, tool failed twice, escalate_to_human, cost spike."

④ A/B 與 canary

新 prompt 或新模型先在 5% 流量跑，觀察 trajectory + 結果指標再放量。LangSmith / Statsig 都支援。

Roll new prompts / models to 5% traffic first; watch trajectory + outcome metrics before scaling. Supported in LangSmith / Statsig.

互動展示

六、Agent 評分卡視覺化

下表是三個版本 agent 在同一 eval set 上的指標——v3 是當前 production。觀察「不要只看一個指標」：

The table shows three agent versions on the same eval set — v3 is currently in production. Notice "don't optimize a single metric":

Metric	v1 (baseline)	v2 (more tools)	v3 (better prompt)
Task Success	62%	71%	78%
Tool Precision	0.65	0.51 ⚠	0.82
Avg Steps	4.2	7.8 ⚠	3.9
Cost / Task	$0.04	$0.11 ⚠	$0.05
Hallucination Rate	7%	6%	2%

⚠️

觀察：v2 加了更多工具，task success 提升，但 tool precision 暴跌、步驟暴增、成本翻 3 倍——這就是「指標跨領域 trade-off」。v3 透過更好的 prompt 把工具選擇變精準，全面贏。 Observation: v2 added more tools → success up, but tool precision plummeted, steps exploded, cost 3×ed — classic "metric trade-off." v3 used a better prompt to sharpen tool selection — wins on every axis.

🎓 章節小測

Q1. SWE-bench 主要測什麼能力？

Q1. What does SWE-bench primarily measure?

A) 圖像理解

B) GitHub repo 修 bug 並通過測試

C) 客服對話品質

D) 創意寫作

✅ SWE-bench 是 GitHub 真實 PR bug-fix 任務，是 coding agent 的旗艦 benchmark。✅ SWE-bench is real-GitHub-PR bug-fixing, the flagship coding-agent benchmark.

Q2. 為什麼通用 benchmark 不能單獨用來評估生產 agent？

Q2. Why aren't generic benchmarks enough for production agents?

A) Benchmark 不可信

B) Lab vs 真實部署有顯著落差，且每個場景不同

C) Benchmark 沒有 license

D) 只看結果

✅ 必須搭配自家 eval set + 生產觀測，才能反映真實品質。✅ You need your own eval set + production observability for real quality signal.

Q3. Trajectory metric 解決什麼 outcome metric 解決不了的問題？

Q3. What does trajectory metric do that outcome metric can't?

A) 完全取代 outcome

B) 定位「為什麼失敗」

C) 比 outcome 便宜

D) 不需要標註

✅ 結果對但過程亂、或結果錯但接近，都需要看 trajectory 才看得出。✅ "Right answer, bad process" or "wrong but close" both need trajectory signal.

← Step 8多代理人系統 Step 10 →框架比較