STEP 9 / 12 · 進階能力

Agent 評估

Outcome vs trajectory metrics、SWE-bench / GAIA / TAU-bench、觀測性與成本追蹤——把 agent 從「demo」變成「可量化的產品」。

Outcome vs trajectory metrics, SWE-bench / GAIA / TAU-bench, observability and cost tracking — turning agents from "demo" into "measurable product."

一、為什麼 Agent 評估比 LLM 評估難 10 倍?

LLM 的評估相對直接:給 N 個 prompt,比對輸出與 gold answer,算準確率。

Agent 完全不同——它的輸出是一連串動作。同樣的最終結果,可能來自精彩、最短的 3 步,也可能來自混亂的 30 步嘗試錯誤。我們既要評估「答對沒」,也要評估「過程合不合理、貴不貴」。

LangChain 2026 報告指出:57% 組織已部署 agent 到生產,但 32% 的人把「品質」列為最大瓶頸;>40% 的 agentic AI 專案預計在 2027 年底前被取消——多數死於「demo 漂亮,跑久了不行」。

本章拆解 agent 評估的三個面向:結果、軌跡、生產觀測

LLM evaluation is relatively simple: feed N prompts, compare outputs to gold answers, compute accuracy.

Agents are different — their output is a sequence of actions. The same final answer can come from an elegant 3-step path or a chaotic 30-step trial-and-error. You must evaluate both "did it get the right answer" and "was the process sensible and affordable."

LangChain's 2026 report: 57% of organizations have agents in production, but 32% cite "quality" as the top deployment blocker; >40% of agentic AI projects are projected to be cancelled by end-2027 — most die from "great demo, terrible after a month."

This chapter covers three evaluation axes: outcome, trajectory, production observability.

二、Outcome Metrics vs Trajectory Metrics

OutcomeTrajectory
問題Question「答案對嗎?」"Is the answer right?"「過程合不合理?」"Was the process sensible?"
範例Examplestask_success_rate, F1, BLEU, pass@1tool_call_precision, plan_quality, step_count
優點Pros客觀、易自動化Objective, easy to automate能定位「為什麼失敗」Locates "why it failed"
缺點Cons看不見過程Blind to process需要 ground truth trajectory,昂貴Needs ground-truth trajectory — expensive
📐
Google Vertex AI 範式:同時用 trajectory_exact_matchtrajectory_precisiontrajectory_recall 三個軌跡指標 + 結果指標 task_success_rateresponse_quality Google Vertex AI pattern: use trajectory_exact_match, trajectory_precision, trajectory_recall alongside outcome metrics task_success_rate and response_quality.

三、2026 必懂的五大 Agent Benchmarks

🐝 SWE-bench

2310 個來自 GitHub 真實 Python repo 的 PR 修 bug 任務。Agent 必須讀 codebase、找 bug、寫 patch、跑 test 全綠才算成功。SWE-bench Verified 是其專家驗證子集。2026 SOTA 約 70%(人類專家 90%)。

2,310 real-Python-repo PR bug-fixes from GitHub. The agent must read the codebase, locate the bug, write a patch, and pass tests. SWE-bench Verified is the expert-vetted subset. 2026 SOTA ≈ 70% (human experts ≈ 90%).

🌐 GAIA

由 Meta + HuggingFace + AutoGPT 團隊出題,466 題真實世界研究問題(如「2023 年 NeurIPS 最佳論文獎得主有幾個是日本人?」)。考驗網頁瀏覽 + 多模態 + 推理。2026 SOTA 在 Level 1 ≈ 80%、Level 3 ≈ 30%。

From Meta + HuggingFace + AutoGPT, 466 real-world research questions (e.g., "How many Japanese authors won NeurIPS Best Paper 2023?"). Tests web browsing + multimodal + reasoning. 2026 SOTA: ~80% on Level 1, ~30% on Level 3.

🛠️ TAU-bench / τ²-bench

Sierra 提出,模擬企業客服 / 訂票 / 修改訂單等多輪工具呼叫。2026 τ²-bench 新增語音與檢索領域,含 38 個模型 leaderboard(2026-04 更新)。

From Sierra, simulates enterprise customer-service / booking / order-modification with multi-turn tool calls. The 2026 τ²-bench adds voice and retrieval domains, with a 38-model leaderboard (April 2026 update).

📋 AgentBench

清華大學提出的 8 領域 benchmark(OS、DB、KG、Card Game、Web Shopping 等),是學界對 agent 的綜合能力測試。比 SWE-bench 更廣但深度較淺。

From Tsinghua, 8-domain benchmark (OS, DB, KG, card game, web shopping, etc.). The academic-side broad-capability test. Wider than SWE-bench but shallower per domain.

🌐 WebArena

真實網站環境的瀏覽器 agent 評測(GitLab、Reddit、E-commerce sandbox)。考驗長距離 web 操作。2026 SOTA 仍 < 60%——browser agent 仍是難題。

Browser-agent eval in realistic site environments (GitLab, Reddit, e-commerce sandbox). Tests long-horizon web operations. 2026 SOTA still <60% — browser agents remain hard.

⚖️ Static benchmarks 的限制

前緣模型進步太快,靜態 benchmark 上線數月就飽和。Kili Technology 2026 報告:lab benchmark 與真實部署之間有 37% 表現落差,相同準確度的成本可差 50×。靜態 benchmark 是必要起點,但不是終點。

Frontier models improve fast enough that static benchmarks saturate in months. Kili Technology 2026: a 37% performance gap between lab benchmarks and real deployment, with 50× cost variation at similar accuracy. Static benchmarks are a starting point, not the finish line.

四、為自己的 Agent 建一個 eval set

業界共識:通用 benchmark 只能粗略比較模型,每個生產 agent 都需要自己的 eval set。建構步驟:

  1. 蒐集真實對話 100–500 條:從生產 log、客服票、使用者回報。
  2. 分層標註:(a) 期望最終回答 (b) 期望必須呼叫的工具 (c) 不該呼叫的工具 (d) 對話風險等級。
  3. 建立失敗模式清單:「幻覺價格」「未經授權發送」「迴圈」「未引用來源」。
  4. 定期 regression run:每次改 prompt / 換模型 / 更新工具,跑完整 eval set。
  5. LLM-as-Judge 補強:用 GPT-5 / Claude Opus 當裁判給開放性問題打分。注意要校正 (用人類抽樣比對)。

Industry consensus: generic benchmarks only roughly compare models. Every production agent needs its own eval set. Steps:

  1. Collect 100–500 real conversations from production logs, support tickets, user reports.
  2. Layered labels: (a) expected final answer (b) tools that must be called (c) tools that must NOT be called (d) risk tier.
  3. Failure-mode list: "hallucinated price," "unauthorized send," "loop," "no citation."
  4. Regression runs: every prompt / model / tool change → full eval pass.
  5. LLM-as-Judge augmentation: GPT-5 / Claude Opus scores open-ended questions. Calibrate vs human samples.
# pip install deepeval
from deepeval import assert_test
from deepeval.metrics import ToolCorrectnessMetric, AnswerRelevancyMetric, GEval
from deepeval.test_case import LLMTestCase, ToolCall

def test_refund_flow():
    case = LLMTestCase(
      input="I want a refund for order #4421",
      actual_output=my_agent("I want a refund for order #4421"),
      tools_called=[ToolCall(name="lookup_order"), ToolCall(name="issue_refund")],
      expected_tools=[ToolCall(name="lookup_order"), ToolCall(name="issue_refund")],
      expected_output="Refund issued. Funds in 3-5 business days."
    )
    assert_test(case, metrics=[
      ToolCorrectnessMetric(threshold=0.9),
      AnswerRelevancyMetric(threshold=0.85),
      GEval(name="Politeness", criteria="Reply is polite and apologetic", threshold=0.8)
    ])

五、Agent 生產觀測必裝四件事

完整 trace

每個 user → agent 對話留下:所有 LLM 呼叫、所有工具呼叫、輸入輸出、耗時、token 數。LangSmith、Langfuse、Arize、OpenAI Traces 都提供。

For each user → agent session: log every LLM call, every tool call, I/O, latency, tokens. LangSmith, Langfuse, Arize, OpenAI Traces all provide this.

成本儀表板

按使用者 / task type / model 拆解 token 與美元。沒有這個,模型升級時你會看不到後果。

Token + dollar cost broken down by user / task type / model. Without it, model upgrades hide their impact.

失敗警報

對「達 max_iterations、tool 連續失敗、escalate_to_human、成本暴衝」自動 Slack/PagerDuty 通知。

Auto Slack/PagerDuty for "max_iterations hit, tool failed twice, escalate_to_human, cost spike."

A/B 與 canary

新 prompt 或新模型先在 5% 流量跑,觀察 trajectory + 結果指標再放量。LangSmith / Statsig 都支援。

Roll new prompts / models to 5% traffic first; watch trajectory + outcome metrics before scaling. Supported in LangSmith / Statsig.

六、Agent 評分卡視覺化

下表是三個版本 agent 在同一 eval set 上的指標——v3 是當前 production。觀察「不要只看一個指標」:

The table shows three agent versions on the same eval set — v3 is currently in production. Notice "don't optimize a single metric":

Metricv1 (baseline)v2 (more tools)v3 (better prompt)
Task Success62%71%78%
Tool Precision0.650.51 ⚠0.82
Avg Steps4.27.8 ⚠3.9
Cost / Task$0.04$0.11 ⚠$0.05
Hallucination Rate7%6%2%
⚠️
觀察:v2 加了更多工具,task success 提升,但 tool precision 暴跌、步驟暴增、成本翻 3 倍——這就是「指標跨領域 trade-off」。v3 透過更好的 prompt 把工具選擇變精準,全面贏。 Observation: v2 added more tools → success up, but tool precision plummeted, steps exploded, cost 3×ed — classic "metric trade-off." v3 used a better prompt to sharpen tool selection — wins on every axis.

🎓 章節小測

Q1. SWE-bench 主要測什麼能力?

Q1. What does SWE-bench primarily measure?

A) 圖像理解
B) GitHub repo 修 bug 並通過測試
C) 客服對話品質
D) 創意寫作
✅ SWE-bench 是 GitHub 真實 PR bug-fix 任務,是 coding agent 的旗艦 benchmark。✅ SWE-bench is real-GitHub-PR bug-fixing, the flagship coding-agent benchmark.

Q2. 為什麼通用 benchmark 不能單獨用來評估生產 agent?

Q2. Why aren't generic benchmarks enough for production agents?

A) Benchmark 不可信
B) Lab vs 真實部署有顯著落差,且每個場景不同
C) Benchmark 沒有 license
D) 只看結果
✅ 必須搭配自家 eval set + 生產觀測,才能反映真實品質。✅ You need your own eval set + production observability for real quality signal.

Q3. Trajectory metric 解決什麼 outcome metric 解決不了的問題?

Q3. What does trajectory metric do that outcome metric can't?

A) 完全取代 outcome
B) 定位「為什麼失敗」
C) 比 outcome 便宜
D) 不需要標註
✅ 結果對但過程亂、或結果錯但接近,都需要看 trajectory 才看得出。✅ "Right answer, bad process" or "wrong but close" both need trajectory signal.