Step 1: What is an AI Agent? — AI Agents Tutorial

出發點

一、Agent 不只是 Chatbot

「AI 代理人 (AI Agent)」在 2026 年是最熱、也最被濫用的詞之一。同樣叫做「Agent」的系統，可能是一個幫你查天氣的 chatbot，也可能是一個能自己讀程式碼、跑單元測試、開 PR 的自動化工程師。要避免概念混淆，我們先回到最根本的定義。

經典 AI 教科書 Russell & Norvig 的定義：任何能透過感測器 (sensors) 感知環境，並透過效應器 (actuators) 作用於環境的實體，都是 agent。這個定義廣到包含掃地機器人、自動駕駛、甚至一隻青蛙。

2026 年大家在講的「AI Agent」幾乎都特指以 LLM 為核心推理引擎，可以使用工具、保有記憶、能自主規劃多步驟任務的系統。這比單純的 LLM 對話更強，但又比通用人工智慧 (AGI) 更務實。

"AI Agent" is one of the most overloaded buzzwords of 2026. The same label can describe a weather chatbot or a fully autonomous engineer that reads code, runs tests, and opens PRs. To avoid the confusion, return to first principles.

The Russell & Norvig textbook defines an agent as: anything that perceives its environment through sensors and acts upon it through actuators. This is broad enough to include Roombas, self-driving cars, and frogs.

When practitioners say "AI Agent" in 2026 they almost always mean a more specific thing: a system that uses an LLM as its reasoning engine, can call tools, retain memory, and plan multi-step tasks autonomously — stronger than a plain chat LLM, but more grounded than the AGI moonshot.

💡

本教程的工作定義：AI Agent = LLM (大腦) + Tools (手腳) + Memory (筆記本) + Loop (規劃迴圈)。後續 11 章會逐一拆解這四個元件。 Working definition for this tutorial: AI Agent = LLM (brain) + Tools (hands) + Memory (notebook) + Loop (planning). The next 11 chapters dissect each piece.

架構分析

二、PEAS：分析任何 Agent 的萬用框架

Russell & Norvig 提出的 PEAS 框架是設計或評估 agent 時必先填寫的清單。它強迫你回答四個問題：

Russell & Norvig's PEAS framework is the checklist you fill in before designing or evaluating any agent. It forces four questions:

🎯

P · Performance

怎麼算是成功？任務完成率？人類滿意度？成本？延遲？沒有定義 P，就無法評估 agent。

What counts as success? Task completion rate? Human satisfaction? Cost? Latency? Without P, you cannot evaluate the agent.

🌍

E · Environment

Agent 處在什麼環境？網頁瀏覽器？terminal？資料庫？真實世界？環境是離散或連續？可觀測或不可觀測？

What environment? Web browser? Terminal? Database? Real world? Discrete or continuous? Fully or partially observable?

🦾

A · Actuators

Agent 能做什麼動作？發 HTTP 請求？執行 shell？點擊滑鼠？呼叫資料庫？這就是後續第 4 章「工具」。

What actions can the agent take? HTTP requests? Shell? Mouse clicks? DB queries? This becomes Chapter 4 (Tools).

📡

S · Sensors

Agent 能觀測什麼？文字、圖像、API 回應、檔案系統、感測器數據？這決定了 context 的形狀。

What can the agent observe? Text, images, API responses, file system, sensor data? This shapes the context.

案例：用 PEAS 描述「程式碼修 Bug Agent」

PEAS	內容
P	所有 unit test 通過 + 不引入新 lint 警告 + 修改檔案數 ≤ 5	All unit tests pass + no new lint warnings + diff touches ≤ 5 files
E	本機 Git repo、Linux shell、CI 系統、issue tracker	Local Git repo, Linux shell, CI, issue tracker
A	讀檔、寫檔、執行 bash、執行 pytest、開 PR	read_file, write_file, run_bash, run_pytest, open_pr
S	檔案內容、test 輸出、CI 日誌、人類審查者留言	File contents, test output, CI logs, human reviewer comments

經典分類

三、四種經典 Agent 型態

從最簡單到最複雜，經典 AI 把 agent 分為四階。理解這個階層能幫你診斷「我的 LLM agent 卡在第幾階？」

From simplest to most capable, classical AI sorts agents into four tiers. Knowing the ladder helps you diagnose: "Which tier is my LLM agent stuck on?"

① 簡單反射型 (Simple Reflex)

「if 看到 X，就做 Y」的純規則系統。沒有記憶、沒有規劃。例如：自動掃地機器人「碰到牆 → 轉向」。

Pure rule-based "if see X, do Y". No memory, no planning. Example: a Roomba turning when it hits a wall.

② 模型反射型 (Model-Based Reflex)

保留內部狀態，能根據過去的觀察推測現在無法直接看到的部分（部分可觀測環境）。例如：自駕車記得「剛才右邊有車」，即使現在感測器照不到。

Keeps an internal state; infers parts of the world that aren't directly observable (partially observable env). Example: a self-driving car remembering "there was a car on the right" even after losing line of sight.

③ 目標型 (Goal-Based)

擁有明確目標，並透過搜尋或規劃 (search / planning) 找出達成目標的動作序列。例如：A* 路徑規劃、棋類 AI。

Has an explicit goal and uses search or planning to find an action sequence that achieves it. Examples: A* pathfinding, chess engines.

④ 效用型 (Utility-Based) 與學習型 (Learning)

不只判斷「達成 / 沒達成」，而是用效用函數衡量不同結果的好壞，並能從經驗中學習改進策略。現代 LLM Agent + RL fine-tuning 屬於這一階。

Beyond binary success/failure: uses a utility function to grade outcomes and can learn from experience to improve. Modern LLM agents with RL fine-tuning sit here.

🧭

哪一階對應現代 LLM Agent？ 多數實務 LLM Agent 介於 ③ 目標型與 ④ 效用型之間：使用者下達自然語言目標，agent 用 ReAct / Plan-and-Execute 進行搜尋與規劃。能不能算「學習」要看是否在每次互動後更新策略——大多數現役 agent 還做不到這點。 Where do today's LLM agents fit? Most production LLM agents sit between ③ goal-based and ④ utility-based: the user gives a natural-language goal, the agent searches via ReAct or Plan-and-Execute. Whether they "learn" depends on whether their policy updates after each interaction — most deployed agents do not.

現代解剖

四、現代 LLM Agent 的四層解剖

2026 年生產級 agent 普遍呈現「四層架構」（Redis、Oracle、IBM 等大廠 blog 都用類似分法）：

Production agents in 2026 typically follow a "four-layer" architecture (Redis, Oracle, IBM all describe variants of this):

┌──────────────────────────────────────────────┐ │ 🧠 Reasoning Layer (LLM, prompts, sampling) │ ├──────────────────────────────────────────────┤ │ 🎼 Orchestration Layer (Loop / Graph / DAG) │ ├──────────────────────────────────────────────┤ │ 📚 Memory & Data Layer (Context / VectorDB) │ ├──────────────────────────────────────────────┤ │ 🔧 Tool Layer (APIs / Shell / MCP servers) │ └──────────────────────────────────────────────┘

🧠 推理層

核心 LLM (Claude、GPT、Gemini、Llama)。負責閱讀指令、決定下一步、產出最終回應。Prompt 工程、CoT、結構化輸出都在這層發揮——詳見 Step 2 與 Step 3。

The core LLM (Claude, GPT, Gemini, Llama). Reads instructions, decides next steps, produces the final answer. Prompt engineering, CoT, structured output all live here — see Step 2 and Step 3.

🎼 編排層

把 LLM 的單次呼叫串成多步驟流程。實作可以是 while loop、有向圖 (LangGraph)、或角色式 (CrewAI)。決定何時呼叫工具、何時結束——詳見 Step 6。

Chains single LLM calls into multi-step workflows. Implementation can be a while loop, a directed graph (LangGraph), or roles (CrewAI). Decides when to call tools and when to stop — see Step 6.

📚 記憶與資料層

Context window 是「短期記憶」；vector DB / knowledge graph / SQL 是「長期記憶」。RAG 在這層運作——詳見 Step 5 與 Step 7。

The context window is "short-term memory"; vector DB / KG / SQL are "long-term memory." RAG lives here — see Step 5 and Step 7.

🔧 工具層

Agent 真正「動手」的接口：API、shell、瀏覽器、機器手臂。2025 年 Anthropic 發布 MCP 後，工具層逐漸標準化——詳見 Step 4 與 Step 11。

The interface where the agent actually "acts": APIs, shell, browser, robotic arms. After Anthropic released MCP in 2024, this layer is standardizing — see Step 4 and Step 11.

互動模擬

五、Chat LLM vs Agent：一個能力對照表

下表勾選 / 取消「需要的能力」，右邊會即時告訴你需要的是哪種架構。

Toggle the capabilities below — the right side updates to tell you which architecture you need.

—

多步驟任務外部工具呼叫跨對話記憶自主決定停止多 agent 協作

常見迷思

六、五個常見的概念誤區

工具呼叫只是 Agent 的必要條件之一。沒有 planning loop、沒有錯誤恢復、沒有清楚的停止條件，多工具的 LLM 還是個帶按鈕的 chatbot。 Tool calling is just one necessary condition. Without a planning loop, error recovery, or clear stop criteria, an LLM with many tools is still a chatbot with buttons.

框架是選項不是必需。OpenAI、Anthropic、Google 都提供原生的 agent SDK；如果流程簡單，純 Python while loop 就夠。詳見 Step 10。 Frameworks are options, not requirements. OpenAI, Anthropic, and Google offer native agent SDKs. For simple flows a plain Python while-loop is enough. See Step 10.

多 agent 架構帶來溝通成本與錯誤傳播。先用單一 agent + 良好 prompt 跑得通，才考慮拆。OpenAI 2024 年 Swarm 報告也建議「先 monolith」。 Multi-agent setups introduce communication overhead and error propagation. Start with a single well-prompted agent before splitting. OpenAI's 2024 Swarm report also recommends "monolith first."

除非你打 RL fine-tuning，否則 agent 不會記得「上次教過」——它只是在 context 中讀到了。要真正學習，需要把成功的軌跡寫進記憶或更新 prompt template。 Unless you do RL fine-tuning, the agent does not remember what you "taught it" — it just read it in context. To genuinely learn, write successful trajectories to memory or update the prompt template.

自主性是滑桿不是開關。OWASP 2026 Top 10 for Agentic Apps 第一條就是 "Excessive Agency"——權限過大、無人工 checkpoint，是 2026 最常見的生產事故來源。詳見 Step 12。 Autonomy is a slider, not a switch. The first item on OWASP's 2026 Top 10 for Agentic Apps is "Excessive Agency" — over-broad permissions with no human checkpoint are the most common production incident pattern of 2026. See Step 12.

動手感受

七、最小 Agent：用 30 行體會核心迴圈

下方範例不依賴任何框架，直接呼叫 LLM API + 兩個工具，展示 agent 的「思考 → 行動 → 觀察 → 再思考」迴圈。Step 4 與 Step 6 會深入細節。

Below is a framework-free minimal agent: a direct LLM API call plus two tools, illustrating the "think → act → observe → think" loop. Chapters 4 and 6 go deeper.

# pip install anthropic
from anthropic import Anthropic
import json, math, requests

client = Anthropic()

# 1) Define tools the agent can call
TOOLS = [{
  "name": "calculator",
  "description": "Evaluate a math expression. Input: a Python-safe expression string.",
  "input_schema": {"type":"object","properties":{"expr":{"type":"string"}},"required":["expr"]}
},{
  "name": "web_get",
  "description": "HTTP GET a URL and return text.",
  "input_schema": {"type":"object","properties":{"url":{"type":"string"}},"required":["url"]}
}]

def run_tool(name, args):
    if name == "calculator": return str(eval(args["expr"], {"__builtins__":{}}, vars(math)))
    if name == "web_get":  return requests.get(args["url"]).text[:2000]
    return "unknown tool"

# 2) The core agent loop
messages = [{"role":"user","content":"What is 2026's most-cited AI paper, and is its DOI a prime?"}]
while True:
    resp = client.messages.create(model="claude-sonnet-4-6", max_tokens=1024, tools=TOOLS, messages=messages)
    if resp.stop_reason == "end_turn": break           # agent decided to stop
    messages.append({"role":"assistant","content":resp.content})
    tool_results = []
    for block in resp.content:
        if block.type == "tool_use":
            result = run_tool(block.name, block.input)
            tool_results.append({"type":"tool_result","tool_use_id":block.id,"content":result})
    messages.append({"role":"user","content":tool_results})

print(resp.content[-1].text)

// npm install openai
import OpenAI from "openai";
const client = new OpenAI();

const tools = [{
  type: "function",
  function: {
    name: "calculator",
    description: "Evaluate a math expression.",
    parameters: { type:"object", properties:{expr:{type:"string"}}, required:["expr"] }
  }
}];

async function runTool(name, args) {
  if (name === "calculator") return String(eval(args.expr));
  return "unknown";
}

let messages = [{ role:"user", content:"Compute 17 * 23 then check primality of the result." }];
while (true) {
  const resp = await client.chat.completions.create({
    model:"gpt-4.1", messages, tools, tool_choice:"auto"
  });
  const msg = resp.choices[0].message;
  messages.push(msg);
  if (!msg.tool_calls) break;
  for (const tc of msg.tool_calls) {
    const out = await runTool(tc.function.name, JSON.parse(tc.function.arguments));
    messages.push({ role:"tool", tool_call_id:tc.id, content:out });
  }
}
console.log(messages.at(-1).content);

⚠️

注意：範例中的 eval() 僅供示範。生產環境絕對不可把使用者輸入直接 eval——詳見 Step 12 安全章節。 Note: The eval() calls are illustrative only. Never eval user input in production — see Step 12.

🎓 章節小測 (3 題)

Q1. 下列哪一項不是本教程定義的 AI Agent 必要元件？

Q1. Which is not a required component of an AI Agent in our working definition?

A) LLM (推理引擎)

B) 工具 (Tools)

C) GPU (硬體加速器)

D) Loop (規劃迴圈)

✅ C 是基礎設施而非 agent 的概念元件。Agent 的核心是 LLM + Tools + Memory + Loop。✅ C is infrastructure, not a conceptual agent component. The core is LLM + Tools + Memory + Loop.

Q2. PEAS 框架中的「P」代表什麼？

Q2. What does the "P" in PEAS stand for?

A) Planning (規劃)

B) Performance measure (效能衡量)

C) Prompt (提示詞)

D) Python

✅ PEAS = Performance, Environment, Actuators, Sensors。✅ PEAS = Performance, Environment, Actuators, Sensors.

Q3. 「給 LLM 五個工具就叫做 Agent」這個說法為什麼不準確？

Q3. Why is "an LLM with five tools = an agent" inaccurate?

A) Agent 需要至少十個工具

B) 工具只是必要條件之一，還需要規劃迴圈與停止條件

C) Agent 不能使用工具

D) 工具一定要用 MCP 才算 agent

✅ 沒有 planning loop、錯誤恢復、清楚停止條件的多工具 LLM 還只是 chatbot。✅ Without a planning loop, error recovery, and explicit stop criteria, a multi-tool LLM is still a chatbot.

← 首頁教程總覽 Step 2 →LLM 作為大腦