一、為什麼非用 LLM 不可?
2020 年以前,要打造能「理解自然語言指令」的 agent,只能用 意圖分類器 + 槽位填充 + 規則引擎(Dialogflow、Rasa 是代表)。每多一個意圖就要重新標數據、訓練、上線——擴展性極差。
大型語言模型 (LLM) 改變了遊戲規則:同一個模型,靠 prompt 就能處理開放域任務。它不只是強大的語意理解器,更是一個能把「自然語言目標」翻譯成「具體動作序列」的通用推理引擎。這就是為什麼現代 agent 普遍以 LLM 為核心。
但 LLM 不是萬能的。要把它當成 agent 的大腦,必須先理解它能做什麼、不能做什麼、何時會出錯。本章就是這個藍圖。
Before 2020, building an agent that "understands natural-language instructions" meant intent classification + slot filling + rule engines (think Dialogflow, Rasa). Every new intent required relabeling, retraining, redeploying — extensibility was terrible.
LLMs flipped the table: one model, configured by prompts, handles open-domain tasks. Not just a powerful semantic parser — a general reasoning engine that can translate "natural-language goal" into "concrete action sequence." That's why modern agents put an LLM at the core.
LLMs are not magic, though. To use one as the brain of an agent you must understand what it can do, what it cannot, and where it fails. That's the blueprint of this chapter.
二、LLM 的核心:下一個 token 預測
儘管模型有千億參數,LLM 做的事情其實只有一件:給定前面的所有 token,預測下一個 token 的機率分佈。
把它想成一個極度博學的自動完成。它沒有真正在「思考」,但因為訓練語料中包含了人類大量的推理範例,它學會了一個能「看起來像推理」的統計過程。
Despite the hundreds of billions of parameters, an LLM does one thing: given all preceding tokens, predict a probability distribution over the next token.
Think of it as an extremely well-read autocomplete. It does not "think" in any literal sense, but because the training corpus contains enormous amounts of human reasoning, it learned a statistical process that looks like reasoning.
三、Tokenization:你的 prompt 不是字
LLM 看不到字元也看不到單字,它看到的是token——介於字元與單字之間的子單位。常見的 tokenizer 是 BPE (Byte-Pair Encoding) 或 SentencePiece。
- 英文:1 token ≈ 0.75 個單字("hello world" = 2 tokens)
- 中文:1 個漢字常常 = 1–3 tokens(取決於 tokenizer)
- 程式碼:縮排與符號常常各算 1 token
- Emoji:1 個 emoji 通常 = 2–5 tokens
為什麼這重要?因為 API 價格、context window 上限、延遲都按 token 算。同一段中文比英文消耗 2–3 倍 token,等於成本也高 2–3 倍。
An LLM does not see characters or words — it sees tokens, sub-units between characters and words, produced by BPE or SentencePiece.
- English: 1 token ≈ 0.75 words ("hello world" = 2 tokens)
- Chinese: 1 Han character = 1–3 tokens (depends on tokenizer)
- Code: indentation and symbols usually each count as 1 token
- Emoji: 1 emoji = 2–5 tokens
Why does it matter? Because API pricing, context limits, and latency all bill by token. The same paragraph in Chinese costs 2–3× as many tokens as in English — i.e., 2–3× the cost.
輸入文字,估算 token 數(粗略估算:英文每 4 字元 ≈ 1 token;中文每字 ≈ 2 tokens):
Type text, estimate tokens (rough: 1 token ≈ 4 chars English; 2 tokens / Chinese char):
四、Context Window:Agent 的「短期記憶」上限
Context window 是 LLM 一次能讀進去的 token 上限。它包含:system prompt + 歷史對話 + 工具描述 + 工具回傳結果 + 你目前的問題。
2026 年的旗艦模型大致範圍:
- Claude Sonnet 4.6:200K tokens(約一本中篇小說)
- GPT-4.1 / 5:1M tokens(旗艦版本)
- Gemini 2.5 Pro:2M tokens
- 開源 Llama / Mistral:通常 32K – 128K
但「能塞」不等於「有效」。Lost in the middle 現象指出:當 context 接近上限時,模型對中段資訊的回憶率明顯下降。實務上,agent 要主動裁剪、摘要、檢索來控制 context 的有效長度。
The context window is the maximum tokens an LLM can read at once. It contains: system prompt + chat history + tool descriptions + tool results + your current question.
Flagship 2026 ranges:
- Claude Sonnet 4.6: 200K tokens (a short novel)
- GPT-4.1 / 5: 1M tokens (flagship variant)
- Gemini 2.5 Pro: 2M tokens
- Open-source Llama / Mistral: typically 32K – 128K
But "fits" ≠ "effective." The "lost in the middle" phenomenon shows that recall of mid-context information drops sharply when the window fills. Production agents must actively trim, summarize, and retrieve to keep effective context short.
五、Temperature 與 Sampling:可控的隨機性
LLM 預測的是機率分佈,所以可以用不同策略從中抽樣:
- Greedy:永遠選機率最高的 token。確定性高,但容易產生重複、單調。
- Temperature (T):把分佈變得平緩 (T>1) 或變得尖銳 (T<1)。
T=0 ≈ greedy;T=1 是原始分佈;T=2 非常隨機。 - top-p (nucleus sampling):只從累積機率 ≥ p 的最小 token 集合中抽樣,動態裁掉長尾。
- top-k:只從機率最高的 k 個 token 中抽樣。
The LLM emits a distribution, so we can sample with different strategies:
- Greedy: always pick the highest-probability token. Deterministic but repetitive.
- Temperature (T): flattens (T>1) or sharpens (T<1) the distribution. T=0 ≈ greedy; T=1 raw; T=2 wild.
- top-p (nucleus): sample only from the smallest set whose cumulative probability ≥ p — dynamic tail truncation.
- top-k: sample only from the k highest-probability tokens.
觀察 temperature 如何改變 token 分佈(綠色 = 被採樣到的 token):
See how temperature reshapes the distribution (green = sampled token):
建議:事實性任務 / 工具呼叫 用 0~0.3;創意寫作用 0.7~1.2。
Rule: factual / tool-calling tasks use 0–0.3; creative writing 0.7–1.2.
六、2026 年主流 LLM 對照表
| 模型 | Context | 長處 | 價格 / 1M tok | 適合 Agent | |||
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 200K | 強推理、強 tool use、寫程式 | Strong reasoning, tool use, code | $3 / $15 | ★★★★★ 工程 agent 首選 | ★★★★★ top pick for engineering | |
| Claude Opus 4.6 | 200K | 最強推理、複雜規劃 | Top reasoning, complex planning | $15 / $75 | ★★★★★ 高難度規劃任務 | ★★★★★ high-stakes planning | |
| Claude Haiku 4.5 | 200K | 便宜快速、適合分流 | Cheap & fast routing | $0.8 / $4 | ★★★★ 子任務 / router | ★★★★ subtasks / router | |
| GPT-4.1 | 1M | 超長 context、tool 強 | Huge context, tool use | $2 / $8 | ★★★★★ 文件密集場景 | ★★★★★ document-heavy tasks | |
| GPT-5 mini | 400K | 便宜、tool 不錯 | Cheap, decent tools | $0.5 / $2 | ★★★★ 量大型應用 | ★★★★ high-volume apps | |
| Gemini 2.5 Pro | 2M | 最大 context、多模態 | Largest context, multimodal | $1.25 / $10 | ★★★★ 長影音 / 文件 | ★★★★ long video / docs | |
| Llama 4 70B (open) | 128K | 自託管、隱私 | Self-hosted, private | 自有 GPU 成本 | your GPU cost | ★★★ 內網 / 醫療 | ★★★ on-prem / medical |
七、四個 Agent 開發者必知的 LLM 失敗模式
🌀 幻覺 (Hallucination)
LLM 會自信地編造不存在的事實、API、套件名。Agent 場景特別嚴重——它會「呼叫」根本不存在的工具。對策:強制工具用 schema 驗證、結構化輸出、RAG 接地。
LLMs confidently invent non-existent facts, APIs, package names. In agents this is critical — they "call" tools that don't exist. Mitigation: JSON schema validation, structured output, RAG grounding.
♾️ 無窮迴圈
Agent 反覆呼叫同一個失敗的工具、或在兩個動作間來回。對策:設 max_iterations、相同 action 偵測、強制 reflection。
Agent repeatedly calls the same failing tool or oscillates between two actions. Mitigation: max_iterations cap, repeat-action detection, forced reflection.
⏱️ 延遲與成本爆炸
10 步 ReAct loop = 10 次 LLM 呼叫。Context 滾雪球累積,後幾步可能比前幾步貴 5×。對策:summarize、window slide、改用便宜模型做子步驟。
A 10-step ReAct loop = 10 LLM calls. Context snowballs, late steps may be 5× more expensive than early ones. Mitigation: summarize, sliding window, cheaper model for subtasks.
📉 Lost in the middle
長 context 中間段資訊召回率下降。Agent 在第 5 輪可能「忘了」第 1 輪的 system 指令。對策:關鍵指令放開頭與結尾、定期重申、用工具回顧 history。
Recall drops for mid-context info. By turn 5 the agent may "forget" turn 1's system rules. Mitigation: place critical rules at start AND end, restate periodically, give the agent a "review history" tool.
🎓 章節小測 (3 題)
Q1. 下列哪一項對「temperature = 0」的描述最準確?
Q1. Which best describes "temperature = 0"?
Q2. Agent 設計時為什麼要主動 summarize 對話?
Q2. Why should an agent actively summarize history?
Q3. 「同一段中文比英文消耗 2–3 倍 token」最直接的原因是?
Q3. The most direct reason "Chinese consumes 2–3× more tokens than English" is?