STEP 2 / 12 · 基礎

LLM 作為大腦

Token、context window、temperature、sampling——理解 Agent 推理引擎的關鍵特性與限制。

Tokens, context window, temperature, sampling — the key properties and limits of the agent's reasoning engine.

一、為什麼非用 LLM 不可?

2020 年以前,要打造能「理解自然語言指令」的 agent,只能用 意圖分類器 + 槽位填充 + 規則引擎(Dialogflow、Rasa 是代表)。每多一個意圖就要重新標數據、訓練、上線——擴展性極差。

大型語言模型 (LLM) 改變了遊戲規則:同一個模型,靠 prompt 就能處理開放域任務。它不只是強大的語意理解器,更是一個能把「自然語言目標」翻譯成「具體動作序列」的通用推理引擎。這就是為什麼現代 agent 普遍以 LLM 為核心。

但 LLM 不是萬能的。要把它當成 agent 的大腦,必須先理解它能做什麼、不能做什麼、何時會出錯。本章就是這個藍圖。

Before 2020, building an agent that "understands natural-language instructions" meant intent classification + slot filling + rule engines (think Dialogflow, Rasa). Every new intent required relabeling, retraining, redeploying — extensibility was terrible.

LLMs flipped the table: one model, configured by prompts, handles open-domain tasks. Not just a powerful semantic parser — a general reasoning engine that can translate "natural-language goal" into "concrete action sequence." That's why modern agents put an LLM at the core.

LLMs are not magic, though. To use one as the brain of an agent you must understand what it can do, what it cannot, and where it fails. That's the blueprint of this chapter.

二、LLM 的核心:下一個 token 預測

儘管模型有千億參數,LLM 做的事情其實只有一件:給定前面的所有 token,預測下一個 token 的機率分佈

把它想成一個極度博學的自動完成。它沒有真正在「思考」,但因為訓練語料中包含了人類大量的推理範例,它學會了一個能「看起來像推理」的統計過程。

Despite the hundreds of billions of parameters, an LLM does one thing: given all preceding tokens, predict a probability distribution over the next token.

Think of it as an extremely well-read autocomplete. It does not "think" in any literal sense, but because the training corpus contains enormous amounts of human reasoning, it learned a statistical process that looks like reasoning.

[ "The", " capital", " of", " France", " is" ] → P(next | context) ├─ " Paris" p = 0.93 ├─ " a" p = 0.03 ├─ " located" p = 0.02 └─ ...others p = 0.02
🔍
關鍵洞見:整個 agent 系統都是建在這個「下一個 token 機率」之上。Tool call、JSON 輸出、ReAct 推理——全部都是 token。所有 prompt 工程的本質,是把 token 機率分佈推向你想要的方向 Key insight: The entire agent stack sits on top of this "next-token probability." Tool calls, JSON outputs, ReAct reasoning — all just tokens. Every prompt engineering technique is, at heart, nudging the token distribution toward what you want.

三、Tokenization:你的 prompt 不是字

LLM 看不到字元也看不到單字,它看到的是token——介於字元與單字之間的子單位。常見的 tokenizer 是 BPE (Byte-Pair Encoding) 或 SentencePiece。

  • 英文:1 token ≈ 0.75 個單字("hello world" = 2 tokens)
  • 中文:1 個漢字常常 = 1–3 tokens(取決於 tokenizer)
  • 程式碼:縮排與符號常常各算 1 token
  • Emoji:1 個 emoji 通常 = 2–5 tokens

為什麼這重要?因為 API 價格、context window 上限、延遲都按 token 算。同一段中文比英文消耗 2–3 倍 token,等於成本也高 2–3 倍。

An LLM does not see characters or words — it sees tokens, sub-units between characters and words, produced by BPE or SentencePiece.

  • English: 1 token ≈ 0.75 words ("hello world" = 2 tokens)
  • Chinese: 1 Han character = 1–3 tokens (depends on tokenizer)
  • Code: indentation and symbols usually each count as 1 token
  • Emoji: 1 emoji = 2–5 tokens

Why does it matter? Because API pricing, context limits, and latency all bill by token. The same paragraph in Chinese costs 2–3× as many tokens as in English — i.e., 2–3× the cost.

輸入文字,估算 token 數(粗略估算:英文每 4 字元 ≈ 1 token;中文每字 ≈ 2 tokens):

Type text, estimate tokens (rough: 1 token ≈ 4 chars English; 2 tokens / Chinese char):

Characters: 0 ~Tokens: 0 Cost @ $3/M: $0.0000

四、Context Window:Agent 的「短期記憶」上限

Context window 是 LLM 一次能讀進去的 token 上限。它包含:system prompt + 歷史對話 + 工具描述 + 工具回傳結果 + 你目前的問題。

2026 年的旗艦模型大致範圍:

  • Claude Sonnet 4.6:200K tokens(約一本中篇小說)
  • GPT-4.1 / 5:1M tokens(旗艦版本)
  • Gemini 2.5 Pro:2M tokens
  • 開源 Llama / Mistral:通常 32K – 128K

但「能塞」不等於「有效」。Lost in the middle 現象指出:當 context 接近上限時,模型對中段資訊的回憶率明顯下降。實務上,agent 要主動裁剪、摘要、檢索來控制 context 的有效長度。

The context window is the maximum tokens an LLM can read at once. It contains: system prompt + chat history + tool descriptions + tool results + your current question.

Flagship 2026 ranges:

  • Claude Sonnet 4.6: 200K tokens (a short novel)
  • GPT-4.1 / 5: 1M tokens (flagship variant)
  • Gemini 2.5 Pro: 2M tokens
  • Open-source Llama / Mistral: typically 32K – 128K

But "fits" ≠ "effective." The "lost in the middle" phenomenon shows that recall of mid-context information drops sharply when the window fills. Production agents must actively trim, summarize, and retrieve to keep effective context short.

⚠️
實務指南:即使你有 1M context,盡量讓真正用到的 prompt + history 不超過 30K。超過後品質會明顯下滑且成本陡增。長對話請主動 summarize 或用 RAG 檢索。 Rule of thumb: Even with 1M tokens available, keep effective prompt + history under ~30K. Quality drops and cost spikes beyond that. Summarize long chats or use RAG.

五、Temperature 與 Sampling:可控的隨機性

LLM 預測的是機率分佈,所以可以用不同策略從中抽樣:

  • Greedy:永遠選機率最高的 token。確定性高,但容易產生重複、單調。
  • Temperature (T):把分佈變得平緩 (T>1) 或變得尖銳 (T<1)。
    T=0 ≈ greedy;T=1 是原始分佈;T=2 非常隨機。
  • top-p (nucleus sampling):只從累積機率 ≥ p 的最小 token 集合中抽樣,動態裁掉長尾。
  • top-k:只從機率最高的 k 個 token 中抽樣。

The LLM emits a distribution, so we can sample with different strategies:

  • Greedy: always pick the highest-probability token. Deterministic but repetitive.
  • Temperature (T): flattens (T>1) or sharpens (T<1) the distribution. T=0 ≈ greedy; T=1 raw; T=2 wild.
  • top-p (nucleus): sample only from the smallest set whose cumulative probability ≥ p — dynamic tail truncation.
  • top-k: sample only from the k highest-probability tokens.

觀察 temperature 如何改變 token 分佈(綠色 = 被採樣到的 token):

See how temperature reshapes the distribution (green = sampled token):

建議:事實性任務 / 工具呼叫 用 0~0.3;創意寫作用 0.7~1.2。

Rule: factual / tool-calling tasks use 0–0.3; creative writing 0.7–1.2.

六、2026 年主流 LLM 對照表

模型Context長處價格 / 1M tok適合 Agent
Claude Sonnet 4.6200K強推理、強 tool use、寫程式Strong reasoning, tool use, code$3 / $15★★★★★ 工程 agent 首選★★★★★ top pick for engineering
Claude Opus 4.6200K最強推理、複雜規劃Top reasoning, complex planning$15 / $75★★★★★ 高難度規劃任務★★★★★ high-stakes planning
Claude Haiku 4.5200K便宜快速、適合分流Cheap & fast routing$0.8 / $4★★★★ 子任務 / router★★★★ subtasks / router
GPT-4.11M超長 context、tool 強Huge context, tool use$2 / $8★★★★★ 文件密集場景★★★★★ document-heavy tasks
GPT-5 mini400K便宜、tool 不錯Cheap, decent tools$0.5 / $2★★★★ 量大型應用★★★★ high-volume apps
Gemini 2.5 Pro2M最大 context、多模態Largest context, multimodal$1.25 / $10★★★★ 長影音 / 文件★★★★ long video / docs
Llama 4 70B (open)128K自託管、隱私Self-hosted, private自有 GPU 成本your GPU cost★★★ 內網 / 醫療★★★ on-prem / medical
💡
選型建議:不是每個子任務都需要旗艦模型。常見策略是「Router + 主力 + 廉價執行」——用 Haiku/mini 分流,Sonnet/4.1 處理核心推理,Llama 跑大量低風險批次任務。 Selection rule: not every subtask needs the flagship. A common pattern is "Router + Workhorse + Bulk executor" — Haiku/mini for routing, Sonnet/4.1 for core reasoning, Llama for high-volume low-risk batches.

七、四個 Agent 開發者必知的 LLM 失敗模式

🌀 幻覺 (Hallucination)

LLM 會自信地編造不存在的事實、API、套件名。Agent 場景特別嚴重——它會「呼叫」根本不存在的工具。對策:強制工具用 schema 驗證、結構化輸出、RAG 接地。

LLMs confidently invent non-existent facts, APIs, package names. In agents this is critical — they "call" tools that don't exist. Mitigation: JSON schema validation, structured output, RAG grounding.

♾️ 無窮迴圈

Agent 反覆呼叫同一個失敗的工具、或在兩個動作間來回。對策:設 max_iterations、相同 action 偵測、強制 reflection。

Agent repeatedly calls the same failing tool or oscillates between two actions. Mitigation: max_iterations cap, repeat-action detection, forced reflection.

⏱️ 延遲與成本爆炸

10 步 ReAct loop = 10 次 LLM 呼叫。Context 滾雪球累積,後幾步可能比前幾步貴 5×。對策:summarize、window slide、改用便宜模型做子步驟。

A 10-step ReAct loop = 10 LLM calls. Context snowballs, late steps may be 5× more expensive than early ones. Mitigation: summarize, sliding window, cheaper model for subtasks.

📉 Lost in the middle

長 context 中間段資訊召回率下降。Agent 在第 5 輪可能「忘了」第 1 輪的 system 指令。對策:關鍵指令放開頭與結尾、定期重申、用工具回顧 history。

Recall drops for mid-context info. By turn 5 the agent may "forget" turn 1's system rules. Mitigation: place critical rules at start AND end, restate periodically, give the agent a "review history" tool.

🎓 章節小測 (3 題)

Q1. 下列哪一項對「temperature = 0」的描述最準確?

Q1. Which best describes "temperature = 0"?

A) 模型完全不會輸出
B) 接近 greedy 採樣,趨於確定性
C) 輸出最隨機
D) 跳過 LLM 直接走規則
✅ T=0 把分佈推到極尖銳,接近總是選最高機率 token。✅ T=0 sharpens the distribution to near-greedy.

Q2. Agent 設計時為什麼要主動 summarize 對話?

Q2. Why should an agent actively summarize history?

A) Token 一定會用完
B) Context 過長會發生 lost-in-the-middle 與成本爆炸
C) 為了讓 LLM 更聰明
D) 為了符合 GDPR
✅ 召回品質下降 + 每輪 token 累積成本,是兩大原因。✅ Recall degradation and snowballing per-call cost are the two main reasons.

Q3. 「同一段中文比英文消耗 2–3 倍 token」最直接的原因是?

Q3. The most direct reason "Chinese consumes 2–3× more tokens than English" is?

A) 中文模型比較笨
B) BPE tokenizer 對中文字元劃分較細
C) 中文 prompt 一定要更長
D) GPU 不支援中文
✅ 訓練語料以英文為主,中文字元在 vocab 中佔比小,因此被切得更碎。✅ English-dominant training corpora mean Chinese characters are split more finely.