Step 6: Planning & Reasoning — AI Agents Tutorial

出發點

一、規劃迴圈是 Agent 的「控制流」

單一 LLM 呼叫只能「想一次」。Agent 要解決多步驟任務，必須把「想 → 行動 → 觀察」做成迴圈。這個迴圈如何組織，就是規劃模式 (planning pattern)。

2026 年生產界的主流分四種：

ReAct：邊想邊做，最普及
Plan-and-Execute：先寫完整計劃再執行
Tree-of-Thoughts (ToT)：同時探索多條路徑
Reflection：行動後自我批判 + 重做

不同任務適合不同模式。本章把四者拆解、給選擇指南、附完整程式碼。

A single LLM call thinks once. For multi-step tasks the agent must loop "think → act → observe." How that loop is organized = the planning pattern.

Four 2026 production-grade patterns dominate:

ReAct — think and act interleaved, most popular
Plan-and-Execute — write the full plan first, then execute
Tree-of-Thoughts (ToT) — explore multiple paths in parallel
Reflection — self-critique after each action and redo

Different tasks suit different patterns. This chapter dissects all four with selection rules and full code.

模式一

二、ReAct：Reasoning + Acting

ReAct (Yao et al. 2023) 是最有影響力的 agent 規劃模式。核心 prompt 結構：

ReAct (Yao et al. 2023) is the most influential agent pattern. The core prompt structure:

Thought: I need to find X. Let me call tool Y. Action: Y[args] Observation: Thought: Now I have X. I still need Z. Action: ... ... Thought: I have enough. Final answer. Final Answer:

優勢：單一 LLM 即可實作；推理 trace 可審計；錯誤恢復自然——觀察到失敗 → 下一個 Thought 自動調整。

劣勢：每步都呼叫 LLM，慢且貴；長迴圈中容易陷入無窮重複；推理品質依賴 prompt 質量。

適用：3–8 步左右、工具呼叫為主、任務難度中等的場景。也是 LangChain 的預設模式。

Pros: implementable with one LLM; reasoning trace is auditable; natural error recovery — observe failure → next Thought adjusts.

Cons: every step is an LLM call, slow and expensive; can loop infinitely on hard tasks; reasoning quality leans on prompt quality.

Best for: 3–8 step tool-call-heavy tasks of moderate difficulty. LangChain's default mode.

REACT_SYS = """You are a ReAct agent. Solve the user's task by alternating:
Thought: ...
Action: tool_name[args]   (one tool call)
After receiving the tool result you'll see an Observation, then continue.
When you have the answer, output:
Final Answer: ..."""

messages = [{"role":"system","content":REACT_SYS},
            {"role":"user","content": user_question}]
for step in range(MAX_STEPS):
    resp = client.messages.create(model="claude-sonnet-4-6", tools=TOOLS,
                                    max_tokens=1024, messages=messages)
    messages.append({"role":"assistant","content": resp.content})
    if resp.stop_reason == "end_turn":  break
    tool_results = [{"type":"tool_result",
                     "tool_use_id":b.id,
                     "content": run_tool(b.name, b.input)}
                    for b in resp.content if b.type=="tool_use"]
    messages.append({"role":"user","content": tool_results})

模式二

三、Plan-and-Execute：先規劃、後執行

把規劃與執行解耦：

Planner LLM：把使用者目標拆成 5–15 步的明確計劃，輸出 JSON。
Executor：逐步執行（可用更便宜的模型或腳本）。
Re-planner：若執行偏離計劃，回頭重新規劃。

優勢：Planner 用昂貴模型一次想清楚，後面 Executor 用便宜模型大量跑——成本大幅下降；計劃可給人類審查；步驟可並行。

劣勢：初始計劃若錯，整路都偏；中途新資訊難以利用（需要 re-plan）。

適用：步驟多 (>10)、流程穩定、可預測的任務。LangGraph 的「planner / executor」範例就是這模式。

Decouples planning from execution:

Planner LLM: decomposes the goal into a 5–15 step JSON plan.
Executor: runs each step (cheaper model or script).
Re-planner: re-plans if execution diverges.

Pros: one expensive call to plan, many cheap calls to execute — large cost reduction; plan is human-reviewable; steps can run in parallel.

Cons: a bad initial plan cascades; mid-execution discoveries are hard to incorporate without re-planning.

Best for: long flows (>10 steps), stable and predictable. LangGraph's planner/executor example uses this.

User goal ─► [Planner LLM] ─► PLAN[step1, step2, ..., stepN] │ ▼ ┌─ Execute step1 ─┐ ├─ Execute step2 ─┤ ◄─ re-plan if needed └─ Execute stepN ─┘ ▼ Final result

模式三

四、Tree-of-Thoughts (ToT)：分支搜尋

當任務需要「探索多種解法」時（如數學謎題、博弈、創意寫作），ReAct 的線性推理會卡住——一條路走錯就 GG。ToT (Yao et al. 2023) 把推理變成樹狀搜尋：

每一步生成多個候選 thought (branch 3–5 條)
用 LLM 評估每條 thought 的「有希望程度」
用 BFS / DFS 展開最有希望的分支
抵達終點或預算用盡時停止

優勢：對需要回溯的任務（數獨、24-game、創意組合）顯著超越 ReAct。

劣勢：呼叫次數爆炸（10×–100× ReAct）；很慢、很貴；多數實務任務用不上。

適用：離散搜尋空間、明確可評估的中間狀態。生產上很少用，學術論文常見。

When tasks need to "explore multiple solutions" (math puzzles, games, creative writing), ReAct's linear reasoning gets stuck — one wrong turn dooms you. ToT (Yao et al. 2023) turns reasoning into a tree search:

Generate several candidate thoughts per step (3–5 branches)
Have the LLM rate each branch's "promise"
Expand most-promising branches via BFS / DFS
Stop at goal or budget exhaustion

Pros: beats ReAct dramatically on backtracking tasks (Sudoku, 24-game, creative composition).

Cons: call count explodes (10×–100× ReAct); slow, costly; most real tasks don't need it.

Best for: discrete search spaces with clear intermediate evaluators. Rare in production, common in papers.

模式四

五、Reflection：自我批判 + 重做

Reflexion (Shinn et al. 2023) 與 Self-Refine (Madaan et al. 2023) 共同推動：讓 agent 在一個 task 結束後自我評估，把「下次該避免什麼 / 該怎麼改」寫進短期記憶，再重做一次。

常見實作流程：

Agent 完成第一版答案 A1
Critic LLM（可以是同個模型）審查 A1，輸出問題清單
把 critique 加入 context，Agent 產出 A2
重複 N 次或直到 critic 滿意

優勢：對寫作、程式碼、長推理任務很有效（程式 unit test pass rate 常見 +15%）。

劣勢：成本 2–5×；若 critic 本身錯，會強化錯誤。

適用：高品質要求、可離線執行、有自動評估訊號（test、規則）的任務。

Reflexion (Shinn et al. 2023) and Self-Refine (Madaan et al. 2023) popularized: after a task, have the agent self-assess, write "what to avoid / improve" into memory, and try again.

Typical flow:

Agent produces draft A1
Critic LLM (often the same model) reviews A1, lists issues
Critique appended to context, Agent produces A2
Repeat N times or until critic is satisfied

Pros: strong on writing, coding, long reasoning (commonly +15% unit-test pass rate).

Cons: 2–5× cost; a wrong critic reinforces errors.

Best for: quality-critical, offline tasks with auto-eval signal (tests, rules).

選擇指南

六、決策樹：該選哪種規劃模式？

🌳 選擇規劃模式

Q1:任務超過 10 步、流程可預測？More than 10 steps and predictable flow?▶ Plan-and-Execute

Q2:否則，是否需要回溯/分支搜尋（謎題、博弈）？Else, need backtracking / search (puzzles, games)?▶ Tree-of-Thoughts

Q3:否則，任務高品質要求 + 可自動評估？Else, quality-critical with auto-eval?▶ Reflection

Q4:以上皆否（即一般 3–8 步工具呼叫）Else (typical 3–8 step tool calls)▶ ReAct

💡

混合模式：生產 agent 常常混用。例如外層 Plan-and-Execute 拆分大步驟，每個步驟內部用 ReAct，最後加一層 Reflection 自我審查。LangGraph 設計時就支援這種組合。 Hybrid: production agents often mix patterns. Outer Plan-and-Execute decomposes high-level steps; each step runs ReAct internally; a final Reflection layer self-audits. LangGraph supports this natively.

實用技巧

七、防止無窮迴圈的五個保險絲

不論用哪種規劃模式，agent 都可能卡在迴圈。生產系統必裝：

max_iterations 上限：超過 N 步強制停止（通常 10–30）。
相同動作偵測：連續 3 次呼叫同樣 (tool, args) 就強制結束。
Token / 成本上限：累積成本超過閾值就 escalate。
Progress check：每 K 步問 LLM「離目標多遠？」沒進展就強制結束或 re-plan。
Human-in-the-loop checkpoint：執行敏感操作前必須人類核可。

Regardless of pattern, agents can loop. Production systems must install:

max_iterations cap: stop after N steps (usually 10–30).
Repeat-action detection: same (tool, args) three times → stop.
Token / cost cap: escalate after threshold.
Progress check: every K steps ask the LLM "how close to goal?" — no progress → stop or re-plan.
Human-in-the-loop checkpoint: human approval before sensitive actions.

🎓 章節小測

Q1. ReAct 模式的核心結構是？

Q1. The core ReAct structure is?

A) 同時跑多條推理路徑

B) Thought → Action → Observation 交替循環

C) 先寫完整 plan 再執行

D) 自我批判後重做

✅ ReAct = Reasoning + Acting，思考與行動交替。✅ ReAct = Reasoning + Acting, alternating.

Q2. 為什麼 Tree-of-Thoughts 很少用在生產 agent？

Q2. Why is Tree-of-Thoughts rare in production agents?

A) 概念太複雜

B) 呼叫次數 10–100×，成本與延遲過高

C) 模型不支援

D) 違反 OWASP

✅ 分支搜尋意味著大量 LLM 呼叫，多數實務任務不值得。✅ Branch search means many LLM calls — rarely worth it in practice.

Q3. 防止 agent 無窮迴圈，下列哪一項不是有效手段？

Q3. Which is not an effective safeguard against infinite loops?

A) max_iterations 上限

B) 相同 (tool, args) 偵測

C) 累積成本上限

D) 把 temperature 設成 0

✅ Temperature=0 反而可能讓 agent 一直選同一個錯誤動作。✅ T=0 can actually make the agent stick to the same wrong action.

← Step 5記憶系統 Step 7 →RAG 檢索增強