Step 5: Memory Systems — AI Agents Tutorial

出發點

一、為什麼 Agent 需要「記憶」？

純 LLM 是「無狀態」的——每次 API 呼叫都是新的開始，模型本身不會記得你昨天的對話。Context window 是 LLM 唯一的「記憶」，而且每次呼叫都得從頭塞滿。

對於單輪 chatbot 這沒問題，但對 Agent 是災難：

多輪任務需要記得前面做過什麼
跨 session 的個人化（使用者偏好）需要持久儲存
長對話超過 context window 後，舊資訊會被擠出
Agent 之間的協作需要共享狀態

記憶系統就是把「短暫的 token」變成「持久的知識」的設計。本章拆解四種記憶角色。

A bare LLM is "stateless" — every API call is a fresh start, the model itself remembers nothing of yesterday. The context window is the only memory, and every call refills it from scratch.

Fine for single-turn chatbots; catastrophic for agents:

Multi-step tasks need to remember prior steps
Cross-session personalization (user preferences) needs persistent storage
Long conversations past the context window evict old info
Multi-agent collaboration requires shared state

Memory systems convert "ephemeral tokens" into "persistent knowledge." This chapter dissects four memory roles.

四種記憶

二、四種 Agent 記憶類型對照

借鑑認知科學，agent 圈普遍把記憶分為四類（與 CoALA 框架一致）：

Borrowing from cognitive science (and the CoALA framework), the agent community recognizes four memory types:

🧠

工作記憶 (Working)

承載：當前對話歷史 + 工具結果 + 草稿區。
實作：就是 context window 中的 messages 陣列。
壽命：單次 task。
關鍵問題：token 預算管理、修剪策略。

Holds: current conversation + tool results + scratchpad.
Implementation: the messages array in the context window.
Lifespan: one task.
Key issue: token budgeting & trimming.

📖

情節記憶 (Episodic)

承載：過去 task 的完整軌跡（query → steps → outcome）。
實作：vector DB + 時間戳索引。
壽命：跨 session 永久保存。
關鍵問題：relevance retrieval、隱私保留期。

Holds: full traces of past tasks (query → steps → outcome).
Implementation: vector DB + timestamp index.
Lifespan: persistent across sessions.
Key issue: relevance retrieval, retention policy.

📚

語意記憶 (Semantic)

承載：抽象事實與知識（使用者偏好、領域規則、文件知識）。
實作：vector DB + 知識圖譜 + RAG。
壽命：持久。
關鍵問題：更新衝突、過時資訊汰換。

Holds: abstract facts (user preferences, domain rules, doc knowledge).
Implementation: vector DB + knowledge graph + RAG.
Lifespan: persistent.
Key issue: conflict resolution, staleness eviction.

🛠️

程序記憶 (Procedural)

承載：「該怎麼做」——成功的 prompt 模板、工具序列、技能庫。
實作：prompt 庫、skill 註冊、fine-tuned weights。
壽命：持久。
關鍵問題：泛化、版本管理。

Holds: "how to do" — winning prompt templates, tool sequences, skill libraries.
Implementation: prompt library, skill registry, fine-tuned weights.
Lifespan: persistent.
Key issue: generalization, versioning.

🧭

對照人類：工作記憶 ≈ 你正在想的事；情節記憶 ≈ 「上週和 Alice 討論過 X」；語意記憶 ≈ 「Paris 是法國首都」；程序記憶 ≈ 「怎麼騎腳踏車」。 Human analogy: working ≈ what you're holding in mind; episodic ≈ "I discussed X with Alice last week"; semantic ≈ "Paris is the capital of France"; procedural ≈ "how to ride a bike."

實作：工作記憶

三、Context Window 的修剪策略

當 messages 累積超過 context 上限或關鍵段位置不對時，要主動修剪。常見三種策略：

When messages exceed context or critical pieces drift to unhelpful positions, trim actively. Three common strategies:

策略	做法	優點 / 缺點
Sliding Window	只保留最近 N 輪	Keep only the last N turns	✅ 簡單、可預測 / ❌ 完全失去早期上下文	✅ simple, predictable / ❌ loses all early context
Summarization	每 K 輪用 LLM 摘要舊訊息，替換掉	Every K turns, LLM-summarize old messages and replace	✅ 保留語意 / ❌ 摘要也會失真、額外成本	✅ retains gist / ❌ summary distorts, extra cost
Hierarchical / Selective	舊訊息存 vector DB，需要時才檢索回來	Push old messages to vector DB, retrieve on demand	✅ 無上限、保真高 / ❌ 工程複雜	✅ unlimited, high fidelity / ❌ engineering overhead

def trim_messages(messages, max_tokens=12000, keep_last=8):
    if count_tokens(messages) <= max_tokens:
        return messages
    system, *rest = messages
    head, tail = rest[:-keep_last], rest[-keep_last:]
    summary = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=600,
        messages=[{"role":"user",
                   "content": "Summarize the following conversation in <200 words, "
                              "preserve all decisions and tool results:\n" + render(head)}]
    ).content[0].text
    return [system, {"role":"user","content":f"[Earlier conversation summary] {summary}"}, *tail]

實作：長期記憶

四、用向量資料庫實作情節 + 語意記憶

長期記憶的標準架構：把每段「值得記住」的內容用 embedding 模型轉成向量，存進向量 DB。Agent 每次新對話開始或需要回憶時，把當前 query 也 embed 成向量，檢索最相似的 K 條塞進 context。

2026 主流向量 DB：

Pinecone — managed、生產級、貴
Qdrant / Milvus — open source、自託管
Weaviate — 內建 hybrid search
pgvector — Postgres 擴充，<1M 向量首選
Chroma / LanceDB — 輕量本機開發

Standard architecture: embed every "memory-worthy" snippet, store in a vector DB. On a new turn, embed the query and retrieve top-K most-similar items into context.

Mainstream vector DBs in 2026:

Pinecone — managed, production, pricey
Qdrant / Milvus — open-source, self-hostable
Weaviate — built-in hybrid search
pgvector — Postgres extension, ideal for <1M vectors
Chroma / LanceDB — lightweight local dev

# pip install chromadb openai
import chromadb
from openai import OpenAI
client = OpenAI()
db     = chromadb.PersistentClient(path="./agent_memory")
mem    = db.get_or_create_collection("episodes")

def embed(text):
    return client.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding

def remember(episode_id, text, metadata):
    mem.add(ids=[episode_id], documents=[text], embeddings=[embed(text)], metadatas=[metadata])

def recall(query, k=5, filter=None):
    res = mem.query(query_embeddings=[embed(query)], n_results=k, where=filter)
    return [{"text":d,"meta":m} for d,m in zip(res["documents"][0], res["metadatas"][0])]

# Usage in agent loop
remember("task_2026_05_12_001",
         "User asked to triage BRCA1 c.5266dupC. Agent used PubMed + ClinVar. Verdict: pathogenic.",
         {"user_id":"u123","date":"2026-05-12","domain":"variant"})

# On new conversation start
recent = recall("BRCA variant questions", k=3, filter={"user_id":"u123"})
system_prompt += f"\n\nRelevant past tasks:\n{recent}"

記憶寫入策略

五、什麼東西該寫進長期記憶？

「全部寫進去」聽起來方便，但會把 vector DB 變成噪音垃圾桶。Anthropic 與 LangChain 的指引建議用以下原則：

使用者明確要求「請記得 X」
偏好類事實：使用者習慣語言、時區、專業背景
非顯而易見的決策：「我們 H2 freeze 在 6/1」、「這個 repo 用 pnpm 不要 npm」
失敗或修正的軌跡：上次嘗試 X 失敗、用 Y 才成功

不該寫進去：

每次都能即時查的事實（用 RAG / API 即可）
個人敏感資料（醫療、財務、住址）除非使用者明確同意
暫時性對話狀態（用 working memory 就好）

"Write everything" sounds easy but turns the vector DB into a noisy junk drawer. Anthropic and LangChain guidelines suggest:

The user explicitly asks "please remember X"
Preferences: language, timezone, expertise
Non-obvious decisions: "H2 freeze starts 6/1," "this repo uses pnpm, not npm"
Failure / correction trajectories: last attempt of X failed, Y worked

Do NOT write:

Facts that can be looked up live (use RAG / API instead)
Sensitive PII (medical, financial, addresses) unless explicit consent
Ephemeral conversation state (use working memory)

互動展示

六、記憶層級互動模擬

下方模擬一個 agent 接到「我又要訂上次那家飯店」時各層記憶的協作：

Simulating an agent receiving "Book that hotel I stayed at last time" — see how layers cooperate:

🧠 Working Memory

承載：當前對話「我又要訂上次那家飯店」

Holds: current turn "Book that hotel I stayed at last time"

📖 Episodic Memory recall

召回：2026-03-14 訂了 Tokyo Hyatt Regency，房號 1502，4 晚

Recall: 2026-03-14 booked Tokyo Hyatt Regency, room 1502, 4 nights

📚 Semantic Memory recall

召回：使用者偏好——靠窗、高樓層、不要床型 King 以下

Recall: user prefers window, high floor, no smaller than King bed

🛠️ Procedural Memory recall

召回：訂房技能 = check_availability → confirm → send_to_user

Recall: booking skill = check_availability → confirm → send_to_user

✅ 最終行為

Agent 直接問「Tokyo Hyatt Regency，4 晚同樣偏好？」——零摩擦體驗

Agent asks "Tokyo Hyatt Regency, 4 nights, same preferences?" — zero-friction UX

🎓 章節小測

Q1. 下列哪一項不適合寫進長期記憶？

Q1. Which is not suitable for long-term memory?

A) 使用者偏好的程式語言

B) 上次任務的修正軌跡

C) 未經同意的醫療紀錄

D) 專案的特殊命名規則

✅ 敏感 PII 沒有明確同意絕不可寫進長期記憶。✅ Sensitive PII without consent must never be persisted.

Q2. Sliding window 修剪法的最大缺點是？

Q2. The biggest drawback of sliding-window trimming?

A) 太貴

B) 完全失去早期上下文

C) 違反 GDPR

D) 沒辦法搭配工具

✅ 系統指令、初始任務描述會被切掉，導致 agent 「忘記目標」。✅ System instructions and the original task get cut, so the agent "forgets" its goal.

Q3. 「使用者上週和我討論過 X」是哪一種記憶？

Q3. "User discussed X with me last week" is which kind of memory?

A) Working

B) Episodic

C) Semantic

D) Procedural

✅ 帶時間戳的具體事件 = 情節記憶。✅ Time-stamped concrete events = episodic memory.

← Step 4工具呼叫 Step 6 →規劃與推理