Step 7: RAG — AI Agents Tutorial

出發點

一、為什麼非用 RAG 不可？

LLM 的權重凍結在訓練時的知識截止日。要讓 agent 處理「公司內部文件、最新法規、即時數據」，三條路：

Fine-tuning：把知識訓練進權重。貴、慢、需要再訓、無法即時更新。
把所有資料塞進 context：1M context 也撐不住 100MB 文件，貴又慢。
RAG (Retrieval-Augmented Generation)：把資料切塊存進向量 DB，每次 query 動態檢索最相關片段塞進 prompt。

RAG 是 90% 場景的最佳答案：知識可動態更新、成本低、可審計（你能看到 agent 引用了哪段）、降低 60–80% 幻覺。

本章把 RAG 拆成 5 個關卡：切塊 → 嵌入 → 檢索 → 重排 → 生成。

LLM weights freeze at training cutoff. To use "internal docs, latest regulations, live data" in an agent, three options:

Fine-tuning — bake knowledge into weights. Expensive, slow, requires retraining, no live updates.
Stuff all data into context — even 1M context can't hold 100MB docs, and it's slow + costly.
RAG (Retrieval-Augmented Generation) — chunk data into a vector DB; per query, retrieve top-K relevant snippets into the prompt.

RAG wins 90% of cases: dynamic updates, low cost, auditable (you see what got cited), 60–80% hallucination reduction.

This chapter dissects RAG's 5 stages: chunk → embed → retrieve → rerank → generate.

五個關卡

二、RAG 管線的五個關卡

┌─ INDEX TIME (offline) ──────────────────────────────────────────┐ │ Docs → ① Chunk → ② Embed → store in Vector DB + Keyword Index │ └─────────────────────────────────────────────────────────────────┘ ┌─ QUERY TIME (per request) ──────────────────────────────────────┐ │ Query → ③ Retrieve (hybrid) → ④ Rerank → ⑤ Generate (LLM) │ └─────────────────────────────────────────────────────────────────┘

① Chunking（切塊）

將長文件切成 200–800 token 的 chunks，搭配 10–15% overlap。Naive 切法（固定字數）容易切斷語意；建議用語意切分（依段落、heading、句末標點）。

Split long docs into 200–800-token chunks with 10–15% overlap. Naive fixed-length splits cut semantic units; prefer semantic chunking by paragraph, heading, or sentence boundary.

② Embedding（嵌入）

用 embedding 模型（OpenAI text-embedding-3、Voyage、BGE、Cohere）把每個 chunk 轉成 1024–3072 維向量。語意相似的 chunks 在向量空間中距離近。

An embedding model (OpenAI text-embedding-3, Voyage, BGE, Cohere) maps each chunk to a 1024–3072 dim vector. Semantically similar chunks sit close in this space.

③ Retrieve（檢索）

用 query 的 embedding 找最近的 K 個 chunks（cosine similarity）。2026 業界共識：純向量檢索失敗率約 40%，務必加上 BM25 關鍵字檢索做 hybrid search。

Use the query's embedding to find the K nearest chunks (cosine similarity). 2026 consensus: pure vector retrieval fails ~40%; always add BM25 keyword search → hybrid search.

④ Rerank（重排）

用 cross-encoder（Cohere Rerank、Voyage Rerank、bge-reranker）把候選 top-50 重新排序成 top-5。Cross-encoder 比 bi-encoder 精準很多但慢，所以只對少量候選排。

A cross-encoder (Cohere Rerank, Voyage Rerank, bge-reranker) re-orders the top-50 candidates into top-5. Cross-encoders are much more accurate than bi-encoders but slow, so only run on a small candidate set.

⑤ Generate（生成）

把 top-K chunks 塞進 prompt：「使用以下資料回答問題。若資料中沒有，請說『資料中沒有』」。要求 LLM 引用來源。

Stuff top-K chunks into the prompt: "Answer using the following sources. If absent, say 'not in the source.'" Require the LLM to cite sources.

實作

三、完整 RAG 範例（Hybrid + Rerank）

# pip install weaviate-client cohere openai
import weaviate, cohere
from openai import OpenAI

oai = OpenAI()
co  = cohere.Client()
db  = weaviate.connect_to_local()
papers = db.collections.get("Papers")

def rag(query: str, k_retrieve=30, k_final=5):
    # ③ Hybrid retrieve: vector + BM25 fused
    hits = papers.query.hybrid(query=query, alpha=0.5, limit=k_retrieve).objects
    docs = [h.properties["text"] for h in hits]

    # ④ Rerank with cross-encoder
    re = co.rerank(model="rerank-english-v3.0", query=query, documents=docs, top_n=k_final)
    top = [docs[r.index] for r in re.results]

    # ⑤ Generate with citation requirement
    ctx = "\n\n".join(f"[{i+1}] {d}" for i,d in enumerate(top))
    resp = oai.chat.completions.create(model="gpt-4.1",
        messages=[
            {"role":"system","content":"Answer ONLY from the sources. Cite [n] inline. If absent, say so."},
            {"role":"user","content":f"Sources:\n{ctx}\n\nQuestion: {query}"}
        ])
    return resp.choices[0].message.content, top

# Bare-bones vector RAG with Chroma
import chromadb
from openai import OpenAI
oai = OpenAI()
col = chromadb.PersistentClient().get_or_create_collection("docs")

def embed(t): return oai.embeddings.create(model="text-embedding-3-small",input=t).data[0].embedding

def rag(q,k=5):
    docs = col.query(query_embeddings=[embed(q)], n_results=k)["documents"][0]
    ctx  = "\n".join(f"- {d}" for d in docs)
    r = oai.chat.completions.create(model="gpt-4.1-mini",
        messages=[{"role":"user","content":f"Use:\n{ctx}\n\nQ: {q}"}])
    return r.choices[0].message.content

互動模擬

四、Hybrid Search 互動模擬

調整 alpha（向量權重 vs BM25 權重）觀察結果集差異：

Adjust alpha (vector weight vs BM25 weight) — see how results shift:

α (向量權重) 0.5

📐

觀察：α=1（純向量）會推升「概念相似」但精確 token 沒出現的文件；α=0（純 BM25）對 c.5266dupC 這種精確字串有利。生產通常 α=0.4–0.6。 Observe: α=1 (pure vector) favors concept-similar docs even without exact tokens; α=0 (pure BM25) favors exact strings like c.5266dupC. Production typically uses α=0.4–0.6.

進階

五、Agentic RAG：讓 Agent 自主決定怎麼檢索

2024 年起，純單輪 RAG（一次檢索 + 生成）的限制變得明顯：

使用者問「比較 A 與 B」，需要兩個獨立檢索；
使用者問「2026 Q1 銷售」，需要先檢索定義再檢索數據；
檢索結果矛盾，需要 agent 判斷誰可信。

Agentic RAG 把檢索本身變成 agent 的工具——LLM 自己決定要查什麼、查幾次、怎麼合併。常見模式：

Self-querying：LLM 把 user query 改寫成更好的 search query（加同義詞、拆解）。
Multi-step retrieval：第一次檢索結果不足 → 改 query 再查。
Routing：根據 query 類型路由到不同 index（文件、API、SQL）。
Verification：用第二次檢索驗證第一次答案。

From 2024 on, the limits of single-shot RAG became clear:

"Compare A and B" needs two independent retrievals;
"Q1 2026 sales" needs to retrieve a definition then the numbers;
Retrieved results conflict — the agent must judge.

Agentic RAG turns retrieval itself into a tool — the LLM decides what to query, how many times, how to merge. Common patterns:

Self-querying: LLM rewrites the user query into a better search query (add synonyms, decompose).
Multi-step retrieval: when round 1 is insufficient → reformulate and retry.
Routing: route by query type to different indexes (docs, API, SQL).
Verification: a second retrieval verifies the first answer.

🚀

2026 趨勢：VentureBeat 等媒體已開始談論「RAG 的時代結束」——對長期穩定知識 (如產品文件)，「compilation-stage knowledge layer」（把知識直接編譯進模型 weights 或快取）可能取代動態檢索。但對「即時、變動快」的知識，RAG 仍是無可替代。 2026 trend: outlets like VentureBeat are calling time on "pure RAG" for stable knowledge (e.g. product docs) — "compilation-stage knowledge layers" (baking knowledge into weights or cache) may replace dynamic retrieval there. But for fresh, fast-changing knowledge, RAG remains irreplaceable.

常見錯誤

六、五個讓 RAG 失敗的常見錯誤

① 切塊太大或太小

太大 → 一個 chunk 包多個主題，相似度分數失準。太小 → 失去上下文。建議：300–600 tokens + 10% overlap。

Too large → one chunk holds multiple topics, similarity blurs. Too small → loses context. Rule: 300–600 tokens + 10% overlap.

② 沒做 hybrid search

純向量檢索對「精確 ID、罕見字、人名」失效。Hybrid 是 2026 預設。

Pure vector retrieval misses exact IDs, rare words, names. Hybrid is the 2026 default.

③ 沒有 rerank

第一階段 top-30 通常含很多噪音。加上 cross-encoder reranker 可把 Recall@5 從 60% 提升到 85%。

Stage-1 top-30 is noisy. Adding a cross-encoder reranker bumps Recall@5 from ~60% to ~85%.

④ Prompt 沒要求引用

讓 LLM 標 [1][2] 引用 chunks，否則無法審計、無法判斷是否幻覺。

Make the LLM cite [1][2] chunks — otherwise you can't audit or detect hallucination.

🎓 章節小測

Q1. 2026 業界推薦的 chunk 大小是？

Q1. The 2026 recommended chunk size is?

A) 越大越好

B) 300–600 tokens + 10% overlap

C) 一句一個 chunk

D) 必須 8192 tokens

✅ 太大相似度失準、太小失去語境，300–600 是業界共識。✅ Too large blurs similarity, too small loses context; 300–600 is consensus.

Q2. 為什麼 hybrid search 比純向量好？

Q2. Why is hybrid search better than pure vector?

A) 純向量已被淘汰

B) BM25 對精確字串/罕見詞有效，補上向量的弱點

C) Hybrid 比較便宜

D) 不需要 embedding

✅ 向量懂語意但對精確 token 弱，BM25 反之，互補。✅ Vectors capture semantics, BM25 captures exact tokens — complementary.

Q3. Agentic RAG 比單輪 RAG 的核心改進是？

Q3. The core upgrade of Agentic RAG over single-shot RAG?

A) 用更大的 embedding

B) 跳過 retrieve

C) 把檢索變成 agent 的工具

D) 必須使用 Pinecone

✅ 核心是把 retrieve 從固定步驟變成 agent 可自主使用的工具。✅ Retrieve becomes an agent-callable tool, not a fixed pre-step.

← Step 6規劃與推理 Step 8 →多代理人系統