一、為什麼非用 RAG 不可?
LLM 的權重凍結在訓練時的知識截止日。要讓 agent 處理「公司內部文件、最新法規、即時數據」,三條路:
- Fine-tuning:把知識訓練進權重。貴、慢、需要再訓、無法即時更新。
- 把所有資料塞進 context:1M context 也撐不住 100MB 文件,貴又慢。
- RAG (Retrieval-Augmented Generation):把資料切塊存進向量 DB,每次 query 動態檢索最相關片段塞進 prompt。
RAG 是 90% 場景的最佳答案:知識可動態更新、成本低、可審計(你能看到 agent 引用了哪段)、降低 60–80% 幻覺。
本章把 RAG 拆成 5 個關卡:切塊 → 嵌入 → 檢索 → 重排 → 生成。
LLM weights freeze at training cutoff. To use "internal docs, latest regulations, live data" in an agent, three options:
- Fine-tuning — bake knowledge into weights. Expensive, slow, requires retraining, no live updates.
- Stuff all data into context — even 1M context can't hold 100MB docs, and it's slow + costly.
- RAG (Retrieval-Augmented Generation) — chunk data into a vector DB; per query, retrieve top-K relevant snippets into the prompt.
RAG wins 90% of cases: dynamic updates, low cost, auditable (you see what got cited), 60–80% hallucination reduction.
This chapter dissects RAG's 5 stages: chunk → embed → retrieve → rerank → generate.
二、RAG 管線的五個關卡
① Chunking(切塊)
將長文件切成 200–800 token 的 chunks,搭配 10–15% overlap。Naive 切法(固定字數)容易切斷語意;建議用語意切分(依段落、heading、句末標點)。
Split long docs into 200–800-token chunks with 10–15% overlap. Naive fixed-length splits cut semantic units; prefer semantic chunking by paragraph, heading, or sentence boundary.
② Embedding(嵌入)
用 embedding 模型(OpenAI text-embedding-3、Voyage、BGE、Cohere)把每個 chunk 轉成 1024–3072 維向量。語意相似的 chunks 在向量空間中距離近。
An embedding model (OpenAI text-embedding-3, Voyage, BGE, Cohere) maps each chunk to a 1024–3072 dim vector. Semantically similar chunks sit close in this space.
③ Retrieve(檢索)
用 query 的 embedding 找最近的 K 個 chunks(cosine similarity)。2026 業界共識:純向量檢索失敗率約 40%,務必加上 BM25 關鍵字檢索做 hybrid search。
Use the query's embedding to find the K nearest chunks (cosine similarity). 2026 consensus: pure vector retrieval fails ~40%; always add BM25 keyword search → hybrid search.
④ Rerank(重排)
用 cross-encoder(Cohere Rerank、Voyage Rerank、bge-reranker)把候選 top-50 重新排序成 top-5。Cross-encoder 比 bi-encoder 精準很多但慢,所以只對少量候選排。
A cross-encoder (Cohere Rerank, Voyage Rerank, bge-reranker) re-orders the top-50 candidates into top-5. Cross-encoders are much more accurate than bi-encoders but slow, so only run on a small candidate set.
⑤ Generate(生成)
把 top-K chunks 塞進 prompt:「使用以下資料回答問題。若資料中沒有,請說『資料中沒有』」。要求 LLM 引用來源。
Stuff top-K chunks into the prompt: "Answer using the following sources. If absent, say 'not in the source.'" Require the LLM to cite sources.
三、完整 RAG 範例(Hybrid + Rerank)
# pip install weaviate-client cohere openai import weaviate, cohere from openai import OpenAI oai = OpenAI() co = cohere.Client() db = weaviate.connect_to_local() papers = db.collections.get("Papers") def rag(query: str, k_retrieve=30, k_final=5): # ③ Hybrid retrieve: vector + BM25 fused hits = papers.query.hybrid(query=query, alpha=0.5, limit=k_retrieve).objects docs = [h.properties["text"] for h in hits] # ④ Rerank with cross-encoder re = co.rerank(model="rerank-english-v3.0", query=query, documents=docs, top_n=k_final) top = [docs[r.index] for r in re.results] # ⑤ Generate with citation requirement ctx = "\n\n".join(f"[{i+1}] {d}" for i,d in enumerate(top)) resp = oai.chat.completions.create(model="gpt-4.1", messages=[ {"role":"system","content":"Answer ONLY from the sources. Cite [n] inline. If absent, say so."}, {"role":"user","content":f"Sources:\n{ctx}\n\nQuestion: {query}"} ]) return resp.choices[0].message.content, top
# Bare-bones vector RAG with Chroma import chromadb from openai import OpenAI oai = OpenAI() col = chromadb.PersistentClient().get_or_create_collection("docs") def embed(t): return oai.embeddings.create(model="text-embedding-3-small",input=t).data[0].embedding def rag(q,k=5): docs = col.query(query_embeddings=[embed(q)], n_results=k)["documents"][0] ctx = "\n".join(f"- {d}" for d in docs) r = oai.chat.completions.create(model="gpt-4.1-mini", messages=[{"role":"user","content":f"Use:\n{ctx}\n\nQ: {q}"}]) return r.choices[0].message.content
四、Hybrid Search 互動模擬
調整 alpha(向量權重 vs BM25 權重)觀察結果集差異:
Adjust alpha (vector weight vs BM25 weight) — see how results shift:
五、Agentic RAG:讓 Agent 自主決定怎麼檢索
2024 年起,純單輪 RAG(一次檢索 + 生成)的限制變得明顯:
- 使用者問「比較 A 與 B」,需要兩個獨立檢索;
- 使用者問「2026 Q1 銷售」,需要先檢索定義再檢索數據;
- 檢索結果矛盾,需要 agent 判斷誰可信。
Agentic RAG 把檢索本身變成 agent 的工具——LLM 自己決定要查什麼、查幾次、怎麼合併。常見模式:
- Self-querying:LLM 把 user query 改寫成更好的 search query(加同義詞、拆解)。
- Multi-step retrieval:第一次檢索結果不足 → 改 query 再查。
- Routing:根據 query 類型路由到不同 index(文件、API、SQL)。
- Verification:用第二次檢索驗證第一次答案。
From 2024 on, the limits of single-shot RAG became clear:
- "Compare A and B" needs two independent retrievals;
- "Q1 2026 sales" needs to retrieve a definition then the numbers;
- Retrieved results conflict — the agent must judge.
Agentic RAG turns retrieval itself into a tool — the LLM decides what to query, how many times, how to merge. Common patterns:
- Self-querying: LLM rewrites the user query into a better search query (add synonyms, decompose).
- Multi-step retrieval: when round 1 is insufficient → reformulate and retry.
- Routing: route by query type to different indexes (docs, API, SQL).
- Verification: a second retrieval verifies the first answer.
六、五個讓 RAG 失敗的常見錯誤
① 切塊太大或太小
太大 → 一個 chunk 包多個主題,相似度分數失準。太小 → 失去上下文。建議:300–600 tokens + 10% overlap。
Too large → one chunk holds multiple topics, similarity blurs. Too small → loses context. Rule: 300–600 tokens + 10% overlap.
② 沒做 hybrid search
純向量檢索對「精確 ID、罕見字、人名」失效。Hybrid 是 2026 預設。
Pure vector retrieval misses exact IDs, rare words, names. Hybrid is the 2026 default.
③ 沒有 rerank
第一階段 top-30 通常含很多噪音。加上 cross-encoder reranker 可把 Recall@5 從 60% 提升到 85%。
Stage-1 top-30 is noisy. Adding a cross-encoder reranker bumps Recall@5 from ~60% to ~85%.
④ Prompt 沒要求引用
讓 LLM 標 [1][2] 引用 chunks,否則無法審計、無法判斷是否幻覺。
Make the LLM cite [1][2] chunks — otherwise you can't audit or detect hallucination.
🎓 章節小測
Q1. 2026 業界推薦的 chunk 大小是?
Q1. The 2026 recommended chunk size is?
Q2. 為什麼 hybrid search 比純向量好?
Q2. Why is hybrid search better than pure vector?
Q3. Agentic RAG 比單輪 RAG 的核心改進是?
Q3. The core upgrade of Agentic RAG over single-shot RAG?