STEP 7 / 12 · 進階能力

檢索增強生成 (RAG)

Chunking、Embedding、Hybrid Search、Rerank 與 Agentic RAG——讓 Agent 擁有可信、可更新的知識基礎。

Chunking, embedding, hybrid search, rerank, and Agentic RAG — grounding agents in trustworthy, updatable knowledge.

一、為什麼非用 RAG 不可?

LLM 的權重凍結在訓練時的知識截止日。要讓 agent 處理「公司內部文件、最新法規、即時數據」,三條路:

  1. Fine-tuning:把知識訓練進權重。貴、慢、需要再訓、無法即時更新。
  2. 把所有資料塞進 context:1M context 也撐不住 100MB 文件,貴又慢。
  3. RAG (Retrieval-Augmented Generation):把資料切塊存進向量 DB,每次 query 動態檢索最相關片段塞進 prompt。

RAG 是 90% 場景的最佳答案:知識可動態更新、成本低、可審計(你能看到 agent 引用了哪段)、降低 60–80% 幻覺。

本章把 RAG 拆成 5 個關卡:切塊 → 嵌入 → 檢索 → 重排 → 生成

LLM weights freeze at training cutoff. To use "internal docs, latest regulations, live data" in an agent, three options:

  1. Fine-tuning — bake knowledge into weights. Expensive, slow, requires retraining, no live updates.
  2. Stuff all data into context — even 1M context can't hold 100MB docs, and it's slow + costly.
  3. RAG (Retrieval-Augmented Generation) — chunk data into a vector DB; per query, retrieve top-K relevant snippets into the prompt.

RAG wins 90% of cases: dynamic updates, low cost, auditable (you see what got cited), 60–80% hallucination reduction.

This chapter dissects RAG's 5 stages: chunk → embed → retrieve → rerank → generate.

二、RAG 管線的五個關卡

┌─ INDEX TIME (offline) ──────────────────────────────────────────┐ │ Docs → ① Chunk → ② Embed → store in Vector DB + Keyword Index │ └─────────────────────────────────────────────────────────────────┘ ┌─ QUERY TIME (per request) ──────────────────────────────────────┐ │ Query → ③ Retrieve (hybrid) → ④ Rerank → ⑤ Generate (LLM) │ └─────────────────────────────────────────────────────────────────┘

① Chunking(切塊)

將長文件切成 200–800 token 的 chunks,搭配 10–15% overlap。Naive 切法(固定字數)容易切斷語意;建議用語意切分(依段落、heading、句末標點)。

Split long docs into 200–800-token chunks with 10–15% overlap. Naive fixed-length splits cut semantic units; prefer semantic chunking by paragraph, heading, or sentence boundary.

② Embedding(嵌入)

用 embedding 模型(OpenAI text-embedding-3、Voyage、BGE、Cohere)把每個 chunk 轉成 1024–3072 維向量。語意相似的 chunks 在向量空間中距離近。

An embedding model (OpenAI text-embedding-3, Voyage, BGE, Cohere) maps each chunk to a 1024–3072 dim vector. Semantically similar chunks sit close in this space.

③ Retrieve(檢索)

用 query 的 embedding 找最近的 K 個 chunks(cosine similarity)。2026 業界共識:純向量檢索失敗率約 40%,務必加上 BM25 關鍵字檢索做 hybrid search

Use the query's embedding to find the K nearest chunks (cosine similarity). 2026 consensus: pure vector retrieval fails ~40%; always add BM25 keyword search → hybrid search.

④ Rerank(重排)

用 cross-encoder(Cohere Rerank、Voyage Rerank、bge-reranker)把候選 top-50 重新排序成 top-5。Cross-encoder 比 bi-encoder 精準很多但慢,所以只對少量候選排。

A cross-encoder (Cohere Rerank, Voyage Rerank, bge-reranker) re-orders the top-50 candidates into top-5. Cross-encoders are much more accurate than bi-encoders but slow, so only run on a small candidate set.

⑤ Generate(生成)

把 top-K chunks 塞進 prompt:「使用以下資料回答問題。若資料中沒有,請說『資料中沒有』」。要求 LLM 引用來源

Stuff top-K chunks into the prompt: "Answer using the following sources. If absent, say 'not in the source.'" Require the LLM to cite sources.

三、完整 RAG 範例(Hybrid + Rerank)

# pip install weaviate-client cohere openai
import weaviate, cohere
from openai import OpenAI

oai = OpenAI()
co  = cohere.Client()
db  = weaviate.connect_to_local()
papers = db.collections.get("Papers")

def rag(query: str, k_retrieve=30, k_final=5):
    # ③ Hybrid retrieve: vector + BM25 fused
    hits = papers.query.hybrid(query=query, alpha=0.5, limit=k_retrieve).objects
    docs = [h.properties["text"] for h in hits]

    # ④ Rerank with cross-encoder
    re = co.rerank(model="rerank-english-v3.0", query=query, documents=docs, top_n=k_final)
    top = [docs[r.index] for r in re.results]

    # ⑤ Generate with citation requirement
    ctx = "\n\n".join(f"[{i+1}] {d}" for i,d in enumerate(top))
    resp = oai.chat.completions.create(model="gpt-4.1",
        messages=[
            {"role":"system","content":"Answer ONLY from the sources. Cite [n] inline. If absent, say so."},
            {"role":"user","content":f"Sources:\n{ctx}\n\nQuestion: {query}"}
        ])
    return resp.choices[0].message.content, top
# Bare-bones vector RAG with Chroma
import chromadb
from openai import OpenAI
oai = OpenAI()
col = chromadb.PersistentClient().get_or_create_collection("docs")

def embed(t): return oai.embeddings.create(model="text-embedding-3-small",input=t).data[0].embedding

def rag(q,k=5):
    docs = col.query(query_embeddings=[embed(q)], n_results=k)["documents"][0]
    ctx  = "\n".join(f"- {d}" for d in docs)
    r = oai.chat.completions.create(model="gpt-4.1-mini",
        messages=[{"role":"user","content":f"Use:\n{ctx}\n\nQ: {q}"}])
    return r.choices[0].message.content

四、Hybrid Search 互動模擬

調整 alpha(向量權重 vs BM25 權重)觀察結果集差異:

Adjust alpha (vector weight vs BM25 weight) — see how results shift:

📐
觀察:α=1(純向量)會推升「概念相似」但精確 token 沒出現的文件;α=0(純 BM25)對 c.5266dupC 這種精確字串有利。生產通常 α=0.4–0.6。 Observe: α=1 (pure vector) favors concept-similar docs even without exact tokens; α=0 (pure BM25) favors exact strings like c.5266dupC. Production typically uses α=0.4–0.6.

五、Agentic RAG:讓 Agent 自主決定怎麼檢索

2024 年起,純單輪 RAG(一次檢索 + 生成)的限制變得明顯:

  • 使用者問「比較 A 與 B」,需要兩個獨立檢索;
  • 使用者問「2026 Q1 銷售」,需要先檢索定義再檢索數據;
  • 檢索結果矛盾,需要 agent 判斷誰可信。

Agentic RAG 把檢索本身變成 agent 的工具——LLM 自己決定要查什麼、查幾次、怎麼合併。常見模式:

  1. Self-querying:LLM 把 user query 改寫成更好的 search query(加同義詞、拆解)。
  2. Multi-step retrieval:第一次檢索結果不足 → 改 query 再查。
  3. Routing:根據 query 類型路由到不同 index(文件、API、SQL)。
  4. Verification:用第二次檢索驗證第一次答案。

From 2024 on, the limits of single-shot RAG became clear:

  • "Compare A and B" needs two independent retrievals;
  • "Q1 2026 sales" needs to retrieve a definition then the numbers;
  • Retrieved results conflict — the agent must judge.

Agentic RAG turns retrieval itself into a tool — the LLM decides what to query, how many times, how to merge. Common patterns:

  1. Self-querying: LLM rewrites the user query into a better search query (add synonyms, decompose).
  2. Multi-step retrieval: when round 1 is insufficient → reformulate and retry.
  3. Routing: route by query type to different indexes (docs, API, SQL).
  4. Verification: a second retrieval verifies the first answer.
🚀
2026 趨勢:VentureBeat 等媒體已開始談論「RAG 的時代結束」——對長期穩定知識 (如產品文件),「compilation-stage knowledge layer」(把知識直接編譯進模型 weights 或快取)可能取代動態檢索。但對「即時、變動快」的知識,RAG 仍是無可替代。 2026 trend: outlets like VentureBeat are calling time on "pure RAG" for stable knowledge (e.g. product docs) — "compilation-stage knowledge layers" (baking knowledge into weights or cache) may replace dynamic retrieval there. But for fresh, fast-changing knowledge, RAG remains irreplaceable.

六、五個讓 RAG 失敗的常見錯誤

切塊太大或太小

太大 → 一個 chunk 包多個主題,相似度分數失準。太小 → 失去上下文。建議:300–600 tokens + 10% overlap

Too large → one chunk holds multiple topics, similarity blurs. Too small → loses context. Rule: 300–600 tokens + 10% overlap.

沒做 hybrid search

純向量檢索對「精確 ID、罕見字、人名」失效。Hybrid 是 2026 預設。

Pure vector retrieval misses exact IDs, rare words, names. Hybrid is the 2026 default.

沒有 rerank

第一階段 top-30 通常含很多噪音。加上 cross-encoder reranker 可把 Recall@5 從 60% 提升到 85%。

Stage-1 top-30 is noisy. Adding a cross-encoder reranker bumps Recall@5 from ~60% to ~85%.

Prompt 沒要求引用

讓 LLM 標 [1][2] 引用 chunks,否則無法審計、無法判斷是否幻覺。

Make the LLM cite [1][2] chunks — otherwise you can't audit or detect hallucination.

🎓 章節小測

Q1. 2026 業界推薦的 chunk 大小是?

Q1. The 2026 recommended chunk size is?

A) 越大越好
B) 300–600 tokens + 10% overlap
C) 一句一個 chunk
D) 必須 8192 tokens
✅ 太大相似度失準、太小失去語境,300–600 是業界共識。✅ Too large blurs similarity, too small loses context; 300–600 is consensus.

Q2. 為什麼 hybrid search 比純向量好?

Q2. Why is hybrid search better than pure vector?

A) 純向量已被淘汰
B) BM25 對精確字串/罕見詞有效,補上向量的弱點
C) Hybrid 比較便宜
D) 不需要 embedding
✅ 向量懂語意但對精確 token 弱,BM25 反之,互補。✅ Vectors capture semantics, BM25 captures exact tokens — complementary.

Q3. Agentic RAG 比單輪 RAG 的核心改進是?

Q3. The core upgrade of Agentic RAG over single-shot RAG?

A) 用更大的 embedding
B) 跳過 retrieve
C) 把檢索變成 agent 的工具
D) 必須使用 Pinecone
✅ 核心是把 retrieve 從固定步驟變成 agent 可自主使用的工具。✅ Retrieve becomes an agent-callable tool, not a fixed pre-step.