什麼是 Multimodal RAG？

傳統的 RAG（Retrieval-Augmented Generation）通常只處理純文字資料。但在真實世界中，我們的知識庫裡不只有文字——還有 PDF 裡的圖表、簡報中的截圖、技術文件裡的架構圖、甚至是掃描的紙本文件。Multimodal RAG 就是要解決這個問題：讓 AI 能夠理解並檢索多種格式的資料。

我最近在做一個企業知識庫專案，裡面有超過 5000 份 PDF（包含圖表和表格）、3000 張技術架構圖、還有大量的 Word 文件。用傳統的文字 RAG 根本搜不到圖片裡的資訊。導入 Multimodal RAG 之後，使用者可以問「上季度營收趨勢圖在哪？」然後系統就能直接找到對應的圖表。

如果你對基礎的 RAG 架構還不太熟，建議先看看 RAG 文件切割策略這篇，打好基礎再來學多模態。

多模態 RAG 架構設計

Multimodal RAG 的架構比純文字 RAG 複雜不少，主要有三種設計方向：

方案一：統一 Embedding 空間

用 CLIP 或 ImageBind 這類多模態模型，把文字和圖片映射到同一個向量空間。查詢時不管是文字還是圖片，都能直接比對相似度。

優點：架構簡單，查詢速度快
缺點：CLIP 對中文支援較弱，圖片細節理解有限

方案二：描述文字中介

先用視覺語言模型（如 GPT-4V、Claude Vision）為每張圖片生成文字描述，然後用純文字 RAG 來檢索。

優點：可以利用現有的文字 RAG 架構，對中文友善
缺點：描述品質影響檢索效果，前處理耗時

方案三：混合架構（推薦）

結合方案一和方案二：圖片同時存儲 CLIP embedding 和 GPT-4V 生成的描述文字 embedding，查詢時同時搜尋兩個向量空間，再合併結果。

我個人推薦方案三，因為它能兼顧精確度和召回率。不過如果你的預算有限，方案二其實就已經很夠用了。

PDF 文件解析策略

PDF 是最棘手的格式之一，因為它本質上是一個排版格式而非語意格式。以下是幾個常用的解析工具：

Unstructured

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="report.pdf",
    strategy="hi_res",
    extract_images_in_pdf=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_output_dir="./extracted_images"
)

Unstructured 是目前功能最完整的文件解析庫，它可以自動識別文字段落、標題、圖片、表格等不同元素。hi_res 策略會使用 OCR 來處理掃描文件。

PyMuPDF（fitz）

import fitz

doc = fitz.open("report.pdf")
for page_num, page in enumerate(doc):
    text = page.get_text()
    images = page.get_images(full=True)
    for img_idx, img in enumerate(images):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        pix.save(f"page{page_num}_img{img_idx}.png")

PyMuPDF 速度快很多，但功能相對簡單。如果你的 PDF 結構規整（不是掃描件），PyMuPDF 是更好的選擇。

圖片處理與向量化

把圖片轉換成可以搜尋的向量，有兩種主要方法：

CLIP Embedding

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# 圖片向量化
inputs = processor(images=image, return_tensors="pt")
image_embedding = model.get_image_features(**inputs)

Vision LLM 描述生成

import anthropic

client = anthropic.Anthropic()
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=500,
    messages=[{
        "role": "user",
        "content": [{
            "type": "image",
            "source": {"type": "base64", "media_type": "image/png", "data": img_b64}
        }, {
            "type": "text",
            "text": "請詳細描述這張圖片的內容，包含所有文字、數據和視覺元素。"
        }]
    }]
)

我的建議是兩種方法都用。CLIP embedding 用來做粗篩（快速找到視覺相似的圖片），Vision LLM 的描述用來做精排（確保語意匹配）。關於向量資料庫的選擇，你可以參考向量資料庫比較。

表格資料擷取

PDF 中的表格是另一個大挑戰。常用的工具有：

Camelot：專門處理 PDF 表格，支援 lattice（有格線）和 stream（無格線）兩種模式
Tabula：Java 基底的表格擷取工具，有 Python 封裝
Unstructured：自動識別表格區域並擷取

import camelot

tables = camelot.read_pdf("report.pdf", pages="1-5", flavor="lattice")
for table in tables:
    df = table.df
    markdown = df.to_markdown()
    # 把表格轉成 Markdown 格式存入向量資料庫

我的經驗是把表格轉成 Markdown 格式再做 embedding 效果最好，因為 LLM 對 Markdown 表格的理解能力比純文字好很多。

向量資料庫與混合搜尋

多模態資料需要特殊的索引策略。推薦使用支援 metadata filtering 的向量資料庫：

from langchain_community.vectorstores import Chroma

# 建立不同類型的 collection
text_store = Chroma(collection_name="texts", embedding_function=text_embeddings)
image_store = Chroma(collection_name="images", embedding_function=clip_embeddings)

# 混合搜尋
def hybrid_search(query, k=5):
    text_results = text_store.similarity_search(query, k=k)
    image_results = image_store.similarity_search(query, k=k)
    # 用 RRF (Reciprocal Rank Fusion) 合併結果
    return reciprocal_rank_fusion(text_results, image_results)

Metadata 在多模態 RAG 中非常重要，你應該在每個 chunk 中標記它的來源（哪份 PDF、第幾頁）、類型（文字/圖片/表格）、以及相關的上下文資訊。

完整實作範例

把所有東西串在一起：

class MultimodalRAG:
    def __init__(self):
        self.text_store = Chroma(collection_name="texts")
        self.image_store = Chroma(collection_name="images")
        self.llm = ChatAnthropic(model="claude-sonnet-4-20250514")

    def ingest_pdf(self, pdf_path):
        elements = partition_pdf(pdf_path, strategy="hi_res",
                                  extract_images_in_pdf=True)
        for elem in elements:
            if elem.category == "Image":
                caption = self.generate_caption(elem.image)
                self.image_store.add_texts([caption],
                    metadatas=[{"source": pdf_path, "type": "image"}])
            elif elem.category == "Table":
                md = elem.metadata.text_as_html
                self.text_store.add_texts([md],
                    metadatas=[{"source": pdf_path, "type": "table"}])
            else:
                self.text_store.add_texts([elem.text],
                    metadatas=[{"source": pdf_path, "type": "text"}])

    def query(self, question):
        context = self.hybrid_search(question)
        response = self.llm.invoke(
            f"根據以下資料回答問題：\n\n{context}\n\n問題：{question}"
        )
        return response

效能優化與最佳實踐

批次處理圖片：不要一張一張呼叫 Vision API，盡量批次處理以節省時間和費用
快取 embedding：相同的圖片不需要重複計算 embedding
分層檢索：先用便宜的 CLIP 粗篩，再用昂貴的 Vision LLM 精排
建立父子關係：圖片的 metadata 要記錄它來自哪個文件的哪一頁，這樣回答時可以提供完整上下文
定期評估：用 RAG 評估指標來監控檢索品質

結語

Multimodal RAG 是 RAG 技術的自然演進。隨著 Vision LLM 的能力越來越強，處理多模態資料的門檻也越來越低。如果你的知識庫中有大量的非文字資料，現在正是導入 Multimodal RAG 的好時機。從最簡單的「描述文字中介」方案開始，等系統穩定了再逐步升級到混合架構。記住，最好的架構不是最複雜的架構，而是最適合你資料特性的架構。