世界上的大部分数据都以图像、音频和视频的形式存在，而不仅仅是文本。多模态 LLM 系统正在不断发展以应对这种复杂性。这些模型能够处理多种类型的输入，其中每种模态都指的是一种特定的数据类型 — — 如文本、声音、图像等。

在使用多模态LLM时，我们通常从提示工程开始，并且无论精度如何，都会实现合理的准确度。然而，即使是这些功能强大的模型也会产生幻觉。

LLM 可能出现幻觉的主要原因有以下几点：

该模型没有回答该问题的知识。
模型拥有知识但仍然产生幻觉。

防止 LLM 产生幻觉的一种方法是向 LLM 提供新知识。LLM 至少可以通过两种方式获取新信息：(1) 权重更新（例如微调）和 (2) RAG 通过提示将相关上下文传递给 LLM。

在本文中，我们将学习构建多模态 RAG 系统的三种方法，并进行实际构建实践。

多模态 RAG 的理念是允许 RAG 系统将多种形式的信息注入 LLM。为简单起见，我们将主要关注图像模态以及文本输入。

构建多模态 RAG 系统的方法

下面简要介绍一下构建多模式 RAG 管道的三种方法。

这些方法可以根据两个因素进行分类：在检索过程中如何嵌入图像和文本（多模态嵌入与纯文本嵌入）以及用于生成响应的数据（纯文本嵌入与文本+图像）。

Option 1

检索：使用多模态嵌入（例如 CLIP）处理图像和文本描述，该嵌入将图像和/或文本都用作嵌入向量。这些嵌入存储在向量数据库中。在推理时，给定用户的图像和查询，系统会根据图像和文本摘要的共享嵌入执行检索。

生成：然后，多模态 LLM（如 Google Gemini 或 GPT-4o）获取用户的图像以及从矢量数据库中检索到的类似图像的文本描述以创建响应。

Option 2

检索：在构建嵌入数据库时，首先通过多模态 LLM 处理图像以生成文本摘要。然后使用纯文本嵌入模型嵌入这些摘要并存储在矢量数据库中。在推理时，给定用户的图像和查询，系统仅基于文本嵌入执行检索。

生成：由于检索到的内容是纯文本，因此单模态 LLM（不一定是多模态）可以生成响应。用户的图像被排除在检索和生成之外。

Option 3

检索：与Option2类似。

生成：然后，多模态 LLM（如 Google Gemini 或 GPT-4o）获取用户的图像以及从矢量数据库中检索到的类似图像的文本描述以创建响应（类似于Option 1）

关键要点

Option 1：使用多模态嵌入进行检索，使用多模态 LLM进行生成。
Option 2：使用多模态 LLM 将图像转换为文本摘要，并使用纯文本嵌入模型进行检索。使用纯文本 LLM进行生成
Option 3：与检索Option 2 类似，但在生成时使用多模 LLM。

代码实现！

让我们开始动手吧。首先，我们需要导入一些相关的库：

import base64
import json
import os
import pickle
from typing import List, Tuple, Union
import clip
import faiss
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from openai import OpenAI
from PIL import Image
from torch.utils.data import DataLoader
from tqdm import tqdm

我们将使用一小部分示例图像并创建图像描述来构建我们的数据库。为了创建摘要，我将图像传递给多模态 LLM 并要求它这样做。

图像可通过两种主要方式提供给模型：通过传递图像链接或在请求中传递 base64 编码的图像。Base64 是一种可以将图像转换为可读字符串的编码算法。

由于我的图像存储在本地，因此我们将图像编码为 base4 url。以下函数读取图像文件，将其转换为数据 url，然后将其发送到 GPT-4o 以生成简短描述：

import os
import json
import base64
from mimetypes import guess_type
# Function to encode a local image into a data URL
def local_image_to_data_url(image_path):
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = "application/octet-stream"
    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode("utf-8")
    return f"data:{mime_type};base64,{base64_encoded_data}"

base_dir = "../"
description_data = []
path = os.path.join(base_dir, "figures")
openai_api_key = os.getenv('OPEN_AI_KEY')
client = OpenAI(api_key = openai_api_key)
if os.path.isdir(path):
    for image_file in os.listdir(path):
        image_path = os.path.join(path, image_file)
        try:
            data_url = local_image_to_data_url(image_path)
            
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": "You are tasked with summarizing the description of the images. Give a concise summary of the images provided to you."
                            },
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": data_url
                                }
                            }
                        ]
                    }
                ],
                max_tokens=30
            )
            
            content = response.choices[0].message.content
            
            description_data.append({
                "image_path": image_path,
                "description": content
            })
        
        except Exception as e:
            print(f"Error processing image {image_path}: {e}")
with open('description.json', 'w') as file:
    json.dump(description_data, file, indent=4)

结果是一个 ` description_data.json ` 对象，其中包含数据库中每个图像的条目。它有两个键：` image_path ` 和 ` description `，将每个图像映射到其描述：

现在我们已经准备好了数据集。让我们按照上一节中的说明，实现构建多模型 RAG 系统的不同选项。

Option 1：使用多模态嵌入进行检索，使用多模态 LLM 进行生成。

我们将首先加载 CLIP 模型，这是一个多模态嵌入：

#load model on device. The device you are running inference/training on is either a CPU or GPU if you have.
os.environ["OMP_NUM_THREADS"] = "1" # <-- this is only to make clip compatible for mac users
device = "cpu"
model, preprocess = clip.load("ViT-B/32",device=device)

接下来，我们为数据集中的每个条目创建嵌入，并将其存储在矢量数据库中。这将是我们搜索的知识库，用于为用户提供他们上传的图像的信息。

我们首先编写一个函数来获取给定目录中的所有图像路径：

def get_image_paths(directory: str, number: int = None) -> List[str]:
    image_paths = []
    count = 0
    for filename in os.listdir(directory):
        image_paths.append(os.path.join(directory, filename))
        if number is not None and count == number:
            return [image_paths[-1]]
        count += 1
    return image_paths

image_directory = '../figures/'
image_paths = get_image_paths(image_directory)

给定路径后，下一步是生成图像嵌入 CLIP。在下面的函数中，我们首先使用 CLIP 中的预处理函数对图像进行预处理。然后，我们将这些预处理后的图像堆叠在一起，并将它们一次性传递到 LLM 中。输出是嵌入数组：

def get_features_from_image_path(image_paths):
    images = [preprocess(Image.open(image_path).convert("RGB")) for image_path in image_paths]
    image_input = torch.tensor(np.stack(images))
    with torch.no_grad():
        image_features = model.encode_image(image_input).float()
    return image_features

image_features = get_features_from_image_path(image_paths)

有了这些图像嵌入，我们现在可以创建矢量数据库了。我们将使用“faiss”来实现这一点：

index = faiss.IndexFlatIP（image_features.shape [ 1 ]）
index.add（image_features）

好了！现在矢量数据库已经构建完毕，我们可以使用 GPT4o 查询数据库。给定用户的查询，让我们创建一个 ` image_query`函数，允许我们使用图像输入查询 LLM：

def image_query(query, image_path):
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[
            {
            "role": "user",
            "content": [
                {
                "type": "text",
                "text": query,
                },
                {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{encode_image(image_path)}",
                },
                }
            ],
            }
        ],
        max_tokens=300,
    )
    return response.choices[0].message.content

现在让我们执行相似性搜索，根据用户的查询在我们的知识库中找到最相似的图像。我们通过获取用户图像的嵌入，检索数据库中相似图像的索引和距离来实现这一点。请注意，这里我仅使用图像进行嵌入，但您也可以结合图像和文本！

image_search_embedding = get_features_from_image_path([image_path]) 
distances, indices = index.search(image_search_embedding.reshape( 1 , - 1 ), 2 ) 
distances = distances[ 0 ] 
indices = indices[ 0 ] 
indices_distances = list ( zip (indices, distances)) 
indices_distances.sort(key= lambda x: x[ 1 ], reverse= True )

使用最相似的图像，我们得到该图像的描述，并将描述和用户查询（在提示中增强）以及用户的图像连同用户查询一起传递给 GPT-4o：

similar_path = get_image_paths(image_directory, indices_distances[0][0])[0] # for simplicity, this is also the user image
element = find_entry(description_data, 'image_path', similar_path)

user_query = 'What is this item supposed to do?'
prompt = f"""
Below is a user query, I want you to answer the query using the description and image provided.

user query:
{user_query}

description:
{element['description']}
"""
image_query(prompt, similar_path)

就是这样！您刚刚使用Option 1 构建了第一个多模态 RAG 系统！看起来它正在按预期响应用户的问题！太棒了！

现在让我们尝试Option 2！

Option 2：用于图像描述的多模态LLM和用于检索的纯文本嵌入LLM以及用于生成的纯文本LLM

由于我们已经在上一步中使用了多模态 LLM 来创建图像的描述，让我们继续使用description_data.json文件中的文本描述创建一个矢量存储。我们使用“text-embedding-3-small”作为嵌入模型，并使用 Chroma 作为矢量数据库：

import uuid
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# initialize embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key = openai_api_key)

# create vectorstore for summaries
vectorstore = Chroma(
    collection_name="image_summaries", 
    embedding_function=embeddings
)

# create storage for original data
store = InMemoryStore()
id_key = "image_id"

# initialize the retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

image_ids = [str(uuid.uuid4()) for _ in description_data]

# create documents using text description of the images
summary_docs = [
    Document(
        page_content=item["description"],
        metadata={
            id_key: image_ids[i],
            "image_path": item["image_path"]
        }
    )
    for i, item in enumerate(description_data)
]

# ddd those documents to vectorstore
retriever.vectorstore.add_documents(summary_docs)

# store original data
original_data = [(id_, item) for id_, item in zip(image_ids, description_data)]
retriever.docstore.mset(original_data)

现在给定用户的查询，使用检索器获取相关的文本描述：

similar_docs = retriever.invoke("portable refrigerator for travelling")[0]

对于option 2，我们仅将最相似图像的描述和用户查询传递给 GPT-4O。让我们创建一个函数来执行此操作：

def text_query(query, description):
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": query + "\n\nContext: " + description
                    }
                ]
            }
        ],
        max_tokens=30,
    )
    return response.choices[0].message.content

user_query = 'What is this item supposed to do?'
prompt = f"""
Below is a user query, I want you to answer the query using the description provided.

user query:
{user_query}

description:
{similar_docs['description']}
"""

# Generate response using only text
text_query(prompt, similar_docs['description'])

瞧！另一个多模态 RAG 管道无需使用多模态嵌入模型！

太棒了！现在是时候检查Option 3 了。

Option 3：用于图像描述的多模态 LLM 和用于检索的纯文本嵌入以及用于生成的多模态 LLM

我们使用与Option 2 中相同的检索器。

从数据库中最相似的图像中获取描述后，我们希望将此描述、用户查询（均在提示中增强）和用户图像传递给 GPT-4O（如Option 1 中所示）。让我们更新以用户查询、图像和描述作为输入的函数：

def text_image_query(query, image_path, description):
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": query,
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{encode_image(image_path)}",
                        }
                    }
                ]
            }
        ],
        max_tokens=300,
    )
    return response.choices[0].message.content


user_query = 'What is this item supposed to do?'
prompt = f"""
Below is a user query, I want you to answer the query using the description provided.

user query:
{user_query}

description:
{similar_docs['description']}
"""

# Generate response using the query and image, but not the stored description
text_image_query(prompt, image_path,{similar_docs['description']} )

就是这样！在本文中，您学习了如何构建一个简单的交互式系统，让您可以与图像语料库聊天！

结论

在这篇文章中，我们探讨了构建多模式 RAG 系统的三种方法。

每种方法都有其优点，最终选择哪种方法取决于您的具体用例。我认为选择最佳方法还取决于用例。根据我的经验，Option 1 提供了良好的结果，检索非常精确，甚至包括前 n 个，而不仅仅是最接近的匹配。Option 2 是轻量级的，当视觉细节不那么重要时，也许可以很好地工作。想象一下，如果您问“这辆车能跑多快”，那么系统可能只需要获取数据库中图像的摘要即可返回一个好的答案。

多模态 RAG 管道构建的三种方法（多模态包括哪些模态）

构建多模态 RAG 系统的方法

Option 1

Option 2

Option 3

关键要点

代码实现！

结论

相关推荐

我用 1 个 2 手计算器换了 3 台 MacBook(上)

零基础也能搞定!DeepSeek大模型本地安装全攻略

Win7中同时安装python2和python3的方法

Python三目运算符(三元运算符)用法详解

PS零基础入门教程:Photoshop 2024工具详解—标尺工具

按颜色计数、求和、算平均值或最大值?学这个函数就够啦!

SpringBoot中使用LocalDateTime踩坑记录

中药古今研究:人参

「mysql第二次安装不了」mysql安装失败怎么清理干净?

最全的linux安装软件方法 linux安装软件流程