PDFに質問しよう6 ~ ベクトルDBへのEmbeddingの保存 | SkillhubAI（スキルハブエーアイ）

ステップ5: ベクトルDBへのEmbeddingの保存

次に、作ったEmbeddingをベクトルDBに格納します。ベクトルDBに格納することで、あるテキストをEmbeddingにした際に、そのEmbeddingと類似のEmbedding（すなわち類似文書）を高速に検索できるようになります。

ベクトルDBって何？なぜ必要？

普通のデータベースは名前とか住所とかを保存しますよね。でもベクトルDBは「数値の列（Embedding）」を保存して、「これに似たやつある？」って検索できる特殊なデータベースなんです。

例えば、「トヨタの売上は？」という質問が来たら、保存してあるチャンクの中から「トヨタ」「売上」に関連するものを瞬時に見つけてくれます。すごいでしょ？

Qdrantを使ってみよう

ベクトルDBにはいろんな種類があるんですが、今回は「Qdrant」を使います。なぜQdrantかというと：

ローカル（自分のPC）で簡単に動く
クラウドへの移行も簡単（コードをほぼ変えなくていい）
無料枠が充実している
使い方がシンプル

では、実装していきましょう：

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid

# 設定
QDRANT_PATH = "./local_qdrant"  # ローカルにDBを保存する場所
COLLECTION_NAME = "my_collection"  # コレクション名（テーブル名みたいなもの）

def init_qdrant():
    """Qdrantクライアントを初期化してコレクションを作成"""
    try:
        # クライアントを作成（ローカルに保存）
        client = QdrantClient(path=QDRANT_PATH)

        # 既存のコレクションを確認
        collections = client.get_collections().collections
        collection_names = [collection.name for collection in collections]

        # コレクションが存在しなければ作成
        if COLLECTION_NAME not in collection_names:
            client.create_collection(
                collection_name=COLLECTION_NAME,
                vectors_config=VectorParams(
                    size=1536,  # text-embedding-3-smallの次元数
                    distance=Distance.COSINE  # コサイン類似度を使用
                ),
            )
            st.info("📦 新しいコレクションを作成しました")

        return client

    except Exception as e:
        st.error(f"Qdrant初期化エラー: {e}")
        return None

def store_embeddings(chunks, embeddings):
    """チャンクとEmbeddingをベクトルDBに保存"""
    client = init_qdrant()
    if not client:
        return False

    try:
        # 保存するポイント（データ）を準備
        points = []
        for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
            point = PointStruct(
                id=str(uuid.uuid4()),  # ユニークなID
                vector=embedding,  # Embeddingベクトル
                payload={"text": chunk}  # 元のテキスト（メタデータ）
            )
            points.append(point)

        # バッチでデータベースに保存
        client.upsert(
            collection_name=COLLECTION_NAME,
            points=points
        )

        st.success(f"✅ {len(points)}個のチャンクをベクトルDBに保存しました！")
        return True

    except Exception as e:
        st.error(f"保存エラー: {e}")
        return False

ここでのポイントは： - PointStruct：1つのデータ（チャンク＋Embedding）を表す - payload：元のテキストなどのメタデータを保存できる - upsert：データの挿入または更新（便利！）

ローカルからクラウドへの移行も簡単

開発中はローカルで、本番はクラウドで、という使い分けも簡単です：

# ローカル版（開発用）
client = QdrantClient(path="./local_qdrant")

# クラウド版（本番用） - pathをurlに変えるだけ！
client = QdrantClient(
    url="https://xxx.qdrant.io",
    api_key="your-api-key"
)

レッスン一覧へ