DocArray는 멀티모달 데이터를 관리하기 위한 다목적 오픈소스 도구입니다. 원하는 방식으로 데이터를 구성할 수 있으며, 다양한 document index backend를 사용하여 데이터를 저장하고 검색할 수 있는 유연성을 제공합니다. 더 나아가,이 노트북은 두 개의 섹션으로 나뉩니다. 첫 번째 섹션에서는 지원되는 다섯 가지 document index backend를 소개합니다. 각 backend를 설정하고 인덱싱하는 방법에 대한 가이드를 제공하며, 관련 문서를 찾기 위한DocArraydocument index를 활용하여DocArrayRetriever를 생성하고 멋진 LangChain 앱을 구축할 수 있습니다!
DocArrayRetriever를 구축하는 방법도 안내합니다.
두 번째 섹션에서는 이러한 backend 중 하나를 선택하여 기본 예제를 통해 사용 방법을 설명합니다.
Document Index Backends
Copy
import random
from docarray import BaseDoc
from docarray.typing import NdArray
from langchain_community.embeddings import FakeEmbeddings
from langchain_community.retrievers import DocArrayRetriever
embeddings = FakeEmbeddings(size=32)
Copy
class MyDoc(BaseDoc):
title: str
title_embedding: NdArray[32]
year: int
color: str
InMemoryExactNNIndex
InMemoryExactNNIndex는 모든 Document를 메모리에 저장합니다. 데이터베이스 서버를 시작하고 싶지 않은 소규모 데이터셋에 적합한 시작점입니다.
자세히 알아보기: docs.docarray.org/user_guide/storing/index_in_memory/
Copy
from docarray.index import InMemoryExactNNIndex
# initialize the index
db = InMemoryExactNNIndex[MyDoc]()
# index data
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# optionally, you can create a filter query
filter_query = {"year": {"$lte": 90}}
Copy
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# find the relevant document
doc = retriever.invoke("some query")
print(doc)
Copy
[Document(page_content='My document 56', metadata={'id': '1f33e58b6468ab722f3786b96b20afe6', 'year': 56, 'color': 'red'})]
HnswDocumentIndex
HnswDocumentIndex는 완전히 로컬에서 실행되는 경량 Document Index 구현으로, 소규모에서 중규모 데이터셋에 가장 적합합니다. hnswlib에 벡터를 디스크에 저장하고, 다른 모든 데이터는 SQLite에 저장합니다.
자세히 알아보기: docs.docarray.org/user_guide/storing/index_hnswlib/
Copy
from docarray.index import HnswDocumentIndex
# initialize the index
db = HnswDocumentIndex[MyDoc](work_dir="hnsw_index")
# index data
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# optionally, you can create a filter query
filter_query = {"year": {"$lte": 90}}
Copy
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# find the relevant document
doc = retriever.invoke("some query")
print(doc)
Copy
[Document(page_content='My document 28', metadata={'id': 'ca9f3f4268eec7c97a7d6e77f541cb82', 'year': 28, 'color': 'red'})]
WeaviateDocumentIndex
WeaviateDocumentIndex는 Weaviate vector database를 기반으로 구축된 document index입니다.
자세히 알아보기: docs.docarray.org/user_guide/storing/index_weaviate/
Copy
# There's a small difference with the Weaviate backend compared to the others.
# Here, you need to 'mark' the field used for vector search with 'is_embedding=True'.
# So, let's create a new schema for Weaviate that takes care of this requirement.
from pydantic import Field
class WeaviateDoc(BaseDoc):
title: str
title_embedding: NdArray[32] = Field(is_embedding=True)
year: int
color: str
Copy
from docarray.index import WeaviateDocumentIndex
# initialize the index
dbconfig = WeaviateDocumentIndex.DBConfig(host="http://localhost:8080")
db = WeaviateDocumentIndex[WeaviateDoc](db_config=dbconfig)
# index data
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# optionally, you can create a filter query
filter_query = {"path": ["year"], "operator": "LessThanEqual", "valueInt": "90"}
Copy
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# find the relevant document
doc = retriever.invoke("some query")
print(doc)
Copy
[Document(page_content='My document 17', metadata={'id': '3a5b76e85f0d0a01785dc8f9d965ce40', 'year': 17, 'color': 'red'})]
ElasticDocIndex
ElasticDocIndex는 ElasticSearch를 기반으로 구축된 document index입니다.
자세히 알아보기 여기
Copy
from docarray.index import ElasticDocIndex
# initialize the index
db = ElasticDocIndex[MyDoc](
hosts="http://localhost:9200", index_name="docarray_retriever"
)
# index data
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# optionally, you can create a filter query
filter_query = {"range": {"year": {"lte": 90}}}
Copy
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# find the relevant document
doc = retriever.invoke("some query")
print(doc)
Copy
[Document(page_content='My document 46', metadata={'id': 'edbc721bac1c2ad323414ad1301528a4', 'year': 46, 'color': 'green'})]
QdrantDocumentIndex
QdrantDocumentIndex는 Qdrant vector database를 기반으로 구축된 document index입니다.
자세히 알아보기 여기
Copy
from docarray.index import QdrantDocumentIndex
from qdrant_client.http import models as rest
# initialize the index
qdrant_config = QdrantDocumentIndex.DBConfig(path=":memory:")
db = QdrantDocumentIndex[MyDoc](qdrant_config)
# index data
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# optionally, you can create a filter query
filter_query = rest.Filter(
must=[
rest.FieldCondition(
key="year",
range=rest.Range(
gte=10,
lt=90,
),
)
]
)
Copy
WARNING:root:Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.
Copy
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# find the relevant document
doc = retriever.invoke("some query")
print(doc)
Copy
[Document(page_content='My document 80', metadata={'id': '97465f98d0810f1f330e4ecc29b13d20', 'year': 80, 'color': 'blue'})]
HnswDocumentIndex를 사용한 영화 검색
Copy
movies = [
{
"title": "Inception",
"description": "A thief who steals corporate secrets through the use of dream-sharing technology is given the task of planting an idea into the mind of a CEO.",
"director": "Christopher Nolan",
"rating": 8.8,
},
{
"title": "The Dark Knight",
"description": "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.",
"director": "Christopher Nolan",
"rating": 9.0,
},
{
"title": "Interstellar",
"description": "Interstellar explores the boundaries of human exploration as a group of astronauts venture through a wormhole in space. In their quest to ensure the survival of humanity, they confront the vastness of space-time and grapple with love and sacrifice.",
"director": "Christopher Nolan",
"rating": 8.6,
},
{
"title": "Pulp Fiction",
"description": "The lives of two mob hitmen, a boxer, a gangster's wife, and a pair of diner bandits intertwine in four tales of violence and redemption.",
"director": "Quentin Tarantino",
"rating": 8.9,
},
{
"title": "Reservoir Dogs",
"description": "When a simple jewelry heist goes horribly wrong, the surviving criminals begin to suspect that one of them is a police informant.",
"director": "Quentin Tarantino",
"rating": 8.3,
},
{
"title": "The Godfather",
"description": "An aging patriarch of an organized crime dynasty transfers control of his empire to his reluctant son.",
"director": "Francis Ford Coppola",
"rating": 9.2,
},
]
Copy
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
Copy
OpenAI API Key: ········
Copy
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
from langchain_openai import OpenAIEmbeddings
# define schema for your movie documents
class MyDoc(BaseDoc):
title: str
description: str
description_embedding: NdArray[1536]
rating: float
director: str
embeddings = OpenAIEmbeddings()
# get "description" embeddings, and create documents
docs = DocList[MyDoc](
[
MyDoc(
description_embedding=embeddings.embed_query(movie["description"]), **movie
)
for movie in movies
]
)
Copy
from docarray.index import HnswDocumentIndex
# initialize the index
db = HnswDocumentIndex[MyDoc](work_dir="movie_search")
# add data
db.index(docs)
Normal Retriever
Copy
from langchain_community.retrievers import DocArrayRetriever
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="description_embedding",
content_field="description",
)
# find the relevant document
doc = retriever.invoke("movie about dreams")
print(doc)
Copy
[Document(page_content='A thief who steals corporate secrets through the use of dream-sharing technology is given the task of planting an idea into the mind of a CEO.', metadata={'id': 'f1649d5b6776db04fec9a116bbb6bbe5', 'title': 'Inception', 'rating': 8.8, 'director': 'Christopher Nolan'})]
Filter를 사용한 Retriever
Copy
from langchain_community.retrievers import DocArrayRetriever
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="description_embedding",
content_field="description",
filters={"director": {"$eq": "Christopher Nolan"}},
top_k=2,
)
# find relevant documents
docs = retriever.invoke("space travel")
print(docs)
Copy
[Document(page_content='Interstellar explores the boundaries of human exploration as a group of astronauts venture through a wormhole in space. In their quest to ensure the survival of humanity, they confront the vastness of space-time and grapple with love and sacrifice.', metadata={'id': 'ab704cc7ae8573dc617f9a5e25df022a', 'title': 'Interstellar', 'rating': 8.6, 'director': 'Christopher Nolan'}), Document(page_content='A thief who steals corporate secrets through the use of dream-sharing technology is given the task of planting an idea into the mind of a CEO.', metadata={'id': 'f1649d5b6776db04fec9a116bbb6bbe5', 'title': 'Inception', 'rating': 8.8, 'director': 'Christopher Nolan'})]
MMR 검색을 사용한 Retriever
Copy
from langchain_community.retrievers import DocArrayRetriever
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="description_embedding",
content_field="description",
filters={"rating": {"$gte": 8.7}},
search_type="mmr",
top_k=3,
)
# find relevant documents
docs = retriever.invoke("action movies")
print(docs)
Copy
[Document(page_content="The lives of two mob hitmen, a boxer, a gangster's wife, and a pair of diner bandits intertwine in four tales of violence and redemption.", metadata={'id': 'e6aa313bbde514e23fbc80ab34511afd', 'title': 'Pulp Fiction', 'rating': 8.9, 'director': 'Quentin Tarantino'}), Document(page_content='A thief who steals corporate secrets through the use of dream-sharing technology is given the task of planting an idea into the mind of a CEO.', metadata={'id': 'f1649d5b6776db04fec9a116bbb6bbe5', 'title': 'Inception', 'rating': 8.8, 'director': 'Christopher Nolan'}), Document(page_content='When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.', metadata={'id': '91dec17d4272041b669fd113333a65f7', 'title': 'The Dark Knight', 'rating': 9.0, 'director': 'Christopher Nolan'})]
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.