Tencent Cloud VectorDB는 다차원 벡터 데이터를 저장, 검색 및 분석하기 위해 설계된 완전 관리형, 자체 개발 엔터프라이즈급 분산 데이터베이스 서비스입니다. 이 데이터베이스는 여러 인덱스 유형과 유사도 계산 방법을 지원합니다. 단일 인덱스는 최대 10억 개의 벡터 규모를 지원할 수 있으며, 수백만 QPS와 밀리초 수준의 쿼리 지연 시간을 지원할 수 있습니다. Tencent Cloud Vector Database는 대규모 모델의 응답 정확도를 향상시키기 위해 외부 지식 베이스를 제공할 수 있을 뿐만 아니라 추천 시스템, NLP 서비스, 컴퓨터 비전, 지능형 고객 서비스와 같은 AI 분야에서 광범위하게 사용될 수 있습니다.
이 노트북은 Tencent vector database와 관련된 기능을 사용하는 방법을 보여줍니다. 실행하려면 Database instance가 있어야 합니다.

Basic Usage

!pip3 install tcvectordb langchain-community
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.fake import FakeEmbeddings
from langchain_community.vectorstores import TencentVectorDB
from langchain_community.vectorstores.tencentvectordb import ConnectionParams
from langchain_text_splitters import CharacterTextSplitter
문서를 로드하고 청크로 분할합니다.
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
문서를 임베딩하는 두 가지 방법을 지원합니다:
  • LangChain Embeddings와 호환되는 모든 Embeddings 모델을 사용합니다.
  • Tencent VectorStore DB의 Embedding 모델 이름을 지정합니다. 선택 가능한 옵션은 다음과 같습니다:
    • bge-base-zh, dimension: 768
    • m3e-base, dimension: 768
    • text2vec-large-chinese, dimension: 1024
    • e5-large-v2, dimension: 1024
    • multilingual-e5-base, dimension: 768
다음 코드는 문서를 임베딩하는 두 가지 방법을 모두 보여주며, 다른 하나를 주석 처리하여 하나를 선택할 수 있습니다:
##  you can use a LangChain Embeddings model, like OpenAIEmbeddings:

# from langchain_community.embeddings.openai import OpenAIEmbeddings
#
# embeddings = OpenAIEmbeddings()
# t_vdb_embedding = None

## Or you can use a Tencent Embedding model, like `bge-base-zh`:

t_vdb_embedding = "bge-base-zh"  # bge-base-zh is the default model
embeddings = None
이제 TencentVectorDB 인스턴스를 생성할 수 있습니다. embeddings 또는 t_vdb_embedding 매개변수 중 하나 이상을 제공해야 합니다. 둘 다 제공되면 embeddings 매개변수가 사용됩니다:
conn_params = ConnectionParams(
    url="http://10.0.X.X",
    key="eC4bLRy2va******************************",
    username="root",
    timeout=20,
)

vector_db = TencentVectorDB.from_documents(
    docs, embeddings, connection_params=conn_params, t_vdb_embedding=t_vdb_embedding
)
query = "What did the president say about Ketanji Brown Jackson"
docs = vector_db.similarity_search(query)
docs[0].page_content
'Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'
vector_db = TencentVectorDB(embeddings, conn_params)

vector_db.add_texts(["Ankush went to Princeton"])
query = "Where did Ankush go to college?"
docs = vector_db.max_marginal_relevance_search(query)
docs[0].page_content
'Ankush went to Princeton'

Metadata and filtering

Tencent VectorDB는 metadata와 filtering을 지원합니다. 문서에 metadata를 추가하고 metadata를 기반으로 검색 결과를 필터링할 수 있습니다. 이제 metadata가 포함된 새로운 TencentVectorDB collection을 생성하고 metadata를 기반으로 검색 결과를 필터링하는 방법을 시연하겠습니다:
from langchain_community.vectorstores.tencentvectordb import (
    META_FIELD_TYPE_STRING,
    META_FIELD_TYPE_UINT64,
    ConnectionParams,
    MetaField,
    TencentVectorDB,
)
from langchain_core.documents import Document

meta_fields = [
    MetaField(name="year", data_type=META_FIELD_TYPE_UINT64, index=True),
    MetaField(name="rating", data_type=META_FIELD_TYPE_STRING, index=False),
    MetaField(name="genre", data_type=META_FIELD_TYPE_STRING, index=True),
    MetaField(name="director", data_type=META_FIELD_TYPE_STRING, index=True),
]

docs = [
    Document(
        page_content="The Shawshank Redemption is a 1994 American drama film written and directed by Frank Darabont.",
        metadata={
            "year": 1994,
            "rating": "9.3",
            "genre": "drama",
            "director": "Frank Darabont",
        },
    ),
    Document(
        page_content="The Godfather is a 1972 American crime film directed by Francis Ford Coppola.",
        metadata={
            "year": 1972,
            "rating": "9.2",
            "genre": "crime",
            "director": "Francis Ford Coppola",
        },
    ),
    Document(
        page_content="The Dark Knight is a 2008 superhero film directed by Christopher Nolan.",
        metadata={
            "year": 2008,
            "rating": "9.0",
            "genre": "superhero",
            "director": "Christopher Nolan",
        },
    ),
    Document(
        page_content="Inception is a 2010 science fiction action film written and directed by Christopher Nolan.",
        metadata={
            "year": 2010,
            "rating": "8.8",
            "genre": "science fiction",
            "director": "Christopher Nolan",
        },
    ),
]

vector_db = TencentVectorDB.from_documents(
    docs,
    None,
    connection_params=ConnectionParams(
        url="http://10.0.X.X",
        key="eC4bLRy2va******************************",
        username="root",
        timeout=20,
    ),
    collection_name="movies",
    meta_fields=meta_fields,
)

query = "film about dream by Christopher Nolan"

# you can use the tencentvectordb filtering syntax with the `expr` parameter:
result = vector_db.similarity_search(query, expr='director="Christopher Nolan"')

# you can either use the langchain filtering syntax with the `filter` parameter:
# result = vector_db.similarity_search(query, filter='eq("director", "Christopher Nolan")')

result
[Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
 Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
 Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
 Document(page_content='Inception is a 2010 science fiction action film written and directed by Christopher Nolan.', metadata={'year': 2010, 'rating': '8.8', 'genre': 'science fiction', 'director': 'Christopher Nolan'})]

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.
I