MyScale는 오픈 소스 ClickHouse 위에 구축된, AI 애플리케이션과 솔루션에 최적화된 클라우드 기반 데이터베이스입니다.
이 노트북은 MyScale 벡터 데이터베이스와 관련된 기능을 사용하는 방법을 보여줍니다.

환경 설정

pip install -qU  clickhouse-connect langchain-community
OpenAIEmbeddings를 사용하려면 OpenAI API Key가 필요합니다.
import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
if "OPENAI_API_BASE" not in os.environ:
    os.environ["OPENAI_API_BASE"] = getpass.getpass("OpenAI Base:")
if "MYSCALE_HOST" not in os.environ:
    os.environ["MYSCALE_HOST"] = getpass.getpass("MyScale Host:")
if "MYSCALE_PORT" not in os.environ:
    os.environ["MYSCALE_PORT"] = getpass.getpass("MyScale Port:")
if "MYSCALE_USERNAME" not in os.environ:
    os.environ["MYSCALE_USERNAME"] = getpass.getpass("MyScale Username:")
if "MYSCALE_PASSWORD" not in os.environ:
    os.environ["MYSCALE_PASSWORD"] = getpass.getpass("MyScale Password:")
myscale index의 파라미터를 설정하는 방법은 두 가지가 있습니다.
  1. 환경 변수 앱을 실행하기 전에 export로 환경 변수를 설정하세요: export MYSCALE_HOST='<your-endpoints-url>' MYSCALE_PORT=<your-endpoints-port> MYSCALE_USERNAME=<your-username> MYSCALE_PASSWORD=<your-password> ... 계정, 비밀번호 및 기타 정보는 SaaS에서 쉽게 확인할 수 있습니다. 자세한 내용은 이 문서를 참고하세요. MyScaleSettings의 모든 속성은 MYSCALE_ 접두사로 설정할 수 있으며, 대소문자를 구분하지 않습니다.
  2. 파라미터로 MyScaleSettings 객체 생성하기
    from langchain_community.vectorstores import MyScale, MyScaleSettings
    config = MyScaleSetting(host="<your-backend-url>", port=8443, ...)
    index = MyScale(embedding_function, config)
    index.add_documents(...)
    
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import MyScale
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
for d in docs:
    d.metadata = {"some": "metadata"}
docsearch = MyScale.from_documents(docs, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)
Inserting data...: 100%|██████████| 42/42 [00:15<00:00,  2.66it/s]
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

연결 정보와 데이터 스키마 가져오기

print(str(docsearch))

필터링

myscale SQL의 WHERE 구문에 직접 접근할 수 있습니다. 표준 SQL에 따라 WHERE 절을 작성할 수 있습니다. NOTE: SQL injection에 유의하세요. 이 인터페이스는 최종 사용자가 직접 호출하지 않도록 해야 합니다. 설정에서 column_map을 커스터마이즈했다면, 다음과 같이 필터를 사용해 검색할 수 있습니다:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import MyScale

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

for i, d in enumerate(docs):
    d.metadata = {"doc_id": i}

docsearch = MyScale.from_documents(docs, embeddings)
Inserting data...: 100%|██████████| 42/42 [00:15<00:00,  2.68it/s]

점수 포함 유사도 검색

반환되는 distance score는 cosine distance입니다. 따라서 값이 낮을수록 더 좋습니다.
meta = docsearch.metadata_column
output = docsearch.similarity_search_with_relevance_scores(
    "What did the president say about Ketanji Brown Jackson?",
    k=4,
    where_str=f"{meta}.doc_id<10",
)
for d, dist in output:
    print(dist, d.metadata, d.page_content[:20] + "...")
0.229655921459198 {'doc_id': 0} Madam Speaker, Madam...
0.24506962299346924 {'doc_id': 8} And so many families...
0.24786919355392456 {'doc_id': 1} Groups of citizens b...
0.24875116348266602 {'doc_id': 6} And I’m taking robus...

데이터 삭제

.drop() 메서드로 테이블을 삭제(drop)하거나, .delete() 메서드로 일부 데이터를 삭제할 수 있습니다.
# use directly a `where_str` to delete
docsearch.delete(where_str=f"{docsearch.metadata_column}.doc_id < 5")
meta = docsearch.metadata_column
output = docsearch.similarity_search_with_relevance_scores(
    "What did the president say about Ketanji Brown Jackson?",
    k=4,
    where_str=f"{meta}.doc_id<10",
)
for d, dist in output:
    print(dist, d.metadata, d.page_content[:20] + "...")
0.24506962299346924 {'doc_id': 8} And so many families...
0.24875116348266602 {'doc_id': 6} And I’m taking robus...
0.26027143001556396 {'doc_id': 7} We see the unity amo...
0.26390212774276733 {'doc_id': 9} And unlike the $2 Tr...
docsearch.drop()

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.
I