StarRocks

StarRocks는 고성능 분석 데이터베이스입니다. StarRocks는 다차원 분석, 실시간 분석 및 ad-hoc 쿼리를 포함한 전체 분석 시나리오를 위한 차세대 서브초 MPP 데이터베이스입니다.

일반적으로 StarRocks는 OLAP으로 분류되며, ClickBench — a Benchmark For Analytical DBMS에서 뛰어난 성능을 보여주었습니다. 초고속 벡터화 실행 엔진을 갖추고 있어 빠른 vectordb로도 사용할 수 있습니다.

여기서는 StarRocks Vector Store를 사용하는 방법을 보여드리겠습니다.

Setup

pip install -qU  pymysql langchain-community

처음에 update_vectordb = False로 설정합니다. 업데이트된 문서가 없다면 문서의 embedding을 다시 빌드할 필요가 없습니다.

from langchain.chains import RetrievalQA
from langchain_community.document_loaders import (
    DirectoryLoader,
    UnstructuredMarkdownLoader,
)
from langchain_community.vectorstores import StarRocks
from langchain_community.vectorstores.starrocks import StarRocksSettings
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_text_splitters import TokenTextSplitter

update_vectordb = False

/Users/dirlt/utils/py3env/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.7) or chardet (5.1.0)/charset_normalizer (2.0.9) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "

문서를 로드하고 token으로 분할하기

docs 디렉토리 아래의 모든 markdown 파일을 로드합니다. starrocks 문서의 경우 github.com/StarRocks/starrocks에서 repo를 clone할 수 있으며, 그 안에 docs 디렉토리가 있습니다.

loader = DirectoryLoader(
    "./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader
)
documents = loader.load()

문서를 token으로 분할하고, 새로운 문서/token이 있으므로 update_vectordb = True로 설정합니다.

# load text splitter and split docs into snippets of text
text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)
split_docs = text_splitter.split_documents(documents)

# tell vectordb to update text embeddings
update_vectordb = True

split_docs[-20]

Document(page_content='Compile StarRocks with Docker\n\nThis topic describes how to compile StarRocks using Docker.\n\nOverview\n\nStarRocks provides development environment images for both Ubuntu 22.04 and CentOS 7.9. With the image, you can launch a Docker container and compile StarRocks in the container.\n\nStarRocks version and DEV ENV image\n\nDifferent branches of StarRocks correspond to different development environment images provided on StarRocks Docker Hub.\n\nFor Ubuntu 22.04:\n\n| Branch name | Image name              |\n  | --------------- | ----------------------------------- |\n  | main            | starrocks/dev-env-ubuntu:latest     |\n  | branch-3.0      | starrocks/dev-env-ubuntu:3.0-latest |\n  | branch-2.5      | starrocks/dev-env-ubuntu:2.5-latest |\n\nFor CentOS 7.9:\n\n| Branch name | Image name                       |\n  | --------------- | ------------------------------------ |\n  | main            | starrocks/dev-env-centos7:latest     |\n  | branch-3.0      | starrocks/dev-env-centos7:3.0-latest |\n  | branch-2.5      | starrocks/dev-env-centos7:2.5-latest |\n\nPrerequisites\n\nBefore compiling StarRocks, make sure the following requirements are satisfied:\n\nHardware\n\n', metadata={'source': 'docs/developers/build-starrocks/Build_in_docker.md'})

print("# docs  = %d, # splits = %d" % (len(documents), len(split_docs)))

# docs  = 657, # splits = 2802

vectordb instance 생성하기

StarRocks를 vectordb로 사용하기

def gen_starrocks(update_vectordb, embeddings, settings):
    if update_vectordb:
        docsearch = StarRocks.from_documents(split_docs, embeddings, config=settings)
    else:
        docsearch = StarRocks(embeddings, settings)
    return docsearch

token을 embedding으로 변환하고 vectordb에 저장하기

여기서는 StarRocks를 vectordb로 사용하며, StarRocksSettings를 통해 StarRocks instance를 구성할 수 있습니다. StarRocks instance 구성은 mysql instance 구성과 매우 유사합니다. 다음을 지정해야 합니다:

host/port
username(기본값: ‘root’)
password(기본값: ”)
database(기본값: ‘default’)
table(기본값: ‘langchain’)

embeddings = OpenAIEmbeddings()

# configure starrocks settings(host/port/user/pw/db)
settings = StarRocksSettings()
settings.port = 41003
settings.host = "127.0.0.1"
settings.username = "root"
settings.password = ""
settings.database = "zya"
docsearch = gen_starrocks(update_vectordb, embeddings, settings)

print(docsearch)

update_vectordb = False

Inserting data...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2802/2802 [02:26<00:00, 19.11it/s]

zya.langchain @ 127.0.0.1:41003

username: root

Table Schema:
----------------------------------------------------------------------------
|name                    |type                    |key                     |
----------------------------------------------------------------------------
|id                      |varchar(65533)          |true                    |
|document                |varchar(65533)          |false                   |
|embedding               |array<float>            |false                   |
|metadata                |varchar(65533)          |false                   |
----------------------------------------------------------------------------

QA를 구축하고 질문하기

llm = OpenAI()
qa = RetrievalQA.from_chain_type(
        llm=llm, chain_type="stuff", retriever=docsearch.as_retriever()
)
query = "is profile enabled by default? if not, how to enable profile?"
resp = qa.run(query)
print(resp)

 No, profile is not enabled by default. To enable profile, set the variable `enable_profile` to `true` using the command `set enable_profile = true;`

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

Setup

문서를 로드하고 token으로 분할하기

vectordb instance 생성하기

StarRocks를 vectordb로 사용하기

token을 embedding으로 변환하고 vectordb에 저장하기

QA를 구축하고 질문하기

Popular Providers

Integrations by component

​Setup

​문서를 로드하고 token으로 분할하기

​vectordb instance 생성하기

​StarRocks를 vectordb로 사용하기

​token을 embedding으로 변환하고 vectordb에 저장하기

​QA를 구축하고 질문하기

Setup

문서를 로드하고 token으로 분할하기

vectordb instance 생성하기

StarRocks를 vectordb로 사용하기

token을 embedding으로 변환하고 vectordb에 저장하기

QA를 구축하고 질문하기