Docling은 PDF, DOCX, PPTX, HTML 및 기타 형식을 문서 레이아웃, 표 등을 포함한 풍부한 통합 표현으로 파싱하여 RAG와 같은 생성형 AI 워크플로우에 바로 사용할 수 있도록 합니다. 이 통합은 DoclingLoader document loader를 통해 Docling의 기능을 제공합니다.

Overview

Integration details

ClassPackageLocalSerializableJS support
langchain_docling.DoclingLoaderlangchain-docling

Loader features

SourceDocument Lazy LoadingNative Async Support
DoclingLoader
제공되는 DoclingLoader 컴포넌트를 사용하면 다음을 수행할 수 있습니다:
  • 다양한 문서 유형을 LLM 애플리케이션에서 쉽고 빠르게 사용할 수 있으며,
  • 고급 문서 네이티브 grounding을 위해 Docling의 풍부한 형식을 활용할 수 있습니다.
DoclingLoader는 두 가지 다른 export 모드를 지원합니다:
  • ExportType.DOC_CHUNKS (기본값): 각 입력 문서를 청크로 나누고 각 개별 청크를 별도의 LangChain Document로 캡처하려는 경우, 또는
  • ExportType.MARKDOWN: 각 입력 문서를 별도의 LangChain Document로 캡처하려는 경우
예제에서는 EXPORT_TYPE 파라미터를 통해 두 모드를 모두 탐색할 수 있습니다. 설정된 값에 따라 예제 파이프라인이 적절하게 설정됩니다.

Setup

pip install -qU langchain-docling
Note: you may need to restart the kernel to use updated packages.
최상의 변환 속도를 위해 가능한 경우 GPU 가속을 사용하세요. 예를 들어 Colab에서 실행하는 경우 GPU 지원 런타임을 사용하세요.

Initialization

기본 초기화는 다음과 같습니다:
from langchain_docling import DoclingLoader

FILE_PATH = "https://arxiv.org/pdf/2408.09869"

loader = DoclingLoader(file_path=FILE_PATH)
고급 사용을 위해 DoclingLoader는 다음 파라미터를 가지고 있습니다:
  • file_path: 단일 str(URL 또는 로컬 파일) 또는 이들의 iterable로 지정되는 소스
  • converter (선택사항): 사용할 특정 Docling converter 인스턴스
  • convert_kwargs (선택사항): 변환 실행을 위한 특정 kwargs
  • export_type (선택사항): 사용할 export 모드: ExportType.DOC_CHUNKS (기본값) 또는 ExportType.MARKDOWN
  • md_export_kwargs (선택사항): 특정 Markdown export kwargs (Markdown 모드용)
  • chunker (선택사항): 사용할 특정 Docling chunker 인스턴스 (doc-chunk 모드용)
  • meta_extractor (선택사항): 사용할 특정 metadata extractor

Load

docs = loader.load()
Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors
참고: "Token indices sequence length is longer than the specified maximum sequence length..."라는 메시지는 이 경우 무시할 수 있습니다 — 자세한 내용은 여기를 참조하세요.
일부 샘플 문서 검사:
for d in docs[:3]:
    print(f"- {d.page_content=}")
- d.page_content='arXiv:2408.09869v5  [cs.CL]  9 Dec 2024'
- d.page_content='Docling Technical Report\nVersion 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\nAI4K Group, IBM Research R¨uschlikon, Switzerland'
- d.page_content='Abstract\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'

Lazy Load

문서는 lazy 방식으로도 로드할 수 있습니다:
doc_iter = loader.lazy_load()
for doc in doc_iter:
    pass  # you can operate on `doc` here

End-to-end Example

import os

# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
  • 다음 예제 파이프라인은 HuggingFace의 Inference API를 사용합니다. LLM 할당량을 늘리려면 환경 변수 HF_TOKEN을 통해 토큰을 제공할 수 있습니다.
  • 이 파이프라인의 의존성은 아래와 같이 설치할 수 있습니다 (--no-warn-conflicts는 Colab의 사전 구성된 Python 환경을 위한 것이며, 더 엄격한 사용을 위해 제거해도 됩니다):
pip install -q --progress-bar off --no-warn-conflicts langchain-core langchain-huggingface langchain-milvus langchain python-dotenv
Note: you may need to restart the kernel to use updated packages.
파이프라인 파라미터 정의:
from pathlib import Path
from tempfile import mkdtemp

from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate
from langchain_docling.loader import ExportType


def _get_env_from_colab_or_os(key):
    try:
        from google.colab import userdata

        try:
            return userdata.get(key)
        except userdata.SecretNotFoundError:
            pass
    except ImportError:
        pass
    return os.getenv(key)


load_dotenv()

HF_TOKEN = _get_env_from_colab_or_os("HF_TOKEN")
FILE_PATH = ["https://arxiv.org/pdf/2408.09869"]  # Docling Technical Report
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
GEN_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
EXPORT_TYPE = ExportType.DOC_CHUNKS
QUESTION = "Which are the main AI models in Docling?"
PROMPT = PromptTemplate.from_template(
    "Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
TOP_K = 3
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")
이제 loader를 인스턴스화하고 문서를 로드할 수 있습니다:
from docling.chunking import HybridChunker
from langchain_docling import DoclingLoader

loader = DoclingLoader(
    file_path=FILE_PATH,
    export_type=EXPORT_TYPE,
    chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
)

docs = loader.load()
Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors
splits 결정:
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
    splits = docs
elif EXPORT_TYPE == ExportType.MARKDOWN:
    from langchain_text_splitters import MarkdownHeaderTextSplitter

    splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[
            ("#", "Header_1"),
            ("##", "Header_2"),
            ("###", "Header_3"),
        ],
    )
    splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]
else:
    raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")
일부 샘플 splits 검사:
for d in splits[:3]:
    print(f"- {d.page_content=}")
print("...")
- d.page_content='arXiv:2408.09869v5  [cs.CL]  9 Dec 2024'
- d.page_content='Docling Technical Report\nVersion 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\nAI4K Group, IBM Research R¨uschlikon, Switzerland'
- d.page_content='Abstract\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'
...

Ingestion

import json
from pathlib import Path
from tempfile import mkdtemp

from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus

embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)

milvus_uri = str(Path(mkdtemp()) / "docling.db")  # or set as needed
vectorstore = Milvus.from_documents(
    documents=splits,
    embedding=embedding,
    collection_name="docling_demo",
    connection_args={"uri": milvus_uri},
    index_params={"index_type": "FLAT"},
    drop_old=True,
)

RAG

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_huggingface import HuggingFaceEndpoint

retriever = vectorstore.as_retriever(search_kwargs={"k": TOP_K})
llm = HuggingFaceEndpoint(
    repo_id=GEN_MODEL_ID,
    huggingfacehub_api_token=HF_TOKEN,
    task="text-generation",
)
def clip_text(text, threshold=100):
    return f"{text[:threshold]}..." if len(text) > threshold else text
question_answer_chain = create_stuff_documents_chain(llm, PROMPT)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
resp_dict = rag_chain.invoke({"input": QUESTION})

clipped_answer = clip_text(resp_dict["answer"], threshold=350)
print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{clipped_answer}")
for i, doc in enumerate(resp_dict["context"]):
    print()
    print(f"Source {i + 1}:")
    print(f"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}")
    for key in doc.metadata:
        if key != "pk":
            val = doc.metadata.get(key)
            clipped_val = clip_text(val) if isinstance(val, str) else val
            print(f"  {key}: {clipped_val}")
Question:
Which are the main AI models in Docling?

Answer:
The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.

Source 1:
  text: "3.2 AI models\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re..."
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 405.1419982910156, 'r': 504.00299072265625, 'b': 330.7799987792969, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 2:
  text: "3 Processing pipeline\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ..."
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/26', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 2, 'bbox': {'l': 108.0, 't': 273.01800537109375, 'r': 504.00299072265625, 'b': 176.83799743652344, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 796]}]}], 'headings': ['3 Processing pipeline'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 3:
  text: "6 Future work and contributions\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ..."
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/76', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 322.468994140625, 'r': 504.00299072265625, 'b': 259.0169982910156, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}, {'self_ref': '#/texts/77', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 251.6540069580078, 'r': 504.00299072265625, 'b': 198.99200439453125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 402]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
  source: https://arxiv.org/pdf/2408.09869
소스에 passage 제목(즉, 섹션), 페이지 및 정확한 bounding box를 포함한 풍부한 grounding 정보가 포함되어 있음을 확인할 수 있습니다.

API reference


Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.
I