ZeroxPDFLoaderZerox 라이브러리를 활용하는 document loader입니다. Zerox는 PDF 문서를 이미지로 변환하고, vision 기능을 갖춘 language model을 사용하여 처리한 후, 구조화된 Markdown 표현을 생성합니다. 이 loader는 비동기 작업을 지원하며 페이지 단위 문서 추출을 제공합니다.

Integration 세부사항

ClassPackageLocalSerializableJS support
ZeroxPDFLoaderlangchain_community

Loader 기능

SourceDocument Lazy LoadingNative Async Support
ZeroxPDFLoader

Setup

Credentials

적절한 credentials를 환경 변수에 설정해야 합니다. 이 loader는 다양한 model과 model provider를 지원합니다. 몇 가지 예시는 아래 Usage 헤더를 참조하거나, 지원되는 model의 전체 목록은 Zerox documentation을 참조하세요.

Installation

ZeroxPDFLoader를 사용하려면 zerox package를 설치해야 합니다. 또한 langchain-community가 설치되어 있는지 확인하세요.
pip install zerox langchain-community

Initialization

ZeroxPDFLoader는 각 페이지를 이미지로 변환하고 비동기적으로 처리하여 vision 기능을 갖춘 language model을 사용한 PDF 텍스트 추출을 가능하게 합니다. 이 loader를 사용하려면 model을 지정하고 API key와 같은 Zerox에 필요한 환경 변수를 구성해야 합니다. Jupyter Notebook과 같은 환경에서 작업하는 경우, nest_asyncio를 사용하여 비동기 코드를 처리해야 할 수 있습니다. 다음과 같이 설정할 수 있습니다:
import nest_asyncio
nest_asyncio.apply()
import os

# use nest_asyncio (only necessary inside of jupyter notebook)
import nest_asyncio
from langchain_community.document_loaders.pdf import ZeroxPDFLoader

nest_asyncio.apply()

# Specify the url or file path for the PDF you want to process
# In this case let's use pdf from web
file_path = "https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf"

# Set up necessary env vars for a vision model
os.environ["OPENAI_API_KEY"] = (
    "zK3BAhQUmbwZNoHoOcscBwQdwi3oc3hzwJmbgdZ"  ## your-api-key
)

# Initialize ZeroxPDFLoader with the desired model
loader = ZeroxPDFLoader(file_path=file_path, model="azure/gpt-4o-mini")

Load

# Load the document and look at the first page:
documents = loader.load()
documents[0]
Document(metadata={'source': 'https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf', 'page': 1, 'num_pages': 5}, page_content='# OpenAI\n\nOpenAI is an AI research laboratory.\n\n#ai-models #ai\n\n## Revenue\n- **$1,000,000,000**  \n  2023\n\n## Valuation\n- **$28,000,000,000**  \n  2023\n\n## Growth Rate (Y/Y)\n- **400%**  \n  2023\n\n## Funding\n- **$11,300,000,000**  \n  2023\n\n---\n\n## Details\n- **Headquarters:** San Francisco, CA\n- **CEO:** Sam Altman\n\n[Visit Website](#)\n\n---\n\n## Revenue\n### ARR ($M)  | Growth\n--- | ---\n$1000M  | 456%\n$750M   | \n$500M   | \n$250M   | $36M\n$0     | $200M\n\nis on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.\n\nOpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be "the most capital-intensive startup in Silicon Valley history."\n\nThe reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.\n\n---\n\n## Valuation\nIn April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.\n\nAssuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.\n\n---\n\n## Product\n\n### ChatGPT\n| Examples                       | Capabilities                        | Limitations                        |\n|---------------------------------|-------------------------------------|------------------------------------|\n| "Explain quantum computing in simple terms" | "Remember what users said earlier in the conversation" | May occasionally generate incorrect information |\n| "What can you give me for my dad\'s birthday?" | "Allows users to follow-up questions" | Limited knowledge of world events after 2021 |\n| "How do I make an HTTP request in JavaScript?" | "Trained to provide harmless requests" |                                    |')
# Let's look at parsed first page
print(documents[0].page_content)
# OpenAI

OpenAI is an AI research laboratory.

#ai-models #ai

## Revenue
- **$1,000,000,000**
  2023

## Valuation
- **$28,000,000,000**
  2023

## Growth Rate (Y/Y)
- **400%**
  2023

## Funding
- **$11,300,000,000**
  2023

---

## Details
- **Headquarters:** San Francisco, CA
- **CEO:** Sam Altman

[Visit Website](#)

---

## Revenue
### ARR ($M)  | Growth
--- | ---
$1000M  | 456%
$750M   |
$500M   |
$250M   | $36M
$0     | $200M

is on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.

OpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be "the most capital-intensive startup in Silicon Valley history."

The reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.

---

## Valuation
In April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.

Assuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.

---

## Product

### ChatGPT
| Examples                       | Capabilities                        | Limitations                        |
|---------------------------------|-------------------------------------|------------------------------------|
| "Explain quantum computing in simple terms" | "Remember what users said earlier in the conversation" | May occasionally generate incorrect information |
| "What can you give me for my dad's birthday?" | "Allows users to follow-up questions" | Limited knowledge of world events after 2021 |
| "How do I make an HTTP request in JavaScript?" | "Trained to provide harmless requests" |                                    |

Lazy Load

이 loader는 항상 결과를 lazy하게 가져옵니다. .load() method는 .lazy_load()와 동일합니다.

API reference

ZeroxPDFLoader

이 loader class는 file path와 model type으로 초기화되며, Zerox 관련 parameter를 처리하기 위해 zerox_kwargs를 통한 사용자 정의 구성을 지원합니다. Arguments:
  • file_path (Union[str, Path]): PDF 파일 경로.
  • model (str): 처리에 사용할 vision 기능을 갖춘 model, <provider>/<model> 형식. 유효한 값의 몇 가지 예시:
    • model = "gpt-4o-mini" ## openai model
    • model = "azure/gpt-4o-mini"
    • model = "gemini/gpt-4o-mini"
    • model="claude-3-opus-20240229"
    • model = "vertex_ai/gemini-1.5-flash-001"
    • 자세한 내용은 Zerox documentation 참조
    • 기본값은 "gpt-4o-mini".
  • **zerox_kwargs (dict): API key, endpoint 등과 같은 추가 Zerox 관련 parameter.
Methods:
  • lazy_load: 페이지 번호 및 source를 포함한 metadata와 함께 PDF의 각 페이지를 나타내는 Document instance의 iterator를 생성합니다.
전체 API documentation은 여기를 참조하세요.

Notes

  • Model Compatibility: Zerox는 다양한 vision 기능을 갖춘 model을 지원합니다. 지원되는 model 목록 및 구성 세부사항은 Zerox’s GitHub documentation을 참조하세요.
  • Environment Variables: Zerox documentation에 명시된 대로 API_KEY 또는 endpoint 세부사항과 같은 필수 환경 변수를 설정해야 합니다.
  • Asynchronous Processing: Jupyter Notebook에서 event loop 관련 오류가 발생하면 setup 섹션에 표시된 대로 nest_asyncio를 적용해야 할 수 있습니다.

Troubleshooting

  • RuntimeError: This event loop is already running: Jupyter와 같은 환경에서 비동기 loop 충돌을 방지하려면 nest_asyncio.apply()를 사용하세요.
  • Configuration Errors: zerox_kwargs가 선택한 model에 대해 예상되는 argument와 일치하는지, 필요한 모든 환경 변수가 설정되어 있는지 확인하세요.

Additional Resources


Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.
I