Diffbot은 웹 데이터를 쉽게 구조화할 수 있는 ML 기반 제품 모음입니다.
Diffbot의 Extract API는 웹 페이지의 데이터를 구조화하고 정규화하는 서비스입니다.
기존의 웹 스크래핑 도구와 달리, Diffbot Extract는 페이지의 콘텐츠를 읽기 위해 어떠한 규칙도 필요하지 않습니다. 컴퓨터 비전 모델을 사용하여 페이지를 20가지 가능한 유형 중 하나로 분류한 다음, 원시 HTML 마크업을 JSON으로 변환합니다. 결과로 생성된 구조화된 JSON은 일관된 타입 기반 온톨로지를 따르므로, 동일한 스키마로 여러 다른 웹 소스에서 데이터를 쉽게 추출할 수 있습니다.
Open In Colab

개요

이 가이드는 Diffbot Extract API를 사용하여 URL 목록에서 데이터를 추출하고, 이를 다운스트림에서 사용할 수 있는 구조화된 JSON으로 변환하는 방법을 다룹니다.

설정하기

필요한 패키지를 설치하는 것부터 시작합니다.
pip install -qU langchain-community
Diffbot의 Extract API는 API token이 필요합니다. 무료 API token 받기 지침을 따라 환경 변수를 설정하세요.
%env DIFFBOT_API_TOKEN REPLACE_WITH_YOUR_TOKEN

Document Loader 사용하기

DiffbotLoader 모듈을 import하고 URL 목록과 Diffbot token으로 인스턴스화합니다.
import os

from langchain_community.document_loaders import DiffbotLoader

urls = [
    "https://python.langchain.com/",
]

loader = DiffbotLoader(urls=urls, api_token=os.environ.get("DIFFBOT_API_TOKEN"))
.load() 메서드를 사용하면 로드된 document를 확인할 수 있습니다.
loader.load()
[Document(page_content="LangChain is a framework for developing applications powered by large language models (LLMs).\nLangChain simplifies every stage of the LLM application lifecycle:\nDevelopment: Build your applications using LangChain's open-source building blocks and components. Hit the ground running using third-party integrations and Templates.\nProductionization: Use LangSmith to inspect, monitor and evaluate your chains, so that you can continuously optimize and deploy with confidence.\nDeployment: Turn any chain into an API with LangServe.\nlangchain-core: Base abstractions and LangChain Expression Language.\nlangchain-community: Third party integrations.\nPartner packages (e.g. langchain-openai, langchain-anthropic, etc.): Some integrations have been further split into their own lightweight packages that only depend on langchain-core.\nlangchain: Chains, agents, and retrieval strategies that make up an application's cognitive architecture.\nlanggraph: Build robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph.\nlangserve: Deploy LangChain chains as REST APIs.\nThe broader ecosystem includes:\nLangSmith: A developer platform that lets you debug, test, evaluate, and monitor LLM applications and seamlessly integrates with LangChain.\nGet started\nWe recommend following our Quickstart guide to familiarize yourself with the framework by building your first LangChain application.\nSee here for instructions on how to install LangChain, set up your environment, and start building.\nnote\nThese docs focus on the Python LangChain library. Head here for docs on the JavaScript LangChain library.\nUse cases\nIf you're looking to build something specific or are more of a hands-on learner, check out our use-cases. They're walkthroughs and techniques for common end-to-end tasks, such as:\nQuestion answering with RAG\nExtracting structured output\nChatbots\nand more!\nExpression Language\nLangChain Expression Language (LCEL) is the foundation of many of LangChain's components, and is a declarative way to compose chains. LCEL was designed from day 1 to support putting prototypes in production, with no code changes, from the simplest “prompt + LLM” chain to the most complex chains.\nGet started: LCEL and its benefits\nRunnable interface: The standard interface for LCEL objects\nPrimitives: More on the primitives LCEL includes\nand more!\nEcosystem\n🦜🛠️ LangSmith\nTrace and evaluate your language model applications and intelligent agents to help you move from prototype to production.\n🦜🕸️ LangGraph\nBuild stateful, multi-actor applications with LLMs, built on top of (and intended to be used with) LangChain primitives.\n🦜🏓 LangServe\nDeploy LangChain runnables and chains as REST APIs.\nSecurity\nRead up on our Security best practices to make sure you're developing safely with LangChain.\nAdditional resources\nComponents\nLangChain provides standard, extendable interfaces and integrations for many different components, including:\nIntegrations\nLangChain is part of a rich ecosystem of tools that integrate with our framework and build on top of it. Check out our growing list of integrations.\nGuides\nBest practices for developing with LangChain.\nAPI reference\nHead to the reference section for full documentation of all classes and methods in the LangChain and LangChain Experimental Python packages.\nContributing\nCheck out the developer's guide for guidelines on contributing and help getting your dev environment set up.\nHelp us out by providing feedback on this documentation page:", metadata={'source': 'https://python.langchain.com/'})]

추출된 텍스트를 Graph Document로 변환하기

구조화된 페이지 콘텐츠는 DiffbotGraphTransformer를 사용하여 추가 처리하여 entity와 relationship을 graph로 추출할 수 있습니다.
pip install -qU langchain-experimental
from langchain_experimental.graph_transformers.diffbot import DiffbotGraphTransformer

diffbot_nlp = DiffbotGraphTransformer(
    diffbot_api_key=os.environ.get("DIFFBOT_API_TOKEN")
)
graph_documents = diffbot_nlp.convert_to_graph_documents(loader.load())
Knowledge Graph에 데이터를 계속 로드하려면 DiffbotGraphTransformer 가이드를 참조하세요.
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.
I