이 가이드는 ScrapeGraph tools 시작하기에 대한 간단한 개요를 제공합니다. 모든 ScrapeGraph 기능 및 구성에 대한 자세한 문서는 API reference를 참조하세요. ScrapeGraph AI에 대한 자세한 정보:

개요

Integration 세부 정보

ClassPackageSerializableJS supportVersion
SmartScraperToollangchain-scrapegraphPyPI - Version
SmartCrawlerToollangchain-scrapegraphPyPI - Version
MarkdownifyToollangchain-scrapegraphPyPI - Version
AgenticScraperToollangchain-scrapegraphPyPI - Version
GetCreditsToollangchain-scrapegraphPyPI - Version

Tool 기능

ToolPurposeInputOutput
SmartScraperTool웹사이트에서 구조화된 데이터 추출URL + promptJSON
SmartCrawlerTool크롤링을 통해 여러 페이지에서 데이터 추출URL + prompt + crawl optionsJSON
MarkdownifyTool웹페이지를 markdown으로 변환URLMarkdown text
GetCreditsToolAPI 크레딧 확인NoneCredit info

설정

이 integration에는 다음 패키지가 필요합니다:
pip install --quiet -U langchain-scrapegraph
Note: you may need to restart the kernel to use updated packages.

자격 증명

이러한 tool을 사용하려면 ScrapeGraph AI API key가 필요합니다. scrapegraphai.com에서 발급받으세요.
import getpass
import os

if not os.environ.get("SGAI_API_KEY"):
    os.environ["SGAI_API_KEY"] = getpass.getpass("ScrapeGraph AI API key:\n")
최고 수준의 관찰성을 위해 LangSmith를 설정하는 것도 유용합니다(필수는 아님):
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

인스턴스화

여기서는 ScrapeGraph tool의 인스턴스를 생성하는 방법을 보여줍니다:
from scrapegraph_py.logger import sgai_logger
import json

from langchain_scrapegraph.tools import (
    GetCreditsTool,
    MarkdownifyTool,
    SmartCrawlerTool,
    SmartScraperTool,
)

sgai_logger.set_logging(level="INFO")

smartscraper = SmartScraperTool()
smartcrawler = SmartCrawlerTool()
markdownify = MarkdownifyTool()
credits = GetCreditsTool()

호출

인자와 함께 직접 호출

각 tool을 개별적으로 시도해 봅시다:

SmartCrawler Tool

SmartCrawlerTool을 사용하면 웹사이트의 여러 페이지를 크롤링하고 깊이 제어, 페이지 제한, 도메인 제한과 같은 고급 크롤링 옵션을 사용하여 구조화된 데이터를 추출할 수 있습니다.
# SmartScraper
result = smartscraper.invoke(
    {
        "user_prompt": "Extract the company name and description",
        "website_url": "https://scrapegraphai.com",
    }
)
print("SmartScraper Result:", result)

# Markdownify
markdown = markdownify.invoke({"website_url": "https://scrapegraphai.com"})
print("\nMarkdownify Result (first 200 chars):", markdown[:200])

# SmartCrawler
url = "https://scrapegraphai.com/"
prompt = (
    "What does the company do? and I need text content from their privacy and terms"
)

# Use the tool with crawling parameters
result_crawler = smartcrawler.invoke(
    {
        "url": url,
        "prompt": prompt,
        "cache_website": True,
        "depth": 2,
        "max_pages": 2,
        "same_domain_only": True,
    }
)

print("\nSmartCrawler Result:")
print(json.dumps(result_crawler, indent=2))

# Check credits
credits_info = credits.invoke({})
print("\nCredits Info:", credits_info)
SmartScraper Result: {'company_name': 'ScrapeGraphAI', 'description': "ScrapeGraphAI is a powerful AI web scraping tool that turns entire websites into clean, structured data through a simple API. It's designed to help developers and AI companies extract valuable data from websites efficiently and transform it into formats that are ready for use in LLM applications and data analysis."}

Markdownify Result (first 200 chars): [![ScrapeGraphAI Logo](https://scrapegraphai.com/images/scrapegraphai_logo.svg)ScrapeGraphAI](https://scrapegraphai.com/)

PartnersPricingFAQ[Blog](https://scrapegraphai.com/blog)DocsLog inSign up

Op
LocalScraper Result: {'company_name': 'Company Name', 'description': 'We are a technology company focused on AI solutions.', 'contact': {'email': '[email protected]', 'phone': '(555) 123-4567'}}

Credits Info: {'remaining_credits': 49679, 'total_credits_used': 914}
# SmartCrawler example
from scrapegraph_py.logger import sgai_logger
import json

from langchain_scrapegraph.tools import SmartCrawlerTool

sgai_logger.set_logging(level="INFO")

# Will automatically get SGAI_API_KEY from environment
tool = SmartCrawlerTool()

# Example based on the provided code snippet
url = "https://scrapegraphai.com/"
prompt = (
    "What does the company do? and I need text content from their privacy and terms"
)

# Use the tool with crawling parameters
result = tool.invoke(
    {
        "url": url,
        "prompt": prompt,
        "cache_website": True,
        "depth": 2,
        "max_pages": 2,
        "same_domain_only": True,
    }
)

print(json.dumps(result, indent=2))

ToolCall과 함께 호출

모델이 생성한 ToolCall로 tool을 호출할 수도 있습니다:
model_generated_tool_call = {
    "args": {
        "user_prompt": "Extract the main heading and description",
        "website_url": "https://scrapegraphai.com",
    },
    "id": "1",
    "name": smartscraper.name,
    "type": "tool_call",
}
smartscraper.invoke(model_generated_tool_call)
ToolMessage(content='{"main_heading": "Get the data you need from any website", "description": "Easily extract and gather information with just a few lines of code with a simple api. Turn websites into clean and usable structured data."}', name='SmartScraper', tool_call_id='1')

체이닝

LLM과 함께 tool을 사용하여 웹사이트를 분석해 봅시다:
# | output: false
# | echo: false

# pip install -qU langchain langchain-openai
from langchain.chat_models import init_chat_model

model = init_chat_model(model="gpt-4o", model_provider="openai")
Note: you may need to restart the kernel to use updated packages.
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig, chain

prompt = ChatPromptTemplate(
    [
        (
            "system",
            "You are a helpful assistant that can use tools to extract structured information from websites.",
        ),
        ("human", "{user_input}"),
        ("placeholder", "{messages}"),
    ]
)

model_with_tools = model.bind_tools([smartscraper], tool_choice=smartscraper.name)
model_chain = prompt | model_with_tools


@chain
def tool_chain(user_input: str, config: RunnableConfig):
    input_ = {"user_input": user_input}
    ai_msg = model_chain.invoke(input_, config=config)
    tool_msgs = smartscraper.batch(ai_msg.tool_calls, config=config)
    return model_chain.invoke({**input_, "messages": [ai_msg, *tool_msgs]}, config=config)


tool_chain.invoke(
    "What does ScrapeGraph AI do? Extract this information from their website https://scrapegraphai.com"
)
AIMessage(content='ScrapeGraph AI is an AI-powered web scraping tool that efficiently extracts and converts website data into structured formats via a simple API. It caters to developers, data scientists, and AI researchers, offering features like easy integration, support for dynamic content, and scalability for large projects. It supports various website types, including business, e-commerce, and educational sites. Contact: [email protected].', additional_kwargs={'tool_calls': [{'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'function': {'arguments': '{"user_prompt":"Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.","website_url":"https://scrapegraphai.com"}', 'name': 'SmartScraper'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 47, 'prompt_tokens': 480, 'total_tokens': 527, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_c7ca0ebaca', 'finish_reason': 'stop', 'logprobs': None}, id='run-45a12c86-d499-4273-8c59-0db926799bc7-0', tool_calls=[{'name': 'SmartScraper', 'args': {'user_prompt': 'Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.', 'website_url': 'https://scrapegraphai.com'}, 'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'type': 'tool_call'}], usage_metadata={'input_tokens': 480, 'output_tokens': 47, 'total_tokens': 527, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

API reference

모든 ScrapeGraph 기능 및 구성에 대한 자세한 문서는 the LangChain API reference를 참조하세요. 또는 the official SDK repo를 참조하세요.
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.
I