Scrapeless universal scraping

---
title: Scrapeless Universal Scraping
---

**Scrapeless**는 광범위한 매개변수 커스터마이징과 다중 형식 내보내기 지원을 통해 유연하고 기능이 풍부한 데이터 수집 서비스를 제공합니다. 이러한 기능은 LangChain이 외부 데이터를 보다 효과적으로 통합하고 활용할 수 있도록 지원합니다. 핵심 기능 모듈은 다음과 같습니다:

**DeepSerp**

- **Google Search**: 모든 결과 유형에 걸쳐 Google SERP 데이터의 포괄적인 추출을 가능하게 합니다.
  - 지역별 검색 결과를 검색하기 위해 지역화된 Google 도메인(예: `google.com`, `google.ad`) 선택을 지원합니다.
  - 첫 페이지 이후의 결과를 검색하기 위한 페이지네이션을 지원합니다.
  - 중복되거나 유사한 콘텐츠를 제외할지 제어하는 검색 결과 필터링 토글을 지원합니다.
- **Google Trends**: 시간 경과에 따른 인기도, 지역별 관심도, 관련 검색어를 포함한 Google의 키워드 트렌드 데이터를 검색합니다.
  - 다중 키워드 비교를 지원합니다.
  - 여러 데이터 유형을 지원합니다: `interest_over_time`, `interest_by_region`, `related_queries`, `related_topics`.
  - 소스별 트렌드 분석을 위해 특정 Google 속성(Web, YouTube, News, Shopping)별 필터링을 허용합니다.

**Universal Scraping**

- JavaScript가 많이 사용되는 최신 웹사이트를 위해 설계되어 동적 콘텐츠 추출을 가능하게 합니다.
  - 지역 제한을 우회하고 안정성을 향상시키기 위한 글로벌 프리미엄 proxy 지원.

**Crawler**

- **Crawl**: 웹사이트와 연결된 페이지를 재귀적으로 크롤링하여 사이트 전체 콘텐츠를 추출합니다.
  - 구성 가능한 크롤링 깊이와 범위가 지정된 URL 타겟팅을 지원합니다.
- **Scrape**: 높은 정밀도로 단일 웹페이지에서 콘텐츠를 추출합니다.
  - 광고, 푸터 및 기타 비필수 요소를 제외하는 "주요 콘텐츠만" 추출을 지원합니다.
  - 여러 독립 URL의 일괄 스크래핑을 허용합니다.

## Overview

### Integration details

| Class | Package | Serializable | JS support |  Version |
| :--- | :--- | :---: | :---: | :---: |
| [ScrapelessUniversalScrapingTool](https://pypi.org/project/langchain-scrapeless/) | [langchain-scrapeless](https://pypi.org/project/langchain-scrapeless/) | ✅ | ❌ |  ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapeless?style=flat-square&label=%20) |

### Tool features

|Native async|Returns artifact|Return data|
|:-:|:-:|:-:|
|✅|✅|html, markdown, links, metadata, structured content|

## Setup

이 integration은 `langchain-scrapeless` 패키지에 포함되어 있습니다.
!pip install langchain-scrapeless

### Credentials

이 도구를 사용하려면 Scrapeless API key가 필요합니다. 환경 변수로 설정할 수 있습니다:

```python
import os

os.environ["SCRAPELESS_API_KEY"] = "your-api-key"

Instantiation

여기서는 Scrapeless Universal Scraping Tool의 인스턴스를 생성하는 방법을 보여줍니다. 이 도구를 사용하면 JavaScript 렌더링 기능, 커스터마이징 가능한 출력 유형, 지역별 proxy 지원을 갖춘 headless 브라우저를 사용하여 모든 웹사이트를 스크래핑할 수 있습니다. 이 도구는 인스턴스 생성 시 다음 매개변수를 허용합니다:

url (필수, str): 스크래핑할 웹사이트의 URL.
headless (선택, bool): headless 브라우저를 사용할지 여부. 기본값은 True입니다.
js_render (선택, bool): JavaScript 렌더링을 활성화할지 여부. 기본값은 True입니다.
js_wait_until (선택, str): JavaScript로 렌더링된 페이지가 준비된 것으로 간주할 시점을 정의합니다. 기본값은 'domcontentloaded'입니다. 옵션은 다음과 같습니다:
- load: 페이지가 완전히 로드될 때까지 대기합니다.
- domcontentloaded: DOM이 완전히 로드될 때까지 대기합니다.
- networkidle0: 네트워크가 유휴 상태가 될 때까지 대기합니다.
- networkidle2: 네트워크가 2초 동안 유휴 상태가 될 때까지 대기합니다.
outputs (선택, str): 페이지에서 추출할 특정 데이터 유형. 옵션은 다음과 같습니다:
- phone_numbers
- headings
- images
- audios
- videos
- links
- menus
- hashtags
- emails
- metadata
- tables
- favicon
response_type (선택, str): 응답 형식을 정의합니다. 기본값은 'html'입니다. 옵션은 다음과 같습니다:
- html: 페이지의 원시 HTML을 반환합니다.
- plaintext: 일반 텍스트 콘텐츠를 반환합니다.
- markdown: 페이지의 Markdown 버전을 반환합니다.
- png: PNG 스크린샷을 반환합니다.
- jpeg: JPEG 스크린샷을 반환합니다.
response_image_full_page (선택, bool): 스크린샷 출력(png 또는 jpeg)을 사용할 때 전체 페이지 이미지를 캡처하고 반환할지 여부. 기본값은 False입니다.
selector (선택, str): 페이지의 일부 내에서 스크래핑 범위를 지정하는 특정 CSS selector. 기본값은 None입니다.
proxy_country (선택, str): 지역별 proxy 액세스를 위한 두 글자 국가 코드(예: 'us', 'gb', 'de', 'jp'). 기본값은 'ANY'입니다.

Invocation

Basic Usage

from langchain_scrapeless import ScrapelessUniversalScrapingTool

tool = ScrapelessUniversalScrapingTool()

# Basic usage
result = tool.invoke("https://example.com")
print(result)

<!DOCTYPE html><html><head>
    <title>Example Domain</title>

    <meta charset="utf-8">
    <meta http-equiv="Content-type" content="text/html; charset=utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>


</body></html>

Advanced Usage with Parameters

from langchain_scrapeless import ScrapelessUniversalScrapingTool

tool = ScrapelessUniversalScrapingTool()

result = tool.invoke({"url": "https://exmaple.com", "response_type": "markdown"})
print(result)

# Well hello there.

Welcome to exmaple.com.
Chances are you got here by mistake (example.com, anyone?)

Use within an agent

from langchain_openai import ChatOpenAI
from langchain_scrapeless import ScrapelessUniversalScrapingTool
from langchain.agents import create_agent


model = ChatOpenAI()

tool = ScrapelessUniversalScrapingTool()

# Use the tool with an agent
tools = [tool]
agent = create_agent(model, tools)

for chunk in agent.stream(
    {
        "messages": [
            (
                "human",
                "Use the scrapeless scraping tool to fetch https://www.scrapeless.com/en and extract the h1 tag.",
            )
        ]
    },
    stream_mode="values",
):
    chunk["messages"][-1].pretty_print()

================================ Human Message =================================

Use the scrapeless scraping tool to fetch https://www.scrapeless.com/en and extract the h1 tag.
================================== Ai Message ==================================
Tool Calls:
  scrapeless_universal_scraping (call_jBrvMVL2ixhvf6gklhi7Gqtb)
 Call ID: call_jBrvMVL2ixhvf6gklhi7Gqtb
  Args:
    url: https://www.scrapeless.com/en
    outputs: headings
================================= Tool Message =================================
Name: scrapeless_universal_scraping

{"headings":["Effortless Web Scraping Toolkitfor Business and Developers","4.8","4.5","8.5","A Flexible Toolkit for Accessing Public Web Data","Deep SerpApi","Scraping Browser","Universal Scraping API","Customized Services","From Simple Data Scraping to Complex Anti-Bot Challenges, Scrapeless Has You Covered.","Fully Compatible with Key Programming Languages and Tools","Enterprise-level Data Scraping Solution","Customized Data Scraping Solutions","High Concurrency and High-Performance Scraping","Data Cleaning and Transformation","Real-Time Data Push and API Integration","Data Security and Privacy Protection","Enterprise-level SLA","Why Scrapeless: Simplify Your Data Flow Effortlessly.","Articles","Organized Fresh Data","Prices","No need to hassle with browser maintenance","Reviews","Only pay for successful requests","Products","Fully scalable","Unleash Your Competitive Edgein Data within the Industry","Regulate Compliance for All Users","Web Scraping Blog","Scrapeless MCP Server Is Officially Live! Build Your Ultimate AI-Web Connector","Product Updates | New Profile Feature","How to Track Your Ranking on ChatGPT?","For Scraping","For Data","For AI","Top Scraper API","Learning Center","Legal"]}
================================== Ai Message ==================================

The h1 tag extracted from the website https://www.scrapeless.com/en is "Effortless Web Scraping Toolkit for Business and Developers".

API reference

---

<Callout icon="pen-to-square" iconType="regular">
    [Edit the source of this page on GitHub.](https://github.com/langchain-ai/docs/edit/main/src/oss/python/integrations/tools/scrapeless_universal_scraping.mdx)
</Callout>
<Tip icon="terminal" iconType="regular">
    [Connect these docs programmatically](/use-these-docs) to Claude, VSCode, and more via MCP for    real-time answers.
</Tip>

Popular Providers

Integrations by component

Instantiation

Invocation

Basic Usage

Advanced Usage with Parameters

Use within an agent

API reference

Popular Providers

Integrations by component

​Instantiation

​Invocation

​Basic Usage

​Advanced Usage with Parameters

​Use within an agent

​API reference

Instantiation

Invocation

Basic Usage

Advanced Usage with Parameters

Use within an agent

API reference