SitemapLoader

WebBaseLoader를 확장한 SitemapLoader는 주어진 URL에서 sitemap을 로드한 다음, sitemap의 모든 페이지를 스크래핑하고 로드하여 각 페이지를 Document로 반환합니다. 스크래핑은 동시에 수행됩니다. 동시 요청에는 합리적인 제한이 있으며, 기본값은 초당 2개입니다. 좋은 시민이 되는 것에 관심이 없거나, 스크래핑하는 서버를 제어하거나, 부하에 신경 쓰지 않는다면 이 제한을 늘릴 수 있습니다. 참고로, 이렇게 하면 스크래핑 프로세스의 속도가 빨라지지만 서버가 귀하를 차단할 수 있습니다. 주의하세요!

Overview

Integration details

Class	Package	Local	Serializable	JS support
SiteMapLoader	langchain-community	✅	❌	✅

Loader features

Source	Document Lazy Loading	Native Async Support
SiteMapLoader	✅	❌

Setup

SiteMap document loader에 액세스하려면 langchain-community integration package를 설치해야 합니다.

Credentials

이를 실행하는 데 필요한 자격 증명은 없습니다. 모델 호출의 자동 추적을 활성화하려면 LangSmith API key를 설정하세요:

os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
os.environ["LANGSMITH_TRACING"] = "true"

Installation

langchain-community를 설치합니다.

pip install -qU langchain-community

Fix notebook asyncio bug

import nest_asyncio

nest_asyncio.apply()

Initialization

이제 model object를 인스턴스화하고 document를 로드할 수 있습니다:

from langchain_community.document_loaders.sitemap import SitemapLoader

sitemap_loader = SitemapLoader(web_path="https://api.python.langchain.com/sitemap.xml")

Load

docs = sitemap_loader.load()
docs[0]

Fetching pages: 100%|##########| 28/28 [00:04<00:00,  6.83it/s]

Document(metadata={'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}, page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n')

print(docs[0].metadata)

{'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}

requests_per_second parameter를 변경하여 최대 동시 요청 수를 늘릴 수 있으며, requests_kwargs를 사용하여 요청을 보낼 때 kwargs를 전달할 수 있습니다.

sitemap_loader.requests_per_second = 2
# Optional: avoid `[SSL: CERTIFICATE_VERIFY_FAILED]` issue
sitemap_loader.requests_kwargs = {"verify": False}

Lazy Load

메모리 부하를 최소화하기 위해 페이지를 lazy하게 로드할 수도 있습니다.

page = []
for doc in sitemap_loader.lazy_load():
    page.append(doc)
    if len(page) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        page = []

Fetching pages: 100%|##########| 28/28 [00:01<00:00, 19.06it/s]

Filtering sitemap URLs

Sitemap은 수천 개의 URL이 포함된 대용량 파일일 수 있습니다. 종종 모든 URL이 필요하지 않을 수 있습니다. filter_urls parameter에 문자열 목록이나 정규식 패턴을 전달하여 URL을 필터링할 수 있습니다. 패턴 중 하나와 일치하는 URL만 로드됩니다.

loader = SitemapLoader(
    web_path="https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest"],
)
documents = loader.load()

documents[0]

Document(page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n', metadata={'source': 'https://api.python.langchain.com/en/latest/', 'loc': 'https://api.python.langchain.com/en/latest/', 'lastmod': '2024-02-12T05:26:10.971077+00:00', 'changefreq': 'daily', 'priority': '0.9'})

Add custom scraping rules

SitemapLoader는 스크래핑 프로세스에 beautifulsoup4를 사용하며, 기본적으로 페이지의 모든 요소를 스크래핑합니다. SitemapLoader constructor는 custom scraping function을 허용합니다. 이 기능은 특정 요구 사항에 맞게 스크래핑 프로세스를 조정하는 데 유용할 수 있습니다. 예를 들어, header나 navigation 요소의 스크래핑을 피하고 싶을 수 있습니다. 다음 예제는 navigation 및 header 요소를 피하기 위해 custom function을 개발하고 사용하는 방법을 보여줍니다. beautifulsoup4 library를 import하고 custom function을 정의합니다.

pip install beautifulsoup4

from bs4 import BeautifulSoup


def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    # Find all 'nav' and 'header' elements in the BeautifulSoup object
    nav_elements = content.find_all("nav")
    header_elements = content.find_all("header")

    # Remove each 'nav' and 'header' element from the BeautifulSoup object
    for element in nav_elements + header_elements:
        element.decompose()

    return str(content.get_text())

custom function을 SitemapLoader object에 추가합니다.

loader = SitemapLoader(
    "https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest/"],
    parsing_function=remove_nav_and_header_elements,
)

Local Sitemap

sitemap loader는 local file을 로드하는 데에도 사용할 수 있습니다.

sitemap_loader = SitemapLoader(web_path="example_data/sitemap.xml", is_local=True)

docs = sitemap_loader.load()

API reference

모든 SiteMapLoader feature 및 configuration에 대한 자세한 문서는 API reference를 참조하세요: python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.sitemap.SitemapLoader.html#langchain_community.document_loaders.sitemap.SitemapLoader

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

Overview

Integration details

Loader features

Setup

Credentials

Installation

Fix notebook asyncio bug

Initialization

Load

Lazy Load

Filtering sitemap URLs

Add custom scraping rules

Local Sitemap

API reference

Popular Providers

Integrations by component

​Overview

​Integration details

​Loader features

​Setup

​Credentials

​Installation

​Fix notebook asyncio bug

​Initialization

​Load

​Lazy Load

​Filtering sitemap URLs

​Add custom scraping rules

​Local Sitemap

​API reference

Overview

Integration details

Loader features

Setup

Credentials

Installation

Fix notebook asyncio bug

Initialization

Load

Lazy Load

Filtering sitemap URLs

Add custom scraping rules

Local Sitemap

API reference