WebBaseLoader를 사용하여 HTML 웹페이지의 모든 텍스트를 다운스트림에서 사용할 수 있는 document 형식으로 로드하는 방법을 다룹니다. 웹페이지 로딩을 위한 더 맞춤화된 로직은 IMSDbLoader, AZLyricsLoader, CollegeConfidentialLoader와 같은 자식 클래스 예제를 참고하세요. 웹사이트 크롤링, JS 차단 사이트 우회, 데이터 정리에 대해 걱정하고 싶지 않다면 FireCrawlLoader 또는 더 빠른 옵션인 SpiderLoader 사용을 고려하세요.

Overview

Integration details

  • TODO: Fill in table features.
  • TODO: Remove JS support link if not relevant, otherwise ensure link is correct.
  • TODO: Make sure API reference links are correct.
ClassPackageLocalSerializableJS support
WebBaseLoaderlangchain-community

Loader features

SourceDocument Lazy LoadingNative Async Support
WebBaseLoader

Setup

Credentials

WebBaseLoader는 자격 증명이 필요하지 않습니다.

Installation

WebBaseLoader를 사용하려면 먼저 langchain-community python package를 설치해야 합니다.
pip install -qU langchain-community beautifulsoup4

Initialization

이제 model object를 인스턴스화하고 document를 로드할 수 있습니다:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.example.com/")
가져오는 동안 SSL 검증 오류를 우회하려면 “verify” 옵션을 설정할 수 있습니다: loader.requests_kwargs = {'verify':False}

Initialization with multiple pages

여러 페이지 목록을 전달하여 로드할 수도 있습니다.
loader_multiple_pages = WebBaseLoader(
    ["https://www.example.com/", "https://google.com"]
)

Load

docs = loader.load()

docs[0]
Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n')
print(docs[0].metadata)
{'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}

Load multiple urls concurrently

여러 url을 동시에 스크래핑하고 파싱하여 스크래핑 프로세스를 가속화할 수 있습니다. 동시 요청에는 합리적인 제한이 있으며, 기본값은 초당 2개입니다. 좋은 시민이 되는 것에 관심이 없거나, 스크래핑하는 서버를 제어하고 부하를 신경 쓰지 않는다면 requests_per_second 매개변수를 변경하여 최대 동시 요청 수를 늘릴 수 있습니다. 이렇게 하면 스크래핑 프로세스가 빨라지지만 서버가 차단할 수 있습니다. 주의하세요!
pip install -qU  nest_asyncio

# fixes a bug with asyncio and jupyter
import nest_asyncio

nest_asyncio.apply()
Note: you may need to restart the kernel to use updated packages.
loader = WebBaseLoader(["https://www.example.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs
Fetching pages: 100%|###########################################################################| 2/2 [00:00<00:00,  8.28it/s]
[Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n'),
 Document(metadata={'source': 'https://google.com', 'title': 'Google', 'description': "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.", 'language': 'en'}, page_content='GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in\xa0Advanced search5 ways Gemini can help during the HolidaysAdvertisingBusiness SolutionsAbout Google© 2024 - Privacy - Terms  ')]

Loading a xml file, or using a different BeautifulSoup parser

xml 파일을 로드하는 예제는 SitemapLoader를 참고할 수도 있으며, 이는 이 기능을 사용하는 예제입니다.
loader = WebBaseLoader(
    "https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)
loader.default_parser = "xml"
docs = loader.load()
docs
[Document(metadata={'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}, page_content='\n\n10\nEnergy\n3\n2018-01-01\n2018-01-01\nfalse\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n§ 431.86\nSection § 431.86\n\nEnergy\nDEPARTMENT OF ENERGY\nENERGY CONSERVATION\nENERGY EFFICIENCY PROGRAM FOR CERTAIN COMMERCIAL AND INDUSTRIAL EQUIPMENT\nCommercial Packaged Boilers\nTest Procedures\n\n\n\n\n§\u2009431.86\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n(a) Scope. This section provides test procedures, pursuant to the Energy Policy and Conservation Act (EPCA), as amended, which must be followed for measuring the combustion efficiency and/or thermal efficiency of a gas- or oil-fired commercial packaged boiler.\n(b) Testing and Calculations. Determine the thermal efficiency or combustion efficiency of commercial packaged boilers by conducting the appropriate test procedure(s) indicated in Table 1 of this section.\n\nTable 1—Test Requirements for Commercial Packaged Boiler Equipment Classes\n\nEquipment category\nSubcategory\nCertified rated inputBtu/h\n\nStandards efficiency metric(§\u2009431.87)\n\nTest procedure(corresponding to\nstandards efficiency\nmetric required\nby §\u2009431.87)\n\n\n\nHot Water\nGas-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nHot Water\nGas-fired\n>2,500,000\nCombustion Efficiency\nAppendix A, Section 3.\n\n\nHot Water\nOil-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nHot Water\nOil-fired\n>2,500,000\nCombustion Efficiency\nAppendix A, Section 3.\n\n\nSteam\nGas-fired (all*)\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nSteam\nGas-fired (all*)\n>2,500,000 and ≤5,000,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\n\u2003\n\n>5,000,000\nThermal Efficiency\nAppendix A, Section 2.OR\nAppendix A, Section 3 with Section 2.4.3.2.\n\n\n\nSteam\nOil-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nSteam\nOil-fired\n>2,500,000 and ≤5,000,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\n\u2003\n\n>5,000,000\nThermal Efficiency\nAppendix A, Section 2.OR\nAppendix A, Section 3. with Section 2.4.3.2.\n\n\n\n*\u2009Equipment classes for commercial packaged boilers as of July 22, 2009 (74 FR 36355) distinguish between gas-fired natural draft and all other gas-fired (except natural draft).\n\n(c) Field Tests. The field test provisions of appendix A may be used only to test a unit of commercial packaged boiler with rated input greater than 5,000,000 Btu/h.\n[81 FR 89305, Dec. 9, 2016]\n\n\nEnergy Efficiency Standards\n\n')]

Lazy Load

메모리 요구 사항을 최소화하기 위해 lazy loading을 사용하여 한 번에 하나의 페이지만 로드할 수 있습니다.
pages = []
for doc in loader.lazy_load():
    pages.append(doc)

print(pages[0].page_content[:100])
print(pages[0].metadata)
10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}

Async

pages = []
async for doc in loader.alazy_load():
    pages.append(doc)

print(pages[0].page_content[:100])
print(pages[0].metadata)
Fetching pages: 100%|###########################################################################| 1/1 [00:00<00:00, 10.51it/s]
10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}

Using proxies

때때로 IP 차단을 우회하기 위해 proxy를 사용해야 할 수 있습니다. loader(및 내부의 requests)에 proxy dictionary를 전달하여 사용할 수 있습니다.
loader = WebBaseLoader(
        "https://www.walmart.com/search?q=parrots",
        proxies={
        "http": "http://{username}:{password}:@proxy.service.com:6666/",
        "https": "https://{username}:{password}:@proxy.service.com:6666/",
    },
)
docs = loader.load()

API reference

모든 WebBaseLoader 기능 및 구성에 대한 자세한 문서는 API reference를 참조하세요: python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.
I