---
title: MediaWiki Dump
---

>[MediaWiki XML Dumps](https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps)는 위키의 콘텐츠(모든 수정 버전이 포함된 위키 페이지)를 포함하지만 사이트 관련 데이터는 포함하지 않습니다. XML dump는 위키 데이터베이스의 전체 백업을 생성하지 않으며, 사용자 계정, 이미지, 편집 로그 등은 포함하지 않습니다.

이 문서는 MediaWiki XML dump 파일을 다운스트림에서 사용할 수 있는 document 형식으로 로드하는 방법을 다룹니다.

`mediawiki-utilities``mwxml`을 사용하여 dump하고 `earwig``mwparserfromhell`을 사용하여 MediaWiki wikicode를 파싱합니다.

Dump 파일은 dumpBackup.php를 사용하거나 Wiki의 Special:Statistics 페이지에서 얻을 수 있습니다.

```python
# mediawiki-utilities supports XML schema 0.11 in unmerged branches
pip install -qU git+https://github.com/mediawiki-utilities/python-mwtypes@updates_schema_0.11
# mediawiki-utilities mwxml has a bug, fix PR pending
pip install -qU git+https://github.com/gdedrouas/python-mwxml@xml_format_0.11
pip install -qU mwparserfromhell
from langchain_community.document_loaders import MWDumpLoader
loader = MWDumpLoader(
    file_path="example_data/testmw_pages_current.xml",
    encoding="utf8",
    # namespaces = [0,2,3] Optional list to load only specific namespaces. Loads all namespaces by default.
    skip_redirects=True,  # will skip over pages that just redirect to other pages (or not if False)
    stop_on_error=False,  # will skip over pages that cause parsing errors (or not if False)
)
documents = loader.load()
print(f"You have {len(documents)} document(s) in your data ")
You have 177 document(s) in your data
documents[:5]
[Document(page_content='\t\n\t\n\tArtist\n\tReleased\n\tRecorded\n\tLength\n\tLabel\n\tProducer', metadata={'source': 'Album'}),
 Document(page_content='{| class="article-table plainlinks" style="width:100%;"\n|- style="font-size:18px;"\n! style="padding:0px;" | Template documentation\n|-\n| Note: portions of the template sample may not be visible without values provided.\n|-\n| View or edit this documentation. (About template documentation)\n|-\n| Editors can experiment in this template\'s [ sandbox] and [ test case] pages.\n|}Category:Documentation templates', metadata={'source': 'Documentation'}),
 Document(page_content='Description\nThis template is used to insert descriptions on template pages.\n\nSyntax\nAdd <noinclude></noinclude> at the end of the template page.\n\nAdd <noinclude></noinclude> to transclude an alternative page from the /doc subpage.\n\nUsage\n\nOn the Template page\nThis is the normal format when used:\n\nTEMPLATE CODE\n<includeonly>Any categories to be inserted into articles by the template</includeonly>\n<noinclude>{{Documentation}}</noinclude>\n\nIf your template is not a completed div or table, you may need to close the tags just before {{Documentation}} is inserted (within the noinclude tags).\n\nA line break right before {{Documentation}} can also be useful as it helps prevent the documentation template "running into" previous code.\n\nOn the documentation page\nThe documentation page is usually located on the /doc subpage for a template, but a different page can be specified with the first parameter of the template (see Syntax).\n\nNormally, you will want to write something like the following on the documentation page:\n\n==Description==\nThis template is used to do something.\n\n==Syntax==\nType <code>{{t|templatename}}</code> somewhere.\n\n==Samples==\n<code><nowiki>{{templatename|input}}</nowiki></code> \n\nresults in...\n\n{{templatename|input}}\n\n<includeonly>Any categories for the template itself</includeonly>\n<noinclude>[[Category:Template documentation]]</noinclude>\n\nUse any or all of the above description/syntax/sample output sections. You may also want to add "see also" or other sections.\n\nNote that the above example also uses the Template:T template.\n\nCategory:Documentation templatesCategory:Template documentation', metadata={'source': 'Documentation/doc'}),
 Document(page_content='Description\nA template link with a variable number of parameters (0-20).\n\nSyntax\n \n\nSource\nImproved version not needing t/piece subtemplate developed on Templates wiki see the list of authors. Copied here via CC-By-SA 3.0 license.\n\nExample\n\nCategory:General wiki templates\nCategory:Template documentation', metadata={'source': 'T/doc'}),
 Document(page_content='\t\n\t\t    \n\t\n\t\t    Aliases\n\t    Relatives\n\t    Affiliation\n        Occupation\n    \n            Biographical information\n        Marital status\n    \tDate of birth\n        Place of birth\n        Date of death\n        Place of death\n    \n            Physical description\n        Species\n        Gender\n        Height\n        Weight\n        Eye color\n\t\n           Appearances\n       Portrayed by\n       Appears in\n       Debut\n    ', metadata={'source': 'Character'})]

---

<Callout icon="pen-to-square" iconType="regular">
    [Edit the source of this page on GitHub.](https://github.com/langchain-ai/docs/edit/main/src/oss/python/integrations/document_loaders/mediawikidump.mdx)
</Callout>
<Tip icon="terminal" iconType="regular">
    [Connect these docs programmatically](/use-these-docs) to Claude, VSCode, and more via MCP for    real-time answers.
</Tip>
I