Skip to main content

递归 URL

RecursiveUrlLoader 允许您递归抓取根 URL 的所有子链接,并将它们解析为文档。


RecursiveUrlLoader 位于 langchain-community 包中。没有其他必需的包,不过如果安装了 beautifulsoup4,您将获得更丰富的默认文档元数据。

%pip install -qU langchain-community beautifulsoup4



from langchain_community.document_loaders import RecursiveUrlLoader

loader = RecursiveUrlLoader(
# max_depth=2,
# use_async=False,
# extractor=None,
# metadata_extractor=None,
# exclude_dirs=(),
# timeout=10,
# check_response_status=True,
# continue_on_failure=True,
# prevent_outside=True,
# base_url=None,
# ...


使用 .load() 同步加载所有文档到内存中,每个文档对应一个访问过的 URL。从初始 URL 开始,我们递归遍历所有链接的 URL,直到指定的最大深度。

让我们通过一个基本示例来看看如何在 Python 3.9 文档 上使用 RecursiveUrlLoader

docs = loader.load()
/Users/bagatur/.pyenv/versions/3.9.1/lib/python3.9/html/ XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
{'source': '',
'content_type': 'text/html',
'title': '3.9.19 文档',
'language': None}


{'source': '',
'content_type': 'text/html',
'title': 'Python 安装和使用 — Python 3.9.19 文档',
'language': None}

那个 URL 看起来像是我们根页面的子页面,这很好!让我们从元数据转到检查我们文档的内容。


<!DOCTYPE html>

<html xmlns="">
<meta charset="utf-8" /><title>3.9.19 文档</title><meta name="viewport" content="width=device-width, initial-scale=1.0">

<link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
<link rel=

这确实看起来像是来自 URL 的 HTML,这正是我们所期待的。现在让我们看看一些可以对基本示例进行的变更,这在不同情况下可能会很有帮助。


默认情况下,加载器将每个链接的原始 HTML 设置为文档页面内容。要将此 HTML 解析为更适合人类/LLM 的格式,您可以传入自定义的 extractor 方法:

import re

from bs4 import BeautifulSoup

def bs4_extractor(html: str) -> str:
soup = BeautifulSoup(html, "lxml")
return re.sub(r"\n\n+", "\n\n", soup.text).strip()

loader = RecursiveUrlLoader("", extractor=bs4_extractor)
docs = loader.load()
/var/folders/td/vzm913rx77x21csd90g63_7c0000gn/T/ipykernel_10935/ XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")
/Users/isaachershenson/.pyenv/versions/3.11.9/lib/python3.11/html/ XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
3.9.19 文档


Python 3.13(开发中)
Python 3.12(稳定版)
Python 3.11(安全修复)
Python 3.10(安全修复)
Python 3.9(安全


您还可以传入 metadata_extractor 自定义从 HTTP 响应中提取文档元数据的方式。有关更多信息,请参见 API 参考



page = []
for doc in loader.lazy_load():
if len(page) >= 10:
# do some paged operation, e.g.
# index.upsert(page)

page = []
/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/ XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")


API 参考

这些示例展示了您可以修改默认的 RecursiveUrlLoader 的一些方式,但还有许多其他修改可以更好地适应您的用例。使用参数 link_regexexclude_dirs 可以帮助您过滤掉不需要的 URL,aload()alazy_load() 可以用于异步加载,等等。

有关配置和调用 RecursiveUrlLoader 的详细信息,请参阅 API 参考:



您还可以留下详细的反馈 在 GitHub 上