利用Python脚本生成sitemap.xml的实现方法

当一个网站要被搜索引擎索引时，sitemaps文件是一个必不可少的文件，它可帮助搜索引擎更快速、准确地找到网站的所有页面。对于使用Python开发的网站，我们可以使用Python脚本自动生成sitemap.xml文件。

实现方法

安装必要的库

在生成sitemap.xml前，我们需要确保我们的Python环境中安装了以下库：beautifulsoup4、lxml和requests。如果这些库未安装，我们需要在终端中运行以下命令安装。

pip install beautifulsoup4 lxml requests

解析网站

接下来，我们需要编写Python脚本来解析网站。我们建议使用beautifulsoup4库，它能够很容易地从HTML文件中提取所需的信息。

以下是一个示例脚本，它会使用requests库获取网站的HTML代码，并使用beautifulsoup4库解析HTML代码，提取其中的链接和页面信息，并将其保存在一个列表中。

from bs4 import BeautifulSoup
import requests

url = "https://example.com"

def get_links(url):
    soup = BeautifulSoup(requests.get(url).content, "lxml")
    link_list = []
    for link in soup.find_all("a"):
        link_list.append(link.get("href"))
    return link_list

link_list = get_links(url)

生成sitemap.xml

最后一步是将我们获得的链接列表，以sitemap.xml格式保存在我们的网站根目录中。如果您的网站有上千个页面，您可能需要将sitemap.xml划分为多个文件，以提高性能。

以下是生成sitemap.xml的示例代码。请注意，在生成的XML文件中，标签与它们的祖先标签对齐是一种很好的实践方法，可以让文件更容易阅读和管理。此外，本示例中，我们将使用最新的XML格式，即XML命名空间，以确保向未来的XML规范演化。

from datetime import datetime
from lxml import etree

site_url = "https://example.com"

urlset = etree.Element("urlset", attrib={
    "xmlns": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance",
    "xsi:schemaLocation": "http://www.sitemaps.org/schemas/sitemap/0.9 \
    http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
})

for link in link_list:
    url = etree.Element("url")
    loc = etree.Element("loc")
    loc.text = site_url + link
    url.append(loc)
    urlset.append(url)

datestring = datetime.today().strftime("%Y-%m-%d")

with open("sitemap.xml", "wb") as f:
    f.write(etree.tostring(urlset, pretty_print=True, xml_declaration=True, encoding='UTF-8'))

示例说明

下面我们将给出两个示例，说明如何使用上述实现方法生成sitemap.xml文件。

示例1

我们从豆瓣读书网站https://book.douban.com/上生成sitemap.xml文件，用于将其网站的书本信息提交到搜索引擎索引。

from bs4 import BeautifulSoup
import requests
from datetime import datetime
from lxml import etree

site_url = "https://book.douban.com"

def get_links(url):
    soup = BeautifulSoup(requests.get(url).content, "lxml")
    link_list = []
    for link in soup.find_all("a"):
        link = link.get("href")
        if link.startswith(site_url):
            link_list.append(link)
    return link_list

link_list = get_links(site_url)
urlset = etree.Element("urlset", attrib={
    "xmlns": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance",
    "xsi:schemaLocation": "http://www.sitemaps.org/schemas/sitemap/0.9 \
    http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
})

for link in link_list:
    url = etree.Element("url")
    loc = etree.Element("loc")
    loc.text = link
    url.append(loc)
    urlset.append(url)

datestring = datetime.today().strftime("%Y-%m-%d")

with open("douban_book_sitemap.xml", "wb") as f:
    f.write(etree.tostring(urlset, pretty_print=True, xml_declaration=True, encoding='UTF-8'))

示例2

我们从B站https://www.bilibili.com/上生成sitemap.xml文件，用于将其网站视频信息提交到搜索引擎索引。

import requests
from bs4 import BeautifulSoup
import datetime
from lxml import etree

site_url = "https://www.bilibili.com"

def get_links(site_url):
    soup = BeautifulSoup(requests.get(site_url).content, "html.parser")
    links = []
    for link in soup.find_all("a"):
        href = link.get("href")
        if href and not href.startswith("#") and not "javascript:" in href and not href.startswith("mailto:"):
            if not href.startswith("http"):
                href = site_url + href
            if "bilibili" in href and href not in links:
                links.append(href)
    return links

links = get_links(site_url)
urlset = etree.Element("urlset", attrib={
    "xmlns": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance",
    "xsi:schemaLocation": "http://www.sitemaps.org/schemas/sitemap/0.9 \
    http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
})

for link in links:
    url = etree.Element("url")
    loc = etree.Element("loc")
    loc.text = link
    url.append(loc)
    urlset.append(url)

datestring = datetime.datetime.today().strftime("%Y-%m-%d")
with open("bilibili_video_sitemap.xml", "wb") as f:
    f.write(etree.tostring(urlset, pretty_print=True, xml_declaration=True, encoding='UTF-8'))

以上两个示例中，我们分别在豆瓣读书和B站网站上生成sitemap.xml文件，用于将其网站的图书信息和视频信息提交到搜索引擎索引。当然，在实际应用中，可以根据需求修改编写的脚本。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：利用Python脚本生成sitemap.xml的实现方法 - Python技术站

利用Python脚本生成sitemap.xml的实现方法

实现方法

示例说明

示例1

示例2

相关文章