Python处理XML格式数据的方法详解

什么是XML

XML全称为eXtensible Markup Language，它主要用于描述数据。和HTML类似，XML也是一种标记语言，但XML不是用来显示数据，而是用来存储和传输数据。与HTML不同，XML没有预定义的标签，而是由用户根据需要定义标签。

Python模块处理XML

Python内置支持XML数据处理，包括DOM、SAX与ElementTree。

DOM解析器

DOM全称为Document Object Model，将整个XML文档转换为一棵树，操作XML文档的时候可以选择节点进行查找、删除、插入等操作，但是由于需要将整个XML文档转换为一个树形结构，会比较消耗内存。

以下是使用DOM解析器解析XML内容的代码示例：

import xml.dom.minidom

# 使用minidom解析器打开XML文档
DOMTree = xml.dom.minidom.parse("example.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
   print("Root element : %s" % collection.getAttribute("shelf"))

movies = collection.getElementsByTagName("movie")

for movie in movies:
   print("*****Movie*****")
   if movie.hasAttribute("title"):
      print("Title: %s" % movie.getAttribute("title"))

   type = movie.getElementsByTagName('type')[0]
   print("Type: %s" % type.childNodes[0].nodeValue)
   format = movie.getElementsByTagName('format')[0]
   print("Format: %s" % format.childNodes[0].nodeValue)
   rating = movie.getElementsByTagName('rating')[0]
   print("Rating: %s" % rating.childNodes[0].nodeValue)
   description = movie.getElementsByTagName('description')[0]
   print("Description: %s" % description.childNodes[0].nodeValue)

SAX解析器

SAX全称为Simple API for XML，不需要将整个XML文档读入内存来解析，只需要处理该文档的一部分即可，所以比较适合处理大型XML文件。

以下是使用SAX解析器处理XML内容的代码示例：

import xml.sax

class MovieHandler(xml.sax.ContentHandler):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # 元素开始事件处理
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print("*****Movie*****")
         title = attributes["title"]
         print("Title:", title)

   # 元素结束事件处理
   def endElement(self, tag):
      if self.CurrentData == "type":
         print("Type:", self.type)
      elif self.CurrentData == "format":
         print("Format:", self.format)
      elif self.CurrentData == "year":
         print("Year:", self.year)
      elif self.CurrentData == "rating":
         print("Rating:", self.rating)
      elif self.CurrentData == "stars":
         print("Stars:", self.stars)
      elif self.CurrentData == "description":
         print("Description:", self.description)
      self.CurrentData = ""

   # 内容事件处理
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content

# 创建一个XMLReader
parser = xml.sax.make_parser()
# 关闭命名空间
parser.setFeature(xml.sax.handler.feature_namespaces, 0)

# 重写ContextHandler
Handler = MovieHandler()
parser.setContentHandler( Handler )

parser.parse("example.xml")

ElementTree

ElementTree是Python中最常用的处理XML文件的模块，它也在Python 2.5之后内置于Python中，使用它可以比较方便地处理XML文件。

以下是使用ElementTree处理XML文件的示例：

import xml.etree.ElementTree as ET

tree = ET.parse('example.xml')

root = tree.getroot()
print("Root tag:", root.tag)

for child in root:
    print(child.tag, child.attrib)

以上代码使用ElementTree解析XML文件，并打印出了根元素的标签名和子元素的标签名和属性。

示例

示例1

假设我们有一个XML文件，其中有多个book元素，每个book元素有title和author两个子元素，现在我们需要遍历XML文件，打印每个book元素的title和author。

XML文件内容：

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
</catalog>

代码示例：

import xml.etree.ElementTree as ET

tree = ET.parse('example.xml')
root = tree.getroot()

for book in root.findall('book'):
    title = book.find('title').text
    author = book.find('author').text
    print("Title: {}, Author: {}".format(title, author))

输出结果：

Title: XML Developer's Guide, Author: Gambardella, Matthew
Title: Midnight Rain, Author: Ralls, Kim
Title: Maeve Ascendant, Author: Corets, Eva

以上代码使用ElementTree遍历XML文档，使用findall方法找到所有book元素，使用find方法找到子元素title和author，最后打印出title和author。

示例2

假设我们有一个XML文件，其中有多个student元素，每个student元素有name、age、gender三个子元素，现在我们需要将每个student元素的内容写入到一个CSV文件中。

XML文件内容：

<?xml version="1.0"?>
<class>
   <student>
      <name>John</name>
      <gender>male</gender>
      <age>15</age>
   </student>
   <student>
      <name>Alice</name>
      <gender>female</gender>
      <age>16</age>
   </student>
   <student>
      <name>Mike</name>
      <gender>male</gender>
      <age>17</age>
   </student>
</class>

代码示例：

import csv
import xml.etree.ElementTree as ET

tree = ET.parse('example.xml')
root = tree.getroot()

# 打开CSV文件
with open('students.csv', 'w', newline='') as csvfile:
    # 创建CSV写入器
    writer = csv.writer(csvfile)
    writer.writerow(['Name', 'Age', 'Gender'])

    # 遍历XML文档
    for student in root.findall('student'):
        name = student.find('name').text
        age = student.find('age').text
        gender = student.find('gender').text
        writer.writerow([name, age, gender])

以上代码使用了CSV模块，先打开一个CSV文件，然后遍历XML文档，将每个student元素的内容写入到CSV文件中。最后关闭CSV文件。

CSV文件的内容如下：

Name,Age,Gender
John,15,male
Alice,16,female
Mike,17,male

总结

本文介绍了Python处理XML格式数据的方法，包括DOM、SAX和ElementTree三种方式，并提供了相应的代码示例。掌握这些方法可以帮助我们更方便地对XML文件进行处理。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python处理XML格式数据的方法详解 - Python技术站

Python处理XML格式数据的方法详解

Python处理XML格式数据的方法详解

什么是XML

Python模块处理XML

DOM解析器

SAX解析器

ElementTree

示例

示例1

示例2

总结

相关文章