Introduction
It seems like everyone needs to parse XML these days. They’re either saving their own information in XML or loading in someone else’s data. This is why I was glad to learn that as of Python 2.5, the ElementTree XML package has been added to the standard library in the XML module.What I like about the ElementTree module is that it just seems to make sense. This might seem like a strange thing to say about an XML module, but I’ve had to parse enough XML in my time to know that if an XML module makes sense the first time you use it, it’s probably a keeper. The ElementTree module allows me to work with XML data in a way that is similar to how I think about XML data.
A subset of the full ElementTree module is available in the Python 2.5 standard library as
xml.etree
, but you don’t have to use Python 2.5 in order to use the ElementTree module. If you are still using an older version of Python (1.5.2 or later) you can simply download the module from its website and manually install it on your system. The website also has very easy to follow installation instructions, which you should consult to avoid issues while installing ElementTree.In general, the ElementTree module treats XML data as a list of lists. All XML has a root element that will have zero or more subelements (or child elements). Each of those subelements may in turn have subelements of their own. The best way to think about this is with a brief example.
First let’s take a look at some sample XML data:
<root> <child>One</child> <child>Two</child> </root>
Reading XML data
Now let’s use the ElementTree package to parse this XML and print the text data associated with each child element. To start, we’ll create a Python file with the contents shown in Listing 1.Listing 1
#!/usr/bin/env python def main(): pass if __name__ == "__main__": main()
This is basically a template that I use for many of my simple “*.py” files. It doesn’t actually do anything except set up the script so that when the file is run, the
main
method will be executed. Some people like to use the Python interactive interpreter for simple hacking like this. Personally, I prefer having my code stored in a handy file so I can make simple changes and re-run the entire script when I am just playing around.The first thing that we need to do in our Python code is import the ElementTree module:
from xml.etree import ElementTree as ET
from elementtree import ElementTree as ET
ElementTree
. Using ET is demonstrated in the Python 2.5 “What’s new” documentation[1] and I think it’s a great way to eliminate some key strokes.Now we’ll begin writing code in the
main
method. The first step is to load the XML data described above. Normally you will be working with a file or URL; for now we want to keep this simple and load the XML data directly from the text:
element = ET.XML( "<root><child>One</child><child>Two</child></root>")
The
XML
function is described in the ElementTree documentation as follows: “Parses an XML document from a string constant. This function can be used to embed “XML literals” in Python code”[2].Be careful here! The
XML
function returns an Element object, and not an ElementTree object as one might expect. Element objects are used to represent XML elements, whereas the ElementTree object is used to represent the entire XML document. Element objects may represent the entire XML document if they are the root element but will not if they are a subelement. ElementTree objects also add “some extra support for serialization to and from standard XML.”[3] The Element object that is returned represents the
element in our XML data.Thankfully, the Element object is an iterator object so we can use a
for
loop to loop through all of its child elements:for subelement in element:
text
attribute:
for subelement in element: print subelement.text
To recap, have a look at the code in Listing 2.
Listing 2
#!/usr/bin/env python from xml.etree import ElementTree as ET def main(): element = ET.XML("<root><child>One</child><child>Two</child></root>") for subelement in element: print subelement.text if __name__ == "__main__": # Someone is launching this directly main()
Once you run the code you should get the following output:
One
Two
If an XML element does not have any text associated with it, like our root element, the Element object’s
text
attribute will be set to None
. If you want to check if an element had any text associated with it, you can do the following:if element.text is not None: print element.text
Reading XML Attributes
Let’s alter the XML that we are working with to add attributes to the elements and look at how we would parse that information.If the XML uses attributes in addition to, or instead of, inner text they can be accessed using the Element object’s
attrib
attribute. The attrib
attribute is a Python dictionary and is relatively easy to use:
def main(): element = ET.XML( '<root><child val="One"/><child val="Two"/></root>') for subelement in element: print subelement.attrib
When you run the code you get the following output:
{'val': 'One'} {'val': 'Two'}
These are the attributes for each child element stored in a dictionary. Being able to work with an XML element’s attributes as a Python dictionary is a great feature and fits well with the dynamic nature of XML attributes.
Writing XML
Now that we’ve tried our hand at reading XML, let’s try creating some. If you understand the reading process, you should have no trouble understanding the creation process because it works in much the same manner. What we are going to do in this example is recreate the XML data that we were working with above.The first step is to create our
element:
#create the root <root>root_element = ET.Element("root")
After this code is executed, the variable
root_element
is an Element object, just like the Element objects that we used earlier to parse the XML.The next step is to create the two child elements. There are two ways to do this.
In the first method, if you know exactly what you are creating, it’s easiest to use the
SubElement
method, which creates an Element object that is a subelement (or child) of another Element object:
#create the first child <child>One</child>child = ET.SubElement(root_element, "child")
This will create a
Element that is a child of root_element
. We then need to set the text associated with that element. To do this we use the same text attribute that we used in the first parsing example. However, instead of simply reading the text attribute we set its value:
child.text = "One"
The second approach to creating a child element is to create an Element object separately (rather than a sub element) and append it to a parent Element object. The results are exactly the same – this is simply a different approach that may come in handy when creating your XML,or working with two sets of XML data.
First we create an Element object in the same way that we created the root element:
#create the second child <child>Two</child>child = ET.Element("child")child.text = "Two"
This creates the
child
Element object and sets its text to “Two”. We then append it to the root element:
#now appendroot_element.append(child)
Pretty simple! Now, if we want to look at the contents of our
root_element
(or any other Element object for that matter) we can use the handy tostring
function. It does exactly what it says that it does: it converts an Element object into a human readable string.
#Let's see the resultsprint ET.tostring(root_element)
Listing 3
#!/usr/bin/env python from xml.etree import ElementTree as ET def main(): #create the root </root><root> root_element = ET.Element("root") #create the first child <child>One</child> child = ET.SubElement(root_element, "child") child.text = "One" #create the second child <child>Two</child> child = ET.Element("child") child.text = "Two" #now append root_element.append(child) #Let's see the results
print ET.tostring(root_element)
if __name__ == "__main__":
# Someone is launching this directly main()
</root><root><child>One</child><child>Two</child></root>
To recap, have a look at the code in Listing 3. When you run this code you will get the following output:
Writing XML Attributes
If you want to create the XML with attributes (as illustrated in the second reading example), you can use the Element object’sset
method. To add the val
attribute to the first element, use the following:child.set("val","One")
child = ET.Element("child", val="One")
Reading XML Files
Most of the time you won’t be working with XML data that you explicitly create in your code, instead you will usually read the XML data in from a data source, work with it, and then save it back out when you are done. Fortunately, configuring ElementTree to work with different data sources is very easy. For example, let’s take the XML data that we first used and save it into a file namedour.xml
in the same location as our Python file.There are a few methods that we can use to load XML data from a file. We are going to use the
parse
function. This function is nice because it will accept, as a parameter, the path to a file OR a “file-like” object. The term “file-like” is used on purpose because the object does not have to be a file object per se – it simply has to be an object that behaves in a file-like manner. A “file-like” object is an object that implements a “file-like” interface, meaning that it shares many (if not all) methods with the file object. If an object is “file-like” this fact will usually be prominently mentioned in its documentation.The first thing that we need in order to load the XML data is determine the full path to the
our.xml
file. In order to calculate this, we determine the full path of our Python source file, strip the filename from it, and then append our.xml
to the path. This is rather simple given that the __file__
attribute (available in Python 2.2 and later) is the relative path and filename of our Python source file. Although the __file__
attribute will be a relative path, we can use it to calculate the absolute path using the standard os module:import os
abspath
function to get the absolute path:xml_file = os.path.abspath(__file__)
xml_file = os.path.dirname(xml_file)
our.xml
file resides, all we have to do is append the our.xml
filename to the xml_file
variable. However, instead of just doing something like:xml_file += "/our.xml"
xml_file = os.path.join(xml_file, "our.xml")
our.xml
is doing, try printing out xml_file
after each of the above lines and it should become clear.We now have the full path to the
our.xml
file. In order to load its XML data we simply pass the path to the parse
function:tree = ET.parse(xml_file)
Since we are working with files, we should watch out for incorrect paths, I/O errors, or the parse function failing for any other reason. If you wish to be extra careful, you can wrap the parse function in a try/except block in order to catch any exceptions that may be thrown:
try: tree = ET.parse("sar")except Exception, inst: print "Unexpected error opening %s: %s" % (xml_file, inst) return
IOError
exception).Writing XML Data to a File
Now that we know how to read in XML data, we should look at how one writes XML data out to a file. Let’s assume that after reading in theout.xml
fiie we want to add another item to the XML file that we just read in:child = ET.SubElement(tree.getroot(), "child")child.text = "Three"
getroot
function. The getroot
function simply returns the root Element object of the XML data.Now that we have a third child element, let’s write the XML data back out to
our.xml
. Thanks to ElementTree this is a painless experience:tree.write(xml_file)
If we want to be really careful when writing the XML data out to a file, we’ll watch out for exceptions. However most of the time the
write
method will succeed without throwing an exception; it is more important to be sure that the path used is correct. Often times, instead of getting the exception that you want, you end up with an XML file stored in some far off and strange location on your hard drive because your path was incorrect or you did not specify the full path. But, as is often the case when programming, better safe than sorry:try: tree.write(xml_file)except Exception, inst: print "Unexpected error writing to file %s: %s" % (xml_file, inst) return
To recap you can find all of the code from this section in Listing 4.
Listing 4
#!/usr/bin/env python from xml.etree import ElementTree as ETimport os def main(): xml_file = os.path.abspath(__file__) xml_file = os.path.dirname(xml_file) xml_file = os.path.join(xml_file, "our.xml") try: tree = ET.parse(xml_file) except Exception, inst: print "Unexpected error opening %s: %s" % (xml_file, inst) return child = ET.SubElement(tree.getroot(), "child")
child.text = "Three"
try: tree.write(xml_file) except Exception, inst: print "Unexpected error writing to file %s: %s" % (xml_file, inst) return \if __name__ == "__main__":
# Someone is launching this directly main()
our.xml
file you should see that the the third child element has been added:<root> <child>One</child> <child>Two</child> <child>Three</child> </root>
Reading from the Web
Working with a local file is very useful, but you might also be in a situation where you will have to work with an XML file that is located on the Internet, perhaps an RSS feed. Fortunately, since the
parse
function explained above works with file-like elements, loading a URL is very easy.First off, you need to import the urllib module; a standard module that allows you to open URLs in a method similar to opening local files:
import urllib
feed = urllib.urlopen("http://pythonmagazine.com/c/news/atom")tree = ET.parse(feed)
No comments:
Post a Comment