Monday, 26 September 2011

XML Parsing using ElementTree.

Introduction

It seems like everyone needs to parse XML these days. They’re either saving their own information in XML or loading in someone else’s data. This is why I was glad to learn that as of Python 2.5, the ElementTree XML package has been added to the standard library in the XML module.
What I like about the ElementTree module is that it just seems to make sense. This might seem like a strange thing to say about an XML module, but I’ve had to parse enough XML in my time to know that if an XML module makes sense the first time you use it, it’s probably a keeper. The ElementTree module allows me to work with XML data in a way that is similar to how I think about XML data.
A subset of the full ElementTree module is available in the Python 2.5 standard library as xml.etree, but you don’t have to use Python 2.5 in order to use the ElementTree module. If you are still using an older version of Python (1.5.2 or later) you can simply download the module from its website and manually install it on your system. The website also has very easy to follow installation instructions, which you should consult to avoid issues while installing ElementTree.
In general, the ElementTree module treats XML data as a list of lists. All XML has a root element that will have zero or more subelements (or child elements). Each of those subelements may in turn have subelements of their own. The best way to think about this is with a brief example.
First let’s take a look at some sample XML data:
<root>
<child>One</child>
<child>Two</child>
</root>
Here we have a root element with two child elements. Each child element has some text associated with it seen here as “one” and “two”. If we examine the XML as a hierarchical list of lists we see that we have one element “root” in our root list. Within the “root” element we have a list containing two subelements “child” and “child”. The two “child” elements would then contain empty lists representing their lack of subelements. Not too complicated so far, is it?

Reading XML data

Now let’s use the ElementTree package to parse this XML and print the text data associated with each child element. To start, we’ll create a Python file with the contents shown in Listing 1.

Listing 1
#!/usr/bin/env python
def main():
 pass

if __name__ == "__main__":
 main()

This is basically a template that I use for many of my simple “*.py” files. It doesn’t actually do anything except set up the script so that when the file is run, the main method will be executed. Some people like to use the Python interactive interpreter for simple hacking like this. Personally, I prefer having my code stored in a handy file so I can make simple changes and re-run the entire script when I am just playing around.
The first thing that we need to do in our Python code is import the ElementTree module:
from xml.etree import ElementTree as ET
Note: If you are not using Python 2.5 and have installed the ElementTree module on your own, you should import the ElementTree module as follows:
from elementtree import ElementTree as ET
This will import the ElementTree section of the module into your program aliased as ET. However, you don’t have to import ElementTree using an alias; you can simply import it and access it as ElementTree. Using ET is demonstrated in the Python 2.5 “What’s new” documentation[1] and I think it’s a great way to eliminate some key strokes.
Now we’ll begin writing code in the main method. The first step is to load the XML data described above. Normally you will be working with a file or URL; for now we want to keep this simple and load the XML data directly from the text:
 
element = ET.XML(
       "<root><child>One</child><child>Two</child></root>")

The XML function is described in the ElementTree documentation as follows: “Parses an XML document from a string constant. This function can be used to embed “XML literals” in Python code”[2].
Be careful here! The XML function returns an Element object, and not an ElementTree object as one might expect. Element objects are used to represent XML elements, whereas the ElementTree object is used to represent the entire XML document. Element objects may represent the entire XML document if they are the root element but will not if they are a subelement. ElementTree objects also add “some extra support for serialization to and from standard XML.”[3] The Element object that is returned represents the element in our XML data.
Thankfully, the Element object is an iterator object so we can use a for loop to loop through all of its child elements:
for subelement in element:
This will give us all the child elements in the root element. As mentioned earlier, each element in the XML tree is represented as an Element object, so as we iterate through the root element’s child elements we are getting Element objects with which to work. Meaning that each loop though the for loop will give us the next child element in the form of an Element object until there are no more children left. In order to print out the text associated with an Element object we simply have to access the Element object’s text attribute:
 
for subelement in element:
       print subelement.text

To recap, have a look at the code in Listing 2.

Listing 2
#!/usr/bin/env python
from xml.etree import ElementTree as ET
def main():
 element = ET.XML("<root><child>One</child><child>Two</child></root>")
 for subelement in element:
  print subelement.text
if __name__ == "__main__":
 # Someone is launching this directly
 main()

Once you run the code you should get the following output:
One
Two

If an XML element does not have any text associated with it, like our root element, the Element object’s text attribute will be set to None. If you want to check if an element had any text associated with it, you can do the following:
if element.text is not None:
       print element.text

Reading XML Attributes

Let’s alter the XML that we are working with to add attributes to the elements and look at how we would parse that information.
If the XML uses attributes in addition to, or instead of, inner text they can be accessed using the Element object’s attrib attribute. The attrib attribute is a Python dictionary and is relatively easy to use:
 
def main():
       element = ET.XML(
               '<root><child val="One"/><child val="Two"/></root>')
       for subelement in element:
               print subelement.attrib

When you run the code you get the following output:
{'val': 'One'}
{'val': 'Two'}

These are the attributes for each child element stored in a dictionary. Being able to work with an XML element’s attributes as a Python dictionary is a great feature and fits well with the dynamic nature of XML attributes.

Writing XML

Now that we’ve tried our hand at reading XML, let’s try creating some. If you understand the reading process, you should have no trouble understanding the creation process because it works in much the same manner. What we are going to do in this example is recreate the XML data that we were working with above.
The first step is to create our element:
 
#create the root <root>root_element = ET.Element("root")

After this code is executed, the variable root_element is an Element object, just like the Element objects that we used earlier to parse the XML.
The next step is to create the two child elements. There are two ways to do this.
In the first method, if you know exactly what you are creating, it’s easiest to use the SubElement method, which creates an Element object that is a subelement (or child) of another Element object:
 
#create the first child <child>One</child>child = ET.SubElement(root_element, "child")

This will create a Element that is a child of root_element. We then need to set the text associated with that element. To do this we use the same text attribute that we used in the first parsing example. However, instead of simply reading the text attribute we set its value:
 
child.text = "One"

The second approach to creating a child element is to create an Element object separately (rather than a sub element) and append it to a parent Element object. The results are exactly the same – this is simply a different approach that may come in handy when creating your XML,or working with two sets of XML data.
First we create an Element object in the same way that we created the root element:
 
#create the second child <child>Two</child>child = ET.Element("child")child.text = "Two"

This creates the child Element object and sets its text to “Two”. We then append it to the root element:
 
#now appendroot_element.append(child)

Pretty simple! Now, if we want to look at the contents of our root_element (or any other Element object for that matter) we can use the handy tostring function. It does exactly what it says that it does: it converts an Element object into a human readable string.
 
#Let's see the resultsprint ET.tostring(root_element)

Listing 3
#!/usr/bin/env python
from xml.etree import ElementTree as ET
def main():
 #create the root </root><root>
 root_element = ET.Element("root")
 #create the first child <child>One</child>
 child = ET.SubElement(root_element, "child")
 child.text = "One"
 #create the second child <child>Two</child>
 child = ET.Element("child")
 child.text = "Two"
 #now append
 root_element.append(child)
 #Let's see the results 
 print ET.tostring(root_element)
 if __name__ == "__main__"
  
  
# Someone is launching this directly
 main()
</root><root><child>One</child><child>Two</child></root>

 

To recap, have a look at the code in Listing 3. When you run this code you will get the following output:

Writing XML Attributes

If you want to create the XML with attributes (as illustrated in the second reading example), you can use the Element object’s set method. To add the val attribute to the first element, use the following:
child.set("val","One")
You may also set attributes when you create Element objects:
child = ET.Element("child", val="One")

Reading XML Files

Most of the time you won’t be working with XML data that you explicitly create in your code, instead you will usually read the XML data in from a data source, work with it, and then save it back out when you are done. Fortunately, configuring ElementTree to work with different data sources is very easy. For example, let’s take the XML data that we first used and save it into a file named our.xml in the same location as our Python file.
There are a few methods that we can use to load XML data from a file. We are going to use the parse function. This function is nice because it will accept, as a parameter, the path to a file OR a “file-like” object. The term “file-like” is used on purpose because the object does not have to be a file object per se – it simply has to be an object that behaves in a file-like manner. A “file-like” object is an object that implements a “file-like” interface, meaning that it shares many (if not all) methods with the file object. If an object is “file-like” this fact will usually be prominently mentioned in its documentation.
The first thing that we need in order to load the XML data is determine the full path to the our.xml file. In order to calculate this, we determine the full path of our Python source file, strip the filename from it, and then append our.xml to the path. This is rather simple given that the __file__ attribute (available in Python 2.2 and later) is the relative path and filename of our Python source file. Although the __file__ attribute will be a relative path, we can use it to calculate the absolute path using the standard os module:
import os
We then call the abspath function to get the absolute path:
xml_file = os.path.abspath(__file__)
However, since we only want the directory name (not the full path and filename of our Python source file) we have to strip off the filename:
xml_file = os.path.dirname(xml_file)
Now that we have the directory in which the our.xml file resides, all we have to do is append the our.xml filename to the xml_file variable. However, instead of just doing something like:
xml_file += "/our.xml"
we will use the os module to join the two paths so that the resulting path is always correct regardless of what operating system our code is executed on:
xml_file = os.path.join(xml_file, "our.xml")
Note: If you have any trouble understanding what any of the code used to determine the path of our.xml is doing, try printing out xml_file after each of the above lines and it should become clear.
We now have the full path to the our.xml file. In order to load its XML data we simply pass the path to the parse function:
tree = ET.parse(xml_file)
We now have an ElementTree object instance that represents our XML file.
Since we are working with files, we should watch out for incorrect paths, I/O errors, or the parse function failing for any other reason. If you wish to be extra careful, you can wrap the parse function in a try/except block in order to catch any exceptions that may be thrown:
 
try:
       tree = ET.parse("sar")except Exception, inst:
       print "Unexpected error opening %s: %s" % (xml_file, instreturn
In the except block, I catch the Exception base class so that I catch any and all exceptions that may be thrown (in the case of a missing file it will most likely be an IOError exception).

Writing XML Data to a File

Now that we know how to read in XML data, we should look at how one writes XML data out to a file. Let’s assume that after reading in the out.xml fiie we want to add another item to the XML file that we just read in:
child = ET.SubElement(tree.getroot(), "child")child.text = "Three"
Notice that in order to add a child to the root element we used the ElementTree object’s getroot function. The getroot function simply returns the root Element object of the XML data.
Now that we have a third child element, let’s write the XML data back out to our.xml. Thanks to ElementTree this is a painless experience:
tree.write(xml_file)
That’s it!
If we want to be really careful when writing the XML data out to a file, we’ll watch out for exceptions. However most of the time the write method will succeed without throwing an exception; it is more important to be sure that the path used is correct. Often times, instead of getting the exception that you want, you end up with an XML file stored in some far off and strange location on your hard drive because your path was incorrect or you did not specify the full path. But, as is often the case when programming, better safe than sorry:
try:
       tree.write(xml_file)except Exception, inst:
       print "Unexpected error writing to file %s: %s" % (xml_file, inst)
       return

To recap you can find all of the code from this section in Listing 4.

Listing 4
#!/usr/bin/env python
from xml.etree import ElementTree as ETimport os
def main():

 xml_file = os.path.abspath(__file__)
 xml_file = os.path.dirname(xml_file)
 xml_file = os.path.join(xml_file, "our.xml")

 try:
  tree = ET.parse(xml_file)
 except Exception, inst:
  print "Unexpected error opening %s: %s" % (xml_file, inst)
  return

 child = ET.SubElement(tree.getroot(), "child") 
 child.text = "Three"
 
 
 
 
 
 


 try:
  tree.write(xml_file)
 except Exception, inst:
  print "Unexpected error writing to file %s: %s" % (xml_file, inst)
  return
 \if __name__ == "__main__":
 
 
 
 # Someone is launching this directly
 main()
When you run the code and take a look at the our.xml file you should see that the the third child element has been added:
<root>
<child>One</child>
<child>Two</child>
<child>Three</child>
</root>


 







Reading from the Web



Working with a local file is very useful, but you might also be in a situation where you will have to work with an XML file that is located on the Internet, perhaps an RSS feed. Fortunately, since the parse function explained above works with file-like elements, loading a URL is very easy.
First off, you need to import the urllib module; a standard module that allows you to open URLs in a method similar to opening local files:
import urllib
In order to open a URL we use:
feed = urllib.urlopen("http://pythonmagazine.com/c/news/atom")tree = ET.parse(feed)

Conclusion

And that’s that! This concludes our brief introduction to XML parsing using the ElementTree module. Hopefully throughout this article you have seen how easy it is to create and manipulate XML using ElementTree …and I’ve only scratched the surface. For more information take a look at the official Python documentation and some of the great examples on the effbot website. I’m sure you’ll be an XML wizard in no time.

No comments:

Post a Comment