The lxml.etree Tutorial
Author:
Stefan Behnel
This is a tutorial on XML processing with It briefly
overviews the main concepts of the ElementTree API, and some simple
enhancements that make your life as a programmer easier.
For a complete reference of the API, see the generated API
documentation.
Contents
The Element class
Elements are lists
Elements carry attributes as a dict
Elements contain text
Using XPath to find text
Tree iteration
Serialisation
The ElementTree class
Parsing from strings and files
The fromstring() function
The XML() function
The parse() function
Parser objects
Incremental parsing
Event-driven parsing
Namespaces
The E-factory
ElementPath
A common way to import is as follows:
>>> from lxml import etree
If your code only uses the ElementTree API and does not rely on any
functionality that is specific to, you can also use (any part
of) the following import chain as a fall-back to the original ElementTree:
try:
from lxml import etree
print(“running with “)
except ImportError:
# Python 2. 5
import as etree
print(“running with cElementTree on Python 2. 5+”)
print(“running with ElementTree on Python 2. 5+”)
# normal cElementTree install
import cElementTree as etree
print(“running with cElementTree”)
# normal ElementTree install
import elementtree. ElementTree as etree
print(“running with ElementTree”)
print(“Failed to import ElementTree from any known place”)
To aid in writing portable code, this tutorial makes it clear in the examples
which part of the presented API is an extension of over the
original ElementTree API, as defined by Fredrik Lundh’s ElementTree
library.
An Element is the main container object for the ElementTree API. Most of
the XML tree functionality is accessed through this class. Elements are
easily created through the Element factory:
>>> root = etree. Element(“root”)
The XML tag name of elements is accessed through the tag property:
Elements are organised in an XML tree structure. To create child elements and
add them to a parent element, you can use the append() method:
>>> ( etree. Element(“child1”))
However, this is so common that there is a shorter and much more efficient way
to do this: the SubElement factory. It accepts the same arguments as the
Element factory, but additionally requires the parent as first argument:
>>> child2 = bElement(root, “child2”)
>>> child3 = bElement(root, “child3”)
To see that this is really XML, you can serialise the tree you have created:
>>> print(string(root, pretty_print=True))
To make the access to these subelements easy and straight forward,
elements mimic the behaviour of normal Python lists as closely as
possible:
>>> child = root[0]
>>> print()
child1
>>> print(len(root))
3
>>> (root[1]) # only!
1
>>> children = list(root)
>>> for child in root:… print()
child2
child3
>>> (0, etree. Element(“child0”))
>>> start = root[:1]
>>> end = root[-1:]
>>> print(start[0])
child0
>>> print(end[0])
Prior to ElementTree 1. 3 and lxml 2. 0, you could also check the truth value of
an Element to see if it has children, i. e. if the list of children is empty:
if root: # this no longer works!
print(“The root element has children”)
This is no longer supported as people tend to expect that a “something”
evaluates to True and expect Elements to be “something”, may they have
children or not. So, many users find it surprising that any Element
would evaluate to False in an if-statement like the above. Instead,
use len(element), which is both more explicit and less error prone.
>>> print(element(root)) # test if it’s some kind of Element
True
>>> if len(root): # test if it has children… print(“The root element has children”)
The root element has children
There is another important case where the behaviour of Elements in lxml
(in 2. 0 and later) deviates from that of lists and from that of the
original ElementTree (prior to version 1. 3 or Python 2. 7/3. 2):
>>> root[0] = root[-1] # this moves the element in!
In this example, the last element is moved to a different position,
instead of being copied, i. it is automatically removed from its
previous position when it is put in a different place. In lists,
objects can appear in multiple positions at the same time, and the
above assignment would just copy the item reference into the first
position, so that both contain the exact same item:
>>> l = [0, 1, 2, 3]
>>> l[0] = l[-1]
>>> l
[3, 1, 2, 3]
Note that in the original ElementTree, a single Element object can sit
in any number of places in any number of trees, which allows for the same
copy operation as with lists. The obvious drawback is that modifications
to such an Element will apply to all places where it appears in a tree,
which may or may not be intended.
The upside of this difference is that an Element in always
has exactly one parent, which can be queried through the getparent()
method. This is not supported in the original ElementTree.
>>> root is root[0]. getparent() # only!
If you want to copy an element to a different position in,
consider creating an independent deep copy using the copy module
from Python’s standard library:
>>> from copy import deepcopy
>>> element = etree. Element(“neu”)
>>> ( deepcopy(root[1]))
>>> print(element[0])
>>> print([ for c in root])
[‘child3’, ‘child1’, ‘child2′]
The siblings (or neighbours) of an element are accessed as next and previous
elements:
>>> root[0] is root[1]. getprevious() # only!
>>> root[1] is root[0]. getnext() # only!
XML elements support attributes. You can create them directly in the Element
factory:
>>> root = etree. Element(“root”, interesting=”totally”)
>>> string(root)
b’
Attributes are just unordered name-value pairs, so a very convenient way
of dealing with them is through the dictionary-like interface of Elements:
>>> print((“interesting”))
totally
>>> print((“hello”))
None
>>> (“hello”, “Huhu”)
Huhu
b’
>>> sorted(())
[‘hello’, ‘interesting’]
>>> for name, value in sorted(()):… print(‘%s =%r’% (name, value))
hello = ‘Huhu’
interesting = ‘totally’
For the cases where you want to do item lookup or have other reasons for
getting a ‘real’ dictionary-like object, e. g. for passing it around,
you can use the attrib property:
>>> attributes =
>>> print(attributes[“interesting”])
>>> print((“no-such-attribute”))
>>> attributes[“hello”] = “Guten Tag”
>>> print(attributes[“hello”])
Guten Tag
Note that attrib is a dict-like object backed by the Element itself.
This means that any changes to the Element are reflected in attrib
and vice versa. It also means that the XML tree stays alive in memory
as long as the attrib of one of its Elements is in use. To get an
independent snapshot of the attributes that does not depend on the XML
tree, copy it into a dict:
>>> d = dict()
[(‘hello’, ‘Guten Tag’), (‘interesting’, ‘totally’)]
Elements can contain text:
>>> = “TEXT”
TEXT
b’
In many XML documents (data-centric documents), this is the only place where
text can be found. It is encapsulated by a leaf tag at the very bottom of the
tree hierarchy.
However, if XML is used for tagged text documents such as (X)HTML, text can
also appear between different elements, right in the middle of the tree:
World
Here, the
tag is surrounded by text. This is often referred to as
document-style or mixed-content XML. Elements support this through their
tail property. It contains the text that directly follows the element, up
to the next element in the XML tree:
>>> html = etree. Element(“html”)
>>> body = bElement(html, “body”)
>>> string(html)
b’TEXT‘
>>> br = bElement(body, “br”)
b’TEXT
‘
>>> = “TAIL”
b’TEXT
TAIL‘
The two properties and are enough to represent any
text content in an XML document. This way, the ElementTree API does
not require any special text nodes in addition to the Element
class, that tend to get in the way fairly often (as you might know
from classic DOM APIs).
However, there are cases where the tail text also gets in the way.
For example, when you serialise an Element from within the tree, you
do not always want its tail text in the result (although you would
still want the tail text of its children). For this purpose, the
tostring() function accepts the keyword argument with_tail:
>>> string(br)
b’
TAIL’
>>> string(br, with_tail=False) # only!
b’
‘
If you want to read only the text, i. without any intermediate
tags, you have to recursively concatenate all text and tail
attributes in the correct order. Again, the tostring() function
comes to the rescue, this time using the method keyword:
>>> string(html, method=”text”)
b’TEXTTAIL’
Another way to extract the text content of a tree is XPath, which
also allows you to extract the separate text chunks into a list:
>>> print((“string()”)) # only!
TEXTTAIL
>>> print((“//text()”)) # only!
[‘TEXT’, ‘TAIL’]
If you want to use this more often, you can wrap it in a function:
>>> build_text_list = (“//text()”) # only!
>>> print(build_text_list(html))
Note that a string result returned by XPath is a special ‘smart’
object that knows about its origins. You can ask it where it came
from through its getparent() method, just as you would with
Elements:
>>> texts = build_text_list(html)
>>> print(texts[0])
>>> parent = texts[0]. getparent()
body
>>> print(texts[1])
TAIL
>>> print(texts[1]. getparent())
br
You can also find out if it’s normal text content or tail text:
>>> print(texts[0]. is_text)
>>> print(texts[1]. is_text)
False
>>> print(texts[1]. is_tail)
While this works for the results of the text() function, lxml will
not tell you the origin of a string value that was constructed by the
XPath functions string() or concat():
>>> stringify = (“string()”)
>>> print(stringify(html))
>>> print(stringify(html). getparent())
For problems like the above, where you want to recursively traverse the tree
and do something with its elements, tree iteration is a very convenient
solution. Elements provide a tree iterator for this purpose. It yields
elements in document order, i. in the order their tags would appear if you
serialised the tree to XML:
>>> bElement(root, “child”) = “Child 1”
>>> bElement(root, “child”) = “Child 2”
>>> bElement(root, “another”) = “Child 3”
>>> for element in ():… print(“%s -%s”% (, ))
root – None
child – Child 1
child – Child 2
another – Child 3
If you know you are only interested in a single tag, you can pass its name to
iter() to have it filter for you. Starting with lxml 3. 0, you can also
pass more than one tag to intercept on multiple tags during iteration.
>>> for element in (“child”):… print(“%s -%s”% (, ))
>>> for element in (“another”, “child”):… print(“%s -%s”% (, ))
By default, iteration yields all nodes in the tree, including
ProcessingInstructions, Comments and Entity instances. If you want to
make sure only Element objects are returned, you can pass the
Element factory as tag parameter:
>>> ((“#234”))
>>> (mment(“some comment”))
>>> for element in ():… if isinstance(, basestring): # or ‘str’ in Python 3… print(“%s -%s”% (, ))… else:… print(“SPECIAL:%s -%s”% (element, ))
SPECIAL: ê – ê
SPECIAL: – some comment
>>> for element in (tag=etree. Element):… print(“%s -%s”% (, ))
>>> for element in ():… print()
ê
Note that passing a wildcard “*” tag name will also yield all
Element nodes (and only elements).
In, elements provide further iterators for all directions in the
tree: children, parents (or rather ancestors) and siblings.
Serialisation commonly uses the tostring() function that returns a
string, or the () method that writes to a file, a
file-like object, or a URL (via FTP PUT or HTTP POST). Both calls accept
the same keyword arguments like pretty_print for formatted output
or encoding to select a specific output encoding other than plain
ASCII:
>>> root = (‘
b’
>>> print(string(root, xml_declaration=True))
xml version='1. 0' encoding='ASCII'? >
>>> print(string(root, encoding=’iso-8859-1′))
xml version='1. 0' encoding='iso-8859-1'? >
Note that pretty printing appends a newline at the end.
For more fine-grained control over the pretty-printing, you can add
whitespace indentation to the tree before serialising it, using the
indent() function (added in lxml 4. 5):
>>> root = (‘
>>> print(string(root))
>>> (root)
>>>
‘n ‘
>>> root[0]
>>> (root, space=” “)
>>> (root, space=”t”)
‘
In lxml 2. 0 and later (as well as ElementTree 1. 3), the serialisation
functions can do more than XML serialisation. You can serialise to
HTML or extract the text content by passing the method keyword:
>>> root = (… ‘