An Introduction to XPath

Published Jan 22, 2016Last updated Mar 14, 2017
An Introduction to XPath

Introduction

XPath is a standard for accessing and obtaining data from XML documents. It takes into account that XML is structured hierarchically as a tree. By defining paths across the tree, it allows identifying specific parts of an XML document.

Path Expressions

A path expression locates nodes inside the hierarchical structure of an XML document. Each expression includes one or more steps across the tree, each one connected with “/”.

 step1/step2/… 

If the path expression start with “/”, then its evaluation starts from the root element of the XML document.

/step1/step2/…

Each step might include one axe, one node test and one or more predicates:

axe::nodetest[predicate1][predicate2]

Steps are evaluated in relation to the set of nodes produced by the previous step in the expression, the one at the left of the current one.

For instance, the following expression selects, from the specified XML document, the text content of the name elements inside artist elements that are childs of the metadata root element:

$doc/child::mmd:metadata/child::mmd:artist/child::mmd:name/child::text()

The XPath examples in this tutorial use the XML generated by MusicBrainz. It is recommended to define the mmd namespace, where all the MusicBrainz schema elements are defined. The $doc variable is used to store the XML to work with, which is retrieved using MusicBrainz API. For instance, to retrieve information about the artist Adele:

http://musicbrainz.org/ws/2/artist/cc2c9c3c-b7bc-4b8b-84d8-4fbd8779e493?inc=release-groups+releases

Consequently, all the examples wil start with the following two lines:

declare namespace mmd="http://musicbrainz.org/ns/mmd-2.0#";
declare variable $doc := doc("http://musicbrainz.org/ws/2/artist/ cc2c9c3c-b7bc-4b8b-84d8-4fbd8779e493?inc=release-groups+releases");

They can be texted online with a XPath interpreter like:
http://www.semwebtech.org/xquery-demo

Or using Java:
XQueryHelper.java Gist

Axes

The axes in a path expression step specify the direction in which the evaluation is going to proceed. This might be up or down in the hierarchy if it is going to include the current node or not, etc. Most of the axes are presented next and their application illustrated in the figure at the beginning of this tutorial (axes are applied in relation to the current node, like the one selected by self in the figure):

  • ancestor: this axis selects all ancestors (parent, grandparent, etc.) of the current node
  • ancestor-or-self: selects all ancestors (parent, grandparent, etc.) of the current node and the current node itself
  • attribute: selects all attribute nodes of the current node
  • child: selects all children of the current node
  • descendant: selects all descendants (children, grandchildren, etc.) of the current node
  • descendant-or-self: selects all descendants (children, grandchildren, etc.) of the current node and the current node itself
  • following: selects everything in the document after the closing tag of the current node
  • following-sibling: selects all siblings after the current node
  • namespace: selects all namespace nodes of the current node
  • parent: selects the parent of the current node
  • preceding: selects everything in the document that is before the start tag of the current node
  • preceding-sibling: selects all siblings before the current node
  • self: selects the current node

The more common axes can be abbreviated, or in some cases omitted:

  • child: this is the default axis so it can be omitted.
    • For instance, to get just the text contained in all name elements inside an artist node inside the metadata root element of the specified XML documents, it is possible to abbreviate the XPath to:
$doc/mmd:metadata/mmd:artist/mmd:name/text()
  • attribute: it can be abbreviated as “@”.
    • For instance, to get the attribute named “type” for all artists in the XML document:
$doc/mmd:metadata/mmd:artist/@type
  • self::node(): is equivalent to a point “.”
  • parent::node(): can be replaced with two points “..”
  • descendant-or-self::node(): is equivalent to “//”
    • For instance, to get the attribute named “count” for the parent node of any release element, wherever in the XML document:
$doc//mmd:release/../@count

Node Tests

These tests are used to include or exclude nodes selected by an axe. The result, after applying a node test, is a subset of the nodes selected by the axe, those that satisfy the test. The available tests are:

  • node-name: where node-name is the actual name of the nodes to be selected.
    • For instance “release-group” will select all nodes named like that.
  • node(): matches any node. It can be abbreviated using “*”
    • For instance child:😗
  • text(): matches any text node.
  • comment(): matches any comment node.
  • element(): matches any element node.
  • attribute(): matches any attribute node.
  • attribute(price): matches any attribute whose name is price.

Predicates

A predicate further restricts the set of nodes selected by the combination of an axe and a node-test it is attached to. It sets the conditions that should be evaluated true by the set of nodes selected by the axe and the node-test.

The predicates are included in the expression between “[“ and “]”. In order to build the logical expressions in predicates, it is possible to combine axes and node-tests with operators and functions defined by the XPath specification.

These are some of the functions provided by XPath:

  • |: joins two node-sets, for instance //release | //release-group
  • +: addition, for instance 6 + 4
  • -: subtraction, for instance 6 - 4
  • *: multiplication, for instance 6 * 4
  • div: division, for instance 8 div 4
  • =: equal, for instance count = 9
  • !=: not equal, for instance count != 9
  • <: less than, for instance count < 9
  • <=: less than or equal to, for instance count <= 9
  • >: greater than, for instance count > 9
  • >=: greater than or equal to, for instance count >= 9
  • or: logical OR, for instance count < 8 or count > 10
  • and: logical AND, for instance count > 8 and count < 10
  • mod: modulus, division remainder, for instance 5 mod 2

For instance, the following two expressions restrict the set of nodes selected by the axes plus node-tests to just those with the attribute type valued “Album” or the attribute count valued greater than 10:

$doc//mmd:release-group[@type="Album"]
$doc//mmd:release-list[@count>10]
Discover and read more posts from Roberto García
get started