Check out my first novel, midnight's simulacra!

Xmlstarlet

From dankwiki

XMLStarlet is a command line tool for manipulation of XML. It exposes XPath, XSLT, and validation among many other capabilities. On FreeBSD, it is generally invoked as xml and is available from the Ports Collection in textproc/xmlstarlet; on Linux, it is generally invoked as xmlstarlet (throughout the rest of this article, I will use xml). XMLstarlet invocations take the form: xml command [command options] [filename]. It is intended for use within pipelines; any diagnostic output generated will be placed on stderr. To get help on a command, run xml command --help.

As more and more tools generate XML-based output (subversion, nmap, etc), xmlstarlet becomes a useful and flexible tool for working with XML in a stream-editing context.

Escaping and Unescaping

Due to entity references, text must be processed prior to being embedded within an element as data. Conversion to and from entity references is managed with xml esc and xml unesc. Neither of these commands take any options. The input has no well-formedness requirements.

[diaconicon](0) $ xmlstarlet esc \<\>\& && echo \<\>\& | xmlstarlet esc
&lt;&gt;&amp;
&lt;&gt;&amp;
[diaconicon](0) $ xmlstarlet esc \<\>\& | xmlstarlet unesc && echo \<\>\& | xmlstarlet esc | xargs xmlstarlet unesc
<>&
<>&
[diaconicon](0) $

Document Presentation

Canonical Representation

xml c14n can be used to canonicalize XML (this primarily consists of removing whitespace leading or trailing the root node, and expanding empty nodes which use the <foo/> notation). The input must be well-formed. Comments can either be preserved or stripped via the with-comments and without-comments command-line options.

[diaconicon](0) $ echo -e "\n<foo>  <bar  />\n<qux> crap\tcrap </qux></foo>" | xmlstarlet c14n
<foo>  <bar></bar>
<qux> crap      crap </qux></foo>[diaconicon](0) $

Element Structure

xml el will display the element structure of well-formed XML. Attributes and their values can also be displayed; use the -a option to enable attribute output, and the -v option to enable attribute + value output. Each element and its complete ancestry will appear on its own line, with generations demarcated with foreslashes. Attributes and their values, if enabled, follow the element path in square brackets.

[diaconicon](0) $ echo '<foo><bar qux="quux" fux="fuux">quuux</bar><baz/></foo>' | xmlstarlet el -v
foo
foo/bar[@qux='quux' and @fux='fuux']
foo/baz
[diaconicon](0) $

Line-oriented (PyX)

The PyX (PostScript + Python + TeX) notation can be used to prepare XML data for processing by line-oriented tools such as grep or wc. Each (preserved) syntactic element in the XML source results in a single line of text prefaced with a PyX classifier; this mapping is bijective:

XML Syntax PyX Classifier XML Example PyX Equivalent
start-tag '(' <foo>, <foo/> (foo
end-tag ')' </foo>, <foo/> )foo
attribute 'A' foo="bar" Afoo bar
character data '-' <foo>bar</foo> -bar
processing instruction '?' <?A4TypeSetter PageBreak?> ?A4TypeSetter PageBreak

Use xml pyx to convert from XML to PyX and xml p2x to convert from PyX to XML. The input must be well-formed in either case.

[diaconicon](0) $ echo '<foo><?xml-stylesheet href="headlines.css" type="text/css"?><bar qux="quux">quuux</bar><baz/></foo>' | xmlstarlet pyx
(foo
?xml-stylesheet href="headlines.css" type="text/css"
(bar
Aqux quux
-quuux
)bar
(baz
)baz
)foo
[diaconicon](0) $ echo '<foo><?xml-stylesheet href="headlines.css" type="text/css"?><bar qux="quux">quuux</bar><baz/></foo>' | xmlstarlet pyx | xmlstarlet p2x
<foo><?xml-stylesheet href="headlines.css" type="text/css"?>
<bar qux="quux">quuux</bar><baz></baz></foo>
[diaconicon](0) $

Pretty Printing

xml fo can pretty-print well-formed input using either spaces or tabs. Consult xml fo --help; it's pretty self-explanatory.

XML Input

Directory Listings

xml ls will output a verbose listing of the current directory's contents, displaying all times in ISO 8601 format. It takes no options. The return value is, for reasons unfathomable, the number of files listed as opposed to a success/error indicator -- don't get caught unawares.

[diaconicon](0) $ xmlstarlet ls
<dir>
<d p="rwxr-xr-x" a="20070421T065820Z" m="20070415T190332Z" s="4096" n="trunk"/>
<d p="rwxr-xr-x" a="20070421T044013Z" m="20070419T181636Z" s="4096" n=".svn"/>
</dir>
[diaconicon](2) $

Recovering Documents

xml fo -R attempts to magically fix up malformed XML. It's pretty badass.

[diaconicon](0) $ echo "<fucked> this is all <fubar> </fukked>" | xmlstarlet fo -R
-:1: parser error : Opening and ending tag mismatch: fubar line 1 and fukked
<fucked> this is all <fubar> </fukked>
                                      ^
-:2: parser error : Premature end of data in tag fucked line 1

^
<?xml version="1.0"?>
<fucked> this is all <fubar> </fubar>
</fucked>
[diaconicon](0) $

Subversion

svn log and svn status both accept the --xml command line option. See the subversion page for some recipes.

[diaconicon](0) $ svn status --xml
<?xml version="1.0"?>
<status>
<target
   path=".">
<entry
   path="trunk/clients/log.xml">
<wc-status
   props="none"
   item="unversioned">
</wc-status>
</entry>
</target>
</status>
[diaconicon](0) $ svn log -r299 --xml
<?xml version="1.0"?>
<log>
<logentry
   revision="299">
<author>dank</author>
<date>2007-04-19T16:47:10.226838Z</date>
<msg>reenable the exitcode etc buildreport handlers</msg>
</logentry>
</log>
[diaconicon](0) $

Validation

All forms of the xml val command as used below can be provided the -e command line option to verbosely print the causes of validation failures on stderr.

Well-formedness

Well-formedness of an XML document can be verified with xml val. Well-formedness mandates a single root node, balancing of all element start- and end-tags, proper syntax, proper use of character entities where necessary, etc.

RELAX-NG

If a RELAX-NG schema has been provided, validation can be performed against it using xml val -r schema.

Transformation

XSLT is awesomely powerful.

Dump all content

Delimited by spaces: xml sel -t -m //\*\[not\(\*\)\] -v . -o ' '

Querying

XPath can be used to programmatically explore XML.

Counting

  • Count all nodes in a document:xml sel -t -v "count(//node())"
  • Count all occurrences of somenode nodes:xml sel -t -v "count(//somenode)"

See Also