I See Dead Code

… as sounding brass, or a tinkling cymbal.

I See Dead Code header image 2

XML parsers? But I know regular expressions…

Mai 31st, 2007 · No Comments

Dear WordPress Developers, I won’t even explain to you what is wrong with this statement above.

CDATA and Doubly-Escaped Entities

So let’s see. In theory, the XML snippets

<bla><![CDATA[&auml;]]></bla>

and

<bla>&amp;auml;</bla>

represent the same DOM tree.

In order to see that this is actually true, one can use the Swiss Army Knife of XML handling, xmllint, to convert an XML file to its Canonical Format:

$ echo '<bla><![CDATA[&auml;]]></bla>'|xmllint --c14n -|md5sum
02ccfa2fff45fcb19b96a096637e3925  -
$echo '<bla>&amp;auml;</bla>'|xmllint --c14n -|md5sum
02ccfa2fff45fcb19b96a096637e3925  -
How Can I Break It?

Normally, XML parsers handle this correctly and it does not matter if a file contains CDATA sections or at lot of doubly-escaped entities like &amp;lt;. Unfortunately, all the niceness disappears if people decide that it’s a good idea to „parse” XML using some regexes and just assume that everybody would use CDATA sections for certain elements. This dreadful technique is used in in the RSS importer of WordPress. Adding injury to insult, ElementTree does not support creating XML files with CDATA sections because the author (rightfully) claims that they are not really needed. Because of this, I had to use a rather blunt workaround to finally get the XML into the format needed for a successful and good-looking import.

What do we learn?
  • Although WordPress looks quite good on the outside and works quite well, the authors do exactly what I expect from PHP programmers.
  • Supporting even obscure and seemingly superfluous parts of a standard is important, because other programs might depend on exactly that feature, rendering a library totally useless in some cases.
ElementTree and Namespaces

Originally, this post should have been about ElementTree and how I finally got around to correctly handle XML namespaces with it. Now this has to wait till tomorrow.

Tags: lang:en · programming · rant

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment