Convert HTML named entities to numeric in PHP

XML doesn't recognise most named HTML entities (e.g. &nbsp;), so if you're taking HTML content and presenting it as XML, you will either need to declare those entities yourself, or convert them. The easiest way to deal with them is to decode them with PHP's built-in html_entity_decode() function; XML's default encoding (UTF-8) can safely show just about any character, so you don't need to use any entities except for the XML special characters >, <, and &.

If you're preparing content and then using something like SimpleXML or XMLWriter, leave out the htmlspecialchars() call or you'll double-escape everything and look silly. In this example I'm converting all quotes (' and ") to entities (&quot; and &apos;) because it's the paranoid option, which is always nice for example code. If you aren't actually printing inside an XML attribute then you can safely use ENT_NOQUOTES instead.

There are two drawbacks to this approach. The first is invalid entities: html_entity_decode() won't touch them, which means you'll still get XML errors. The second is encoding. I suppose it's possible that you don't actually want UTF-8. You should, because it's neat, but maybe you have a good reason. If you don't tell html_entity_decode() to use UTF-8, it won't convert entities that don't exist in the character set you specify. If you tell it to output in UTF-8 and then use something like iconv() to convert it, then you'll lose any characters that aren't in the output encoding.

The pair of functions below converts all named entities to numeric entities, and gets rid of all invalid entities. It should leave existing numeric entities alone, so it's safe (but pointless) to run it multiple times on the same input. There are some notes about the code further down.

Notes


Disclaimer

It should go without saying, but any example code shown on this site is yours to use without obligation or warranty of any kind. As far as it's possible to do so, I release it into the public domain.