Convert HTML named entities to numeric in PHP

XML doesn't recognise most named HTML entities (e.g.  ), so if you're taking HTML content and presenting it as XML, you will either need to declare those entities yourself, or convert them. The easiest way to deal with them is to decode them with PHP's built-in html_entity_decode() function; XML's default encoding (UTF-8) can safely show just about any character, so you don't need to use any entities except for the XML special characters >, <, and &.

If you're preparing content and then using something like SimpleXML or XMLWriter, leave out the htmlspecialchars() call or you'll double-escape everything and look silly. In this example I'm converting all quotes (' and ") to entities (" and ') because it's the paranoid option, which is always nice for example code. If you aren't actually printing inside an XML attribute then you can safely use ENT_NOQUOTES instead.

There are two drawbacks to this approach. The first is invalid entities: html_entity_decode() won't touch them, which means you'll still get XML errors. The second is encoding. I suppose it's possible that you don't actually want UTF-8. You should, because it's neat, but maybe you have a good reason. If you don't tell html_entity_decode() to use UTF-8, it won't convert entities that don't exist in the character set you specify. If you tell it to output in UTF-8 and then use something like iconv() to convert it, then you'll lose any characters that aren't in the output encoding.

The pair of functions below converts all named entities to numeric entities, and gets rid of all invalid entities. It should leave existing numeric entities alone, so it's safe (but pointless) to run it multiple times on the same input. There are some notes about the code further down.

Notes

Yes you're free to use this code, no you don't have to credit me with anything. Just don't sue me if you use it and it goes wrong. In as much as it's possible to do in the UK, I release this code into the public domain and waive all rights granted to me (and obligations required of me) as its creator.
You might think that using str_replace() or strtr() with static lookup tables would be much faster than preg_replace_callback(), but (in PHP 5.3 and 5.4 at least) the exact opposite is true, by at least an order of magnitude. As far as I can figure out, the PHP functions are copying the table by value on each call, whereas the regex version shown above only sets it up once. It's very noticable on 1,000 calls (3 seconds with regex, 11 with strtr, 16 with str_replace) and absolutely crippling on 50,000 calls. Of course, if you really care about performance, you probably shouldn't be using PHP in the first place ;)
What's really really messing with my head is that the regex version above appears to be much faster than calling html_entity_decode() and then a preg_replace() to remove invalid entities. That doesn't make any sense at all to me.

Disclaimer

It should go without saying, but any example code shown on this site is yours to use without obligation or warranty of any kind. As far as it's possible to do so, I release it into the public domain.