inanimatt.com • Matt Robinson

Thou shalt always(ish) use UTF-8

One “charming quirk” of PHP is that while it's pretty good about handling character sets, it'll let you blithely carry on without even knowing what they are, which can mean you end up spitting out some pretty crazy weirdities that leave you scratching your head, or just getting annoyed. Increasingly, web editing tools are producing UTF-8, while PHP's default output is the catchily named ISO-8859-1. So what happens is that after writing PHP for years, you finally find out that you're supposed to do htmlspecialchars() on everything you send to the browser, and when you do, you find it's mangled your € and £ symbols. You have to tell it to behave. Here's how.

Why UTF-8?

UTF-8 isn't actually a character set, it's an encoding of the unicode character set. Unicode aims to solve the problem of character sets by containing every possible kind of character and symbol from every language. So basically if you write your pages in UTF-8, you can display any language and character without using HTML entities (like ©). It's one huge thing less to think about.

Make sure your web page is UTF-8

There's two parts to this. First, your HTML editing tool (Dreamweaver, Emacs, Textmate, whatever) needs to save the file with the UTF-8 encoding. I mean, obviously you don't want to be telling everyone you're using UTF-8 if you aren't. If you've got a Mac, this is usually the default because that's what the operating system uses. Notepad's a bit of a bitch about UTF-8 because it squirts in a bit of invisible code (aptly named a BOM) at the start of the file that sometimes doesn't turn out to be invisible at all, so go out and get yourself a slightly better text editor. I'll wait.

Part two is making sure that the web browser knows that the document is UTF-8. This is done by adding a tag to your document's HEAD section:

Making PHP output UTF-8 too

This part is pretty straightforward. You just print/echo. Except that you don't, because you use htmlspecialchars() so that no one can take over your script and use it to steal stuff from your visitors, right? So all you have to do is fill in the encoding argument of that function. Like this:

If you get tired of writing that every time instead of echo or print, you could make a couple of functions that do it for you. Go wild. Check the PHP Manual on what ENT_COMPAT means and why it's probably what you mean, and why you're not using htmlentities() instead of htmlspecialchars().

Talking UTF-8 to your database

Well, I say database, but really I mean MySQL, which is more or less the same thing except that people who manage real databases for a living will look down their noses at you. If you aren't using MySQL, the method varies and you'll have to look it up for yourself, sorry.

Gotcha

Everyone in the world writes UTF-8 like `UTF-8`. Except MySQL. They write it `utf8`. No hyphen. Tsk.

If you want to be really good about the whole thing, you make the database use UTF-8 internally too (start with CREATE DATABASE mydb CHARACTER SET utf8;), but really it doesn't matter as long as the database converts to UTF-8 when it's talking to PHP.

All you have to do is send an SQL statement when you connect to your database: SET NAMES utf8. How you do this depends on how you connect to your database, but here's an example using PDO:

A quick note on forms

Web browsers will (or ought to) send you form data using the same encoding as the page is set to, so if your HTML is in UTF-8, form data will be too.

Turning other stuff into UTF-8

This is actually quite tricky, because if you're reading a file from a hard disk, the software pretty much has to guess what character set it's in. Which is a bit crappy. Fortunately as a PHP coder, you hardly ever have to do this bit yourself (and if you do, you can get mb_detect_encoding() PHP Manual to do it for you).

If you know what it is, and it's not UTF-8, you can use PHP's iconv() PHP Manual or mb_convert_encoding PHP Manual. One of the really sweet things about XML is that when you write an XML file, the first line has the encoding in it, and if it doesn't, it defaults to UTF-8 so you don't have to guess.