Some Iñtërnâtiônàlizætiøn hints

Jon Ramsey

<jonathon.ramsey@gmail.com>

PHP London

Some provisos

Not definitive at all, I'm still confused and grasping for answers.

Some of this depends on access to your server configuration (or on having a nice host)

The presentation is much longer than it should've been, but it's now very late and I must sleep.

The Problem

My output is mashed! What are all of these little boxes?

What inconsiderate person has chosen to use non-English text in my app!

Quick definition

Internationalisation (I18N)
building an app so that it can handle different linguistic and cultural conventions; separation of language and data from source code

I won't be looking at localisation (L10N) at all, but at a more basic level of character representation; this could help you more than it may sound like it will. Possibly.

Coded character sets

A representation of a set of characters

Letters map to code points; representation of code points is left up to encodings.

Coded character sets; encodings

... --- ...

ASCII - represents every unaccented English letter, numeral, and some punctuation and control characters with a number between 1 and 128; encoded as 7 bit binary digit.

This encoding leaves a bit spare to be used in all sorts of snazzy ways...

Coded character sets; encodings

iso-8859-1 (latin-1); Western European (accents etc.)

Windows CP1252; very similar to iso-8859-1, 27 differences to catch the unwary; problems pasting from Word into text areas? Mmmmm, dig the smart quotes.

Unicode

A single character set to include every possible character (6' x 12' posters available).

UTF-8; help arrives

An encoding for storing Unicode (ie. every character) code points in memory with 8 bit bytes.

Code points 0 - 127 stored in a single byte; the same as ASCII; efficient for English text

Code points above 128 are stored with 1-6 bytes.

So, we'll use UTF-8 for everything, problem solved … doubles all round!

Wait! We need to do some work …

Be aware of the encodings that you're dealing with, specify UTF-8 to browsers and use their handling capabilities where possible.

Text editor

Save your documents in utf-8 encoding.

For GNU Emacs:

;; utf-8 encoding as default (setq locale-coding-system 'utf-8) (set-terminal-coding-system 'utf-8) (set-keyboard-coding-system 'utf-8) (prefer-coding-system 'utf-8)

in your .emacs

HTML

Use a meta tag to set a header (It must be the first tag in the head, the browser will back up on reaching it and re-scan from the start of the document):

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

HTML

Forms should use an accept-charset attribute:

<form accept-charset="utf-8">

This works as a hint to the browser to send us UTF-8 encoded data.

Not bullet-proof, especially for application/x-www-form-urlencoded data.

But hey, we do what can we do …

XML

Specify the encoding in your XML prolog:

<?xml version="1.0" encoding="utf-8"?>

UTF-8 is the default for documents that don't specify an encoding (and that don't have a BOM ... whole other story)

Apache

Default httpd.conf:

AddDefaultCharset On

Adds iso-8859-1 charset header to all text/plain, text/html docs except standard error pages; given a charset apt to their content.

This overrides the charset we specified in our HTML!

We need to specify:

AddDefaultCharset utf-8

Apache

eg (in the virtual host definition)

<VirtualHost *> ServerName roxxor DocumentRoot /home/jon/blah AddDefaultCharset utf-8 php_value default_charset utf-8 php_value mbstring.internal_encoding utf-8 </VirtualHost>

PHP values explained later, this is a handy place to set them …

MySQL (4.1)

utf-8 is utf8 in MySQL.

Collation: sort order, may be case sensitive or not

To find out your current setup:

SHOW VARIABLES LIKE 'character_set_database'; SHOW VARIABLES LIKE 'character_set_client';

MySQL

To see available character sets and collations

SHOW CHARACTER SET; SHOW COLLATION LIKE 'utf8%';

MySQL

We can set character set and collation per server, database, table, connection;

Server (/etc/my.cnf):

[mysqld] ... default-character-set=utf8 default-collation=utf8_general_ci

MySQL

Database:

(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8

Table:

(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8

MySQL

Connection:

SET NAMES 'utf8';

According to Bertrand Mansion this can be set on a server-wide basis in your my.cnf:

[mysqld] init_connect="SET NAMES utf8"

however, this will not apply if you connect as the MySQL root user.

PHP mysql connection (I think) defaults to a latin1 connection, so, first query after connection:

mysql_query("SET NAMES 'utf8'");

MySQL

CONVERT() function for converting between charsets, eg:

INSERT INTO utf8table (utf8column) SELECT CONVERT(latin1field USING utf8) FROM latin1table;

MySQL

Consider field sizes - chars may be up to 6 times wider, field sizes need to allow for this.

PHP

Support is, ahem, a bit lacking at the moment; inbuilt expectation of latin-1.

UTF-8 characters represented by 1-6 bytes, but some PHP functions don't know about this; they see

character == byte

However, all is not lost. And better Unicode support is promised for PHP5.

PHP

Need to set

default_charset = utf-8

Setting is PHP_INI_ALL, so ini_set()'ll do it. See above for apache setting.

For MySQL connections, as above, first query after connection:

mysql_query("SET NAMES 'utf8'");

PHP

PCRE, use u modifier for unicode

$pattern = "/pattern/u";

When using htmlspecialchars, supply encoding as 3rd param:

htmlspecialchars($string, ENT_QUOTES, 'utf-8'); Not sure about the need for this …

PHP

String handling functions assume character == byte, so won't work correctly with multibyte encoded character sets.

Use mbstring extension if available; replaces string functions (eg. strlen) with multibyte aware equivalents (eg. mb_strlen).

PHP

If using mbstring, set internal_encoding:

mbstring.internal_encoding utf-8

php.net says that the default for the internal encoding is null, but my setup seemed to default to iso-8859-1.

PHP

mbstring.func_overload setting looks good;

"Overloads a set of single byte functions by the mbstring counterparts."

Unfortunately stability may be an issue at present.

PHP

iconv extension for conversion between character sets.

Useful for conversion of existing data, or where you can't control input data charset (eg. feeds)??

See also, mb_convert_encoding().

PHP

But I don't have these extensions!

Try Dokuwiki utf-8 helpers or Scott Reynen's approach.

PHP

Of course, all of this also applies to any libraries/programs that you're using... Aaarrgghh!

Finnish

Only scratched the surface, but hopefully enough to give you some ideas/nightmares.

Links (and lots of them)

Links (and lots of them)

Links (and lots of them)

Links (and lots of them)