Encoding trouble

Apparently for historic reasons WordPress’s database connection doesn’t use UTF-8 as it’s character encoding. This has led to all kinds of odd problems for people using UTF-8 as character encoding for their blog.

  1. If you dump your database e.g. using the WordPress Database Backup plugin you will get a nice, gzipped .sql file. Opening this text file with a good text editor will result in auto-detection of it’s UTF-8 format. Even though you may believe your data is stored in UTF-8, importing it using PHPMyAdmin as UTF-8 will result in a real mess: all non-ASCII characters are mangled. You have to import using “latin1” which is very counter-intuitive.
  2. Searching behaves oddly in some cases. For example searching the post „Über uns“ (German for „About us“) finds it only if you search with a capital Ü not a lowercase ü. From MySQL’s point of view it’s behaving perfectly fine. All it sees is two byte sequences that have nothing much in common in the context of the latin1 collation: C3 9C and C3 BC.
  3. The same applies for any SELECTs you may be doing in a plugin you are developing. That’s the way I stumbled upon this.

The reasons for these problems are:

  • All the blog’s tables in MySQL are set to some latin1 encoding.
  • WordPress’s database connection uses latin1 as well.

I have had a nice few days’ work figuring this out as I couldn’t find the lines in WordPress’s source that were to blame. Which was futile anyway. The fault didn’t lie in any of WordPress‘ code but in the absence of the correct code.

In the end I dumped the blog’s tables, changed all the encoding flags to „utf8_general_ci“ (which may not be strictly necessary) and applied this patch to my version 2.0.4 code. Then I imported my SQL dump back into the database using UTF-8 as encoding this time.

The patch uses „SET NAMES ‘utf8′“ to tell MySQL to use UTF-8 for all communication with WordPress. This may be better than the more obvious „SET CHARACTER SET ‘utf8′“. The MySQL manual hints at problems with the latter in MySQL 4.1 that were fixed in 4.1.1. „SET NAMES“ also seems to force character_set_connection to the given encoding instead of setting it to the database’s encoding.

Here are the respective parts from the MYSQL manual

NAMES {‘charset_name’ [COLLATE ‘collation_name’} | DEFAULT}
SET NAMES sets the three session system variables character_set_client, character_set_connection, and character_set_results to the given character set. Setting character_set_connection to charset_name also sets collation_connection to the default collation for charset_name. The optional COLLATE clause may be used to specify a collation explicitly. See Section 10.4, “Connection Character Sets and Collations”.

CHARACTER SET {charset_name | DEFAULT}
This maps all strings from and to the client with the given mapping. Before MySQL 4.1, the only allowable value for charset_name is cp1251_koi8, but you can add new mappings by editing the sql/convert.cc file in the MySQL source distribution. As of MySQL 4.1.1, SET CHARACTER SET sets three session system variables: character_set_client and character_set_results are set to the given character set, and character_set_connection to the value of character_set_database. See Section 10.4, “Connection Character Sets and Collations”.
The default mapping can be restored by using the value DEFAULT. The default depends on the server configuration.
Note that the syntax for SET CHARACTER SET differs from that for setting most other options.

All this mess resolved and there still is a problem left:
This works:
SELECT ID FROM wp_talentboerse_posts WHERE post_status = ‘static’ AND post_title = ‘Über uns’
This doesn’t
SELECT ID FROM wp_talentboerse_posts WHERE post_status = ‘static’ AND post_title = ‘Über uns’
The SQL statements seem identical. The reason for this is that the Unicode character “Ü” can be stored in a pre-composed or a decomposed way: either as “Ü” (U+00DC) itself or alternatively as “U” and “¨” (U+0055 U+0308). The “¨” knows how to merge with preceding glyphs that support it. The result is an “Ü”.

These statements may or may not work when this text reaches you. It’s always possible that decomposed characters get converted to pre-composed ones (or vice versa) by the text processing that is applied at the many stages until you finally read them.

You could suggest doing this sort of conversion by default. That doesn’t solve the underlying problem, though. You would have to do this in every conceivable data path into your database. It might become slow. Thus it’s just a workaround.

On an aside: Firefox seems to have a display problem for the decomposed “Ü”. It’s display of pre-composed and decomposed characters is not identical. At least up to version 1.5.0.6: ÜÜ.

Update: After another few days I have found someone else describing this MySQL bug. This user has the same problem with Ös. ;)

Update 2: Now that I could verify the location of the problem I implemented the workaround described above. Everything appears to be working as it should now. I may release this mysterious plugin after some more in depth testing.

Related links
Unicode’s dirty little secret – Anne’s Weblog

This entry was posted in WordPress. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>