unicode normalization (anti)

Do you know the difference between 郎 (láng) and 郎 (láng)? One is the Chinese character for fiancé and the other one is a so-called compatibility form, which is used in Korea. There is a slight difference in the appearance of the characters, but depending on your font they might look exactly the same.

Why would one care? Not usually, but if you want to copy (Chinese) text from a web page and turn it into a PDF file using pdflatex with the CJK package, and get an error about missing glyphs in the font and metafont errors — and finally track this down to these characters, which GTK “helpfully” both turns into the same 郎 when pasting…… (luckily XEmacs did not, use C-u C-x = to see information about character)

How to fix? Either get a font with all those compatibility glyphs, but as I can’t tell the difference anyway or even might not properly recognize the character, just normalize them back to their “typical” characters. For the latter, Perl can do the job quick and nice: just run s/\p{Han}*/NFKD($&)/ge to replace all the Han characters with their “compatibility decomposition”. See man Unicode::Normalize for details.