Unicode Best Practices

programming

2013-08-17

http://www.youtube.com/watch?v=X2FQHUHjo8M

Nick Patch

2013-06-13

use v5.8
- Contains everything you need.
- Contains ‘the unicode bug’, where it’s hard to tell whether a string has unicode semantics or not.
use v5.12
- largely fixes ‘the unicode bug’
use v5.14
- ‘the unicode bug’ is gone
ideal pattern: random input strings -> decode to character strings as soon as possible -> do whatever you need to do -> encode back to UTF-8 (or whatever) at the very last possible moment
use utf8;
- means your perl file is interpreted as UTF-8
- so literals and regexes can contain unicode rather than escape sequences
- (variable names can also be unicode, but please don’t)
use /x modifier with your regexes, so you can use whitespace and comments
=encoding UTF-8 in POD
- non-ASCII characters will work in your doc
use open qw( :encoding(UTF-8) :std );
- Perl IO layer assumes encoding to be UTF-8, automatically deals with all filehandles as UTF-8
other libraries (e.g., decode_json)
- many do it automatically
- HTTP libraries providing $res->decoded_content() convenience methods
- if nothing is done by default, nor is provided, then do decode(‘UTF-8’, $arg) to get a character string out of $arg
use charnames ‘:full’;
- Lets you use named unicode characters, e.g. \N{ARABIC KASRA}
- particularly necessary for combining characters, control characters, non-printing characters
- 5.16 assumes full charnames for you
Regex
- \d - digits – includes Tibetan and Lao numerals
- \w - word characters – not just the English alphabet
- \p{PerlWord} – the traditional ascii perl word
- \s - whitespace
- \R - any line ending – \n, \r, \f, \r\n, and others
- . - any codepoint except a line ending
- \X - any grapheme cluster (what a user thinks of as a single character, potentially multiple codepoints)
- \p - for matching character properties
  - \p{ASCII} - any 1 ASCII codepoint
  - \P{ASCII} - any 1 codepoint that is not an ASCII character
  - \p{General_Category=Letter} (same as \p{Letter} or \p{L} or \pL) - any codepoint having the General_Category property with value Letter
    - General Categories: L (Letter), M (Mark), N (Number), P (Punctuation), S (Symbol), Z (Separator), C (Other)
    - Subcategories: Sm (Math_Symbol), Sc (Currency_Symbol), Sk (Modifier_Symbol), So (Other_Symbol)
  - Script property - matches characters belonging to a specific writing system
    - Latin (Latn), Hiragana (Hira), Katakana (Kana), Han (Hani), Common (Common), Arabic (Arab), Bengali (Beng), Devanagari (Deva), Egyptian hieroglyphs (Egyp), Ethiopic (Ethi), Greek (Grek), Hangul (Hang), Cyrillic (Cyrl), ….
Casing
- uc( ‘große’ ) returns “GROSSE”, but if you lc() that, you get “grosse”. Uh-oh.
- use Unicode::CaseFold; fc(), shorthand for fold-case, will switch cases in a way that’s appropriate for comparison (but not for display: for display, keep using lc() and uc())
- Also useful for well-normalized sorting
- fc() is core in 5.16
- use Unicode::Normalize; provides NFC() and NFD() which will allow comparison of two strings in that case that one uses a combining character, and the other string uses a precomposed character
- NFD() decodes, NFC() encodes
- So now a recommended workflow might be: input –> decode to character string –> NFD() –> hack hack hack –> NFC –> encode to UTF-8 –> output
- use Unicode::Collate – finds a standard unicode collation algorithm (the UCA)
  - use Unicode::Collate; my $c = Unicode::Collate->new(); @countries = $c->sort( @countries );
- use Unicode::Collate::Locale lets you use a specific locale’s collation rules for sorting, etc.
  - predefined levels of support for collation
    - e.g., level 2 ignores case

Q&A

Unicode::UCD module in core (Unicode Character Database) with a lot of utility functions (e.g., convert from any type of number to one that you can use for arithmetic, etc.)