What is this?

Get the current version from here:

 http://training.perl.com/OSCON2011/index.html

I’ll often use 🐪 as an abbreviation for Perl, and sometimes for perl — as in, “I’m running 🐪 v5.14.”
See Appendix 1: Fonts at the end of this talk for instructions on font set-up.
These slides was written in pod using vi, with the help of a score of 🐪 Unicode tools available via the link in Appendix 2: Tools.
This slide show was built using Pod::S5 by Tom Linden, which in turn uses S5 by Eric Meyer. Slideshow controls appears if you hover near the bottom right corner.
I’ve used next to no HTML tricks here; nearly all fancy‐qua‐weird stuﬀ you see here is actually Unicode.

Welcome to Unicode!

Unicode isn’t just “here” — Unicode is now 20 years old, and it’s here to stay.
Every programmer needs to understand it, at least a little.
Unicode is not just for non‐English; look closely at this page.
This talk has its roots in a StackOverﬂow posting.
I have a whole ’nother talk on Thursday at 4:10pm in Portland 256 entirely dedicated to Unicode in Perl regular expressions.
Therefore, this talk won’t talk too much about regexes, but if we run out of materials here this morning before we run out of morning, I’ll go through some of those here.

The Bad News

There exists no:
- envariable to set
- CLI switch to turn on
- pragma to declare
- nor module to load
that will somehow “enable Unicode by default” in your code.
The very idea is a category error, because so long as you think of Unicode as merely “like ASCII/Latin1/CP1252/anything, but with more characters”, you will never grasp it. That’s closer to ISO‐10646, but Unicode is much more than that.
No matter how much boilerplate you introduce, you are never going to get your head into Unicode until you stop thinking of it as nothing more than a bigger character set, which means you are never going to do things “right”.

Surprise!

Besides having more characters, Unicode also includes
- rules for casemapping and casefolding
- grapheme clusters, including combining characters
- several diﬀerent kinds of normalization forms
- highly customizable rules for collation
- rules for word‐ & line‐breaking
- special rules for regular expressions
- 1000s of properties, including names and scripts
- numeric equivalences (e.g., telling you that U+216B, “Ⅻ”, has the value 12)
- print widths
- bidirectionality
- glyph variants
- & much, much more‼

The Good News 😻

The good news is 🐪 currently has the best Unicode support of any programming language. If you think we’ve got it bad, just try any of the other guys. Come to my Unicode Support Shootout talk at 5pm on Thursday, and see why.
That doesn’t mean Unicode in Perl (or anything) is hassle‐free. A small set of common blunders exist that are far too is easy to get stuck on, seemingly forever.
This tutorial is front‐loaded to get across a few simple, basic settings so that that never happens to you. It’s to make the easy stuﬀ easy.
There is hard stuﬀ involved with Unicode — and for the most part, you can do them in 🐪. It just takes more work&care that the easy stuff.
Most of the really hard ones are beyond the scope of this tutorial, but we may touch on a few of those toward the end.

Included 🐪 Unicode Tools

I’ve included a directory full of 🐪 scripts which I ﬁnd useful in Appendix 2: Tools:

Probing the UCD: unichars, uninames, uniprops
For use with the charnames pragma: unicore/html_alias.pl
Unix tool rewrites: tcgrep, ucsort, unifmt, unilook, uniquote (a better cat ‑v or od), uniwc
Normalization ﬁlters and tools: nfc, nfd, nfkc, nfkd, plus nfcheck
Casing and font‐fun ﬁlters: uc, lc, tc, plus titulate; unifont, leo, unicaps, uninarrow, uniwide, unisubs, unisupers
Test/demo code: es-sort, hantest, havshpx, hypertest, nunez (should be called núñez or even better, búsqueda-libre, but for ﬁlesys issues)

🐪 Unicode Tools: Setup

If you grab the tools, you can various examples of them I mention in this talk.

When I run them, I have this environment set up:

for bash

 $ export PERL_UNICODE=CSD      # but see below
 $ export LESS=MQeicsnf
 $ alias pc 'perl5.14.0 -Mcharnames=:full,:short,latin,greek -E'
 $ alias ug=uninames
 $ alias um='ucsort | less -r'

for tcsh

 % alias um 'ucsort | less -r'
 % alias ug uninames
 % alias pc 'perl5.14.0 -Mcharnames=:full,:short,latin,greek -E'
 % setenv PERL_UNICODE CSD
 % export LESS MQeicsnf

🐪 Unicode Tools: 𝔈𝔵𝔢𝔪𝔭𝔩𝔦 𝔊𝔯𝔞𝔱𝔦𝔞

Here are simple examples to play with:

 % pc 'say "\N{long s} \N{ae} \N{Omega} \N{omega}" \N{UPWARDS ARROW}'

 % echo exempli gratia | tc | unifont 

 % tcgrep -n '\P{ASCII}' pue.pod | uniquote -x | less
 % tcgrep -n '\P{ASCII}' pue.pod | uniquote -v | less

 % echo you are not expected to read this | leo

 % echo these are small caps | tc | unicaps
 % echo supers                    | unisupers
 % echo 123a                      | unisubs

🐪 Unicode Tools: 𝔈𝔵𝔢𝔪𝔭𝔩𝔦 𝔊𝔯𝔞𝔱𝔦𝔞 (cont¹…)

Here are more interesting examples to play with:

 % uninames latin ligat
 % uninames SIGN -arabic
 % uninames arrow -combining

 % uniprops 'MICRO SIGN'
 % uniprops -a 2010

 % unichars '\pL' '\p{Greek}'
 % unichars '\pL' '\p{Greek}' | um
 % unichars '\p{Age=6.0}'     | um
 % unichars -gasn NUM

 % unilook glob
 % unilook /glob
 % unilook -v '\pM'
 % unilook -v '\N{acute}'
 % unilook -v whom
 % unilook -vpn run

Basic Recommendations

Set your PERL_UNICODE envariable to AS. This makes all 🐪 scripts decode @ARGV as UTF‐8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTF‐8. You may have to turn it oﬀ at times, though. I don’t recommend D.
At the top of your source ﬁle (program, module, library, dohickey), prominently assert that you are running perl version 5.12 or better via:
```
 use v5.12;  # minimal for unicode_strings feature
 use v5.14;  # optimal for unicode_strings feature 
```
Declare this source unit as UTF‐8. Once upon a time, this pragma did other things, but it now serves this singular purpose alone and no other:
```
 use utf8; 
```

Basic Recommendations (cont¹…)

Enable (or reënable) strictures.
```
 use strict; 
```
Enable compiler warnings; we’ll get to runtime warnings in a moment.
```
 use warnings; 
```
Fatalize UTF‐8 warnings.
```
 use warnings qw( FATAL utf8 ); 
```
Under v5.14, the utf8 warning class comprises three subwarnings — nonchar, surrogate, and non_unicode — which you may sometimes wish to exert greater (i.e., separate) control over.
```
 no warnings "non_unicode"; 
```

Basic Recommendations (cont²…)

Declare that ﬁlehandles opened within this lexical scope but not elsewhere are in UTF‐8, until and unless you say otherwise. The :std adds in STDIN, STDOUT, and STDERR. This critical step implicitly decodes incoming data and encodes outgoing data as UTF‐8.
```
 use open qw( :encoding(UTF-8) :std ); 
```
Enable named characters via \N{CHARNAME}.
```
 use charnames qw( :full ); 
```
If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTF‐8, then say:
```
 binmode(DATA, ":encoding(UTF-8)"); 
```

Unicode Template

The next few slides show how I start all my 🐪 code these days.
They combine the previous directives and add a few niceties.
Some of the later elements may get omitted in some code, but not the earlier ones.
The #! line is debatable. Consider it a shortcut for whatever you need on your system: I try not to inﬂict perlrun’s eval exec hack on people. 😒
🐜 Unfortunately, a known bug prevents the open pragma from working correctly if you’ve also used the autodie pragma:
```
 https://rt.cpan.org/Public/Bug/Display.html?id=54777 
```

Unicode Template: Pragmas

Don’t take my #! here too seriously; it has Issues.

 #!/usr/bin/env perl

 use v5.14;
 use utf8;

 use strict;
 use autodie;
 use warnings; 

 use warnings    qw< FATAL utf8 >;
 use open        qw< :std :encoding(UTF-8) >;
 use charnames   qw< :full >;
 use feature     qw< unicode_strings >;

Unicode Template: Modules

The ﬁrst of these is almost always needed; the rest, not so much.

 use Unicode::Normalize  qw< NFD NFC >;
 use Encode              qw< encode decode >;

 use Carp                qw< carp croak confess cluck >;
 use File::Basename      qw< basename >;

 $0 = basename($0);  # shorter messages

Unicode Template: DATA, @ARGV

Don’t make $| hot if you have a lot of output on STDOUT.

 binmode(DATA, ":encoding(UTF-8)");
 
 # This works like perl -CA: note that it
 # assumes your terminal is set to use UTF-8
 if (grep /\P{ASCII}/ => @ARGV) { 
    @ARGV = map { decode("UTF-8", $_) } @ARGV;
 }
 
 $| = 1;   # comment out for performance

 END { close STDOUT }

Unicode Template: Set Traps

This avoids compile‐time 🐜 “bugs” in the pragma:

 # XXX: use warnings FATAL => "all";
 
 local $SIG{__DIE__} = sub {
     confess "Uncaught exception: @_" unless $^S;
 };
 
 local $SIG{__WARN__} = sub {
     if ($^S) { cluck   "Trapped warning: @_" } 
     else     { confess "Deadly warning: @_"  }
 };

Unicode Template: Filters

I use this on normal CLI ﬁlters:

 if (@ARGV == 0 && -t STDIN && -t STDERR) {
    print STDERR "$0: reading input from tty, type ^D for EOF...\n";
 }

 while (<>)  {
     chomp;
     $_ = NFD($_);
     ...
 } continue {
     say NFC($_);
 }
 
 __END__
 𝖍𝖎𝖈 𝖏𝖆𝖈𝖊𝖓𝖙 𝖉𝖆𝖙𝖆 𝖚𝖓𝖎𝖈𝖔𝖉𝖎𝖈𝖆

🐪 Runtime Environment

The only 🐪‐related envariable I normally run with is PERL_UNICODE, which I have set to "SA". That’s equivalent to running with the -CSA command‐line option. Possible values are
- 0 = turn oﬀ all ﬂags (that’s a DIGIT ZERO)
- I = STDIN is assumed to be in UTF‐8
- O = STDOUT will be in UTF‐8
- E = STDERR will be in UTF‐8
- S = I + O + E
- i = UTF‐8 is the default PerlIO layer for input streams
- o = UTF‐8 is the default PerlIO layer for output streams
- D = i + o
- A = the @ARGV elements are expected to be strings encoded in UTF‐8
- L = makes "IOEioA" conditional on the locale environment variables (LC_ALL, LC_TYPE, and LANG, in order of decreasing precedence) — if the variables indicate UTF‐8, then the selected "IOEioA" are in eﬀect.

🐪 Unicode Laundry List

That was the easy stuﬀ, but you’ve got to get it out of the way before you can go on.
Now we’ll do the real work.

🐜 Ixnay on the Ugbay Already! 🐜

🐪 used to have something called “The Unicode Bug” 🐜 .
Essentially, it caused code points in the range 128–255 to be treated as binary, not text.
If you follow my instructions above, you should no longer be aﬀected by it.
The critical ﬁx that makes it all possible is
```
 use feature "unicode_strings"; 
```
This feature was ﬁrst introduced in v5.12, but didn’t come up to full functionality until v5.14.

Core Pragmas

Key core pragmas for Unicode are:
- v5.14
- utf8
- feature
- charnames
- open
- re "/flags"
- encoding::warnings
Probably best to stay clear of these, though:
- bytes
- encoding
- locale

Program Literals

Specify Unicode literals any of these ways:

As literal UTF‐8 under the recommended utf8 pragma, allowing you to write "à contre-cœur", "Ångström", or "👪 💗 🐪" directly.
As wicked “magic numbers” like chr(0x1F4A9), "\x{2639}", or "\N{U+A0}".
Using the charnames pragma and the \N{CHARNAME} construct, strings like "\N{LATIN SMALL LETTER A WITH GRAVE} contre-c\N{LATIN SMALL LIGATURE OE}ur", "A\N{COMBINING RING ABOVE}ngstro\N{COMBINING DIAERESIS}m", and "\N{FAMILY} \N{GROWING HEART} \N{DROMEDARY CAMEL}".

The charnames Pragma

Works on any interpolated string, including regexes.
Must be available at compile‐time.
⚠ Diﬀerent scopes can have diﬀerent name bindings.
Import any combination of :full, :short, SCRIPTNAME, or :alias.
Also provides several functions (not for import):
- charnames::string_vianame(name) for runtime lookup of either a character name or a named character sequence, returning its string representation
- charnames::vianame(name) for runtime lookup of a character name (but not a named character sequence) to get its ordinal value (code point)
- charnames::viacode(code) for runtime lookup of a code point to get its Unicode name.

Basic charnames Examples

Using full or short character names

 use charnames ":full";
 print "\N{GREEK CAPITAL LETTER DELTA} is delta.\n";
    # Δ is delta.

 use charnames ':short';
 print "\N{greek:Delta} is an upper-case delta.\n";
    # Δ is an upper-case delta.

By script name

  use charnames qw(cyrillic greek);
  print "Sigmata are \N{Sigma}, \N{sigma}, and \N{final sigma}.\n";
    # Sigmata are Σ, σ, and ς.
  print "\N{Be} and \N{be} are Cyrillic B's.\n";
    # Б and б are Cyrillic B's.

Custom charnames Examples

Customization via :alias and a hash:

 use charnames ":full", ":alias" => {
     e_ACUTE => "LATIN SMALL LETTER E WITH ACUTE",
     E_ACUTE => "LATIN CAPITAL LETTER E WITH ACUTE",
 };
 print "I'll have the \N{e_ACUTE}touff\N{e_ACUTE}e.\n";
    # I'll have the étouffée.

Customization via :alias and a string looks for a corresponding ﬁle to require from unicore/, which must be a subdirectory under your @INC path. For example, :html would look for a ﬁle named unicore/html_alias.pl.
```
 use charnames ":alias" => ":html";
 print "\N{frac14} and \N{frac12} are \N{frac34}.\n";
    # ¼ and ½ are ¾. 
```

Core Modules

Key core modules for Unicode are:

Encode
Unicode::Normalize
Unicode::Collate
Unicode::Collate::Locale
Unicode::UCD
DBM_Filter::utf8

The Encode module

The Encode module is most often used implicitly: it’s loaded automatically whenever you pass an :encoding(ENC) argument to binmode or to open.

 binmode(STDIN,       ":encoding(cp1252)")
               || die "can't binmode STDIN: $!";

 open(OUTPUT, "> :raw :encoding(UTF-16LE) :crlf", $filename) 
               || die "can't open $filename: $!";

 print OUTPUT while <STDIN>;

 close(OUTPUT) || die "couldn't close $filename: $!";
 close(STDIN)  || die "couldn't close STDIN: $!";

The Encode module (cont¹…)

The Encode module provides functions for when you need to manually decode incoming data and to manually encode outgoing data.

The functions I most often use from it are encode, decode, and find_encoding.

 use Encode qw< find_encoding >;
 for my $alias (qw< utf8 UTF-8 utf16le >) {
     my $obj = find_encoding($alias);
     my $name = $obj ? $obj->name() : "UNKNOWN";
     printf "%-8s is really %s.\n", $alias, $name;
 } 

 # utf8     is really utf8.
 # UTF-8    is really utf-8-strict.
 # utf16le  is really UTF-16LE.

Try running my byte2uni tool like this for a blast from the past:
```
 % byte2uni -a -e nextstep | less 
```

The Encode module (cont²…)

The MacRoman encoding is a bit weird:

 use charnames qw< :full >;
 use Encode (
     "decode",   # $unicode = decode("scheme", $bytes);
     "encode",   # $bytes   = encode("scheme", $unicode);
 );

 my $permil = "\N{PER MILLE SIGN}";
 printf "A permille %s is U+%vX in Unicode", $permil, $permil;

 my $bytes = encode("macroman", $permil);
 printf " but is 0x%vX in Macroman\n", $bytes;

    # A permille ‰ is U+2030 in Unicode but is 0xE4 in Macroman

The Encode module (cont³…)

The MacRoman encoding is still a bit weird:

 use charnames qw< :full >;
 use Encode (
     "decode",   # $unicode = decode("scheme", $bytes);
     "encode",   # $bytes   = encode("scheme", $unicode);
 );

 my $byte = chr(0x8E);
 my $char = decode("macroman", $byte);

 printf "An %vX in MacRoman is %vX in Unicode\n", $byte, $char;
 printf "Which is really a %s\n", charnames::viacode(ord $char);

    # An 8E in MacRoman is E9 in Unicode
    # Which is really a LATIN SMALL LETTER E WITH ACUTE

The Unicode::Normalize module

Because equivalent grapheme clusters can be written multiple ways, you almost always want to normalize your data using functions from the standard Unicode::Normalize module.

That’s why my standard template said:

 while (<>)  {
     chomp;
     $_ = NFD($_);
     ...
 } continue {
     say NFC($_);
 }

Canonical Conundra

Just as one example, consider all these variants of a Latin small letter o with tilde:

N	Glyph	NFC?	NFD?	🐪🐪🐪🐪🐪	Code Points
1	õ	✓	─	`"\x{F5}"`	`LATIN SMALL LETTER O WITH TILDE`
2	õ	─	✓	`"o\x{303}"`	`LATIN SMALL LETTER O, COMBINING TILDE`
3	ȭ	✓	─	`"\x{22D}"`	`LATIN SMALL LETTER O WITH TILDE AND MACRON`
4	ȭ	─	─	`"\x{F5}\x{304}"`	`LATIN SMALL LETTER O WITH TILDE, COMBINING MACRON`
5	ȭ	─	✓	`"o\x{303}\x{304}"`	`LATIN SMALL LETTER O, COMBINING TILDE, COMBINING MACRON`
6	ō̃	─	✓	`"o\x{304}\x{303}"`	`LATIN SMALL LETTER O, COMBINING MACRON, COMBINING TILDE`
7	ō̃	✓	─	`"\x{14D}\x{303}"`	`LATIN SMALL LETTER O WITH MACRON, COMBINING TILDE`

Oh shucks...

Assuming you enforce NFD on input, then 1 shows up as 2, both of 3 & 4 show up as 5, and 7 shows up as 6.
Assuming you enforce NFC on output, then 2 shows up as 1, both of 4 & 5 show up as 3, and 6 shows up as 7.
That means that by normalizing to either NFD or NFC, you can do a simple eq to get 1 & 2, 3–5, and 6 & 7 to each respectively test equal to one another.
Notice, however, that that’s three diﬀerent sets.
Number 4 is in neither NFC nor NFD. These things happen. It gets worse. Normalize. Always. Please.

... and more shucks!

In a regex, all 7 of those will be completely matched by \X, an extended grapheme cluster. Yes, but now what? 😭 I’m afraid this is where it stops being easy. NFD is assumed and required for the following to work:

/^o/ reports that all 7 start with an o.
/^o\x{COMBINING TILDE}/ reports that 1–5 start with an o and a tilde, but that misses 6 & 7.
You’d need /^o\pM*?\x{COMBINING TILDE}/ to get all 7 matching.
This is still just a stab, with various issues still unresolved (like using \p{Grapheme_Extend} instead of \pM — and, were there any, using \p{Grapheme_Base} instead of \PM):
```
 $o_tilde_rx = qr{ o \pM *? \x{COMBINING TILDE} \pM* }x; 
```

👽 Filesystems Hate You

You are going to have ﬁlesystem issues, especially on 👽 ﬁlesystems.
Some ﬁlesystems silently enforce a conversion to NFC; others silently enforce a conversion to NFD. And others do something else still.
Some even ignore the matter altogether, which leads to even greater problems.
So you really do have to do your own NFC/NFD handling to keep sane. I think. Maybe.

The Unicode::Collate module

Partly for reasons just shown, string comparisons on Unicode are pretty much always the wrong way to go.
That includes eq, ne, le, gt, cmp, sort, &c &c. 😓
Enter the standard Unicode::Collate module. It’s super‐fancy, so I’ll just show the simplest approaches here.
You can get a taste for how it works by playing around with my ucsort utility.

Replacing sort

Whenever you’ve an array of text strings to sort, as in @a = sort @b, just swap that code out for this and all will be well:
```
 use Unicode::Collate;
 @sorted = Unicode::Collate::->new->sort(@unsorted); 
```

There’s also a standard Unicode::Collate::Locale module for national sorts.

 use Unicode::Collate::Locale;

 state $coll = new Unicode::Collate::Locale::
                   locale => "fr",
       # lots of other parameters possible here
               ;

 my @bons_mots = $coll->sort(our @mots);

Replacing eq

Most real‐world sorts are more complex than those.

 @srecs = sort {
     $b->{AGE}   <=>  $b->{AGE}
                 ||
     $a->{NAME}  cmp  $b->{NAME}
 } @recs;

Enter the getSortKey method:

 my $collator = Unicode::Collate::->new();
 for my $rec (@recs) {
     $rec->{NAME_key} = $collator->getSortKey($rec->{NAME});
 } 
 @srecs = sort {
     $b->{AGE}       <=>  $b->{AGE}
                     ||
     $a->{NAME_key}  cmp  $b->{NAME_key}
 } @recs;

Program Options to ucsort

These are its literal Getopt:::Long arguments:

  # collator constructor options
    --backwards-levels=i
    --collation-level|level|l=i
    --katakana-before-hiragana
    --normalization|n=s
    --override-CJK=s
    --override-Hangul=s
    --preprocess|P=s
    --upper-before-lower|u
    --variable=s

  # program specific options
    --case-insensitive|insensitive|i
    --input-encoding|e=s
    --locale|L=s
    --paragraph|p
    --reverse-fields|last
    --reverse-output|r
    --right-to-left|reverse-input

CPAN Modules

CPAN modules for handling Unicode include:

Unicode::LineBreak, which includes Unicode::GCString. These respectively solve “the format problem” and “the printf problem”.
Unicode::Casing for things like lc ΣΤΙΓΜΑΣ ⇒ στιγμας in Greek, or uc i ⇒ İ in the Turkic languages.
Unicode::Unihan, and if you liked the last one, you might want to look into Lingua::JA::Romanize::Japanese, Lingua::KO::Hangul::Util, Lingua::KO::Romanize::Hangul, Lingua::ZH::Romanize::Pinyin, &c.
Unicode::Stringprep

No POSIX Locales, por favor

Please don't (try to) use POSIX locales’ collation. Use Unicode’s.
Normalization won’t always help you enough. For example, you can’t use it to get o, õ, and ø to look the same, because LATIN SMALL LETTER O WITH STROKE has no decomposition to something with an o in it.
When comparing whether letters are the same, Unicode::Collate does count o, õ, and ø as the same letter — normally. Not in Swedish or Hungarian, though.
Similarly with d and ð — you can’t decompose LATIN SMALL LETTER ETH to anything with a d in it, but the UCA treats them as the same letter. Er, except in Icelandic (the "is" locale), where d and ð are now diﬀerent letters in their own right.

Unicode Locales

What about ae & æ \x{E6}, or oe & œ \x{153}? Those aren’t casefolds of each other as occurs with ij and ĳ \x{133}, and there’s no useful decomposition, either. But Unicode::Collate will treat them alike.
Usually, that is. However, in the “German phonebook” locale, now ae and æ are diﬀerent — but ae and ä (whether written \x{E4} or a\x{308}) are the same. No kidding.

Here’s how to do a locale compare:

 state $coll = new Unicode::Collate::Locale:: 
                locale => "de__phonebook", 
            ;

 if ($coll->eq($a, $b)) { ... }

Now what? It’s tough, I tell you. See my collation examples in my FixString.pm module and my es-sort, núñez, and unilook tools.

Unicode Regex Gotchas

Code that uses \p{Lu} is almost as wrong as code that uses [A-Za-z]. You need to use \p{Upper} instead, because \p{Lowercase} (≡ \p{Lower}) is diﬀerent from \p{Lowercase_Letter} (≡ \p{Ll}) by 159 code points:
```
 % unichars '\p{Lowercase}' '\P{Lowercase_Letter}' 
 % unichars '\p{Lower}'     '\P{Ll}'  # same but easier to type 
```
Code that uses [a-zA-Z] is even worse. And it can’t use \pL or \p{Letter}; it needs to use \p{Alphabetic}. Not all alphabetics are letters:
```
 % unichars -a '\p{alphabetic}' '\P{Letter}' | wc -l # 1006 code points 
```
If you are looking for 🐪 variables with /[\$\@%]\w+/, then you have a problem (or two).
- You need to look for /[\$\@%]\p{IDS}\p{IDC}*/
- Even that isn’t thinking about the punctuation variables or package variables.

Unicode Regex Gotchas (cont¹…)

If you are checking for whitespace, then you should choose between \h and \v, depending. And you should never use \s to mean all possible Unicode whitespace.

For historical reasons, \s does not mean [\h\v]. These both tell the same tale:

 % unichars '\S' '[\v\h]' 
  ---- U+000B LINE TABULATION

 % unichars '\S' '\p{space}'   
  ---- U+000B LINE TABULATION

Unicode Regex Gotchas (cont²…)

If you are using \n for a line boundary, or even \r\n, then you are doing it wrong.

The Unicode linebreak sequence metacharacter is \R. It means (?:\r\n|\v).

 % unichars '\R'
  ---- U+000A LINE FEED (LF)
  ---- U+000B LINE TABULATION
  ---- U+000C FORM FEED (FF)
  ---- U+000D CARRIAGE RETURN (CR)
  ---- U+0085 NEXT LINE (NEL)
  ---- U+2028 LINE SEPARATOR
  ---- U+2029 PARAGRAPH SEPARATOR

You could always canonicalize to linefeeds:

 my $slurpy = `cat somefile`;    # pretend I didn’t do this :)
    $slurpy =~ s/\R/\n/g;        # convert Unicode linebreaks

Unicode Antipatterns 💩

People make millions of broken assumptions about Unicode. Until they understand these things, their 🐪 code will be broken. Look for these Unicode antipatterns:

Code that assumes it can open a text ﬁle without specifying the encoding is broken.
Code that assumes the default encoding is some sort of native platform encoding is broken.
Code that assumes web pages in Japanese or Chinese take up less space in UTF‐16 than in UTF‐8 is wrong.
Code that assumes 🐪 uses UTF‐8 internally is wrong.
Code that assumes encoding errors will always raise an exception is wrong.
Code that assumes 🐪 code points are limited to 0x10_FFFF is wrong.

Antipatterns (cont¹…)

Code that assumes you can set $/ to something that will work with any valid line separator is wrong. \R only works in patterns.
Code that assumes roundrip equality on casefolding, like lc(uc($s)) eq $s or uc(lc($s)) eq $s, is completely borken and worng. Consider that the uc("σ") and uc("ς") are both "Σ", but lc("Σ") cannot possibly return both of those.
Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, "ª" is a lowercase letter with no uppercase. Kinda.
Whereas both "ᵃ" and "ᴬ" are Cased letters, they casemap only to themselves. Both are Lowercase, and Letters, but they are not Lowercase_Letters.
Got that? They are not \p{Lowercase_Letter}, despite being both \p{Letter}s and \p{Lowercase}. They’re \p{Modifier_Letter}s, actually. Honest.

Antipatterns (cont²…)

Code that assumes changing the case doesn’t change the length of the string is broken.
```
 % unichars -gas 'grep { length > 1 } lc, ucfirst, uc' 
```
Code that assumes there are only two cases is broken. There’s also titlecase.
```
 % unichars -gas 'uc ne ucfirst' 
```
Code that assumes only letters have Case is broken. Beyond just letters, it turns out that numbers, symbols, and even marks have case.
```
 % unichars -gas '\PL' '\p{Cased}' 
```
Code that assumes casemapping a Cased code point always gives a diﬀerent code point is broken. This shows there are 1299 un–case‐changing cased code points:
```
 % unichars -gas '\p{Cased}' '[^\p{CWL}\p{CWT}\p{CWU}]' 
```
Code that assumes case is never locale‐dependent is broken, as is code that assumes Unicode gives a ﬁckle ﬂying ﬁg about legacy POSIX locales.

Antipatterns (cont³…)

Code that uses something like y/\000-\177/\200-\377/ is broken and wrong. Try tr[\0-\x{10_FFFF}][\x{20_0000}-\x{30_FFFF}] if you dare.
Code that assumes it can remove Marks to get “ASCII” letters is evil and rude.

Code that assumes diacritics \p{Diacritic} and marks \p{Mark} are the same thing is broken.

 % unichars -gas '\p{mark}' '\P{DIACRITIC}'   # 1068 code points
 % unichars -gas '\P{MARK}' '\p{diacritic}'   #  209 code points

Code that assumes the general category \p{GC=Dash_Punctuation} covers as much as the binary property \p{Dash=Yes} is broken.
Code that naïvely assumes dash, hyphens, and minuses are all the same thing as each other, or that there is only one of each, is broken and wrong.
```
 % unichars -gas '\p{Dash}' 
```

Antipatterns (cont⁴…)

Code that assumes every code point takes up no more than one print column is broken.
Code that assumes all \p{Mark} characters take up zero print columns is broken.
```
 % unichars -gas '\pM' '\P{BC=NSM}' 
```
Code that assumes characters which look alike are alike is broken.
Code that assumes characters which do not look alike are not alike is broken.
Code that assumes there is a limit to the number of code points in a row that just one \X can match is wrong.
Code that assumes \X can never start with a \p{Mark} character is wrong.

Antipatterns (cont⁵…)

Code that assumes \X can never hold two non‐\p{Mark} characters is wrong.
Code that assumes it cannot use "\x{FFFF}" is wrong.
Code that assumes a non‐BMP code point requiring two UTF‐16 (surrogate) “code units” will encode to two separate UTF‐8 characters, one per code unit, is wrong. It doesn’t: it encodes to a single code point. Or should.
Code that transcodes from UTF‐16 or UTF‐32 with leading BOMs into UTF‐8 is broken if it puts a BOM at the start of the resulting UTF‐8, because it just changed the number of code points in the data! Wʀᴏɴɢ‼

Antipatterns (cont⁶…)

Code that assumes the CESU‐8 is a valid UTF encoding is wrong. Likewise, code that thinks encoding U+0000 as bytes "\xC0\x80" is UTF‐8 is broken and wrong.

Code that assumes characters like < always points to the right and > always points to the left are wrong — because they in fact do not.

 % perl -Mcharnames=:full -E 'say "\N{RLE}", "12 < 345 < 6789"'
 6789 > 345 > 12
 % perl -Mcharnames=:full -E 'say "\N{RLO}", "12 < 345 < 6789"'
 9876 > 543 > 21

Code that assumes if you ﬁrst output character X and then character Y, that those will show up as XY is wrong. Sometimes they don’t.
Code that assumes ASCII is good enough for writing English properly is stupid, shortsighted, illiterate, broken, evil, and wrong.
I have stronger words for it, too.

Antipatterns (cont⁷…)

Code that assumes all \p{Math} code points are visible characters is wrong.
Code that assumes \w contains only letters, digits, and underscores is wrong — unless you use the /a modiﬁer or
```
    use re "/a"; 
```
Code that assumes ^ and ~ are punctuation marks is wrong.
Code that assumes ë has an umlaut character in it is wrong, thrice.
Code that believes symbols like ㏂, ℉, ㎨, ₨, & ™ contain any letters in them is wrong — except in NFKD:
```
  % unichars -gas '\pS' 'NFKD =~ /\p{Latin}/' | ucsort | less -r 
```
Code that believes \p{InLatin} is the same as \p{Latin} is heinously broken.

Antipatterns (cont⁸…)

Code that believes \p{InLatin} is almost ever useful is almost certainly wrong.
Code that believes that, given $FIRST_LETTER as the ﬁrst letter in some alphabet and $LAST_LETTER as the last letter in that same alphabet, writing [${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is almost always complete broken and wrong and meaningless.
Code that believes someone’s name can only contain certain characters is oﬀensive, broken, and wrong.

Antipatterns (cont⁹…)

Code that believes there’s some way to pretend textﬁle encodings don’t exist is broken and dangerous.
Code that converts unknown characters to ? is broken, stupid, braindead, and runs contrary to the standard recommendation, which says not to do that! So don’t.
Code that believes it can reliably guess the encoding of an unmarked textﬁle is guilty of a fatal mélange of hubris and naïveté that only a lightning bolt from Zeus will ﬁx.

Antipatterns (cont¹⁰...)

Code that believes you can use 🐪 printf widths to pad and justify Unicode data is broken and wrong. Use Unicode::GCString to count columns.
Code that believes once you successfully create a ﬁle by a given name, that when you run ls or readdir on its enclosing directory, you’ll actually ﬁnd that ﬁle with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!
Code that believes UTF‐16 is a ﬁxed‐width encoding is stupid, broken, and wrong.
Code that treats code points from one plane one whit diﬀerently than those from any other plane is ipso facto broken and wrong.

Antipatterns (cont¹¹...)

Code that believes that stuﬀ like /s/i can match only "S" or "s" is broken and wrong. You’d be surprised!
Code that uses \PM\pM to ﬁnd grapheme clusters instead of using \X is broken and wrong.

Appendix 1: Font suggestions

I recommend two free fonts from George Douros at users.teilar.gr/~g1951d/ known to work with this presentation: his Alﬁos font for regular text, and his Symbola font for fancy emoji. If any of these don’t look right to you, you probably need to supplement your system fonts:
- Ligatures: ﬁ ﬃ ﬀ ﬄ ﬂ β ẞ ﬅ ﬆ
- Math letters: 𝒜 𝒟 𝔅 𝔎 𝔼 𝔽
- Gothic & Deseret: 𐌸𐌼𐌽𐍂, 𐐔𐐯𐑅𐐨𐑉𐐯𐐻
- Symbols: ✔ ✅ 🐪 📖 🛂 🐍
- Emotica: 😇 😈 😉 😨 😭 😱
- Upside‐down: ¡pɐəɥ ɹnoʎ uo ƃuᴉpuɐʇs ʎq sᴉɥʇ pɐəᴚ
- Combining characters: ◌̂,◌̃,◌⃞,◌̲,◌︀,◌̵,◌̷
The last line with combining characters is especially hard to get to look right. You may ﬁnd that the shareware font Everson Mono works when all else fails.

Appendix 2: Tools

I wrote a huge bucketful of tools to make your life with Unicode not just easier, but more fun.
These are available in the Unicode::Tussle bundle from CPAN, where they come with documentation, or you can get them individually from training.perl.com/scripts if you’d prefer.
Thanks very much to brian d foy for putting that CPAN bundle together for me.

Contact Information

I’m Tom Christiansen. You can reach me at tchrist@perl.com.
All three talks, as well as these instructions, are available from training.perl.com/OSCON2011.

YANETUT

文字化け 𝀬 𝀳 𝀵 𝀷 𝀺 𝁁 𝁩 𝁭 𝁲 𝄁 𝄃 𝄈 𝄍 𝄏 𝄒 𝄔 𝄗 𝄙 𝄬 𝄯 𝄱 𝄵 𝄹 𝄻 𝅀 𝅘𝅥𝅲 𝅫 𝆄 𝆒 𝆕 𝆗 𝆚 𝆶 𝆹𝅥 𝆺𝅥𝅯 🀀 🀄 🀈 🀋 🀍 🀐 🀒 🀕 🀚 🀞 🀠 🀣 🀦 🀨 🀫 🂩 🂬 🂮 🂱 🂵 🂺 🂼 🃁 🃃 🃆 🃈 🃍 🃓 🃕 🃘 🃚 🃝 🃟 🌀 🌁 🌂 🌃 🌄 🌅 🌆 🌇 🌈 🌉 🌊 🌋 🌌 🌍 🌎 🌏 🌐 🌑 🌒 🌓 🌔 🌕 🌖 🌗 🌘 🌙 🌚 🌛 🌜 🌝 🌞 🌟 🌠 🌰 🌱 🌲 🌳 🌴 🌵 🌷 🌸 🌹 🌺 🌻 🌼 🌽 🌾 🌿 🍀 🍁 🍂 🍃 🍄 🍅 🍆 🍇 🍈 🍉 🍊 🍋 🍌 🍍 🍎 🍏 🍐 🍑 🍒 🍓 🍔 🍕 🍖 🍗 🍘 🍙 🍚 🍛 🍜 🍝 🍞 🍟 🍠 🍡 🍢 🍣 🍤 🍥 🍦 🍧 🍨 🍩 🍪 🍫 🍬 🍭 🍮 🍯 🍰 🍱 🍲 🍳 🍴 🍵 🍶 🍷 🍸 🍹 🍺 🍻 🍼 🎀 🎁 🎂 🎃 🎄 🎅 🎆 🎇 🎈 🎉 🎊 🎋 🎌 🎍 🎎 🎏 🎐 🎑 🎒 🎓 🎠 🎡 🎢 🎣 🎤 🎥 🎦 🎧 🎨 🎩 🎪 🎫 🎬 🎭 🎮 🎯 🎰 🎱 🎲 🎳 🎴 🎵 🎶 🎷 🎸 🎹 🎺 🎻 🎼 🎽 🎾 🎿 🏀 🏁 🏂 🏃 🏄 🏆 🏇 🏈 🏉 🏊 🏠 🏡 🏢 🏣 🏤 🏥 🏦 🏧 🏨 🏩 🏪 🏫 🏬 🏭 🏮 🏯 🏰 🐀 🐁 🐂 🐃 🐄 🐅 🐆 🐇 🐈 🐉 🐊 🐋 🐌 🐍 🐎 🐏 🐐 🐑 🐒 🐓 🐔 🐕 🐖 🐗 🐘 🐙 🐚 🐛 🐜 🐝 🐞 🐟 🐠 🐡 🐢 🐣 🐤 🐥 🐦 🐧 🐨 🐩 🐪 🐫 🐬 🐭 🐮 🐯 🐰 🐱 🐲 🐳 🐴 🐵 🐶 🐷 🐸 🐹 🐺 🐻 🐼 🐽 🐾 👀 👂 👃 👄 👅 👆 👇 👈 👉 👊 👋 👌 👍 👎 👏 👐 👑 👒 👓 👔 👕 👖 👗 👘 👙 👚 👛 👜 👝 👞 👟 👠 👡 👢 👣 👤 👥 👦 👧 👨 👩 👪 👫 👬 👭 👮 👯 👰 👱 👲 👳 👴 👵 👶 👷 👸 👹 👺 👻 👼 👽 👾 👿 💀 💁 💂 💃 💄 💅 💆 💇 💈 💉 💊 💋 💌 💍 💎 💏 💐 💑 💒 💓 💔 💕 💖 💗 💘 💙 💚 💛 💜 💝 💞 💟 💠 💡 💢 💣 💤 💥 💦 💧 💨 💩 💪 💫 💬 💭 💮 💯 💰 💱 💲 💳 💴 💵 💶 💷 💸 💹 💺 💻 💼 💽 💾 💿 📀 📁 📂 📃 📄 📅 📆 📇 📈 📉 📊 📋 📌 📍 📎 📏 📐 📑 📒 📓 📔 📕 📖 📗 📘 📙 📚 📛 📜 📝 📞 📟 📠 📡 📢 📣 📤 📥 📦 📧 📨 📩 📪 📫 📬 📭 📮 📯 📰 📱 📲 📳 📴 📵 📶 📷 📹 📺 📻 📼 🔀 🔁 🔂 🔃 🔄 🔅 🔆 🔇 🔈 🔉 🔊 🔋 🔌 🔍 🔎 🔏 🔐 🔑 🔒 🔓 🔔 🔕 🔖 🔗 🔘 🔙 🔚 🔛 🔜 🔝 🔞 🔟 🔠 🔡 🔢 🔣 🔤 🔥 🔦 🔧 🔨 🔩 🔪 🔫 🔬 🔭 🔮 🔯 🔰 🔱 🔲 🔳 🔴 🔵 🔶 🔷 🔸 🔹 🔺 🔻 🔼 🔽 🕐 🕑 🕒 🕓 🕔 🕕 🕖 🕗 🕘 🕙 🕚 🕛 🕜 🕝 🕞 🕟 🕠 🕡 🕢 🕣 🕤 🕥 🕦 🕧 🗻 🗼 🗽 🗾 🗿 😁 😂 😃 😄 😅 😆 😇 😈 😉 😊 😋 😌 😍 😎 😏 😐 😒 😓 😔 😖 😘 😚 😜 😝 😞 😠 😡 😢 😣 😤 😥 😨 😩 😪 😫 😭 😰 😱 😲 😳 😵 😶 😷 😸 😹 😺 😻 😼 😽 😾 😿 🙀 🙅 🙆 🙊 🙍 🙏 🚁 🚃 🚇 🚈 🚌 🚏 🚑 🚔 🚖 🚙 🚞 🚢 🚤 🚧 🚪 🚬 🚯 🚳 🚸 🜂 🜃 🜇 🜈 🜐 🜓 🜕 🜘 🜝 🜡 🜤 🜦 🜩 🜫 🜮 🜲 🜳 🜷 🜺 🜼 🜿 🝁 🝈 🝌 🝍 🝐 🝒 🝗 🝟 🝢 🝩 🝮 🝰 🝳 〠 ꝣ 𐌴 𐌶 𐌹 𐌻 𐌾 𐍃 𐍄 𐍈 𐍊 𒂊 𒂭 𒂱 𒃔 𒇉 𒋧 𒋼 𒌣

OSCON • Tuesday, 28 July 2011

Perl Unicode Essentials

🐪 Perl Unicode Essentials

Tom Christiansen <tchrist@perl.com>