Unicode & UTF

Unicode is a system to map every graphemes (character/glyph/alphabet/symbol/letter) from hopefully all human writing systems into numeric values, a.k.a. code points, hence the name, universal code.

For example a (formally known as Latin Small Letter A) has a code point with value 61, and we usually use this U+<hex digits> annotation for code points, which would be U+0061.

Now we have turned human writing systems into a bunch of numbers, but that’s not enough for computers to handle the data. We hear terms like UTF-{8,16,32}. These are encoding systems, which turn code points (not characters) into bytes in the memory or disk.

UTF-8 uses 1-4 bytes for one code point, meaning at least 8 bits for one code point. UTF-16 uses 2 or 4 bytes, and UTF-32 always uses 4 bytes, or 32 bits, thus the names.

Therefore, seeking within a string in UTF-8 can be difficult (e.g. to find the next 100 characters, or to truncate the to a limit of 140 characters), and UTF-32 on the other hand wastes a lot of space.

Further reading

Plances and blocks

In order to organize the code points, we divide the entire codespace into different planes and blocks.

Planes have larger space, and every plane has the same size (2^16). There’re 17 planes in total, most of them are still empty for now.

Blocks are the real meaningful groups/ranges of code points, e.g. Kangxi Radicals (U+2F00–U+2FDF) and Mathematical Alphanumeric Symbols (U+1D400–U+1D7FF).

Character combining with modifiérs

There could be more than one ways to describe one character, and one character could be mapped into multiple code points.

Take character á. It could be represented as one precomposed character U+00E1 latin small letter a with acute.

It could also be described as a sequence of one base letter and one combining character:

  • a Latin Small Letter A (U+0061)
  • ◌́ Combining Acute Accent (U+0301)*

  • the acute accent can’t be rendered properly by itself so this is actually combined with a dotted circle. </small>

Control characters

Non-printing control characters are not some new concept invented by unicode. Almost every developer have to deal with \n at some point, which is a typical control character.

Control characters are like commands for the layout engine or printer. This may sound like a modern technique compared to primitive ASCII, but it’s basically the same solution to print á with a typewriter (type: a; cmd: move paper; type: ◌́).

Control characters can move the virtual cursor (e.g. tabulation, line breaking), set a functional mark (e.g. LTR), and join/combine characters (e.g. ZWJ).

For example, ZWJ (zero width joiner) is a widely used joiner. The rainbow flag emoji (🏳️‍🌈) is actually a sequence of a white flag (🏳️), a ZWJ, and a rainbow (🌈).

Emotion, Emoticon, Emoji

People started to use combinations of characters like punctuation marks to express emotions long before the invention of computers.

Emoticons, as the name suggests, are human emotions zipped into icons ;). Emoji is a romanized japanese word coincidentally means the similar concept.

Different organizations have defined characters in different character sets (CP437) and fonts (Wingdings). Eventually, everybody accepted the dominance of unicode defined emojis.

Emojis are no different from other unicode characters, but it do expose many issues with unicode characters handling (e.g. cursor position, validation, truncation) into the real world and enforces general text/code editing solutions to adapt.

An emoji can be a single character (👩). Some could be combined with a modifier (👋 + Medium Skin Tone -> 👋🏽); Some could be joined into a sequence and being displayed as a single character (🐈 ZWJ -> 🐈‍⬛); Some could be grouped into a multi-person grouping, so a single character on screen for a family could be a complex sequence of woman, medium skin tones, ZWJ, woman, dark skin tones, ZWJ, boy, ZWJ, girl, light skin tones.

Supports

Not all emojis are widely supported, for example some flags, private logos, and newly proposed characters. A subset of emojis, RGI (recommended for general interchange), contains widely supported emojis.

Further reading

Normalization & canonically equivalent

With these combinators, diacritics, variations, and control characters altogether, determining the equivalence of characters would be hard, since simply comparing the value of code points is not enough.

Unicode deals with this problem by normalization and defining what’s canonically equivalence and compatibility equivalence.

Canonically equivalent answers the question: are they the same character? For example, composed character ü (U+00FC) and a sequence of u (U+0075) + ` ̈ (U+0308)` are the same character only being described with different values.

Compatibility equivalance is a superset which answers are they represent the same thing? For example, ¼ and 1/4, and アンペア are not the same characters but they have the same meaning.

Further reading

Unicode meet regex

Regular expression supports matching unicode using hex style code points so it’s possible for patterns like \u{3040}, [\u{3040}-\u{309F}] and [ぁ-ゖ]. You could also use properties to match by category (\p{Punctuation}), block or script (\p{Script=Greek}).

Differencies between languages/regex libraries plus the state of (?u) flag make writing consistent patterns difficult. It’s not trivial to determine what do you really want your self defined var regex_number to match. Should it be [0-9], [0-9a-e] or [0-9a-e0-9፩-፱...]?

General recommended rules:

  • Assume user input includes any unicode characters including non-printable control ones
  • Turn on the unicode flag by default, it’s better than assuming it’s off and matches unexpected values
  • Use exact ranges ([0-9]) or POSIX classes ([[:digit:]]) instead of regular classes (\d)
  • Check our the pre-defined constants in your language’s regex module
  • Prefer to use properties (\p{digit}, \p{gc=Basic_Emoji}) if supported
  • Be careful with the properties, never assume the value based on the name. {Number} and {Nd} (Deciman Digit Number) are different
  • Be explicit your constants’ names. If you only want 0-9 then don’t name it DIGITS but ASCII_DIGITS

Further reading

Security issue #1: visual spoofing with confusables

Visual spoofing is an often malicious technique to swap characters with visually similar ones.

So you could have a string looks like leցaⅼ-domain.com in url but it’s actually not because that is actually U+217C SMALL ROMAN NUMERAL FIFTY and the ց is U+0581 ARMENIAN SMALL LETTER CO.

This trick to swap l with 1 or | already existed before unicode era, but unicode offers much more possibilities. For example, there are 74 confusables for l or 1:

𝚰 𝘭 І 𝖨 𝐥 ﺍ ﺎ 𝔩 ℐ ℑ 𐊊 Ⲓ 𐌉 ℓ 𝜤 Ɩ 𝞘 Ι 𝚕 𝟏 ∣ ا I 𝗅 𝕀 1 𝙄 𝓁 𐌠 𝐼 𞸀 𞺀 ׀ 𝑰 ǀ Ӏ ᛁ 𝟭 𝕴 I ߊ l 𝛪 ⵏ 𝝞 𝕝 𝟣 ו 𞣇 𝙡 𝓘 𝗜 𝟙 𝑙 ן Ⅰ 𝘐 ١ 𝒍 𝖑 │ 🯱 𝐈 l ۱ ꓲ 𖼨 𝙸 𝟷 𝓵 ⅼ ⏽ 𝗹

Unicode officially provides this utility to look up for potential confusables.

Security issue #2: trojan source - WYSINWYG

Unicode makes it hard to maintain consistency between different environments (systems, languages, fonts, etc.) for variety of support coverage, there’re even security vulnerbilities leveraging the unicode system.

There’s one control character RTL isolate to force following text to be interpreted in the RTL order. If your code editor ignores this character (and many of them do), then the code for you to see (and even the diffs for your fellow to review) will be different with the code for the compiler to interprete.

Meticulously designed source code could be used to execute malicious functions. This vulnerbility is called trojan source.

Toolbox & read more

There’re many sites/tools to show data for unicode characters. But the only one I’d like to recommend is the offical Unicode Utilities, including helpers for characters, confusables, emoji, regex, etc.

I also couldn’t recommend the Unicode® Technical Reports enough. I think every developer could learn something from those hidden gems.

Updated:

Comments