aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--content/normalization.article280
-rw-r--r--content/normalization/table1.html20
-rw-r--r--content/normalization/table2.html23
3 files changed, 323 insertions, 0 deletions
diff --git a/content/normalization.article b/content/normalization.article
new file mode 100644
index 0000000..b84ce0c
--- /dev/null
+++ b/content/normalization.article
@@ -0,0 +1,280 @@
+Text normalization in Go
+26 Nov 2013
+
+Marcel van Lohuizen
+
+* Introduction
+
+An earlier [[http://blog.golang.org/strings][post]] talked about strings, bytes
+and characters in Go. I've been working on various packages for multilingual
+text processing for the go.text repository. Several of these packages deserve a
+separate blog post, but today I want to focus on
+[[http://godoc.org/code.google.com/p/go.text/unicode/norm][go.text/unicode/norm]],
+which handles normalization, a topic touched in the
+[[http://blog.golang.org/strings][strings article]] and the subject of this
+post. Normalization works at a higher level of abstraction than raw bytes.
+
+To learn pretty much everything you ever wanted to know about normalization
+(and then some), [[http://unicode.org/reports/tr15/][Annex 15 of the Unicode Standard]]
+is a good read. A more approachable article is the corresponding
+[[http://en.wikipedia.org/wiki/Unicode_equivalence][Wikipedia page]]. Here we
+focus on how normalization relates to Go.
+
+* What is normalization?
+
+There are often several ways to represent the same string. For example, an é
+(e-acute) can be represented in a string as a single rune ("\u00e9") or an 'e'
+followed by an acute accent ("e\u0301"). According to the Unicode standard,
+these two are "canonically equivalent" and should be treated as equal.
+
+Using a byte-to-byte comparison to determine equality would clearly not give
+the right result for these two strings. Unicode defines a set of normal forms
+such that if two strings are canonically equivalent and are normalized to the
+same normal form, their byte representations are the same.
+
+Unicode also defines a "compatibility equivalence" to equate characters that
+represent the same characters, but may have a different visual appearance. For
+example, the superscript digit '⁹' and the regular digit '9' are equivalent in
+this form.
+
+For each of these two equivalence forms, Unicode defines a composing and
+decomposing form. The former replaces runes that can combine into a single rune
+with this single rune. The latter breaks runes apart into their components.
+This table shows the names, all starting with NF, by which the Unicode
+Consortium identifies these forms:
+
+.html normalization/table1.html
+
+* Go's approach to normalization
+
+As mentioned in the strings blog post, Go does not guarantee that characters in
+a string are normalized. However, the go.text packages can compensate. For
+example, the
+[[http://godoc.org/code.google.com/p/go.text/collate][collate]] package, which
+can sort strings in a language-specific way, works correctly even with
+unnormalized strings. The packages in go.text do not always require normalized
+input, but in general normalization may be necessary for consistent results.
+
+Normalization isn't free but it is fast, particularly for collation and
+searching or if a string is either in NFD or in NFC and can be converted to NFD
+by decomposing without reordering its bytes. In practice,
+[[http://www.macchiato.com/unicode/nfc-faq#TOC-How-much-text-is-already-NFC-][99.98%]] of
+the web's HTML page content is in NFC form (not counting markup, in which case
+it would be more). By far most NFC can be decomposed to NFD without the need
+for reordering (which requires allocation). Also, it is efficient to detect
+when reordering is necessary, so we can save time by doing it only for the rare
+segments that need it.
+
+To make things even better, the collation package typically does not use the
+norm package directly, but instead uses the norm package to interleave
+normalization information with its own tables. Interleaving the two problems
+allows for reordering and normalization on the fly with almost no impact on
+performance. The cost of on-the-fly normalization is compensated by not having
+to normalize text beforehand and ensuring that the normal form is maintained
+upon edits. The latter can be tricky. For instance, the result of concatenating
+two NFC-normalized strings is not guaranteed to be in NFC.
+
+Of course, we can also avoid the overhead outright if we know in advance that a
+string is already normalized, which is often the case.
+
+* Why bother?
+
+After all this discussion about avoiding normalization, you might ask why it's
+worth worrying about at all. The reason is that there are cases where
+normalization is required and it is important to understand what those are, and
+in turn how to do it correctly.
+
+Before discussing those, we must first clarify the concept of 'character'.
+
+* What is a character?
+
+As was mentioned in the strings blog post, characters can span multiple runes.
+For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301"
+in NFD).  Together these two runes are one character. The definition of a
+character may vary depending on the application. For normalization we will
+define it as a sequence of runes that starts with a starter, a rune that does
+not modify or combine backwards with any other rune, followed by possibly empty
+sequence of non-starters, that is, runes that do (typically accents). The
+normalization algorithm processes one character at at time.
+
+Theoretically, there is no bound to the number of runes that can make up a
+Unicode character. In fact, there are no restrictions on the number of
+modifiers that can follow a character and a modifier may be repeated, or
+stacked. Ever seen an 'e' with three acutes? Here you go: 'é́́'. That is a
+perfectly valid 4-rune character according to the standard.
+
+As a consequence, even at the lowest level, text needs to be processed in
+increments of unbounded chunk sizes. This is especially awkward with a
+streaming approach to text processing, as used by Go's standard Reader and
+Writer interfaces, as that model potentially requires any intermediate buffers
+to have unbounded size as well. Also, a straightforward implementation of
+normalization will have a O(n²) running time.
+
+There are really no meaningful interpretations for such large sequences of
+modifiers for practical applications. Unicode defines a Stream-Safe Text
+format, which allows capping the number of modifiers (non-starters) to at most
+30, more than enough for any practical purpose. Subsequent modifiers will be
+placed after a freshly inserted Combining Grapheme Joiner (CGJ or U+034F). Go
+adopts this approach for all normalization algorithms. This decision gives up a
+little conformance but gains a little safety.
+
+* Writing in normal form
+
+Even if you don't need to normalize text within your Go code, you might still
+want to do so when communicating to the outside world. For example, normalizing
+to NFC might compact your text, making it cheaper to send down a wire. For some
+languages, like Korean, the savings can be substantial. Also, some external
+APIs might expect text in a certain normal form. Or you might just want to fit
+in and output your text as NFC like the rest of the world.
+
+To write your text as NFC, use the
+[[http://godoc.org/code.google.com/p/go.text/unicode/norm][unicode/norm]] package
+to wrap your `io.Writer` of choice:
+
+ wc := norm.NFC.Writer(w)
+ defer wc.Close()
+ // write as before...
+
+If you have a small string and want to do a quick conversion, you can use this simpler form:
+
+ norm.NFC.Bytes(b)
+
+Package norm provides various other methods for normalizing text. Pick the one that suits your needs best.
+
+* Catching look-alikes
+
+Can you tell the difference between 'K' ("\u004B") and 'K' (Kelvin sign
+"\u212A") or 'Ω' ("\u03a9") and 'Ω' (Ohm sign "\u2126")? It is easy to overlook
+the sometimes minute differences between variants of the same underlying
+character. It is generally a good idea to disallow such variants in identifiers
+or anything where deceiving users with such look-alikes can pose a security
+hazard.
+
+The compatibility normal forms, NFKC and NFKD, will map many visually nearly
+identical forms to a single value. Note that it will not do so when two symbols
+look alike, but are really from two different alphabets. For example the Latin
+'o', Greek 'ο', and Cyrillic 'о' are still different characters as defined by
+these forms.
+
+* Correct text modifications
+
+The norm package might also come to the rescue when one needs to modify text.
+Consider a case where you want to search and replace the word "cafe" with its
+plural form "cafes".  A code snippet could look like this.
+
+ s := "We went to eat at multiple cafe"
+ cafe := "cafe"
+ if p := strings.Index(s, cafe); p != -1 {
+ p += len(cafe)
+ s = s[:p] + "s" + s[p:]
+ }
+ fmt.Println(s)
+
+This prints "We went to eat at multiple cafes" as desired and expected. Now
+consider our text contains the French spelling "café" in NFD form:
+
+ s := "We went to eat at multiple cafe\u0301"
+
+Using the same code from above, the plural "s" would still be inserted after
+the 'e', but before the acute, resulting in  "We went to eat at multiple
+cafeś".  This behavior is undesirable.
+
+The problem is that the code does not respect the boundaries between multi-rune
+characters and inserts a rune in the middle of a character.  Using the norm
+package, we can rewrite this piece of code as follows:
+
+ s := "We went to eat at multiple cafe\u0301"
+ cafe := "cafe"
+ if p := strings.Index(s, cafe); p != -1 {
+ p += len(cafe)
+ if bp := norm.FirstBoundary(s[p:]); bp > 0 {
+ p += bp
+ }
+ s = s[:p] + "s" + s[p:]
+ }
+ fmt.Println(s)
+
+This may be a contrived example, but the gist should be clear. Be mindful of
+the fact that characters can span multiple runes. Generally these kinds of
+problems can be avoided by using search functionality that respects character
+boundaries (such as the planned go.text/search package.)
+
+* Iteration
+
+Another tool provided by the norm package that may help dealing with character
+boundaries is its iterator,
+[[http://godoc.org/code.google.com/p/go.text/norm/#Iter][`norm.Iter`]].
+It iterates over characters one at a time in the normal form of choice.
+
+* Performing magic
+
+As mentioned earlier, most text is in NFC form, where base characters and
+modifiers are combined into a single rune whenever possible.  For the purpose
+of analyzing characters, it is often easier to handle runes after decomposition
+into their smallest components. This is where the NFD form comes in handy. For
+example, the following piece of code creates a transform.Transformer that
+decomposes text into its smallest parts, removes all accents, and then
+recomposes the text into NFC:
+
+ import (
+ "unicode"
+
+ "code.google.com/p/go.text/transform"
+ "code.google.com/p/go.text/unicode/norm"
+ )
+
+ isMn := func(r rune) bool {
+ return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
+ }
+ t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
+
+The resulting `Transformer` can be used to remove accents from an `io.Reader`
+of choice as follows:
+
+ r = transform.NewReader(r, t)
+ // read as before ...
+
+This will, for example, convert any mention of "cafés" in the text to "cafes",
+regardless of the normal form in which the original text was encoded.
+
+* Normalization info
+
+As mentioned earlier, some packages precompute normalizations into their tables
+to minimize the need for normalization at run time. The type norm.Properties
+provides access to the per-rune information needed by these packages, most
+notably the Canonical Combining Class and decomposition information. Read the
+[[http://godoc.org/code.google.com/p/go.text/unicode/norm/#Properties][documentation]]
+for this type if you want to dig deeper.
+
+* Performance
+
+To give an idea of the performance of normalization, we compare it against the
+performance of strings.ToLower. The sample in the first row is both lowercase
+and NFC and can in every case be returned as is. The second sample is neither
+and requires writing a new version.
+
+.html normalization/table2.html
+
+The column with the results for the iterator shows both the measurement with
+and without initialization of the iterator, which contain buffers that don't
+need to be reinitialized upon reuse.
+
+As you can see, detecting whether a string is normalized can be quite
+efficient. A lot of the cost of normalizing in the second row is for the
+initialization of buffers, the cost of which is amortized when one is
+processing larger strings. As it turns out, these buffers are rarely needed, so
+we may change the implementation at some point to speed up the common case for
+small strings even further.
+
+* Conclusion
+
+If you're dealing with text inside Go, you generally do not have to use the
+unicode/norm package to normalize your text. The package may still be useful
+for things like ensuring that strings are normalized before sending them out or
+to do advanced text manipulation.
+
+This article briefly mentioned the existence of other go.text packages as well
+as multilingual text processing and it may have raised more questions than it
+has given answers. The discussion of these topics, however, will have to wait
+until another day.
+
diff --git a/content/normalization/table1.html b/content/normalization/table1.html
new file mode 100644
index 0000000..a1f0178
--- /dev/null
+++ b/content/normalization/table1.html
@@ -0,0 +1,20 @@
+<style>
+ .padtable td, .padtable th { padding-right: 10px; }
+</style>
+<table class="codetable padtable">
+ <tr>
+ <th>&nbsp;</th>
+ <th>Composing</th>
+ <th>Decomposing</th>
+ </tr>
+ <tr>
+ <th>Canonical equivalence</th>
+ <td>NFC</td>
+ <td>NFD</td>
+ </tr>
+ <tr>
+ <th>Compatibility equivalence</th>
+ <td>NFK</td>
+ <td>NFK</td>
+ </tr>
+</table>
diff --git a/content/normalization/table2.html b/content/normalization/table2.html
new file mode 100644
index 0000000..2ecf294
--- /dev/null
+++ b/content/normalization/table2.html
@@ -0,0 +1,23 @@
+<table class="codetable padtable">
+ <tr>
+ <th>Input</th>
+ <th>ToLower</th>
+ <th>NFC Append</th>
+ <th>NFC Transform</th>
+ <th>NFC Iter</th>
+ </tr>
+ <tr>
+ <td>nörmalization</td>
+ <td>199 ns</td>
+ <td>137 ns</td>
+ <td>133 ns</td>
+ <td>251 ns (621 ns)</td>
+ </tr>
+ <tr>
+ <td>No\u0308rmalization</td>
+ <td>427 ns</td>
+ <td>836 ns</td>
+ <td>845 ns</td>
+ <td>573 ns (948 ns)</td>
+ </tr>
+</table>