diff options
-rw-r--r-- | content/normalization.article | 280 | ||||
-rw-r--r-- | content/normalization/table1.html | 20 | ||||
-rw-r--r-- | content/normalization/table2.html | 23 |
3 files changed, 323 insertions, 0 deletions
diff --git a/content/normalization.article b/content/normalization.article new file mode 100644 index 0000000..b84ce0c --- /dev/null +++ b/content/normalization.article @@ -0,0 +1,280 @@ +Text normalization in Go +26 Nov 2013 + +Marcel van Lohuizen + +* Introduction + +An earlier [[http://blog.golang.org/strings][post]] talked about strings, bytes +and characters in Go. I've been working on various packages for multilingual +text processing for the go.text repository. Several of these packages deserve a +separate blog post, but today I want to focus on +[[http://godoc.org/code.google.com/p/go.text/unicode/norm][go.text/unicode/norm]], +which handles normalization, a topic touched in the +[[http://blog.golang.org/strings][strings article]] and the subject of this +post. Normalization works at a higher level of abstraction than raw bytes. + +To learn pretty much everything you ever wanted to know about normalization +(and then some), [[http://unicode.org/reports/tr15/][Annex 15 of the Unicode Standard]] +is a good read. A more approachable article is the corresponding +[[http://en.wikipedia.org/wiki/Unicode_equivalence][Wikipedia page]]. Here we +focus on how normalization relates to Go. + +* What is normalization? + +There are often several ways to represent the same string. For example, an é +(e-acute) can be represented in a string as a single rune ("\u00e9") or an 'e' +followed by an acute accent ("e\u0301"). According to the Unicode standard, +these two are "canonically equivalent" and should be treated as equal. + +Using a byte-to-byte comparison to determine equality would clearly not give +the right result for these two strings. Unicode defines a set of normal forms +such that if two strings are canonically equivalent and are normalized to the +same normal form, their byte representations are the same. + +Unicode also defines a "compatibility equivalence" to equate characters that +represent the same characters, but may have a different visual appearance. For +example, the superscript digit '⁹' and the regular digit '9' are equivalent in +this form. + +For each of these two equivalence forms, Unicode defines a composing and +decomposing form. The former replaces runes that can combine into a single rune +with this single rune. The latter breaks runes apart into their components. +This table shows the names, all starting with NF, by which the Unicode +Consortium identifies these forms: + +.html normalization/table1.html + +* Go's approach to normalization + +As mentioned in the strings blog post, Go does not guarantee that characters in +a string are normalized. However, the go.text packages can compensate. For +example, the +[[http://godoc.org/code.google.com/p/go.text/collate][collate]] package, which +can sort strings in a language-specific way, works correctly even with +unnormalized strings. The packages in go.text do not always require normalized +input, but in general normalization may be necessary for consistent results. + +Normalization isn't free but it is fast, particularly for collation and +searching or if a string is either in NFD or in NFC and can be converted to NFD +by decomposing without reordering its bytes. In practice, +[[http://www.macchiato.com/unicode/nfc-faq#TOC-How-much-text-is-already-NFC-][99.98%]] of +the web's HTML page content is in NFC form (not counting markup, in which case +it would be more). By far most NFC can be decomposed to NFD without the need +for reordering (which requires allocation). Also, it is efficient to detect +when reordering is necessary, so we can save time by doing it only for the rare +segments that need it. + +To make things even better, the collation package typically does not use the +norm package directly, but instead uses the norm package to interleave +normalization information with its own tables. Interleaving the two problems +allows for reordering and normalization on the fly with almost no impact on +performance. The cost of on-the-fly normalization is compensated by not having +to normalize text beforehand and ensuring that the normal form is maintained +upon edits. The latter can be tricky. For instance, the result of concatenating +two NFC-normalized strings is not guaranteed to be in NFC. + +Of course, we can also avoid the overhead outright if we know in advance that a +string is already normalized, which is often the case. + +* Why bother? + +After all this discussion about avoiding normalization, you might ask why it's +worth worrying about at all. The reason is that there are cases where +normalization is required and it is important to understand what those are, and +in turn how to do it correctly. + +Before discussing those, we must first clarify the concept of 'character'. + +* What is a character? + +As was mentioned in the strings blog post, characters can span multiple runes. +For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" +in NFD). Together these two runes are one character. The definition of a +character may vary depending on the application. For normalization we will +define it as a sequence of runes that starts with a starter, a rune that does +not modify or combine backwards with any other rune, followed by possibly empty +sequence of non-starters, that is, runes that do (typically accents). The +normalization algorithm processes one character at at time. + +Theoretically, there is no bound to the number of runes that can make up a +Unicode character. In fact, there are no restrictions on the number of +modifiers that can follow a character and a modifier may be repeated, or +stacked. Ever seen an 'e' with three acutes? Here you go: 'é́́'. That is a +perfectly valid 4-rune character according to the standard. + +As a consequence, even at the lowest level, text needs to be processed in +increments of unbounded chunk sizes. This is especially awkward with a +streaming approach to text processing, as used by Go's standard Reader and +Writer interfaces, as that model potentially requires any intermediate buffers +to have unbounded size as well. Also, a straightforward implementation of +normalization will have a O(n²) running time. + +There are really no meaningful interpretations for such large sequences of +modifiers for practical applications. Unicode defines a Stream-Safe Text +format, which allows capping the number of modifiers (non-starters) to at most +30, more than enough for any practical purpose. Subsequent modifiers will be +placed after a freshly inserted Combining Grapheme Joiner (CGJ or U+034F). Go +adopts this approach for all normalization algorithms. This decision gives up a +little conformance but gains a little safety. + +* Writing in normal form + +Even if you don't need to normalize text within your Go code, you might still +want to do so when communicating to the outside world. For example, normalizing +to NFC might compact your text, making it cheaper to send down a wire. For some +languages, like Korean, the savings can be substantial. Also, some external +APIs might expect text in a certain normal form. Or you might just want to fit +in and output your text as NFC like the rest of the world. + +To write your text as NFC, use the +[[http://godoc.org/code.google.com/p/go.text/unicode/norm][unicode/norm]] package +to wrap your `io.Writer` of choice: + + wc := norm.NFC.Writer(w) + defer wc.Close() + // write as before... + +If you have a small string and want to do a quick conversion, you can use this simpler form: + + norm.NFC.Bytes(b) + +Package norm provides various other methods for normalizing text. Pick the one that suits your needs best. + +* Catching look-alikes + +Can you tell the difference between 'K' ("\u004B") and 'K' (Kelvin sign +"\u212A") or 'Ω' ("\u03a9") and 'Ω' (Ohm sign "\u2126")? It is easy to overlook +the sometimes minute differences between variants of the same underlying +character. It is generally a good idea to disallow such variants in identifiers +or anything where deceiving users with such look-alikes can pose a security +hazard. + +The compatibility normal forms, NFKC and NFKD, will map many visually nearly +identical forms to a single value. Note that it will not do so when two symbols +look alike, but are really from two different alphabets. For example the Latin +'o', Greek 'ο', and Cyrillic 'о' are still different characters as defined by +these forms. + +* Correct text modifications + +The norm package might also come to the rescue when one needs to modify text. +Consider a case where you want to search and replace the word "cafe" with its +plural form "cafes". A code snippet could look like this. + + s := "We went to eat at multiple cafe" + cafe := "cafe" + if p := strings.Index(s, cafe); p != -1 { + p += len(cafe) + s = s[:p] + "s" + s[p:] + } + fmt.Println(s) + +This prints "We went to eat at multiple cafes" as desired and expected. Now +consider our text contains the French spelling "café" in NFD form: + + s := "We went to eat at multiple cafe\u0301" + +Using the same code from above, the plural "s" would still be inserted after +the 'e', but before the acute, resulting in "We went to eat at multiple +cafeś". This behavior is undesirable. + +The problem is that the code does not respect the boundaries between multi-rune +characters and inserts a rune in the middle of a character. Using the norm +package, we can rewrite this piece of code as follows: + + s := "We went to eat at multiple cafe\u0301" + cafe := "cafe" + if p := strings.Index(s, cafe); p != -1 { + p += len(cafe) + if bp := norm.FirstBoundary(s[p:]); bp > 0 { + p += bp + } + s = s[:p] + "s" + s[p:] + } + fmt.Println(s) + +This may be a contrived example, but the gist should be clear. Be mindful of +the fact that characters can span multiple runes. Generally these kinds of +problems can be avoided by using search functionality that respects character +boundaries (such as the planned go.text/search package.) + +* Iteration + +Another tool provided by the norm package that may help dealing with character +boundaries is its iterator, +[[http://godoc.org/code.google.com/p/go.text/norm/#Iter][`norm.Iter`]]. +It iterates over characters one at a time in the normal form of choice. + +* Performing magic + +As mentioned earlier, most text is in NFC form, where base characters and +modifiers are combined into a single rune whenever possible. For the purpose +of analyzing characters, it is often easier to handle runes after decomposition +into their smallest components. This is where the NFD form comes in handy. For +example, the following piece of code creates a transform.Transformer that +decomposes text into its smallest parts, removes all accents, and then +recomposes the text into NFC: + + import ( + "unicode" + + "code.google.com/p/go.text/transform" + "code.google.com/p/go.text/unicode/norm" + ) + + isMn := func(r rune) bool { + return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks + } + t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC) + +The resulting `Transformer` can be used to remove accents from an `io.Reader` +of choice as follows: + + r = transform.NewReader(r, t) + // read as before ... + +This will, for example, convert any mention of "cafés" in the text to "cafes", +regardless of the normal form in which the original text was encoded. + +* Normalization info + +As mentioned earlier, some packages precompute normalizations into their tables +to minimize the need for normalization at run time. The type norm.Properties +provides access to the per-rune information needed by these packages, most +notably the Canonical Combining Class and decomposition information. Read the +[[http://godoc.org/code.google.com/p/go.text/unicode/norm/#Properties][documentation]] +for this type if you want to dig deeper. + +* Performance + +To give an idea of the performance of normalization, we compare it against the +performance of strings.ToLower. The sample in the first row is both lowercase +and NFC and can in every case be returned as is. The second sample is neither +and requires writing a new version. + +.html normalization/table2.html + +The column with the results for the iterator shows both the measurement with +and without initialization of the iterator, which contain buffers that don't +need to be reinitialized upon reuse. + +As you can see, detecting whether a string is normalized can be quite +efficient. A lot of the cost of normalizing in the second row is for the +initialization of buffers, the cost of which is amortized when one is +processing larger strings. As it turns out, these buffers are rarely needed, so +we may change the implementation at some point to speed up the common case for +small strings even further. + +* Conclusion + +If you're dealing with text inside Go, you generally do not have to use the +unicode/norm package to normalize your text. The package may still be useful +for things like ensuring that strings are normalized before sending them out or +to do advanced text manipulation. + +This article briefly mentioned the existence of other go.text packages as well +as multilingual text processing and it may have raised more questions than it +has given answers. The discussion of these topics, however, will have to wait +until another day. + diff --git a/content/normalization/table1.html b/content/normalization/table1.html new file mode 100644 index 0000000..a1f0178 --- /dev/null +++ b/content/normalization/table1.html @@ -0,0 +1,20 @@ +<style> + .padtable td, .padtable th { padding-right: 10px; } +</style> +<table class="codetable padtable"> + <tr> + <th> </th> + <th>Composing</th> + <th>Decomposing</th> + </tr> + <tr> + <th>Canonical equivalence</th> + <td>NFC</td> + <td>NFD</td> + </tr> + <tr> + <th>Compatibility equivalence</th> + <td>NFK</td> + <td>NFK</td> + </tr> +</table> diff --git a/content/normalization/table2.html b/content/normalization/table2.html new file mode 100644 index 0000000..2ecf294 --- /dev/null +++ b/content/normalization/table2.html @@ -0,0 +1,23 @@ +<table class="codetable padtable"> + <tr> + <th>Input</th> + <th>ToLower</th> + <th>NFC Append</th> + <th>NFC Transform</th> + <th>NFC Iter</th> + </tr> + <tr> + <td>nörmalization</td> + <td>199 ns</td> + <td>137 ns</td> + <td>133 ns</td> + <td>251 ns (621 ns)</td> + </tr> + <tr> + <td>No\u0308rmalization</td> + <td>427 ns</td> + <td>836 ns</td> + <td>845 ns</td> + <td>573 ns (948 ns)</td> + </tr> +</table> |