aboutsummaryrefslogtreecommitdiff
path: root/content/normalization.article
diff options
context:
space:
mode:
authorRuss Cox <rsc@golang.org>2020-03-09 23:54:35 -0400
committerRuss Cox <rsc@golang.org>2020-03-17 20:58:37 +0000
commitaf5018f64e406aaa646dae066f28de57321ea5ce (patch)
tree8db7b1f049d83d215fa9abf68851efce7b5ccadb /content/normalization.article
parent86e424fac66fa90ddcb7e8d7febd4c2b07d7c59e (diff)
content: convert to Markdown-enabled present inputs
Converted blog to Markdown-enabled present (CL 222846) using present2md (CL 222847). For golang/go#33955. Change-Id: Ib39fa1ddd9a46f9c7a62a2ca7b96e117635553e8 Reviewed-on: https://go-review.googlesource.com/c/blog/+/222848 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Andrew Bonventre <andybons@golang.org>
Diffstat (limited to 'content/normalization.article')
-rw-r--r--content/normalization.article49
1 files changed, 25 insertions, 24 deletions
diff --git a/content/normalization.article b/content/normalization.article
index 636eac8..0076d62 100644
--- a/content/normalization.article
+++ b/content/normalization.article
@@ -1,27 +1,28 @@
-Text normalization in Go
+# Text normalization in Go
26 Nov 2013
Tags: strings, bytes, runes, characters
+Summary: An earlier [post](https://blog.golang.org/strings) talked about strings, bytes and characters in Go. I've been working on various packages for multilingual text processing for the go.text repository. Several of these packages deserve a separate blog post, but today I want to focus on [go.text/unicode/norm](https://godoc.org/code.google.com/p/go.text/unicode/norm), which handles normalization, a topic touched in the [strings article](https://blog.golang.org/strings) and the subject of this post. Normalization works at a higher level of abstraction than raw bytes.
Marcel van Lohuizen
-* Introduction
+## Introduction
-An earlier [[https://blog.golang.org/strings][post]] talked about strings, bytes
+An earlier [post](https://blog.golang.org/strings) talked about strings, bytes
and characters in Go. I've been working on various packages for multilingual
text processing for the go.text repository. Several of these packages deserve a
separate blog post, but today I want to focus on
-[[https://godoc.org/code.google.com/p/go.text/unicode/norm][go.text/unicode/norm]],
+[go.text/unicode/norm](https://godoc.org/code.google.com/p/go.text/unicode/norm),
which handles normalization, a topic touched in the
-[[https://blog.golang.org/strings][strings article]] and the subject of this
+[strings article](https://blog.golang.org/strings) and the subject of this
post. Normalization works at a higher level of abstraction than raw bytes.
To learn pretty much everything you ever wanted to know about normalization
-(and then some), [[http://unicode.org/reports/tr15/][Annex 15 of the Unicode Standard]]
+(and then some), [Annex 15 of the Unicode Standard](http://unicode.org/reports/tr15/)
is a good read. A more approachable article is the corresponding
-[[http://en.wikipedia.org/wiki/Unicode_equivalence][Wikipedia page]]. Here we
+[Wikipedia page](http://en.wikipedia.org/wiki/Unicode_equivalence). Here we
focus on how normalization relates to Go.
-* What is normalization?
+## What is normalization?
There are often several ways to represent the same string. For example, an é
(e-acute) can be represented in a string as a single rune ("\u00e9") or an 'e'
@@ -46,12 +47,12 @@ Consortium identifies these forms:
.html normalization/table1.html
-* Go's approach to normalization
+## Go's approach to normalization
As mentioned in the strings blog post, Go does not guarantee that characters in
a string are normalized. However, the go.text packages can compensate. For
example, the
-[[https://godoc.org/code.google.com/p/go.text/collate][collate]] package, which
+[collate](https://godoc.org/code.google.com/p/go.text/collate) package, which
can sort strings in a language-specific way, works correctly even with
unnormalized strings. The packages in go.text do not always require normalized
input, but in general normalization may be necessary for consistent results.
@@ -59,7 +60,7 @@ input, but in general normalization may be necessary for consistent results.
Normalization isn't free but it is fast, particularly for collation and
searching or if a string is either in NFD or in NFC and can be converted to NFD
by decomposing without reordering its bytes. In practice,
-[[http://www.macchiato.com/unicode/nfc-faq#TOC-How-much-text-is-already-NFC-][99.98%]] of
+[99.98%](http://www.macchiato.com/unicode/nfc-faq#TOC-How-much-text-is-already-NFC-) of
the web's HTML page content is in NFC form (not counting markup, in which case
it would be more). By far most NFC can be decomposed to NFD without the need
for reordering (which requires allocation). Also, it is efficient to detect
@@ -78,7 +79,7 @@ two NFC-normalized strings is not guaranteed to be in NFC.
Of course, we can also avoid the overhead outright if we know in advance that a
string is already normalized, which is often the case.
-* Why bother?
+## Why bother?
After all this discussion about avoiding normalization, you might ask why it's
worth worrying about at all. The reason is that there are cases where
@@ -87,7 +88,7 @@ in turn how to do it correctly.
Before discussing those, we must first clarify the concept of 'character'.
-* What is a character?
+## What is a character?
As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301"
@@ -119,7 +120,7 @@ placed after a freshly inserted Combining Grapheme Joiner (CGJ or U+034F). Go
adopts this approach for all normalization algorithms. This decision gives up a
little conformance but gains a little safety.
-* Writing in normal form
+## Writing in normal form
Even if you don't need to normalize text within your Go code, you might still
want to do so when communicating to the outside world. For example, normalizing
@@ -129,7 +130,7 @@ APIs might expect text in a certain normal form. Or you might just want to fit
in and output your text as NFC like the rest of the world.
To write your text as NFC, use the
-[[https://godoc.org/code.google.com/p/go.text/unicode/norm][unicode/norm]] package
+[unicode/norm](https://godoc.org/code.google.com/p/go.text/unicode/norm) package
to wrap your `io.Writer` of choice:
wc := norm.NFC.Writer(w)
@@ -144,7 +145,7 @@ simpler form:
Package norm provides various other methods for normalizing text.
Pick the one that suits your needs best.
-* Catching look-alikes
+## Catching look-alikes
Can you tell the difference between 'K' ("\u004B") and 'K' (Kelvin sign
"\u212A") or 'Ω' ("\u03a9") and 'Ω' (Ohm sign "\u2126")? It is easy to overlook
@@ -159,7 +160,7 @@ look alike, but are really from two different alphabets. For example the Latin
'o', Greek 'ο', and Cyrillic 'о' are still different characters as defined by
these forms.
-* Correct text modifications
+## Correct text modifications
The norm package might also come to the rescue when one needs to modify text.
Consider a case where you want to search and replace the word "cafe" with its
@@ -202,14 +203,14 @@ the fact that characters can span multiple runes. Generally these kinds of
problems can be avoided by using search functionality that respects character
boundaries (such as the planned go.text/search package.)
-* Iteration
+## Iteration
Another tool provided by the norm package that may help dealing with character
boundaries is its iterator,
-[[https://godoc.org/code.google.com/p/go.text/unicode/norm#Iter][`norm.Iter`]].
+[`norm.Iter`](https://godoc.org/code.google.com/p/go.text/unicode/norm#Iter).
It iterates over characters one at a time in the normal form of choice.
-* Performing magic
+## Performing magic
As mentioned earlier, most text is in NFC form, where base characters and
modifiers are combined into a single rune whenever possible.  For the purpose
@@ -240,16 +241,16 @@ of choice as follows:
This will, for example, convert any mention of "cafés" in the text to "cafes",
regardless of the normal form in which the original text was encoded.
-* Normalization info
+## Normalization info
As mentioned earlier, some packages precompute normalizations into their tables
to minimize the need for normalization at run time. The type `norm.Properties`
provides access to the per-rune information needed by these packages, most
notably the Canonical Combining Class and decomposition information. Read the
-[[https://godoc.org/code.google.com/p/go.text/unicode/norm/#Properties][documentation]]
+[documentation](https://godoc.org/code.google.com/p/go.text/unicode/norm/#Properties)
for this type if you want to dig deeper.
-* Performance
+## Performance
To give an idea of the performance of normalization, we compare it against the
performance of strings.ToLower. The sample in the first row is both lowercase
@@ -269,7 +270,7 @@ processing larger strings. As it turns out, these buffers are rarely needed, so
we may change the implementation at some point to speed up the common case for
small strings even further.
-* Conclusion
+## Conclusion
If you're dealing with text inside Go, you generally do not have to use the
unicode/norm package to normalize your text. The package may still be useful