diff options
author | Rob Pike <r@golang.org> | 2013-10-22 23:05:47 -0700 |
---|---|---|
committer | Rob Pike <r@golang.org> | 2013-10-22 23:05:47 -0700 |
commit | e81e54f7e23b828c2b42d5cf277251dda7511be1 (patch) | |
tree | b52b973a65277b630c96b0671405c864332e16ae /content/strings.article | |
parent | 88eb9799109ea8009377f18e86be137ec64b3ac4 (diff) |
go.blog/strings: blog post about strings
R=golang-dev, dan.kortschak, iant
CC=golang-dev
https://golang.org/cl/15460049
Diffstat (limited to 'content/strings.article')
-rw-r--r-- | content/strings.article | 331 |
1 files changed, 331 insertions, 0 deletions
diff --git a/content/strings.article b/content/strings.article new file mode 100644 index 0000000..1cfd227 --- /dev/null +++ b/content/strings.article @@ -0,0 +1,331 @@ +Strings, bytes, runes and characters in Go. +23 Oct 2013 +Tags: strings, bytes, runes, characters + +Rob Pike + +* Introduction + +The [[http://blog.golang.org/slices][previous blog post]] explained how slices +work in Go, using a number of examples to illustrate the mechanism behind +their implementation. +Building on that background, this post discusses strings in Go. +At first, strings might seem too simple a topic for a blog post, but to use +them well requires understanding not only how they work, +but also the difference between a byte, a character, and a rune, +the difference between Unicode and UTF-8, +the difference between a string and a string literal, +and other even more subtle distinctions. + +One way to approach this topic is to think of it as an answer to the frequently +asked question, "When I index a Go string at position _n_, why don't I get the +_nth_ character?" +As you'll see, this question leads us to many details about how text works +in the modern world. + +An excellent introduction to some of these issues, independent of Go, +is Joel Spolsky's famous blog post, +[[http://www.joelonsoftware.com/articles/Unicode.html][The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)]]. +Many of the points he raises will be echoed here. + +* What is a string? + +Let's start with some basics. + +In Go, a string is in effect a read-only slice of bytes. +If you're at all uncertain about what a slice of bytes is or how it works, +please read the [[http://blog.golang.org/slices][previous blog post]]; +we'll assume here that you have. + +It's important to state right up front that a string holds _arbitrary_ bytes. +It is not required to hold Unicode text, UTF-8 text, or any other predefined format. +As far as the content of a string is concerned, it is exactly equivalent to a +slice of bytes. + +Here is a string literal (more about those soon) that uses the +`\xNN` notation to define a string constant holding some peculiar byte values. +(Of course, bytes range from hexadecimal values 00 through FF, inclusive.) + +.code strings/basic.go /const sample/ + +* Printing strings + +Because some of the bytes in our sample string are not valid ASCII, not even +valid UTF-8, printing the string directly will produce ugly output. +The simple print statement + +.code strings/basic.go /println/,/println/ + +produces this mess (whose exact appearance varies with the environment): + + ��=� ⌘ + +To find out what that string really holds, we need to take it apart and examine the pieces. +There are several ways to do this. +The most obvious is to loop over its contents and pull out the bytes +individually, as in this `for` loop: + +.code strings/basic.go /byte loop/,/byte loop/ + +As implied up front, indexing a string accesses individual bytes, not +characters. We'll return to that topic in detail below. For now, let's +stick with just the bytes. +This is the output from the byte-by-byte loop: + + bd b2 3d bc 20 e2 8c 98 + +Notice how the individual bytes match the +hexadecimal escapes that defined the string. + +A shorter way to generate presentable output for a messy string +is to use the `%x` (hexadecimal) format verb of `fmt.Printf`. +It just dumps out the sequential bytes of the string as hexadecimal +digits, two per byte. + +.code strings/basic.go /percent x/,/percent x/ + +Compare its output to that above: + + bdb23dbc20e28c98 + +A nice trick is to use the "space" flag in that format, putting a +space between the `%` and the `x`. Compare the format string +used here to the one above, + +.code strings/basic.go /percent space x/,/percent space x/ + +and notice how the bytes come +out with spaces between, making the result a little less imposing: + + bd b2 3d bc 20 e2 8c 98 + +There's more. The `%q` (quoted) verb will escape any non-printable +byte sequences in a string so the output is unambiguous. + +.code strings/basic.go /percent q/,/percent q/ + +This technique is handy when much of the string is +intelligible as text but there are peculiarities to root out; it produces: + + "\xbd\xb2=\xbc ⌘" + +If we squint at that, we can see that buried in the noise is one ASCII equals sign, +along with a regular space, and at the end appears the well-known Swedish "Place of Interest" +symbol. +That symbol has Unicode value U+2318, encoded as UTF-8 by the bytes +after the space (hex value `20`): `e2` `8c` `98`. + +If we are unfamiliar or confused by strange values in the string, +we can use the "plus" flag to the `%q` verb. This flag causes the output to escape +not only non-printable sequences, but also any non-ASCII bytes, all +while interpreting UTF-8. +The result is that it exposes the Unicode values of properly formatted UTF-8 +that represents non-ASCII data in the string: + +.code strings/basic.go /percent plus q/,/percent plus q/ + +With that format, the Unicode value of the Swedish symbol shows up as a +`\u` escape: + + "\xbd\xb2=\xbc \u2318" + +These printing techiques are good to know when debugging +the contents of strings, and will be handy in the discussion that follows. +It's worth pointing out as well that all these methods behave exactly the +same for byte slices as they do for strings. + +Here's the full set of printing options we've listed, presented as +a complete program you can run (and edit) right in the browser: + +.play -edit strings/basic.go /package/,/^}/ + +[Exercise: Modify the examples above to use a slice of bytes +instead of a string. Hint: Use a conversion to create the slice.] + +[Exercise: Loop over the string using the `%q` format on each byte. +What does the output tell you?] + +* UTF-8 and string literals + +As we saw, indexing a string yields its bytes, not its characters: a string is just a +bunch of bytes. +That means that when we store a character value in a string, +we store its byte-at-a-time representation. +Let's look at a more controlled example to see how that happens. + +Here's a simple program that prints a string constant with a single character +three different ways, once as a plain string, once as an ASCII-only quoted +string, and once as individual bytes in hexadecimal. +To avoid any confusion, we create a "raw string", enclosed by back quotes, +so it can contain only literal text. (Regular strings, enclosed by double +quotes, can contain escape sequences as we showed above.) + +.play -edit strings/utf8.go /^func/,/^}/ + +The output is: + + plain string: ⌘ + quoted string: "\u2318" + hex bytes: e2 8c 98 + +which reminds us that the Unicode character value U+2318, the "Place +of Interest" symbol ⌘, is represented by the bytes `e2` `8c` `98`, and +that those bytes are the UTF-8 encoding of the hexadecimal +value 2318. + +It may be obvious or it may be subtle, depending on your familiarity with +UTF-8, but it's worth taking a moment to explain how the UTF-8 representation +of the string was created. +The simple fact is: it was created when the source code was written. + +Source code in Go is _defined_ to be UTF-8 text; no other representation is +allowed. That implies that when, in the source code, we write the text + + `⌘` + +the text editor used to create the program places the UTF-8 encoding +of the symbol ⌘ into the source text. +When we print out the hexadecimal bytes, we're just dumping the +data the editor placed in the file. + +In short, Go source code is UTF-8, so +_the_source_code_for_the_string_literal_is_UTF-8_text_. +If that string literal contains no escape sequences, which a raw +string cannot, the constructed string will hold exactly the +source text between the quotes. +Thus by definition and +by construction the raw string will always contain a valid UTF-8 +representation of its contents. +Similarly, unless it contains UTF-8-breaking escapes like those +from the previous section, a regular string literal will also always +contain valid UTF-8. + +Some people think Go strings are always UTF-8, but they +are not: only string literals are UTF-8. +As we showed in the previous section, string _values_ can contain arbitrary +bytes; +as we showed in this one, string _literals_ always contain UTF-8 text +as long as they have no byte-level escapes. + +To summarize, strings can contain arbitrary bytes, but when constructed +from string literals, those bytes are (almost always) UTF-8. + +* Code points, characters, and runes + +We've been very careful so far in how we use the words "byte" and "character". +That's partly because strings hold bytes, and partly because the idea of "character" +is a little hard to define. +The Unicode standard uses the term "code point" to refer to the item represented +by a single value. +The code point U+2318, with hexadecimal value 2318, represents the symbol ⌘. +(For lots more information about that code point, see +[[http://unicode.org/cldr/utility/character.jsp?a=2318][its Unicode page]].) + +To pick a more prosaic example, the Unicode code point U+0061 is the lower +case Latin letter 'A': a. + +But what about the lower case grave-accented letter 'A', à? +That's a character, and it's also a code point (U+00E0), but it has other +representations. +For example we can use the "combining" grave accent code point, U+0300, +and attach it to the lower case letter a, U+0061, to create the same character à. +In general, a character may be represented by a number of different +sequences of code points, and therefore different sequences of UTF-8 bytes. + +The concept of character in computing is therefore ambiguous, or at least +confusing, so we use it with care. +To make things dependable, there are _normalization_ techniques that guarantee that +a given character is always represented by the same code points, but that +subject takes us too far off the topic for now. +A later blog post will explain how the Go libraries address normalization. + +"Code point" is a bit of a mouthful, so Go introduces a shorter term for the +concept: _rune_. +The term appears in the libraries and source code, and means exactly +the same as "code point", with one interesting addition. + +The Go language defines the word `rune` as an alias for the type `int32`, so +programs can be clear when an integer value represents a code point. +Moreover, what you might think of as a character constant is called a +_rune_constant_ in Go. +The type and value of the expression + + '⌘' + +is `rune` with integer value `0x2318`. + +To summarize, here are the salient points: + +- Go source code is always UTF-8. +- A string holds arbitrary bytes. +- A string literal, absent byte-level escapes, always holds valid UTF-8 sequences. +- Those sequences represent Unicode code points, called runes. +- No guarantee is made in Go that characters in strings are normalized. + + +* Range loops + +Besides the axiomatic detail that Go source code is UTF-8, +there's really only one way that Go treats UTF-8 specially, and that is when using +a `for` `range` loop on a string. + +We've seen what happens with a regular `for` loop. +A `for` `range` loop, by contrast, decodes one UTF-8-encoded rune on each +iteration. +Each time around the loop, the index of the loop is the starting position of the +current rune, measured in bytes, and the code point is its value. +Here's an example using yet another handy `Printf` format, `%#U`, which shows +the code point's Unicode value and its printed representation: + +.play -edit strings/range.go /const/,/}/ + +The output shows how each code point occupies multiple bytes: + + U+65E5 '日' starts at byte position 0 + U+672C '本' starts at byte position 3 + U+8A9E '語' starts at byte position 6 + +[Exercise: Put an invalid UTF-8 byte sequence into the string. (How?) +What happens to the iterations of the loop?] + +* Libraries + +Go's standard library provides strong support for interpreting UTF-8 text. +If a `for` `range` loop isn't sufficient for your purposes, +chances are the facility you need is provided by a package in the library. + +The most important such package is +[[http://golang.org/pkg/unicode/utf8/][`unicode/utf8`]], +which contains +helper routines to validate, disassemble, and reassemble UTF-8 strings. +Here is a program equivalent to the `for` `range` example above, +but using the `DecodeRuneInString` function from that package to +do the work. +The return values from the function are the rune and its width in +UTF-8-encoded bytes. + +.play -edit strings/encoding.go /const/,/}/ + +Run it to see that it performs the same. +The `for` `range` loop and `DecodeRuneInString` are defined to produce +exactly the same iteration sequence. + +Look at the +[[http://golang.org/pkg/unicode/utf8/][documentation]] +for the `unicode/utf8` package to see what +other facilities it provides. + +* Conclusion + +To answer the question posed at the beginning: Strings are built from bytes +so indexing them yields bytes, not characters. +A string might not even hold characters. +In fact, the definition of "character" is ambiguous and it would +be a mistake to try to resolve the ambiguity by defining that strings are made +of characters. + +There's much more to say about Unicode, UTF-8, and the world of multilingual +text processing, but it can wait for another post. +For now, we hope you have a better understanding of how Go strings behave +and that, although they may contain arbitrary bytes, UTF-8 is a central part +of their design. |