aboutsummaryrefslogtreecommitdiff
path: root/content/gob.article
diff options
context:
space:
mode:
Diffstat (limited to 'content/gob.article')
-rw-r--r--content/gob.article304
1 files changed, 304 insertions, 0 deletions
diff --git a/content/gob.article b/content/gob.article
new file mode 100644
index 0000000..3b5c405
--- /dev/null
+++ b/content/gob.article
@@ -0,0 +1,304 @@
+# Gobs of data
+24 Mar 2011
+Tags: gob, json, protobuf, xml, technical
+Summary: Introducing gob, a high-speed Go-to-Go wire encoding format.
+OldURL: /gobs-of-data
+
+Rob Pike
+
+## Introduction
+
+To transmit a data structure across a network or to store it in a file,
+it must be encoded and then decoded again.
+There are many encodings available, of course:
+[JSON](http://www.json.org/), [XML](http://www.w3.org/XML/),
+Google's [protocol buffers](http://code.google.com/p/protobuf), and more.
+And now there's another, provided by Go's [gob](https://golang.org/pkg/encoding/gob/) package.
+
+Why define a new encoding? It's a lot of work and redundant at that.
+Why not just use one of the existing formats? Well,
+for one thing, we do!
+Go has [packages](https://golang.org/pkg/) supporting all the encodings
+just mentioned (the [protocol buffer package](http://github.com/golang/protobuf)
+is in a separate repository but it's one of the most frequently downloaded).
+And for many purposes, including communicating with tools and systems written in other languages,
+they're the right choice.
+
+But for a Go-specific environment, such as communicating between two servers written in Go,
+there's an opportunity to build something much easier to use and possibly more efficient.
+
+Gobs work with the language in a way that an externally-defined,
+language-independent encoding cannot.
+At the same time, there are lessons to be learned from the existing systems.
+
+## Goals
+
+The gob package was designed with a number of goals in mind.
+
+First, and most obvious, it had to be very easy to use.
+First, because Go has reflection, there is no need for a separate interface
+definition language or "protocol compiler".
+The data structure itself is all the package should need to figure out how
+to encode and decode it.
+On the other hand, this approach means that gobs will never work as well
+with other languages, but that's OK:
+gobs are unashamedly Go-centric.
+
+Efficiency is also important. Textual representations,
+exemplified by XML and JSON, are too slow to put at the center of an efficient
+communications network.
+A binary encoding is necessary.
+
+Gob streams must be self-describing. Each gob stream,
+read from the beginning, contains sufficient information that the entire
+stream can be parsed by an agent that knows nothing a priori about its contents.
+This property means that you will always be able to decode a gob stream stored in a file,
+even long after you've forgotten what data it represents.
+
+There were also some things to learn from our experiences with Google protocol buffers.
+
+## Protocol buffer misfeatures
+
+Protocol buffers had a major effect on the design of gobs,
+but have three features that were deliberately avoided.
+(Leaving aside the property that protocol buffers aren't self-describing:
+if you don't know the data definition used to encode a protocol buffer,
+you might not be able to parse it.)
+
+First, protocol buffers only work on the data type we call a struct in Go.
+You can't encode an integer or array at the top level,
+only a struct with fields inside it.
+That seems a pointless restriction, at least in Go.
+If all you want to send is an array of integers,
+why should you have to put it into a struct first?
+
+Next, a protocol buffer definition may specify that fields `T.x` and `T.y`
+are required to be present whenever a value of type `T` is encoded or decoded.
+Although such required fields may seem like a good idea,
+they are costly to implement because the codec must maintain a separate
+data structure while encoding and decoding,
+to be able to report when required fields are missing.
+They're also a maintenance problem. Over time,
+one may want to modify the data definition to remove a required field,
+but that may cause existing clients of the data to crash.
+It's better not to have them in the encoding at all.
+(Protocol buffers also have optional fields.
+But if we don't have required fields, all fields are optional and that's that.
+There will be more to say about optional fields a little later.)
+
+The third protocol buffer misfeature is default values.
+If a protocol buffer omits the value for a "defaulted" field,
+then the decoded structure behaves as if the field were set to that value.
+This idea works nicely when you have getter and setter methods to control
+access to the field,
+but is harder to handle cleanly when the container is just a plain idiomatic struct.
+Required fields are also tricky to implement:
+where does one define the default values,
+what types do they have (is text UTF-8? uninterpreted bytes? how many bits
+in a float?) and despite the apparent simplicity,
+there were a number of complications in their design and implementation
+for protocol buffers.
+We decided to leave them out of gobs and fall back to Go's trivial but effective defaulting rule:
+unless you set something otherwise, it has the "zero value" for that type -
+and it doesn't need to be transmitted.
+
+So gobs end up looking like a sort of generalized, simplified protocol buffer. How do they work?
+
+## Values
+
+The encoded gob data isn't about types like `int8` and `uint16`.
+Instead, somewhat analogous to constants in Go,
+its integer values are abstract, sizeless numbers,
+either signed or unsigned.
+When you encode an `int8`, its value is transmitted as an unsized,
+variable-length integer.
+When you encode an `int64`, its value is also transmitted as an unsized,
+variable-length integer.
+(Signed and unsigned are treated distinctly,
+but the same unsized-ness applies to unsigned values too.) If both have the value 7,
+the bits sent on the wire will be identical.
+When the receiver decodes that value, it puts it into the receiver's variable,
+which may be of arbitrary integer type.
+Thus an encoder may send a 7 that came from an `int8`,
+but the receiver may store it in an `int64`.
+This is fine: the value is an integer and as a long as it fits, everything works.
+(If it doesn't fit, an error results.) This decoupling from the size of
+the variable gives some flexibility to the encoding:
+we can expand the type of the integer variable as the software evolves,
+but still be able to decode old data.
+
+This flexibility also applies to pointers.
+Before transmission, all pointers are flattened.
+Values of type `int8`, `*int8`, `**int8`,
+`****int8`, etc. are all transmitted as an integer value,
+which may then be stored in `int` of any size,
+or `*int`, or `******int`, etc.
+Again, this allows for flexibility.
+
+Flexibility also happens because, when decoding a struct,
+only those fields that are sent by the encoder are stored in the destination. Given the value
+
+ type T struct{ X, Y, Z int } // Only exported fields are encoded and decoded.
+ var t = T{X: 7, Y: 0, Z: 8}
+
+the encoding of `t` sends only the 7 and 8.
+Because it's zero, the value of `Y` isn't even sent;
+there's no need to send a zero value.
+
+The receiver could instead decode the value into this structure:
+
+ type U struct{ X, Y *int8 } // Note: pointers to int8s
+ var u U
+
+and acquire a value of `u` with only `X` set (to the address of an `int8` variable set to 7);
+the `Z` field is ignored - where would you put it? When decoding structs,
+fields are matched by name and compatible type,
+and only fields that exist in both are affected.
+This simple approach finesses the "optional field" problem:
+as the type `T` evolves by adding fields,
+out of date receivers will still function with the part of the type they recognize.
+Thus gobs provide the important result of optional fields - extensibility -
+without any additional mechanism or notation.
+
+From integers we can build all the other types:
+bytes, strings, arrays, slices, maps, even floats.
+Floating-point values are represented by their IEEE 754 floating-point bit pattern,
+stored as an integer, which works fine as long as you know their type, which we always do.
+By the way, that integer is sent in byte-reversed order because common values
+of floating-point numbers,
+such as small integers, have a lot of zeros at the low end that we can avoid transmitting.
+
+One nice feature of gobs that Go makes possible is that they allow you to
+define your own encoding by having your type satisfy the [GobEncoder](https://golang.org/pkg/encoding/gob/#GobEncoder)
+and [GobDecoder](https://golang.org/pkg/encoding/gob/#GobDecoder) interfaces,
+in a manner analogous to the [JSON](https://golang.org/pkg/encoding/json/)
+package's [Marshaler](https://golang.org/pkg/encoding/json/#Marshaler)
+and [Unmarshaler](https://golang.org/pkg/encoding/json/#Unmarshaler) and
+also to the [Stringer](https://golang.org/pkg/fmt/#Stringer) interface
+from [package fmt](https://golang.org/pkg/fmt/).
+This facility makes it possible to represent special features,
+enforce constraints, or hide secrets when you transmit data.
+See the [documentation](https://golang.org/pkg/encoding/gob/) for details.
+
+## Types on the wire
+
+The first time you send a given type, the gob package includes in the data
+stream a description of that type.
+In fact, what happens is that the encoder is used to encode,
+in the standard gob encoding format, an internal struct that describes the
+type and gives it a unique number.
+(Basic types, plus the layout of the type description structure,
+are predefined by the software for bootstrapping.) After the type is described,
+it can be referenced by its type number.
+
+Thus when we send our first type `T`, the gob encoder sends a description
+of `T` and tags it with a type number, say 127.
+All values, including the first, are then prefixed by that number,
+so a stream of `T` values looks like:
+
+ ("define type id" 127, definition of type T)(127, T value)(127, T value), ...
+
+These type numbers make it possible to describe recursive types and send
+values of those types.
+Thus gobs can encode types such as trees:
+
+ type Node struct {
+ Value int
+ Left, Right *Node
+ }
+
+(It's an exercise for the reader to discover how the zero-defaulting rule makes this work,
+even though gobs don't represent pointers.)
+
+With the type information, a gob stream is fully self-describing except
+for the set of bootstrap types,
+which is a well-defined starting point.
+
+## Compiling a machine
+
+The first time you encode a value of a given type,
+the gob package builds a little interpreted machine specific to that data type.
+It uses reflection on the type to construct that machine,
+but once the machine is built it does not depend on reflection.
+The machine uses package unsafe and some trickery to convert the data into
+the encoded bytes at high speed.
+It could use reflection and avoid unsafe,
+but would be significantly slower.
+(A similar high-speed approach is taken by the protocol buffer support for Go,
+whose design was influenced by the implementation of gobs.) Subsequent values
+of the same type use the already-compiled machine,
+so they can be encoded right away.
+
+[Update: As of Go 1.4, package unsafe is no longer use by the gob package, with a modest performance drop.]
+
+Decoding is similar but harder. When you decode a value,
+the gob package holds a byte slice representing a value of a given encoder-defined type to decode,
+plus a Go value into which to decode it.
+The gob package builds a machine for that pair:
+the gob type sent on the wire crossed with the Go type provided for decoding.
+Once that decoding machine is built, though,
+it's again a reflectionless engine that uses unsafe methods to get maximum speed.
+
+## Use
+
+There's a lot going on under the hood, but the result is an efficient,
+easy-to-use encoding system for transmitting data.
+Here's a complete example showing differing encoded and decoded types.
+Note how easy it is to send and receive values;
+all you need to do is present values and variables to the [gob package](https://golang.org/pkg/encoding/gob/)
+and it does all the work.
+
+ package main
+
+ import (
+ "bytes"
+ "encoding/gob"
+ "fmt"
+ "log"
+ )
+
+ type P struct {
+ X, Y, Z int
+ Name string
+ }
+
+ type Q struct {
+ X, Y *int32
+ Name string
+ }
+
+ func main() {
+ // Initialize the encoder and decoder. Normally enc and dec would be
+ // bound to network connections and the encoder and decoder would
+ // run in different processes.
+ var network bytes.Buffer // Stand-in for a network connection
+ enc := gob.NewEncoder(&network) // Will write to network.
+ dec := gob.NewDecoder(&network) // Will read from network.
+ // Encode (send) the value.
+ err := enc.Encode(P{3, 4, 5, "Pythagoras"})
+ if err != nil {
+ log.Fatal("encode error:", err)
+ }
+ // Decode (receive) the value.
+ var q Q
+ err = dec.Decode(&q)
+ if err != nil {
+ log.Fatal("decode error:", err)
+ }
+ fmt.Printf("%q: {%d,%d}\n", q.Name, *q.X, *q.Y)
+ }
+
+You can compile and run this example code in the [Go Playground](http://play.golang.org/p/_-OJV-rwMq).
+
+The [rpc package](https://golang.org/pkg/net/rpc/) builds on gobs to turn
+this encode/decode automation into transport for method calls across the network.
+That's a subject for another article.
+
+## Details
+
+The [gob package documentation](https://golang.org/pkg/encoding/gob/),
+especially the file [doc.go](https://golang.org/src/pkg/encoding/gob/doc.go),
+expands on many of the details described here and includes a full worked
+example showing how the encoding represents data.
+If you are interested in the innards of the gob implementation,
+that's a good place to start.