content: rename articles to reinforce convention of short URLs

The Go blog started out on Blogger (http://web.archive.org/web/20100325005843/http://blog.golang.org/). Later, we moved to the current self-hosted blog server with extra Go-specific functionality like playground snippets. The old Blogger posts have very long URLs that Blogger chose for us, such as "go-programming-language-turns-two" or "two-go-talks-lexical-scanning-in-go-and", predating the convention of giving posts shorter, more share-friendly, typeable names. The conversion of the old Blogger posts also predated the convention of putting supporting files in a subdirectory. The result is that although we've established new conventions, you wouldn't know by listing the directory - the old Blogger content presents a conflicting picture. This commit renames the posts with very long names to have shorter, more share-friendly names, and it moves all supporting files to subdirectories. It also adds a README documenting the conventions. For example, blog.golang.org/go-programming-language-turns-two is now blog.golang.org/2years, matching our more recent birthday post URLs, and its supporting files are moved to the new 2years/ directory. The old URLs redirect to the new ones. Change-Id: I9f46a790c2c8fab8459aeda73d4e3d2efc86d88f Reviewed-on: https://go-review.googlesource.com/c/blog/+/223599 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Andrew Bonventre <andybons@golang.org>
author: Russ Cox <rsc@golang.org> 2020-03-15 15:50:36 -0400
committer: Russ Cox <rsc@golang.org> 2020-03-17 20:58:46 +0000
commit: 972d42d925e6cae3f8eebd9b21d445e06c2eb386 (patch)
tree: 737af27f0d49318b612efec874b1d1328c699d1a /content/pprof.article
parent: faf1e2da2d911edc717993e8edb24fe88f99b2b5 (diff)
1 files changed, 656 insertions, 0 deletions
diff --git a/content/pprof.article b/content/pprof.article
new file mode 100644
index 0000000..ef79de1
--- /dev/null
+++ b/content/pprof.article
@@ -0,0 +1,656 @@
+# Profiling Go Programs
+24 Jun 2011
+Tags: benchmark, pprof, profiling, technical
+Summary: How to use Go's built-in profiler to understand and optimize your programs.
+OldURL: /profiling-go-programs
+
+Russ Cox, July 2011; updated by Shenghou Ma, May 2013
+
+##
+
+At Scala Days 2011, Robert Hundt presented a paper titled
+[Loop Recognition in C++/Java/Go/Scala.](http://research.google.com/pubs/pub37122.html)
+The paper implemented a specific loop finding algorithm, such as you might use
+in a flow analysis pass of a compiler, in C++, Go, Java, Scala, and then used
+those programs to draw conclusions about typical performance concerns in these
+languages.
+The Go program presented in that paper runs quite slowly, making it
+an excellent opportunity to demonstrate how to use Go's profiling tools to take
+a slow program and make it faster.
+
+_By using Go's profiling tools to identify and correct specific bottlenecks, we can make the Go loop finding program run an order of magnitude faster and use 6x less memory._
+(Update: Due to recent optimizations of `libstdc++` in `gcc`, the memory reduction is now 3.7x.)
+
+Hundt's paper does not specify which versions of the C++, Go, Java, and Scala
+tools he used.
+In this blog post, we will be using the most recent weekly snapshot of the `6g`
+Go compiler and the version of `g++` that ships with the Ubuntu Natty
+distribution.
+(We will not be using Java or Scala, because we are not skilled at writing efficient
+programs in either of those languages, so the comparison would be unfair.
+Since C++ was the fastest language in the paper, the comparisons here with C++ should
+suffice.)
+(Update: In this updated post, we will be using the most recent development snapshot
+of the Go compiler on amd64 and the most recent version of `g++` -- 4.8.0, which was
+released in March 2013.)
+
+	$ go version
+	go version devel +08d20469cc20 Tue Mar 26 08:27:18 2013 +0100 linux/amd64
+	$ g++ --version
+	g++ (GCC) 4.8.0
+	Copyright (C) 2013 Free Software Foundation, Inc.
+	...
+	$
+
+The programs are run on a computer with a 3.4GHz Core i7-2600 CPU and 16 GB of
+RAM running Gentoo Linux's 3.8.4-gentoo kernel.
+The machine is running with CPU frequency scaling disabled via
+
+	$ sudo bash
+	# for i in /sys/devices/system/cpu/cpu[0-7]
+	do
+	    echo performance > $i/cpufreq/scaling_governor
+	done
+	#
+
+We've taken [Hundt's benchmark programs](https://github.com/hundt98847/multi-language-bench)
+in C++ and Go, combined each into a single source file, and removed all but one
+line of output.
+We'll time the program using Linux's `time` utility with a format that shows user time,
+system time, real time, and maximum memory usage:
+
+	$ cat xtime
+	#!/bin/sh
+	/usr/bin/time -f '%Uu %Ss %er %MkB %C' "$@"
+	$
+
+	$ make havlak1cc
+	g++ -O3 -o havlak1cc havlak1.cc
+	$ ./xtime ./havlak1cc
+	# of loops: 76002 (total 3800100)
+	loop-0, nest: 0, depth: 0
+	17.70u 0.05s 17.80r 715472kB ./havlak1cc
+	$
+
+	$ make havlak1
+	go build havlak1.go
+	$ ./xtime ./havlak1
+	# of loops: 76000 (including 1 artificial root node)
+	25.05u 0.11s 25.20r 1334032kB ./havlak1
+	$
+
+The C++ program runs in 17.80 seconds and uses 700 MB of memory.
+The Go program runs in 25.20 seconds and uses 1302 MB of memory.
+(These measurements are difficult to reconcile with the ones in the paper, but the
+point of this post is to explore how to use `go tool pprof`, not to reproduce the
+results from the paper.)
+
+To start tuning the Go program, we have to enable profiling.
+If the code used the [Go testing package](https://golang.org/pkg/testing/)'s
+benchmarking support, we could use gotest's standard `-cpuprofile` and `-memprofile`
+flags.
+In a standalone program like this one, we have to import `runtime/pprof` and add a few
+lines of code:
+
+	var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file")
+
+	func main() {
+	    flag.Parse()
+	    if *cpuprofile != "" {
+	        f, err := os.Create(*cpuprofile)
+	        if err != nil {
+	            log.Fatal(err)
+	        }
+	        pprof.StartCPUProfile(f)
+	        defer pprof.StopCPUProfile()
+	    }
+	    ...
+
+The new code defines a flag named `cpuprofile`, calls the
+[Go flag library](https://golang.org/pkg/flag/) to parse the command line flags,
+and then, if the `cpuprofile` flag has been set on the command line,
+[starts CPU profiling](https://golang.org/pkg/runtime/pprof/#StartCPUProfile)
+redirected to that file.
+The profiler requires a final call to
+[`StopCPUProfile`](https://golang.org/pkg/runtime/pprof/#StopCPUProfile) to
+flush any pending writes to the file before the program exits; we use `defer`
+to make sure this happens as `main` returns.
+
+After adding that code, we can run the program with the new `-cpuprofile` flag
+and then run `go tool pprof` to interpret the profile.
+
+	$ make havlak1.prof
+	./havlak1 -cpuprofile=havlak1.prof
+	# of loops: 76000 (including 1 artificial root node)
+	$ go tool pprof havlak1 havlak1.prof
+	Welcome to pprof!  For help, type 'help'.
+	(pprof)
+
+The `go tool pprof` program is a slight variant of
+[Google's `pprof` C++ profiler](https://github.com/gperftools/gperftools).
+The most important command is `topN`, which shows the top `N` samples in the profile:
+
+	(pprof) top10
+	Total: 2525 samples
+	     298  11.8%  11.8%      345  13.7% runtime.mapaccess1_fast64
+	     268  10.6%  22.4%     2124  84.1% main.FindLoops
+	     251   9.9%  32.4%      451  17.9% scanblock
+	     178   7.0%  39.4%      351  13.9% hash_insert
+	     131   5.2%  44.6%      158   6.3% sweepspan
+	     119   4.7%  49.3%      350  13.9% main.DFS
+	      96   3.8%  53.1%       98   3.9% flushptrbuf
+	      95   3.8%  56.9%       95   3.8% runtime.aeshash64
+	      95   3.8%  60.6%      101   4.0% runtime.settype_flush
+	      88   3.5%  64.1%      988  39.1% runtime.mallocgc
+
+When CPU profiling is enabled, the Go program stops about 100 times per second
+and records a sample consisting of the program counters on the currently executing
+goroutine's stack.
+The profile has 2525 samples, so it was running for a bit over 25 seconds.
+In the `go tool pprof` output, there is a row for each function that appeared in
+a sample.
+The first two columns show the number of samples in which the function was running
+(as opposed to waiting for a called function to return), as a raw count and as a
+percentage of total samples.
+The `runtime.mapaccess1_fast64` function was running during 298 samples, or 11.8%.
+The `top10` output is sorted by this sample count.
+The third column shows the running total during the listing:
+the first three rows account for 32.4% of the samples.
+The fourth and fifth columns show the number of samples in which the function appeared
+(either running or waiting for a called function to return).
+The `main.FindLoops` function was running in 10.6% of the samples, but it was on the
+call stack (it or functions it called were running) in 84.1% of the samples.
+
+To sort by the fourth and fifth columns, use the `-cum` (for cumulative) flag:
+
+	(pprof) top5 -cum
+	Total: 2525 samples
+	       0   0.0%   0.0%     2144  84.9% gosched0
+	       0   0.0%   0.0%     2144  84.9% main.main
+	       0   0.0%   0.0%     2144  84.9% runtime.main
+	       0   0.0%   0.0%     2124  84.1% main.FindHavlakLoops
+	     268  10.6%  10.6%     2124  84.1% main.FindLoops
+	(pprof) top5 -cum
+
+In fact the total for `main.FindLoops` and `main.main` should have been 100%, but
+each stack sample only includes the bottom 100 stack frames; during about a quarter
+of the samples, the recursive `main.DFS` function was more than 100 frames deeper
+than `main.main` so the complete trace was truncated.
+
+The stack trace samples contain more interesting data about function call relationships
+than the text listings can show.
+The `web` command writes a graph of the profile data in SVG format and opens it in a web
+browser.
+(There is also a `gv` command that writes PostScript and opens it in Ghostview.
+For either command, you need [graphviz](http://www.graphviz.org/) installed.)
+
+	(pprof) web
+
+A small fragment of
+[the full graph](https://rawgit.com/rsc/benchgraffiti/master/havlak/havlak1.svg) looks like:
+
+.image pprof/havlak1a-75.png
+
+Each box in the graph corresponds to a single function, and the boxes are sized
+according to the number of samples in which the function was running.
+An edge from box X to box Y indicates that X calls Y; the number along the edge is
+the number of times that call appears in a sample.
+If a call appears multiple times in a single sample, such as during recursive function
+calls, each appearance counts toward the edge weight.
+That explains the 21342 on the self-edge from `main.DFS` to itself.
+
+Just at a glance, we can see that the program spends much of its time in hash
+operations, which correspond to use of Go's `map` values.
+We can tell `web` to use only samples that include a specific function, such as
+`runtime.mapaccess1_fast64`, which clears some of the noise from the graph:
+
+	(pprof) web mapaccess1
+
+.image pprof/havlak1-hash_lookup-75.png
+
+If we squint, we can see that the calls to `runtime.mapaccess1_fast64` are being
+made by `main.FindLoops` and `main.DFS`.
+
+Now that we have a rough idea of the big picture, it's time to zoom in on a particular
+function.
+Let's look at `main.DFS` first, just because it is a shorter function:
+
+	(pprof) list DFS
+	Total: 2525 samples
+	ROUTINE ====================== main.DFS in /home/rsc/g/benchgraffiti/havlak/havlak1.go
+	   119    697 Total samples (flat / cumulative)
+	     3      3  240: func DFS(currentNode *BasicBlock, nodes []*UnionFindNode, number map[*BasicBlock]int, last []int, current int) int {
+	     1      1  241:     nodes[current].Init(currentNode, current)
+	     1     37  242:     number[currentNode] = current
+	     .      .  243:
+	     1      1  244:     lastid := current
+	    89     89  245:     for _, target := range currentNode.OutEdges {
+	     9    152  246:             if number[target] == unvisited {
+	     7    354  247:                     lastid = DFS(target, nodes, number, last, lastid+1)
+	     .      .  248:             }
+	     .      .  249:     }
+	     7     59  250:     last[number[currentNode]] = lastid
+	     1      1  251:     return lastid
+	(pprof)
+
+The listing shows the source code for the `DFS` function (really, for every function
+matching the regular expression `DFS`).
+The first three columns are the number of samples taken while running that line, the
+number of samples taken while running that line or in code called from that line, and
+the line number in the file.
+The related command `disasm` shows a disassembly of the function instead of a source
+listing; when there are enough samples this can help you see which instructions are
+expensive.
+The `weblist` command mixes the two modes: it shows
+[a source listing in which clicking a line shows the disassembly](https://rawgit.com/rsc/benchgraffiti/master/havlak/havlak1.html).
+
+Since we already know that the time is going into map lookups implemented by the
+hash runtime functions, we care most about the second column.
+A large fraction of time is spent in recursive calls to `DFS` (line 247), as would be
+expected from a recursive traversal.
+Excluding the recursion, it looks like the time is going into the accesses to the
+`number` map on lines 242, 246, and 250.
+For that particular lookup, a map is not the most efficient choice.
+Just as they would be in a compiler, the basic block structures have unique sequence
+numbers assigned to them.
+Instead of using a `map[*BasicBlock]int` we can use a `[]int`, a slice indexed by the
+block number.
+There's no reason to use a map when an array or slice will do.
+
+Changing `number` from a map to a slice requires editing seven lines in the program
+and cut its run time by nearly a factor of two:
+
+	$ make havlak2
+	go build havlak2.go
+	$ ./xtime ./havlak2
+	# of loops: 76000 (including 1 artificial root node)
+	16.55u 0.11s 16.69r 1321008kB ./havlak2
+	$
+
+(See the [diff between `havlak1` and `havlak2`](https://github.com/rsc/benchgraffiti/commit/58ac27bcac3ffb553c29d0b3fb64745c91c95948))
+
+We can run the profiler again to confirm that `main.DFS` is no longer a significant
+part of the run time:
+
+	$ make havlak2.prof
+	./havlak2 -cpuprofile=havlak2.prof
+	# of loops: 76000 (including 1 artificial root node)
+	$ go tool pprof havlak2 havlak2.prof
+	Welcome to pprof!  For help, type 'help'.
+	(pprof)
+	(pprof) top5
+	Total: 1652 samples
+	     197  11.9%  11.9%      382  23.1% scanblock
+	     189  11.4%  23.4%     1549  93.8% main.FindLoops
+	     130   7.9%  31.2%      152   9.2% sweepspan
+	     104   6.3%  37.5%      896  54.2% runtime.mallocgc
+	      98   5.9%  43.5%      100   6.1% flushptrbuf
+	(pprof)
+
+The entry `main.DFS` no longer appears in the profile, and the rest of the program
+runtime has dropped too.
+Now the program is spending most of its time allocating memory and garbage collecting
+(`runtime.mallocgc`, which both allocates and runs periodic garbage collections,
+accounts for 54.2% of the time).
+To find out why the garbage collector is running so much, we have to find out what is
+allocating memory.
+One way is to add memory profiling to the program.
+We'll arrange that if the `-memprofile` flag is supplied, the program stops after one
+iteration of the loop finding, writes a memory profile, and exits:
+
+	var memprofile = flag.String("memprofile", "", "write memory profile to this file")
+	...
+
+		FindHavlakLoops(cfgraph, lsgraph)
+		if *memprofile != "" {
+			f, err := os.Create(*memprofile)
+			if err != nil {
+			    log.Fatal(err)
+			}
+			pprof.WriteHeapProfile(f)
+			f.Close()
+			return
+		}
+
+We invoke the program with `-memprofile` flag to write a profile:
+
+	$ make havlak3.mprof
+	go build havlak3.go
+	./havlak3 -memprofile=havlak3.mprof
+	$
+
+(See the [diff from havlak2](https://github.com/rsc/benchgraffiti/commit/b78dac106bea1eb3be6bb3ca5dba57c130268232))
+
+We use `go tool pprof` exactly the same way. Now the samples we are examining are
+memory allocations, not clock ticks.
+
+	$ go tool pprof havlak3 havlak3.mprof
+	Adjusting heap profiles for 1-in-524288 sampling rate
+	Welcome to pprof!  For help, type 'help'.
+	(pprof) top5
+	Total: 82.4 MB
+	    56.3  68.4%  68.4%     56.3  68.4% main.FindLoops
+	    17.6  21.3%  89.7%     17.6  21.3% main.(*CFG).CreateNode
+	     8.0   9.7%  99.4%     25.6  31.0% main.NewBasicBlockEdge
+	     0.5   0.6% 100.0%      0.5   0.6% itab
+	     0.0   0.0% 100.0%      0.5   0.6% fmt.init
+	(pprof)
+
+The command `go tool pprof` reports that `FindLoops` has allocated approximately
+56.3 of the 82.4 MB in use; `CreateNode` accounts for another 17.6 MB.
+To reduce overhead, the memory profiler only records information for approximately
+one block per half megabyte allocated (the “1-in-524288 sampling rate”), so these
+are approximations to the actual counts.
+
+To find the memory allocations, we can list those functions.
+
+	(pprof) list FindLoops
+	Total: 82.4 MB
+	ROUTINE ====================== main.FindLoops in /home/rsc/g/benchgraffiti/havlak/havlak3.go
+	  56.3   56.3 Total MB (flat / cumulative)
+	...
+	   1.9    1.9  268:     nonBackPreds := make([]map[int]bool, size)
+	   5.8    5.8  269:     backPreds := make([][]int, size)
+	     .      .  270:
+	   1.9    1.9  271:     number := make([]int, size)
+	   1.9    1.9  272:     header := make([]int, size, size)
+	   1.9    1.9  273:     types := make([]int, size, size)
+	   1.9    1.9  274:     last := make([]int, size, size)
+	   1.9    1.9  275:     nodes := make([]*UnionFindNode, size, size)
+	     .      .  276:
+	     .      .  277:     for i := 0; i < size; i++ {
+	   9.5    9.5  278:             nodes[i] = new(UnionFindNode)
+	     .      .  279:     }
+	...
+	     .      .  286:     for i, bb := range cfgraph.Blocks {
+	     .      .  287:             number[bb.Name] = unvisited
+	  29.5   29.5  288:             nonBackPreds[i] = make(map[int]bool)
+	     .      .  289:     }
+	...
+
+It looks like the current bottleneck is the same as the last one: using maps where
+simpler data structures suffice.
+`FindLoops` is allocating about 29.5 MB of maps.
+
+As an aside, if we run `go tool pprof` with the `--inuse_objects` flag, it will
+report allocation counts instead of sizes:
+
+	$ go tool pprof --inuse_objects havlak3 havlak3.mprof
+	Adjusting heap profiles for 1-in-524288 sampling rate
+	Welcome to pprof!  For help, type 'help'.
+	(pprof) list FindLoops
+	Total: 1763108 objects
+	ROUTINE ====================== main.FindLoops in /home/rsc/g/benchgraffiti/havlak/havlak3.go
+	720903 720903 Total objects (flat / cumulative)
+	...
+	     .      .  277:     for i := 0; i < size; i++ {
+	311296 311296  278:             nodes[i] = new(UnionFindNode)
+	     .      .  279:     }
+	     .      .  280:
+	     .      .  281:     // Step a:
+	     .      .  282:     //   - initialize all nodes as unvisited.
+	     .      .  283:     //   - depth-first traversal and numbering.
+	     .      .  284:     //   - unreached BB's are marked as dead.
+	     .      .  285:     //
+	     .      .  286:     for i, bb := range cfgraph.Blocks {
+	     .      .  287:             number[bb.Name] = unvisited
+	409600 409600  288:             nonBackPreds[i] = make(map[int]bool)
+	     .      .  289:     }
+	...
+	(pprof)
+
+Since the ~200,000 maps account for 29.5 MB, it looks like the initial map allocation
+takes about 150 bytes.
+That's reasonable when a map is being used to hold key-value pairs, but not when a map
+is being used as a stand-in for a simple set, as it is here.
+
+Instead of using a map, we can use a simple slice to list the elements.
+In all but one of the cases where maps are being used, it is impossible for the algorithm
+to insert a duplicate element.
+In the one remaining case, we can write a simple variant of the `append` built-in function:
+
+	func appendUnique(a []int, x int) []int {
+	    for _, y := range a {
+	        if x == y {
+	            return a
+	        }
+	    }
+	    return append(a, x)
+	}
+
+In addition to writing that function, changing the Go program to use slices instead
+of maps requires changing just a few lines of code.
+
+	$ make havlak4
+	go build havlak4.go
+	$ ./xtime ./havlak4
+	# of loops: 76000 (including 1 artificial root node)
+	11.84u 0.08s 11.94r 810416kB ./havlak4
+	$
+
+(See the [diff from havlak3](https://github.com/rsc/benchgraffiti/commit/245d899f7b1a33b0c8148a4cd147cb3de5228c8a))
+
+We're now at 2.11x faster than when we started. Let's look at a CPU profile again.
+
+	$ make havlak4.prof
+	./havlak4 -cpuprofile=havlak4.prof
+	# of loops: 76000 (including 1 artificial root node)
+	$ go tool pprof havlak4 havlak4.prof
+	Welcome to pprof!  For help, type 'help'.
+	(pprof) top10
+	Total: 1173 samples
+	     205  17.5%  17.5%     1083  92.3% main.FindLoops
+	     138  11.8%  29.2%      215  18.3% scanblock
+	      88   7.5%  36.7%       96   8.2% sweepspan
+	      76   6.5%  43.2%      597  50.9% runtime.mallocgc
+	      75   6.4%  49.6%       78   6.6% runtime.settype_flush
+	      74   6.3%  55.9%       75   6.4% flushptrbuf
+	      64   5.5%  61.4%       64   5.5% runtime.memmove
+	      63   5.4%  66.8%      524  44.7% runtime.growslice
+	      51   4.3%  71.1%       51   4.3% main.DFS
+	      50   4.3%  75.4%      146  12.4% runtime.MCache_Alloc
+	(pprof)
+
+Now memory allocation and the consequent garbage collection (`runtime.mallocgc`)
+accounts for 50.9% of our run time.
+Another way to look at why the system is garbage collecting is to look at the
+allocations that are causing the collections, the ones that spend most of the time
+in `mallocgc`:
+
+	(pprof) web mallocgc
+
+.image pprof/havlak4a-mallocgc.png
+
+It's hard to tell what's going on in that graph, because there are many nodes with
+small sample numbers obscuring the big ones.
+We can tell `go tool pprof` to ignore nodes that don't account for at least 10% of
+the samples:
+
+	$ go tool pprof --nodefraction=0.1 havlak4 havlak4.prof
+	Welcome to pprof!  For help, type 'help'.
+	(pprof) web mallocgc
+
+.image pprof/havlak4a-mallocgc-trim.png
+
+We can follow the thick arrows easily now, to see that `FindLoops` is triggering
+most of the garbage collection.
+If we list `FindLoops` we can see that much of it is right at the beginning:
+
+	(pprof) list FindLoops
+	...
+	     .      .  270: func FindLoops(cfgraph *CFG, lsgraph *LSG) {
+	     .      .  271:     if cfgraph.Start == nil {
+	     .      .  272:             return
+	     .      .  273:     }
+	     .      .  274:
+	     .      .  275:     size := cfgraph.NumNodes()
+	     .      .  276:
+	     .    145  277:     nonBackPreds := make([][]int, size)
+	     .      9  278:     backPreds := make([][]int, size)
+	     .      .  279:
+	     .      1  280:     number := make([]int, size)
+	     .     17  281:     header := make([]int, size, size)
+	     .      .  282:     types := make([]int, size, size)
+	     .      .  283:     last := make([]int, size, size)
+	     .      .  284:     nodes := make([]*UnionFindNode, size, size)
+	     .      .  285:
+	     .      .  286:     for i := 0; i < size; i++ {
+	     2     79  287:             nodes[i] = new(UnionFindNode)
+	     .      .  288:     }
+	...
+	(pprof)
+
+Every time `FindLoops` is called, it allocates some sizable bookkeeping structures.
+Since the benchmark calls `FindLoops` 50 times, these add up to a significant amount
+of garbage, so a significant amount of work for the garbage collector.
+
+Having a garbage-collected language doesn't mean you can ignore memory allocation
+issues.
+In this case, a simple solution is to introduce a cache so that each call to `FindLoops`
+reuses the previous call's storage when possible.
+(In fact, in Hundt's paper, he explains that the Java program needed just this change to
+get anything like reasonable performance, but he did not make the same change in the
+other garbage-collected implementations.)
+
+We'll add a global `cache` structure:
+
+	var cache struct {
+	    size int
+	    nonBackPreds [][]int
+	    backPreds [][]int
+	    number []int
+	    header []int
+	    types []int
+	    last []int
+	    nodes []*UnionFindNode
+	}
+
+and then have `FindLoops` consult it as a replacement for allocation:
+
+	if cache.size < size {
+	    cache.size = size
+	    cache.nonBackPreds = make([][]int, size)
+	    cache.backPreds = make([][]int, size)
+	    cache.number = make([]int, size)
+	    cache.header = make([]int, size)
+	    cache.types = make([]int, size)
+	    cache.last = make([]int, size)
+	    cache.nodes = make([]*UnionFindNode, size)
+	    for i := range cache.nodes {
+	        cache.nodes[i] = new(UnionFindNode)
+	    }
+	}
+
+	nonBackPreds := cache.nonBackPreds[:size]
+	for i := range nonBackPreds {
+	    nonBackPreds[i] = nonBackPreds[i][:0]
+	}
+	backPreds := cache.backPreds[:size]
+	for i := range nonBackPreds {
+	    backPreds[i] = backPreds[i][:0]
+	}
+	number := cache.number[:size]
+	header := cache.header[:size]
+	types := cache.types[:size]
+	last := cache.last[:size]
+	nodes := cache.nodes[:size]
+
+Such a global variable is bad engineering practice, of course: it means that
+concurrent calls to `FindLoops` are now unsafe.
+For now, we are making the minimal possible changes in order to understand what
+is important for the performance of our program; this change is simple and mirrors
+the code in the Java implementation.
+The final version of the Go program will use a separate `LoopFinder` instance to
+track this memory, restoring the possibility of concurrent use.
+
+	$ make havlak5
+	go build havlak5.go
+	$ ./xtime ./havlak5
+	# of loops: 76000 (including 1 artificial root node)
+	8.03u 0.06s 8.11r 770352kB ./havlak5
+	$
+
+(See the [diff from havlak4](https://github.com/rsc/benchgraffiti/commit/2d41d6d16286b8146a3f697dd4074deac60d12a4))
+
+There's more we can do to clean up the program and make it faster, but none of
+it requires profiling techniques that we haven't already shown.
+The work list used in the inner loop can be reused across iterations and across
+calls to `FindLoops`, and it can be combined with the separate “node pool” generated
+during that pass.
+Similarly, the loop graph storage can be reused on each iteration instead of reallocated.
+In addition to these performance changes, the
+[final version](https://github.com/rsc/benchgraffiti/blob/master/havlak/havlak6.go)
+is written using idiomatic Go style, using data structures and methods.
+The stylistic changes have only a minor effect on the run time: the algorithm and
+constraints are unchanged.
+
+The final version runs in 2.29 seconds and uses 351 MB of memory:
+
+	$ make havlak6
+	go build havlak6.go
+	$ ./xtime ./havlak6
+	# of loops: 76000 (including 1 artificial root node)
+	2.26u 0.02s 2.29r 360224kB ./havlak6
+	$
+
+That's 11 times faster than the program we started with.
+Even if we disable reuse of the generated loop graph, so that the only cached memory
+is the loop finding bookeeping, the program still runs 6.7x faster than the original
+and uses 1.5x less memory.
+
+	$ ./xtime ./havlak6 -reuseloopgraph=false
+	# of loops: 76000 (including 1 artificial root node)
+	3.69u 0.06s 3.76r 797120kB ./havlak6 -reuseloopgraph=false
+	$
+
+Of course, it's no longer fair to compare this Go program to the original C++
+program, which used inefficient data structures like `set`s where `vector`s would
+be more appropriate.
+As a sanity check, we translated the final Go program into
+[equivalent C++ code](https://github.com/rsc/benchgraffiti/blob/master/havlak/havlak6.cc).
+Its execution time is similar to the Go program's:
+
+	$ make havlak6cc
+	g++ -O3 -o havlak6cc havlak6.cc
+	$ ./xtime ./havlak6cc
+	# of loops: 76000 (including 1 artificial root node)
+	1.99u 0.19s 2.19r 387936kB ./havlak6cc
+
+The Go program runs almost as fast as the C++ program.
+As the C++ program is using automatic deletes and allocation instead of an explicit
+cache, the C++ program a bit shorter and easier to write, but not dramatically so:
+
+	$ wc havlak6.cc; wc havlak6.go
+	 401 1220 9040 havlak6.cc
+	 461 1441 9467 havlak6.go
+	$
+
+(See [havlak6.cc](https://github.com/rsc/benchgraffiti/blob/master/havlak/havlak6.cc)
+and [havlak6.go](https://github.com/rsc/benchgraffiti/blob/master/havlak/havlak6.go))
+
+Benchmarks are only as good as the programs they measure.
+We used `go tool pprof` to study an inefficient Go program and then to improve its
+performance by an order of magnitude and to reduce its memory usage by a factor of 3.7.
+A subsequent comparison with an equivalently optimized C++ program shows that Go can be
+competitive with C++ when programmers are careful about how much garbage is generated
+by inner loops.
+
+The program sources, Linux x86-64 binaries, and profiles used to write this post
+are available in the [benchgraffiti project on GitHub](https://github.com/rsc/benchgraffiti/).
+
+As mentioned above, [`go test`](https://golang.org/cmd/go/#Test_packages) includes
+these profiling flags already: define a
+[benchmark function](https://golang.org/pkg/testing/) and you're all set.
+There is also a standard HTTP interface to profiling data. In an HTTP server, adding
+
+	import _ "net/http/pprof"
+
+will install handlers for a few URLs under `/debug/pprof/`.
+Then you can run `go tool pprof` with a single argument—the URL to your server's
+profiling data and it will download and examine a live profile.
+
+	go tool pprof http://localhost:6060/debug/pprof/profile   # 30-second CPU profile
+	go tool pprof http://localhost:6060/debug/pprof/heap      # heap profile
+	go tool pprof http://localhost:6060/debug/pprof/block     # goroutine blocking profile
+
+The goroutine blocking profile will be explained in a future post. Stay tuned.
author	Russ Cox <rsc@golang.org>	2020-03-15 15:50:36 -0400
committer	Russ Cox <rsc@golang.org>	2020-03-17 20:58:46 +0000
commit	972d42d925e6cae3f8eebd9b21d445e06c2eb386 (patch)
tree	737af27f0d49318b612efec874b1d1328c699d1a /content/pprof.article
parent	faf1e2da2d911edc717993e8edb24fe88f99b2b5 (diff)