runtime: use heap scan size as estimate of GC scan work

Currently, the GC uses a moving average of recent scan work ratios to estimate the total scan work required by this cycle. This is in turn used to compute how much scan work should be done by mutators when they allocate in order to perform all expected scan work by the time the allocated heap reaches the heap goal. However, our current scan work estimate can be arbitrarily wrong if the heap topography changes significantly from one cycle to the next. For example, in the go1 benchmarks, at the beginning of each benchmark, the heap is dominated by a 256MB no-scan object, so the GC learns that the scan density of the heap is very low. In benchmarks that then rapidly allocate pointer-dense objects, by the time of the next GC cycle, our estimate of the scan work can be too low by a large factor. This in turn lets the mutator allocate faster than the GC can collect, allowing it to get arbitrarily far ahead of the scan work estimate, which leads to very long GC cycles with very little mutator assist that can overshoot the heap goal by large margins. This is particularly easy to demonstrate with BinaryTree17: $ GODEBUG=gctrace=1 ./go1.test -test.bench BinaryTree17 gc #1 @0.017s 2%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 4->262->262 MB, 4 MB goal, 1 P gc #2 @0.026s 3%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 262->262->262 MB, 524 MB goal, 1 P testing: warning: no tests to run PASS BenchmarkBinaryTree17 gc #3 @1.906s 0%: 0+0+0+0+7 ms clock, 0+0+0+0/0/0+7 ms cpu, 325->325->287 MB, 325 MB goal, 1 P (forced) gc #4 @12.203s 20%: 0+0+0+10067+10 ms clock, 0+0+0+0/2523/852+10 ms cpu, 430->2092->1950 MB, 574 MB goal, 1 P 1 9150447353 ns/op Change this estimate to instead use the *current* scannable heap size. This has the advantage of being based solely on the current state of the heap, not on past densities or reachable heap sizes, so it isn't susceptible to falling behind during these sorts of phase changes. This is strictly an over-estimate, but it's better to over-estimate and get more assist than necessary than it is to under-estimate and potentially spiral out of control. Experiments with scaling this estimate back showed no obvious benefit for mutator utilization, heap size, or assist time. This new estimate has little effect for most benchmarks, including most go1 benchmarks, x/benchmarks, and the 6g benchmark. It has a huge effect for benchmarks that triggered the bad pacer behavior: name old mean new mean delta BinaryTree17 10.0s × (1.00,1.00) 3.5s × (0.98,1.01) -64.93% (p=0.000) Fannkuch11 2.74s × (1.00,1.01) 2.65s × (1.00,1.00) -3.52% (p=0.000) FmtFprintfEmpty 56.4ns × (0.99,1.00) 57.8ns × (1.00,1.01) +2.43% (p=0.000) FmtFprintfString 187ns × (0.99,1.00) 185ns × (0.99,1.01) -1.19% (p=0.010) FmtFprintfInt 184ns × (1.00,1.00) 183ns × (1.00,1.00) (no variance) FmtFprintfIntInt 321ns × (1.00,1.00) 315ns × (1.00,1.00) -1.80% (p=0.000) FmtFprintfPrefixedInt 266ns × (1.00,1.00) 263ns × (1.00,1.00) -1.22% (p=0.000) FmtFprintfFloat 353ns × (1.00,1.00) 353ns × (1.00,1.00) -0.13% (p=0.035) FmtManyArgs 1.21µs × (1.00,1.00) 1.19µs × (1.00,1.00) -1.33% (p=0.000) GobDecode 9.69ms × (1.00,1.00) 9.59ms × (1.00,1.00) -1.07% (p=0.000) GobEncode 7.89ms × (0.99,1.01) 7.74ms × (1.00,1.00) -1.92% (p=0.000) Gzip 391ms × (1.00,1.00) 392ms × (1.00,1.00) ~ (p=0.522) Gunzip 97.1ms × (1.00,1.00) 97.0ms × (1.00,1.00) -0.10% (p=0.000) HTTPClientServer 55.7µs × (0.99,1.01) 56.7µs × (0.99,1.01) +1.81% (p=0.001) JSONEncode 19.1ms × (1.00,1.00) 19.0ms × (1.00,1.00) -0.85% (p=0.000) JSONDecode 66.8ms × (1.00,1.00) 66.9ms × (1.00,1.00) ~ (p=0.288) Mandelbrot200 4.13ms × (1.00,1.00) 4.12ms × (1.00,1.00) -0.08% (p=0.000) GoParse 3.97ms × (1.00,1.01) 4.01ms × (1.00,1.00) +0.99% (p=0.000) RegexpMatchEasy0_32 114ns × (1.00,1.00) 115ns × (0.99,1.00) ~ (p=0.070) RegexpMatchEasy0_1K 376ns × (1.00,1.00) 376ns × (1.00,1.00) ~ (p=0.900) RegexpMatchEasy1_32 94.9ns × (1.00,1.00) 96.3ns × (1.00,1.01) +1.53% (p=0.001) RegexpMatchEasy1_1K 568ns × (1.00,1.00) 567ns × (1.00,1.00) -0.22% (p=0.001) RegexpMatchMedium_32 159ns × (1.00,1.00) 159ns × (1.00,1.00) ~ (p=0.178) RegexpMatchMedium_1K 46.4µs × (1.00,1.00) 46.6µs × (1.00,1.00) +0.29% (p=0.000) RegexpMatchHard_32 2.37µs × (1.00,1.00) 2.37µs × (1.00,1.00) ~ (p=0.722) RegexpMatchHard_1K 71.1µs × (1.00,1.00) 71.2µs × (1.00,1.00) ~ (p=0.229) Revcomp 565ms × (1.00,1.00) 562ms × (1.00,1.00) -0.52% (p=0.000) Template 81.0ms × (1.00,1.00) 80.2ms × (1.00,1.00) -0.97% (p=0.000) TimeParse 380ns × (1.00,1.00) 380ns × (1.00,1.00) ~ (p=0.148) TimeFormat 405ns × (0.99,1.00) 385ns × (0.99,1.00) -5.00% (p=0.000) Change-Id: I11274158bf3affaf62662e02de7af12d5fb789e4 Reviewed-on: https://go-review.googlesource.com/9696Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Austin Clements <austin@google.com>

runtime: use heap scan size as estimate of GC scan work
Currently, the GC uses a moving average of recent scan work ratios to estimate the total scan work required by this cycle. This is in turn used to compute how much scan work should be done by mutators when they allocate in order to perform all expected scan work by the time the allocated heap reaches the heap goal. However, our current scan work estimate can be arbitrarily wrong if the heap topography changes significantly from one cycle to the next. For example, in the go1 benchmarks, at the beginning of each benchmark, the heap is dominated by a 256MB no-scan object, so the GC learns that the scan density of the heap is very low. In benchmarks that then rapidly allocate pointer-dense objects, by the time of the next GC cycle, our estimate of the scan work can be too low by a large factor. This in turn lets the mutator allocate faster than the GC can collect, allowing it to get arbitrarily far ahead of the scan work estimate, which leads to very long GC cycles with very little mutator assist that can overshoot the heap goal by large margins. This is particularly easy to demonstrate with BinaryTree17: $ GODEBUG=gctrace=1 ./go1.test -test.bench BinaryTree17 gc #1 @0.017s 2%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 4->262->262 MB, 4 MB goal, 1 P gc #2 @0.026s 3%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 262->262->262 MB, 524 MB goal, 1 P testing: warning: no tests to run PASS BenchmarkBinaryTree17 gc #3 @1.906s 0%: 0+0+0+0+7 ms clock, 0+0+0+0/0/0+7 ms cpu, 325->325->287 MB, 325 MB goal, 1 P (forced) gc #4 @12.203s 20%: 0+0+0+10067+10 ms clock, 0+0+0+0/2523/852+10 ms cpu, 430->2092->1950 MB, 574 MB goal, 1 P 1 9150447353 ns/op Change this estimate to instead use the *current* scannable heap size. This has the advantage of being based solely on the current state of the heap, not on past densities or reachable heap sizes, so it isn't susceptible to falling behind during these sorts of phase changes. This is strictly an over-estimate, but it's better to over-estimate and get more assist than necessary than it is to under-estimate and potentially spiral out of control. Experiments with scaling this estimate back showed no obvious benefit for mutator utilization, heap size, or assist time. This new estimate has little effect for most benchmarks, including most go1 benchmarks, x/benchmarks, and the 6g benchmark. It has a huge effect for benchmarks that triggered the bad pacer behavior: name old mean new mean delta BinaryTree17 10.0s × (1.00,1.00) 3.5s × (0.98,1.01) -64.93% (p=0.000) Fannkuch11 2.74s × (1.00,1.01) 2.65s × (1.00,1.00) -3.52% (p=0.000) FmtFprintfEmpty 56.4ns × (0.99,1.00) 57.8ns × (1.00,1.01) +2.43% (p=0.000) FmtFprintfString 187ns × (0.99,1.00) 185ns × (0.99,1.01) -1.19% (p=0.010) FmtFprintfInt 184ns × (1.00,1.00) 183ns × (1.00,1.00) (no variance) FmtFprintfIntInt 321ns × (1.00,1.00) 315ns × (1.00,1.00) -1.80% (p=0.000) FmtFprintfPrefixedInt 266ns × (1.00,1.00) 263ns × (1.00,1.00) -1.22% (p=0.000) FmtFprintfFloat 353ns × (1.00,1.00) 353ns × (1.00,1.00) -0.13% (p=0.035) FmtManyArgs 1.21µs × (1.00,1.00) 1.19µs × (1.00,1.00) -1.33% (p=0.000) GobDecode 9.69ms × (1.00,1.00) 9.59ms × (1.00,1.00) -1.07% (p=0.000) GobEncode 7.89ms × (0.99,1.01) 7.74ms × (1.00,1.00) -1.92% (p=0.000) Gzip 391ms × (1.00,1.00) 392ms × (1.00,1.00) ~ (p=0.522) Gunzip 97.1ms × (1.00,1.00) 97.0ms × (1.00,1.00) -0.10% (p=0.000) HTTPClientServer 55.7µs × (0.99,1.01) 56.7µs × (0.99,1.01) +1.81% (p=0.001) JSONEncode 19.1ms × (1.00,1.00) 19.0ms × (1.00,1.00) -0.85% (p=0.000) JSONDecode 66.8ms × (1.00,1.00) 66.9ms × (1.00,1.00) ~ (p=0.288) Mandelbrot200 4.13ms × (1.00,1.00) 4.12ms × (1.00,1.00) -0.08% (p=0.000) GoParse 3.97ms × (1.00,1.01) 4.01ms × (1.00,1.00) +0.99% (p=0.000) RegexpMatchEasy0_32 114ns × (1.00,1.00) 115ns × (0.99,1.00) ~ (p=0.070) RegexpMatchEasy0_1K 376ns × (1.00,1.00) 376ns × (1.00,1.00) ~ (p=0.900) RegexpMatchEasy1_32 94.9ns × (1.00,1.00) 96.3ns × (1.00,1.01) +1.53% (p=0.001) RegexpMatchEasy1_1K 568ns × (1.00,1.00) 567ns × (1.00,1.00) -0.22% (p=0.001) RegexpMatchMedium_32 159ns × (1.00,1.00) 159ns × (1.00,1.00) ~ (p=0.178) RegexpMatchMedium_1K 46.4µs × (1.00,1.00) 46.6µs × (1.00,1.00) +0.29% (p=0.000) RegexpMatchHard_32 2.37µs × (1.00,1.00) 2.37µs × (1.00,1.00) ~ (p=0.722) RegexpMatchHard_1K 71.1µs × (1.00,1.00) 71.2µs × (1.00,1.00) ~ (p=0.229) Revcomp 565ms × (1.00,1.00) 562ms × (1.00,1.00) -0.52% (p=0.000) Template 81.0ms × (1.00,1.00) 80.2ms × (1.00,1.00) -0.97% (p=0.000) TimeParse 380ns × (1.00,1.00) 380ns × (1.00,1.00) ~ (p=0.148) TimeFormat 405ns × (0.99,1.00) 385ns × (0.99,1.00) -5.00% (p=0.000) Change-Id: I11274158bf3affaf62662e02de7af12d5fb789e4 Reviewed-on: https://go-review.googlesource.com/9696Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Austin Clements <austin@google.com>
17db6e04 · Austin Clements · 3be3cbd5 · 17db6e04
Commit 17db6e04 authored May 04, 2015 by Austin Clements
Hide whitespace changes
Inline Side-by-side

Showing with 15 additions and 38 deletions

mgc.go src/runtime/mgc.go +15 -38

No files found.
--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -238,13 +238,6 @@ const (
 // GOMAXPROCS. The high-level design of this algorithm is documented
 // at http://golang.org/s/go15gcpacing.
 var gcController = gcControllerState{
-	// Initial work ratio guess.
-	//
-	// TODO(austin): This is based on the work ratio of the
-	// compiler on ./all.bash. Run a wider variety of programs and
-	// see what their work ratios are.
-	workRatioAvg: 0.5 / float64(ptrSize),
-
 	// Initial trigger ratio guess.
 	triggerRatio: 7 / 8.0,
 }
@@ -254,6 +247,10 @@ type gcControllerState struct {
 	// is updated atomically during the cycle. Updates may be
 	// batched arbitrarily, since the value is only read at the
 	// end of the cycle.
+	//
+	// Currently this is the bytes of heap scanned. For most uses,
+	// this is an opaque unit of work, but for estimation the
+	// definition is important.
 	scanWork int64

 	// bgScanCredit is the scan work credit accumulated by the
@@ -299,10 +296,6 @@ type gcControllerState struct {
 	// dedicated mark workers get started.
 	dedicatedMarkWorkersNeeded int64

-	// workRatioAvg is a moving average of the scan work ratio
-	// (scan work per byte marked).
-	workRatioAvg float64
-
 	// assistRatio is the ratio of allocated bytes to scan work
 	// that should be performed by mutator assists. This is
 	// computed at the beginning of each cycle.
@@ -399,21 +392,16 @@ func (c *gcControllerState) startCycle() {
 // improved estimates. This should be called periodically during
 // concurrent mark.
 func (c *gcControllerState) revise() {
-	// Estimate the size of the marked heap. We don't have much to
-	// go on, so at the beginning of the cycle this uses the
-	// marked heap size from last cycle. If the reachable heap has
-	// grown since last cycle, we'll eventually mark more than
-	// this and we can revise our estimate. This way, if we
-	// overshoot our initial estimate, the assist ratio will climb
-	// smoothly and put more pressure on mutator assists to finish
-	// the cycle.
-	heapMarkedEstimate := memstats.heap_marked
-	if heapMarkedEstimate < work.bytesMarked {
-		heapMarkedEstimate = work.bytesMarked
-	}
-
-	// Compute the expected work based on this estimate.
-	scanWorkExpected := uint64(float64(heapMarkedEstimate) * c.workRatioAvg)
+	// Compute the expected scan work. This is a strict upper
+	// bound on the possible scan work in the current heap.
+	//
+	// You might consider dividing this by 2 (or by
+	// (100+GOGC)/100) to counter this over-estimation, but
+	// benchmarks show that this has almost no effect on mean
+	// mutator utilization, heap size, or assist time and it
+	// introduces the danger of under-estimating and letting the
+	// mutator outpace the garbage collector.
+	scanWorkExpected := memstats.heap_scan

 	// Compute the mutator assist ratio so by the time the mutator
 	// allocates the remaining heap bytes up to next_gc, it will
@@ -443,9 +431,6 @@ func (c *gcControllerState) endCycle() {
 	// transient changes. Values near 1 may be unstable.
 	const triggerGain = 0.5

-	// EWMA weight given to this cycle's scan work ratio.
-	const workRatioWeight = 0.75
-
 	// Stop the revise timer
 	deltimer(&c.reviseTimer)

@@ -484,12 +469,6 @@ func (c *gcControllerState) endCycle() {
 		c.triggerRatio = goalGrowthRatio * 0.95
 	}

-	// Compute the scan work ratio for this cycle.
-	workRatio := float64(c.scanWork) / float64(work.bytesMarked)
-
-	// Update EWMA of recent scan work ratios.
-	c.workRatioAvg = workRatioWeight*workRatio + (1-workRatioWeight)*c.workRatioAvg
-
 	if debug.gcpacertrace > 0 {
 		// Print controller state in terms of the design
 		// document.
@@ -502,14 +481,12 @@ func (c *gcControllerState) endCycle() {
 		u_a := utilization
 		u_g := gcGoalUtilization
 		W_a := c.scanWork
-		w_a := workRatio
-		w_ewma := c.workRatioAvg
 		print("pacer: H_m_prev=", H_m_prev,
 			" h_t=", h_t, " H_T=", H_T,
 			" h_a=", h_a, " H_a=", H_a,
 			" h_g=", h_g, " H_g=", H_g,
 			" u_a=", u_a, " u_g=", u_g,
-			" W_a=", W_a, " w_a=", w_a, " w_ewma=", w_ewma,
+			" W_a=", W_a,
 			" goalΔ=", goalGrowthRatio-h_t,
 			" actualΔ=", h_a-h_t,
 			" u_a/u_g=", u_a/u_g,