runtime: eliminate mark 2 and fix mark termination race

The mark 2 phase was originally introduced as a way to reduce the chance of entering STW mark termination while there was still marking work to do. It works by flushing and disabling all local work caches so that all enqueued work becomes immediately globally visible. However, mark 2 is not only slow–disabling caches makes marking and the write barrier both much more expensive–but also imperfect. There is still a rare but possible race (~once per all.bash) that can cause GC to enter mark termination while there is still marking work. This race is detailed at https://github.com/golang/proposal/blob/master/design/17503-eliminate-rescan.md#appendix-mark-completion-race The effect of this is that mark termination must still cope with the possibility that there may be work remaining after a concurrent mark phase. Dealing with this increases STW pause time and increases the complexity of mark termination. Furthermore, a similar but far more likely race can cause early transition from mark 1 to mark 2. This is unfortunate because it causes performance instability because of the cost of mark 2. This CL fixes this by replacing mark 2 with a distributed termination detection algorithm. This algorithm is correct, so it eliminates the mark termination race, and doesn't require disabling local caches. It ensures that there are no grey objects upon entering mark termination. With this change, we're one step closer to eliminating marking from mark termination entirely (it's still used by STW GC and checkmarks mode). This CL does not eliminate the gcBlackenPromptly global flag, though it is always set to false now. It will be removed in a cleanup CL. This led to only minor variations in the go1 benchmarks (https://perf.golang.org/search?q=upload:20180909.1) and compilebench benchmarks (https://perf.golang.org/search?q=upload:20180910.2). This significantly improves performance of the garbage benchmark, with no impact on STW times: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.21ms ± 1% 2.05ms ± 1% -7.38% (p=0.000 n=18+19) Garbage/benchmem-MB=1024-12 2.30ms ±16% 2.20ms ± 7% -4.51% (p=0.001 n=20+20) name old STW-ns/GC new STW-ns/GC delta Garbage/benchmem-MB=64-12 138k ±44% 141k ±23% ~ (p=0.309 n=19+20) Garbage/benchmem-MB=1024-12 159k ±25% 178k ±98% ~ (p=0.798 n=16+18) name old STW-ns/op new STW-ns/op delta Garbage/benchmem-MB=64-12 4.42k ±44% 4.24k ±23% ~ (p=0.531 n=19+20) Garbage/benchmem-MB=1024-12 591 ±24% 636 ±111% ~ (p=0.309 n=16+18) (https://perf.golang.org/search?q=upload:20180910.1) Updates #26903. Updates #17503. Change-Id: Icbd1e12b7a12a76f423c9bf033b13cb363e4cd19 Reviewed-on: https://go-review.googlesource.com/c/134318 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>

runtime: eliminate mark 2 and fix mark termination race
The mark 2 phase was originally introduced as a way to reduce the chance of entering STW mark termination while there was still marking work to do. It works by flushing and disabling all local work caches so that all enqueued work becomes immediately globally visible. However, mark 2 is not only slow–disabling caches makes marking and the write barrier both much more expensive–but also imperfect. There is still a rare but possible race (~once per all.bash) that can cause GC to enter mark termination while there is still marking work. This race is detailed at https://github.com/golang/proposal/blob/master/design/17503-eliminate-rescan.md#appendix-mark-completion-race The effect of this is that mark termination must still cope with the possibility that there may be work remaining after a concurrent mark phase. Dealing with this increases STW pause time and increases the complexity of mark termination. Furthermore, a similar but far more likely race can cause early transition from mark 1 to mark 2. This is unfortunate because it causes performance instability because of the cost of mark 2. This CL fixes this by replacing mark 2 with a distributed termination detection algorithm. This algorithm is correct, so it eliminates the mark termination race, and doesn't require disabling local caches. It ensures that there are no grey objects upon entering mark termination. With this change, we're one step closer to eliminating marking from mark termination entirely (it's still used by STW GC and checkmarks mode). This CL does not eliminate the gcBlackenPromptly global flag, though it is always set to false now. It will be removed in a cleanup CL. This led to only minor variations in the go1 benchmarks (https://perf.golang.org/search?q=upload:20180909.1) and compilebench benchmarks (https://perf.golang.org/search?q=upload:20180910.2). This significantly improves performance of the garbage benchmark, with no impact on STW times: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.21ms ± 1% 2.05ms ± 1% -7.38% (p=0.000 n=18+19) Garbage/benchmem-MB=1024-12 2.30ms ±16% 2.20ms ± 7% -4.51% (p=0.001 n=20+20) name old STW-ns/GC new STW-ns/GC delta Garbage/benchmem-MB=64-12 138k ±44% 141k ±23% ~ (p=0.309 n=19+20) Garbage/benchmem-MB=1024-12 159k ±25% 178k ±98% ~ (p=0.798 n=16+18) name old STW-ns/op new STW-ns/op delta Garbage/benchmem-MB=64-12 4.42k ±44% 4.24k ±23% ~ (p=0.531 n=19+20) Garbage/benchmem-MB=1024-12 591 ±24% 636 ±111% ~ (p=0.309 n=16+18) (https://perf.golang.org/search?q=upload:20180910.1) Updates #26903. Updates #17503. Change-Id: Icbd1e12b7a12a76f423c9bf033b13cb363e4cd19 Reviewed-on: https://go-review.googlesource.com/c/134318 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
9108ae77 · Austin Clements · a2a2901b · 9108ae77 · 9108ae77 · 9108ae77
Commit 9108ae77 authored Aug 03, 2018 by Austin Clements
Show whitespace changes
Inline Side-by-side

Showing with 116 additions and 108 deletions

gc_test.go src/runtime/gc_test.go +2 -2

mgc.go src/runtime/mgc.go +109 -106

mwbbuf.go src/runtime/mwbbuf.go +5 -0

No files found.
--- a/src/runtime/gc_test.go
+++ b/src/runtime/gc_test.go
@@ -574,8 +574,8 @@ func BenchmarkWriteBarrier(b *testing.B) {
 		n := &node{mkTree(level - 1), mkTree(level - 1)}
 		if level == 10 {
 			// Seed GC with enough early pointers so it
-			// doesn't accidentally switch to mark 2 when
-			// it only has the top of the tree.
+			// doesn't start termination barriers when it
+			// only has the top of the tree.
 			wbRoots = append(wbRoots, n)
 		}
 		return n

--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -28,8 +28,7 @@
 //    b. Sweep any unswept spans. There will only be unswept spans if
 //    this GC cycle was forced before the expected time.
 //
-// 2. GC performs the "mark 1" sub-phase. In this sub-phase, Ps are
-// allowed to locally cache parts of the work queue.
+// 2. GC performs the mark phase.
 //
 //    a. Prepare for the mark phase by setting gcphase to _GCmark
 //    (from _GCoff), enabling the write barrier, enabling mutator
@@ -54,28 +53,21 @@
 //    object to black and shading all pointers found in the object
 //    (which in turn may add those pointers to the work queue).
 //
-// 3. Once the global work queue is empty (but local work queue caches
-// may still contain work), GC performs the "mark 2" sub-phase.
+//    e. Because GC work is spread across local caches, GC uses a
+//    distributed termination algorithm to detect when there are no
+//    more root marking jobs or grey objects (see gcMarkDone). At this
+//    point, GC transitions to mark termination.
 //
-//    a. GC stops all workers, disables local work queue caches,
-//    flushes each P's local work queue cache to the global work queue
-//    cache, and reenables workers.
-//
-//    b. GC again drains the work queue, as in 2d above.
-//
-// 4. Once the work queue is empty, GC performs mark termination.
+// 3. GC performs mark termination.
 //
 //    a. Stop the world.
 //
 //    b. Set gcphase to _GCmarktermination, and disable workers and
 //    assists.
 //
-//    c. Drain any remaining work from the work queue (typically there
-//    will be none).
-//
-//    d. Perform other housekeeping like flushing mcaches.
+//    c. Perform housekeeping like flushing mcaches.
 //
-// 5. GC performs the sweep phase.
+// 4. GC performs the sweep phase.
 //
 //    a. Prepare for the sweep phase by setting gcphase to _GCoff,
 //    setting up sweep state and disabling the write barrier.
@@ -86,7 +78,7 @@
 //    c. GC does concurrent sweeping in the background and in response
 //    to allocation. See description below.
 //
-// 6. When sufficient allocation has taken place, replay the sequence
+// 5. When sufficient allocation has taken place, replay the sequence
 // starting with 1 above. See discussion of GC rate below.

 // Concurrent sweep.
@@ -996,8 +988,7 @@ var work struct {
 	// startSema protects the transition from "off" to mark or
 	// mark termination.
 	startSema uint32
-	// markDoneSema protects transitions from mark 1 to mark 2 and
-	// from mark 2 to mark termination.
+	// markDoneSema protects transitions from mark to mark termination.
 	markDoneSema uint32

 	bgMarkReady note   // signal background mark worker has started
@@ -1385,93 +1376,86 @@ func gcStart(mode gcMode, trigger gcTrigger) {
 	semrelease(&work.startSema)
 }

-// gcMarkDone transitions the GC from mark 1 to mark 2 and from mark 2
-// to mark termination.
+// gcMarkDoneFlushed counts the number of P's with flushed work.
+//
+// Ideally this would be a captured local in gcMarkDone, but forEachP
+// escapes its callback closure, so it can't capture anything.
+//
+// This is protected by markDoneSema.
+var gcMarkDoneFlushed uint32
+
+// gcMarkDone transitions the GC from mark to mark termination if all
+// reachable objects have been marked (that is, there are no grey
+// objects and can be no more in the future). Otherwise, it flushes
+// all local work to the global queues where it can be discovered by
+// other workers.
 //
-// This should be called when all mark work has been drained. In mark
-// 1, this includes all root marking jobs, global work buffers, and
-// active work buffers in assists and background workers; however,
-// work may still be cached in per-P work buffers. In mark 2, per-P
-// caches are disabled.
+// This should be called when all local mark work has been drained and
+// there are no remaining workers. Specifically, when
+//
+//   work.nwait == work.nproc && !gcMarkWorkAvailable(p)
 //
 // The calling context must be preemptible.
 //
-// Note that it is explicitly okay to have write barriers in this
-// function because completion of concurrent mark is best-effort
-// anyway. Any work created by write barriers here will be cleaned up
-// by mark termination.
+// Flushing local work is important because idle Ps may have local
+// work queued. This is the only way to make that work visible and
+// drive GC to completion.
+//
+// It is explicitly okay to have write barriers in this function. If
+// it does transition to mark termination, then all reachable objects
+// have been marked, so the write barrier cannot shade any more
+// objects.
 func gcMarkDone() {
-top:
+	// Ensure only one thread is running the ragged barrier at a
+	// time.
 	semacquire(&work.markDoneSema)

+top:
 	// Re-check transition condition under transition lock.
+	//
+	// It's critical that this checks the global work queues are
+	// empty before performing the ragged barrier. Otherwise,
+	// there could be global work that a P could take after the P
+	// has passed the ragged barrier.
 	if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) {
 		semrelease(&work.markDoneSema)
 		return
 	}

-	// Disallow starting new workers so that any remaining workers
-	// in the current mark phase will drain out.
-	//
-	// TODO(austin): Should dedicated workers keep an eye on this
-	// and exit gcDrain promptly?
-	atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, -0xffffffff)
-	prevFractionalGoal := gcController.fractionalUtilizationGoal
-	gcController.fractionalUtilizationGoal = 0
-
-	if !gcBlackenPromptly {
-		// Transition from mark 1 to mark 2.
-		//
-		// The global work list is empty, but there can still be work
-		// sitting in the per-P work caches.
-		// Flush and disable work caches.
-
-		// Disallow caching workbufs and indicate that we're in mark 2.
-		gcBlackenPromptly = true
-
-		// Prevent completion of mark 2 until we've flushed
-		// cached workbufs.
-		atomic.Xadd(&work.nwait, -1)
-
-		// GC is set up for mark 2. Let Gs blocked on the
-		// transition lock go while we flush caches.
-		semrelease(&work.markDoneSema)
-
+	// Flush all local buffers and collect flushedWork flags.
+	gcMarkDoneFlushed = 0
 	systemstack(func() {
-			// Flush all currently cached workbufs and
-			// ensure all Ps see gcBlackenPromptly. This
-			// also blocks until any remaining mark 1
-			// workers have exited their loop so we can
-			// start new mark 2 workers.
 		forEachP(func(_p_ *p) {
+			// Flush the write barrier buffer, since this may add
+			// work to the gcWork.
 			wbBufFlush1(_p_)
+			// Flush the gcWork, since this may create global work
+			// and set the flushedWork flag.
+			//
+			// TODO(austin): Break up these workbufs to
+			// better distribute work.
 			_p_.gcw.dispose()
+			// Collect the flushedWork flag.
+			if _p_.gcw.flushedWork {
+				atomic.Xadd(&gcMarkDoneFlushed, 1)
+				_p_.gcw.flushedWork = false
+			}
 		})
 	})

-		// Check that roots are marked. We should be able to
-		// do this before the forEachP, but based on issue
-		// #16083 there may be a (harmless) race where we can
-		// enter mark 2 while some workers are still scanning
-		// stacks. The forEachP ensures these scans are done.
-		//
-		// TODO(austin): Figure out the race and fix this
-		// properly.
-		gcMarkRootCheck()
-
-		// Now we can start up mark 2 workers.
-		atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 0xffffffff)
-		gcController.fractionalUtilizationGoal = prevFractionalGoal
-
-		incnwait := atomic.Xadd(&work.nwait, +1)
-		if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
-			// This loop will make progress because
-			// gcBlackenPromptly is now true, so it won't
-			// take this same "if" branch.
+	if gcMarkDoneFlushed != 0 {
+		// More grey objects were discovered since the
+		// previous termination check, so there may be more
+		// work to do. Keep going. It's possible the
+		// transition condition became true again during the
+		// ragged barrier, so re-check it.
 		goto top
 	}
-	} else {
-		// Transition to mark termination.
+
+	// There was no global work, no local work, and no Ps
+	// communicated work since we took markDoneSema. Therefore
+	// there are no grey objects and no more objects can be
+	// shaded. Transition to mark termination.
 	now := nanotime()
 	work.tMarkTerm = now
 	work.pauseStart = now
@@ -1500,13 +1484,13 @@ top:
 	// world again.
 	semrelease(&work.markDoneSema)

-		// endCycle depends on all gcWork cache stats being
-		// flushed. This is ensured by mark 2.
+	// endCycle depends on all gcWork cache stats being flushed.
+	// The termination algorithm above ensured that up to
+	// allocations since the ragged barrier.
 	nextTriggerRatio := gcController.endCycle()

 	// Perform mark termination. This will restart the world.
 	gcMarkTermination(nextTriggerRatio)
-	}
 }

 func gcMarkTermination(nextTriggerRatio float64) {
@@ -1940,23 +1924,23 @@ func gcMark(start_time int64) {
 	if work.full == 0 && work.nDataRoots+work.nBSSRoots+work.nSpanRoots+work.nStackRoots == 0 {
 		// There's no work on the work queue and no root jobs
 		// that can produce work, so don't bother entering the
-		// getfull() barrier.
-		//
-		// This will be the situation the vast majority of the
-		// time after concurrent mark. However, we still need
-		// a fallback for STW GC and because there are some
-		// known races that occasionally leave work around for
-		// mark termination.
+		// getfull() barrier. There will be flushCacheRoots
+		// work, but that doesn't gray anything.
 		//
-		// We're still hedging our bets here: if we do
-		// accidentally produce some work, we'll still process
-		// it, just not necessarily in parallel.
-		//
-		// TODO(austin): Fix the races and and remove
-		// work draining from mark termination so we don't
-		// need the fallback path.
+		// This should always be the situation after
+		// concurrent mark.
 		work.helperDrainBlock = false
 	} else {
+		// There's marking work to do. This is the case during
+		// STW GC and in checkmark mode. Instruct GC workers
+		// to block in getfull until all GC workers are in getfull.
+		//
+		// TODO(austin): Move STW and checkmark marking out of
+		// mark termination and eliminate this code path.
+		if !useCheckmark && debug.gcstoptheworld == 0 && debug.gcrescanstacks == 0 {
+			print("runtime: full=", hex(work.full), " nDataRoots=", work.nDataRoots, " nBSSRoots=", work.nBSSRoots, " nSpanRoots=", work.nSpanRoots, " nStackRoots=", work.nStackRoots, "\n")
+			panic("non-empty mark queue after concurrent mark")
+		}
 		work.helperDrainBlock = true
 	}

@@ -1991,16 +1975,35 @@ func gcMark(start_time int64) {
 	// Record that at least one root marking pass has completed.
 	work.markrootDone = true

-	// Double-check that all gcWork caches are empty. This should
-	// be ensured by mark 2 before we enter mark termination.
+	// Clear out buffers and double-check that all gcWork caches
+	// are empty. This should be ensured by gcMarkDone before we
+	// enter mark termination.
+	//
+	// TODO: We could clear out buffers just before mark if this
+	// has a non-negligible impact on STW time.
 	for _, p := range allp {
+		// The write barrier may have buffered pointers since
+		// the gcMarkDone barrier. However, since the barrier
+		// ensured all reachable objects were marked, all of
+		// these must be pointers to black objects. Hence we
+		// can just discard the write barrier buffer.
+		if debug.gccheckmark > 0 {
+			// For debugging, flush the buffer and make
+			// sure it really was all marked.
+			wbBufFlush1(p)
+		} else {
+			p.wbBuf.reset()
+		}
+
 		gcw := &p.gcw
 		if !gcw.empty() {
 			throw("P has cached GC work at end of mark termination")
 		}
-		if gcw.scanWork != 0 || gcw.bytesMarked != 0 {
-			throw("P has unflushed stats at end of mark termination")
-		}
+		// There may still be cached empty buffers, which we
+		// need to flush since we're going to free them. Also,
+		// there may be non-zero stats because we allocated
+		// black after the gcMarkDone barrier.
+		gcw.dispose()
 	}

 	cachestats()

--- a/src/runtime/mwbbuf.go
+++ b/src/runtime/mwbbuf.go
@@ -107,6 +107,11 @@ func (b *wbBuf) discard() {
 	b.next = uintptr(unsafe.Pointer(&b.buf[0]))
 }

+// empty returns true if b contains no pointers.
+func (b *wbBuf) empty() bool {
+	return b.next == uintptr(unsafe.Pointer(&b.buf[0]))
+}
+
 // putFast adds old and new to the write barrier buffer and returns
 // false if a flush is necessary. Callers should use this as:
 //