runtime: use separate count and note for forEachP

Currently, forEachP reuses the stopwait and stopnote fields from stopTheWorld to track how many Ps have not responded to the safe-point request and to sleep until all Ps have responded. It was assumed this was safe because both stopTheWorld and forEachP must occur under the worlsema and hence stopwait and stopnote cannot be used for both purposes simultaneously and callers could always determine the appropriate use based on sched.gcwaiting (which is only set by stopTheWorld). However, this is not the case, since it's possible for there to be a window between when an M observes that gcwaiting is set and when it checks stopwait during which stopwait could have changed meanings. When this happens, the M decrements stopwait and may wakeup stopnote, but does not otherwise participate in the forEachP protocol. As a result, stopwait is decremented too many times, so it may reach zero before all Ps have run the safe-point function, causing forEachP to wake up early. It will then either observe that some P has not run the safe-point function and panic with "P did not run fn", or the remaining P (or Ps) will run the safe-point function before it wakes up and it will observe that stopwait is negative and panic with "not stopped". Fix this problem by giving forEachP its own safePointWait and safePointNote fields. One known sequence of events that can cause this race is as follows. It involves three actors: G1 is running on M1 on P1. P1 has an empty run queue. G2/M2 is in a blocked syscall and has lost its P. (The details of this don't matter, it just needs to be in a position where it needs to grab an idle P.) GC just started on G3/M3/P3. (These aren't very involved, they just have to be separate from the other G's, M's, and P's.) 1. GC calls stopTheWorld(), which sets sched.gcwaiting to 1. Now G1/M1 begins to enter a syscall: 2. G1/M1 invokes reentersyscall, which sets the P1's status to _Psyscall. 3. G1/M1's reentersyscall observes gcwaiting != 0 and calls entersyscall_gcwait. 4. G1/M1's entersyscall_gcwait blocks acquiring sched.lock. Back on GC: 5. stopTheWorld cas's P1's status to _Pgcstop, does other stuff, and returns. 6. GC does stuff and then calls startTheWorld(). 7. startTheWorld() calls procresize(), which sets P1's status to _Pidle and puts P1 on the idle list. Now G2/M2 returns from its syscall and takes over P1: 8. G2/M2 returns from its blocked syscall and gets P1 from the idle list. 9. G2/M2 acquires P1, which sets P1's status to _Prunning. 10. G2/M2 starts a new syscall and invokes reentersyscall, which sets P1's status to _Psyscall. Back on G1/M1: 11. G1/M1 finally acquires sched.lock in entersyscall_gcwait. At this point, G1/M1 still thinks it's running on P1. P1's status is _Psyscall, which is consistent with what G1/M1 is doing, but it's _Psyscall because *G2/M2* put it in to _Psyscall, not G1/M1. This is basically an ABA race on P1's status. Because forEachP currently shares stopwait with stopTheWorld. G1/M1's entersyscall_gcwait observes the non-zero stopwait set by forEachP, but mistakes it for a stopTheWorld. It cas's P1's status from _Psyscall (set by G2/M2) to _Pgcstop and proceeds to decrement stopwait one more time than forEachP was expecting. Fixes #10618. (See the issue for details on why the above race is safe when forEachP is not involved.) Prior to this commit, the command stress ./runtime.test -test.run TestFutexsleep\|TestGoroutineProfile would reliably fail after a few hundred runs. With this commit, it ran for over 2 million runs and never crashed. Change-Id: I9a91ea20035b34b6e5f07ef135b144115f281f30 Reviewed-on: https://go-review.googlesource.com/10157Reviewed-by: Russ Cox <rsc@golang.org>

runtime: use separate count and note for forEachP
Currently, forEachP reuses the stopwait and stopnote fields from stopTheWorld to track how many Ps have not responded to the safe-point request and to sleep until all Ps have responded. It was assumed this was safe because both stopTheWorld and forEachP must occur under the worlsema and hence stopwait and stopnote cannot be used for both purposes simultaneously and callers could always determine the appropriate use based on sched.gcwaiting (which is only set by stopTheWorld). However, this is not the case, since it's possible for there to be a window between when an M observes that gcwaiting is set and when it checks stopwait during which stopwait could have changed meanings. When this happens, the M decrements stopwait and may wakeup stopnote, but does not otherwise participate in the forEachP protocol. As a result, stopwait is decremented too many times, so it may reach zero before all Ps have run the safe-point function, causing forEachP to wake up early. It will then either observe that some P has not run the safe-point function and panic with "P did not run fn", or the remaining P (or Ps) will run the safe-point function before it wakes up and it will observe that stopwait is negative and panic with "not stopped". Fix this problem by giving forEachP its own safePointWait and safePointNote fields. One known sequence of events that can cause this race is as follows. It involves three actors: G1 is running on M1 on P1. P1 has an empty run queue. G2/M2 is in a blocked syscall and has lost its P. (The details of this don't matter, it just needs to be in a position where it needs to grab an idle P.) GC just started on G3/M3/P3. (These aren't very involved, they just have to be separate from the other G's, M's, and P's.) 1. GC calls stopTheWorld(), which sets sched.gcwaiting to 1. Now G1/M1 begins to enter a syscall: 2. G1/M1 invokes reentersyscall, which sets the P1's status to _Psyscall. 3. G1/M1's reentersyscall observes gcwaiting != 0 and calls entersyscall_gcwait. 4. G1/M1's entersyscall_gcwait blocks acquiring sched.lock. Back on GC: 5. stopTheWorld cas's P1's status to _Pgcstop, does other stuff, and returns. 6. GC does stuff and then calls startTheWorld(). 7. startTheWorld() calls procresize(), which sets P1's status to _Pidle and puts P1 on the idle list. Now G2/M2 returns from its syscall and takes over P1: 8. G2/M2 returns from its blocked syscall and gets P1 from the idle list. 9. G2/M2 acquires P1, which sets P1's status to _Prunning. 10. G2/M2 starts a new syscall and invokes reentersyscall, which sets P1's status to _Psyscall. Back on G1/M1: 11. G1/M1 finally acquires sched.lock in entersyscall_gcwait. At this point, G1/M1 still thinks it's running on P1. P1's status is _Psyscall, which is consistent with what G1/M1 is doing, but it's _Psyscall because *G2/M2* put it in to _Psyscall, not G1/M1. This is basically an ABA race on P1's status. Because forEachP currently shares stopwait with stopTheWorld. G1/M1's entersyscall_gcwait observes the non-zero stopwait set by forEachP, but mistakes it for a stopTheWorld. It cas's P1's status from _Psyscall (set by G2/M2) to _Pgcstop and proceeds to decrement stopwait one more time than forEachP was expecting. Fixes #10618. (See the issue for details on why the above race is safe when forEachP is not involved.) Prior to this commit, the command stress ./runtime.test -test.run TestFutexsleep\|TestGoroutineProfile would reliably fail after a few hundred runs. With this commit, it ran for over 2 million runs and never crashed. Change-Id: I9a91ea20035b34b6e5f07ef135b144115f281f30 Reviewed-on: https://go-review.googlesource.com/10157Reviewed-by: Russ Cox <rsc@golang.org>
f0dd0028 · Austin Clements · 277acca2 · f0dd0028 · f0dd0028
Commit f0dd0028 authored May 15, 2015 by Austin Clements
Hide whitespace changes
Inline Side-by-side

Showing with 18 additions and 16 deletions

proc1.go src/runtime/proc1.go +15 -15

runtime2.go src/runtime/runtime2.go +3 -1

No files found.
--- a/src/runtime/proc1.go
+++ b/src/runtime/proc1.go
@@ -790,10 +790,10 @@ func forEachP(fn func(*p)) {
 	_p_ := getg().m.p.ptr()

 	lock(&sched.lock)
-	if sched.stopwait != 0 {
-		throw("forEachP: sched.stopwait != 0")
+	if sched.safePointWait != 0 {
+		throw("forEachP: sched.safePointWait != 0")
 	}
-	sched.stopwait = gomaxprocs - 1
+	sched.safePointWait = gomaxprocs - 1
 	sched.safePointFn = fn

 	// Ask all Ps to run the safe point function.
@@ -813,11 +813,11 @@ func forEachP(fn func(*p)) {
 	for p := sched.pidle.ptr(); p != nil; p = p.link.ptr() {
 		if cas(&p.runSafePointFn, 1, 0) {
 			fn(p)
-			sched.stopwait--
+			sched.safePointWait--
 		}
 	}

-	wait := sched.stopwait > 0
+	wait := sched.safePointWait > 0
 	unlock(&sched.lock)

 	// Run fn for the current P.
@@ -843,15 +843,15 @@ func forEachP(fn func(*p)) {
 		for {
 			// Wait for 100us, then try to re-preempt in
 			// case of any races.
-			if notetsleep(&sched.stopnote, 100*1000) {
-				noteclear(&sched.stopnote)
+			if notetsleep(&sched.safePointNote, 100*1000) {
+				noteclear(&sched.safePointNote)
 				break
 			}
 			preemptall()
 		}
 	}
-	if sched.stopwait != 0 {
-		throw("forEachP: not stopped")
+	if sched.safePointWait != 0 {
+		throw("forEachP: not done")
 	}
 	for i := 0; i < int(gomaxprocs); i++ {
 		p := allp[i]
@@ -887,9 +887,9 @@ func runSafePointFn() {
 	}
 	sched.safePointFn(p)
 	lock(&sched.lock)
-	sched.stopwait--
-	if sched.stopwait == 0 {
-		notewakeup(&sched.stopnote)
+	sched.safePointWait--
+	if sched.safePointWait == 0 {
+		notewakeup(&sched.safePointNote)
 	}
 	unlock(&sched.lock)
 }
@@ -1262,9 +1262,9 @@ func handoffp(_p_ *p) {
 	}
 	if _p_.runSafePointFn != 0 && cas(&_p_.runSafePointFn, 1, 0) {
 		sched.safePointFn(_p_)
-		sched.stopwait--
-		if sched.stopwait == 0 {
-			notewakeup(&sched.stopnote)
+		sched.safePointWait--
+		if sched.safePointWait == 0 {
+			notewakeup(&sched.safePointNote)
 		}
 	}
 	if sched.runqsize != 0 {

--- a/src/runtime/runtime2.go
+++ b/src/runtime/runtime2.go
@@ -441,7 +441,9 @@ type schedt struct {

 	// safepointFn should be called on each P at the next GC
 	// safepoint if p.runSafePointFn is set.
-	safePointFn func(*p)
+	safePointFn   func(*p)
+	safePointWait int32
+	safePointNote note

 	profilehz int32 // cpu profiling rate