• Russ Cox's avatar
    [release-branch.go1.8] runtime: do not call wakep from enlistWorker, to avoid possible deadlock · bcda91c1
    Russ Cox authored
    We have seen one instance of a production job suddenly spinning to
    100% CPU and becoming unresponsive. In that one instance, a SIGQUIT
    was sent after 328 minutes of spinning, and the stacks showed a single
    goroutine in "IO wait (scan)" state.
    
    Looking for things that might get stuck if a goroutine got stuck in
    scanning a stack, we found that injectglist does:
    
    	lock(&sched.lock)
    	var n int
    	for n = 0; glist != nil; n++ {
    		gp := glist
    		glist = gp.schedlink.ptr()
    		casgstatus(gp, _Gwaiting, _Grunnable)
    		globrunqput(gp)
    	}
    	unlock(&sched.lock)
    
    and that casgstatus spins on gp.atomicstatus until the _Gscan bit goes
    away. Essentially, this code locks sched.lock and then while holding
    sched.lock, waits to lock gp.atomicstatus.
    
    The code that is doing the scan is:
    
    	if castogscanstatus(gp, s, s|_Gscan) {
    		if !gp.gcscandone {
    			scanstack(gp, gcw)
    			gp.gcscandone = true
    		}
    		restartg(gp)
    		break loop
    	}
    
    More analysis showed that scanstack can, in a rare case, end up
    calling back into code that acquires sched.lock. For example:
    
    	runtime.scanstack at proc.go:866
    	calls runtime.gentraceback at mgcmark.go:842
    	calls runtime.scanstack$1 at traceback.go:378
    	calls runtime.scanframeworker at mgcmark.go:819
    	calls runtime.scanblock at mgcmark.go:904
    	calls runtime.greyobject at mgcmark.go:1221
    	calls (*runtime.gcWork).put at mgcmark.go:1412
    	calls (*runtime.gcControllerState).enlistWorker at mgcwork.go:127
    	calls runtime.wakep at mgc.go:632
    	calls runtime.startm at proc.go:1779
    	acquires runtime.sched.lock at proc.go:1675
    
    This path was found with an automated deadlock-detecting tool.
    There are many such paths but they all go through enlistWorker -> wakep.
    
    The evidence strongly suggests that one of these paths is what caused
    the deadlock we observed. We're running those jobs with
    GOTRACEBACK=crash now to try to get more information if it happens
    again.
    
    Further refinement and analysis shows that if we drop the wakep call
    from enlistWorker, the remaining few deadlock cycles found by the tool
    are all false positives caused by not understanding the effect of calls
    to func variables.
    
    The enlistWorker -> wakep call was intended only as a performance
    optimization, it rarely executes, and if it does execute at just the
    wrong time it can (and plausibly did) cause the deadlock we saw.
    
    Comment it out, to avoid the potential deadlock.
    
    Fixes #19112.
    Unfixes #14179.
    
    Change-Id: I6f7e10b890b991c11e79fab7aeefaf70b5d5a07b
    Reviewed-on: https://go-review.googlesource.com/37093
    Run-TryBot: Russ Cox <rsc@golang.org>
    Reviewed-by: 's avatarAustin Clements <austin@google.com>
    Reviewed-on: https://go-review.googlesource.com/37022
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    bcda91c1
Name
Last commit
Last update
.github Loading commit data...
api Loading commit data...
doc Loading commit data...
lib/time Loading commit data...
misc Loading commit data...
src Loading commit data...
test Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
AUTHORS Loading commit data...
CONTRIBUTING.md Loading commit data...
CONTRIBUTORS Loading commit data...
LICENSE Loading commit data...
PATENTS Loading commit data...
README.md Loading commit data...
VERSION Loading commit data...
favicon.ico Loading commit data...
robots.txt Loading commit data...