1. 06 Jan, 2017 7 commits
    • Russ Cox's avatar
      runtime: fix corruption crash/race between select and stack growth · b902a63a
      Russ Cox authored
      To implement the blocking of a select, a goroutine builds a list of
      offers to communicate (pseudo-g's, aka sudog), one for each case,
      queues them on the corresponding channels, and waits for another
      goroutine to complete one of those cases and wake it up. Obviously it
      is not OK for two other goroutines to complete multiple cases and both
      wake the goroutine blocked in select. To make sure that only one
      branch of the select is chosen, all the sudogs contain a pointer to a
      shared (single) 'done uint32', which is atomically cas'ed by any
      interested goroutines. The goroutine that wins the cas race gets to
      wake up the select. A complication is that 'done uint32' is stored on
      the stack of the goroutine running the select, and that stack can move
      during the select due to stack growth or stack shrinking.
      
      The relevant ordering to block and unblock in select is:
      
      	1. Lock all channels.
      	2. Create list of sudogs and queue sudogs on all channels.
      	3. Switch to system stack, mark goroutine as asleep,
      	   unlock all channels.
      	4. Sleep until woken.
      	5. Wake up on goroutine stack.
      	6. Lock all channels.
      	7. Dequeue sudogs from all channels.
      	8. Free list of sudogs.
      	9. Unlock all channels.
      
      There are two kinds of stack moves: stack growth and stack shrinking.
      Stack growth happens while the original goroutine is running.
      Stack shrinking happens asynchronously, during garbage collection.
      
      While a channel listing a sudog is locked by select in this process,
      no other goroutine can attempt to complete communication on that
      channel, because that other goroutine doesn't hold the lock and can't
      find the sudog. If the stack moves while all the channel locks are
      held or when the sudogs are not yet or no longer queued in the
      channels, no problem, because no goroutine can get to the sudogs and
      therefore to selectdone. We only need to worry about the stack (and
      'done uint32') moving with the sudogs queued in unlocked channels.
      
      Stack shrinking can happen any time the goroutine is stopped.
      That code already acquires all the channel locks before doing the
      stack move, so it avoids this problem.
      
      Stack growth can happen essentially any time the original goroutine is
      running on its own stack (not the system stack). In the first half of
      the select, all the channels are locked before any sudogs are queued,
      and the channels are not unlocked until the goroutine has stopped
      executing on its own stack and is asleep, so that part is OK. In the
      second half of the select, the goroutine wakes up on its own goroutine
      stack and immediately locks all channels. But the actual call to lock
      might grow the stack, before acquiring any locks. In that case, the
      stack is moving with the sudogs queued in unlocked channels. Not good.
      One goroutine has already won a cas on the old stack (that goroutine
      woke up the selecting goroutine, moving it out of step 4), and the
      fact that done = 1 now should prevent any other goroutines from
      completing any other select cases. During the stack move, however,
      sudog.selectdone is moved from pointing to the old done variable on
      the old stack to a new memory location on the new stack. Another
      goroutine might observe the moved pointer before the new memory
      location has been initialized. If the new memory word happens to be
      zero, that goroutine might win a cas on the new location, thinking it
      can now complete the select (again). It will then complete a second
      communication (reading from or writing to the goroutine stack
      incorrectly) and then attempt to wake up the selecting goroutine,
      which is already awake.
      
      The scribbling over the goroutine stack unexpectedly is already bad,
      but likely to go unnoticed, at least immediately. As for the second
      wakeup, there are a variety of ways it might play out.
      
      * The goroutine might not be asleep.
      That will produce a runtime crash (throw) like in #17007:
      
      	runtime: gp: gp=0xc0422dcb60, goid=2299, gp->atomicstatus=8
      	runtime:  g:  g=0xa5cfe0, goid=0,  g->atomicstatus=0
      	fatal error: bad g->status in ready
      
      Here, atomicstatus=8 is copystack; the second, incorrect wakeup is
      observing that the selecting goroutine is in state "Gcopystack"
      instead of "Gwaiting".
      
      * The goroutine might be sleeping in a send on a nil chan.
      If it wakes up, it will crash with 'fatal error: unreachable'.
      
      * The goroutine might be sleeping in a send on a non-nil chan.
      If it wakes up, it will crash with 'fatal error: chansend:
      spurious wakeup'.
      
      * The goroutine might be sleeping in a receive on a nil chan.
      If it wakes up, it will crash with 'fatal error: unreachable'.
      
      * The goroutine might be sleeping in a receive on a non-nil chan.
      If it wakes up, it will silently (incorrectly!) continue as if it
      received a zero value from a closed channel, leaving a sudog queued on
      the channel pointing at that zero vaue on the goroutine's stack; that
      space will be reused as the goroutine executes, and when some other
      goroutine finally completes the receive, it will do a stray write into
      the goroutine's stack memory, which may cause problems. Then it will
      attempt the real wakeup of the goroutine, leading recursively to any
      of the cases in this list.
      
      * The goroutine might have been running a select in a finalizer
      (I hope not!) and might now be sleeping waiting for more things to
      finalize. If it wakes up, as long as it goes back to sleep quickly
      (before the real GC code tries to wake it), the spurious wakeup does
      no harm (but the stack was still scribbled on).
      
      * The goroutine might be sleeping in gcParkAssist.
      If it wakes up, that will let the goroutine continue executing a bit
      earlier than we would have liked. Eventually the GC will attempt the
      real wakeup of the goroutine, leading recursively to any of the cases
      in this list.
      
      * The goroutine cannot be sleeping in bgsweep, because the background
      sweepers never use select.
      
      * The goroutine might be sleeping in netpollblock.
      If it wakes up, it will crash with 'fatal error: netpollblock:
      corrupted state'.
      
      * The goroutine might be sleeping in main as another thread crashes.
      If it wakes up, it will exit(0) instead of letting the other thread
      crash with a non-zero exit status.
      
      * The goroutine cannot be sleeping in forcegchelper,
      because forcegchelper never uses select.
      
      * The goroutine might be sleeping in an empty select - select {}.
      If it wakes up, it will return to the next line in the program!
      
      * The goroutine might be sleeping in a non-empty select (again).
      In this case, it will wake up spuriously, with gp.param == nil (no
      reason for wakeup), but that was fortuitously overloaded for handling
      wakeup due to a closing channel and the way it is handled is to rerun
      the select, which (accidentally) handles the spurious wakeup
      correctly:
      
      	if cas == nil {
      		// This can happen if we were woken up by a close().
      		// TODO: figure that out explicitly so we don't need this loop.
      		goto loop
      	}
      
      Before looping, it will dequeue all the sudogs on all the channels
      involved, so that no other goroutine will attempt to wake it.
      Since the goroutine was blocked in select before, being blocked in
      select again when the spurious wakeup arrives may be quite likely.
      In this case, the spurious wakeup does no harm (but the stack was
      still scribbled on).
      
      * The goroutine might be sleeping in semacquire (mutex slow path).
      If it wakes up, that is taken as a signal to try for the semaphore
      again, not a signal that the semaphore is now held, but the next
      iteration around the loop will queue the sudog a second time, causing
      a cycle in the wakeup list for the given address. If that sudog is the
      only one in the list, when it is eventually dequeued, it will
      (due to the precise way the code is written) leave the sudog on the
      queue inactive with the sudog broken. But the sudog will also be in
      the free list, and that will eventually cause confusion.
      
      * The goroutine might be sleeping in notifyListWait, for sync.Cond.
      If it wakes up, (*Cond).Wait returns. The docs say "Unlike in other
      systems, Wait cannot return unless awoken by Broadcast or Signal,"
      so the spurious wakeup is incorrect behavior, but most callers do not
      depend on that fact. Eventually the condition will happen, attempting
      the real wakeup of the goroutine and leading recursively to any of the
      cases in this list.
      
      * The goroutine might be sleeping in timeSleep aka time.Sleep.
      If it wakes up, it will continue running, leaving a timer ticking.
      When that time bomb goes off, it will try to ready the goroutine
      again, leading to any one of the cases in this list.
      
      * The goroutine cannot be sleeping in timerproc,
      because timerproc never uses select.
      
      * The goroutine might be sleeping in ReadTrace.
      If it wakes up, it will print 'runtime: spurious wakeup of trace
      reader' and return nil. All future calls to ReadTrace will print
      'runtime: ReadTrace called from multiple goroutines simultaneously'.
      Eventually, when trace data is available, a true wakeup will be
      attempted, leading to any one of the cases in this list.
      
      None of these fatal errors appear in any of the trybot or dashboard
      logs. The 'bad g->status in ready' that happens if the goroutine is
      running (the most likely scenario anyway) has happened once on the
      dashboard and eight times in trybot logs. Of the eight, five were
      atomicstatus=8 during net/http tests, so almost certainly this bug.
      The other three were atomicstatus=2, all near code in select,
      but in a draft CL by Dmitry that was rewriting select and may or may
      not have had its own bugs.
      
      This bug has existed since Go 1.4. Until then the select code was
      implemented in C, 'done uint32' was a C stack variable 'uint32 done',
      and C stacks never moved. I believe it has become more common recently
      because of Brad's work to run more and more tests in net/http in
      parallel, which lengthens race windows.
      
      The fix is to run step 6 on the system stack,
      avoiding possibility of stack growth.
      
      Fixes #17007 and possibly other mysterious failures.
      
      Change-Id: I9d6575a51ac96ae9d67ec24da670426a4a45a317
      Reviewed-on: https://go-review.googlesource.com/34835
      Run-TryBot: Russ Cox <rsc@golang.org>
      Reviewed-by: 's avatarAustin Clements <austin@google.com>
      b902a63a
    • Austin Clements's avatar
      runtime: expand HACKING.md · 5dd978a2
      Austin Clements authored
      This adds high-level descriptions of the scheduler structures, the
      user and system stacks, error handling, and synchronization.
      
      Change-Id: I1eed97c6dd4a6e3d351279e967b11c6e64898356
      Reviewed-on: https://go-review.googlesource.com/34290Reviewed-by: 's avatarRick Hudson <rlh@golang.org>
      5dd978a2
    • Austin Clements's avatar
      runtime: update big mgc.go comment · 618c2915
      Austin Clements authored
      The comment describing the overall GC algorithm at the top of mgc.go
      has gotten woefully out-of-date (and was possibly never
      correct/complete). Update it to reflect the current workings of the
      GC and the set of phases that we now divide it into.
      
      Change-Id: I02143c0ebefe9d4cd7753349dab8045f0973bf95
      Reviewed-on: https://go-review.googlesource.com/34711Reviewed-by: 's avatarRick Hudson <rlh@golang.org>
      618c2915
    • Russ Cox's avatar
      net/http: better failure in TestTransportPersistConnLeak · cb91dccd
      Russ Cox authored
      If one of the c.Get(ts.URL) results in an error, the child goroutine
      calls t.Errorf, but the test goroutine gets stuck waiting for <-gotReqCh,
      so the test hangs and the program is eventually killed (after 10 minutes!).
      Whatever might have been printed to t.Errorf is never seen.
      Adjust test so that the test fails cleanly in this case.
      
      Still trying to debug why c.Get might fail.
      It seems to have something to do with occasional connection
      failures on macOS Sierra.
      
      Change-Id: Ia797787bd51ea7cd6deb1192aec89c331c4f2c48
      Reviewed-on: https://go-review.googlesource.com/34836
      Run-TryBot: Russ Cox <rsc@golang.org>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
      cb91dccd
    • Austin Clements's avatar
      runtime: use 4K as the boundary of legal pointers · 7aefdfde
      Austin Clements authored
      Currently, the check for legal pointers in stack copying uses
      _PageSize (8K) as the minimum legal pointer. By default, Linux won't
      let you map under 64K, but
      
      1) it's less clear what other OSes allow or will allow in the future;
      
      2) while mapping the first page is a terrible idea, mapping anywhere
      above that is arguably more justifiable;
      
      3) the compiler only assumes the first physical page (4K) is never
      mapped.
      
      Make the runtime consistent with the compiler and more robust by
      changing the bad pointer check to use 4K as the minimum legal pointer.
      
      This came out of discussions on CLs 34663 and 34719.
      
      Change-Id: Idf721a788bd9699fb348f47bdd083cf8fa8bd3e5
      Reviewed-on: https://go-review.googlesource.com/34890
      Run-TryBot: Austin Clements <austin@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarRuss Cox <rsc@golang.org>
      7aefdfde
    • Kevin Burke's avatar
      net: Fix grammar error · 867dcb55
      Kevin Burke authored
      Change-Id: I1c2e17b25ca91be37a18c47e70678c3753070fb8
      Reviewed-on: https://go-review.googlesource.com/34827Reviewed-by: 's avatarJoe Tsai <thebrokentoaster@gmail.com>
      867dcb55
    • Mikio Hara's avatar
      net: display the complete BUGS section on every platform · b07363da
      Mikio Hara authored
      We cannot assume that the platform running documentation service is
      the target platform.
      
      Change-Id: I241ed6f8778169faac9ef49e11dcd40f7422cccc
      Reviewed-on: https://go-review.googlesource.com/34750
      Run-TryBot: Mikio Hara <mikioh.mikioh@gmail.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarIan Lance Taylor <iant@golang.org>
      b07363da
  2. 05 Jan, 2017 8 commits
  3. 04 Jan, 2017 6 commits
  4. 03 Jan, 2017 5 commits
  5. 02 Jan, 2017 3 commits
  6. 31 Dec, 2016 1 commit
  7. 30 Dec, 2016 1 commit
  8. 29 Dec, 2016 1 commit
  9. 28 Dec, 2016 1 commit
  10. 24 Dec, 2016 2 commits
  11. 23 Dec, 2016 3 commits
  12. 22 Dec, 2016 2 commits
    • Raul Silvera's avatar
      cmd/pprof: Re-enable weblist and disasm · 8887be46
      Raul Silvera authored
      Previous changes started using the full filename for object files
      on graph nodes, instead of just the file basename. The basename
      was still being used when selecting mappings to disassemble for
      weblist and disasm commands, causing a mismatch.
      
      This fixes #18385. It was already fixed on the upstream pprof.
      
      Change-Id: I1664503634f2c8cd31743561301631f12c4949c9
      Reviewed-on: https://go-review.googlesource.com/34665Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
      Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      8887be46
    • Brad Fitzpatrick's avatar
      net/http: restore Transport's Request.Body byte sniff in limited cases · 6e36811c
      Brad Fitzpatrick authored
      In Go 1.8, we'd removed the Transport's Request.Body
      one-byte-Read-sniffing to disambiguate between non-nil Request.Body
      with a ContentLength of 0 or -1. Previously, we tried to see whether a
      ContentLength of 0 meant actually zero, or just an unset by reading a
      single byte of the Request.Body and then stitching any read byte back
      together with the original Request.Body.
      
      That historically has caused many problems due to either data races,
      blocking forever (#17480), or losing bytes (#17071). Thus, we removed
      it in both HTTP/1 and HTTP/2 in Go 1.8. Unfortunately, during the Go
      1.8 beta, we've found that a few people have gotten bitten by the
      behavior change on requests with methods typically not containing
      request bodies (e.g. GET, HEAD, DELETE). The most popular example is
      the aws-go SDK, which always set http.Request.Body to a non-nil value,
      even on such request methods. That was causing Go 1.8 to send such
      requests with Transfer-Encoding chunked bodies, with zero bytes,
      confusing popular servers (including but limited to AWS).
      
      This CL partially reverts the no-byte-sniffing behavior and restores
      it only for GET/HEAD/DELETE/etc requests, and only when there's no
      Transfer-Encoding set, and the Content-Length is 0 or -1.
      
      Updates #18257 (aws-go) bug
      And also private bug reports about non-AWS issues.
      
      Updates #18407 also, but haven't yet audited things enough to declare
      it fixed.
      
      Change-Id: Ie5284d3e067c181839b31faf637eee56e5738a6a
      Reviewed-on: https://go-review.googlesource.com/34668
      Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarIan Lance Taylor <iant@golang.org>
      6e36811c