1. 09 Mar, 2018 15 commits
  2. 08 Mar, 2018 25 commits
    • Austin Clements's avatar
      runtime: explain and enforce that _panic values live on the stack · 5d22cebb
      Austin Clements authored
      It's a bit mysterious that _defer.sp is a uintptr that gets
      stack-adjusted explicitly while _panic.argp is an unsafe.Pointer that
      doesn't, but turns out to be critically important when a deferred
      function grows the stack before doing a recover.
      
      Add a comment explaining that this works because _panic values live on
      the stack. Enforce this by marking _panic go:notinheap.
      
      Change-Id: I9ca49e84ee1f86d881552c55dccd0662b530836b
      Reviewed-on: https://go-review.googlesource.com/99735
      Run-TryBot: Austin Clements <austin@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarMatthew Dempsky <mdempsky@google.com>
      5d22cebb
    • Austin Clements's avatar
      runtime: ensure abort actually crashes the process · 60a9e5d6
      Austin Clements authored
      On all non-x86 arches, runtime.abort simply reads from nil.
      Unfortunately, if this happens on a user stack, the signal handler
      will dutifully turn this into a panicmem, which lets user defers run
      and which user code can even recover from.
      
      To fix this, add an explicit check to the signal handler that turns
      faults in abort into hard crashes directly in the signal handler. This
      has the added benefit of giving a register dump at the abort point.
      
      Change-Id: If26a7f13790745ee3867db7f53b72d8281176d70
      Reviewed-on: https://go-review.googlesource.com/93661
      Run-TryBot: Austin Clements <austin@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      60a9e5d6
    • Austin Clements's avatar
      runtime: call abort instead of raw INT $3 or bad MOV · c950a90d
      Austin Clements authored
      Everything except for amd64, amd64p32, and 386 currently defines and
      uses an abort function. This CL makes these match. The next CL will
      recognize the abort function to make this more useful.
      
      Change-Id: I7c155871ea48919a9220417df0630005b444f488
      Reviewed-on: https://go-review.googlesource.com/93660
      Run-TryBot: Austin Clements <austin@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      c950a90d
    • Austin Clements's avatar
      runtime: make throw safer to call · 7f1b2738
      Austin Clements authored
      Currently, throw may grow the stack, which means whenever we call it
      from a context where it's not safe to grow the stack, we first have to
      switch to the system stack. This is pretty easy to get wrong.
      
      Fix this by making throw switch to the system stack so it doesn't grow
      the stack and is hence safe to call without a system stack switch at
      the call site.
      
      The only thing this complicates is badsystemstack itself, which would
      now go into an infinite loop before printing anything (previously it
      would also go into an infinite loop, but would at least print the
      error first). Fix this by making badsystemstack do a direct write and
      then crash hard.
      
      Change-Id: Ic5b4a610df265e47962dcfa341cabac03c31c049
      Reviewed-on: https://go-review.googlesource.com/93659
      Run-TryBot: Austin Clements <austin@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      7f1b2738
    • Austin Clements's avatar
      runtime: move unrecoverable panic handling to the system stack · 9d59234c
      Austin Clements authored
      Currently parts of unrecoverable panic handling (notably, printing
      panic messages) can happen on the user stack. This may grow the stack,
      which is generally fine, but if we're handling a runtime panic, it's
      better to do as little as possible in case the runtime is in an
      inconsistent state.
      
      Hence, this commit rearranges the handling of unrecoverable panics so
      that it's done entirely on the system stack.
      
      This is mostly a matter of shuffling code a bit so everything can move
      into a systemstack block. The one slight subtlety is in the "panic
      during panic" case, where we now depend on startpanic_m's caller to
      print the stack rather than startpanic_m itself. To make this work,
      startpanic_m now returns a boolean indicating that the caller should
      avoid trying to print any panic messages and get right to the stack
      trace. Since the caller is already in a position to do this, this
      actually simplifies things a little.
      
      Change-Id: Id72febe8c0a9fb31d9369b600a1816d65a49bfed
      Reviewed-on: https://go-review.googlesource.com/93658
      Run-TryBot: Austin Clements <austin@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      9d59234c
    • Austin Clements's avatar
      cmd/compile: simplify OpSlicemask optimization · da022da9
      Austin Clements authored
      The previous CL introduced isConstDelta. Use it to simplify the
      OpSlicemask optimization in the prove pass. This passes toolstash
      -cmp.
      
      Change-Id: If2aa762db4cdc0cd1c581a536340530a9831081b
      Reviewed-on: https://go-review.googlesource.com/87481Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      da022da9
    • Austin Clements's avatar
      cmd/compile: add fence-post implications to prove · 6436270d
      Austin Clements authored
      This adds four new deductions to the prove pass, all related to adding
      or subtracting one from a value. This is the first hint of actual
      arithmetic relations in the prove pass.
      
      The most effective of these is
      
         x-1 >= w && x > min  ⇒  x > w
      
      This helps eliminate bounds checks in code like
      
        if x > 0 {
          // do something with s[x-1]
        }
      
      Altogether, these deductions prove an additional 260 branches in std
      and cmd. Furthermore, they will let us eliminate some tricky
      compiler-inserted panics in the runtime that are interfering with
      static analysis.
      
      Fixes #23354.
      
      Change-Id: I7088223e0e0cd6ff062a75c127eb4bb60e6dce02
      Reviewed-on: https://go-review.googlesource.com/87480Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      Reviewed-by: 's avatarAlexandru Moșoi <alexandru@mosoi.ro>
      6436270d
    • Austin Clements's avatar
      cmd/compile: derive unsigned limits from signed limits in prove · 941fc129
      Austin Clements authored
      This adds a few simple deductions to the prove pass' fact table to
      derive unsigned concrete limits from signed concrete limits where
      possible.
      
      This tweak lets the pass prove 70 additional branch conditions in std
      and cmd.
      
      This is based on a comment from the recently-deleted factsTable.get:
      "// TODO: also use signed data if lim.min >= 0".
      
      Change-Id: Ib4340249e7733070f004a0aa31254adf5df8a392
      Reviewed-on: https://go-review.googlesource.com/87479Reviewed-by: 's avatarAlexandru Moșoi <alexandru@mosoi.ro>
      Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      941fc129
    • Austin Clements's avatar
      cmd/compile: make prove pass use unsatisfiability · 669db2ce
      Austin Clements authored
      Currently the prove pass uses implication queries. For each block, it
      collects the set of branch conditions leading to that block, and
      queries this fact table for whether any of these facts imply the
      block's own branch condition (or its inverse). This works remarkably
      well considering it doesn't do any deduction on these facts, but it
      has various downsides:
      
      1. It requires an implementation both of adding facts to the table and
         determining implications. These are very nearly duals of each
         other, but require separate implementations. Likewise, the process
         of asserting facts of dominating branch conditions is very nearly
         the dual of the process of querying implied branch conditions.
      
      2. It leads to less effective use of derived facts. For example, the
         prove pass currently derives facts about the relations between len
         and cap, but can't make use of these unless a branch condition is
         in the exact form of a derived fact. If one of these derived facts
         contradicts another fact, it won't notice or make use of this.
      
      This CL changes the approach of the prove pass to instead use
      *contradiction* instead of implication. Rather than ever querying a
      branch condition, it simply adds branch conditions to the fact table.
      If this leads to a contradiction (specifically, it makes the fact set
      unsatisfiable), that branch is impossible and can be cut. As a result,
      
      1. We can eliminate the code for determining implications
         (factsTable.get disappears entirely). Also, there is now a single
         implementation of visiting and asserting branch conditions, since
         we don't have to flip them around to treat them as facts in one
         place and queries in another.
      
      2. Derived facts can be used effectively. It doesn't matter *why* the
         fact table is unsatisfiable; a contradiction in any of the facts is
         enough.
      
      3. As an added benefit, it's now quite easy to avoid traversing beyond
         provably-unreachable blocks. In contrast, the current
         implementation always visits all blocks.
      
      The prove pass already has nearly all of the mechanism necessary to
      compute unsatisfiability, which means this both simplifies the code
      and makes it more powerful.
      
      The only complication is that the current implication procedure has a
      hack for dealing with the 0 <= Args[0] condition of OpIsInBounds and
      OpIsSliceInBounds. We replace this with asserting the appropriate fact
      when we process one of these conditions. This seems much cleaner
      anyway, and works because we can now take advantage of derived facts.
      
      This has no measurable effect on compiler performance.
      
      Effectiveness:
      
      There is exactly one condition in all of std and cmd that this fails
      to prove that the old implementation could: (int64(^uint(0)>>1) < x)
      in encoding/gob. This can never be true because x is an int, and it's
      basically coincidence that the old code gets this. (For example, it
      fails to prove the similar (x < ^int64(^uint(0)>>1)) condition that
      immediately precedes it, and even though the conditions are logically
      unrelated, it wouldn't get the second one if it hadn't first processed
      the first!)
      
      It does, however, prove a few dozen additional branches. These come
      from facts that are added to the fact table about the relations
      between len and cap. These were almost never queried directly before,
      but could lead to contradictions, which the unsat-based approach is
      able to use.
      
      There are exactly two branches in std and cmd that this implementation
      proves in the *other* direction. This sounds scary, but is okay
      because both occur in already-unreachable blocks, so it doesn't matter
      what we chose. Because the fact table logic is sound but incomplete,
      it fails to prove that the block isn't reachable, even though it is
      able to prove that both outgoing branches are impossible. We could
      turn these blocks into BlockExit blocks, but it doesn't seem worth the
      trouble of the extra proof effort for something that happens twice in
      all of std and cmd.
      
      Tests:
      
      This CL updates test/prove.go to change the expected messages because
      it can no longer give a "reason" why it proved or disproved a
      condition. It also adds a new test of a branch it couldn't prove
      before.
      
      It mostly guts test/sliceopt.go, removing everything related to slice
      bounds optimizations and moving a few relevant tests to test/prove.go.
      Much of this test is actually unreachable. The new prove pass figures
      this out and doesn't try to prove anything about the unreachable
      parts. The output on the unreachable parts is already suspect because
      anything can be proved at that point, so it's really just a regression
      test for an algorithm the compiler no longer uses.
      
      This is a step toward fixing #23354. That issue is quite easy to fix
      once we can use derived facts effectively.
      
      Change-Id: Ia48a1b9ee081310579fe474e4a61857424ff8ce8
      Reviewed-on: https://go-review.googlesource.com/87478Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      669db2ce
    • Austin Clements's avatar
      cmd/compile: simplify limit logic in prove · 2e9cf5f6
      Austin Clements authored
      This replaces the open-coded intersection of limits in the prove pass
      with a general limit intersection operation. This should get identical
      results except in one case where it's more precise: when handling an
      equality relation, if the value is *outside* the existing range, this
      will reduce the range to empty rather than resetting it. This will be
      important to a follow-up CL where we can take advantage of empty
      ranges.
      
      For #23354.
      
      Change-Id: I3d3d75924f61b1da1cb604b3a9d189b26fb3a14e
      Reviewed-on: https://go-review.googlesource.com/87477
      Run-TryBot: Austin Clements <austin@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      Reviewed-by: 's avatarAlexandru Moșoi <alexandru@mosoi.ro>
      2e9cf5f6
    • Austin Clements's avatar
      cmd/compile: more String methods for prove types · 44e20b64
      Austin Clements authored
      These aid in debugging.
      
      Change-Id: Ieb38c996765f780f6103f8c3292639d408c25123
      Reviewed-on: https://go-review.googlesource.com/87476
      Run-TryBot: Austin Clements <austin@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
      Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      44e20b64
    • Austin Clements's avatar
      cmd/compile: minor comment improvements/corrections · 491f409a
      Austin Clements authored
      Change-Id: Ie0934f1528d58d4971cdef726d3e2d23cf3935d3
      Reviewed-on: https://go-review.googlesource.com/87475
      Run-TryBot: Austin Clements <austin@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarDavid Chase <drchase@google.com>
      Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      Reviewed-by: 's avatarAlexandru Moșoi <alexandru@mosoi.ro>
      491f409a
    • Matthew Dempsky's avatar
      Revert "cmd/compile: cleanup nodpc and nodfp" · b55eedd1
      Matthew Dempsky authored
      This reverts commit dcac984b.
      
      Reason for revert: broke LR architectures (arm64, ppc64, s390x)
      
      Change-Id: I531d311c9053e81503c8c78d6cf044b318fc828b
      Reviewed-on: https://go-review.googlesource.com/99695
      Run-TryBot: Matthew Dempsky <mdempsky@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarAustin Clements <austin@google.com>
      b55eedd1
    • Alberto Donizetti's avatar
      math/big: allocate less in Float.Sqrt · 010579c2
      Alberto Donizetti authored
      The Newton sqrtInverse procedure we use to compute Float.Sqrt should
      not allocate a number of times proportional to the number of Newton
      iterations we need to reach the desired precision.
      
      At the beginning the function the target precision is known, so even
      if we do want to perform the early steps at low precisions (to save
      time), it's still possible to pre-allocate larger backing arrays, both
      for the temp variables in the loop and the variable that'll hold the
      final result.
      
      There's one complication. At the following line:
      
        u.Sub(three, u)
      
      the Sub method will allocate, because the receiver aliases one of the
      arguments, and the large backing array we initially allocated for u
      will be replaced by a smaller one allocated by Sub. We can work around
      this by introducing a second temp variable u2 that we use to hold the
      Sub call result.
      
      Overall, the sqrtInverse procedure still allocates a number of times
      proportional to the number of Newton steps, because unfortunately a
      few of the Mul calls in the Newton function allocate; but at least we
      allocate less in the function itself.
      
      FloatSqrt/256-4        1.97µs ± 1%    1.84µs ± 1%   -6.61%  (p=0.000 n=8+8)
      FloatSqrt/1000-4       4.80µs ± 3%    4.28µs ± 1%  -10.78%  (p=0.000 n=8+8)
      FloatSqrt/10000-4      40.0µs ± 1%    38.3µs ± 1%   -4.15%  (p=0.000 n=8+8)
      FloatSqrt/100000-4      955µs ± 1%     932µs ± 0%   -2.49%  (p=0.000 n=8+7)
      FloatSqrt/1000000-4    79.8ms ± 1%    79.4ms ± 1%     ~     (p=0.105 n=8+8)
      
      name                 old alloc/op   new alloc/op   delta
      FloatSqrt/256-4          816B ± 0%      512B ± 0%  -37.25%  (p=0.000 n=8+8)
      FloatSqrt/1000-4       2.50kB ± 0%    1.47kB ± 0%  -41.03%  (p=0.000 n=8+8)
      FloatSqrt/10000-4      23.5kB ± 0%    18.2kB ± 0%  -22.62%  (p=0.000 n=8+8)
      FloatSqrt/100000-4      251kB ± 0%     173kB ± 0%  -31.26%  (p=0.000 n=8+8)
      FloatSqrt/1000000-4    4.61MB ± 0%    2.86MB ± 0%  -37.90%  (p=0.000 n=8+8)
      
      name                 old allocs/op  new allocs/op  delta
      FloatSqrt/256-4          12.0 ± 0%       8.0 ± 0%  -33.33%  (p=0.000 n=8+8)
      FloatSqrt/1000-4         19.0 ± 0%       9.0 ± 0%  -52.63%  (p=0.000 n=8+8)
      FloatSqrt/10000-4        35.0 ± 0%      14.0 ± 0%  -60.00%  (p=0.000 n=8+8)
      FloatSqrt/100000-4       55.0 ± 0%      23.0 ± 0%  -58.18%  (p=0.000 n=8+8)
      FloatSqrt/1000000-4       122 ± 0%        75 ± 0%  -38.52%  (p=0.000 n=8+8)
      
      Change-Id: I950dbf61a40267a6cca82ae72524c3024bcb149c
      Reviewed-on: https://go-review.googlesource.com/87659Reviewed-by: 's avatarRobert Griesemer <gri@golang.org>
      010579c2
    • isharipo's avatar
      math/big: speedup nat.setBytes for bigger slices · d2a5263a
      isharipo authored
      Set up to _S (number of bytes in Uint) bytes at time
      by using BigEndian.Uint32 and BigEndian.Uint64.
      
      The performance improves for slices bigger than _S bytes.
      This is the case for 128/256bit arith that initializes
      it's objects from bytes.
      
      name               old time/op  new time/op  delta
      NatSetBytes/8-4    29.8ns ± 1%  11.4ns ± 0%  -61.63%  (p=0.000 n=9+8)
      NatSetBytes/24-4    109ns ± 1%    56ns ± 0%  -48.75%  (p=0.000 n=9+8)
      NatSetBytes/128-4   420ns ± 2%   110ns ± 1%  -73.83%  (p=0.000 n=10+10)
      NatSetBytes/7-4    26.2ns ± 1%  21.3ns ± 2%  -18.63%  (p=0.000 n=8+9)
      NatSetBytes/23-4    106ns ± 1%    67ns ± 1%  -36.93%  (p=0.000 n=9+10)
      NatSetBytes/127-4   410ns ± 2%   121ns ± 0%  -70.46%  (p=0.000 n=9+8)
      
      Found this optimization opportunity by looking at ethereum_corevm
      community benchmark cpuprofile.
      
      name        old time/op  new time/op  delta
      OpDiv256-4   715ns ± 1%   596ns ± 1%  -16.57%  (p=0.008 n=5+5)
      OpDiv128-4   373ns ± 1%   314ns ± 1%  -15.83%  (p=0.008 n=5+5)
      OpDiv64-4    301ns ± 0%   285ns ± 1%   -5.12%  (p=0.008 n=5+5)
      
      Change-Id: I8e5a680ae6284c8233d8d7431d51253a8a740b57
      Reviewed-on: https://go-review.googlesource.com/98775
      Run-TryBot: Iskander Sharipov <iskander.sharipov@intel.com>
      Reviewed-by: 's avatarRobert Griesemer <gri@golang.org>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      d2a5263a
    • Matthew Dempsky's avatar
      cmd/compile: cleanup nodpc and nodfp · dcac984b
      Matthew Dempsky authored
      Instead of creating a new &nodfp expression for every recover() call,
      or a new nodpc variable for every function instrumented by the race
      detector, this CL introduces two new uintptr-typed pseudo-variables
      callerSP and callerPC. These pseudo-variables act just like calls to
      the runtime's getcallersp() and getcallerpc() functions.
      
      For consistency, change runtime.gorecover's builtin stub's parameter
      type from "*int32" to "uintptr".
      
      Passes toolstash-check, but toolstash-check -race fails because of
      register allocator changes.
      
      Change-Id: I985d644653de2dac8b7b03a28829ad04dfd4f358
      Reviewed-on: https://go-review.googlesource.com/99416
      Run-TryBot: Matthew Dempsky <mdempsky@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarDaniel Martí <mvdan@mvdan.cc>
      Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
      dcac984b
    • Matthew Dempsky's avatar
      cmd/compile: remove two out-of-phase calls to walk · 6a5cfa8b
      Matthew Dempsky authored
      All calls to walkstmt/walkexpr/etc should be rooted from funccompile,
      whereas transformclosure and fninit are called by main.
      
      Passes toolstash-check.
      
      Change-Id: Ic880e2d2d83af09618ce4daa8e7716f6b389e53e
      Reviewed-on: https://go-review.googlesource.com/99418
      Run-TryBot: Matthew Dempsky <mdempsky@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
      6a5cfa8b
    • Matthew Dempsky's avatar
      cmd/compile: remove state.exitCode · 8b766e5d
      Matthew Dempsky authored
      We're holding onto the function's complete AST anyway, so might as
      well grab the exit code from there.
      
      Passes toolstash-check.
      
      Change-Id: I851b5dfdb53f991e9cd9488d25d0d2abc2a8379f
      Reviewed-on: https://go-review.googlesource.com/99417
      Run-TryBot: Matthew Dempsky <mdempsky@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
      8b766e5d
    • Matthew Dempsky's avatar
      cmd/compile: fuse escape analysis parameter tagging loops · e3127f02
      Matthew Dempsky authored
      Simplifies the code somewhat and allows removing Param.Field.
      
      Passes toolstash-check.
      
      Change-Id: Id854416aea8afd27ce4830ff0f5ff940f7353792
      Reviewed-on: https://go-review.googlesource.com/99336
      Run-TryBot: Matthew Dempsky <mdempsky@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarCherry Zhang <cherryyz@google.com>
      e3127f02
    • Kunpei Sakai's avatar
      net/http: panic when a nil handler is passed to (*ServeMux)HandleFunc · 7d654af5
      Kunpei Sakai authored
      Fixes #24297
      
      Change-Id: I759e88655632fda97dced240b3f13392b2785d0a
      Reviewed-on: https://go-review.googlesource.com/99575Reviewed-by: 's avatarAndrew Bonventre <andybons@golang.org>
      Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
      Run-TryBot: Andrew Bonventre <andybons@golang.org>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      7d654af5
    • Michael Kasch's avatar
      time: add support for parsing timezones denoted by sign and offset · 9f2c611f
      Michael Kasch authored
      IANA Zoneinfo does not provide names for all timezones. Some are denoted
      by a sign and an offset only. E.g: Europe/Turkey is currently +03 or
      America/La_Paz which is -04 (https://data.iana.org/time-zones/releases/tzdata2018c.tar.gz)
      
      Fixes #24071
      
      Change-Id: I9c230a719945e1263c5b52bab82084d22861be3e
      Reviewed-on: https://go-review.googlesource.com/98157Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
      9f2c611f
    • Ian Lance Taylor's avatar
      runtime: use systemstack around throw in sysSigaction · 3d69ef37
      Ian Lance Taylor authored
      Try to fix the build on ppc64-linux and ppc64le-linux, avoiding:
      
      --- FAIL: TestInlinedRoutineRecords (2.12s)
      	dwarf_test.go:97: build: # command-line-arguments
      		runtime.systemstack: nosplit stack overflow
      			752	assumed on entry to runtime.sigtrampgo (nosplit)
      			480	after runtime.sigtrampgo (nosplit) uses 272
      			400	after runtime.sigfwdgo (nosplit) uses 80
      			264	after runtime.setsig (nosplit) uses 136
      			208	after runtime.sigaction (nosplit) uses 56
      			136	after runtime.sysSigaction (nosplit) uses 72
      			88	after runtime.throw (nosplit) uses 48
      			16	after runtime.dopanic (nosplit) uses 72
      			-16	after runtime.systemstack (nosplit) uses 32
      
      	dwarf_test.go:98: build error: exit status 2
      --- FAIL: TestAbstractOriginSanity (10.22s)
      	dwarf_test.go:97: build: # command-line-arguments
      		runtime.systemstack: nosplit stack overflow
      			752	assumed on entry to runtime.sigtrampgo (nosplit)
      			480	after runtime.sigtrampgo (nosplit) uses 272
      			400	after runtime.sigfwdgo (nosplit) uses 80
      			264	after runtime.setsig (nosplit) uses 136
      			208	after runtime.sigaction (nosplit) uses 56
      			136	after runtime.sysSigaction (nosplit) uses 72
      			88	after runtime.throw (nosplit) uses 48
      			16	after runtime.dopanic (nosplit) uses 72
      			-16	after runtime.systemstack (nosplit) uses 32
      
      	dwarf_test.go:98: build error: exit status 2
      FAIL
      FAIL	cmd/link/internal/ld	13.404s
      
      Change-Id: I4840604adb0e9f68a8d8e24f2f2a1a17d1634a58
      Reviewed-on: https://go-review.googlesource.com/99415Reviewed-by: 's avatarAustin Clements <austin@google.com>
      3d69ef37
    • Alberto Donizetti's avatar
      test/codegen: port 2^n muls tests to codegen harness · 3772b2e1
      Alberto Donizetti authored
      And delete them from the asm_test.go file.
      
      Change-Id: I124c8c352299646ec7db0968cdb0fe59a3b5d83d
      Reviewed-on: https://go-review.googlesource.com/99475
      Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarGiovanni Bajo <rasky@develer.com>
      3772b2e1
    • erifan01's avatar
      math/big: optimize addVW and subVW on arm64 · 0585d41c
      erifan01 authored
      The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance.
      By unrolling the cycle 4 times and 2 times, and using "LDP", "STP" instructions,
      this CL can reduce the "load" cost and improve performance.
      
      Benchmarks:
      
      name                              old time/op    new time/op     delta
      AddVV/1-8                           21.5ns ± 0%     21.5ns ± 0%      ~     (all equal)
      AddVV/2-8                           13.5ns ± 0%     13.5ns ± 0%      ~     (all equal)
      AddVV/3-8                           15.5ns ± 0%     15.5ns ± 0%      ~     (all equal)
      AddVV/4-8                           17.5ns ± 0%     17.5ns ± 0%      ~     (all equal)
      AddVV/5-8                           19.5ns ± 0%     19.5ns ± 0%      ~     (all equal)
      AddVV/10-8                          29.5ns ± 0%     29.5ns ± 0%      ~     (all equal)
      AddVV/100-8                          217ns ± 0%      217ns ± 0%      ~     (all equal)
      AddVV/1000-8                        2.02µs ± 0%     2.02µs ± 0%      ~     (all equal)
      AddVV/10000-8                       20.3µs ± 0%     20.3µs ± 0%      ~     (p=0.603 n=5+5)
      AddVV/100000-8                       223µs ± 7%      228µs ± 8%      ~     (p=0.548 n=5+5)
      AddVW/1-8                           9.32ns ± 0%     9.26ns ± 0%    -0.64%  (p=0.008 n=5+5)
      AddVW/2-8                           19.8ns ± 3%     10.5ns ± 0%   -46.92%  (p=0.008 n=5+5)
      AddVW/3-8                           11.5ns ± 0%     11.0ns ± 0%    -4.35%  (p=0.008 n=5+5)
      AddVW/4-8                           13.0ns ± 0%     12.0ns ± 0%    -7.69%  (p=0.008 n=5+5)
      AddVW/5-8                           14.5ns ± 0%     12.5ns ± 0%   -13.79%  (p=0.008 n=5+5)
      AddVW/10-8                          22.0ns ± 0%     15.5ns ± 0%   -29.55%  (p=0.008 n=5+5)
      AddVW/100-8                          167ns ± 0%       81ns ± 0%   -51.44%  (p=0.008 n=5+5)
      AddVW/1000-8                        1.52µs ± 0%     0.64µs ± 0%   -57.58%  (p=0.008 n=5+5)
      AddVW/10000-8                       15.1µs ± 0%      7.2µs ± 0%   -52.55%  (p=0.008 n=5+5)
      AddVW/100000-8                       150µs ± 0%       71µs ± 0%   -52.95%  (p=0.008 n=5+5)
      SubVW/1-8                           9.32ns ± 0%     9.26ns ± 0%    -0.64%  (p=0.008 n=5+5)
      SubVW/2-8                           19.7ns ± 2%     10.5ns ± 0%   -46.70%  (p=0.008 n=5+5)
      SubVW/3-8                           11.5ns ± 0%     11.0ns ± 0%    -4.35%  (p=0.008 n=5+5)
      SubVW/4-8                           13.0ns ± 0%     12.0ns ± 0%    -7.69%  (p=0.008 n=5+5)
      SubVW/5-8                           14.5ns ± 0%     12.5ns ± 0%   -13.79%  (p=0.008 n=5+5)
      SubVW/10-8                          22.0ns ± 0%     15.5ns ± 0%   -29.55%  (p=0.008 n=5+5)
      SubVW/100-8                          167ns ± 0%       81ns ± 0%   -51.44%  (p=0.008 n=5+5)
      SubVW/1000-8                        1.52µs ± 0%     0.64µs ± 0%   -57.58%  (p=0.008 n=5+5)
      SubVW/10000-8                       15.1µs ± 0%      7.2µs ± 0%   -52.49%  (p=0.008 n=5+5)
      SubVW/100000-8                       150µs ± 0%       71µs ± 0%   -52.91%  (p=0.008 n=5+5)
      AddMulVVW/1-8                       32.4ns ± 1%     32.6ns ± 1%      ~     (p=0.119 n=5+5)
      AddMulVVW/2-8                       57.0ns ± 0%     57.0ns ± 0%      ~     (p=0.643 n=5+5)
      AddMulVVW/3-8                       90.8ns ± 0%     90.7ns ± 0%      ~     (p=0.524 n=5+5)
      AddMulVVW/4-8                        118ns ± 0%      118ns ± 1%      ~     (p=1.000 n=4+5)
      AddMulVVW/5-8                        144ns ± 1%      144ns ± 0%      ~     (p=0.794 n=5+4)
      AddMulVVW/10-8                       294ns ± 1%      296ns ± 0%    +0.48%  (p=0.040 n=5+5)
      AddMulVVW/100-8                     2.73µs ± 0%     2.73µs ± 0%      ~     (p=0.278 n=5+5)
      AddMulVVW/1000-8                    26.0µs ± 0%     26.5µs ± 0%    +2.14%  (p=0.008 n=5+5)
      AddMulVVW/10000-8                    297µs ± 0%      297µs ± 0%    +0.24%  (p=0.008 n=5+5)
      AddMulVVW/100000-8                  3.15ms ± 1%     3.13ms ± 0%      ~     (p=0.690 n=5+5)
      DecimalConversion-8                  311µs ± 2%      309µs ± 2%      ~     (p=0.310 n=5+5)
      FloatString/100-8                   2.55µs ± 2%     2.54µs ± 2%      ~     (p=1.000 n=5+5)
      FloatString/1000-8                  58.1µs ± 0%     58.1µs ± 0%      ~     (p=0.151 n=5+5)
      FloatString/10000-8                 4.59ms ± 0%     4.59ms ± 0%      ~     (p=0.151 n=5+5)
      FloatString/100000-8                 446ms ± 0%      446ms ± 0%    +0.01%  (p=0.016 n=5+5)
      FloatAdd/10-8                        183ns ± 0%      183ns ± 0%      ~     (p=0.333 n=4+5)
      FloatAdd/100-8                       187ns ± 1%      192ns ± 2%      ~     (p=0.056 n=5+5)
      FloatAdd/1000-8                      369ns ± 0%      371ns ± 0%    +0.54%  (p=0.016 n=4+5)
      FloatAdd/10000-8                    1.88µs ± 0%     1.88µs ± 0%    -0.14%  (p=0.000 n=4+5)
      FloatAdd/100000-8                   17.2µs ± 0%     17.1µs ± 0%    -0.37%  (p=0.008 n=5+5)
      FloatSub/10-8                        147ns ± 0%      147ns ± 0%      ~     (all equal)
      FloatSub/100-8                       145ns ± 0%      146ns ± 0%      ~     (p=0.238 n=5+4)
      FloatSub/1000-8                      241ns ± 0%      241ns ± 0%      ~     (p=0.333 n=5+4)
      FloatSub/10000-8                    1.06µs ± 0%     1.06µs ± 0%      ~     (p=0.444 n=5+5)
      FloatSub/100000-8                   9.50µs ± 0%     9.48µs ± 0%    -0.14%  (p=0.008 n=5+5)
      ParseFloatSmallExp-8                28.4µs ± 2%     28.5µs ± 1%      ~     (p=0.690 n=5+5)
      ParseFloatLargeExp-8                 125µs ± 1%      124µs ± 1%      ~     (p=0.095 n=5+5)
      GCD10x10/WithoutXY-8                 277ns ± 2%      278ns ± 3%      ~     (p=0.937 n=5+5)
      GCD10x10/WithXY-8                   2.08µs ± 3%     2.15µs ± 3%      ~     (p=0.056 n=5+5)
      GCD10x100/WithoutXY-8                592ns ± 3%      613ns ± 4%      ~     (p=0.056 n=5+5)
      GCD10x100/WithXY-8                  3.40µs ± 2%     3.42µs ± 4%      ~     (p=0.841 n=5+5)
      GCD10x1000/WithoutXY-8              1.37µs ± 2%     1.35µs ± 3%      ~     (p=0.460 n=5+5)
      GCD10x1000/WithXY-8                 7.34µs ± 2%     7.33µs ± 4%      ~     (p=0.841 n=5+5)
      GCD10x10000/WithoutXY-8             8.52µs ± 0%     8.51µs ± 1%      ~     (p=0.421 n=5+5)
      GCD10x10000/WithXY-8                27.5µs ± 2%     27.2µs ± 1%      ~     (p=0.151 n=5+5)
      GCD10x100000/WithoutXY-8            78.3µs ± 1%     78.5µs ± 1%      ~     (p=0.690 n=5+5)
      GCD10x100000/WithXY-8                231µs ± 0%      229µs ± 1%    -1.11%  (p=0.016 n=5+5)
      GCD100x100/WithoutXY-8              1.86µs ± 2%     1.86µs ± 2%      ~     (p=0.881 n=5+5)
      GCD100x100/WithXY-8                 27.1µs ± 2%     27.2µs ± 1%      ~     (p=0.421 n=5+5)
      GCD100x1000/WithoutXY-8             4.44µs ± 2%     4.41µs ± 1%      ~     (p=0.310 n=5+5)
      GCD100x1000/WithXY-8                36.3µs ± 1%     36.2µs ± 1%      ~     (p=0.310 n=5+5)
      GCD100x10000/WithoutXY-8            22.6µs ± 2%     22.5µs ± 1%      ~     (p=0.690 n=5+5)
      GCD100x10000/WithXY-8                145µs ± 1%      145µs ± 1%      ~     (p=1.000 n=5+5)
      GCD100x100000/WithoutXY-8            195µs ± 0%      196µs ± 1%      ~     (p=0.548 n=5+5)
      GCD100x100000/WithXY-8              1.10ms ± 0%     1.10ms ± 0%    -0.30%  (p=0.016 n=5+5)
      GCD1000x1000/WithoutXY-8            25.0µs ± 1%     25.2µs ± 2%      ~     (p=0.222 n=5+5)
      GCD1000x1000/WithXY-8                520µs ± 0%      520µs ± 1%      ~     (p=0.151 n=5+5)
      GCD1000x10000/WithoutXY-8           57.0µs ± 1%     56.9µs ± 1%      ~     (p=0.690 n=5+5)
      GCD1000x10000/WithXY-8              1.21ms ± 0%     1.21ms ± 1%      ~     (p=0.881 n=5+5)
      GCD1000x100000/WithoutXY-8           358µs ± 0%      359µs ± 1%      ~     (p=0.548 n=5+5)
      GCD1000x100000/WithXY-8             8.73ms ± 0%     8.73ms ± 0%      ~     (p=0.548 n=5+5)
      GCD10000x10000/WithoutXY-8           686µs ± 0%      687µs ± 0%      ~     (p=0.548 n=5+5)
      GCD10000x10000/WithXY-8             15.9ms ± 0%     15.9ms ± 0%      ~     (p=0.841 n=5+5)
      GCD10000x100000/WithoutXY-8         2.08ms ± 0%     2.08ms ± 0%      ~     (p=1.000 n=5+5)
      GCD10000x100000/WithXY-8            86.7ms ± 0%     86.7ms ± 0%      ~     (p=1.000 n=5+5)
      GCD100000x100000/WithoutXY-8        51.1ms ± 0%     51.0ms ± 0%      ~     (p=0.151 n=5+5)
      GCD100000x100000/WithXY-8            1.23s ± 0%      1.23s ± 0%      ~     (p=0.841 n=5+5)
      Hilbert-8                           2.41ms ± 1%     2.42ms ± 2%      ~     (p=0.690 n=5+5)
      Binomial-8                          4.86µs ± 1%     4.86µs ± 1%      ~     (p=0.889 n=5+5)
      QuoRem-8                            7.09µs ± 0%     7.08µs ± 0%    -0.09%  (p=0.024 n=5+5)
      Exp-8                                161ms ± 0%      161ms ± 0%    -0.08%  (p=0.032 n=5+5)
      Exp2-8                               161ms ± 0%      161ms ± 0%      ~     (p=1.000 n=5+5)
      Bitset-8                            40.7ns ± 0%     40.6ns ± 0%      ~     (p=0.095 n=4+5)
      BitsetNeg-8                          159ns ± 4%      148ns ± 0%    -6.92%  (p=0.016 n=5+4)
      BitsetOrig-8                         378ns ± 1%      378ns ± 1%      ~     (p=0.937 n=5+5)
      BitsetNegOrig-8                      647ns ± 5%      647ns ± 4%      ~     (p=1.000 n=5+5)
      ModSqrt225_Tonelli-8                7.26ms ± 0%     7.27ms ± 0%      ~     (p=1.000 n=5+5)
      ModSqrt224_3Mod4-8                  2.24ms ± 0%     2.24ms ± 0%      ~     (p=0.690 n=5+5)
      ModSqrt5430_Tonelli-8                62.8s ± 1%      62.5s ± 0%      ~     (p=0.063 n=5+4)
      ModSqrt5430_3Mod4-8                  20.8s ± 0%      20.8s ± 0%      ~     (p=0.310 n=5+5)
      Sqrt-8                               101µs ± 1%      101µs ± 0%    -0.35%  (p=0.032 n=5+5)
      IntSqr/1-8                          32.3ns ± 1%     32.5ns ± 1%      ~     (p=0.421 n=5+5)
      IntSqr/2-8                           157ns ± 5%      156ns ± 5%      ~     (p=0.651 n=5+5)
      IntSqr/3-8                           292ns ± 2%      291ns ± 3%      ~     (p=0.881 n=5+5)
      IntSqr/5-8                           738ns ± 6%      740ns ± 5%      ~     (p=0.841 n=5+5)
      IntSqr/8-8                          1.82µs ± 4%     1.83µs ± 4%      ~     (p=0.730 n=5+5)
      IntSqr/10-8                         2.92µs ± 1%     2.93µs ± 1%      ~     (p=0.643 n=5+5)
      IntSqr/20-8                         6.28µs ± 2%     6.28µs ± 2%      ~     (p=1.000 n=5+5)
      IntSqr/30-8                         13.8µs ± 2%     13.9µs ± 3%      ~     (p=1.000 n=5+5)
      IntSqr/50-8                         37.8µs ± 4%     37.9µs ± 4%      ~     (p=0.690 n=5+5)
      IntSqr/80-8                         95.9µs ± 1%     95.8µs ± 1%      ~     (p=0.841 n=5+5)
      IntSqr/100-8                         148µs ± 1%      148µs ± 1%      ~     (p=0.310 n=5+5)
      IntSqr/200-8                         586µs ± 1%      586µs ± 1%      ~     (p=0.841 n=5+5)
      IntSqr/300-8                        1.32ms ± 0%     1.31ms ± 0%      ~     (p=0.222 n=5+5)
      IntSqr/500-8                        2.48ms ± 0%     2.48ms ± 0%      ~     (p=0.556 n=5+4)
      IntSqr/800-8                        4.68ms ± 0%     4.68ms ± 0%      ~     (p=0.548 n=5+5)
      IntSqr/1000-8                       7.57ms ± 0%     7.56ms ± 0%      ~     (p=0.421 n=5+5)
      Mul-8                                311ms ± 0%      311ms ± 0%      ~     (p=0.548 n=5+5)
      Exp3Power/0x10-8                     559ns ± 1%      560ns ± 1%      ~     (p=0.984 n=5+5)
      Exp3Power/0x40-8                     641ns ± 1%      634ns ± 1%      ~     (p=0.063 n=5+5)
      Exp3Power/0x100-8                   1.39µs ± 2%     1.40µs ± 2%      ~     (p=0.381 n=5+5)
      Exp3Power/0x400-8                   8.27µs ± 1%     8.26µs ± 0%      ~     (p=0.571 n=5+5)
      Exp3Power/0x1000-8                  59.9µs ± 0%     59.7µs ± 0%    -0.23%  (p=0.008 n=5+5)
      Exp3Power/0x4000-8                   816µs ± 0%      816µs ± 0%      ~     (p=1.000 n=5+5)
      Exp3Power/0x10000-8                 7.77ms ± 0%     7.77ms ± 0%      ~     (p=0.841 n=5+5)
      Exp3Power/0x40000-8                 73.4ms ± 0%     73.4ms ± 0%      ~     (p=0.690 n=5+5)
      Exp3Power/0x100000-8                 665ms ± 0%      664ms ± 0%    -0.14%  (p=0.008 n=5+5)
      Exp3Power/0x400000-8                 5.98s ± 0%      5.98s ± 0%    -0.09%  (p=0.008 n=5+5)
      Fibo-8                               116ms ± 0%      116ms ± 0%    -0.25%  (p=0.008 n=5+5)
      NatSqr/1-8                           115ns ± 3%      116ns ± 2%      ~     (p=0.238 n=5+5)
      NatSqr/2-8                           237ns ± 1%      237ns ± 1%      ~     (p=0.683 n=5+5)
      NatSqr/3-8                           367ns ± 3%      368ns ± 3%      ~     (p=0.817 n=5+5)
      NatSqr/5-8                           807ns ± 3%      812ns ± 3%      ~     (p=0.913 n=5+5)
      NatSqr/8-8                          1.93µs ± 2%     1.93µs ± 3%      ~     (p=0.651 n=5+5)
      NatSqr/10-8                         2.98µs ± 2%     2.99µs ± 2%      ~     (p=0.690 n=5+5)
      NatSqr/20-8                         6.49µs ± 2%     6.46µs ± 2%      ~     (p=0.548 n=5+5)
      NatSqr/30-8                         14.4µs ± 2%     14.3µs ± 2%      ~     (p=0.690 n=5+5)
      NatSqr/50-8                         38.6µs ± 2%     38.7µs ± 2%      ~     (p=0.841 n=5+5)
      NatSqr/80-8                         96.1µs ± 2%     95.8µs ± 2%      ~     (p=0.548 n=5+5)
      NatSqr/100-8                         149µs ± 1%      149µs ± 1%      ~     (p=0.841 n=5+5)
      NatSqr/200-8                         593µs ± 1%      590µs ± 1%      ~     (p=0.421 n=5+5)
      NatSqr/300-8                        1.32ms ± 0%     1.32ms ± 1%      ~     (p=0.222 n=5+5)
      NatSqr/500-8                        2.49ms ± 0%     2.49ms ± 0%      ~     (p=0.690 n=5+5)
      NatSqr/800-8                        4.69ms ± 0%     4.69ms ± 0%      ~     (p=1.000 n=5+5)
      NatSqr/1000-8                       7.59ms ± 0%     7.58ms ± 0%      ~     (p=0.841 n=5+5)
      ScanPi-8                             322µs ± 0%      321µs ± 0%      ~     (p=0.095 n=5+5)
      StringPiParallel-8                  71.4µs ± 5%     68.8µs ± 4%      ~     (p=0.151 n=5+5)
      Scan/10/Base2-8                     1.10µs ± 0%     1.09µs ± 0%    -0.36%  (p=0.032 n=5+5)
      Scan/100/Base2-8                    7.78µs ± 0%     7.79µs ± 0%    +0.14%  (p=0.008 n=5+5)
      Scan/1000/Base2-8                   78.8µs ± 0%     79.0µs ± 0%    +0.24%  (p=0.008 n=5+5)
      Scan/10000/Base2-8                  1.22ms ± 0%     1.22ms ± 0%      ~     (p=0.056 n=5+5)
      Scan/100000/Base2-8                 55.1ms ± 0%     55.0ms ± 0%    -0.15%  (p=0.008 n=5+5)
      Scan/10/Base8-8                      514ns ± 0%      515ns ± 0%      ~     (p=0.079 n=5+5)
      Scan/100/Base8-8                    2.89µs ± 0%     2.89µs ± 0%    +0.15%  (p=0.008 n=5+5)
      Scan/1000/Base8-8                   31.0µs ± 0%     31.1µs ± 0%    +0.12%  (p=0.008 n=5+5)
      Scan/10000/Base8-8                   740µs ± 0%      740µs ± 0%      ~     (p=0.222 n=5+5)
      Scan/100000/Base8-8                 50.6ms ± 0%     50.5ms ± 0%    -0.06%  (p=0.016 n=4+5)
      Scan/10/Base10-8                     492ns ± 1%      490ns ± 1%      ~     (p=0.310 n=5+5)
      Scan/100/Base10-8                   2.67µs ± 0%     2.67µs ± 0%      ~     (p=0.056 n=5+5)
      Scan/1000/Base10-8                  28.7µs ± 0%     28.7µs ± 0%      ~     (p=1.000 n=5+5)
      Scan/10000/Base10-8                  717µs ± 0%      716µs ± 0%      ~     (p=0.222 n=5+5)
      Scan/100000/Base10-8                50.2ms ± 0%     50.3ms ± 0%    +0.05%  (p=0.008 n=5+5)
      Scan/10/Base16-8                     442ns ± 1%      442ns ± 0%      ~     (p=0.468 n=5+5)
      Scan/100/Base16-8                   2.46µs ± 0%     2.45µs ± 0%      ~     (p=0.159 n=5+5)
      Scan/1000/Base16-8                  27.2µs ± 0%     27.2µs ± 0%      ~     (p=0.841 n=5+5)
      Scan/10000/Base16-8                  721µs ± 0%      722µs ± 0%      ~     (p=0.548 n=5+5)
      Scan/100000/Base16-8                52.6ms ± 0%     52.6ms ± 0%    +0.07%  (p=0.008 n=5+5)
      String/10/Base2-8                    244ns ± 1%      242ns ± 1%      ~     (p=0.103 n=5+5)
      String/100/Base2-8                  1.48µs ± 0%     1.48µs ± 1%      ~     (p=0.786 n=5+5)
      String/1000/Base2-8                 13.3µs ± 1%     13.3µs ± 0%      ~     (p=0.222 n=5+5)
      String/10000/Base2-8                 132µs ± 1%      132µs ± 1%      ~     (p=1.000 n=5+5)
      String/100000/Base2-8               1.30ms ± 1%     1.30ms ± 1%      ~     (p=1.000 n=5+5)
      String/10/Base8-8                    167ns ± 1%      168ns ± 1%      ~     (p=0.135 n=5+5)
      String/100/Base8-8                   623ns ± 1%      626ns ± 1%      ~     (p=0.151 n=5+5)
      String/1000/Base8-8                 5.24µs ± 1%     5.24µs ± 0%      ~     (p=1.000 n=5+5)
      String/10000/Base8-8                50.0µs ± 1%     50.0µs ± 1%      ~     (p=1.000 n=5+5)
      String/100000/Base8-8                492µs ± 1%      489µs ± 1%      ~     (p=0.056 n=5+5)
      String/10/Base10-8                   503ns ± 1%      501ns ± 0%      ~     (p=0.183 n=5+5)
      String/100/Base10-8                 1.96µs ± 0%     1.97µs ± 0%      ~     (p=0.389 n=5+5)
      String/1000/Base10-8                12.4µs ± 1%     12.4µs ± 1%      ~     (p=0.841 n=5+5)
      String/10000/Base10-8               56.7µs ± 1%     56.6µs ± 0%      ~     (p=1.000 n=5+5)
      String/100000/Base10-8              25.6ms ± 0%     25.6ms ± 0%      ~     (p=0.222 n=5+5)
      String/10/Base16-8                   147ns ± 0%      148ns ± 2%      ~     (p=1.000 n=4+5)
      String/100/Base16-8                  505ns ± 0%      505ns ± 1%      ~     (p=0.778 n=5+5)
      String/1000/Base16-8                3.94µs ± 0%     3.94µs ± 0%      ~     (p=0.841 n=5+5)
      String/10000/Base16-8               37.4µs ± 1%     37.2µs ± 1%      ~     (p=0.095 n=5+5)
      String/100000/Base16-8               367µs ± 1%      367µs ± 0%      ~     (p=1.000 n=5+5)
      LeafSize/0-8                        6.64ms ± 0%     6.65ms ± 0%      ~     (p=0.690 n=5+5)
      LeafSize/1-8                        72.5µs ± 1%     72.4µs ± 1%      ~     (p=0.841 n=5+5)
      LeafSize/2-8                        72.6µs ± 1%     72.6µs ± 1%      ~     (p=1.000 n=5+5)
      LeafSize/3-8                         377µs ± 0%      377µs ± 0%      ~     (p=0.421 n=5+5)
      LeafSize/4-8                        71.2µs ± 1%     71.3µs ± 0%      ~     (p=0.278 n=5+5)
      LeafSize/5-8                         469µs ± 0%      469µs ± 0%      ~     (p=0.310 n=5+5)
      LeafSize/6-8                         376µs ± 0%      376µs ± 0%      ~     (p=0.841 n=5+5)
      LeafSize/7-8                         244µs ± 0%      244µs ± 0%      ~     (p=0.841 n=5+5)
      LeafSize/8-8                        71.9µs ± 1%     72.1µs ± 1%      ~     (p=0.548 n=5+5)
      LeafSize/9-8                         536µs ± 0%      536µs ± 0%      ~     (p=0.151 n=5+5)
      LeafSize/10-8                        470µs ± 0%      471µs ± 0%    +0.10%  (p=0.032 n=5+5)
      LeafSize/11-8                        458µs ± 0%      458µs ± 0%      ~     (p=0.881 n=5+5)
      LeafSize/12-8                        376µs ± 0%      376µs ± 0%      ~     (p=0.548 n=5+5)
      LeafSize/13-8                        341µs ± 0%      342µs ± 0%      ~     (p=0.222 n=5+5)
      LeafSize/14-8                        246µs ± 0%      245µs ± 0%      ~     (p=0.167 n=5+5)
      LeafSize/15-8                        168µs ± 0%      168µs ± 0%      ~     (p=0.548 n=5+5)
      LeafSize/16-8                       72.1µs ± 1%     72.2µs ± 1%      ~     (p=0.690 n=5+5)
      LeafSize/32-8                       81.5µs ± 1%     81.4µs ± 1%      ~     (p=1.000 n=5+5)
      LeafSize/64-8                        133µs ± 1%      134µs ± 1%      ~     (p=0.690 n=5+5)
      ProbablyPrime/n=0-8                 44.3ms ± 0%     44.2ms ± 0%    -0.28%  (p=0.008 n=5+5)
      ProbablyPrime/n=1-8                 64.8ms ± 0%     64.7ms ± 0%    -0.15%  (p=0.008 n=5+5)
      ProbablyPrime/n=5-8                  147ms ± 0%      147ms ± 0%    -0.11%  (p=0.008 n=5+5)
      ProbablyPrime/n=10-8                 250ms ± 0%      250ms ± 0%      ~     (p=0.056 n=5+5)
      ProbablyPrime/n=20-8                 456ms ± 0%      455ms ± 0%    -0.05%  (p=0.008 n=5+5)
      ProbablyPrime/Lucas-8               23.6ms ± 0%     23.5ms ± 0%    -0.29%  (p=0.008 n=5+5)
      ProbablyPrime/MillerRabinBase2-8    20.6ms ± 0%     20.6ms ± 0%      ~     (p=0.690 n=5+5)
      FloatSqrt/64-8                      2.01µs ± 1%     2.02µs ± 1%      ~     (p=0.421 n=5+5)
      FloatSqrt/128-8                     4.43µs ± 2%     4.38µs ± 2%      ~     (p=0.222 n=5+5)
      FloatSqrt/256-8                     6.64µs ± 1%     6.68µs ± 2%      ~     (p=0.516 n=5+5)
      FloatSqrt/1000-8                    31.9µs ± 0%     31.8µs ± 0%      ~     (p=0.095 n=5+5)
      FloatSqrt/10000-8                    595µs ± 0%      594µs ± 0%      ~     (p=0.056 n=5+5)
      FloatSqrt/100000-8                  17.9ms ± 0%     17.9ms ± 0%      ~     (p=0.151 n=5+5)
      FloatSqrt/1000000-8                  1.52s ± 0%      1.52s ± 0%      ~     (p=0.841 n=5+5)
      
      name                              old speed      new speed       delta
      AddVV/1-8                         2.97GB/s ± 0%   2.97GB/s ± 0%      ~     (p=0.971 n=4+4)
      AddVV/2-8                         9.47GB/s ± 0%   9.47GB/s ± 0%    +0.01%  (p=0.016 n=5+5)
      AddVV/3-8                         12.4GB/s ± 0%   12.4GB/s ± 0%      ~     (p=0.548 n=5+5)
      AddVV/4-8                         14.6GB/s ± 0%   14.6GB/s ± 0%      ~     (p=1.000 n=5+5)
      AddVV/5-8                         16.4GB/s ± 0%   16.4GB/s ± 0%      ~     (p=1.000 n=5+5)
      AddVV/10-8                        21.7GB/s ± 0%   21.7GB/s ± 0%      ~     (p=0.548 n=5+5)
      AddVV/100-8                       29.4GB/s ± 0%   29.4GB/s ± 0%      ~     (p=1.000 n=5+5)
      AddVV/1000-8                      31.7GB/s ± 0%   31.7GB/s ± 0%      ~     (p=0.524 n=5+4)
      AddVV/10000-8                     31.5GB/s ± 0%   31.5GB/s ± 0%      ~     (p=0.690 n=5+5)
      AddVV/100000-8                    28.8GB/s ± 7%   28.1GB/s ± 8%      ~     (p=0.548 n=5+5)
      AddVW/1-8                          859MB/s ± 0%    864MB/s ± 0%    +0.61%  (p=0.008 n=5+5)
      AddVW/2-8                          809MB/s ± 2%   1520MB/s ± 0%   +87.78%  (p=0.008 n=5+5)
      AddVW/3-8                         2.08GB/s ± 0%   2.18GB/s ± 0%    +4.54%  (p=0.008 n=5+5)
      AddVW/4-8                         2.46GB/s ± 0%   2.66GB/s ± 0%    +8.33%  (p=0.016 n=4+5)
      AddVW/5-8                         2.76GB/s ± 0%   3.20GB/s ± 0%   +16.03%  (p=0.008 n=5+5)
      AddVW/10-8                        3.63GB/s ± 0%   5.15GB/s ± 0%   +41.83%  (p=0.008 n=5+5)
      AddVW/100-8                       4.79GB/s ± 0%   9.87GB/s ± 0%  +106.12%  (p=0.008 n=5+5)
      AddVW/1000-8                      5.27GB/s ± 0%  12.42GB/s ± 0%  +135.74%  (p=0.008 n=5+5)
      AddVW/10000-8                     5.31GB/s ± 0%  11.19GB/s ± 0%  +110.71%  (p=0.008 n=5+5)
      AddVW/100000-8                    5.32GB/s ± 0%  11.32GB/s ± 0%  +112.56%  (p=0.008 n=5+5)
      SubVW/1-8                          859MB/s ± 0%    864MB/s ± 0%    +0.61%  (p=0.008 n=5+5)
      SubVW/2-8                          812MB/s ± 2%   1520MB/s ± 0%   +87.09%  (p=0.008 n=5+5)
      SubVW/3-8                         2.08GB/s ± 0%   2.18GB/s ± 0%    +4.55%  (p=0.008 n=5+5)
      SubVW/4-8                         2.46GB/s ± 0%   2.66GB/s ± 0%    +8.33%  (p=0.008 n=5+5)
      SubVW/5-8                         2.75GB/s ± 0%   3.20GB/s ± 0%   +16.03%  (p=0.008 n=5+5)
      SubVW/10-8                        3.63GB/s ± 0%   5.15GB/s ± 0%   +41.82%  (p=0.008 n=5+5)
      SubVW/100-8                       4.79GB/s ± 0%   9.87GB/s ± 0%  +106.13%  (p=0.008 n=5+5)
      SubVW/1000-8                      5.27GB/s ± 0%  12.42GB/s ± 0%  +135.74%  (p=0.008 n=5+5)
      SubVW/10000-8                     5.31GB/s ± 0%  11.17GB/s ± 0%  +110.44%  (p=0.008 n=5+5)
      SubVW/100000-8                    5.32GB/s ± 0%  11.31GB/s ± 0%  +112.35%  (p=0.008 n=5+5)
      AddMulVVW/1-8                     1.97GB/s ± 1%   1.96GB/s ± 1%      ~     (p=0.151 n=5+5)
      AddMulVVW/2-8                     2.24GB/s ± 0%   2.25GB/s ± 0%      ~     (p=0.095 n=5+5)
      AddMulVVW/3-8                     2.11GB/s ± 0%   2.12GB/s ± 0%      ~     (p=0.548 n=5+5)
      AddMulVVW/4-8                     2.17GB/s ± 1%   2.17GB/s ± 1%      ~     (p=0.548 n=5+5)
      AddMulVVW/5-8                     2.22GB/s ± 1%   2.21GB/s ± 1%      ~     (p=0.421 n=5+5)
      AddMulVVW/10-8                    2.17GB/s ± 1%   2.16GB/s ± 0%      ~     (p=0.095 n=5+5)
      AddMulVVW/100-8                   2.35GB/s ± 0%   2.35GB/s ± 0%      ~     (p=0.421 n=5+5)
      AddMulVVW/1000-8                  2.47GB/s ± 0%   2.41GB/s ± 0%    -2.09%  (p=0.008 n=5+5)
      AddMulVVW/10000-8                 2.16GB/s ± 0%   2.15GB/s ± 0%    -0.23%  (p=0.008 n=5+5)
      AddMulVVW/100000-8                2.03GB/s ± 1%   2.04GB/s ± 0%      ~     (p=0.690 n=5+5)
      
      name                              old alloc/op   new alloc/op    delta
      FloatString/100-8                     400B ± 0%       400B ± 0%      ~     (all equal)
      FloatString/1000-8                  3.22kB ± 0%     3.22kB ± 0%      ~     (all equal)
      FloatString/10000-8                 55.6kB ± 0%     55.5kB ± 0%      ~     (p=0.206 n=5+5)
      FloatString/100000-8                 627kB ± 0%      627kB ± 0%      ~     (all equal)
      FloatAdd/10-8                        0.00B           0.00B           ~     (all equal)
      FloatAdd/100-8                       0.00B           0.00B           ~     (all equal)
      FloatAdd/1000-8                      0.00B           0.00B           ~     (all equal)
      FloatAdd/10000-8                     0.00B           0.00B           ~     (all equal)
      FloatAdd/100000-8                    0.00B           0.00B           ~     (all equal)
      FloatSub/10-8                        0.00B           0.00B           ~     (all equal)
      FloatSub/100-8                       0.00B           0.00B           ~     (all equal)
      FloatSub/1000-8                      0.00B           0.00B           ~     (all equal)
      FloatSub/10000-8                     0.00B           0.00B           ~     (all equal)
      FloatSub/100000-8                    0.00B           0.00B           ~     (all equal)
      FloatSqrt/64-8                        416B ± 0%       416B ± 0%      ~     (all equal)
      FloatSqrt/128-8                       720B ± 0%       720B ± 0%      ~     (all equal)
      FloatSqrt/256-8                       816B ± 0%       816B ± 0%      ~     (all equal)
      FloatSqrt/1000-8                    2.50kB ± 0%     2.50kB ± 0%      ~     (all equal)
      FloatSqrt/10000-8                   23.5kB ± 0%     23.5kB ± 0%      ~     (all equal)
      FloatSqrt/100000-8                   251kB ± 0%      251kB ± 0%      ~     (all equal)
      FloatSqrt/1000000-8                 4.61MB ± 0%     4.61MB ± 0%      ~     (all equal)
      
      name                              old allocs/op  new allocs/op   delta
      FloatString/100-8                     8.00 ± 0%       8.00 ± 0%      ~     (all equal)
      FloatString/1000-8                    10.0 ± 0%       10.0 ± 0%      ~     (all equal)
      FloatString/10000-8                   42.0 ± 0%       42.0 ± 0%      ~     (all equal)
      FloatString/100000-8                   346 ± 0%        346 ± 0%      ~     (all equal)
      FloatAdd/10-8                         0.00            0.00           ~     (all equal)
      FloatAdd/100-8                        0.00            0.00           ~     (all equal)
      FloatAdd/1000-8                       0.00            0.00           ~     (all equal)
      FloatAdd/10000-8                      0.00            0.00           ~     (all equal)
      FloatAdd/100000-8                     0.00            0.00           ~     (all equal)
      FloatSub/10-8                         0.00            0.00           ~     (all equal)
      FloatSub/100-8                        0.00            0.00           ~     (all equal)
      FloatSub/1000-8                       0.00            0.00           ~     (all equal)
      FloatSub/10000-8                      0.00            0.00           ~     (all equal)
      FloatSub/100000-8                     0.00            0.00           ~     (all equal)
      FloatSqrt/64-8                        9.00 ± 0%       9.00 ± 0%      ~     (all equal)
      FloatSqrt/128-8                       13.0 ± 0%       13.0 ± 0%      ~     (all equal)
      FloatSqrt/256-8                       12.0 ± 0%       12.0 ± 0%      ~     (all equal)
      FloatSqrt/1000-8                      19.0 ± 0%       19.0 ± 0%      ~     (all equal)
      FloatSqrt/10000-8                     35.0 ± 0%       35.0 ± 0%      ~     (all equal)
      FloatSqrt/100000-8                    55.0 ± 0%       55.0 ± 0%      ~     (all equal)
      FloatSqrt/1000000-8                    122 ± 0%        122 ± 0%      ~     (all equal)
      
      Change-Id: I6888d84c037d91f9e2199f3492ea3f6a0ed77b24
      Reviewed-on: https://go-review.googlesource.com/77832Reviewed-by: 's avatarVlad Krasnov <vlad@cloudflare.com>
      Reviewed-by: 's avatarCherry Zhang <cherryyz@google.com>
      Run-TryBot: Cherry Zhang <cherryyz@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      0585d41c
    • Lynn Boger's avatar
      cmd/asm, cmd/internal/obj/ppc64: avoid unnecessary load zeros · 5b14c7b3
      Lynn Boger authored
      When instructions add, and, or, xor, and movd have
      constant operands in some cases more instructions are
      generated than necessary by the assembler.
      
      This adds more opcode/operand combinations to the optab
      and improves the code generation for the cases where the
      size and sign of the constant allows the use of 1
      instructions instead of 2.
      
      Example of previous code:
      	oris r3, r0, 0
      	ori  r3, r3, 65533
      
      now:
      	ori r3, r0, 65533
      
      This does not significantly reduce the overall binary size
      because the improvement depends on the constant value.
      Some procedures show a 1-2% reduction in size. This improvement
      could also be significant in cases where the extra instructions
      occur in a critical loop.
      
      Testcase ppc64enc.s was added to cmd/asm/internal/asm/testdata
      with the variations affected by this change.
      
      Updates #23845
      
      Change-Id: I7fdf2320c95815d99f2755ba77d0c6921cd7fad7
      Reviewed-on: https://go-review.googlesource.com/95135
      Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarDavid Chase <drchase@google.com>
      5b14c7b3