1. 06 Mar, 2018 2 commits
    • Fangming.Fang's avatar
      crypto/aes: optimize arm64 AES implementation · 917e7269
      Fangming.Fang authored
      This patch makes use of arm64 AES instructions to accelerate AES computation
      and only supports optimization on Linux for arm64
      
      name        old time/op    new time/op     delta
      Encrypt-32     255ns ± 0%       26ns ± 0%   -89.73%
      Decrypt-32     256ns ± 0%       26ns ± 0%   -89.77%
      Expand-32      990ns ± 5%      901ns ± 0%    -9.05%
      
      name        old speed      new speed       delta
      Encrypt-32  62.5MB/s ± 0%  610.4MB/s ± 0%  +876.39%
      Decrypt-32  62.3MB/s ± 0%  610.2MB/s ± 0%  +879.6%
      
      Fixes #18498
      
      Change-Id: If416e5a151785325527b32ff72f6da3812493ed0
      Reviewed-on: https://go-review.googlesource.com/64490
      Run-TryBot: Cherry Zhang <cherryyz@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarCherry Zhang <cherryyz@google.com>
      917e7269
    • erifan01's avatar
      math/big: optimize addVV and subVV on arm64 · c4f3fe95
      erifan01 authored
      The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance.
      By unrolling the cycle 4x and 2x, and using "LDP", "STP" instructions, this CL can reduce the "load" cost and improve performance.
      
      Benchmarks:
      
      name                              old time/op    new time/op     delta
      AddVV/1-8                           21.5ns ± 0%     11.5ns ± 0%   -46.51%  (p=0.008 n=5+5)
      AddVV/2-8                           13.5ns ± 0%     12.0ns ± 0%   -11.11%  (p=0.008 n=5+5)
      AddVV/3-8                           15.5ns ± 0%     13.0ns ± 0%   -16.13%  (p=0.008 n=5+5)
      AddVV/4-8                           17.5ns ± 0%     13.5ns ± 0%   -22.86%  (p=0.008 n=5+5)
      AddVV/5-8                           19.5ns ± 0%     14.5ns ± 0%   -25.64%  (p=0.008 n=5+5)
      AddVV/10-8                          29.5ns ± 0%     18.0ns ± 0%   -38.98%  (p=0.008 n=5+5)
      AddVV/100-8                          217ns ± 0%       94ns ± 0%   -56.64%  (p=0.008 n=5+5)
      AddVV/1000-8                        2.02µs ± 0%     1.03µs ± 0%   -48.85%  (p=0.008 n=5+5)
      AddVV/10000-8                       20.5µs ± 0%     11.3µs ± 0%   -44.70%  (p=0.008 n=5+5)
      AddVV/100000-8                       247µs ± 3%      154µs ± 0%   -37.52%  (p=0.008 n=5+5)
      SubVV/1-8                           21.5ns ± 0%     11.5ns ± 0%      ~     (p=0.079 n=4+5)
      SubVV/2-8                           13.5ns ± 0%     12.0ns ± 0%   -11.11%  (p=0.008 n=5+5)
      SubVV/3-8                           15.5ns ± 0%     13.0ns ± 0%   -16.13%  (p=0.008 n=5+5)
      SubVV/4-8                           17.5ns ± 0%     13.5ns ± 0%   -22.86%  (p=0.008 n=5+5)
      SubVV/5-8                           19.5ns ± 0%     14.5ns ± 0%   -25.64%  (p=0.008 n=5+5)
      SubVV/10-8                          29.5ns ± 0%     18.0ns ± 0%   -38.98%  (p=0.008 n=5+5)
      SubVV/100-8                          217ns ± 0%       94ns ± 0%   -56.64%  (p=0.008 n=5+5)
      SubVV/1000-8                        2.02µs ± 0%     0.80µs ± 0%   -60.50%  (p=0.008 n=5+5)
      SubVV/10000-8                       20.5µs ± 0%     11.3µs ± 0%   -44.99%  (p=0.008 n=5+5)
      SubVV/100000-8                       221µs ±11%      223µs ±16%      ~     (p=0.690 n=5+5)
      AddVW/1-8                           9.32ns ± 0%     9.32ns ± 0%      ~     (all equal)
      AddVW/2-8                           19.7ns ± 1%     19.7ns ± 0%      ~     (p=0.381 n=5+4)
      AddVW/3-8                           11.5ns ± 0%     11.5ns ± 0%      ~     (all equal)
      AddVW/4-8                           13.0ns ± 0%     13.0ns ± 0%      ~     (all equal)
      AddVW/5-8                           14.5ns ± 0%     14.5ns ± 0%      ~     (all equal)
      AddVW/10-8                          22.0ns ± 0%     22.0ns ± 0%      ~     (all equal)
      AddVW/100-8                          167ns ± 0%      167ns ± 0%      ~     (all equal)
      AddVW/1000-8                        1.52µs ± 0%     1.52µs ± 0%    +0.40%  (p=0.008 n=5+5)
      AddVW/10000-8                       15.1µs ± 0%     15.1µs ± 0%      ~     (p=0.556 n=5+4)
      AddVW/100000-8                       152µs ± 1%      152µs ± 1%      ~     (p=0.690 n=5+5)
      AddMulVVW/1-8                       33.3ns ± 0%     32.7ns ± 1%    -1.86%  (p=0.008 n=5+5)
      AddMulVVW/2-8                       59.3ns ± 1%     56.9ns ± 1%    -4.15%  (p=0.008 n=5+5)
      AddMulVVW/3-8                       80.5ns ± 1%     85.4ns ± 3%    +6.19%  (p=0.008 n=5+5)
      AddMulVVW/4-8                        127ns ± 0%      111ns ± 1%   -13.19%  (p=0.008 n=5+5)
      AddMulVVW/5-8                        144ns ± 0%      149ns ± 0%    +3.47%  (p=0.016 n=4+5)
      AddMulVVW/10-8                       298ns ± 1%      283ns ± 0%    -4.77%  (p=0.008 n=5+5)
      AddMulVVW/100-8                     3.06µs ± 0%     2.99µs ± 0%    -2.21%  (p=0.008 n=5+5)
      AddMulVVW/1000-8                    31.3µs ± 0%     26.9µs ± 0%   -14.17%  (p=0.008 n=5+5)
      AddMulVVW/10000-8                    316µs ± 0%      305µs ± 0%    -3.51%  (p=0.008 n=5+5)
      AddMulVVW/100000-8                  3.17ms ± 0%     3.17ms ± 1%      ~     (p=0.690 n=5+5)
      DecimalConversion-8                  316µs ± 1%      313µs ± 2%      ~     (p=0.095 n=5+5)
      FloatString/100-8                   2.53µs ± 1%     2.56µs ± 2%      ~     (p=0.222 n=5+5)
      FloatString/1000-8                  58.4µs ± 0%     58.5µs ± 0%      ~     (p=0.206 n=5+5)
      FloatString/10000-8                 4.59ms ± 0%     4.58ms ± 0%    -0.31%  (p=0.008 n=5+5)
      FloatString/100000-8                 446ms ± 0%      444ms ± 0%    -0.31%  (p=0.008 n=5+5)
      FloatAdd/10-8                        184ns ± 0%      172ns ± 0%    -6.30%  (p=0.008 n=5+5)
      FloatAdd/100-8                       189ns ± 2%      191ns ± 4%      ~     (p=0.381 n=5+5)
      FloatAdd/1000-8                      371ns ± 0%      347ns ± 1%    -6.42%  (p=0.008 n=5+5)
      FloatAdd/10000-8                    1.87µs ± 0%     1.68µs ± 0%   -10.16%  (p=0.008 n=5+5)
      FloatAdd/100000-8                   17.1µs ± 0%     15.6µs ± 0%    -8.74%  (p=0.016 n=5+4)
      FloatSub/10-8                        152ns ± 0%      138ns ± 0%    -9.47%  (p=0.000 n=4+5)
      FloatSub/100-8                       148ns ± 0%      142ns ± 0%    -4.05%  (p=0.000 n=5+4)
      FloatSub/1000-8                      245ns ± 1%      217ns ± 0%   -11.28%  (p=0.000 n=5+4)
      FloatSub/10000-8                    1.07µs ± 0%     0.88µs ± 1%   -18.14%  (p=0.008 n=5+5)
      FloatSub/100000-8                   9.58µs ± 0%     7.96µs ± 0%   -16.84%  (p=0.008 n=5+5)
      ParseFloatSmallExp-8                28.8µs ± 1%     29.0µs ± 1%      ~     (p=0.095 n=5+5)
      ParseFloatLargeExp-8                 126µs ± 1%      126µs ± 1%      ~     (p=0.841 n=5+5)
      GCD10x10/WithoutXY-8                 277ns ± 2%      281ns ± 4%      ~     (p=0.746 n=5+5)
      GCD10x10/WithXY-8                   2.10µs ± 1%     2.12µs ± 3%      ~     (p=0.548 n=5+5)
      GCD10x100/WithoutXY-8                615ns ± 3%      607ns ± 2%      ~     (p=0.135 n=5+5)
      GCD10x100/WithXY-8                  3.50µs ± 2%     3.62µs ± 5%      ~     (p=0.151 n=5+5)
      GCD10x1000/WithoutXY-8              1.39µs ± 2%     1.39µs ± 3%      ~     (p=0.690 n=5+5)
      GCD10x1000/WithXY-8                 7.39µs ± 1%     7.34µs ± 2%      ~     (p=0.135 n=5+5)
      GCD10x10000/WithoutXY-8             8.66µs ± 1%     8.68µs ± 1%      ~     (p=0.421 n=5+5)
      GCD10x10000/WithXY-8                28.1µs ± 2%     27.0µs ± 2%    -3.81%  (p=0.008 n=5+5)
      GCD10x100000/WithoutXY-8            79.3µs ± 1%     79.3µs ± 1%      ~     (p=0.841 n=5+5)
      GCD10x100000/WithXY-8                238µs ± 0%      227µs ± 1%    -4.74%  (p=0.008 n=5+5)
      GCD100x100/WithoutXY-8              1.89µs ± 1%     1.88µs ± 2%      ~     (p=0.968 n=5+5)
      GCD100x100/WithXY-8                 26.7µs ± 1%     27.0µs ± 1%    +1.44%  (p=0.032 n=5+5)
      GCD100x1000/WithoutXY-8             4.48µs ± 1%     4.45µs ± 2%      ~     (p=0.341 n=5+5)
      GCD100x1000/WithXY-8                36.3µs ± 1%     35.1µs ± 1%    -3.27%  (p=0.008 n=5+5)
      GCD100x10000/WithoutXY-8            22.8µs ± 0%     22.7µs ± 1%      ~     (p=0.056 n=5+5)
      GCD100x10000/WithXY-8                145µs ± 1%      133µs ± 1%    -8.33%  (p=0.008 n=5+5)
      GCD100x100000/WithoutXY-8            198µs ± 0%      195µs ± 0%    -1.56%  (p=0.008 n=5+5)
      GCD100x100000/WithXY-8              1.11ms ± 0%     1.00ms ± 0%   -10.04%  (p=0.008 n=5+5)
      GCD1000x1000/WithoutXY-8            25.2µs ± 1%     24.8µs ± 1%    -1.63%  (p=0.016 n=5+5)
      GCD1000x1000/WithXY-8                513µs ± 0%      517µs ± 2%      ~     (p=0.421 n=5+5)
      GCD1000x10000/WithoutXY-8           57.0µs ± 0%     52.7µs ± 1%    -7.56%  (p=0.008 n=5+5)
      GCD1000x10000/WithXY-8              1.20ms ± 0%     1.10ms ± 0%    -8.70%  (p=0.008 n=5+5)
      GCD1000x100000/WithoutXY-8           358µs ± 0%      318µs ± 1%   -11.03%  (p=0.008 n=5+5)
      GCD1000x100000/WithXY-8             8.71ms ± 0%     7.65ms ± 0%   -12.19%  (p=0.008 n=5+5)
      GCD10000x10000/WithoutXY-8           690µs ± 0%      630µs ± 0%    -8.71%  (p=0.008 n=5+5)
      GCD10000x10000/WithXY-8             16.0ms ± 1%     14.9ms ± 0%    -6.85%  (p=0.008 n=5+5)
      GCD10000x100000/WithoutXY-8         2.09ms ± 0%     1.75ms ± 0%   -16.09%  (p=0.016 n=5+4)
      GCD10000x100000/WithXY-8            86.8ms ± 0%     76.3ms ± 0%   -12.09%  (p=0.008 n=5+5)
      GCD100000x100000/WithoutXY-8        51.1ms ± 0%     46.0ms ± 0%    -9.97%  (p=0.008 n=5+5)
      GCD100000x100000/WithXY-8            1.25s ± 0%      1.15s ± 0%    -7.92%  (p=0.008 n=5+5)
      Hilbert-8                           2.45ms ± 1%     2.49ms ± 1%    +1.99%  (p=0.008 n=5+5)
      Binomial-8                          4.98µs ± 3%     4.90µs ± 2%      ~     (p=0.421 n=5+5)
      QuoRem-8                            7.10µs ± 0%     6.21µs ± 0%   -12.55%  (p=0.016 n=5+4)
      Exp-8                                161ms ± 0%      161ms ± 0%      ~     (p=0.421 n=5+5)
      Exp2-8                               161ms ± 0%      161ms ± 0%      ~     (p=0.151 n=5+5)
      Bitset-8                            40.4ns ± 0%     40.3ns ± 0%      ~     (p=0.190 n=5+5)
      BitsetNeg-8                          163ns ± 3%      137ns ± 2%   -15.91%  (p=0.008 n=5+5)
      BitsetOrig-8                         377ns ± 1%      372ns ± 1%    -1.22%  (p=0.024 n=5+5)
      BitsetNegOrig-8                      631ns ± 1%      605ns ± 1%    -4.09%  (p=0.008 n=5+5)
      ModSqrt225_Tonelli-8                7.26ms ± 0%     7.26ms ± 0%      ~     (p=0.548 n=5+5)
      ModSqrt224_3Mod4-8                  2.24ms ± 0%     2.24ms ± 0%      ~     (p=1.000 n=5+5)
      ModSqrt5430_Tonelli-8                62.4s ± 0%      62.4s ± 0%      ~     (p=0.841 n=5+5)
      ModSqrt5430_3Mod4-8                  20.8s ± 0%      20.7s ± 0%      ~     (p=0.056 n=5+5)
      Sqrt-8                               101µs ± 0%       89µs ± 0%   -12.17%  (p=0.008 n=5+5)
      IntSqr/1-8                          32.5ns ± 1%     32.7ns ± 1%      ~     (p=0.056 n=5+5)
      IntSqr/2-8                           160ns ± 5%      158ns ± 0%      ~     (p=0.397 n=5+4)
      IntSqr/3-8                           298ns ± 4%      296ns ± 4%      ~     (p=0.667 n=5+5)
      IntSqr/5-8                           737ns ± 5%      761ns ± 3%    +3.34%  (p=0.016 n=5+5)
      IntSqr/8-8                          1.87µs ± 4%     1.90µs ± 3%      ~     (p=0.222 n=5+5)
      IntSqr/10-8                         2.96µs ± 4%     2.92µs ± 6%      ~     (p=0.310 n=5+5)
      IntSqr/20-8                         6.28µs ± 3%     6.21µs ± 2%      ~     (p=0.310 n=5+5)
      IntSqr/30-8                         14.0µs ± 2%     13.9µs ± 2%      ~     (p=0.548 n=5+5)
      IntSqr/50-8                         37.7µs ± 3%     38.3µs ± 2%      ~     (p=0.095 n=5+5)
      IntSqr/80-8                         95.9µs ± 2%     95.1µs ± 1%      ~     (p=0.310 n=5+5)
      IntSqr/100-8                         148µs ± 1%      148µs ± 1%      ~     (p=0.841 n=5+5)
      IntSqr/200-8                         586µs ± 1%      587µs ± 1%      ~     (p=1.000 n=5+5)
      IntSqr/300-8                        1.32ms ± 0%     1.31ms ± 1%    -0.73%  (p=0.032 n=5+5)
      IntSqr/500-8                        2.48ms ± 0%     2.45ms ± 0%    -1.15%  (p=0.008 n=5+5)
      IntSqr/800-8                        4.68ms ± 0%     4.62ms ± 0%    -1.23%  (p=0.008 n=5+5)
      IntSqr/1000-8                       7.57ms ± 0%     7.50ms ± 0%    -0.84%  (p=0.008 n=5+5)
      Mul-8                                311ms ± 0%      308ms ± 0%    -0.81%  (p=0.008 n=5+5)
      Exp3Power/0x10-8                     574ns ± 1%      578ns ± 2%      ~     (p=0.500 n=5+5)
      Exp3Power/0x40-8                     640ns ± 1%      646ns ± 0%      ~     (p=0.056 n=5+5)
      Exp3Power/0x100-8                   1.42µs ± 1%     1.42µs ± 1%      ~     (p=0.246 n=5+5)
      Exp3Power/0x400-8                   8.30µs ± 1%     8.29µs ± 1%      ~     (p=0.802 n=5+5)
      Exp3Power/0x1000-8                  60.0µs ± 0%     59.9µs ± 0%    -0.24%  (p=0.016 n=5+5)
      Exp3Power/0x4000-8                   817µs ± 0%      816µs ± 0%    -0.17%  (p=0.008 n=5+5)
      Exp3Power/0x10000-8                 7.80ms ± 1%     7.70ms ± 0%    -1.23%  (p=0.008 n=5+5)
      Exp3Power/0x40000-8                 73.4ms ± 0%     72.5ms ± 0%    -1.28%  (p=0.008 n=5+5)
      Exp3Power/0x100000-8                 665ms ± 0%      656ms ± 0%    -1.34%  (p=0.008 n=5+5)
      Exp3Power/0x400000-8                 5.99s ± 0%      5.90s ± 0%    -1.40%  (p=0.008 n=5+5)
      Fibo-8                               116ms ± 0%       50ms ± 0%   -57.09%  (p=0.008 n=5+5)
      NatSqr/1-8                           112ns ± 4%      112ns ± 2%      ~     (p=0.968 n=5+5)
      NatSqr/2-8                           251ns ± 2%      250ns ± 1%      ~     (p=0.571 n=5+5)
      NatSqr/3-8                           378ns ± 2%      379ns ± 2%      ~     (p=0.794 n=5+5)
      NatSqr/5-8                           829ns ± 3%      827ns ± 2%      ~     (p=1.000 n=5+5)
      NatSqr/8-8                          1.97µs ± 2%     1.95µs ± 2%      ~     (p=0.310 n=5+5)
      NatSqr/10-8                         3.02µs ± 2%     2.99µs ± 2%      ~     (p=0.421 n=5+5)
      NatSqr/20-8                         6.51µs ± 2%     6.49µs ± 1%      ~     (p=0.841 n=5+5)
      NatSqr/30-8                         14.1µs ± 2%     14.0µs ± 2%      ~     (p=0.841 n=5+5)
      NatSqr/50-8                         38.1µs ± 2%     38.3µs ± 3%      ~     (p=0.690 n=5+5)
      NatSqr/80-8                         95.5µs ± 2%     96.0µs ± 1%      ~     (p=0.421 n=5+5)
      NatSqr/100-8                         150µs ± 1%      148µs ± 2%      ~     (p=0.095 n=5+5)
      NatSqr/200-8                         588µs ± 1%      590µs ± 1%      ~     (p=0.421 n=5+5)
      NatSqr/300-8                        1.32ms ± 1%     1.31ms ± 1%      ~     (p=0.841 n=5+5)
      NatSqr/500-8                        2.50ms ± 0%     2.47ms ± 0%    -1.03%  (p=0.008 n=5+5)
      NatSqr/800-8                        4.70ms ± 0%     4.64ms ± 0%    -1.31%  (p=0.008 n=5+5)
      NatSqr/1000-8                       7.60ms ± 0%     7.52ms ± 0%    -1.01%  (p=0.008 n=5+5)
      ScanPi-8                             326µs ± 0%      326µs ± 0%      ~     (p=0.841 n=5+5)
      StringPiParallel-8                  70.3µs ± 5%     63.8µs ±10%      ~     (p=0.056 n=5+5)
      Scan/10/Base2-8                     1.09µs ± 0%     1.09µs ± 0%      ~     (p=0.317 n=5+5)
      Scan/100/Base2-8                    7.79µs ± 0%     7.78µs ± 0%      ~     (p=0.063 n=5+5)
      Scan/1000/Base2-8                   79.0µs ± 0%     78.9µs ± 0%    -0.18%  (p=0.008 n=5+5)
      Scan/10000/Base2-8                  1.22ms ± 0%     1.22ms ± 0%    -0.15%  (p=0.008 n=5+5)
      Scan/100000/Base2-8                 55.1ms ± 0%     55.2ms ± 0%    +0.20%  (p=0.008 n=5+5)
      Scan/10/Base8-8                      512ns ± 0%      512ns ± 1%      ~     (p=0.810 n=5+5)
      Scan/100/Base8-8                    2.89µs ± 0%     2.89µs ± 0%      ~     (p=0.810 n=5+5)
      Scan/1000/Base8-8                   31.0µs ± 0%     31.0µs ± 0%      ~     (p=0.151 n=5+5)
      Scan/10000/Base8-8                   740µs ± 0%      741µs ± 0%    +0.10%  (p=0.008 n=5+5)
      Scan/100000/Base8-8                 50.6ms ± 0%     50.6ms ± 0%    +0.08%  (p=0.008 n=5+5)
      Scan/10/Base10-8                     487ns ± 0%      487ns ± 0%      ~     (p=0.571 n=5+5)
      Scan/100/Base10-8                   2.67µs ± 0%     2.67µs ± 0%      ~     (p=0.810 n=5+5)
      Scan/1000/Base10-8                  28.7µs ± 0%     28.7µs ± 0%    +0.06%  (p=0.008 n=5+5)
      Scan/10000/Base10-8                  716µs ± 0%      717µs ± 0%      ~     (p=0.222 n=5+5)
      Scan/100000/Base10-8                50.3ms ± 0%     50.3ms ± 0%    +0.10%  (p=0.008 n=5+5)
      Scan/10/Base16-8                     438ns ± 0%      437ns ± 1%      ~     (p=0.786 n=5+5)
      Scan/100/Base16-8                   2.47µs ± 0%     2.47µs ± 0%    -0.19%  (p=0.048 n=5+5)
      Scan/1000/Base16-8                  27.2µs ± 0%     27.3µs ± 0%      ~     (p=0.087 n=5+5)
      Scan/10000/Base16-8                  722µs ± 0%      722µs ± 0%    +0.11%  (p=0.008 n=5+5)
      Scan/100000/Base16-8                52.6ms ± 0%     52.7ms ± 0%    +0.15%  (p=0.008 n=5+5)
      String/10/Base2-8                    247ns ± 2%      248ns ± 1%      ~     (p=0.437 n=5+5)
      String/100/Base2-8                  1.51µs ± 0%     1.51µs ± 0%    -0.37%  (p=0.024 n=5+5)
      String/1000/Base2-8                 13.6µs ± 1%     13.5µs ± 0%      ~     (p=0.095 n=5+5)
      String/10000/Base2-8                 135µs ± 0%      135µs ± 1%      ~     (p=0.841 n=5+5)
      String/100000/Base2-8               1.32ms ± 1%     1.32ms ± 1%      ~     (p=0.690 n=5+5)
      String/10/Base8-8                    169ns ± 1%      169ns ± 1%      ~     (p=1.000 n=5+5)
      String/100/Base8-8                   636ns ± 0%      634ns ± 1%      ~     (p=0.413 n=5+5)
      String/1000/Base8-8                 5.33µs ± 1%     5.32µs ± 0%      ~     (p=0.222 n=5+5)
      String/10000/Base8-8                50.9µs ± 1%     50.7µs ± 0%      ~     (p=0.151 n=5+5)
      String/100000/Base8-8                500µs ± 1%      497µs ± 0%      ~     (p=0.421 n=5+5)
      String/10/Base10-8                   516ns ± 1%      513ns ± 0%    -0.62%  (p=0.016 n=5+4)
      String/100/Base10-8                 1.97µs ± 0%     1.96µs ± 0%      ~     (p=0.667 n=4+5)
      String/1000/Base10-8                12.5µs ± 0%     11.5µs ± 0%    -7.92%  (p=0.008 n=5+5)
      String/10000/Base10-8               57.7µs ± 0%     52.5µs ± 0%    -8.93%  (p=0.008 n=5+5)
      String/100000/Base10-8              25.6ms ± 0%     21.6ms ± 0%   -15.94%  (p=0.008 n=5+5)
      String/10/Base16-8                   150ns ± 1%      149ns ± 0%      ~     (p=0.413 n=5+4)
      String/100/Base16-8                  514ns ± 1%      514ns ± 1%      ~     (p=0.849 n=5+5)
      String/1000/Base16-8                4.01µs ± 0%     4.01µs ± 0%      ~     (p=0.421 n=5+5)
      String/10000/Base16-8               37.8µs ± 1%     37.8µs ± 1%      ~     (p=0.841 n=5+5)
      String/100000/Base16-8               373µs ± 2%      373µs ± 0%      ~     (p=0.421 n=5+5)
      LeafSize/0-8                        6.63ms ± 0%     6.63ms ± 0%      ~     (p=0.730 n=4+5)
      LeafSize/1-8                        74.0µs ± 0%     67.7µs ± 1%    -8.53%  (p=0.008 n=5+5)
      LeafSize/2-8                        74.2µs ± 0%     68.3µs ± 1%    -7.99%  (p=0.008 n=5+5)
      LeafSize/3-8                         379µs ± 0%      309µs ± 0%   -18.52%  (p=0.008 n=5+5)
      LeafSize/4-8                        72.7µs ± 1%     66.7µs ± 0%    -8.37%  (p=0.008 n=5+5)
      LeafSize/5-8                         471µs ± 0%      384µs ± 0%   -18.55%  (p=0.008 n=5+5)
      LeafSize/6-8                         378µs ± 0%      308µs ± 0%   -18.59%  (p=0.008 n=5+5)
      LeafSize/7-8                         245µs ± 0%      204µs ± 1%   -16.75%  (p=0.008 n=5+5)
      LeafSize/8-8                        73.4µs ± 0%     66.9µs ± 1%    -8.79%  (p=0.008 n=5+5)
      LeafSize/9-8                         538µs ± 0%      437µs ± 0%   -18.75%  (p=0.008 n=5+5)
      LeafSize/10-8                        472µs ± 0%      396µs ± 1%   -16.01%  (p=0.008 n=5+5)
      LeafSize/11-8                        460µs ± 0%      374µs ± 0%   -18.58%  (p=0.008 n=5+5)
      LeafSize/12-8                        378µs ± 0%      308µs ± 0%   -18.38%  (p=0.008 n=5+5)
      LeafSize/13-8                        343µs ± 0%      284µs ± 0%   -17.30%  (p=0.008 n=5+5)
      LeafSize/14-8                        248µs ± 0%      206µs ± 0%   -16.94%  (p=0.008 n=5+5)
      LeafSize/15-8                        169µs ± 0%      144µs ± 0%   -14.69%  (p=0.008 n=5+5)
      LeafSize/16-8                       72.9µs ± 0%     66.8µs ± 1%    -8.27%  (p=0.008 n=5+5)
      LeafSize/32-8                       82.5µs ± 0%     76.7µs ± 0%    -7.04%  (p=0.008 n=5+5)
      LeafSize/64-8                        134µs ± 0%      129µs ± 0%    -3.80%  (p=0.008 n=5+5)
      ProbablyPrime/n=0-8                 44.2ms ± 0%     43.4ms ± 0%    -1.95%  (p=0.008 n=5+5)
      ProbablyPrime/n=1-8                 64.9ms ± 0%     64.0ms ± 0%    -1.27%  (p=0.008 n=5+5)
      ProbablyPrime/n=5-8                  147ms ± 0%      146ms ± 0%    -0.58%  (p=0.008 n=5+5)
      ProbablyPrime/n=10-8                 250ms ± 0%      249ms ± 0%    -0.35%  (p=0.008 n=5+5)
      ProbablyPrime/n=20-8                 456ms ± 0%      455ms ± 0%    -0.18%  (p=0.008 n=5+5)
      ProbablyPrime/Lucas-8               23.6ms ± 0%     22.7ms ± 0%    -3.74%  (p=0.008 n=5+5)
      ProbablyPrime/MillerRabinBase2-8    20.7ms ± 0%     20.6ms ± 0%      ~     (p=0.421 n=5+5)
      FloatSqrt/64-8                      2.25µs ± 1%     2.29µs ± 0%    +1.48%  (p=0.008 n=5+5)
      FloatSqrt/128-8                     4.86µs ± 1%     4.92µs ± 1%    +1.21%  (p=0.032 n=5+5)
      FloatSqrt/256-8                     13.6µs ± 0%     13.7µs ± 1%    +1.31%  (p=0.032 n=5+5)
      FloatSqrt/1000-8                    70.0µs ± 1%     70.1µs ± 0%      ~     (p=0.690 n=5+5)
      FloatSqrt/10000-8                   1.92ms ± 0%     1.90ms ± 0%    -0.59%  (p=0.008 n=5+5)
      FloatSqrt/100000-8                  55.3ms ± 0%     54.8ms ± 0%    -1.01%  (p=0.008 n=5+5)
      FloatSqrt/1000000-8                  4.56s ± 0%      4.50s ± 0%    -1.28%  (p=0.008 n=5+5)
      
      name                              old speed      new speed       delta
      AddVV/1-8                         2.97GB/s ± 0%   5.56GB/s ± 0%   +86.85%  (p=0.008 n=5+5)
      AddVV/2-8                         9.47GB/s ± 0%  10.66GB/s ± 0%   +12.50%  (p=0.008 n=5+5)
      AddVV/3-8                         12.4GB/s ± 0%   14.7GB/s ± 0%   +19.10%  (p=0.008 n=5+5)
      AddVV/4-8                         14.6GB/s ± 0%   18.9GB/s ± 0%   +29.63%  (p=0.016 n=4+5)
      AddVV/5-8                         16.4GB/s ± 0%   22.0GB/s ± 0%   +34.47%  (p=0.016 n=5+4)
      AddVV/10-8                        21.7GB/s ± 0%   35.5GB/s ± 0%   +63.89%  (p=0.008 n=5+5)
      AddVV/100-8                       29.4GB/s ± 0%   68.0GB/s ± 0%  +131.38%  (p=0.008 n=5+5)
      AddVV/1000-8                      31.7GB/s ± 0%   61.9GB/s ± 0%   +95.43%  (p=0.008 n=5+5)
      AddVV/10000-8                     31.2GB/s ± 0%   56.4GB/s ± 0%   +80.83%  (p=0.008 n=5+5)
      AddVV/100000-8                    25.9GB/s ± 3%   41.4GB/s ± 0%   +59.98%  (p=0.008 n=5+5)
      SubVV/1-8                         2.97GB/s ± 0%   5.56GB/s ± 0%   +86.97%  (p=0.016 n=4+5)
      SubVV/2-8                         9.47GB/s ± 0%  10.66GB/s ± 0%   +12.51%  (p=0.008 n=5+5)
      SubVV/3-8                         12.4GB/s ± 0%   14.8GB/s ± 0%   +19.23%  (p=0.016 n=4+5)
      SubVV/4-8                         14.6GB/s ± 0%   18.9GB/s ± 0%   +29.56%  (p=0.008 n=5+5)
      SubVV/5-8                         16.4GB/s ± 0%   22.0GB/s ± 0%   +34.47%  (p=0.016 n=4+5)
      SubVV/10-8                        21.7GB/s ± 0%   35.5GB/s ± 0%   +63.89%  (p=0.008 n=5+5)
      SubVV/100-8                       29.4GB/s ± 0%   68.0GB/s ± 0%  +131.38%  (p=0.008 n=5+5)
      SubVV/1000-8                      31.6GB/s ± 0%   80.1GB/s ± 0%  +153.08%  (p=0.008 n=5+5)
      SubVV/10000-8                     31.2GB/s ± 0%   56.7GB/s ± 0%   +81.79%  (p=0.008 n=5+5)
      SubVV/100000-8                    29.1GB/s ±10%   29.0GB/s ±18%      ~     (p=0.690 n=5+5)
      AddVW/1-8                          859MB/s ± 0%    859MB/s ± 0%    -0.01%  (p=0.008 n=5+5)
      AddVW/2-8                          811MB/s ± 1%    814MB/s ± 0%      ~     (p=0.413 n=5+4)
      AddVW/3-8                         2.08GB/s ± 0%   2.08GB/s ± 0%      ~     (p=0.206 n=5+5)
      AddVW/4-8                         2.46GB/s ± 0%   2.46GB/s ± 0%      ~     (p=0.056 n=5+5)
      AddVW/5-8                         2.75GB/s ± 0%   2.75GB/s ± 0%      ~     (p=0.508 n=5+5)
      AddVW/10-8                        3.63GB/s ± 0%   3.63GB/s ± 0%      ~     (p=0.214 n=5+5)
      AddVW/100-8                       4.79GB/s ± 0%   4.79GB/s ± 0%      ~     (p=0.500 n=5+5)
      AddVW/1000-8                      5.27GB/s ± 0%   5.25GB/s ± 0%    -0.43%  (p=0.008 n=5+5)
      AddVW/10000-8                     5.30GB/s ± 0%   5.30GB/s ± 0%      ~     (p=0.397 n=5+5)
      AddVW/100000-8                    5.27GB/s ± 1%   5.25GB/s ± 1%      ~     (p=0.690 n=5+5)
      AddMulVVW/1-8                     1.92GB/s ± 0%   1.96GB/s ± 1%    +1.95%  (p=0.008 n=5+5)
      AddMulVVW/2-8                     2.16GB/s ± 1%   2.25GB/s ± 1%    +4.32%  (p=0.008 n=5+5)
      AddMulVVW/3-8                     2.39GB/s ± 1%   2.25GB/s ± 3%    -5.79%  (p=0.008 n=5+5)
      AddMulVVW/4-8                     2.00GB/s ± 0%   2.31GB/s ± 1%   +15.31%  (p=0.008 n=5+5)
      AddMulVVW/5-8                     2.22GB/s ± 0%   2.14GB/s ± 0%    -3.86%  (p=0.008 n=5+5)
      AddMulVVW/10-8                    2.15GB/s ± 1%   2.25GB/s ± 0%    +5.03%  (p=0.008 n=5+5)
      AddMulVVW/100-8                   2.09GB/s ± 0%   2.14GB/s ± 0%    +2.25%  (p=0.008 n=5+5)
      AddMulVVW/1000-8                  2.04GB/s ± 0%   2.38GB/s ± 0%   +16.52%  (p=0.008 n=5+5)
      AddMulVVW/10000-8                 2.03GB/s ± 0%   2.10GB/s ± 0%    +3.64%  (p=0.008 n=5+5)
      AddMulVVW/100000-8                2.02GB/s ± 0%   2.02GB/s ± 1%      ~     (p=0.690 n=5+5)
      
      Change-Id: Ie482d67a7dbb5af6f5d81af2b3d9d14bd66336db
      Reviewed-on: https://go-review.googlesource.com/77831Reviewed-by: 's avatarCherry Zhang <cherryyz@google.com>
      Run-TryBot: Cherry Zhang <cherryyz@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      c4f3fe95
  2. 05 Mar, 2018 10 commits
  3. 04 Mar, 2018 7 commits
  4. 03 Mar, 2018 7 commits
    • Giovanni Bajo's avatar
      test: port a nil-check interface test from asm_test · 8ce74b7d
      Giovanni Bajo authored
      Change-Id: I69c1688506d1aeca655047acf35d1bff966fc01e
      Reviewed-on: https://go-review.googlesource.com/98442
      Run-TryBot: Giovanni Bajo <rasky@develer.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      8ce74b7d
    • Giovanni Bajo's avatar
      test: use the version of Go used to run run.go · ec0b8c05
      Giovanni Bajo authored
      Currently, the top-level testsuite always uses whatever version
      of Go is found in the PATH to execute all the tests. This
      forces the developers to tweak the PATH to run the testsuite.
      
      Change it to use the same version of Go used to run run.go.
      This allows developers to run the testsuite using the tip
      compiler by simply saying "../bin/go run run.go".
      
      I think this is a better solution compared to always forcing
      "../bin/go", because it allows developers to run the testsuite
      using different Go versions, for instance to check if a new
      test is fixed in tip compared to the installed compiler.
      
      Fixes #24217
      
      Change-Id: I41b299c753b6e77c41e28be9091b2b630efea9d2
      Reviewed-on: https://go-review.googlesource.com/98439
      Run-TryBot: Giovanni Bajo <rasky@develer.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarKeith Randall <khr@golang.org>
      ec0b8c05
    • Pascal S. de Kloe's avatar
      encoding/json: apply conventional error handling in decoder · 74a92b8e
      Pascal S. de Kloe authored
      name                            old time/op    new time/op    delta
      CodeEncoder-12                    1.89ms ± 1%    1.91ms ± 0%   +1.16%  (p=0.000 n=20+19)
      CodeMarshal-12                    2.09ms ± 1%    2.12ms ± 0%   +1.63%  (p=0.000 n=17+18)
      CodeDecoder-12                    8.43ms ± 1%    8.32ms ± 1%   -1.35%  (p=0.000 n=18+20)
      UnicodeDecoder-12                  399ns ± 0%     339ns ± 0%  -15.00%  (p=0.000 n=20+19)
      DecoderStream-12                   281ns ± 1%     231ns ± 0%  -17.91%  (p=0.000 n=20+16)
      CodeUnmarshal-12                  9.35ms ± 2%    9.15ms ± 2%   -2.11%  (p=0.000 n=20+20)
      CodeUnmarshalReuse-12             8.41ms ± 2%    8.29ms ± 2%   -1.34%  (p=0.000 n=20+20)
      UnmarshalString-12                81.2ns ± 2%    74.0ns ± 4%   -8.89%  (p=0.000 n=20+20)
      UnmarshalFloat64-12               71.1ns ± 2%    64.3ns ± 1%   -9.60%  (p=0.000 n=20+19)
      UnmarshalInt64-12                 60.6ns ± 2%    53.2ns ± 0%  -12.28%  (p=0.000 n=18+18)
      Issue10335-12                     96.9ns ± 0%    87.7ns ± 1%   -9.52%  (p=0.000 n=17+20)
      Unmapped-12                        247ns ± 4%     231ns ± 3%   -6.34%  (p=0.000 n=20+20)
      TypeFieldsCache/MissTypes1-12     11.1µs ± 0%    11.1µs ± 0%     ~     (p=0.376 n=19+20)
      TypeFieldsCache/MissTypes10-12    33.9µs ± 0%    33.8µs ± 0%   -0.32%  (p=0.000 n=18+9)
      
      name                            old speed      new speed      delta
      CodeEncoder-12                  1.03GB/s ± 1%  1.01GB/s ± 0%   -1.15%  (p=0.000 n=20+19)
      CodeMarshal-12                   930MB/s ± 1%   915MB/s ± 0%   -1.60%  (p=0.000 n=17+18)
      CodeDecoder-12                   230MB/s ± 1%   233MB/s ± 1%   +1.37%  (p=0.000 n=18+20)
      UnicodeDecoder-12               35.0MB/s ± 0%  41.2MB/s ± 0%  +17.60%  (p=0.000 n=20+19)
      CodeUnmarshal-12                 208MB/s ± 2%   212MB/s ± 2%   +2.16%  (p=0.000 n=20+20)
      
      name                            old alloc/op   new alloc/op   delta
      Issue10335-12                       184B ± 0%      184B ± 0%     ~     (all equal)
      Unmapped-12                         216B ± 0%      216B ± 0%     ~     (all equal)
      
      name                            old allocs/op  new allocs/op  delta
      Issue10335-12                       3.00 ± 0%      3.00 ± 0%     ~     (all equal)
      Unmapped-12                         4.00 ± 0%      4.00 ± 0%     ~     (all equal)
      
      Change-Id: I4b1a87a205da2ef9a572f86f85bc833653c61570
      Reviewed-on: https://go-review.googlesource.com/98440Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
      Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      74a92b8e
    • Tobias Klauser's avatar
      runtime: use vDSO for clock_gettime on linux/arm · 51b02711
      Tobias Klauser authored
      Use the __vdso_clock_gettime fast path via the vDSO on linux/arm to
      speed up nanotime and walltime. This results in the following
      performance improvement for time.Now on a RaspberryPi 3 (running
      32bit Raspbian, i.e. GOOS=linux/GOARCH=arm):
      
      name     old time/op  new time/op  delta
      TimeNow  0.99µs ± 0%  0.39µs ± 1%  -60.74%  (p=0.000 n=12+20)
      
      Change-Id: I3598278a6c88d7f6a6ce66c56b9d25f9dd2f4c9a
      Reviewed-on: https://go-review.googlesource.com/98095Reviewed-by: 's avatarIan Lance Taylor <iant@golang.org>
      Run-TryBot: Ian Lance Taylor <iant@golang.org>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      51b02711
    • Tobias Klauser's avatar
      runtime: remove unused __vdso_time_sym · c69f60d0
      Tobias Klauser authored
      It's unused since https://golang.org/cl/99320043
      
      Change-Id: I74d69ff894aa2fb556f1c2083406c118c559d91b
      Reviewed-on: https://go-review.googlesource.com/98195
      Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarIan Lance Taylor <iant@golang.org>
      c69f60d0
    • Keith Randall's avatar
      internal/bytealg: move equal functions to bytealg · 1dfa380e
      Keith Randall authored
      Move bytes.Equal, runtime.memequal, and runtime.memequal_varlen
      to the bytealg package.
      
      Update #19792
      
      Change-Id: Ic4175e952936016ea0bda6c7c3dbb33afdc8e4ac
      Reviewed-on: https://go-review.googlesource.com/98355
      Run-TryBot: Keith Randall <khr@golang.org>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
      1dfa380e
    • Joe Tsai's avatar
      encoding/json: use sync.Map for field cache · f0756ca2
      Joe Tsai authored
      The previous type cache is quadratic in time in the situation where
      new types are continually encountered. Now that it is possible to dynamically
      create new types with the reflect package, this can cause json to
      perform very poorly.
      
      Switch to sync.Map which does well when the cache has hit steady state,
      but also handles occasional updates in better than quadratic time.
      
      benchmark                                     old ns/op      new ns/op     delta
      BenchmarkTypeFieldsCache/MissTypes1-8         14817          16202         +9.35%
      BenchmarkTypeFieldsCache/MissTypes10-8        70926          69144         -2.51%
      BenchmarkTypeFieldsCache/MissTypes100-8       976467         208973        -78.60%
      BenchmarkTypeFieldsCache/MissTypes1000-8      79520162       1750371       -97.80%
      BenchmarkTypeFieldsCache/MissTypes10000-8     6873625837     16847806      -99.75%
      BenchmarkTypeFieldsCache/HitTypes1000-8       7.51           8.80          +17.18%
      BenchmarkTypeFieldsCache/HitTypes10000-8      7.58           8.68          +14.51%
      
      The old implementation takes 12 minutes just to build a cache of size 1e5
      due to the quadratic behavior. I did not bother benchmark sizes above that.
      
      Change-Id: I5e6facc1eb8e1b80e5ca285e4dd2cc8815618dad
      Reviewed-on: https://go-review.googlesource.com/76850
      Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
      Reviewed-by: 's avatarBryan Mills <bcmills@google.com>
      TryBot-Result: Gobot Gobot <gobot@golang.org>
      f0756ca2
  5. 02 Mar, 2018 14 commits