• erifan01's avatar
    math/big: optimize addVW and subVW on arm64 · 0585d41c
    erifan01 authored
    The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance.
    By unrolling the cycle 4 times and 2 times, and using "LDP", "STP" instructions,
    this CL can reduce the "load" cost and improve performance.
    
    Benchmarks:
    
    name                              old time/op    new time/op     delta
    AddVV/1-8                           21.5ns ± 0%     21.5ns ± 0%      ~     (all equal)
    AddVV/2-8                           13.5ns ± 0%     13.5ns ± 0%      ~     (all equal)
    AddVV/3-8                           15.5ns ± 0%     15.5ns ± 0%      ~     (all equal)
    AddVV/4-8                           17.5ns ± 0%     17.5ns ± 0%      ~     (all equal)
    AddVV/5-8                           19.5ns ± 0%     19.5ns ± 0%      ~     (all equal)
    AddVV/10-8                          29.5ns ± 0%     29.5ns ± 0%      ~     (all equal)
    AddVV/100-8                          217ns ± 0%      217ns ± 0%      ~     (all equal)
    AddVV/1000-8                        2.02µs ± 0%     2.02µs ± 0%      ~     (all equal)
    AddVV/10000-8                       20.3µs ± 0%     20.3µs ± 0%      ~     (p=0.603 n=5+5)
    AddVV/100000-8                       223µs ± 7%      228µs ± 8%      ~     (p=0.548 n=5+5)
    AddVW/1-8                           9.32ns ± 0%     9.26ns ± 0%    -0.64%  (p=0.008 n=5+5)
    AddVW/2-8                           19.8ns ± 3%     10.5ns ± 0%   -46.92%  (p=0.008 n=5+5)
    AddVW/3-8                           11.5ns ± 0%     11.0ns ± 0%    -4.35%  (p=0.008 n=5+5)
    AddVW/4-8                           13.0ns ± 0%     12.0ns ± 0%    -7.69%  (p=0.008 n=5+5)
    AddVW/5-8                           14.5ns ± 0%     12.5ns ± 0%   -13.79%  (p=0.008 n=5+5)
    AddVW/10-8                          22.0ns ± 0%     15.5ns ± 0%   -29.55%  (p=0.008 n=5+5)
    AddVW/100-8                          167ns ± 0%       81ns ± 0%   -51.44%  (p=0.008 n=5+5)
    AddVW/1000-8                        1.52µs ± 0%     0.64µs ± 0%   -57.58%  (p=0.008 n=5+5)
    AddVW/10000-8                       15.1µs ± 0%      7.2µs ± 0%   -52.55%  (p=0.008 n=5+5)
    AddVW/100000-8                       150µs ± 0%       71µs ± 0%   -52.95%  (p=0.008 n=5+5)
    SubVW/1-8                           9.32ns ± 0%     9.26ns ± 0%    -0.64%  (p=0.008 n=5+5)
    SubVW/2-8                           19.7ns ± 2%     10.5ns ± 0%   -46.70%  (p=0.008 n=5+5)
    SubVW/3-8                           11.5ns ± 0%     11.0ns ± 0%    -4.35%  (p=0.008 n=5+5)
    SubVW/4-8                           13.0ns ± 0%     12.0ns ± 0%    -7.69%  (p=0.008 n=5+5)
    SubVW/5-8                           14.5ns ± 0%     12.5ns ± 0%   -13.79%  (p=0.008 n=5+5)
    SubVW/10-8                          22.0ns ± 0%     15.5ns ± 0%   -29.55%  (p=0.008 n=5+5)
    SubVW/100-8                          167ns ± 0%       81ns ± 0%   -51.44%  (p=0.008 n=5+5)
    SubVW/1000-8                        1.52µs ± 0%     0.64µs ± 0%   -57.58%  (p=0.008 n=5+5)
    SubVW/10000-8                       15.1µs ± 0%      7.2µs ± 0%   -52.49%  (p=0.008 n=5+5)
    SubVW/100000-8                       150µs ± 0%       71µs ± 0%   -52.91%  (p=0.008 n=5+5)
    AddMulVVW/1-8                       32.4ns ± 1%     32.6ns ± 1%      ~     (p=0.119 n=5+5)
    AddMulVVW/2-8                       57.0ns ± 0%     57.0ns ± 0%      ~     (p=0.643 n=5+5)
    AddMulVVW/3-8                       90.8ns ± 0%     90.7ns ± 0%      ~     (p=0.524 n=5+5)
    AddMulVVW/4-8                        118ns ± 0%      118ns ± 1%      ~     (p=1.000 n=4+5)
    AddMulVVW/5-8                        144ns ± 1%      144ns ± 0%      ~     (p=0.794 n=5+4)
    AddMulVVW/10-8                       294ns ± 1%      296ns ± 0%    +0.48%  (p=0.040 n=5+5)
    AddMulVVW/100-8                     2.73µs ± 0%     2.73µs ± 0%      ~     (p=0.278 n=5+5)
    AddMulVVW/1000-8                    26.0µs ± 0%     26.5µs ± 0%    +2.14%  (p=0.008 n=5+5)
    AddMulVVW/10000-8                    297µs ± 0%      297µs ± 0%    +0.24%  (p=0.008 n=5+5)
    AddMulVVW/100000-8                  3.15ms ± 1%     3.13ms ± 0%      ~     (p=0.690 n=5+5)
    DecimalConversion-8                  311µs ± 2%      309µs ± 2%      ~     (p=0.310 n=5+5)
    FloatString/100-8                   2.55µs ± 2%     2.54µs ± 2%      ~     (p=1.000 n=5+5)
    FloatString/1000-8                  58.1µs ± 0%     58.1µs ± 0%      ~     (p=0.151 n=5+5)
    FloatString/10000-8                 4.59ms ± 0%     4.59ms ± 0%      ~     (p=0.151 n=5+5)
    FloatString/100000-8                 446ms ± 0%      446ms ± 0%    +0.01%  (p=0.016 n=5+5)
    FloatAdd/10-8                        183ns ± 0%      183ns ± 0%      ~     (p=0.333 n=4+5)
    FloatAdd/100-8                       187ns ± 1%      192ns ± 2%      ~     (p=0.056 n=5+5)
    FloatAdd/1000-8                      369ns ± 0%      371ns ± 0%    +0.54%  (p=0.016 n=4+5)
    FloatAdd/10000-8                    1.88µs ± 0%     1.88µs ± 0%    -0.14%  (p=0.000 n=4+5)
    FloatAdd/100000-8                   17.2µs ± 0%     17.1µs ± 0%    -0.37%  (p=0.008 n=5+5)
    FloatSub/10-8                        147ns ± 0%      147ns ± 0%      ~     (all equal)
    FloatSub/100-8                       145ns ± 0%      146ns ± 0%      ~     (p=0.238 n=5+4)
    FloatSub/1000-8                      241ns ± 0%      241ns ± 0%      ~     (p=0.333 n=5+4)
    FloatSub/10000-8                    1.06µs ± 0%     1.06µs ± 0%      ~     (p=0.444 n=5+5)
    FloatSub/100000-8                   9.50µs ± 0%     9.48µs ± 0%    -0.14%  (p=0.008 n=5+5)
    ParseFloatSmallExp-8                28.4µs ± 2%     28.5µs ± 1%      ~     (p=0.690 n=5+5)
    ParseFloatLargeExp-8                 125µs ± 1%      124µs ± 1%      ~     (p=0.095 n=5+5)
    GCD10x10/WithoutXY-8                 277ns ± 2%      278ns ± 3%      ~     (p=0.937 n=5+5)
    GCD10x10/WithXY-8                   2.08µs ± 3%     2.15µs ± 3%      ~     (p=0.056 n=5+5)
    GCD10x100/WithoutXY-8                592ns ± 3%      613ns ± 4%      ~     (p=0.056 n=5+5)
    GCD10x100/WithXY-8                  3.40µs ± 2%     3.42µs ± 4%      ~     (p=0.841 n=5+5)
    GCD10x1000/WithoutXY-8              1.37µs ± 2%     1.35µs ± 3%      ~     (p=0.460 n=5+5)
    GCD10x1000/WithXY-8                 7.34µs ± 2%     7.33µs ± 4%      ~     (p=0.841 n=5+5)
    GCD10x10000/WithoutXY-8             8.52µs ± 0%     8.51µs ± 1%      ~     (p=0.421 n=5+5)
    GCD10x10000/WithXY-8                27.5µs ± 2%     27.2µs ± 1%      ~     (p=0.151 n=5+5)
    GCD10x100000/WithoutXY-8            78.3µs ± 1%     78.5µs ± 1%      ~     (p=0.690 n=5+5)
    GCD10x100000/WithXY-8                231µs ± 0%      229µs ± 1%    -1.11%  (p=0.016 n=5+5)
    GCD100x100/WithoutXY-8              1.86µs ± 2%     1.86µs ± 2%      ~     (p=0.881 n=5+5)
    GCD100x100/WithXY-8                 27.1µs ± 2%     27.2µs ± 1%      ~     (p=0.421 n=5+5)
    GCD100x1000/WithoutXY-8             4.44µs ± 2%     4.41µs ± 1%      ~     (p=0.310 n=5+5)
    GCD100x1000/WithXY-8                36.3µs ± 1%     36.2µs ± 1%      ~     (p=0.310 n=5+5)
    GCD100x10000/WithoutXY-8            22.6µs ± 2%     22.5µs ± 1%      ~     (p=0.690 n=5+5)
    GCD100x10000/WithXY-8                145µs ± 1%      145µs ± 1%      ~     (p=1.000 n=5+5)
    GCD100x100000/WithoutXY-8            195µs ± 0%      196µs ± 1%      ~     (p=0.548 n=5+5)
    GCD100x100000/WithXY-8              1.10ms ± 0%     1.10ms ± 0%    -0.30%  (p=0.016 n=5+5)
    GCD1000x1000/WithoutXY-8            25.0µs ± 1%     25.2µs ± 2%      ~     (p=0.222 n=5+5)
    GCD1000x1000/WithXY-8                520µs ± 0%      520µs ± 1%      ~     (p=0.151 n=5+5)
    GCD1000x10000/WithoutXY-8           57.0µs ± 1%     56.9µs ± 1%      ~     (p=0.690 n=5+5)
    GCD1000x10000/WithXY-8              1.21ms ± 0%     1.21ms ± 1%      ~     (p=0.881 n=5+5)
    GCD1000x100000/WithoutXY-8           358µs ± 0%      359µs ± 1%      ~     (p=0.548 n=5+5)
    GCD1000x100000/WithXY-8             8.73ms ± 0%     8.73ms ± 0%      ~     (p=0.548 n=5+5)
    GCD10000x10000/WithoutXY-8           686µs ± 0%      687µs ± 0%      ~     (p=0.548 n=5+5)
    GCD10000x10000/WithXY-8             15.9ms ± 0%     15.9ms ± 0%      ~     (p=0.841 n=5+5)
    GCD10000x100000/WithoutXY-8         2.08ms ± 0%     2.08ms ± 0%      ~     (p=1.000 n=5+5)
    GCD10000x100000/WithXY-8            86.7ms ± 0%     86.7ms ± 0%      ~     (p=1.000 n=5+5)
    GCD100000x100000/WithoutXY-8        51.1ms ± 0%     51.0ms ± 0%      ~     (p=0.151 n=5+5)
    GCD100000x100000/WithXY-8            1.23s ± 0%      1.23s ± 0%      ~     (p=0.841 n=5+5)
    Hilbert-8                           2.41ms ± 1%     2.42ms ± 2%      ~     (p=0.690 n=5+5)
    Binomial-8                          4.86µs ± 1%     4.86µs ± 1%      ~     (p=0.889 n=5+5)
    QuoRem-8                            7.09µs ± 0%     7.08µs ± 0%    -0.09%  (p=0.024 n=5+5)
    Exp-8                                161ms ± 0%      161ms ± 0%    -0.08%  (p=0.032 n=5+5)
    Exp2-8                               161ms ± 0%      161ms ± 0%      ~     (p=1.000 n=5+5)
    Bitset-8                            40.7ns ± 0%     40.6ns ± 0%      ~     (p=0.095 n=4+5)
    BitsetNeg-8                          159ns ± 4%      148ns ± 0%    -6.92%  (p=0.016 n=5+4)
    BitsetOrig-8                         378ns ± 1%      378ns ± 1%      ~     (p=0.937 n=5+5)
    BitsetNegOrig-8                      647ns ± 5%      647ns ± 4%      ~     (p=1.000 n=5+5)
    ModSqrt225_Tonelli-8                7.26ms ± 0%     7.27ms ± 0%      ~     (p=1.000 n=5+5)
    ModSqrt224_3Mod4-8                  2.24ms ± 0%     2.24ms ± 0%      ~     (p=0.690 n=5+5)
    ModSqrt5430_Tonelli-8                62.8s ± 1%      62.5s ± 0%      ~     (p=0.063 n=5+4)
    ModSqrt5430_3Mod4-8                  20.8s ± 0%      20.8s ± 0%      ~     (p=0.310 n=5+5)
    Sqrt-8                               101µs ± 1%      101µs ± 0%    -0.35%  (p=0.032 n=5+5)
    IntSqr/1-8                          32.3ns ± 1%     32.5ns ± 1%      ~     (p=0.421 n=5+5)
    IntSqr/2-8                           157ns ± 5%      156ns ± 5%      ~     (p=0.651 n=5+5)
    IntSqr/3-8                           292ns ± 2%      291ns ± 3%      ~     (p=0.881 n=5+5)
    IntSqr/5-8                           738ns ± 6%      740ns ± 5%      ~     (p=0.841 n=5+5)
    IntSqr/8-8                          1.82µs ± 4%     1.83µs ± 4%      ~     (p=0.730 n=5+5)
    IntSqr/10-8                         2.92µs ± 1%     2.93µs ± 1%      ~     (p=0.643 n=5+5)
    IntSqr/20-8                         6.28µs ± 2%     6.28µs ± 2%      ~     (p=1.000 n=5+5)
    IntSqr/30-8                         13.8µs ± 2%     13.9µs ± 3%      ~     (p=1.000 n=5+5)
    IntSqr/50-8                         37.8µs ± 4%     37.9µs ± 4%      ~     (p=0.690 n=5+5)
    IntSqr/80-8                         95.9µs ± 1%     95.8µs ± 1%      ~     (p=0.841 n=5+5)
    IntSqr/100-8                         148µs ± 1%      148µs ± 1%      ~     (p=0.310 n=5+5)
    IntSqr/200-8                         586µs ± 1%      586µs ± 1%      ~     (p=0.841 n=5+5)
    IntSqr/300-8                        1.32ms ± 0%     1.31ms ± 0%      ~     (p=0.222 n=5+5)
    IntSqr/500-8                        2.48ms ± 0%     2.48ms ± 0%      ~     (p=0.556 n=5+4)
    IntSqr/800-8                        4.68ms ± 0%     4.68ms ± 0%      ~     (p=0.548 n=5+5)
    IntSqr/1000-8                       7.57ms ± 0%     7.56ms ± 0%      ~     (p=0.421 n=5+5)
    Mul-8                                311ms ± 0%      311ms ± 0%      ~     (p=0.548 n=5+5)
    Exp3Power/0x10-8                     559ns ± 1%      560ns ± 1%      ~     (p=0.984 n=5+5)
    Exp3Power/0x40-8                     641ns ± 1%      634ns ± 1%      ~     (p=0.063 n=5+5)
    Exp3Power/0x100-8                   1.39µs ± 2%     1.40µs ± 2%      ~     (p=0.381 n=5+5)
    Exp3Power/0x400-8                   8.27µs ± 1%     8.26µs ± 0%      ~     (p=0.571 n=5+5)
    Exp3Power/0x1000-8                  59.9µs ± 0%     59.7µs ± 0%    -0.23%  (p=0.008 n=5+5)
    Exp3Power/0x4000-8                   816µs ± 0%      816µs ± 0%      ~     (p=1.000 n=5+5)
    Exp3Power/0x10000-8                 7.77ms ± 0%     7.77ms ± 0%      ~     (p=0.841 n=5+5)
    Exp3Power/0x40000-8                 73.4ms ± 0%     73.4ms ± 0%      ~     (p=0.690 n=5+5)
    Exp3Power/0x100000-8                 665ms ± 0%      664ms ± 0%    -0.14%  (p=0.008 n=5+5)
    Exp3Power/0x400000-8                 5.98s ± 0%      5.98s ± 0%    -0.09%  (p=0.008 n=5+5)
    Fibo-8                               116ms ± 0%      116ms ± 0%    -0.25%  (p=0.008 n=5+5)
    NatSqr/1-8                           115ns ± 3%      116ns ± 2%      ~     (p=0.238 n=5+5)
    NatSqr/2-8                           237ns ± 1%      237ns ± 1%      ~     (p=0.683 n=5+5)
    NatSqr/3-8                           367ns ± 3%      368ns ± 3%      ~     (p=0.817 n=5+5)
    NatSqr/5-8                           807ns ± 3%      812ns ± 3%      ~     (p=0.913 n=5+5)
    NatSqr/8-8                          1.93µs ± 2%     1.93µs ± 3%      ~     (p=0.651 n=5+5)
    NatSqr/10-8                         2.98µs ± 2%     2.99µs ± 2%      ~     (p=0.690 n=5+5)
    NatSqr/20-8                         6.49µs ± 2%     6.46µs ± 2%      ~     (p=0.548 n=5+5)
    NatSqr/30-8                         14.4µs ± 2%     14.3µs ± 2%      ~     (p=0.690 n=5+5)
    NatSqr/50-8                         38.6µs ± 2%     38.7µs ± 2%      ~     (p=0.841 n=5+5)
    NatSqr/80-8                         96.1µs ± 2%     95.8µs ± 2%      ~     (p=0.548 n=5+5)
    NatSqr/100-8                         149µs ± 1%      149µs ± 1%      ~     (p=0.841 n=5+5)
    NatSqr/200-8                         593µs ± 1%      590µs ± 1%      ~     (p=0.421 n=5+5)
    NatSqr/300-8                        1.32ms ± 0%     1.32ms ± 1%      ~     (p=0.222 n=5+5)
    NatSqr/500-8                        2.49ms ± 0%     2.49ms ± 0%      ~     (p=0.690 n=5+5)
    NatSqr/800-8                        4.69ms ± 0%     4.69ms ± 0%      ~     (p=1.000 n=5+5)
    NatSqr/1000-8                       7.59ms ± 0%     7.58ms ± 0%      ~     (p=0.841 n=5+5)
    ScanPi-8                             322µs ± 0%      321µs ± 0%      ~     (p=0.095 n=5+5)
    StringPiParallel-8                  71.4µs ± 5%     68.8µs ± 4%      ~     (p=0.151 n=5+5)
    Scan/10/Base2-8                     1.10µs ± 0%     1.09µs ± 0%    -0.36%  (p=0.032 n=5+5)
    Scan/100/Base2-8                    7.78µs ± 0%     7.79µs ± 0%    +0.14%  (p=0.008 n=5+5)
    Scan/1000/Base2-8                   78.8µs ± 0%     79.0µs ± 0%    +0.24%  (p=0.008 n=5+5)
    Scan/10000/Base2-8                  1.22ms ± 0%     1.22ms ± 0%      ~     (p=0.056 n=5+5)
    Scan/100000/Base2-8                 55.1ms ± 0%     55.0ms ± 0%    -0.15%  (p=0.008 n=5+5)
    Scan/10/Base8-8                      514ns ± 0%      515ns ± 0%      ~     (p=0.079 n=5+5)
    Scan/100/Base8-8                    2.89µs ± 0%     2.89µs ± 0%    +0.15%  (p=0.008 n=5+5)
    Scan/1000/Base8-8                   31.0µs ± 0%     31.1µs ± 0%    +0.12%  (p=0.008 n=5+5)
    Scan/10000/Base8-8                   740µs ± 0%      740µs ± 0%      ~     (p=0.222 n=5+5)
    Scan/100000/Base8-8                 50.6ms ± 0%     50.5ms ± 0%    -0.06%  (p=0.016 n=4+5)
    Scan/10/Base10-8                     492ns ± 1%      490ns ± 1%      ~     (p=0.310 n=5+5)
    Scan/100/Base10-8                   2.67µs ± 0%     2.67µs ± 0%      ~     (p=0.056 n=5+5)
    Scan/1000/Base10-8                  28.7µs ± 0%     28.7µs ± 0%      ~     (p=1.000 n=5+5)
    Scan/10000/Base10-8                  717µs ± 0%      716µs ± 0%      ~     (p=0.222 n=5+5)
    Scan/100000/Base10-8                50.2ms ± 0%     50.3ms ± 0%    +0.05%  (p=0.008 n=5+5)
    Scan/10/Base16-8                     442ns ± 1%      442ns ± 0%      ~     (p=0.468 n=5+5)
    Scan/100/Base16-8                   2.46µs ± 0%     2.45µs ± 0%      ~     (p=0.159 n=5+5)
    Scan/1000/Base16-8                  27.2µs ± 0%     27.2µs ± 0%      ~     (p=0.841 n=5+5)
    Scan/10000/Base16-8                  721µs ± 0%      722µs ± 0%      ~     (p=0.548 n=5+5)
    Scan/100000/Base16-8                52.6ms ± 0%     52.6ms ± 0%    +0.07%  (p=0.008 n=5+5)
    String/10/Base2-8                    244ns ± 1%      242ns ± 1%      ~     (p=0.103 n=5+5)
    String/100/Base2-8                  1.48µs ± 0%     1.48µs ± 1%      ~     (p=0.786 n=5+5)
    String/1000/Base2-8                 13.3µs ± 1%     13.3µs ± 0%      ~     (p=0.222 n=5+5)
    String/10000/Base2-8                 132µs ± 1%      132µs ± 1%      ~     (p=1.000 n=5+5)
    String/100000/Base2-8               1.30ms ± 1%     1.30ms ± 1%      ~     (p=1.000 n=5+5)
    String/10/Base8-8                    167ns ± 1%      168ns ± 1%      ~     (p=0.135 n=5+5)
    String/100/Base8-8                   623ns ± 1%      626ns ± 1%      ~     (p=0.151 n=5+5)
    String/1000/Base8-8                 5.24µs ± 1%     5.24µs ± 0%      ~     (p=1.000 n=5+5)
    String/10000/Base8-8                50.0µs ± 1%     50.0µs ± 1%      ~     (p=1.000 n=5+5)
    String/100000/Base8-8                492µs ± 1%      489µs ± 1%      ~     (p=0.056 n=5+5)
    String/10/Base10-8                   503ns ± 1%      501ns ± 0%      ~     (p=0.183 n=5+5)
    String/100/Base10-8                 1.96µs ± 0%     1.97µs ± 0%      ~     (p=0.389 n=5+5)
    String/1000/Base10-8                12.4µs ± 1%     12.4µs ± 1%      ~     (p=0.841 n=5+5)
    String/10000/Base10-8               56.7µs ± 1%     56.6µs ± 0%      ~     (p=1.000 n=5+5)
    String/100000/Base10-8              25.6ms ± 0%     25.6ms ± 0%      ~     (p=0.222 n=5+5)
    String/10/Base16-8                   147ns ± 0%      148ns ± 2%      ~     (p=1.000 n=4+5)
    String/100/Base16-8                  505ns ± 0%      505ns ± 1%      ~     (p=0.778 n=5+5)
    String/1000/Base16-8                3.94µs ± 0%     3.94µs ± 0%      ~     (p=0.841 n=5+5)
    String/10000/Base16-8               37.4µs ± 1%     37.2µs ± 1%      ~     (p=0.095 n=5+5)
    String/100000/Base16-8               367µs ± 1%      367µs ± 0%      ~     (p=1.000 n=5+5)
    LeafSize/0-8                        6.64ms ± 0%     6.65ms ± 0%      ~     (p=0.690 n=5+5)
    LeafSize/1-8                        72.5µs ± 1%     72.4µs ± 1%      ~     (p=0.841 n=5+5)
    LeafSize/2-8                        72.6µs ± 1%     72.6µs ± 1%      ~     (p=1.000 n=5+5)
    LeafSize/3-8                         377µs ± 0%      377µs ± 0%      ~     (p=0.421 n=5+5)
    LeafSize/4-8                        71.2µs ± 1%     71.3µs ± 0%      ~     (p=0.278 n=5+5)
    LeafSize/5-8                         469µs ± 0%      469µs ± 0%      ~     (p=0.310 n=5+5)
    LeafSize/6-8                         376µs ± 0%      376µs ± 0%      ~     (p=0.841 n=5+5)
    LeafSize/7-8                         244µs ± 0%      244µs ± 0%      ~     (p=0.841 n=5+5)
    LeafSize/8-8                        71.9µs ± 1%     72.1µs ± 1%      ~     (p=0.548 n=5+5)
    LeafSize/9-8                         536µs ± 0%      536µs ± 0%      ~     (p=0.151 n=5+5)
    LeafSize/10-8                        470µs ± 0%      471µs ± 0%    +0.10%  (p=0.032 n=5+5)
    LeafSize/11-8                        458µs ± 0%      458µs ± 0%      ~     (p=0.881 n=5+5)
    LeafSize/12-8                        376µs ± 0%      376µs ± 0%      ~     (p=0.548 n=5+5)
    LeafSize/13-8                        341µs ± 0%      342µs ± 0%      ~     (p=0.222 n=5+5)
    LeafSize/14-8                        246µs ± 0%      245µs ± 0%      ~     (p=0.167 n=5+5)
    LeafSize/15-8                        168µs ± 0%      168µs ± 0%      ~     (p=0.548 n=5+5)
    LeafSize/16-8                       72.1µs ± 1%     72.2µs ± 1%      ~     (p=0.690 n=5+5)
    LeafSize/32-8                       81.5µs ± 1%     81.4µs ± 1%      ~     (p=1.000 n=5+5)
    LeafSize/64-8                        133µs ± 1%      134µs ± 1%      ~     (p=0.690 n=5+5)
    ProbablyPrime/n=0-8                 44.3ms ± 0%     44.2ms ± 0%    -0.28%  (p=0.008 n=5+5)
    ProbablyPrime/n=1-8                 64.8ms ± 0%     64.7ms ± 0%    -0.15%  (p=0.008 n=5+5)
    ProbablyPrime/n=5-8                  147ms ± 0%      147ms ± 0%    -0.11%  (p=0.008 n=5+5)
    ProbablyPrime/n=10-8                 250ms ± 0%      250ms ± 0%      ~     (p=0.056 n=5+5)
    ProbablyPrime/n=20-8                 456ms ± 0%      455ms ± 0%    -0.05%  (p=0.008 n=5+5)
    ProbablyPrime/Lucas-8               23.6ms ± 0%     23.5ms ± 0%    -0.29%  (p=0.008 n=5+5)
    ProbablyPrime/MillerRabinBase2-8    20.6ms ± 0%     20.6ms ± 0%      ~     (p=0.690 n=5+5)
    FloatSqrt/64-8                      2.01µs ± 1%     2.02µs ± 1%      ~     (p=0.421 n=5+5)
    FloatSqrt/128-8                     4.43µs ± 2%     4.38µs ± 2%      ~     (p=0.222 n=5+5)
    FloatSqrt/256-8                     6.64µs ± 1%     6.68µs ± 2%      ~     (p=0.516 n=5+5)
    FloatSqrt/1000-8                    31.9µs ± 0%     31.8µs ± 0%      ~     (p=0.095 n=5+5)
    FloatSqrt/10000-8                    595µs ± 0%      594µs ± 0%      ~     (p=0.056 n=5+5)
    FloatSqrt/100000-8                  17.9ms ± 0%     17.9ms ± 0%      ~     (p=0.151 n=5+5)
    FloatSqrt/1000000-8                  1.52s ± 0%      1.52s ± 0%      ~     (p=0.841 n=5+5)
    
    name                              old speed      new speed       delta
    AddVV/1-8                         2.97GB/s ± 0%   2.97GB/s ± 0%      ~     (p=0.971 n=4+4)
    AddVV/2-8                         9.47GB/s ± 0%   9.47GB/s ± 0%    +0.01%  (p=0.016 n=5+5)
    AddVV/3-8                         12.4GB/s ± 0%   12.4GB/s ± 0%      ~     (p=0.548 n=5+5)
    AddVV/4-8                         14.6GB/s ± 0%   14.6GB/s ± 0%      ~     (p=1.000 n=5+5)
    AddVV/5-8                         16.4GB/s ± 0%   16.4GB/s ± 0%      ~     (p=1.000 n=5+5)
    AddVV/10-8                        21.7GB/s ± 0%   21.7GB/s ± 0%      ~     (p=0.548 n=5+5)
    AddVV/100-8                       29.4GB/s ± 0%   29.4GB/s ± 0%      ~     (p=1.000 n=5+5)
    AddVV/1000-8                      31.7GB/s ± 0%   31.7GB/s ± 0%      ~     (p=0.524 n=5+4)
    AddVV/10000-8                     31.5GB/s ± 0%   31.5GB/s ± 0%      ~     (p=0.690 n=5+5)
    AddVV/100000-8                    28.8GB/s ± 7%   28.1GB/s ± 8%      ~     (p=0.548 n=5+5)
    AddVW/1-8                          859MB/s ± 0%    864MB/s ± 0%    +0.61%  (p=0.008 n=5+5)
    AddVW/2-8                          809MB/s ± 2%   1520MB/s ± 0%   +87.78%  (p=0.008 n=5+5)
    AddVW/3-8                         2.08GB/s ± 0%   2.18GB/s ± 0%    +4.54%  (p=0.008 n=5+5)
    AddVW/4-8                         2.46GB/s ± 0%   2.66GB/s ± 0%    +8.33%  (p=0.016 n=4+5)
    AddVW/5-8                         2.76GB/s ± 0%   3.20GB/s ± 0%   +16.03%  (p=0.008 n=5+5)
    AddVW/10-8                        3.63GB/s ± 0%   5.15GB/s ± 0%   +41.83%  (p=0.008 n=5+5)
    AddVW/100-8                       4.79GB/s ± 0%   9.87GB/s ± 0%  +106.12%  (p=0.008 n=5+5)
    AddVW/1000-8                      5.27GB/s ± 0%  12.42GB/s ± 0%  +135.74%  (p=0.008 n=5+5)
    AddVW/10000-8                     5.31GB/s ± 0%  11.19GB/s ± 0%  +110.71%  (p=0.008 n=5+5)
    AddVW/100000-8                    5.32GB/s ± 0%  11.32GB/s ± 0%  +112.56%  (p=0.008 n=5+5)
    SubVW/1-8                          859MB/s ± 0%    864MB/s ± 0%    +0.61%  (p=0.008 n=5+5)
    SubVW/2-8                          812MB/s ± 2%   1520MB/s ± 0%   +87.09%  (p=0.008 n=5+5)
    SubVW/3-8                         2.08GB/s ± 0%   2.18GB/s ± 0%    +4.55%  (p=0.008 n=5+5)
    SubVW/4-8                         2.46GB/s ± 0%   2.66GB/s ± 0%    +8.33%  (p=0.008 n=5+5)
    SubVW/5-8                         2.75GB/s ± 0%   3.20GB/s ± 0%   +16.03%  (p=0.008 n=5+5)
    SubVW/10-8                        3.63GB/s ± 0%   5.15GB/s ± 0%   +41.82%  (p=0.008 n=5+5)
    SubVW/100-8                       4.79GB/s ± 0%   9.87GB/s ± 0%  +106.13%  (p=0.008 n=5+5)
    SubVW/1000-8                      5.27GB/s ± 0%  12.42GB/s ± 0%  +135.74%  (p=0.008 n=5+5)
    SubVW/10000-8                     5.31GB/s ± 0%  11.17GB/s ± 0%  +110.44%  (p=0.008 n=5+5)
    SubVW/100000-8                    5.32GB/s ± 0%  11.31GB/s ± 0%  +112.35%  (p=0.008 n=5+5)
    AddMulVVW/1-8                     1.97GB/s ± 1%   1.96GB/s ± 1%      ~     (p=0.151 n=5+5)
    AddMulVVW/2-8                     2.24GB/s ± 0%   2.25GB/s ± 0%      ~     (p=0.095 n=5+5)
    AddMulVVW/3-8                     2.11GB/s ± 0%   2.12GB/s ± 0%      ~     (p=0.548 n=5+5)
    AddMulVVW/4-8                     2.17GB/s ± 1%   2.17GB/s ± 1%      ~     (p=0.548 n=5+5)
    AddMulVVW/5-8                     2.22GB/s ± 1%   2.21GB/s ± 1%      ~     (p=0.421 n=5+5)
    AddMulVVW/10-8                    2.17GB/s ± 1%   2.16GB/s ± 0%      ~     (p=0.095 n=5+5)
    AddMulVVW/100-8                   2.35GB/s ± 0%   2.35GB/s ± 0%      ~     (p=0.421 n=5+5)
    AddMulVVW/1000-8                  2.47GB/s ± 0%   2.41GB/s ± 0%    -2.09%  (p=0.008 n=5+5)
    AddMulVVW/10000-8                 2.16GB/s ± 0%   2.15GB/s ± 0%    -0.23%  (p=0.008 n=5+5)
    AddMulVVW/100000-8                2.03GB/s ± 1%   2.04GB/s ± 0%      ~     (p=0.690 n=5+5)
    
    name                              old alloc/op   new alloc/op    delta
    FloatString/100-8                     400B ± 0%       400B ± 0%      ~     (all equal)
    FloatString/1000-8                  3.22kB ± 0%     3.22kB ± 0%      ~     (all equal)
    FloatString/10000-8                 55.6kB ± 0%     55.5kB ± 0%      ~     (p=0.206 n=5+5)
    FloatString/100000-8                 627kB ± 0%      627kB ± 0%      ~     (all equal)
    FloatAdd/10-8                        0.00B           0.00B           ~     (all equal)
    FloatAdd/100-8                       0.00B           0.00B           ~     (all equal)
    FloatAdd/1000-8                      0.00B           0.00B           ~     (all equal)
    FloatAdd/10000-8                     0.00B           0.00B           ~     (all equal)
    FloatAdd/100000-8                    0.00B           0.00B           ~     (all equal)
    FloatSub/10-8                        0.00B           0.00B           ~     (all equal)
    FloatSub/100-8                       0.00B           0.00B           ~     (all equal)
    FloatSub/1000-8                      0.00B           0.00B           ~     (all equal)
    FloatSub/10000-8                     0.00B           0.00B           ~     (all equal)
    FloatSub/100000-8                    0.00B           0.00B           ~     (all equal)
    FloatSqrt/64-8                        416B ± 0%       416B ± 0%      ~     (all equal)
    FloatSqrt/128-8                       720B ± 0%       720B ± 0%      ~     (all equal)
    FloatSqrt/256-8                       816B ± 0%       816B ± 0%      ~     (all equal)
    FloatSqrt/1000-8                    2.50kB ± 0%     2.50kB ± 0%      ~     (all equal)
    FloatSqrt/10000-8                   23.5kB ± 0%     23.5kB ± 0%      ~     (all equal)
    FloatSqrt/100000-8                   251kB ± 0%      251kB ± 0%      ~     (all equal)
    FloatSqrt/1000000-8                 4.61MB ± 0%     4.61MB ± 0%      ~     (all equal)
    
    name                              old allocs/op  new allocs/op   delta
    FloatString/100-8                     8.00 ± 0%       8.00 ± 0%      ~     (all equal)
    FloatString/1000-8                    10.0 ± 0%       10.0 ± 0%      ~     (all equal)
    FloatString/10000-8                   42.0 ± 0%       42.0 ± 0%      ~     (all equal)
    FloatString/100000-8                   346 ± 0%        346 ± 0%      ~     (all equal)
    FloatAdd/10-8                         0.00            0.00           ~     (all equal)
    FloatAdd/100-8                        0.00            0.00           ~     (all equal)
    FloatAdd/1000-8                       0.00            0.00           ~     (all equal)
    FloatAdd/10000-8                      0.00            0.00           ~     (all equal)
    FloatAdd/100000-8                     0.00            0.00           ~     (all equal)
    FloatSub/10-8                         0.00            0.00           ~     (all equal)
    FloatSub/100-8                        0.00            0.00           ~     (all equal)
    FloatSub/1000-8                       0.00            0.00           ~     (all equal)
    FloatSub/10000-8                      0.00            0.00           ~     (all equal)
    FloatSub/100000-8                     0.00            0.00           ~     (all equal)
    FloatSqrt/64-8                        9.00 ± 0%       9.00 ± 0%      ~     (all equal)
    FloatSqrt/128-8                       13.0 ± 0%       13.0 ± 0%      ~     (all equal)
    FloatSqrt/256-8                       12.0 ± 0%       12.0 ± 0%      ~     (all equal)
    FloatSqrt/1000-8                      19.0 ± 0%       19.0 ± 0%      ~     (all equal)
    FloatSqrt/10000-8                     35.0 ± 0%       35.0 ± 0%      ~     (all equal)
    FloatSqrt/100000-8                    55.0 ± 0%       55.0 ± 0%      ~     (all equal)
    FloatSqrt/1000000-8                    122 ± 0%        122 ± 0%      ~     (all equal)
    
    Change-Id: I6888d84c037d91f9e2199f3492ea3f6a0ed77b24
    Reviewed-on: https://go-review.googlesource.com/77832Reviewed-by: 's avatarVlad Krasnov <vlad@cloudflare.com>
    Reviewed-by: 's avatarCherry Zhang <cherryyz@google.com>
    Run-TryBot: Cherry Zhang <cherryyz@google.com>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    0585d41c
arith_test.go 9.87 KB