-
erifan01 authored
The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance. By unrolling the cycle 4 times and 2 times, and using "LDP", "STP" instructions, this CL can reduce the "load" cost and improve performance. Benchmarks: name old time/op new time/op delta AddVV/1-8 21.5ns ± 0% 21.5ns ± 0% ~ (all equal) AddVV/2-8 13.5ns ± 0% 13.5ns ± 0% ~ (all equal) AddVV/3-8 15.5ns ± 0% 15.5ns ± 0% ~ (all equal) AddVV/4-8 17.5ns ± 0% 17.5ns ± 0% ~ (all equal) AddVV/5-8 19.5ns ± 0% 19.5ns ± 0% ~ (all equal) AddVV/10-8 29.5ns ± 0% 29.5ns ± 0% ~ (all equal) AddVV/100-8 217ns ± 0% 217ns ± 0% ~ (all equal) AddVV/1000-8 2.02µs ± 0% 2.02µs ± 0% ~ (all equal) AddVV/10000-8 20.3µs ± 0% 20.3µs ± 0% ~ (p=0.603 n=5+5) AddVV/100000-8 223µs ± 7% 228µs ± 8% ~ (p=0.548 n=5+5) AddVW/1-8 9.32ns ± 0% 9.26ns ± 0% -0.64% (p=0.008 n=5+5) AddVW/2-8 19.8ns ± 3% 10.5ns ± 0% -46.92% (p=0.008 n=5+5) AddVW/3-8 11.5ns ± 0% 11.0ns ± 0% -4.35% (p=0.008 n=5+5) AddVW/4-8 13.0ns ± 0% 12.0ns ± 0% -7.69% (p=0.008 n=5+5) AddVW/5-8 14.5ns ± 0% 12.5ns ± 0% -13.79% (p=0.008 n=5+5) AddVW/10-8 22.0ns ± 0% 15.5ns ± 0% -29.55% (p=0.008 n=5+5) AddVW/100-8 167ns ± 0% 81ns ± 0% -51.44% (p=0.008 n=5+5) AddVW/1000-8 1.52µs ± 0% 0.64µs ± 0% -57.58% (p=0.008 n=5+5) AddVW/10000-8 15.1µs ± 0% 7.2µs ± 0% -52.55% (p=0.008 n=5+5) AddVW/100000-8 150µs ± 0% 71µs ± 0% -52.95% (p=0.008 n=5+5) SubVW/1-8 9.32ns ± 0% 9.26ns ± 0% -0.64% (p=0.008 n=5+5) SubVW/2-8 19.7ns ± 2% 10.5ns ± 0% -46.70% (p=0.008 n=5+5) SubVW/3-8 11.5ns ± 0% 11.0ns ± 0% -4.35% (p=0.008 n=5+5) SubVW/4-8 13.0ns ± 0% 12.0ns ± 0% -7.69% (p=0.008 n=5+5) SubVW/5-8 14.5ns ± 0% 12.5ns ± 0% -13.79% (p=0.008 n=5+5) SubVW/10-8 22.0ns ± 0% 15.5ns ± 0% -29.55% (p=0.008 n=5+5) SubVW/100-8 167ns ± 0% 81ns ± 0% -51.44% (p=0.008 n=5+5) SubVW/1000-8 1.52µs ± 0% 0.64µs ± 0% -57.58% (p=0.008 n=5+5) SubVW/10000-8 15.1µs ± 0% 7.2µs ± 0% -52.49% (p=0.008 n=5+5) SubVW/100000-8 150µs ± 0% 71µs ± 0% -52.91% (p=0.008 n=5+5) AddMulVVW/1-8 32.4ns ± 1% 32.6ns ± 1% ~ (p=0.119 n=5+5) AddMulVVW/2-8 57.0ns ± 0% 57.0ns ± 0% ~ (p=0.643 n=5+5) AddMulVVW/3-8 90.8ns ± 0% 90.7ns ± 0% ~ (p=0.524 n=5+5) AddMulVVW/4-8 118ns ± 0% 118ns ± 1% ~ (p=1.000 n=4+5) AddMulVVW/5-8 144ns ± 1% 144ns ± 0% ~ (p=0.794 n=5+4) AddMulVVW/10-8 294ns ± 1% 296ns ± 0% +0.48% (p=0.040 n=5+5) AddMulVVW/100-8 2.73µs ± 0% 2.73µs ± 0% ~ (p=0.278 n=5+5) AddMulVVW/1000-8 26.0µs ± 0% 26.5µs ± 0% +2.14% (p=0.008 n=5+5) AddMulVVW/10000-8 297µs ± 0% 297µs ± 0% +0.24% (p=0.008 n=5+5) AddMulVVW/100000-8 3.15ms ± 1% 3.13ms ± 0% ~ (p=0.690 n=5+5) DecimalConversion-8 311µs ± 2% 309µs ± 2% ~ (p=0.310 n=5+5) FloatString/100-8 2.55µs ± 2% 2.54µs ± 2% ~ (p=1.000 n=5+5) FloatString/1000-8 58.1µs ± 0% 58.1µs ± 0% ~ (p=0.151 n=5+5) FloatString/10000-8 4.59ms ± 0% 4.59ms ± 0% ~ (p=0.151 n=5+5) FloatString/100000-8 446ms ± 0% 446ms ± 0% +0.01% (p=0.016 n=5+5) FloatAdd/10-8 183ns ± 0% 183ns ± 0% ~ (p=0.333 n=4+5) FloatAdd/100-8 187ns ± 1% 192ns ± 2% ~ (p=0.056 n=5+5) FloatAdd/1000-8 369ns ± 0% 371ns ± 0% +0.54% (p=0.016 n=4+5) FloatAdd/10000-8 1.88µs ± 0% 1.88µs ± 0% -0.14% (p=0.000 n=4+5) FloatAdd/100000-8 17.2µs ± 0% 17.1µs ± 0% -0.37% (p=0.008 n=5+5) FloatSub/10-8 147ns ± 0% 147ns ± 0% ~ (all equal) FloatSub/100-8 145ns ± 0% 146ns ± 0% ~ (p=0.238 n=5+4) FloatSub/1000-8 241ns ± 0% 241ns ± 0% ~ (p=0.333 n=5+4) FloatSub/10000-8 1.06µs ± 0% 1.06µs ± 0% ~ (p=0.444 n=5+5) FloatSub/100000-8 9.50µs ± 0% 9.48µs ± 0% -0.14% (p=0.008 n=5+5) ParseFloatSmallExp-8 28.4µs ± 2% 28.5µs ± 1% ~ (p=0.690 n=5+5) ParseFloatLargeExp-8 125µs ± 1% 124µs ± 1% ~ (p=0.095 n=5+5) GCD10x10/WithoutXY-8 277ns ± 2% 278ns ± 3% ~ (p=0.937 n=5+5) GCD10x10/WithXY-8 2.08µs ± 3% 2.15µs ± 3% ~ (p=0.056 n=5+5) GCD10x100/WithoutXY-8 592ns ± 3% 613ns ± 4% ~ (p=0.056 n=5+5) GCD10x100/WithXY-8 3.40µs ± 2% 3.42µs ± 4% ~ (p=0.841 n=5+5) GCD10x1000/WithoutXY-8 1.37µs ± 2% 1.35µs ± 3% ~ (p=0.460 n=5+5) GCD10x1000/WithXY-8 7.34µs ± 2% 7.33µs ± 4% ~ (p=0.841 n=5+5) GCD10x10000/WithoutXY-8 8.52µs ± 0% 8.51µs ± 1% ~ (p=0.421 n=5+5) GCD10x10000/WithXY-8 27.5µs ± 2% 27.2µs ± 1% ~ (p=0.151 n=5+5) GCD10x100000/WithoutXY-8 78.3µs ± 1% 78.5µs ± 1% ~ (p=0.690 n=5+5) GCD10x100000/WithXY-8 231µs ± 0% 229µs ± 1% -1.11% (p=0.016 n=5+5) GCD100x100/WithoutXY-8 1.86µs ± 2% 1.86µs ± 2% ~ (p=0.881 n=5+5) GCD100x100/WithXY-8 27.1µs ± 2% 27.2µs ± 1% ~ (p=0.421 n=5+5) GCD100x1000/WithoutXY-8 4.44µs ± 2% 4.41µs ± 1% ~ (p=0.310 n=5+5) GCD100x1000/WithXY-8 36.3µs ± 1% 36.2µs ± 1% ~ (p=0.310 n=5+5) GCD100x10000/WithoutXY-8 22.6µs ± 2% 22.5µs ± 1% ~ (p=0.690 n=5+5) GCD100x10000/WithXY-8 145µs ± 1% 145µs ± 1% ~ (p=1.000 n=5+5) GCD100x100000/WithoutXY-8 195µs ± 0% 196µs ± 1% ~ (p=0.548 n=5+5) GCD100x100000/WithXY-8 1.10ms ± 0% 1.10ms ± 0% -0.30% (p=0.016 n=5+5) GCD1000x1000/WithoutXY-8 25.0µs ± 1% 25.2µs ± 2% ~ (p=0.222 n=5+5) GCD1000x1000/WithXY-8 520µs ± 0% 520µs ± 1% ~ (p=0.151 n=5+5) GCD1000x10000/WithoutXY-8 57.0µs ± 1% 56.9µs ± 1% ~ (p=0.690 n=5+5) GCD1000x10000/WithXY-8 1.21ms ± 0% 1.21ms ± 1% ~ (p=0.881 n=5+5) GCD1000x100000/WithoutXY-8 358µs ± 0% 359µs ± 1% ~ (p=0.548 n=5+5) GCD1000x100000/WithXY-8 8.73ms ± 0% 8.73ms ± 0% ~ (p=0.548 n=5+5) GCD10000x10000/WithoutXY-8 686µs ± 0% 687µs ± 0% ~ (p=0.548 n=5+5) GCD10000x10000/WithXY-8 15.9ms ± 0% 15.9ms ± 0% ~ (p=0.841 n=5+5) GCD10000x100000/WithoutXY-8 2.08ms ± 0% 2.08ms ± 0% ~ (p=1.000 n=5+5) GCD10000x100000/WithXY-8 86.7ms ± 0% 86.7ms ± 0% ~ (p=1.000 n=5+5) GCD100000x100000/WithoutXY-8 51.1ms ± 0% 51.0ms ± 0% ~ (p=0.151 n=5+5) GCD100000x100000/WithXY-8 1.23s ± 0% 1.23s ± 0% ~ (p=0.841 n=5+5) Hilbert-8 2.41ms ± 1% 2.42ms ± 2% ~ (p=0.690 n=5+5) Binomial-8 4.86µs ± 1% 4.86µs ± 1% ~ (p=0.889 n=5+5) QuoRem-8 7.09µs ± 0% 7.08µs ± 0% -0.09% (p=0.024 n=5+5) Exp-8 161ms ± 0% 161ms ± 0% -0.08% (p=0.032 n=5+5) Exp2-8 161ms ± 0% 161ms ± 0% ~ (p=1.000 n=5+5) Bitset-8 40.7ns ± 0% 40.6ns ± 0% ~ (p=0.095 n=4+5) BitsetNeg-8 159ns ± 4% 148ns ± 0% -6.92% (p=0.016 n=5+4) BitsetOrig-8 378ns ± 1% 378ns ± 1% ~ (p=0.937 n=5+5) BitsetNegOrig-8 647ns ± 5% 647ns ± 4% ~ (p=1.000 n=5+5) ModSqrt225_Tonelli-8 7.26ms ± 0% 7.27ms ± 0% ~ (p=1.000 n=5+5) ModSqrt224_3Mod4-8 2.24ms ± 0% 2.24ms ± 0% ~ (p=0.690 n=5+5) ModSqrt5430_Tonelli-8 62.8s ± 1% 62.5s ± 0% ~ (p=0.063 n=5+4) ModSqrt5430_3Mod4-8 20.8s ± 0% 20.8s ± 0% ~ (p=0.310 n=5+5) Sqrt-8 101µs ± 1% 101µs ± 0% -0.35% (p=0.032 n=5+5) IntSqr/1-8 32.3ns ± 1% 32.5ns ± 1% ~ (p=0.421 n=5+5) IntSqr/2-8 157ns ± 5% 156ns ± 5% ~ (p=0.651 n=5+5) IntSqr/3-8 292ns ± 2% 291ns ± 3% ~ (p=0.881 n=5+5) IntSqr/5-8 738ns ± 6% 740ns ± 5% ~ (p=0.841 n=5+5) IntSqr/8-8 1.82µs ± 4% 1.83µs ± 4% ~ (p=0.730 n=5+5) IntSqr/10-8 2.92µs ± 1% 2.93µs ± 1% ~ (p=0.643 n=5+5) IntSqr/20-8 6.28µs ± 2% 6.28µs ± 2% ~ (p=1.000 n=5+5) IntSqr/30-8 13.8µs ± 2% 13.9µs ± 3% ~ (p=1.000 n=5+5) IntSqr/50-8 37.8µs ± 4% 37.9µs ± 4% ~ (p=0.690 n=5+5) IntSqr/80-8 95.9µs ± 1% 95.8µs ± 1% ~ (p=0.841 n=5+5) IntSqr/100-8 148µs ± 1% 148µs ± 1% ~ (p=0.310 n=5+5) IntSqr/200-8 586µs ± 1% 586µs ± 1% ~ (p=0.841 n=5+5) IntSqr/300-8 1.32ms ± 0% 1.31ms ± 0% ~ (p=0.222 n=5+5) IntSqr/500-8 2.48ms ± 0% 2.48ms ± 0% ~ (p=0.556 n=5+4) IntSqr/800-8 4.68ms ± 0% 4.68ms ± 0% ~ (p=0.548 n=5+5) IntSqr/1000-8 7.57ms ± 0% 7.56ms ± 0% ~ (p=0.421 n=5+5) Mul-8 311ms ± 0% 311ms ± 0% ~ (p=0.548 n=5+5) Exp3Power/0x10-8 559ns ± 1% 560ns ± 1% ~ (p=0.984 n=5+5) Exp3Power/0x40-8 641ns ± 1% 634ns ± 1% ~ (p=0.063 n=5+5) Exp3Power/0x100-8 1.39µs ± 2% 1.40µs ± 2% ~ (p=0.381 n=5+5) Exp3Power/0x400-8 8.27µs ± 1% 8.26µs ± 0% ~ (p=0.571 n=5+5) Exp3Power/0x1000-8 59.9µs ± 0% 59.7µs ± 0% -0.23% (p=0.008 n=5+5) Exp3Power/0x4000-8 816µs ± 0% 816µs ± 0% ~ (p=1.000 n=5+5) Exp3Power/0x10000-8 7.77ms ± 0% 7.77ms ± 0% ~ (p=0.841 n=5+5) Exp3Power/0x40000-8 73.4ms ± 0% 73.4ms ± 0% ~ (p=0.690 n=5+5) Exp3Power/0x100000-8 665ms ± 0% 664ms ± 0% -0.14% (p=0.008 n=5+5) Exp3Power/0x400000-8 5.98s ± 0% 5.98s ± 0% -0.09% (p=0.008 n=5+5) Fibo-8 116ms ± 0% 116ms ± 0% -0.25% (p=0.008 n=5+5) NatSqr/1-8 115ns ± 3% 116ns ± 2% ~ (p=0.238 n=5+5) NatSqr/2-8 237ns ± 1% 237ns ± 1% ~ (p=0.683 n=5+5) NatSqr/3-8 367ns ± 3% 368ns ± 3% ~ (p=0.817 n=5+5) NatSqr/5-8 807ns ± 3% 812ns ± 3% ~ (p=0.913 n=5+5) NatSqr/8-8 1.93µs ± 2% 1.93µs ± 3% ~ (p=0.651 n=5+5) NatSqr/10-8 2.98µs ± 2% 2.99µs ± 2% ~ (p=0.690 n=5+5) NatSqr/20-8 6.49µs ± 2% 6.46µs ± 2% ~ (p=0.548 n=5+5) NatSqr/30-8 14.4µs ± 2% 14.3µs ± 2% ~ (p=0.690 n=5+5) NatSqr/50-8 38.6µs ± 2% 38.7µs ± 2% ~ (p=0.841 n=5+5) NatSqr/80-8 96.1µs ± 2% 95.8µs ± 2% ~ (p=0.548 n=5+5) NatSqr/100-8 149µs ± 1% 149µs ± 1% ~ (p=0.841 n=5+5) NatSqr/200-8 593µs ± 1% 590µs ± 1% ~ (p=0.421 n=5+5) NatSqr/300-8 1.32ms ± 0% 1.32ms ± 1% ~ (p=0.222 n=5+5) NatSqr/500-8 2.49ms ± 0% 2.49ms ± 0% ~ (p=0.690 n=5+5) NatSqr/800-8 4.69ms ± 0% 4.69ms ± 0% ~ (p=1.000 n=5+5) NatSqr/1000-8 7.59ms ± 0% 7.58ms ± 0% ~ (p=0.841 n=5+5) ScanPi-8 322µs ± 0% 321µs ± 0% ~ (p=0.095 n=5+5) StringPiParallel-8 71.4µs ± 5% 68.8µs ± 4% ~ (p=0.151 n=5+5) Scan/10/Base2-8 1.10µs ± 0% 1.09µs ± 0% -0.36% (p=0.032 n=5+5) Scan/100/Base2-8 7.78µs ± 0% 7.79µs ± 0% +0.14% (p=0.008 n=5+5) Scan/1000/Base2-8 78.8µs ± 0% 79.0µs ± 0% +0.24% (p=0.008 n=5+5) Scan/10000/Base2-8 1.22ms ± 0% 1.22ms ± 0% ~ (p=0.056 n=5+5) Scan/100000/Base2-8 55.1ms ± 0% 55.0ms ± 0% -0.15% (p=0.008 n=5+5) Scan/10/Base8-8 514ns ± 0% 515ns ± 0% ~ (p=0.079 n=5+5) Scan/100/Base8-8 2.89µs ± 0% 2.89µs ± 0% +0.15% (p=0.008 n=5+5) Scan/1000/Base8-8 31.0µs ± 0% 31.1µs ± 0% +0.12% (p=0.008 n=5+5) Scan/10000/Base8-8 740µs ± 0% 740µs ± 0% ~ (p=0.222 n=5+5) Scan/100000/Base8-8 50.6ms ± 0% 50.5ms ± 0% -0.06% (p=0.016 n=4+5) Scan/10/Base10-8 492ns ± 1% 490ns ± 1% ~ (p=0.310 n=5+5) Scan/100/Base10-8 2.67µs ± 0% 2.67µs ± 0% ~ (p=0.056 n=5+5) Scan/1000/Base10-8 28.7µs ± 0% 28.7µs ± 0% ~ (p=1.000 n=5+5) Scan/10000/Base10-8 717µs ± 0% 716µs ± 0% ~ (p=0.222 n=5+5) Scan/100000/Base10-8 50.2ms ± 0% 50.3ms ± 0% +0.05% (p=0.008 n=5+5) Scan/10/Base16-8 442ns ± 1% 442ns ± 0% ~ (p=0.468 n=5+5) Scan/100/Base16-8 2.46µs ± 0% 2.45µs ± 0% ~ (p=0.159 n=5+5) Scan/1000/Base16-8 27.2µs ± 0% 27.2µs ± 0% ~ (p=0.841 n=5+5) Scan/10000/Base16-8 721µs ± 0% 722µs ± 0% ~ (p=0.548 n=5+5) Scan/100000/Base16-8 52.6ms ± 0% 52.6ms ± 0% +0.07% (p=0.008 n=5+5) String/10/Base2-8 244ns ± 1% 242ns ± 1% ~ (p=0.103 n=5+5) String/100/Base2-8 1.48µs ± 0% 1.48µs ± 1% ~ (p=0.786 n=5+5) String/1000/Base2-8 13.3µs ± 1% 13.3µs ± 0% ~ (p=0.222 n=5+5) String/10000/Base2-8 132µs ± 1% 132µs ± 1% ~ (p=1.000 n=5+5) String/100000/Base2-8 1.30ms ± 1% 1.30ms ± 1% ~ (p=1.000 n=5+5) String/10/Base8-8 167ns ± 1% 168ns ± 1% ~ (p=0.135 n=5+5) String/100/Base8-8 623ns ± 1% 626ns ± 1% ~ (p=0.151 n=5+5) String/1000/Base8-8 5.24µs ± 1% 5.24µs ± 0% ~ (p=1.000 n=5+5) String/10000/Base8-8 50.0µs ± 1% 50.0µs ± 1% ~ (p=1.000 n=5+5) String/100000/Base8-8 492µs ± 1% 489µs ± 1% ~ (p=0.056 n=5+5) String/10/Base10-8 503ns ± 1% 501ns ± 0% ~ (p=0.183 n=5+5) String/100/Base10-8 1.96µs ± 0% 1.97µs ± 0% ~ (p=0.389 n=5+5) String/1000/Base10-8 12.4µs ± 1% 12.4µs ± 1% ~ (p=0.841 n=5+5) String/10000/Base10-8 56.7µs ± 1% 56.6µs ± 0% ~ (p=1.000 n=5+5) String/100000/Base10-8 25.6ms ± 0% 25.6ms ± 0% ~ (p=0.222 n=5+5) String/10/Base16-8 147ns ± 0% 148ns ± 2% ~ (p=1.000 n=4+5) String/100/Base16-8 505ns ± 0% 505ns ± 1% ~ (p=0.778 n=5+5) String/1000/Base16-8 3.94µs ± 0% 3.94µs ± 0% ~ (p=0.841 n=5+5) String/10000/Base16-8 37.4µs ± 1% 37.2µs ± 1% ~ (p=0.095 n=5+5) String/100000/Base16-8 367µs ± 1% 367µs ± 0% ~ (p=1.000 n=5+5) LeafSize/0-8 6.64ms ± 0% 6.65ms ± 0% ~ (p=0.690 n=5+5) LeafSize/1-8 72.5µs ± 1% 72.4µs ± 1% ~ (p=0.841 n=5+5) LeafSize/2-8 72.6µs ± 1% 72.6µs ± 1% ~ (p=1.000 n=5+5) LeafSize/3-8 377µs ± 0% 377µs ± 0% ~ (p=0.421 n=5+5) LeafSize/4-8 71.2µs ± 1% 71.3µs ± 0% ~ (p=0.278 n=5+5) LeafSize/5-8 469µs ± 0% 469µs ± 0% ~ (p=0.310 n=5+5) LeafSize/6-8 376µs ± 0% 376µs ± 0% ~ (p=0.841 n=5+5) LeafSize/7-8 244µs ± 0% 244µs ± 0% ~ (p=0.841 n=5+5) LeafSize/8-8 71.9µs ± 1% 72.1µs ± 1% ~ (p=0.548 n=5+5) LeafSize/9-8 536µs ± 0% 536µs ± 0% ~ (p=0.151 n=5+5) LeafSize/10-8 470µs ± 0% 471µs ± 0% +0.10% (p=0.032 n=5+5) LeafSize/11-8 458µs ± 0% 458µs ± 0% ~ (p=0.881 n=5+5) LeafSize/12-8 376µs ± 0% 376µs ± 0% ~ (p=0.548 n=5+5) LeafSize/13-8 341µs ± 0% 342µs ± 0% ~ (p=0.222 n=5+5) LeafSize/14-8 246µs ± 0% 245µs ± 0% ~ (p=0.167 n=5+5) LeafSize/15-8 168µs ± 0% 168µs ± 0% ~ (p=0.548 n=5+5) LeafSize/16-8 72.1µs ± 1% 72.2µs ± 1% ~ (p=0.690 n=5+5) LeafSize/32-8 81.5µs ± 1% 81.4µs ± 1% ~ (p=1.000 n=5+5) LeafSize/64-8 133µs ± 1% 134µs ± 1% ~ (p=0.690 n=5+5) ProbablyPrime/n=0-8 44.3ms ± 0% 44.2ms ± 0% -0.28% (p=0.008 n=5+5) ProbablyPrime/n=1-8 64.8ms ± 0% 64.7ms ± 0% -0.15% (p=0.008 n=5+5) ProbablyPrime/n=5-8 147ms ± 0% 147ms ± 0% -0.11% (p=0.008 n=5+5) ProbablyPrime/n=10-8 250ms ± 0% 250ms ± 0% ~ (p=0.056 n=5+5) ProbablyPrime/n=20-8 456ms ± 0% 455ms ± 0% -0.05% (p=0.008 n=5+5) ProbablyPrime/Lucas-8 23.6ms ± 0% 23.5ms ± 0% -0.29% (p=0.008 n=5+5) ProbablyPrime/MillerRabinBase2-8 20.6ms ± 0% 20.6ms ± 0% ~ (p=0.690 n=5+5) FloatSqrt/64-8 2.01µs ± 1% 2.02µs ± 1% ~ (p=0.421 n=5+5) FloatSqrt/128-8 4.43µs ± 2% 4.38µs ± 2% ~ (p=0.222 n=5+5) FloatSqrt/256-8 6.64µs ± 1% 6.68µs ± 2% ~ (p=0.516 n=5+5) FloatSqrt/1000-8 31.9µs ± 0% 31.8µs ± 0% ~ (p=0.095 n=5+5) FloatSqrt/10000-8 595µs ± 0% 594µs ± 0% ~ (p=0.056 n=5+5) FloatSqrt/100000-8 17.9ms ± 0% 17.9ms ± 0% ~ (p=0.151 n=5+5) FloatSqrt/1000000-8 1.52s ± 0% 1.52s ± 0% ~ (p=0.841 n=5+5) name old speed new speed delta AddVV/1-8 2.97GB/s ± 0% 2.97GB/s ± 0% ~ (p=0.971 n=4+4) AddVV/2-8 9.47GB/s ± 0% 9.47GB/s ± 0% +0.01% (p=0.016 n=5+5) AddVV/3-8 12.4GB/s ± 0% 12.4GB/s ± 0% ~ (p=0.548 n=5+5) AddVV/4-8 14.6GB/s ± 0% 14.6GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/5-8 16.4GB/s ± 0% 16.4GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/10-8 21.7GB/s ± 0% 21.7GB/s ± 0% ~ (p=0.548 n=5+5) AddVV/100-8 29.4GB/s ± 0% 29.4GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/1000-8 31.7GB/s ± 0% 31.7GB/s ± 0% ~ (p=0.524 n=5+4) AddVV/10000-8 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.690 n=5+5) AddVV/100000-8 28.8GB/s ± 7% 28.1GB/s ± 8% ~ (p=0.548 n=5+5) AddVW/1-8 859MB/s ± 0% 864MB/s ± 0% +0.61% (p=0.008 n=5+5) AddVW/2-8 809MB/s ± 2% 1520MB/s ± 0% +87.78% (p=0.008 n=5+5) AddVW/3-8 2.08GB/s ± 0% 2.18GB/s ± 0% +4.54% (p=0.008 n=5+5) AddVW/4-8 2.46GB/s ± 0% 2.66GB/s ± 0% +8.33% (p=0.016 n=4+5) AddVW/5-8 2.76GB/s ± 0% 3.20GB/s ± 0% +16.03% (p=0.008 n=5+5) AddVW/10-8 3.63GB/s ± 0% 5.15GB/s ± 0% +41.83% (p=0.008 n=5+5) AddVW/100-8 4.79GB/s ± 0% 9.87GB/s ± 0% +106.12% (p=0.008 n=5+5) AddVW/1000-8 5.27GB/s ± 0% 12.42GB/s ± 0% +135.74% (p=0.008 n=5+5) AddVW/10000-8 5.31GB/s ± 0% 11.19GB/s ± 0% +110.71% (p=0.008 n=5+5) AddVW/100000-8 5.32GB/s ± 0% 11.32GB/s ± 0% +112.56% (p=0.008 n=5+5) SubVW/1-8 859MB/s ± 0% 864MB/s ± 0% +0.61% (p=0.008 n=5+5) SubVW/2-8 812MB/s ± 2% 1520MB/s ± 0% +87.09% (p=0.008 n=5+5) SubVW/3-8 2.08GB/s ± 0% 2.18GB/s ± 0% +4.55% (p=0.008 n=5+5) SubVW/4-8 2.46GB/s ± 0% 2.66GB/s ± 0% +8.33% (p=0.008 n=5+5) SubVW/5-8 2.75GB/s ± 0% 3.20GB/s ± 0% +16.03% (p=0.008 n=5+5) SubVW/10-8 3.63GB/s ± 0% 5.15GB/s ± 0% +41.82% (p=0.008 n=5+5) SubVW/100-8 4.79GB/s ± 0% 9.87GB/s ± 0% +106.13% (p=0.008 n=5+5) SubVW/1000-8 5.27GB/s ± 0% 12.42GB/s ± 0% +135.74% (p=0.008 n=5+5) SubVW/10000-8 5.31GB/s ± 0% 11.17GB/s ± 0% +110.44% (p=0.008 n=5+5) SubVW/100000-8 5.32GB/s ± 0% 11.31GB/s ± 0% +112.35% (p=0.008 n=5+5) AddMulVVW/1-8 1.97GB/s ± 1% 1.96GB/s ± 1% ~ (p=0.151 n=5+5) AddMulVVW/2-8 2.24GB/s ± 0% 2.25GB/s ± 0% ~ (p=0.095 n=5+5) AddMulVVW/3-8 2.11GB/s ± 0% 2.12GB/s ± 0% ~ (p=0.548 n=5+5) AddMulVVW/4-8 2.17GB/s ± 1% 2.17GB/s ± 1% ~ (p=0.548 n=5+5) AddMulVVW/5-8 2.22GB/s ± 1% 2.21GB/s ± 1% ~ (p=0.421 n=5+5) AddMulVVW/10-8 2.17GB/s ± 1% 2.16GB/s ± 0% ~ (p=0.095 n=5+5) AddMulVVW/100-8 2.35GB/s ± 0% 2.35GB/s ± 0% ~ (p=0.421 n=5+5) AddMulVVW/1000-8 2.47GB/s ± 0% 2.41GB/s ± 0% -2.09% (p=0.008 n=5+5) AddMulVVW/10000-8 2.16GB/s ± 0% 2.15GB/s ± 0% -0.23% (p=0.008 n=5+5) AddMulVVW/100000-8 2.03GB/s ± 1% 2.04GB/s ± 0% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta FloatString/100-8 400B ± 0% 400B ± 0% ~ (all equal) FloatString/1000-8 3.22kB ± 0% 3.22kB ± 0% ~ (all equal) FloatString/10000-8 55.6kB ± 0% 55.5kB ± 0% ~ (p=0.206 n=5+5) FloatString/100000-8 627kB ± 0% 627kB ± 0% ~ (all equal) FloatAdd/10-8 0.00B 0.00B ~ (all equal) FloatAdd/100-8 0.00B 0.00B ~ (all equal) FloatAdd/1000-8 0.00B 0.00B ~ (all equal) FloatAdd/10000-8 0.00B 0.00B ~ (all equal) FloatAdd/100000-8 0.00B 0.00B ~ (all equal) FloatSub/10-8 0.00B 0.00B ~ (all equal) FloatSub/100-8 0.00B 0.00B ~ (all equal) FloatSub/1000-8 0.00B 0.00B ~ (all equal) FloatSub/10000-8 0.00B 0.00B ~ (all equal) FloatSub/100000-8 0.00B 0.00B ~ (all equal) FloatSqrt/64-8 416B ± 0% 416B ± 0% ~ (all equal) FloatSqrt/128-8 720B ± 0% 720B ± 0% ~ (all equal) FloatSqrt/256-8 816B ± 0% 816B ± 0% ~ (all equal) FloatSqrt/1000-8 2.50kB ± 0% 2.50kB ± 0% ~ (all equal) FloatSqrt/10000-8 23.5kB ± 0% 23.5kB ± 0% ~ (all equal) FloatSqrt/100000-8 251kB ± 0% 251kB ± 0% ~ (all equal) FloatSqrt/1000000-8 4.61MB ± 0% 4.61MB ± 0% ~ (all equal) name old allocs/op new allocs/op delta FloatString/100-8 8.00 ± 0% 8.00 ± 0% ~ (all equal) FloatString/1000-8 10.0 ± 0% 10.0 ± 0% ~ (all equal) FloatString/10000-8 42.0 ± 0% 42.0 ± 0% ~ (all equal) FloatString/100000-8 346 ± 0% 346 ± 0% ~ (all equal) FloatAdd/10-8 0.00 0.00 ~ (all equal) FloatAdd/100-8 0.00 0.00 ~ (all equal) FloatAdd/1000-8 0.00 0.00 ~ (all equal) FloatAdd/10000-8 0.00 0.00 ~ (all equal) FloatAdd/100000-8 0.00 0.00 ~ (all equal) FloatSub/10-8 0.00 0.00 ~ (all equal) FloatSub/100-8 0.00 0.00 ~ (all equal) FloatSub/1000-8 0.00 0.00 ~ (all equal) FloatSub/10000-8 0.00 0.00 ~ (all equal) FloatSub/100000-8 0.00 0.00 ~ (all equal) FloatSqrt/64-8 9.00 ± 0% 9.00 ± 0% ~ (all equal) FloatSqrt/128-8 13.0 ± 0% 13.0 ± 0% ~ (all equal) FloatSqrt/256-8 12.0 ± 0% 12.0 ± 0% ~ (all equal) FloatSqrt/1000-8 19.0 ± 0% 19.0 ± 0% ~ (all equal) FloatSqrt/10000-8 35.0 ± 0% 35.0 ± 0% ~ (all equal) FloatSqrt/100000-8 55.0 ± 0% 55.0 ± 0% ~ (all equal) FloatSqrt/1000000-8 122 ± 0% 122 ± 0% ~ (all equal) Change-Id: I6888d84c037d91f9e2199f3492ea3f6a0ed77b24 Reviewed-on: https://go-review.googlesource.com/77832Reviewed-by: Vlad Krasnov <vlad@cloudflare.com> Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
0585d41c