• Wei Xiao's avatar
    bytes: add optimized Equal for arm64 · 78ddf274
    Wei Xiao authored
    Use SIMD instructions when comparing chunks bigger than 16 bytes.
    Benchmark results of bytes:
    
    name                 old time/op    new time/op    delta
    Equal/0-8              6.52ns ± 1%    5.51ns ± 0%   -15.43%  (p=0.000 n=8+9)
    Equal/1-8              11.5ns ± 0%    10.5ns ± 0%    -8.70%  (p=0.000 n=10+10)
    Equal/6-8              19.0ns ± 0%    13.5ns ± 0%   -28.95%  (p=0.000 n=10+10)
    Equal/9-8              31.0ns ± 0%    13.5ns ± 0%   -56.45%  (p=0.000 n=10+10)
    Equal/15-8             40.0ns ± 0%    15.5ns ± 0%   -61.25%  (p=0.000 n=10+10)
    Equal/16-8             41.5ns ± 0%    14.5ns ± 0%   -65.06%  (p=0.000 n=10+10)
    Equal/20-8             47.5ns ± 0%    17.0ns ± 0%   -64.21%  (p=0.000 n=10+10)
    Equal/32-8             65.6ns ± 0%    17.0ns ± 0%   -74.09%  (p=0.000 n=10+10)
    Equal/4K-8             6.17µs ± 0%    0.57µs ± 1%   -90.76%  (p=0.000 n=10+10)
    Equal/4M-8             6.41ms ± 0%    1.11ms ±14%   -82.71%  (p=0.000 n=8+10)
    Equal/64M-8             104ms ± 0%      33ms ± 0%   -68.64%  (p=0.000 n=10+10)
    EqualPort/1-8          13.0ns ± 0%    13.0ns ± 0%      ~     (all equal)
    EqualPort/6-8          22.0ns ± 0%    22.7ns ± 0%    +3.06%  (p=0.000 n=8+9)
    EqualPort/32-8         78.1ns ± 0%    78.1ns ± 0%      ~     (all equal)
    EqualPort/4K-8         7.54µs ± 0%    7.61µs ± 0%    +0.92%  (p=0.000 n=10+8)
    EqualPort/4M-8         8.16ms ± 2%    8.05ms ± 1%    -1.31%  (p=0.023 n=10+10)
    EqualPort/64M-8         142ms ± 0%     142ms ± 0%    +0.37%  (p=0.000 n=10+10)
    CompareBytesEqual-8    39.0ns ± 0%    41.6ns ± 2%    +6.67%  (p=0.000 n=9+10)
    
    name                 old speed      new speed      delta
    Equal/1-8            86.9MB/s ± 0%  95.2MB/s ± 0%    +9.53%  (p=0.000 n=8+8)
    Equal/6-8             315MB/s ± 0%   444MB/s ± 0%   +40.74%  (p=0.000 n=9+10)
    Equal/9-8             290MB/s ± 0%   666MB/s ± 0%  +129.63%  (p=0.000 n=8+10)
    Equal/15-8            375MB/s ± 0%   967MB/s ± 0%  +158.09%  (p=0.000 n=10+10)
    Equal/16-8            385MB/s ± 0%  1103MB/s ± 0%  +186.24%  (p=0.000 n=10+9)
    Equal/20-8            421MB/s ± 0%  1175MB/s ± 0%  +179.44%  (p=0.000 n=9+10)
    Equal/32-8            488MB/s ± 0%  1881MB/s ± 0%  +285.34%  (p=0.000 n=10+8)
    Equal/4K-8            664MB/s ± 0%  7181MB/s ± 1%  +981.32%  (p=0.000 n=10+10)
    Equal/4M-8            654MB/s ± 0%  3822MB/s ±16%  +484.15%  (p=0.000 n=8+10)
    Equal/64M-8           645MB/s ± 0%  2056MB/s ± 0%  +218.90%  (p=0.000 n=10+10)
    EqualPort/1-8        76.8MB/s ± 0%  76.7MB/s ± 0%    -0.09%  (p=0.023 n=10+10)
    EqualPort/6-8         272MB/s ± 0%   264MB/s ± 0%    -2.94%  (p=0.000 n=8+10)
    EqualPort/32-8        410MB/s ± 0%   410MB/s ± 0%    +0.01%  (p=0.004 n=9+10)
    EqualPort/4K-8        543MB/s ± 0%   538MB/s ± 0%    -0.91%  (p=0.000 n=9+9)
    EqualPort/4M-8        514MB/s ± 2%   521MB/s ± 1%    +1.31%  (p=0.023 n=10+10)
    EqualPort/64M-8       473MB/s ± 0%   472MB/s ± 0%    -0.37%  (p=0.000 n=10+10)
    
    Benchmark results of go1:
    
    name                     old time/op    new time/op    delta
    BinaryTree17-8              6.53s ± 0%     6.52s ± 2%    ~     (p=0.286 n=4+5)
    Fannkuch11-8                6.35s ± 1%     6.33s ± 0%    ~     (p=0.690 n=5+5)
    FmtFprintfEmpty-8           108ns ± 1%      99ns ± 1%  -8.31%  (p=0.008 n=5+5)
    FmtFprintfString-8          172ns ± 1%     188ns ± 0%  +9.43%  (p=0.016 n=5+4)
    FmtFprintfInt-8             207ns ± 0%     202ns ± 0%  -2.42%  (p=0.008 n=5+5)
    FmtFprintfIntInt-8          277ns ± 1%     271ns ± 1%  -2.02%  (p=0.008 n=5+5)
    FmtFprintfPrefixedInt-8     386ns ± 0%     380ns ± 0%  -1.55%  (p=0.008 n=5+5)
    FmtFprintfFloat-8           492ns ± 0%     494ns ± 1%    ~     (p=0.175 n=4+5)
    FmtManyArgs-8              1.32µs ± 1%    1.31µs ± 2%    ~     (p=0.651 n=5+5)
    GobDecode-8                16.8ms ± 2%    16.9ms ± 1%    ~     (p=0.310 n=5+5)
    GobEncode-8                14.1ms ± 1%    14.1ms ± 1%    ~     (p=1.000 n=5+5)
    Gzip-8                      788ms ± 0%     789ms ± 0%    ~     (p=0.548 n=5+5)
    Gunzip-8                   83.6ms ± 0%    83.6ms ± 0%    ~     (p=0.548 n=5+5)
    HTTPClientServer-8          120µs ± 0%     120µs ± 1%    ~     (p=0.690 n=5+5)
    JSONEncode-8               33.2ms ± 0%    33.6ms ± 0%  +1.20%  (p=0.008 n=5+5)
    JSONDecode-8                152ms ± 1%     146ms ± 1%  -3.70%  (p=0.008 n=5+5)
    Mandelbrot200-8            10.0ms ± 0%    10.0ms ± 0%    ~     (p=0.151 n=5+5)
    GoParse-8                  7.97ms ± 0%    8.06ms ± 0%  +1.15%  (p=0.008 n=5+5)
    RegexpMatchEasy0_32-8       233ns ± 1%     239ns ± 4%    ~     (p=0.135 n=5+5)
    RegexpMatchEasy0_1K-8      1.86µs ± 0%    1.86µs ± 0%    ~     (p=0.167 n=5+5)
    RegexpMatchEasy1_32-8       250ns ± 0%     263ns ± 1%  +5.28%  (p=0.008 n=5+5)
    RegexpMatchEasy1_1K-8      2.28µs ± 0%    2.13µs ± 0%  -6.64%  (p=0.000 n=4+5)
    RegexpMatchMedium_32-8      332ns ± 1%     319ns ± 0%  -3.97%  (p=0.008 n=5+5)
    RegexpMatchMedium_1K-8     85.5µs ± 2%    79.1µs ± 1%  -7.42%  (p=0.008 n=5+5)
    RegexpMatchHard_32-8       4.34µs ± 1%    4.42µs ± 7%    ~     (p=0.881 n=5+5)
    RegexpMatchHard_1K-8        130µs ± 1%     127µs ± 0%  -2.18%  (p=0.008 n=5+5)
    Revcomp-8                   1.35s ± 1%     1.34s ± 0%  -0.58%  (p=0.016 n=5+4)
    Template-8                  160ms ± 2%     158ms ± 1%    ~     (p=0.222 n=5+5)
    TimeParse-8                 795ns ± 2%     772ns ± 2%  -2.87%  (p=0.024 n=5+5)
    TimeFormat-8                782ns ± 0%     784ns ± 0%    ~     (p=0.198 n=5+5)
    
    name                     old speed      new speed      delta
    GobDecode-8              45.8MB/s ± 2%  45.5MB/s ± 1%    ~     (p=0.310 n=5+5)
    GobEncode-8              54.3MB/s ± 1%  54.4MB/s ± 1%    ~     (p=0.984 n=5+5)
    Gzip-8                   24.6MB/s ± 0%  24.6MB/s ± 0%    ~     (p=0.540 n=5+5)
    Gunzip-8                  232MB/s ± 0%   232MB/s ± 0%    ~     (p=0.548 n=5+5)
    JSONEncode-8             58.4MB/s ± 0%  57.7MB/s ± 0%  -1.19%  (p=0.008 n=5+5)
    JSONDecode-8             12.8MB/s ± 1%  13.3MB/s ± 1%  +3.85%  (p=0.008 n=5+5)
    GoParse-8                7.27MB/s ± 0%  7.18MB/s ± 0%  -1.13%  (p=0.008 n=5+5)
    RegexpMatchEasy0_32-8     137MB/s ± 1%   134MB/s ± 4%    ~     (p=0.151 n=5+5)
    RegexpMatchEasy0_1K-8     551MB/s ± 0%   550MB/s ± 0%    ~     (p=0.222 n=5+5)
    RegexpMatchEasy1_32-8     128MB/s ± 0%   121MB/s ± 1%  -5.09%  (p=0.008 n=5+5)
    RegexpMatchEasy1_1K-8     449MB/s ± 0%   481MB/s ± 0%  +7.12%  (p=0.016 n=4+5)
    RegexpMatchMedium_32-8   3.00MB/s ± 0%  3.13MB/s ± 0%  +4.33%  (p=0.016 n=4+5)
    RegexpMatchMedium_1K-8   12.0MB/s ± 2%  12.9MB/s ± 1%  +7.98%  (p=0.008 n=5+5)
    RegexpMatchHard_32-8     7.38MB/s ± 1%  7.25MB/s ± 7%    ~     (p=0.952 n=5+5)
    RegexpMatchHard_1K-8     7.88MB/s ± 1%  8.05MB/s ± 0%  +2.21%  (p=0.008 n=5+5)
    Revcomp-8                 188MB/s ± 1%   189MB/s ± 0%  +0.58%  (p=0.016 n=5+4)
    Template-8               12.2MB/s ± 2%  12.3MB/s ± 1%    ~     (p=0.183 n=5+5)
    
    Change-Id: I65e79f3f8f8b2914678311c4f1b0a2d98459e220
    Reviewed-on: https://go-review.googlesource.com/71110Reviewed-by: 's avatarCherry Zhang <cherryyz@google.com>
    Run-TryBot: Cherry Zhang <cherryyz@google.com>
    78ddf274
asm_arm64.s 24.1 KB