• wei xiao's avatar
    runtime: IndexByte and memclr perf improvements on arm64 · d259815c
    wei xiao authored
    Update runtime asm_arm64.s and memclr_arm64.s to improve performance by using
    SIMD instructions to do more in parallel. It shows improvement on bytes, html
    and go1 benchmarks (particualrly regexp, which uses IndexByte frequently).
    
    Benchmark results of bytes:
    
    name                     old time/op   new time/op    delta
    IndexByte/10-8            28.5ns ± 0%    19.5ns ± 0%   -31.58%  (p=0.000 n=10+10)
    IndexByte/32-8            52.6ns ± 0%    19.0ns ± 0%   -63.88%  (p=0.000 n=10+10)
    IndexByte/4K-8            4.12µs ± 0%    0.49µs ± 0%   -88.16%  (p=0.000 n=10+10)
    IndexByte/4M-8            4.29ms ± 1%    0.70ms ±26%   -83.65%  (p=0.000 n=10+10)
    IndexByte/64M-8           69.7ms ± 0%    16.0ms ± 0%   -76.97%  (p=0.000 n=9+10)
    IndexBytePortable/10-8    34.0ns ± 0%    34.0ns ± 0%      ~     (all equal)
    IndexBytePortable/32-8    66.1ns ± 0%    66.1ns ± 0%      ~     (p=0.471 n=9+9)
    IndexBytePortable/4K-8    6.17µs ± 0%    6.17µs ± 0%      ~     (all equal)
    IndexBytePortable/4M-8    6.33ms ± 0%    6.35ms ± 0%    +0.21%  (p=0.002 n=10+9)
    IndexBytePortable/64M-8    103ms ± 0%     103ms ± 0%    +0.01%  (p=0.017 n=9+10)
    
    name                     old speed     new speed      delta
    IndexByte/10-8           351MB/s ± 0%   512MB/s ± 0%   +46.14%  (p=0.000 n=9+10)
    IndexByte/32-8           609MB/s ± 0%  1683MB/s ± 0%  +176.40%  (p=0.000 n=10+10)
    IndexByte/4K-8           994MB/s ± 0%  8378MB/s ± 0%  +742.75%  (p=0.000 n=10+10)
    IndexByte/4M-8           977MB/s ± 1%  6149MB/s ±32%  +529.29%  (p=0.000 n=10+10)
    IndexByte/64M-8          963MB/s ± 0%  4182MB/s ± 0%  +334.29%  (p=0.000 n=9+10)
    IndexBytePortable/10-8   294MB/s ± 0%   294MB/s ± 0%    +0.17%  (p=0.000 n=8+8)
    IndexBytePortable/32-8   484MB/s ± 0%   484MB/s ± 0%      ~     (p=0.877 n=9+9)
    IndexBytePortable/4K-8   664MB/s ± 0%   664MB/s ± 0%      ~     (p=0.242 n=8+9)
    IndexBytePortable/4M-8   662MB/s ± 0%   661MB/s ± 0%    -0.21%  (p=0.002 n=10+9)
    IndexBytePortable/64M-8  652MB/s ± 0%   652MB/s ± 0%      ~     (p=0.065 n=10+10)
    
    Benchmark results of html:
    
    name              old time/op  new time/op  delta
    Escape-8          62.0µs ± 1%  61.0µs ± 1%   -1.69%  (p=0.000 n=9+10)
    EscapeNone-8      10.2µs ± 0%  10.2µs ± 0%   -0.09%  (p=0.022 n=9+10)
    Unescape-8        71.9µs ± 0%  68.7µs ± 0%   -4.35%  (p=0.000 n=10+10)
    UnescapeNone-8    4.03µs ± 0%  0.48µs ± 0%  -88.08%  (p=0.000 n=10+10)
    UnescapeSparse-8  10.7µs ± 2%   7.1µs ± 3%  -33.91%  (p=0.000 n=10+10)
    UnescapeDense-8   53.2µs ± 1%  53.5µs ± 1%     ~     (p=0.143 n=10+10)
    
    Benchmark results of go1:
    
    name                     old time/op    new time/op    delta
    BinaryTree17-8              6.53s ± 0%     6.48s ± 2%      ~     (p=0.190 n=4+5)
    Fannkuch11-8                6.35s ± 1%     6.35s ± 0%      ~     (p=1.000 n=5+5)
    FmtFprintfEmpty-8           108ns ± 1%     101ns ± 2%    -6.32%  (p=0.008 n=5+5)
    FmtFprintfString-8          172ns ± 1%     182ns ± 2%    +5.70%  (p=0.008 n=5+5)
    FmtFprintfInt-8             207ns ± 0%     207ns ± 0%      ~     (p=0.444 n=5+5)
    FmtFprintfIntInt-8          277ns ± 1%     276ns ± 1%      ~     (p=0.873 n=5+5)
    FmtFprintfPrefixedInt-8     386ns ± 0%     382ns ± 1%    -1.04%  (p=0.024 n=5+5)
    FmtFprintfFloat-8           492ns ± 0%     492ns ± 1%      ~     (p=0.571 n=4+5)
    FmtManyArgs-8              1.32µs ± 1%    1.33µs ± 0%      ~     (p=0.087 n=5+5)
    GobDecode-8                16.8ms ± 2%    16.7ms ± 1%      ~     (p=1.000 n=5+5)
    GobEncode-8                14.1ms ± 1%    14.0ms ± 1%      ~     (p=0.056 n=5+5)
    Gzip-8                      788ms ± 0%     802ms ± 0%    +1.71%  (p=0.008 n=5+5)
    Gunzip-8                   83.6ms ± 0%    83.9ms ± 0%    +0.40%  (p=0.008 n=5+5)
    HTTPClientServer-8          120µs ± 0%     120µs ± 1%      ~     (p=0.548 n=5+5)
    JSONEncode-8               33.2ms ± 0%    33.0ms ± 1%    -0.71%  (p=0.008 n=5+5)
    JSONDecode-8                152ms ± 1%     152ms ± 1%      ~     (p=1.000 n=5+5)
    Mandelbrot200-8            10.0ms ± 0%    10.0ms ± 0%    -0.05%  (p=0.008 n=5+5)
    GoParse-8                  7.97ms ± 0%    7.98ms ± 0%      ~     (p=0.690 n=5+5)
    RegexpMatchEasy0_32-8       233ns ± 1%     206ns ± 0%   -11.44%  (p=0.016 n=5+4)
    RegexpMatchEasy0_1K-8      1.86µs ± 0%    0.77µs ± 1%   -58.54%  (p=0.008 n=5+5)
    RegexpMatchEasy1_32-8       250ns ± 0%     205ns ± 0%   -18.07%  (p=0.008 n=5+5)
    RegexpMatchEasy1_1K-8      2.28µs ± 0%    1.11µs ± 0%   -51.09%  (p=0.029 n=4+4)
    RegexpMatchMedium_32-8      332ns ± 1%     301ns ± 2%    -9.45%  (p=0.008 n=5+5)
    RegexpMatchMedium_1K-8     85.5µs ± 2%    78.8µs ± 0%    -7.83%  (p=0.008 n=5+5)
    RegexpMatchHard_32-8       4.34µs ± 1%    4.27µs ± 0%    -1.49%  (p=0.008 n=5+5)
    RegexpMatchHard_1K-8        130µs ± 1%     127µs ± 0%    -2.53%  (p=0.008 n=5+5)
    Revcomp-8                   1.35s ± 1%     1.13s ± 1%   -16.17%  (p=0.008 n=5+5)
    Template-8                  160ms ± 2%     162ms ± 2%      ~     (p=0.222 n=5+5)
    TimeParse-8                 795ns ± 2%     778ns ± 1%      ~     (p=0.095 n=5+5)
    TimeFormat-8                782ns ± 0%     786ns ± 1%    +0.59%  (p=0.040 n=5+5)
    
    name                     old speed      new speed      delta
    GobDecode-8              45.8MB/s ± 2%  45.9MB/s ± 1%      ~     (p=1.000 n=5+5)
    GobEncode-8              54.3MB/s ± 1%  55.0MB/s ± 1%      ~     (p=0.056 n=5+5)
    Gzip-8                   24.6MB/s ± 0%  24.2MB/s ± 0%    -1.69%  (p=0.008 n=5+5)
    Gunzip-8                  232MB/s ± 0%   231MB/s ± 0%    -0.40%  (p=0.008 n=5+5)
    JSONEncode-8             58.4MB/s ± 0%  58.8MB/s ± 1%    +0.71%  (p=0.008 n=5+5)
    JSONDecode-8             12.8MB/s ± 1%  12.8MB/s ± 1%      ~     (p=1.000 n=5+5)
    GoParse-8                7.27MB/s ± 0%  7.26MB/s ± 0%      ~     (p=0.762 n=5+5)
    RegexpMatchEasy0_32-8     137MB/s ± 1%   155MB/s ± 0%   +12.93%  (p=0.008 n=5+5)
    RegexpMatchEasy0_1K-8     551MB/s ± 0%  1329MB/s ± 1%  +141.11%  (p=0.008 n=5+5)
    RegexpMatchEasy1_32-8     128MB/s ± 0%   156MB/s ± 0%   +22.00%  (p=0.008 n=5+5)
    RegexpMatchEasy1_1K-8     449MB/s ± 0%   920MB/s ± 0%  +104.68%  (p=0.016 n=4+5)
    RegexpMatchMedium_32-8   3.00MB/s ± 0%  3.32MB/s ± 2%   +10.60%  (p=0.016 n=4+5)
    RegexpMatchMedium_1K-8   12.0MB/s ± 2%  13.0MB/s ± 0%    +8.48%  (p=0.008 n=5+5)
    RegexpMatchHard_32-8     7.38MB/s ± 1%  7.49MB/s ± 0%    +1.49%  (p=0.008 n=5+5)
    RegexpMatchHard_1K-8     7.88MB/s ± 1%  8.08MB/s ± 0%    +2.59%  (p=0.008 n=5+5)
    Revcomp-8                 188MB/s ± 1%   224MB/s ± 1%   +19.29%  (p=0.008 n=5+5)
    Template-8               12.2MB/s ± 2%  12.0MB/s ± 2%      ~     (p=0.206 n=5+5)
    
    Change-Id: I94116620a287d173a6f60510684362e500f54887
    Reviewed-on: https://go-review.googlesource.com/33597
    Run-TryBot: Cherry Zhang <cherryyz@google.com>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: 's avatarCherry Zhang <cherryyz@google.com>
    d259815c
memclr_arm64.s 1.18 KB