• wei xiao's avatar
    runtime: IndexByte and memclr perf improvements on arm64 · d259815c
    wei xiao authored
    Update runtime asm_arm64.s and memclr_arm64.s to improve performance by using
    SIMD instructions to do more in parallel. It shows improvement on bytes, html
    and go1 benchmarks (particualrly regexp, which uses IndexByte frequently).
    
    Benchmark results of bytes:
    
    name                     old time/op   new time/op    delta
    IndexByte/10-8            28.5ns ± 0%    19.5ns ± 0%   -31.58%  (p=0.000 n=10+10)
    IndexByte/32-8            52.6ns ± 0%    19.0ns ± 0%   -63.88%  (p=0.000 n=10+10)
    IndexByte/4K-8            4.12µs ± 0%    0.49µs ± 0%   -88.16%  (p=0.000 n=10+10)
    IndexByte/4M-8            4.29ms ± 1%    0.70ms ±26%   -83.65%  (p=0.000 n=10+10)
    IndexByte/64M-8           69.7ms ± 0%    16.0ms ± 0%   -76.97%  (p=0.000 n=9+10)
    IndexBytePortable/10-8    34.0ns ± 0%    34.0ns ± 0%      ~     (all equal)
    IndexBytePortable/32-8    66.1ns ± 0%    66.1ns ± 0%      ~     (p=0.471 n=9+9)
    IndexBytePortable/4K-8    6.17µs ± 0%    6.17µs ± 0%      ~     (all equal)
    IndexBytePortable/4M-8    6.33ms ± 0%    6.35ms ± 0%    +0.21%  (p=0.002 n=10+9)
    IndexBytePortable/64M-8    103ms ± 0%     103ms ± 0%    +0.01%  (p=0.017 n=9+10)
    
    name                     old speed     new speed      delta
    IndexByte/10-8           351MB/s ± 0%   512MB/s ± 0%   +46.14%  (p=0.000 n=9+10)
    IndexByte/32-8           609MB/s ± 0%  1683MB/s ± 0%  +176.40%  (p=0.000 n=10+10)
    IndexByte/4K-8           994MB/s ± 0%  8378MB/s ± 0%  +742.75%  (p=0.000 n=10+10)
    IndexByte/4M-8           977MB/s ± 1%  6149MB/s ±32%  +529.29%  (p=0.000 n=10+10)
    IndexByte/64M-8          963MB/s ± 0%  4182MB/s ± 0%  +334.29%  (p=0.000 n=9+10)
    IndexBytePortable/10-8   294MB/s ± 0%   294MB/s ± 0%    +0.17%  (p=0.000 n=8+8)
    IndexBytePortable/32-8   484MB/s ± 0%   484MB/s ± 0%      ~     (p=0.877 n=9+9)
    IndexBytePortable/4K-8   664MB/s ± 0%   664MB/s ± 0%      ~     (p=0.242 n=8+9)
    IndexBytePortable/4M-8   662MB/s ± 0%   661MB/s ± 0%    -0.21%  (p=0.002 n=10+9)
    IndexBytePortable/64M-8  652MB/s ± 0%   652MB/s ± 0%      ~     (p=0.065 n=10+10)
    
    Benchmark results of html:
    
    name              old time/op  new time/op  delta
    Escape-8          62.0µs ± 1%  61.0µs ± 1%   -1.69%  (p=0.000 n=9+10)
    EscapeNone-8      10.2µs ± 0%  10.2µs ± 0%   -0.09%  (p=0.022 n=9+10)
    Unescape-8        71.9µs ± 0%  68.7µs ± 0%   -4.35%  (p=0.000 n=10+10)
    UnescapeNone-8    4.03µs ± 0%  0.48µs ± 0%  -88.08%  (p=0.000 n=10+10)
    UnescapeSparse-8  10.7µs ± 2%   7.1µs ± 3%  -33.91%  (p=0.000 n=10+10)
    UnescapeDense-8   53.2µs ± 1%  53.5µs ± 1%     ~     (p=0.143 n=10+10)
    
    Benchmark results of go1:
    
    name                     old time/op    new time/op    delta
    BinaryTree17-8              6.53s ± 0%     6.48s ± 2%      ~     (p=0.190 n=4+5)
    Fannkuch11-8                6.35s ± 1%     6.35s ± 0%      ~     (p=1.000 n=5+5)
    FmtFprintfEmpty-8           108ns ± 1%     101ns ± 2%    -6.32%  (p=0.008 n=5+5)
    FmtFprintfString-8          172ns ± 1%     182ns ± 2%    +5.70%  (p=0.008 n=5+5)
    FmtFprintfInt-8             207ns ± 0%     207ns ± 0%      ~     (p=0.444 n=5+5)
    FmtFprintfIntInt-8          277ns ± 1%     276ns ± 1%      ~     (p=0.873 n=5+5)
    FmtFprintfPrefixedInt-8     386ns ± 0%     382ns ± 1%    -1.04%  (p=0.024 n=5+5)
    FmtFprintfFloat-8           492ns ± 0%     492ns ± 1%      ~     (p=0.571 n=4+5)
    FmtManyArgs-8              1.32µs ± 1%    1.33µs ± 0%      ~     (p=0.087 n=5+5)
    GobDecode-8                16.8ms ± 2%    16.7ms ± 1%      ~     (p=1.000 n=5+5)
    GobEncode-8                14.1ms ± 1%    14.0ms ± 1%      ~     (p=0.056 n=5+5)
    Gzip-8                      788ms ± 0%     802ms ± 0%    +1.71%  (p=0.008 n=5+5)
    Gunzip-8                   83.6ms ± 0%    83.9ms ± 0%    +0.40%  (p=0.008 n=5+5)
    HTTPClientServer-8          120µs ± 0%     120µs ± 1%      ~     (p=0.548 n=5+5)
    JSONEncode-8               33.2ms ± 0%    33.0ms ± 1%    -0.71%  (p=0.008 n=5+5)
    JSONDecode-8                152ms ± 1%     152ms ± 1%      ~     (p=1.000 n=5+5)
    Mandelbrot200-8            10.0ms ± 0%    10.0ms ± 0%    -0.05%  (p=0.008 n=5+5)
    GoParse-8                  7.97ms ± 0%    7.98ms ± 0%      ~     (p=0.690 n=5+5)
    RegexpMatchEasy0_32-8       233ns ± 1%     206ns ± 0%   -11.44%  (p=0.016 n=5+4)
    RegexpMatchEasy0_1K-8      1.86µs ± 0%    0.77µs ± 1%   -58.54%  (p=0.008 n=5+5)
    RegexpMatchEasy1_32-8       250ns ± 0%     205ns ± 0%   -18.07%  (p=0.008 n=5+5)
    RegexpMatchEasy1_1K-8      2.28µs ± 0%    1.11µs ± 0%   -51.09%  (p=0.029 n=4+4)
    RegexpMatchMedium_32-8      332ns ± 1%     301ns ± 2%    -9.45%  (p=0.008 n=5+5)
    RegexpMatchMedium_1K-8     85.5µs ± 2%    78.8µs ± 0%    -7.83%  (p=0.008 n=5+5)
    RegexpMatchHard_32-8       4.34µs ± 1%    4.27µs ± 0%    -1.49%  (p=0.008 n=5+5)
    RegexpMatchHard_1K-8        130µs ± 1%     127µs ± 0%    -2.53%  (p=0.008 n=5+5)
    Revcomp-8                   1.35s ± 1%     1.13s ± 1%   -16.17%  (p=0.008 n=5+5)
    Template-8                  160ms ± 2%     162ms ± 2%      ~     (p=0.222 n=5+5)
    TimeParse-8                 795ns ± 2%     778ns ± 1%      ~     (p=0.095 n=5+5)
    TimeFormat-8                782ns ± 0%     786ns ± 1%    +0.59%  (p=0.040 n=5+5)
    
    name                     old speed      new speed      delta
    GobDecode-8              45.8MB/s ± 2%  45.9MB/s ± 1%      ~     (p=1.000 n=5+5)
    GobEncode-8              54.3MB/s ± 1%  55.0MB/s ± 1%      ~     (p=0.056 n=5+5)
    Gzip-8                   24.6MB/s ± 0%  24.2MB/s ± 0%    -1.69%  (p=0.008 n=5+5)
    Gunzip-8                  232MB/s ± 0%   231MB/s ± 0%    -0.40%  (p=0.008 n=5+5)
    JSONEncode-8             58.4MB/s ± 0%  58.8MB/s ± 1%    +0.71%  (p=0.008 n=5+5)
    JSONDecode-8             12.8MB/s ± 1%  12.8MB/s ± 1%      ~     (p=1.000 n=5+5)
    GoParse-8                7.27MB/s ± 0%  7.26MB/s ± 0%      ~     (p=0.762 n=5+5)
    RegexpMatchEasy0_32-8     137MB/s ± 1%   155MB/s ± 0%   +12.93%  (p=0.008 n=5+5)
    RegexpMatchEasy0_1K-8     551MB/s ± 0%  1329MB/s ± 1%  +141.11%  (p=0.008 n=5+5)
    RegexpMatchEasy1_32-8     128MB/s ± 0%   156MB/s ± 0%   +22.00%  (p=0.008 n=5+5)
    RegexpMatchEasy1_1K-8     449MB/s ± 0%   920MB/s ± 0%  +104.68%  (p=0.016 n=4+5)
    RegexpMatchMedium_32-8   3.00MB/s ± 0%  3.32MB/s ± 2%   +10.60%  (p=0.016 n=4+5)
    RegexpMatchMedium_1K-8   12.0MB/s ± 2%  13.0MB/s ± 0%    +8.48%  (p=0.008 n=5+5)
    RegexpMatchHard_32-8     7.38MB/s ± 1%  7.49MB/s ± 0%    +1.49%  (p=0.008 n=5+5)
    RegexpMatchHard_1K-8     7.88MB/s ± 1%  8.08MB/s ± 0%    +2.59%  (p=0.008 n=5+5)
    Revcomp-8                 188MB/s ± 1%   224MB/s ± 1%   +19.29%  (p=0.008 n=5+5)
    Template-8               12.2MB/s ± 2%  12.0MB/s ± 2%      ~     (p=0.206 n=5+5)
    
    Change-Id: I94116620a287d173a6f60510684362e500f54887
    Reviewed-on: https://go-review.googlesource.com/33597
    Run-TryBot: Cherry Zhang <cherryyz@google.com>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: 's avatarCherry Zhang <cherryyz@google.com>
    d259815c
Name
Last commit
Last update
.github Loading commit data...
api Loading commit data...
doc Loading commit data...
lib/time Loading commit data...
misc Loading commit data...
src Loading commit data...
test Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
AUTHORS Loading commit data...
CONTRIBUTING.md Loading commit data...
CONTRIBUTORS Loading commit data...
LICENSE Loading commit data...
PATENTS Loading commit data...
README.md Loading commit data...
favicon.ico Loading commit data...
robots.txt Loading commit data...