• Ben Shi's avatar
    cmd/compile/internal/ssa: optimize arm64 with FNMULS/FNMULD · ebb77aa8
    Ben Shi authored
    FNMULS&FNMULD are efficient arm64 instructions, which can be used
    to improve FP performance. This CL use them to optimize pairs of neg-mul
    operations.
    
    Here are benchmark test results on Raspberry Pi 3 with ArchLinux.
    
    1. A special test case gets about 15% improvement.
    (https://github.com/benshi001/ugo1/blob/master/fpmul_test.go)
    FPMul-4                     485µs ± 0%     410µs ± 0%  -15.49%  (p=0.000 n=26+23)
    
    2. There is little regression in the go1 benchmark (excluding noise).
    name                     old time/op    new time/op    delta
    BinaryTree17-4              42.0s ± 3%     42.1s ± 2%    ~     (p=0.542 n=39+40)
    Fannkuch11-4                33.3s ± 3%     32.9s ± 1%    ~     (p=0.200 n=40+32)
    FmtFprintfEmpty-4           534ns ± 0%     534ns ± 0%    ~     (all equal)
    FmtFprintfString-4         1.09µs ± 1%    1.09µs ± 0%    ~     (p=0.950 n=32+32)
    FmtFprintfInt-4            1.14µs ± 0%    1.14µs ± 1%    ~     (p=0.571 n=32+31)
    FmtFprintfIntInt-4         1.79µs ± 3%    1.76µs ± 0%  -1.42%  (p=0.004 n=40+34)
    FmtFprintfPrefixedInt-4    2.17µs ± 0%    2.17µs ± 0%    ~     (p=0.073 n=31+34)
    FmtFprintfFloat-4          3.33µs ± 3%    3.28µs ± 0%  -1.46%  (p=0.001 n=40+34)
    FmtManyArgs-4              7.28µs ± 6%    7.19µs ± 0%    ~     (p=0.641 n=40+33)
    GobDecode-4                96.5ms ± 4%    96.5ms ± 9%    ~     (p=0.214 n=40+40)
    GobEncode-4                79.5ms ± 0%    80.7ms ± 4%  +1.51%  (p=0.000 n=34+40)
    Gzip-4                      4.53s ± 4%     4.56s ± 4%  +0.60%  (p=0.000 n=40+40)
    Gunzip-4                    451ms ± 3%     442ms ± 0%  -1.93%  (p=0.000 n=40+32)
    HTTPClientServer-4          530µs ± 1%     535µs ± 1%  +0.88%  (p=0.000 n=39+39)
    JSONEncode-4                214ms ± 4%     211ms ± 0%    ~     (p=0.059 n=40+31)
    JSONDecode-4                865ms ± 5%     864ms ± 4%  -0.06%  (p=0.003 n=40+40)
    Mandelbrot200-4            52.0ms ± 3%    52.1ms ± 3%    ~     (p=0.556 n=40+40)
    GoParse-4                  43.1ms ± 8%    42.1ms ± 0%    ~     (p=0.083 n=40+33)
    RegexpMatchEasy0_32-4      1.02µs ± 3%    1.02µs ± 4%  +0.06%  (p=0.020 n=40+40)
    RegexpMatchEasy0_1K-4      3.90µs ± 0%    3.96µs ± 3%  +1.58%  (p=0.000 n=31+40)
    RegexpMatchEasy1_32-4       967ns ± 4%     981ns ± 3%  +1.40%  (p=0.000 n=40+40)
    RegexpMatchEasy1_1K-4      6.41µs ± 4%    6.43µs ± 3%    ~     (p=0.386 n=40+40)
    RegexpMatchMedium_32-4     1.76µs ± 3%    1.78µs ± 3%  +1.08%  (p=0.000 n=40+40)
    RegexpMatchMedium_1K-4      561µs ± 0%     562µs ± 0%  +0.09%  (p=0.003 n=34+31)
    RegexpMatchHard_32-4       31.5µs ± 2%    31.1µs ± 4%  -1.17%  (p=0.000 n=30+40)
    RegexpMatchHard_1K-4        960µs ± 3%     950µs ± 4%  -1.02%  (p=0.016 n=40+40)
    Revcomp-4                   7.79s ± 7%     7.79s ± 4%    ~     (p=0.859 n=40+40)
    Template-4                  889ms ± 6%     872ms ± 3%  -1.86%  (p=0.025 n=40+31)
    TimeParse-4                4.80µs ± 0%    4.89µs ± 3%  +1.71%  (p=0.001 n=31+40)
    TimeFormat-4               4.70µs ± 1%    4.78µs ± 3%  +1.57%  (p=0.000 n=33+40)
    [Geo mean]                  710µs          709µs       -0.13%
    
    name                     old speed      new speed      delta
    GobDecode-4              7.96MB/s ± 4%  7.96MB/s ± 9%    ~     (p=0.174 n=40+40)
    GobEncode-4              9.65MB/s ± 0%  9.51MB/s ± 4%  -1.45%  (p=0.000 n=34+40)
    Gzip-4                   4.29MB/s ± 4%  4.26MB/s ± 4%  -0.59%  (p=0.000 n=40+40)
    Gunzip-4                 43.0MB/s ± 3%  43.9MB/s ± 0%  +1.90%  (p=0.000 n=40+32)
    JSONEncode-4             9.09MB/s ± 4%  9.22MB/s ± 0%    ~     (p=0.429 n=40+31)
    JSONDecode-4             2.25MB/s ± 5%  2.25MB/s ± 4%    ~     (p=0.278 n=40+40)
    GoParse-4                1.35MB/s ± 7%  1.37MB/s ± 0%    ~     (p=0.071 n=40+25)
    RegexpMatchEasy0_32-4    31.5MB/s ± 3%  31.5MB/s ± 4%  -0.08%  (p=0.018 n=40+40)
    RegexpMatchEasy0_1K-4     263MB/s ± 0%   259MB/s ± 3%  -1.51%  (p=0.000 n=31+40)
    RegexpMatchEasy1_32-4    33.1MB/s ± 4%  32.6MB/s ± 3%  -1.38%  (p=0.000 n=40+40)
    RegexpMatchEasy1_1K-4     160MB/s ± 4%   159MB/s ± 3%    ~     (p=0.364 n=40+40)
    RegexpMatchMedium_32-4    565kB/s ± 3%   562kB/s ± 2%    ~     (p=0.208 n=40+40)
    RegexpMatchMedium_1K-4   1.82MB/s ± 0%  1.82MB/s ± 0%  -0.27%  (p=0.000 n=34+31)
    RegexpMatchHard_32-4     1.02MB/s ± 3%  1.03MB/s ± 4%  +1.04%  (p=0.000 n=32+40)
    RegexpMatchHard_1K-4     1.07MB/s ± 4%  1.08MB/s ± 4%  +0.94%  (p=0.003 n=40+40)
    Revcomp-4                32.6MB/s ± 7%  32.6MB/s ± 4%    ~     (p=0.965 n=40+40)
    Template-4               2.18MB/s ± 6%  2.22MB/s ± 3%  +1.83%  (p=0.020 n=40+31)
    [Geo mean]               7.77MB/s       7.78MB/s       +0.16%
    
    3. There is little change in the compilecmp benchmark (excluding noise).
    name        old time/op       new time/op       delta
    Template          2.37s ± 3%        2.35s ± 4%    ~     (p=0.529 n=10+10)
    Unicode           1.38s ± 8%        1.36s ± 5%    ~     (p=0.247 n=10+10)
    GoTypes           8.10s ± 2%        8.10s ± 2%    ~     (p=0.971 n=10+10)
    Compiler          40.5s ± 4%        40.8s ± 1%    ~     (p=0.529 n=10+10)
    SSA                115s ± 2%         115s ± 3%    ~     (p=0.684 n=10+10)
    Flate             1.45s ± 5%        1.46s ± 3%    ~     (p=0.796 n=10+10)
    GoParser          1.86s ± 4%        1.84s ± 2%    ~     (p=0.095 n=9+10)
    Reflect           5.11s ± 2%        5.13s ± 2%    ~     (p=0.315 n=10+10)
    Tar               2.22s ± 3%        2.23s ± 1%    ~     (p=0.299 n=9+7)
    XML               2.72s ± 3%        2.72s ± 3%    ~     (p=0.912 n=10+10)
    [Geo mean]        5.03s             5.02s       -0.21%
    
    name        old user-time/op  new user-time/op  delta
    Template          2.92s ± 2%        2.89s ± 1%    ~     (p=0.247 n=10+10)
    Unicode           1.71s ± 5%        1.69s ± 4%    ~     (p=0.393 n=10+10)
    GoTypes           9.78s ± 2%        9.76s ± 2%    ~     (p=0.631 n=10+10)
    Compiler          49.1s ± 2%        49.1s ± 1%    ~     (p=0.796 n=10+10)
    SSA                144s ± 1%         144s ± 2%    ~     (p=0.796 n=10+10)
    Flate             1.74s ± 2%        1.73s ± 3%    ~     (p=0.842 n=10+9)
    GoParser          2.23s ± 3%        2.25s ± 2%    ~     (p=0.143 n=10+10)
    Reflect           5.93s ± 3%        5.98s ± 2%    ~     (p=0.211 n=10+9)
    Tar               2.65s ± 2%        2.69s ± 3%  +1.51%  (p=0.010 n=9+10)
    XML               3.25s ± 2%        3.21s ± 1%  -1.24%  (p=0.035 n=10+9)
    [Geo mean]        6.07s             6.07s       -0.08%
    
    name        old text-bytes    new text-bytes    delta
    HelloSize         641kB ± 0%        641kB ± 0%    ~     (all equal)
    
    name        old data-bytes    new data-bytes    delta
    HelloSize        9.46kB ± 0%       9.46kB ± 0%    ~     (all equal)
    
    name        old bss-bytes     new bss-bytes     delta
    HelloSize         125kB ± 0%        125kB ± 0%    ~     (all equal)
    
    name        old exe-bytes     new exe-bytes     delta
    HelloSize        1.24MB ± 0%       1.24MB ± 0%    ~     (all equal)
    
    Change-Id: Id095d998c380eef929755124084df02446a6b7c1
    Reviewed-on: https://go-review.googlesource.com/92555
    Run-TryBot: Cherry Zhang <cherryyz@google.com>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: 's avatarCherry Zhang <cherryyz@google.com>
    ebb77aa8
ssa.go 22.1 KB