• Michael Hudson-Doyle's avatar
    runtime: adjust the arm64 memmove and memclr to operate by word as much as they can · 168a51b3
    Michael Hudson-Doyle authored
    Not only is this an obvious optimization:
    
    benchmark                           old MB/s     new MB/s     speedup
    BenchmarkMemmove1-4                 35.35        29.65        0.84x
    BenchmarkMemmove2-4                 63.78        52.53        0.82x
    BenchmarkMemmove3-4                 89.72        73.96        0.82x
    BenchmarkMemmove4-4                 109.94       95.73        0.87x
    BenchmarkMemmove5-4                 127.60       112.80       0.88x
    BenchmarkMemmove6-4                 143.59       126.67       0.88x
    BenchmarkMemmove7-4                 157.90       138.92       0.88x
    BenchmarkMemmove8-4                 167.18       231.81       1.39x
    BenchmarkMemmove9-4                 175.23       252.07       1.44x
    BenchmarkMemmove10-4                165.68       261.10       1.58x
    BenchmarkMemmove11-4                174.43       263.31       1.51x
    BenchmarkMemmove12-4                180.76       267.56       1.48x
    BenchmarkMemmove13-4                189.06       284.93       1.51x
    BenchmarkMemmove14-4                186.31       284.72       1.53x
    BenchmarkMemmove15-4                195.75       281.62       1.44x
    BenchmarkMemmove16-4                202.96       439.23       2.16x
    BenchmarkMemmove32-4                264.77       775.77       2.93x
    BenchmarkMemmove64-4                306.81       1209.64      3.94x
    BenchmarkMemmove128-4               357.03       1515.41      4.24x
    BenchmarkMemmove256-4               380.77       2066.01      5.43x
    BenchmarkMemmove512-4               385.05       2556.45      6.64x
    BenchmarkMemmove1024-4              381.23       2804.10      7.36x
    BenchmarkMemmove2048-4              379.06       2814.83      7.43x
    BenchmarkMemmove4096-4              387.43       3064.96      7.91x
    BenchmarkMemmoveUnaligned1-4        28.91        25.40        0.88x
    BenchmarkMemmoveUnaligned2-4        56.13        47.56        0.85x
    BenchmarkMemmoveUnaligned3-4        74.32        69.31        0.93x
    BenchmarkMemmoveUnaligned4-4        97.02        83.58        0.86x
    BenchmarkMemmoveUnaligned5-4        110.17       103.62       0.94x
    BenchmarkMemmoveUnaligned6-4        124.95       113.26       0.91x
    BenchmarkMemmoveUnaligned7-4        142.37       130.82       0.92x
    BenchmarkMemmoveUnaligned8-4        151.20       205.64       1.36x
    BenchmarkMemmoveUnaligned9-4        166.97       215.42       1.29x
    BenchmarkMemmoveUnaligned10-4       148.49       221.22       1.49x
    BenchmarkMemmoveUnaligned11-4       159.47       239.57       1.50x
    BenchmarkMemmoveUnaligned12-4       163.52       247.32       1.51x
    BenchmarkMemmoveUnaligned13-4       167.55       256.54       1.53x
    BenchmarkMemmoveUnaligned14-4       175.12       251.03       1.43x
    BenchmarkMemmoveUnaligned15-4       192.10       267.13       1.39x
    BenchmarkMemmoveUnaligned16-4       190.76       378.87       1.99x
    BenchmarkMemmoveUnaligned32-4       259.02       562.98       2.17x
    BenchmarkMemmoveUnaligned64-4       317.72       842.44       2.65x
    BenchmarkMemmoveUnaligned128-4      355.43       1274.49      3.59x
    BenchmarkMemmoveUnaligned256-4      378.17       1815.74      4.80x
    BenchmarkMemmoveUnaligned512-4      362.15       2180.81      6.02x
    BenchmarkMemmoveUnaligned1024-4     376.07       2453.58      6.52x
    BenchmarkMemmoveUnaligned2048-4     381.66       2568.32      6.73x
    BenchmarkMemmoveUnaligned4096-4     398.51       2669.36      6.70x
    BenchmarkMemclr5-4                  113.83       107.93       0.95x
    BenchmarkMemclr16-4                 223.84       389.63       1.74x
    BenchmarkMemclr64-4                 421.99       1209.58      2.87x
    BenchmarkMemclr256-4                525.94       2411.58      4.59x
    BenchmarkMemclr4096-4               581.66       4372.20      7.52x
    BenchmarkMemclr65536-4              565.84       4747.48      8.39x
    BenchmarkGoMemclr5-4                194.63       160.31       0.82x
    BenchmarkGoMemclr16-4               295.30       630.07       2.13x
    BenchmarkGoMemclr64-4               480.24       1884.03      3.92x
    BenchmarkGoMemclr256-4              540.23       2926.49      5.42x
    
    but it turns out that it's necessary to avoid the GC seeing partially written
    pointers.
    
    It's of course possible to be more sophisticated (using ldp/stp to move 16
    bytes at a time in the core loop and unrolling the tail copying loops being
    the obvious ideas) but I wanted something simple and (reasonably) obviously
    correct.
    
    Fixes #12552
    
    Change-Id: Iaeaf8a812cd06f4747ba2f792de1ded738890735
    Reviewed-on: https://go-review.googlesource.com/14813Reviewed-by: 's avatarAustin Clements <austin@google.com>
    168a51b3
Name
Last commit
Last update
api Loading commit data...
doc Loading commit data...
lib/time Loading commit data...
misc Loading commit data...
src Loading commit data...
test Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
AUTHORS Loading commit data...
CONTRIBUTING.md Loading commit data...
CONTRIBUTORS Loading commit data...
LICENSE Loading commit data...
PATENTS Loading commit data...
README.md Loading commit data...
favicon.ico Loading commit data...
robots.txt Loading commit data...