• Josh Bleecher Snyder's avatar
    runtime, cmd/internal/obj/arm: improve arm function prologue · 5353cde0
    Josh Bleecher Snyder authored
    When stack growth is not needed, as it usually is not,
    execute only a single conditional branch
    rather than three conditional instructions.
    This adds 4 bytes to every function,
    but might speed up execution in the common case.
    
    Sample disassembly for
    
    func f() {
    	_ = [128]byte{}
    }
    
    Before:
    
    TEXT main.f(SB) x.go
    	x.go:3	0x2000	e59a1008	MOVW 0x8(R10), R1
    	x.go:3	0x2004	e59fb028	MOVW 0x28(R15), R11
    	x.go:3	0x2008	e08d200b	ADD R11, R13, R2
    	x.go:3	0x200c	e1520001	CMP R1, R2
    	x.go:3	0x2010	91a0300e	MOVW.LS R14, R3
    	x.go:3	0x2014	9b0118a9	BL.LS runtime.morestack_noctxt(SB)
    	x.go:3	0x2018	9afffff8	B.LS main.f(SB)
    	x.go:3	0x201c	e52de084	MOVW.W R14, -0x84(R13)
    	x.go:4	0x2020	e28d1004	ADD $4, R13, R1
    	x.go:4	0x2024	e3a00000	MOVW $0, R0
    	x.go:4	0x2028	eb012255	BL 0x4a984
    	x.go:5	0x202c	e49df084	RET #132
    	x.go:5	0x2030	eafffffe	B 0x2030
    	x.go:5	0x2034	ffffff7c	?
    
    After:
    
    TEXT main.f(SB) x.go
    	x.go:3	0x2000	e59a1008	MOVW 0x8(R10), R1
    	x.go:3	0x2004	e59fb02c	MOVW 0x2c(R15), R11
    	x.go:3	0x2008	e08d200b	ADD R11, R13, R2
    	x.go:3	0x200c	e1520001	CMP R1, R2
    	x.go:3	0x2010	9a000004	B.LS 0x2028
    	x.go:3	0x2014	e52de084	MOVW.W R14, -0x84(R13)
    	x.go:4	0x2018	e28d1004	ADD $4, R13, R1
    	x.go:4	0x201c	e3a00000	MOVW $0, R0
    	x.go:4	0x2020	eb0124dc	BL 0x4b398
    	x.go:5	0x2024	e49df084	RET #132
    	x.go:5	0x2028	e1a0300e	MOVW R14, R3
    	x.go:5	0x202c	eb011b0d	BL runtime.morestack_noctxt(SB)
    	x.go:5	0x2030	eafffff2	B main.f(SB)
    	x.go:5	0x2034	eafffffe	B 0x2034
    	x.go:5	0x2038	ffffff7c	?
    
    Updates #10587.
    
    package sort benchmarks on an iPhone 6:
    
    name            old time/op  new time/op  delta
    SortString1K     569µs ± 0%   565µs ± 1%  -0.75%  (p=0.000 n=23+24)
    StableString1K   872µs ± 1%   870µs ± 1%  -0.16%  (p=0.009 n=23+24)
    SortInt1K        317µs ± 2%   316µs ± 2%    ~     (p=0.410 n=26+26)
    StableInt1K      343µs ± 1%   339µs ± 1%  -1.07%  (p=0.000 n=22+23)
    SortInt64K      30.0ms ± 1%  30.0ms ± 1%    ~     (p=0.091 n=25+24)
    StableInt64K    30.2ms ± 0%  30.0ms ± 0%  -0.69%  (p=0.000 n=22+22)
    Sort1e2          147µs ± 1%   146µs ± 0%  -0.48%  (p=0.000 n=25+24)
    Stable1e2        290µs ± 1%   286µs ± 1%  -1.30%  (p=0.000 n=23+24)
    Sort1e4         29.5ms ± 2%  29.7ms ± 1%  +0.71%  (p=0.000 n=23+23)
    Stable1e4       88.7ms ± 4%  88.6ms ± 8%  -0.07%  (p=0.022 n=26+26)
    Sort1e6          4.81s ± 7%   4.83s ± 7%    ~     (p=0.192 n=26+26)
    Stable1e6        18.3s ± 1%   18.1s ± 1%  -0.76%  (p=0.000 n=25+23)
    SearchWrappers   318ns ± 1%   344ns ± 1%  +8.14%  (p=0.000 n=23+26)
    
    package sort benchmarks on a first generation rpi:
    
    name            old time/op  new time/op  delta
    SearchWrappers  4.13µs ± 0%  3.95µs ± 0%   -4.42%  (p=0.000 n=15+13)
    SortString1K    5.81ms ± 1%  5.82ms ± 2%     ~     (p=0.400 n=14+15)
    StableString1K  9.69ms ± 1%  9.73ms ± 0%     ~     (p=0.121 n=15+11)
    SortInt1K       3.30ms ± 2%  3.66ms ±19%  +10.82%  (p=0.000 n=15+14)
    StableInt1K     5.97ms ±15%  4.17ms ± 8%  -30.05%  (p=0.000 n=15+15)
    SortInt64K       319ms ± 1%   295ms ± 1%   -7.65%  (p=0.000 n=15+15)
    StableInt64K     343ms ± 0%   332ms ± 0%   -3.26%  (p=0.000 n=12+13)
    Sort1e2         3.36ms ± 2%  3.22ms ± 4%   -4.10%  (p=0.000 n=15+15)
    Stable1e2       6.74ms ± 1%  6.43ms ± 2%   -4.67%  (p=0.000 n=15+15)
    Sort1e4          247ms ± 1%   247ms ± 1%     ~     (p=0.331 n=15+14)
    Stable1e4        864ms ± 0%   820ms ± 0%   -5.15%  (p=0.000 n=14+15)
    Sort1e6          41.2s ± 0%   41.2s ± 0%   +0.15%  (p=0.000 n=13+14)
    Stable1e6         192s ± 0%    182s ± 0%   -5.07%  (p=0.000 n=14+14)
    
    Change-Id: I8a9db77e1d4ea1956575895893bc9d04bd81204b
    Reviewed-on: https://go-review.googlesource.com/10497Reviewed-by: 's avatarRuss Cox <rsc@golang.org>
    5353cde0
asm_arm.s 24.7 KB