-
Josh Bleecher Snyder authored
It is faster to execute MOVQ AX,(DI) MOVQ AX,8(DI) MOVQ AX,16(DI) MOVQ AX,24(DI) ADDQ $32,DI than STOSQ STOSQ STOSQ STOSQ However, in order to be able to jump into the middle of a block of MOVQs, the call site needs to pre-adjust DI. If we're clearing a small area, the cost of that DI pre-adjustment isn't repaid. This CL switches the DUFFZERO implementation to use a hybrid strategy, in which small clears use STOSQ as before, but large clears use mostly MOVQ/ADDQ blocks. benchmark old ns/op new ns/op delta BenchmarkClearFat8 0.55 0.55 +0.00% BenchmarkClearFat12 0.82 0.83 +1.22% BenchmarkClearFat16 0.55 0.55 +0.00% BenchmarkClearFat24 0.82 0.82 +0.00% BenchmarkClearFat32 2.20 1.94 -11.82% BenchmarkClearFat40 1.92 1.66 -13.54% BenchmarkClearFat48 2.21 1.93 -12.67% BenchmarkClearFat56 3.03 2.20 -27.39% BenchmarkClearFat64 3.26 2.48 -23.93% BenchmarkClearFat72 3.57 2.76 -22.69% BenchmarkClearFat80 3.83 3.05 -20.37% BenchmarkClearFat88 4.14 3.30 -20.29% BenchmarkClearFat128 5.54 4.69 -15.34% BenchmarkClearFat256 9.95 9.09 -8.64% BenchmarkClearFat512 18.7 17.9 -4.28% BenchmarkClearFat1024 36.2 35.4 -2.21% Change-Id: Ic786406d9b3cab68d5a231688f9e66fcd1bd7103 Reviewed-on: https://go-review.googlesource.com/2585Reviewed-by: Keith Randall <khr@golang.org>
7e0c11c3