runtime: IndexByte and memclr perf improvements on arm64

Update runtime asm_arm64.s and memclr_arm64.s to improve performance by using SIMD instructions to do more in parallel. It shows improvement on bytes, html and go1 benchmarks (particualrly regexp, which uses IndexByte frequently). Benchmark results of bytes: name old time/op new time/op delta IndexByte/10-8 28.5ns ± 0% 19.5ns ± 0% -31.58% (p=0.000 n=10+10) IndexByte/32-8 52.6ns ± 0% 19.0ns ± 0% -63.88% (p=0.000 n=10+10) IndexByte/4K-8 4.12µs ± 0% 0.49µs ± 0% -88.16% (p=0.000 n=10+10) IndexByte/4M-8 4.29ms ± 1% 0.70ms ±26% -83.65% (p=0.000 n=10+10) IndexByte/64M-8 69.7ms ± 0% 16.0ms ± 0% -76.97% (p=0.000 n=9+10) IndexBytePortable/10-8 34.0ns ± 0% 34.0ns ± 0% ~ (all equal) IndexBytePortable/32-8 66.1ns ± 0% 66.1ns ± 0% ~ (p=0.471 n=9+9) IndexBytePortable/4K-8 6.17µs ± 0% 6.17µs ± 0% ~ (all equal) IndexBytePortable/4M-8 6.33ms ± 0% 6.35ms ± 0% +0.21% (p=0.002 n=10+9) IndexBytePortable/64M-8 103ms ± 0% 103ms ± 0% +0.01% (p=0.017 n=9+10) name old speed new speed delta IndexByte/10-8 351MB/s ± 0% 512MB/s ± 0% +46.14% (p=0.000 n=9+10) IndexByte/32-8 609MB/s ± 0% 1683MB/s ± 0% +176.40% (p=0.000 n=10+10) IndexByte/4K-8 994MB/s ± 0% 8378MB/s ± 0% +742.75% (p=0.000 n=10+10) IndexByte/4M-8 977MB/s ± 1% 6149MB/s ±32% +529.29% (p=0.000 n=10+10) IndexByte/64M-8 963MB/s ± 0% 4182MB/s ± 0% +334.29% (p=0.000 n=9+10) IndexBytePortable/10-8 294MB/s ± 0% 294MB/s ± 0% +0.17% (p=0.000 n=8+8) IndexBytePortable/32-8 484MB/s ± 0% 484MB/s ± 0% ~ (p=0.877 n=9+9) IndexBytePortable/4K-8 664MB/s ± 0% 664MB/s ± 0% ~ (p=0.242 n=8+9) IndexBytePortable/4M-8 662MB/s ± 0% 661MB/s ± 0% -0.21% (p=0.002 n=10+9) IndexBytePortable/64M-8 652MB/s ± 0% 652MB/s ± 0% ~ (p=0.065 n=10+10) Benchmark results of html: name old time/op new time/op delta Escape-8 62.0µs ± 1% 61.0µs ± 1% -1.69% (p=0.000 n=9+10) EscapeNone-8 10.2µs ± 0% 10.2µs ± 0% -0.09% (p=0.022 n=9+10) Unescape-8 71.9µs ± 0% 68.7µs ± 0% -4.35% (p=0.000 n=10+10) UnescapeNone-8 4.03µs ± 0% 0.48µs ± 0% -88.08% (p=0.000 n=10+10) UnescapeSparse-8 10.7µs ± 2% 7.1µs ± 3% -33.91% (p=0.000 n=10+10) UnescapeDense-8 53.2µs ± 1% 53.5µs ± 1% ~ (p=0.143 n=10+10) Benchmark results of go1: name old time/op new time/op delta BinaryTree17-8 6.53s ± 0% 6.48s ± 2% ~ (p=0.190 n=4+5) Fannkuch11-8 6.35s ± 1% 6.35s ± 0% ~ (p=1.000 n=5+5) FmtFprintfEmpty-8 108ns ± 1% 101ns ± 2% -6.32% (p=0.008 n=5+5) FmtFprintfString-8 172ns ± 1% 182ns ± 2% +5.70% (p=0.008 n=5+5) FmtFprintfInt-8 207ns ± 0% 207ns ± 0% ~ (p=0.444 n=5+5) FmtFprintfIntInt-8 277ns ± 1% 276ns ± 1% ~ (p=0.873 n=5+5) FmtFprintfPrefixedInt-8 386ns ± 0% 382ns ± 1% -1.04% (p=0.024 n=5+5) FmtFprintfFloat-8 492ns ± 0% 492ns ± 1% ~ (p=0.571 n=4+5) FmtManyArgs-8 1.32µs ± 1% 1.33µs ± 0% ~ (p=0.087 n=5+5) GobDecode-8 16.8ms ± 2% 16.7ms ± 1% ~ (p=1.000 n=5+5) GobEncode-8 14.1ms ± 1% 14.0ms ± 1% ~ (p=0.056 n=5+5) Gzip-8 788ms ± 0% 802ms ± 0% +1.71% (p=0.008 n=5+5) Gunzip-8 83.6ms ± 0% 83.9ms ± 0% +0.40% (p=0.008 n=5+5) HTTPClientServer-8 120µs ± 0% 120µs ± 1% ~ (p=0.548 n=5+5) JSONEncode-8 33.2ms ± 0% 33.0ms ± 1% -0.71% (p=0.008 n=5+5) JSONDecode-8 152ms ± 1% 152ms ± 1% ~ (p=1.000 n=5+5) Mandelbrot200-8 10.0ms ± 0% 10.0ms ± 0% -0.05% (p=0.008 n=5+5) GoParse-8 7.97ms ± 0% 7.98ms ± 0% ~ (p=0.690 n=5+5) RegexpMatchEasy0_32-8 233ns ± 1% 206ns ± 0% -11.44% (p=0.016 n=5+4) RegexpMatchEasy0_1K-8 1.86µs ± 0% 0.77µs ± 1% -58.54% (p=0.008 n=5+5) RegexpMatchEasy1_32-8 250ns ± 0% 205ns ± 0% -18.07% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 2.28µs ± 0% 1.11µs ± 0% -51.09% (p=0.029 n=4+4) RegexpMatchMedium_32-8 332ns ± 1% 301ns ± 2% -9.45% (p=0.008 n=5+5) RegexpMatchMedium_1K-8 85.5µs ± 2% 78.8µs ± 0% -7.83% (p=0.008 n=5+5) RegexpMatchHard_32-8 4.34µs ± 1% 4.27µs ± 0% -1.49% (p=0.008 n=5+5) RegexpMatchHard_1K-8 130µs ± 1% 127µs ± 0% -2.53% (p=0.008 n=5+5) Revcomp-8 1.35s ± 1% 1.13s ± 1% -16.17% (p=0.008 n=5+5) Template-8 160ms ± 2% 162ms ± 2% ~ (p=0.222 n=5+5) TimeParse-8 795ns ± 2% 778ns ± 1% ~ (p=0.095 n=5+5) TimeFormat-8 782ns ± 0% 786ns ± 1% +0.59% (p=0.040 n=5+5) name old speed new speed delta GobDecode-8 45.8MB/s ± 2% 45.9MB/s ± 1% ~ (p=1.000 n=5+5) GobEncode-8 54.3MB/s ± 1% 55.0MB/s ± 1% ~ (p=0.056 n=5+5) Gzip-8 24.6MB/s ± 0% 24.2MB/s ± 0% -1.69% (p=0.008 n=5+5) Gunzip-8 232MB/s ± 0% 231MB/s ± 0% -0.40% (p=0.008 n=5+5) JSONEncode-8 58.4MB/s ± 0% 58.8MB/s ± 1% +0.71% (p=0.008 n=5+5) JSONDecode-8 12.8MB/s ± 1% 12.8MB/s ± 1% ~ (p=1.000 n=5+5) GoParse-8 7.27MB/s ± 0% 7.26MB/s ± 0% ~ (p=0.762 n=5+5) RegexpMatchEasy0_32-8 137MB/s ± 1% 155MB/s ± 0% +12.93% (p=0.008 n=5+5) RegexpMatchEasy0_1K-8 551MB/s ± 0% 1329MB/s ± 1% +141.11% (p=0.008 n=5+5) RegexpMatchEasy1_32-8 128MB/s ± 0% 156MB/s ± 0% +22.00% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 449MB/s ± 0% 920MB/s ± 0% +104.68% (p=0.016 n=4+5) RegexpMatchMedium_32-8 3.00MB/s ± 0% 3.32MB/s ± 2% +10.60% (p=0.016 n=4+5) RegexpMatchMedium_1K-8 12.0MB/s ± 2% 13.0MB/s ± 0% +8.48% (p=0.008 n=5+5) RegexpMatchHard_32-8 7.38MB/s ± 1% 7.49MB/s ± 0% +1.49% (p=0.008 n=5+5) RegexpMatchHard_1K-8 7.88MB/s ± 1% 8.08MB/s ± 0% +2.59% (p=0.008 n=5+5) Revcomp-8 188MB/s ± 1% 224MB/s ± 1% +19.29% (p=0.008 n=5+5) Template-8 12.2MB/s ± 2% 12.0MB/s ± 2% ~ (p=0.206 n=5+5) Change-Id: I94116620a287d173a6f60510684362e500f54887 Reviewed-on: https://go-review.googlesource.com/33597 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>

runtime: IndexByte and memclr perf improvements on arm64
Update runtime asm_arm64.s and memclr_arm64.s to improve performance by using SIMD instructions to do more in parallel. It shows improvement on bytes, html and go1 benchmarks (particualrly regexp, which uses IndexByte frequently). Benchmark results of bytes: name old time/op new time/op delta IndexByte/10-8 28.5ns ± 0% 19.5ns ± 0% -31.58% (p=0.000 n=10+10) IndexByte/32-8 52.6ns ± 0% 19.0ns ± 0% -63.88% (p=0.000 n=10+10) IndexByte/4K-8 4.12µs ± 0% 0.49µs ± 0% -88.16% (p=0.000 n=10+10) IndexByte/4M-8 4.29ms ± 1% 0.70ms ±26% -83.65% (p=0.000 n=10+10) IndexByte/64M-8 69.7ms ± 0% 16.0ms ± 0% -76.97% (p=0.000 n=9+10) IndexBytePortable/10-8 34.0ns ± 0% 34.0ns ± 0% ~ (all equal) IndexBytePortable/32-8 66.1ns ± 0% 66.1ns ± 0% ~ (p=0.471 n=9+9) IndexBytePortable/4K-8 6.17µs ± 0% 6.17µs ± 0% ~ (all equal) IndexBytePortable/4M-8 6.33ms ± 0% 6.35ms ± 0% +0.21% (p=0.002 n=10+9) IndexBytePortable/64M-8 103ms ± 0% 103ms ± 0% +0.01% (p=0.017 n=9+10) name old speed new speed delta IndexByte/10-8 351MB/s ± 0% 512MB/s ± 0% +46.14% (p=0.000 n=9+10) IndexByte/32-8 609MB/s ± 0% 1683MB/s ± 0% +176.40% (p=0.000 n=10+10) IndexByte/4K-8 994MB/s ± 0% 8378MB/s ± 0% +742.75% (p=0.000 n=10+10) IndexByte/4M-8 977MB/s ± 1% 6149MB/s ±32% +529.29% (p=0.000 n=10+10) IndexByte/64M-8 963MB/s ± 0% 4182MB/s ± 0% +334.29% (p=0.000 n=9+10) IndexBytePortable/10-8 294MB/s ± 0% 294MB/s ± 0% +0.17% (p=0.000 n=8+8) IndexBytePortable/32-8 484MB/s ± 0% 484MB/s ± 0% ~ (p=0.877 n=9+9) IndexBytePortable/4K-8 664MB/s ± 0% 664MB/s ± 0% ~ (p=0.242 n=8+9) IndexBytePortable/4M-8 662MB/s ± 0% 661MB/s ± 0% -0.21% (p=0.002 n=10+9) IndexBytePortable/64M-8 652MB/s ± 0% 652MB/s ± 0% ~ (p=0.065 n=10+10) Benchmark results of html: name old time/op new time/op delta Escape-8 62.0µs ± 1% 61.0µs ± 1% -1.69% (p=0.000 n=9+10) EscapeNone-8 10.2µs ± 0% 10.2µs ± 0% -0.09% (p=0.022 n=9+10) Unescape-8 71.9µs ± 0% 68.7µs ± 0% -4.35% (p=0.000 n=10+10) UnescapeNone-8 4.03µs ± 0% 0.48µs ± 0% -88.08% (p=0.000 n=10+10) UnescapeSparse-8 10.7µs ± 2% 7.1µs ± 3% -33.91% (p=0.000 n=10+10) UnescapeDense-8 53.2µs ± 1% 53.5µs ± 1% ~ (p=0.143 n=10+10) Benchmark results of go1: name old time/op new time/op delta BinaryTree17-8 6.53s ± 0% 6.48s ± 2% ~ (p=0.190 n=4+5) Fannkuch11-8 6.35s ± 1% 6.35s ± 0% ~ (p=1.000 n=5+5) FmtFprintfEmpty-8 108ns ± 1% 101ns ± 2% -6.32% (p=0.008 n=5+5) FmtFprintfString-8 172ns ± 1% 182ns ± 2% +5.70% (p=0.008 n=5+5) FmtFprintfInt-8 207ns ± 0% 207ns ± 0% ~ (p=0.444 n=5+5) FmtFprintfIntInt-8 277ns ± 1% 276ns ± 1% ~ (p=0.873 n=5+5) FmtFprintfPrefixedInt-8 386ns ± 0% 382ns ± 1% -1.04% (p=0.024 n=5+5) FmtFprintfFloat-8 492ns ± 0% 492ns ± 1% ~ (p=0.571 n=4+5) FmtManyArgs-8 1.32µs ± 1% 1.33µs ± 0% ~ (p=0.087 n=5+5) GobDecode-8 16.8ms ± 2% 16.7ms ± 1% ~ (p=1.000 n=5+5) GobEncode-8 14.1ms ± 1% 14.0ms ± 1% ~ (p=0.056 n=5+5) Gzip-8 788ms ± 0% 802ms ± 0% +1.71% (p=0.008 n=5+5) Gunzip-8 83.6ms ± 0% 83.9ms ± 0% +0.40% (p=0.008 n=5+5) HTTPClientServer-8 120µs ± 0% 120µs ± 1% ~ (p=0.548 n=5+5) JSONEncode-8 33.2ms ± 0% 33.0ms ± 1% -0.71% (p=0.008 n=5+5) JSONDecode-8 152ms ± 1% 152ms ± 1% ~ (p=1.000 n=5+5) Mandelbrot200-8 10.0ms ± 0% 10.0ms ± 0% -0.05% (p=0.008 n=5+5) GoParse-8 7.97ms ± 0% 7.98ms ± 0% ~ (p=0.690 n=5+5) RegexpMatchEasy0_32-8 233ns ± 1% 206ns ± 0% -11.44% (p=0.016 n=5+4) RegexpMatchEasy0_1K-8 1.86µs ± 0% 0.77µs ± 1% -58.54% (p=0.008 n=5+5) RegexpMatchEasy1_32-8 250ns ± 0% 205ns ± 0% -18.07% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 2.28µs ± 0% 1.11µs ± 0% -51.09% (p=0.029 n=4+4) RegexpMatchMedium_32-8 332ns ± 1% 301ns ± 2% -9.45% (p=0.008 n=5+5) RegexpMatchMedium_1K-8 85.5µs ± 2% 78.8µs ± 0% -7.83% (p=0.008 n=5+5) RegexpMatchHard_32-8 4.34µs ± 1% 4.27µs ± 0% -1.49% (p=0.008 n=5+5) RegexpMatchHard_1K-8 130µs ± 1% 127µs ± 0% -2.53% (p=0.008 n=5+5) Revcomp-8 1.35s ± 1% 1.13s ± 1% -16.17% (p=0.008 n=5+5) Template-8 160ms ± 2% 162ms ± 2% ~ (p=0.222 n=5+5) TimeParse-8 795ns ± 2% 778ns ± 1% ~ (p=0.095 n=5+5) TimeFormat-8 782ns ± 0% 786ns ± 1% +0.59% (p=0.040 n=5+5) name old speed new speed delta GobDecode-8 45.8MB/s ± 2% 45.9MB/s ± 1% ~ (p=1.000 n=5+5) GobEncode-8 54.3MB/s ± 1% 55.0MB/s ± 1% ~ (p=0.056 n=5+5) Gzip-8 24.6MB/s ± 0% 24.2MB/s ± 0% -1.69% (p=0.008 n=5+5) Gunzip-8 232MB/s ± 0% 231MB/s ± 0% -0.40% (p=0.008 n=5+5) JSONEncode-8 58.4MB/s ± 0% 58.8MB/s ± 1% +0.71% (p=0.008 n=5+5) JSONDecode-8 12.8MB/s ± 1% 12.8MB/s ± 1% ~ (p=1.000 n=5+5) GoParse-8 7.27MB/s ± 0% 7.26MB/s ± 0% ~ (p=0.762 n=5+5) RegexpMatchEasy0_32-8 137MB/s ± 1% 155MB/s ± 0% +12.93% (p=0.008 n=5+5) RegexpMatchEasy0_1K-8 551MB/s ± 0% 1329MB/s ± 1% +141.11% (p=0.008 n=5+5) RegexpMatchEasy1_32-8 128MB/s ± 0% 156MB/s ± 0% +22.00% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 449MB/s ± 0% 920MB/s ± 0% +104.68% (p=0.016 n=4+5) RegexpMatchMedium_32-8 3.00MB/s ± 0% 3.32MB/s ± 2% +10.60% (p=0.016 n=4+5) RegexpMatchMedium_1K-8 12.0MB/s ± 2% 13.0MB/s ± 0% +8.48% (p=0.008 n=5+5) RegexpMatchHard_32-8 7.38MB/s ± 1% 7.49MB/s ± 0% +1.49% (p=0.008 n=5+5) RegexpMatchHard_1K-8 7.88MB/s ± 1% 8.08MB/s ± 0% +2.59% (p=0.008 n=5+5) Revcomp-8 188MB/s ± 1% 224MB/s ± 1% +19.29% (p=0.008 n=5+5) Template-8 12.2MB/s ± 2% 12.0MB/s ± 2% ~ (p=0.206 n=5+5) Change-Id: I94116620a287d173a6f60510684362e500f54887 Reviewed-on: https://go-review.googlesource.com/33597 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
d259815c · wei xiao · Brad Fitzpatrick · 466e299d · d259815c · d259815c
Commit d259815c authored Nov 28, 2016 by wei xiao Committed by Brad Fitzpatrick Nov 15, 2017
Hide whitespace changes
Inline Side-by-side

Showing with 159 additions and 62 deletions

asm_arm64.s src/runtime/asm_arm64.s +109 -34

memclr_arm64.s src/runtime/memclr_arm64.s +50 -28

No files found.
--- a/src/runtime/asm_arm64.s
+++ b/src/runtime/asm_arm64.s
@@ -802,48 +802,123 @@ samebytes:
 //
 TEXT bytes·IndexByte(SB),NOSPLIT,$0-40
 	MOVD	b+0(FP), R0
-	MOVD	b_len+8(FP), R1
-	MOVBU	c+24(FP), R2	// byte to find
-	MOVD	R0, R4		// store base for later
-	ADD	R0, R1		// end
-loop:
-	CMP	R0, R1
-	BEQ	notfound
-	MOVBU.P	1(R0), R3
-	CMP	R2, R3
-	BNE	loop
-
-	SUB	$1, R0		// R0 will be one beyond the position we want
-	SUB	R4, R0		// remove base
-	MOVD	R0, ret+32(FP)
-	RET
-
-notfound:
-	MOVD	$-1, R0
-	MOVD	R0, ret+32(FP)
-	RET
+	MOVD	b_len+8(FP), R2
+	MOVBU	c+24(FP), R1
+	MOVD	$ret+32(FP), R8
+	B	runtime·indexbytebody<>(SB)

 TEXT strings·IndexByte(SB),NOSPLIT,$0-32
 	MOVD	s+0(FP), R0
-	MOVD	s_len+8(FP), R1
-	MOVBU	c+16(FP), R2	// byte to find
-	MOVD	R0, R4		// store base for later
-	ADD	R0, R1		// end
+	MOVD	s_len+8(FP), R2
+	MOVBU	c+16(FP), R1
+	MOVD	$ret+24(FP), R8
+	B	runtime·indexbytebody<>(SB)
+
+// input:
+//   R0: data
+//   R1: byte to search
+//   R2: data len
+//   R8: address to put result
+TEXT runtime·indexbytebody<>(SB),NOSPLIT,$0
+	// Core algorithm:
+	// For each 32-byte chunk we calculate a 64-bit syndrome value,
+	// with two bits per byte. For each tuple, bit 0 is set if the
+	// relevant byte matched the requested character and bit 1 is
+	// not used (faster than using a 32bit syndrome). Since the bits
+	// in the syndrome reflect exactly the order in which things occur
+	// in the original string, counting trailing zeros allows to
+	// identify exactly which byte has matched.
+
+	CBZ	R2, fail
+	MOVD	R0, R11
+	// Magic constant 0x40100401 allows us to identify
+	// which lane matches the requested byte.
+	// 0x40100401 = ((1<<0) + (4<<8) + (16<<16) + (64<<24))
+	// Different bytes have different bit masks (i.e: 1, 4, 16, 64)
+	MOVD	$0x40100401, R5
+	VMOV	R1, V0.B16
+	// Work with aligned 32-byte chunks
+	BIC	$0x1f, R0, R3
+	VMOV	R5, V5.S4
+	ANDS	$0x1f, R0, R9
+	AND	$0x1f, R2, R10
+	BEQ	loop
+
+	// Input string is not 32-byte aligned. We calculate the
+	// syndrome value for the aligned 32 bytes block containing
+	// the first bytes and mask off the irrelevant part.
+	VLD1.P	(R3), [V1.B16, V2.B16]
+	SUB	$0x20, R9, R4
+	ADDS	R4, R2, R2
+	VCMEQ	V0.B16, V1.B16, V3.B16
+	VCMEQ	V0.B16, V2.B16, V4.B16
+	VAND	V5.B16, V3.B16, V3.B16
+	VAND	V5.B16, V4.B16, V4.B16
+	VADDP	V4.B16, V3.B16, V6.B16 // 256->128
+	VADDP	V6.B16, V6.B16, V6.B16 // 128->64
+	VMOV	V6.D[0], R6
+	// Clear the irrelevant lower bits
+	LSL	$1, R9, R4
+	LSR	R4, R6, R6
+	LSL	R4, R6, R6
+	// The first block can also be the last
+	BLS	masklast
+	// Have we found something already?
+	CBNZ	R6, tail
+
 loop:
-	CMP	R0, R1
-	BEQ	notfound
-	MOVBU.P	1(R0), R3
-	CMP	R2, R3
-	BNE	loop
+	VLD1.P	(R3), [V1.B16, V2.B16]
+	SUBS	$0x20, R2, R2
+	VCMEQ	V0.B16, V1.B16, V3.B16
+	VCMEQ	V0.B16, V2.B16, V4.B16
+	// If we're out of data we finish regardless of the result
+	BLS	end
+	// Use a fast check for the termination condition
+	VORR	V4.B16, V3.B16, V6.B16
+	VADDP	V6.D2, V6.D2, V6.D2
+	VMOV	V6.D[0], R6
+	// We're not out of data, loop if we haven't found the character
+	CBZ	R6, loop
+
+end:
+	// Termination condition found, let's calculate the syndrome value
+	VAND	V5.B16, V3.B16, V3.B16
+	VAND	V5.B16, V4.B16, V4.B16
+	VADDP	V4.B16, V3.B16, V6.B16
+	VADDP	V6.B16, V6.B16, V6.B16
+	VMOV	V6.D[0], R6
+	// Only do the clear for the last possible block with less than 32 bytes
+	// Condition flags come from SUBS in the loop
+	BHS	tail
+
+masklast:
+	// Clear the irrelevant upper bits
+	ADD	R9, R10, R4
+	AND	$0x1f, R4, R4
+	SUB	$0x20, R4, R4
+	NEG	R4<<1, R4
+	LSL	R4, R6, R6
+	LSR	R4, R6, R6

-	SUB	$1, R0		// R0 will be one beyond the position we want
-	SUB	R4, R0		// remove base
-	MOVD	R0, ret+24(FP)
+tail:
+	// Check that we have found a character
+	CBZ	R6, fail
+	// Count the trailing zeros using bit reversing
+	RBIT	R6, R6
+	// Compensate the last post-increment
+	SUB	$0x20, R3, R3
+	// And count the leading zeros
+	CLZ	R6, R6
+	// R6 is twice the offset into the fragment
+	ADD	R6>>1, R3, R0
+	// Compute the offset result
+	SUB	R11, R0, R0
+	MOVD	R0, (R8)
 	RET

-notfound:
+fail:
 	MOVD	$-1, R0
-	MOVD	R0, ret+24(FP)
+	MOVD	R0, (R8)
 	RET

 // Equal(a, b []byte) bool

--- a/src/runtime/memclr_arm64.s
+++ b/src/runtime/memclr_arm64.s
@@ -6,32 +6,54 @@

 // void runtime·memclrNoHeapPointers(void*, uintptr)
 TEXT runtime·memclrNoHeapPointers(SB),NOSPLIT,$0-16
-	MOVD	ptr+0(FP), R3
-	MOVD	n+8(FP), R4
-	// TODO(mwhudson): this is written this way to avoid tickling
-	// warnings from addpool when written as AND $7, R4, R6 (see
-	// https://golang.org/issue/12708)
-	AND	$~7, R4, R5	// R5 is N&~7
-	SUB	R5, R4, R6	// R6 is N&7
-
-	CMP	$0, R5
-	BEQ	nowords
-
-	ADD	R3, R5, R5
-
-wordloop: // TODO: Optimize for unaligned ptr.
-	MOVD.P	$0, 8(R3)
-	CMP	R3, R5
-	BNE	wordloop
-nowords:
-        CMP	$0, R6
-        BEQ	done
-
-	ADD	R3, R6, R6
-
-byteloop:
-	MOVBU.P	$0, 1(R3)
-	CMP	R3, R6
-	BNE	byteloop
-done:
+	MOVD	ptr+0(FP), R0
+	MOVD	n+8(FP), R1
+	// If size is less than 16 bytes, use tail_zero to zero what remains
+	CMP	$16, R1
+	BLT	tail_zero
+	// Get buffer offset into 16 byte aligned address for better performance
+	ANDS	$15, R0, ZR
+	BNE	unaligned_to_16
+aligned_to_16:
+	LSR	$4, R1, R2
+zero_by_16:
+	STP.P	(ZR, ZR), 16(R0)
+	SUBS	$1, R2, R2
+	BNE	zero_by_16
+
+	ANDS	$15, R1, R1
+	BEQ	ending
+
+	// Zero buffer with size=R1 < 16
+tail_zero:
+	TBZ	$3, R1, tail_zero_4
+	MOVD.P	ZR, 8(R0)
+
+tail_zero_4:
+	TBZ	$2, R1, tail_zero_2
+	MOVW.P	ZR, 4(R0)
+
+tail_zero_2:
+	TBZ	$1, R1, tail_zero_1
+	MOVH.P	ZR, 2(R0)
+
+tail_zero_1:
+	TBZ	$0, R1, ending
+	MOVB	ZR, (R0)
+
+ending:
 	RET
+
+unaligned_to_16:
+	MOVD	R0, R2
+head_loop:
+	MOVBU.P	ZR, 1(R0)
+	ANDS	$15, R0, ZR
+	BNE	head_loop
+	// Adjust length for what remains
+	SUB	R2, R0, R3
+	SUB	R3, R1
+	// If size is less than 16 bytes, use tail_zero to zero what remains
+	CMP	$16, R1
+	BLT	tail_zero
+	B	aligned_to_16