cmd/compile: don't fold address of global into load/store on PPC64

On PPC64 (and a few other architectures), accessing global requires multiple instructions and use of temp register. The compiler emits a single MOV prog, and the assembler expands it to multiple instructions. If globals are accessed multiple times, each time it generates a reload of the temp register. As this is done by the assembler, the compiler cannot optimize it. This CL makes the compiler not fold address of global into load and store. If a global is accessed multiple times, or multiple fields of a struct are accessed, the compiler can CSE the address. Currently, this doesn't help the case where different globals are accessed, even though they may be close to each other in the address space (which we don't know at compile time). It helps a little bit in go1 benchmark: name old time/op new time/op delta BinaryTree17-2 4.84s ± 1% 4.84s ± 1% ~ (p=0.796 n=10+10) Fannkuch11-2 4.10s ± 0% 4.08s ± 0% -0.58% (p=0.000 n=9+8) FmtFprintfEmpty-2 97.9ns ± 1% 96.8ns ± 1% -1.08% (p=0.000 n=10+10) FmtFprintfString-2 147ns ± 0% 147ns ± 1% ~ (p=0.129 n=9+10) FmtFprintfInt-2 152ns ± 0% 152ns ± 0% ~ (p=0.294 n=10+8) FmtFprintfIntInt-2 218ns ± 1% 217ns ± 0% -0.64% (p=0.000 n=10+8) FmtFprintfPrefixedInt-2 263ns ± 1% 256ns ± 0% -2.77% (p=0.000 n=10+8) FmtFprintfFloat-2 375ns ± 1% 368ns ± 0% -1.95% (p=0.000 n=10+7) FmtManyArgs-2 849ns ± 0% 850ns ± 0% ~ (p=0.621 n=8+9) GobDecode-2 12.3ms ± 1% 12.2ms ± 1% -0.94% (p=0.003 n=10+10) GobEncode-2 10.3ms ± 1% 10.5ms ± 1% +2.03% (p=0.000 n=10+10) Gzip-2 414ms ± 1% 414ms ± 0% ~ (p=0.842 n=9+10) Gunzip-2 66.3ms ± 0% 66.4ms ± 0% ~ (p=0.077 n=9+9) HTTPClientServer-2 66.3µs ± 5% 66.4µs ± 1% ~ (p=0.661 n=10+9) JSONEncode-2 23.9ms ± 1% 23.9ms ± 1% ~ (p=0.905 n=10+9) JSONDecode-2 119ms ± 1% 116ms ± 0% -2.65% (p=0.000 n=10+10) Mandelbrot200-2 5.11ms ± 0% 4.92ms ± 0% -3.71% (p=0.000 n=10+10) GoParse-2 5.81ms ± 1% 5.84ms ± 1% ~ (p=0.052 n=10+10) RegexpMatchEasy0_32-2 315ns ± 0% 317ns ± 0% +0.67% (p=0.000 n=10+10) RegexpMatchEasy0_1K-2 658ns ± 0% 638ns ± 0% -3.01% (p=0.000 n=9+9) RegexpMatchEasy1_32-2 315ns ± 1% 317ns ± 0% +0.56% (p=0.000 n=9+9) RegexpMatchEasy1_1K-2 935ns ± 0% 926ns ± 0% -0.96% (p=0.000 n=9+9) RegexpMatchMedium_32-2 394ns ± 0% 396ns ± 1% +0.46% (p=0.001 n=10+10) RegexpMatchMedium_1K-2 65.1µs ± 0% 64.5µs ± 0% -0.90% (p=0.000 n=9+9) RegexpMatchHard_32-2 3.16µs ± 0% 3.17µs ± 0% +0.35% (p=0.000 n=10+9) RegexpMatchHard_1K-2 89.4µs ± 0% 89.3µs ± 0% ~ (p=0.136 n=9+9) Revcomp-2 703ms ± 2% 694ms ± 2% -1.41% (p=0.009 n=10+10) Template-2 107ms ± 1% 107ms ± 1% ~ (p=0.053 n=9+10) TimeParse-2 526ns ± 0% 524ns ± 0% -0.34% (p=0.002 n=9+9) TimeFormat-2 534ns ± 0% 504ns ± 1% -5.51% (p=0.000 n=10+10) [Geo mean] 93.8µs 93.1µs -0.70% It also helps in the case mentioned in issue #17110, main.main in package math's test. Now it generates 4 loads of R31 instead of 10, for the same piece of code. This causes a slight increase of binary size: cmd/go increases 0.66%. If this is a good idea, we should do it on other architectures where accessing global is expensive. Updates #17110. Change-Id: I2687af6eafc04f2a57c19781ec300c33567094b6 Reviewed-on: https://go-review.googlesource.com/68250 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>

cmd/compile: don't fold address of global into load/store on PPC64
On PPC64 (and a few other architectures), accessing global requires multiple instructions and use of temp register. The compiler emits a single MOV prog, and the assembler expands it to multiple instructions. If globals are accessed multiple times, each time it generates a reload of the temp register. As this is done by the assembler, the compiler cannot optimize it. This CL makes the compiler not fold address of global into load and store. If a global is accessed multiple times, or multiple fields of a struct are accessed, the compiler can CSE the address. Currently, this doesn't help the case where different globals are accessed, even though they may be close to each other in the address space (which we don't know at compile time). It helps a little bit in go1 benchmark: name old time/op new time/op delta BinaryTree17-2 4.84s ± 1% 4.84s ± 1% ~ (p=0.796 n=10+10) Fannkuch11-2 4.10s ± 0% 4.08s ± 0% -0.58% (p=0.000 n=9+8) FmtFprintfEmpty-2 97.9ns ± 1% 96.8ns ± 1% -1.08% (p=0.000 n=10+10) FmtFprintfString-2 147ns ± 0% 147ns ± 1% ~ (p=0.129 n=9+10) FmtFprintfInt-2 152ns ± 0% 152ns ± 0% ~ (p=0.294 n=10+8) FmtFprintfIntInt-2 218ns ± 1% 217ns ± 0% -0.64% (p=0.000 n=10+8) FmtFprintfPrefixedInt-2 263ns ± 1% 256ns ± 0% -2.77% (p=0.000 n=10+8) FmtFprintfFloat-2 375ns ± 1% 368ns ± 0% -1.95% (p=0.000 n=10+7) FmtManyArgs-2 849ns ± 0% 850ns ± 0% ~ (p=0.621 n=8+9) GobDecode-2 12.3ms ± 1% 12.2ms ± 1% -0.94% (p=0.003 n=10+10) GobEncode-2 10.3ms ± 1% 10.5ms ± 1% +2.03% (p=0.000 n=10+10) Gzip-2 414ms ± 1% 414ms ± 0% ~ (p=0.842 n=9+10) Gunzip-2 66.3ms ± 0% 66.4ms ± 0% ~ (p=0.077 n=9+9) HTTPClientServer-2 66.3µs ± 5% 66.4µs ± 1% ~ (p=0.661 n=10+9) JSONEncode-2 23.9ms ± 1% 23.9ms ± 1% ~ (p=0.905 n=10+9) JSONDecode-2 119ms ± 1% 116ms ± 0% -2.65% (p=0.000 n=10+10) Mandelbrot200-2 5.11ms ± 0% 4.92ms ± 0% -3.71% (p=0.000 n=10+10) GoParse-2 5.81ms ± 1% 5.84ms ± 1% ~ (p=0.052 n=10+10) RegexpMatchEasy0_32-2 315ns ± 0% 317ns ± 0% +0.67% (p=0.000 n=10+10) RegexpMatchEasy0_1K-2 658ns ± 0% 638ns ± 0% -3.01% (p=0.000 n=9+9) RegexpMatchEasy1_32-2 315ns ± 1% 317ns ± 0% +0.56% (p=0.000 n=9+9) RegexpMatchEasy1_1K-2 935ns ± 0% 926ns ± 0% -0.96% (p=0.000 n=9+9) RegexpMatchMedium_32-2 394ns ± 0% 396ns ± 1% +0.46% (p=0.001 n=10+10) RegexpMatchMedium_1K-2 65.1µs ± 0% 64.5µs ± 0% -0.90% (p=0.000 n=9+9) RegexpMatchHard_32-2 3.16µs ± 0% 3.17µs ± 0% +0.35% (p=0.000 n=10+9) RegexpMatchHard_1K-2 89.4µs ± 0% 89.3µs ± 0% ~ (p=0.136 n=9+9) Revcomp-2 703ms ± 2% 694ms ± 2% -1.41% (p=0.009 n=10+10) Template-2 107ms ± 1% 107ms ± 1% ~ (p=0.053 n=9+10) TimeParse-2 526ns ± 0% 524ns ± 0% -0.34% (p=0.002 n=9+9) TimeFormat-2 534ns ± 0% 504ns ± 1% -5.51% (p=0.000 n=10+10) [Geo mean] 93.8µs 93.1µs -0.70% It also helps in the case mentioned in issue #17110, main.main in package math's test. Now it generates 4 loads of R31 instead of 10, for the same piece of code. This causes a slight increase of binary size: cmd/go increases 0.66%. If this is a good idea, we should do it on other architectures where accessing global is expensive. Updates #17110. Change-Id: I2687af6eafc04f2a57c19781ec300c33567094b6 Reviewed-on: https://go-review.googlesource.com/68250 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
622cfd88 · Cherry Zhang · 4745604b · 622cfd88 · 622cfd88
Commit 622cfd88 authored Oct 04, 2017 by Cherry Zhang
Expand all Hide whitespace changes
Inline Side-by-side

Showing with 41 additions and 18 deletions

PPC64.rules src/cmd/compile/internal/ssa/gen/PPC64.rules +41 -18

rewritePPC64.go src/cmd/compile/internal/ssa/rewritePPC64.go +0 -0

No files found.
--- a/src/cmd/compile/internal/ssa/gen/PPC64.rules
+++ b/src/cmd/compile/internal/ssa/gen/PPC64.rules
@@ -760,35 +760,54 @@
 (FMOVSstore [off1] {sym} (ADDconst [off2] ptr) val mem) && is16Bit(off1+off2) -> (FMOVSstore [off1+off2] {sym} ptr val mem)
 (FMOVDstore [off1] {sym} (ADDconst [off2] ptr) val mem) && is16Bit(off1+off2) -> (FMOVDstore [off1+off2] {sym} ptr val mem)
-(MOVBstore [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2) ->
+// Fold address into load/store.
+// The assembler needs to generate several instructions and use
+// temp register for accessing global, and each time it will reload
+// the temp register. So don't fold address of global, unless there
+// is only one use.
+(MOVBstore [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (MOVBstore [off1+off2] {mergeSym(sym1,sym2)} ptr val mem)
-(MOVHstore [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2) ->
+(MOVHstore [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (MOVHstore [off1+off2] {mergeSym(sym1,sym2)} ptr val mem)
-(MOVWstore [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2) ->
+(MOVWstore [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (MOVWstore [off1+off2] {mergeSym(sym1,sym2)} ptr val mem)
-(MOVDstore [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2) ->
+(MOVDstore [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (MOVDstore [off1+off2] {mergeSym(sym1,sym2)} ptr val mem)
-(FMOVSstore [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2) ->
+(FMOVSstore [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (FMOVSstore [off1+off2] {mergeSym(sym1,sym2)} ptr val mem)
-(FMOVDstore [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2) ->
+(FMOVDstore [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) val mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (FMOVDstore [off1+off2] {mergeSym(sym1,sym2)} ptr val mem)
-(MOVBZload [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2) ->
+(MOVBZload [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (MOVBZload [off1+off2] {mergeSym(sym1,sym2)} ptr mem)
-(MOVHload [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2) ->
+(MOVHload [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (MOVHload [off1+off2] {mergeSym(sym1,sym2)} ptr mem)
-(MOVHZload [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2) ->
+(MOVHZload [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (MOVHZload [off1+off2] {mergeSym(sym1,sym2)} ptr mem)
-(MOVWload [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2) ->
+(MOVWload [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (MOVWload [off1+off2] {mergeSym(sym1,sym2)} ptr mem)
-(MOVWZload [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2) ->
+(MOVWZload [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (MOVWZload [off1+off2] {mergeSym(sym1,sym2)} ptr mem)
-(MOVDload [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2) ->
+(MOVDload [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (MOVDload [off1+off2] {mergeSym(sym1,sym2)} ptr mem)
-(FMOVSload [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2) ->
+(FMOVSload [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (FMOVSload [off1+off2] {mergeSym(sym1,sym2)} ptr mem)
-(FMOVDload [off1] {sym1} (MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2) ->
+(FMOVDload [off1] {sym1} p:(MOVDaddr [off2] {sym2} ptr) mem) && canMergeSym(sym1,sym2)
+	&& (ptr.Op != OpSB || p.Uses == 1) ->
        (FMOVDload [off1+off2] {mergeSym(sym1,sym2)} ptr mem)
 // Fold offsets for loads.
@@ -819,13 +838,17 @@
    (MOVBstorezero [off1+off2] {sym} x mem)
 // Fold symbols into storezero
-(MOVDstorezero [off1] {sym1} (MOVDaddr [off2] {sym2} x) mem) && canMergeSym(sym1,sym2) ->
+(MOVDstorezero [off1] {sym1} p:(MOVDaddr [off2] {sym2} x) mem) && canMergeSym(sym1,sym2)
+	&& (x.Op != OpSB || p.Uses == 1) ->
    (MOVDstorezero [off1+off2] {mergeSym(sym1,sym2)} x mem)
-(MOVWstorezero [off1] {sym1} (MOVDaddr [off2] {sym2} x) mem) && canMergeSym(sym1,sym2) ->
+(MOVWstorezero [off1] {sym1} p:(MOVDaddr [off2] {sym2} x) mem) && canMergeSym(sym1,sym2)
+	&& (x.Op != OpSB || p.Uses == 1) ->
    (MOVWstorezero [off1+off2] {mergeSym(sym1,sym2)} x mem)
-(MOVHstorezero [off1] {sym1} (MOVDaddr [off2] {sym2} x) mem) && canMergeSym(sym1,sym2) ->
+(MOVHstorezero [off1] {sym1} p:(MOVDaddr [off2] {sym2} x) mem) && canMergeSym(sym1,sym2)
+	&& (x.Op != OpSB || p.Uses == 1) ->
    (MOVHstorezero [off1+off2] {mergeSym(sym1,sym2)} x mem)
-(MOVBstorezero [off1] {sym1} (MOVDaddr [off2] {sym2} x) mem) && canMergeSym(sym1,sym2) ->
+(MOVBstorezero [off1] {sym1} p:(MOVDaddr [off2] {sym2} x) mem) && canMergeSym(sym1,sym2)
+	&& (x.Op != OpSB || p.Uses == 1) ->
    (MOVBstorezero [off1+off2] {mergeSym(sym1,sym2)} x mem)
 // atomic intrinsics

--- a/src/cmd/compile/internal/ssa/rewritePPC64.go
+++ b/src/cmd/compile/internal/ssa/rewritePPC64.go