cmd/compile: automatically handle commuting ops in rewrite rules

Note that this is a redo of an undo of the original buggy CL 38666. We have lots of rewrite rules that vary only in the fact that we have 2 versions for the 2 different orderings of various commuting ops. For example: (ADDL x (MOVLconst [c])) -> (ADDLconst [c] x) (ADDL (MOVLconst [c]) x) -> (ADDLconst [c] x) It can get unwieldly quickly, especially when there is more than one commuting op in a rule. Our existing "fix" for this problem is to have rules that canonicalize the operations first. For example: (Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x) Subsequent rules can then assume if there is a constant arg to Eq64, it will be the first one. This fix kinda works, but it is fragile and only works when we remember to include the required extra rules. The fundamental problem is that the rule matcher doesn't know anything about commuting ops. This CL fixes that fact. We already have information about which ops commute. (The register allocator takes advantage of commutivity.) The rule generator now automatically generates multiple rules for a single source rule when there are commutative ops in the rule. We can now drop all of our almost-duplicate source-level rules and the canonicalization rules. I have some CLs in progress that will be a lot less verbose when the rule generator handles commutivity for me. I had to reorganize the load-combining rules a bit. The 8-way OR rules generated 128 different reorderings, which was causing the generator to put too much code in the rewrite*.go files (the big ones were going from 25K lines to 132K lines). Instead I reorganized the rules to combine pairs of loads at a time. The generated rule files are now actually a bit (5%) smaller. Make.bash times are ~unchanged. Compiler benchmarks are not observably different. Probably because we don't spend much compiler time in rule matching anyway. I've also done a pass over all of our ops adding commutative markings for ops which hadn't had them previously. Fixes #18292 Change-Id: Ic1c0e43fbf579539f459971625f69690c9ab8805 Reviewed-on: https://go-review.googlesource.com/38801 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>

cmd/compile: automatically handle commuting ops in rewrite rules
Note that this is a redo of an undo of the original buggy CL 38666. We have lots of rewrite rules that vary only in the fact that we have 2 versions for the 2 different orderings of various commuting ops. For example: (ADDL x (MOVLconst [c])) -> (ADDLconst [c] x) (ADDL (MOVLconst [c]) x) -> (ADDLconst [c] x) It can get unwieldly quickly, especially when there is more than one commuting op in a rule. Our existing "fix" for this problem is to have rules that canonicalize the operations first. For example: (Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x) Subsequent rules can then assume if there is a constant arg to Eq64, it will be the first one. This fix kinda works, but it is fragile and only works when we remember to include the required extra rules. The fundamental problem is that the rule matcher doesn't know anything about commuting ops. This CL fixes that fact. We already have information about which ops commute. (The register allocator takes advantage of commutivity.) The rule generator now automatically generates multiple rules for a single source rule when there are commutative ops in the rule. We can now drop all of our almost-duplicate source-level rules and the canonicalization rules. I have some CLs in progress that will be a lot less verbose when the rule generator handles commutivity for me. I had to reorganize the load-combining rules a bit. The 8-way OR rules generated 128 different reorderings, which was causing the generator to put too much code in the rewrite*.go files (the big ones were going from 25K lines to 132K lines). Instead I reorganized the rules to combine pairs of loads at a time. The generated rule files are now actually a bit (5%) smaller. Make.bash times are ~unchanged. Compiler benchmarks are not observably different. Probably because we don't spend much compiler time in rule matching anyway. I've also done a pass over all of our ops adding commutative markings for ops which hadn't had them previously. Fixes #18292 Change-Id: Ic1c0e43fbf579539f459971625f69690c9ab8805 Reviewed-on: https://go-review.googlesource.com/38801 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
53f8a6ae · Keith Randall · 63a72fd4 · 53f8a6ae · 53f8a6ae · 53f8a6ae
Commit 53f8a6ae authored Mar 30, 2017 by Keith Randall
26 changed files
--- a/src/cmd/compile/internal/ssa/gen/386.rules
+++ b/src/cmd/compile/internal/ssa/gen/386.rules
--- a/src/cmd/compile/internal/ssa/gen/386Ops.go
+++ b/src/cmd/compile/internal/ssa/gen/386Ops.go
@@ -193,10 +193,10 @@ func init() {
 		{name: "MULL", argLength: 2, reg: gp21, asm: "IMULL", commutative: true, resultInArg0: true, clobberFlags: true}, // arg0 * arg1
 		{name: "MULLconst", argLength: 1, reg: gp11, asm: "IMULL", aux: "Int32", resultInArg0: true, clobberFlags: true}, // arg0 * auxint

-		{name: "HMULL", argLength: 2, reg: gp21hmul, asm: "IMULL", clobberFlags: true}, // (arg0 * arg1) >> width
-		{name: "HMULLU", argLength: 2, reg: gp21hmul, asm: "MULL", clobberFlags: true}, // (arg0 * arg1) >> width
+		{name: "HMULL", argLength: 2, reg: gp21hmul, commutative: true, asm: "IMULL", clobberFlags: true}, // (arg0 * arg1) >> width
+		{name: "HMULLU", argLength: 2, reg: gp21hmul, commutative: true, asm: "MULL", clobberFlags: true}, // (arg0 * arg1) >> width

-		{name: "MULLQU", argLength: 2, reg: gp21mul, asm: "MULL", clobberFlags: true}, // arg0 * arg1, high 32 in result[0], low 32 in result[1]
+		{name: "MULLQU", argLength: 2, reg: gp21mul, commutative: true, asm: "MULL", clobberFlags: true}, // arg0 * arg1, high 32 in result[0], low 32 in result[1]

 		{name: "AVGLU", argLength: 2, reg: gp21, commutative: true, resultInArg0: true, clobberFlags: true}, // (arg0 + arg1) / 2 as unsigned, all 32 result bits

@@ -229,9 +229,9 @@ func init() {
 		{name: "UCOMISS", argLength: 2, reg: fp2flags, asm: "UCOMISS", typ: "Flags", usesScratch: true}, // arg0 compare to arg1, f32
 		{name: "UCOMISD", argLength: 2, reg: fp2flags, asm: "UCOMISD", typ: "Flags", usesScratch: true}, // arg0 compare to arg1, f64

-		{name: "TESTL", argLength: 2, reg: gp2flags, asm: "TESTL", typ: "Flags"},                    // (arg0 & arg1) compare to 0
-		{name: "TESTW", argLength: 2, reg: gp2flags, asm: "TESTW", typ: "Flags"},                    // (arg0 & arg1) compare to 0
-		{name: "TESTB", argLength: 2, reg: gp2flags, asm: "TESTB", typ: "Flags"},                    // (arg0 & arg1) compare to 0
+		{name: "TESTL", argLength: 2, reg: gp2flags, commutative: true, asm: "TESTL", typ: "Flags"}, // (arg0 & arg1) compare to 0
+		{name: "TESTW", argLength: 2, reg: gp2flags, commutative: true, asm: "TESTW", typ: "Flags"}, // (arg0 & arg1) compare to 0
+		{name: "TESTB", argLength: 2, reg: gp2flags, commutative: true, asm: "TESTB", typ: "Flags"}, // (arg0 & arg1) compare to 0
 		{name: "TESTLconst", argLength: 1, reg: gp1flags, asm: "TESTL", typ: "Flags", aux: "Int32"}, // (arg0 & auxint) compare to 0
 		{name: "TESTWconst", argLength: 1, reg: gp1flags, asm: "TESTW", typ: "Flags", aux: "Int16"}, // (arg0 & auxint) compare to 0
 		{name: "TESTBconst", argLength: 1, reg: gp1flags, asm: "TESTB", typ: "Flags", aux: "Int8"},  // (arg0 & auxint) compare to 0
@@ -314,7 +314,7 @@ func init() {
 		{name: "PXOR", argLength: 2, reg: fp21, asm: "PXOR", commutative: true, resultInArg0: true}, // exclusive or, applied to X regs for float negation.

 		{name: "LEAL", argLength: 1, reg: gp11sb, aux: "SymOff", rematerializeable: true, symEffect: "Addr"}, // arg0 + auxint + offset encoded in aux
-		{name: "LEAL1", argLength: 2, reg: gp21sb, aux: "SymOff", symEffect: "Addr"},                         // arg0 + arg1 + auxint + aux
+		{name: "LEAL1", argLength: 2, reg: gp21sb, commutative: true, aux: "SymOff", symEffect: "Addr"},      // arg0 + arg1 + auxint + aux
 		{name: "LEAL2", argLength: 2, reg: gp21sb, aux: "SymOff", symEffect: "Addr"},                         // arg0 + 2*arg1 + auxint + aux
 		{name: "LEAL4", argLength: 2, reg: gp21sb, aux: "SymOff", symEffect: "Addr"},                         // arg0 + 4*arg1 + auxint + aux
 		{name: "LEAL8", argLength: 2, reg: gp21sb, aux: "SymOff", symEffect: "Addr"},                         // arg0 + 8*arg1 + auxint + aux
@@ -331,17 +331,17 @@ func init() {
 		{name: "MOVLstore", argLength: 3, reg: gpstore, asm: "MOVL", aux: "SymOff", typ: "Mem", faultOnNilArg0: true, symEffect: "Write"},    // store 4 bytes in arg1 to arg0+auxint+aux. arg2=mem

 		// indexed loads/stores
-		{name: "MOVBloadidx1", argLength: 3, reg: gploadidx, asm: "MOVBLZX", aux: "SymOff", symEffect: "Read"}, // load a byte from arg0+arg1+auxint+aux. arg2=mem
-		{name: "MOVWloadidx1", argLength: 3, reg: gploadidx, asm: "MOVWLZX", aux: "SymOff", symEffect: "Read"}, // load 2 bytes from arg0+arg1+auxint+aux. arg2=mem
-		{name: "MOVWloadidx2", argLength: 3, reg: gploadidx, asm: "MOVWLZX", aux: "SymOff", symEffect: "Read"}, // load 2 bytes from arg0+2*arg1+auxint+aux. arg2=mem
-		{name: "MOVLloadidx1", argLength: 3, reg: gploadidx, asm: "MOVL", aux: "SymOff", symEffect: "Read"},    // load 4 bytes from arg0+arg1+auxint+aux. arg2=mem
-		{name: "MOVLloadidx4", argLength: 3, reg: gploadidx, asm: "MOVL", aux: "SymOff", symEffect: "Read"},    // load 4 bytes from arg0+4*arg1+auxint+aux. arg2=mem
+		{name: "MOVBloadidx1", argLength: 3, reg: gploadidx, commutative: true, asm: "MOVBLZX", aux: "SymOff", symEffect: "Read"}, // load a byte from arg0+arg1+auxint+aux. arg2=mem
+		{name: "MOVWloadidx1", argLength: 3, reg: gploadidx, commutative: true, asm: "MOVWLZX", aux: "SymOff", symEffect: "Read"}, // load 2 bytes from arg0+arg1+auxint+aux. arg2=mem
+		{name: "MOVWloadidx2", argLength: 3, reg: gploadidx, asm: "MOVWLZX", aux: "SymOff", symEffect: "Read"},                    // load 2 bytes from arg0+2*arg1+auxint+aux. arg2=mem
+		{name: "MOVLloadidx1", argLength: 3, reg: gploadidx, commutative: true, asm: "MOVL", aux: "SymOff", symEffect: "Read"},    // load 4 bytes from arg0+arg1+auxint+aux. arg2=mem
+		{name: "MOVLloadidx4", argLength: 3, reg: gploadidx, asm: "MOVL", aux: "SymOff", symEffect: "Read"},                       // load 4 bytes from arg0+4*arg1+auxint+aux. arg2=mem
 		// TODO: sign-extending indexed loads
-		{name: "MOVBstoreidx1", argLength: 4, reg: gpstoreidx, asm: "MOVB", aux: "SymOff", symEffect: "Write"}, // store byte in arg2 to arg0+arg1+auxint+aux. arg3=mem
-		{name: "MOVWstoreidx1", argLength: 4, reg: gpstoreidx, asm: "MOVW", aux: "SymOff", symEffect: "Write"}, // store 2 bytes in arg2 to arg0+arg1+auxint+aux. arg3=mem
-		{name: "MOVWstoreidx2", argLength: 4, reg: gpstoreidx, asm: "MOVW", aux: "SymOff", symEffect: "Write"}, // store 2 bytes in arg2 to arg0+2*arg1+auxint+aux. arg3=mem
-		{name: "MOVLstoreidx1", argLength: 4, reg: gpstoreidx, asm: "MOVL", aux: "SymOff", symEffect: "Write"}, // store 4 bytes in arg2 to arg0+arg1+auxint+aux. arg3=mem
-		{name: "MOVLstoreidx4", argLength: 4, reg: gpstoreidx, asm: "MOVL", aux: "SymOff", symEffect: "Write"}, // store 4 bytes in arg2 to arg0+4*arg1+auxint+aux. arg3=mem
+		{name: "MOVBstoreidx1", argLength: 4, reg: gpstoreidx, commutative: true, asm: "MOVB", aux: "SymOff", symEffect: "Write"}, // store byte in arg2 to arg0+arg1+auxint+aux. arg3=mem
+		{name: "MOVWstoreidx1", argLength: 4, reg: gpstoreidx, commutative: true, asm: "MOVW", aux: "SymOff", symEffect: "Write"}, // store 2 bytes in arg2 to arg0+arg1+auxint+aux. arg3=mem
+		{name: "MOVWstoreidx2", argLength: 4, reg: gpstoreidx, asm: "MOVW", aux: "SymOff", symEffect: "Write"},                    // store 2 bytes in arg2 to arg0+2*arg1+auxint+aux. arg3=mem
+		{name: "MOVLstoreidx1", argLength: 4, reg: gpstoreidx, commutative: true, asm: "MOVL", aux: "SymOff", symEffect: "Write"}, // store 4 bytes in arg2 to arg0+arg1+auxint+aux. arg3=mem
+		{name: "MOVLstoreidx4", argLength: 4, reg: gpstoreidx, asm: "MOVL", aux: "SymOff", symEffect: "Write"},                    // store 4 bytes in arg2 to arg0+4*arg1+auxint+aux. arg3=mem
 		// TODO: add size-mismatched indexed loads, like MOVBstoreidx4.

 		// For storeconst ops, the AuxInt field encodes both

--- a/src/cmd/compile/internal/ssa/gen/AMD64.rules
+++ b/src/cmd/compile/internal/ssa/gen/AMD64.rules
--- a/src/cmd/compile/internal/ssa/gen/AMD64Ops.go
+++ b/src/cmd/compile/internal/ssa/gen/AMD64Ops.go
--- a/src/cmd/compile/internal/ssa/gen/ARM.rules
+++ b/src/cmd/compile/internal/ssa/gen/ARM.rules
@@ -487,16 +487,13 @@
 (MOVWloadshiftRA ptr idx [c] (MOVWstoreshiftRA ptr2 idx [d] x _)) && c==d && isSamePtr(ptr, ptr2) -> x

 // fold constant into arithmatic ops
-(ADD (MOVWconst [c]) x) -> (ADDconst [c] x)
 (ADD x (MOVWconst [c])) -> (ADDconst [c] x)
 (SUB (MOVWconst [c]) x) -> (RSBconst [c] x)
 (SUB x (MOVWconst [c])) -> (SUBconst [c] x)
 (RSB (MOVWconst [c]) x) -> (SUBconst [c] x)
 (RSB x (MOVWconst [c])) -> (RSBconst [c] x)

-(ADDS (MOVWconst [c]) x) -> (ADDSconst [c] x)
 (ADDS x (MOVWconst [c])) -> (ADDSconst [c] x)
-(SUBS (MOVWconst [c]) x) -> (RSBSconst [c] x)
 (SUBS x (MOVWconst [c])) -> (SUBSconst [c] x)

 (ADC (MOVWconst [c]) x flags) -> (ADCconst [c] x flags)
@@ -504,11 +501,8 @@
 (SBC (MOVWconst [c]) x flags) -> (RSCconst [c] x flags)
 (SBC x (MOVWconst [c]) flags) -> (SBCconst [c] x flags)

-(AND (MOVWconst [c]) x) -> (ANDconst [c] x)
 (AND x (MOVWconst [c])) -> (ANDconst [c] x)
-(OR (MOVWconst [c]) x) -> (ORconst [c] x)
-(OR x (MOVWconst [c])) -> (ORconst [c] x)
-(XOR (MOVWconst [c]) x) -> (XORconst [c] x)
+(OR  x (MOVWconst [c])) -> (ORconst [c] x)
 (XOR x (MOVWconst [c])) -> (XORconst [c] x)
 (BIC x (MOVWconst [c])) -> (BICconst [c] x)

@@ -568,17 +562,6 @@
 (MUL x (MOVWconst [c])) && c%7 == 0 && isPowerOfTwo(c/7) && is32Bit(c) -> (SLLconst [log2(c/7)] (RSBshiftLL <x.Type> x x [3]))
 (MUL x (MOVWconst [c])) && c%9 == 0 && isPowerOfTwo(c/9) && is32Bit(c) -> (SLLconst [log2(c/9)] (ADDshiftLL <x.Type> x x [3]))

-(MUL (MOVWconst [c]) x) && int32(c) == -1 -> (RSBconst [0] x)
-(MUL (MOVWconst [0]) _) -> (MOVWconst [0])
-(MUL (MOVWconst [1]) x) -> x
-(MUL (MOVWconst [c]) x) && isPowerOfTwo(c) -> (SLLconst [log2(c)] x)
-(MUL (MOVWconst [c]) x) && isPowerOfTwo(c-1) && int32(c) >= 3 -> (ADDshiftLL x x [log2(c-1)])
-(MUL (MOVWconst [c]) x) && isPowerOfTwo(c+1) && int32(c) >= 7 -> (RSBshiftLL x x [log2(c+1)])
-(MUL (MOVWconst [c]) x) && c%3 == 0 && isPowerOfTwo(c/3) && is32Bit(c) -> (SLLconst [log2(c/3)] (ADDshiftLL <x.Type> x x [1]))
-(MUL (MOVWconst [c]) x) && c%5 == 0 && isPowerOfTwo(c/5) && is32Bit(c) -> (SLLconst [log2(c/5)] (ADDshiftLL <x.Type> x x [2]))
-(MUL (MOVWconst [c]) x) && c%7 == 0 && isPowerOfTwo(c/7) && is32Bit(c) -> (SLLconst [log2(c/7)] (RSBshiftLL <x.Type> x x [3]))
-(MUL (MOVWconst [c]) x) && c%9 == 0 && isPowerOfTwo(c/9) && is32Bit(c) -> (SLLconst [log2(c/9)] (ADDshiftLL <x.Type> x x [3]))
-
 (MULA x (MOVWconst [c]) a) && int32(c) == -1 -> (SUB a x)
 (MULA _ (MOVWconst [0]) a) -> a
 (MULA x (MOVWconst [1]) a) -> (ADD x a)
@@ -841,17 +824,11 @@

 // absorb shifts into ops
 (ADD x (SLLconst [c] y)) -> (ADDshiftLL x y [c])
-(ADD (SLLconst [c] y) x) -> (ADDshiftLL x y [c])
 (ADD x (SRLconst [c] y)) -> (ADDshiftRL x y [c])
-(ADD (SRLconst [c] y) x) -> (ADDshiftRL x y [c])
 (ADD x (SRAconst [c] y)) -> (ADDshiftRA x y [c])
-(ADD (SRAconst [c] y) x) -> (ADDshiftRA x y [c])
 (ADD x (SLL y z)) -> (ADDshiftLLreg x y z)
-(ADD (SLL y z) x) -> (ADDshiftLLreg x y z)
 (ADD x (SRL y z)) -> (ADDshiftRLreg x y z)
-(ADD (SRL y z) x) -> (ADDshiftRLreg x y z)
 (ADD x (SRA y z)) -> (ADDshiftRAreg x y z)
-(ADD (SRA y z) x) -> (ADDshiftRAreg x y z)
 (ADC x (SLLconst [c] y) flags) -> (ADCshiftLL x y [c] flags)
 (ADC (SLLconst [c] y) x flags) -> (ADCshiftLL x y [c] flags)
 (ADC x (SRLconst [c] y) flags) -> (ADCshiftRL x y [c] flags)
@@ -865,17 +842,11 @@
 (ADC x (SRA y z) flags) -> (ADCshiftRAreg x y z flags)
 (ADC (SRA y z) x flags) -> (ADCshiftRAreg x y z flags)
 (ADDS x (SLLconst [c] y)) -> (ADDSshiftLL x y [c])
-(ADDS (SLLconst [c] y) x) -> (ADDSshiftLL x y [c])
 (ADDS x (SRLconst [c] y)) -> (ADDSshiftRL x y [c])
-(ADDS (SRLconst [c] y) x) -> (ADDSshiftRL x y [c])
 (ADDS x (SRAconst [c] y)) -> (ADDSshiftRA x y [c])
-(ADDS (SRAconst [c] y) x) -> (ADDSshiftRA x y [c])
 (ADDS x (SLL y z)) -> (ADDSshiftLLreg x y z)
-(ADDS (SLL y z) x) -> (ADDSshiftLLreg x y z)
 (ADDS x (SRL y z)) -> (ADDSshiftRLreg x y z)
-(ADDS (SRL y z) x) -> (ADDSshiftRLreg x y z)
 (ADDS x (SRA y z)) -> (ADDSshiftRAreg x y z)
-(ADDS (SRA y z) x) -> (ADDSshiftRAreg x y z)
 (SUB x (SLLconst [c] y)) -> (SUBshiftLL x y [c])
 (SUB (SLLconst [c] y) x) -> (RSBshiftLL x y [c])
 (SUB x (SRLconst [c] y)) -> (SUBshiftRL x y [c])
@@ -925,43 +896,24 @@
 (RSB x (SRA y z)) -> (RSBshiftRAreg x y z)
 (RSB (SRA y z) x) -> (SUBshiftRAreg x y z)
 (AND x (SLLconst [c] y)) -> (ANDshiftLL x y [c])
-(AND (SLLconst [c] y) x) -> (ANDshiftLL x y [c])
 (AND x (SRLconst [c] y)) -> (ANDshiftRL x y [c])
-(AND (SRLconst [c] y) x) -> (ANDshiftRL x y [c])
 (AND x (SRAconst [c] y)) -> (ANDshiftRA x y [c])
-(AND (SRAconst [c] y) x) -> (ANDshiftRA x y [c])
 (AND x (SLL y z)) -> (ANDshiftLLreg x y z)
-(AND (SLL y z) x) -> (ANDshiftLLreg x y z)
 (AND x (SRL y z)) -> (ANDshiftRLreg x y z)
-(AND (SRL y z) x) -> (ANDshiftRLreg x y z)
 (AND x (SRA y z)) -> (ANDshiftRAreg x y z)
-(AND (SRA y z) x) -> (ANDshiftRAreg x y z)
 (OR x (SLLconst [c] y)) -> (ORshiftLL x y [c])
-(OR (SLLconst [c] y) x) -> (ORshiftLL x y [c])
 (OR x (SRLconst [c] y)) -> (ORshiftRL x y [c])
-(OR (SRLconst [c] y) x) -> (ORshiftRL x y [c])
 (OR x (SRAconst [c] y)) -> (ORshiftRA x y [c])
-(OR (SRAconst [c] y) x) -> (ORshiftRA x y [c])
 (OR x (SLL y z)) -> (ORshiftLLreg x y z)
-(OR (SLL y z) x) -> (ORshiftLLreg x y z)
 (OR x (SRL y z)) -> (ORshiftRLreg x y z)
-(OR (SRL y z) x) -> (ORshiftRLreg x y z)
 (OR x (SRA y z)) -> (ORshiftRAreg x y z)
-(OR (SRA y z) x) -> (ORshiftRAreg x y z)
 (XOR x (SLLconst [c] y)) -> (XORshiftLL x y [c])
-(XOR (SLLconst [c] y) x) -> (XORshiftLL x y [c])
 (XOR x (SRLconst [c] y)) -> (XORshiftRL x y [c])
-(XOR (SRLconst [c] y) x) -> (XORshiftRL x y [c])
 (XOR x (SRAconst [c] y)) -> (XORshiftRA x y [c])
-(XOR (SRAconst [c] y) x) -> (XORshiftRA x y [c])
 (XOR x (SRRconst [c] y)) -> (XORshiftRR x y [c])
-(XOR (SRRconst [c] y) x) -> (XORshiftRR x y [c])
 (XOR x (SLL y z)) -> (XORshiftLLreg x y z)
-(XOR (SLL y z) x) -> (XORshiftLLreg x y z)
 (XOR x (SRL y z)) -> (XORshiftRLreg x y z)
-(XOR (SRL y z) x) -> (XORshiftRLreg x y z)
 (XOR x (SRA y z)) -> (XORshiftRAreg x y z)
-(XOR (SRA y z) x) -> (XORshiftRAreg x y z)
 (BIC x (SLLconst [c] y)) -> (BICshiftLL x y [c])
 (BIC x (SRLconst [c] y)) -> (BICshiftRL x y [c])
 (BIC x (SRAconst [c] y)) -> (BICshiftRA x y [c])
@@ -1213,7 +1165,6 @@

 // generic simplifications
 (ADD x (RSBconst [0] y)) -> (SUB x y)
-(ADD (RSBconst [0] y) x) -> (SUB x y)
 (ADD <t> (RSBconst [c] x) (RSBconst [d] y)) -> (RSBconst [c+d] (ADD <t> x y))
 (SUB x x) -> (MOVWconst [0])
 (RSB x x) -> (MOVWconst [0])
@@ -1223,10 +1174,8 @@
 (BIC x x) -> (MOVWconst [0])

 (ADD (MUL x y) a) -> (MULA x y a)
-(ADD a (MUL x y)) -> (MULA x y a)

 (AND x (MVN y)) -> (BIC x y)
-(AND (MVN y) x) -> (BIC x y)

 // simplification with *shift ops
 (SUBshiftLL x (SLLconst x [c]) [d]) && c==d -> (MOVWconst [0])
@@ -1248,11 +1197,8 @@
 (BICshiftRL x (SRLconst x [c]) [d]) && c==d -> (MOVWconst [0])
 (BICshiftRA x (SRAconst x [c]) [d]) && c==d -> (MOVWconst [0])
 (AND x (MVNshiftLL y [c])) -> (BICshiftLL x y [c])
-(AND (MVNshiftLL y [c]) x) -> (BICshiftLL x y [c])
 (AND x (MVNshiftRL y [c])) -> (BICshiftRL x y [c])
-(AND (MVNshiftRL y [c]) x) -> (BICshiftRL x y [c])
 (AND x (MVNshiftRA y [c])) -> (BICshiftRA x y [c])
-(AND (MVNshiftRA y [c]) x) -> (BICshiftRA x y [c])

 // floating point optimizations
 (CMPF x (MOVFconst [0])) -> (CMPF0 x)

--- a/src/cmd/compile/internal/ssa/gen/ARM64.rules
+++ b/src/cmd/compile/internal/ssa/gen/ARM64.rules
--- a/src/cmd/compile/internal/ssa/gen/MIPS.rules
+++ b/src/cmd/compile/internal/ssa/gen/MIPS.rules
@@ -591,16 +591,11 @@
 (MOVWreg x) && x.Uses == 1 -> (MOVWnop x)

 // fold constant into arithmatic ops
-(ADD (MOVWconst [c]) x) -> (ADDconst [c] x)
 (ADD x (MOVWconst [c])) -> (ADDconst [c] x)
 (SUB x (MOVWconst [c])) -> (SUBconst [c] x)
-(AND (MOVWconst [c]) x) -> (ANDconst [c] x)
 (AND x (MOVWconst [c])) -> (ANDconst [c] x)
-(OR  (MOVWconst [c]) x) -> (ORconst  [c] x)
 (OR  x (MOVWconst [c])) -> (ORconst  [c] x)
-(XOR (MOVWconst [c]) x) -> (XORconst [c] x)
 (XOR x (MOVWconst [c])) -> (XORconst [c] x)
-(NOR (MOVWconst [c]) x) -> (NORconst [c] x)
 (NOR x (MOVWconst [c])) -> (NORconst [c] x)

 (SLL _ (MOVWconst [c])) && uint32(c)>=32 -> (MOVWconst [0])
@@ -616,8 +611,8 @@
 (SGTU x (MOVWconst [0])) -> (SGTUzero x)

 // mul with constant
-(Select1 (MULTU x (MOVWconst [c]))) && x.Op != OpMIPSMOVWconst-> (Select1 (MULTU (MOVWconst [c]) x ))
-(Select0 (MULTU x (MOVWconst [c]))) && x.Op != OpMIPSMOVWconst-> (Select0 (MULTU (MOVWconst [c]) x ))
+(Select1 (MULTU x (MOVWconst [c]))) && x.Op != OpMIPSMOVWconst -> (Select1 (MULTU (MOVWconst [c]) x ))
+(Select0 (MULTU x (MOVWconst [c]))) && x.Op != OpMIPSMOVWconst -> (Select0 (MULTU (MOVWconst [c]) x ))

 (Select1 (MULTU (MOVWconst [0]) _ )) -> (MOVWconst [0])
 (Select0 (MULTU (MOVWconst [0]) _ )) -> (MOVWconst [0])
@@ -635,7 +630,6 @@

 // generic simplifications
 (ADD x (NEG y)) -> (SUB x y)
-(ADD (NEG y) x) -> (SUB x y)
 (SUB x x) -> (MOVWconst [0])
 (SUB (MOVWconst [0]) x) -> (NEG x)
 (AND x x) -> x
@@ -729,12 +723,12 @@

 // conditional move
 (CMOVZ _ b (MOVWconst [0])) -> b
-(CMOVZ a _ (MOVWconst [c])) && c!=0-> a
+(CMOVZ a _ (MOVWconst [c])) && c!=0 -> a
 (CMOVZzero _ (MOVWconst [0])) -> (MOVWconst [0])
-(CMOVZzero a (MOVWconst [c])) && c!=0-> a
+(CMOVZzero a (MOVWconst [c])) && c!=0 -> a
 (CMOVZ a (MOVWconst [0]) c) -> (CMOVZzero a c)

 // atomic
 (LoweredAtomicStore ptr (MOVWconst [0]) mem) -> (LoweredAtomicStorezero ptr mem)
-(LoweredAtomicAdd ptr (MOVWconst [c]) mem) && is16Bit(c)-> (LoweredAtomicAddconst [c] ptr mem)
+(LoweredAtomicAdd ptr (MOVWconst [c]) mem) && is16Bit(c) -> (LoweredAtomicAddconst [c] ptr mem)

--- a/src/cmd/compile/internal/ssa/gen/MIPS64.rules
+++ b/src/cmd/compile/internal/ssa/gen/MIPS64.rules
@@ -579,16 +579,11 @@
 (MOVVreg x) && x.Uses == 1 -> (MOVVnop x)

 // fold constant into arithmatic ops
-(ADDV (MOVVconst [c]) x) && is32Bit(c) -> (ADDVconst [c] x)
 (ADDV x (MOVVconst [c])) && is32Bit(c) -> (ADDVconst [c] x)
 (SUBV x (MOVVconst [c])) && is32Bit(c) -> (SUBVconst [c] x)
-(AND (MOVVconst [c]) x) && is32Bit(c) -> (ANDconst [c] x)
 (AND x (MOVVconst [c])) && is32Bit(c) -> (ANDconst [c] x)
-(OR  (MOVVconst [c]) x) && is32Bit(c) -> (ORconst  [c] x)
 (OR  x (MOVVconst [c])) && is32Bit(c) -> (ORconst  [c] x)
-(XOR (MOVVconst [c]) x) && is32Bit(c) -> (XORconst [c] x)
 (XOR x (MOVVconst [c])) && is32Bit(c) -> (XORconst [c] x)
-(NOR (MOVVconst [c]) x) && is32Bit(c) -> (NORconst [c] x)
 (NOR x (MOVVconst [c])) && is32Bit(c) -> (NORconst [c] x)

 (SLLV _ (MOVVconst [c])) && uint64(c)>=64 -> (MOVVconst [0])
@@ -620,7 +615,6 @@

 // generic simplifications
 (ADDV x (NEGV y)) -> (SUBV x y)
-(ADDV (NEGV y) x) -> (SUBV x y)
 (SUBV x x) -> (MOVVconst [0])
 (SUBV (MOVVconst [0]) x) -> (NEGV x)
 (AND x x) -> x

--- a/src/cmd/compile/internal/ssa/gen/PPC64.rules
+++ b/src/cmd/compile/internal/ssa/gen/PPC64.rules
@@ -558,6 +558,28 @@
 (Move [1] dst src mem) -> (MOVBstore dst (MOVBZload src mem) mem)
 (Move [2] dst src mem) ->
        (MOVHstore dst (MOVHZload src mem) mem)
+(Move [4] {t} dst src mem) && t.(Type).Alignment()%4 == 0 ->
+	(MOVWstore dst (MOVWload src mem) mem)
+(Move [4] {t} dst src mem) && t.(Type).Alignment()%2 == 0 ->
+	(MOVHstore [2] dst (MOVHZload [2] src mem)
+		(MOVHstore dst (MOVHZload src mem) mem))
+(Move [4] dst src mem) ->
+	(MOVBstore [3] dst (MOVBZload [3] src mem)
+		(MOVBstore [2] dst (MOVBZload [2] src mem)
+			(MOVBstore [1] dst (MOVBZload [1] src mem)
+				(MOVBstore dst (MOVBZload src mem) mem))))
+
+(Move [8] {t} dst src mem) && t.(Type).Alignment()%8 == 0 ->
+	(MOVDstore dst (MOVDload src mem) mem)
+(Move [8] {t} dst src mem) && t.(Type).Alignment()%4 == 0 ->
+	(MOVWstore [4] dst (MOVWZload [4] src mem)
+		(MOVWstore dst (MOVWZload src mem) mem))
+(Move [8] {t} dst src mem) && t.(Type).Alignment()%2 == 0 ->
+	(MOVHstore [6] dst (MOVHZload [6] src mem)
+		(MOVHstore [4] dst (MOVHZload [4] src mem)
+			(MOVHstore [2] dst (MOVHZload [2] src mem)
+				(MOVHstore dst (MOVHZload src mem) mem))))
+
 (Move [3] dst src mem) ->
        (MOVBstore [2] dst (MOVBZload [2] src mem)
                (MOVHstore dst (MOVHload src mem) mem))
@@ -607,9 +629,6 @@
 (AND x (MOVDconst [c])) && isU16Bit(c) -> (ANDconst [c] x)
 (XOR x (MOVDconst [c])) && isU32Bit(c) -> (XORconst [c] x)
 (OR x (MOVDconst [c])) && isU32Bit(c) -> (ORconst [c] x)
-(AND (MOVDconst [c]) x) && isU16Bit(c) -> (ANDconst [c] x)
-(XOR (MOVDconst [c]) x) && isU32Bit(c) -> (XORconst [c] x)
-(OR (MOVDconst [c]) x) && isU32Bit(c) -> (ORconst [c] x)

 // Simplify consts
 (ANDconst [c] (ANDconst [d] x)) -> (ANDconst [c&d] x)
@@ -675,7 +694,6 @@

 // Arithmetic constant ops

-(ADD (MOVDconst [c]) x) && is32Bit(c) -> (ADDconst [c] x)
 (ADD x (MOVDconst [c])) && is32Bit(c) -> (ADDconst [c] x)
 (ADDconst [c] (ADDconst [d] x)) && is32Bit(c+d) -> (ADDconst [c+d] x)
 (ADDconst [0] x) -> x
@@ -847,9 +865,7 @@
 (AND x:(MOVBZload _ _) (MOVDconst [c])) -> (ANDconst [c&0xFF] x)

 // floating-point fused multiply-add/sub
-(FADD z (FMUL x y)) -> (FMADD x y z)
 (FADD (FMUL x y) z) -> (FMADD x y z)
 (FSUB (FMUL x y) z) -> (FMSUB x y z)
-(FADDS z (FMULS x y)) -> (FMADDS x y z)
 (FADDS (FMULS x y) z) -> (FMADDS x y z)
 (FSUBS (FMULS x y) z) -> (FMSUBS x y z)
--- a/src/cmd/compile/internal/ssa/gen/PPC64Ops.go
+++ b/src/cmd/compile/internal/ssa/gen/PPC64Ops.go
@@ -224,7 +224,7 @@ func init() {
 		{name: "ANDN", argLength: 2, reg: gp21, asm: "ANDN"},                                // arg0&^arg1
 		{name: "OR", argLength: 2, reg: gp21, asm: "OR", commutative: true},                 // arg0|arg1
 		{name: "ORN", argLength: 2, reg: gp21, asm: "ORN"},                                  // arg0|^arg1
-		{name: "NOR", argLength: 2, reg: gp21, asm: "NOR"},                                  // ^(arg0|arg1)
+		{name: "NOR", argLength: 2, reg: gp21, asm: "NOR", commutative: true},               // ^(arg0|arg1)
 		{name: "XOR", argLength: 2, reg: gp21, asm: "XOR", typ: "Int64", commutative: true}, // arg0^arg1
 		{name: "EQV", argLength: 2, reg: gp21, asm: "EQV", typ: "Int64", commutative: true}, // arg0^^arg1
 		{name: "NEG", argLength: 1, reg: gp11, asm: "NEG"},                                  // -arg0 (integer)

--- a/src/cmd/compile/internal/ssa/gen/S390X.rules
+++ b/src/cmd/compile/internal/ssa/gen/S390X.rules
--- a/src/cmd/compile/internal/ssa/gen/S390XOps.go
+++ b/src/cmd/compile/internal/ssa/gen/S390XOps.go
--- a/src/cmd/compile/internal/ssa/gen/generic.rules
+++ b/src/cmd/compile/internal/ssa/gen/generic.rules
@@ -244,48 +244,12 @@
 (Neq16 (Const16 <t> [c]) (Add16 (Const16 <t> [d]) x)) -> (Neq16 (Const16 <t> [int64(int16(c-d))]) x)
 (Neq8  (Const8  <t> [c]) (Add8  (Const8  <t> [d]) x)) -> (Neq8 (Const8 <t> [int64(int8(c-d))]) x)

-// canonicalize: swap arguments for commutative operations when one argument is a constant.
-(Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x)
-(Eq32 x (Const32 <t> [c])) && x.Op != OpConst32 -> (Eq32 (Const32 <t> [c]) x)
-(Eq16 x (Const16 <t> [c])) && x.Op != OpConst16 -> (Eq16 (Const16 <t> [c]) x)
-(Eq8  x (Const8  <t> [c])) && x.Op != OpConst8  -> (Eq8  (Const8  <t> [c]) x)
-
-(Neq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Neq64 (Const64 <t> [c]) x)
-(Neq32 x (Const32 <t> [c])) && x.Op != OpConst32 -> (Neq32 (Const32 <t> [c]) x)
-(Neq16 x (Const16 <t> [c])) && x.Op != OpConst16 -> (Neq16 (Const16 <t> [c]) x)
-(Neq8  x (Const8 <t>  [c])) && x.Op != OpConst8  -> (Neq8  (Const8  <t> [c]) x)
-
-// AddPtr is not canonicalized because nilcheck ptr checks the first argument to be non-nil.
-(Add64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Add64 (Const64 <t> [c]) x)
-(Add32 x (Const32 <t> [c])) && x.Op != OpConst32 -> (Add32 (Const32 <t> [c]) x)
-(Add16 x (Const16 <t> [c])) && x.Op != OpConst16 -> (Add16 (Const16 <t> [c]) x)
-(Add8  x (Const8  <t> [c])) && x.Op != OpConst8  -> (Add8  (Const8  <t> [c]) x)
-
-(Mul64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Mul64 (Const64 <t> [c]) x)
-(Mul32 x (Const32 <t> [c])) && x.Op != OpConst32 -> (Mul32 (Const32 <t> [c]) x)
-(Mul16 x (Const16 <t> [c])) && x.Op != OpConst16 -> (Mul16 (Const16 <t> [c]) x)
-(Mul8  x (Const8  <t> [c])) && x.Op != OpConst8  -> (Mul8  (Const8  <t> [c]) x)
-
+// Canonicalize x-const to x+(-const)
 (Sub64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Add64 (Const64 <t> [-c]) x)
 (Sub32 x (Const32 <t> [c])) && x.Op != OpConst32 -> (Add32 (Const32 <t> [int64(int32(-c))]) x)
 (Sub16 x (Const16 <t> [c])) && x.Op != OpConst16 -> (Add16 (Const16 <t> [int64(int16(-c))]) x)
 (Sub8  x (Const8  <t> [c])) && x.Op != OpConst8  -> (Add8  (Const8  <t> [int64(int8(-c))]) x)

-(And64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (And64 (Const64 <t> [c]) x)
-(And32 x (Const32 <t> [c])) && x.Op != OpConst32 -> (And32 (Const32 <t> [c]) x)
-(And16 x (Const16 <t> [c])) && x.Op != OpConst16 -> (And16 (Const16 <t> [c]) x)
-(And8  x (Const8  <t> [c])) && x.Op != OpConst8  -> (And8  (Const8  <t> [c]) x)
-
-(Or64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Or64 (Const64 <t> [c]) x)
-(Or32 x (Const32 <t> [c])) && x.Op != OpConst32 -> (Or32 (Const32 <t> [c]) x)
-(Or16 x (Const16 <t> [c])) && x.Op != OpConst16 -> (Or16 (Const16 <t> [c]) x)
-(Or8  x (Const8  <t> [c])) && x.Op != OpConst8  -> (Or8  (Const8  <t> [c]) x)
-
-(Xor64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Xor64 (Const64 <t> [c]) x)
-(Xor32 x (Const32 <t> [c])) && x.Op != OpConst32 -> (Xor32 (Const32 <t> [c]) x)
-(Xor16 x (Const16 <t> [c])) && x.Op != OpConst16 -> (Xor16 (Const16 <t> [c]) x)
-(Xor8  x (Const8  <t> [c])) && x.Op != OpConst8  -> (Xor8  (Const8  <t> [c]) x)
-
 // fold negation into comparison operators
 (Not (Eq64 x y)) -> (Neq64 x y)
 (Not (Eq32 x y)) -> (Neq32 x y)
@@ -635,50 +599,14 @@
 (And32 x (And32 x y)) -> (And32 x y)
 (And16 x (And16 x y)) -> (And16 x y)
 (And8  x (And8  x y)) -> (And8  x y)
-(And64 x (And64 y x)) -> (And64 x y)
-(And32 x (And32 y x)) -> (And32 x y)
-(And16 x (And16 y x)) -> (And16 x y)
-(And8  x (And8  y x)) -> (And8  x y)
-(And64 (And64 x y) x) -> (And64 x y)
-(And32 (And32 x y) x) -> (And32 x y)
-(And16 (And16 x y) x) -> (And16 x y)
-(And8  (And8  x y) x) -> (And8  x y)
-(And64 (And64 x y) y) -> (And64 x y)
-(And32 (And32 x y) y) -> (And32 x y)
-(And16 (And16 x y) y) -> (And16 x y)
-(And8  (And8  x y) y) -> (And8  x y)
 (Or64 x (Or64 x y)) -> (Or64 x y)
 (Or32 x (Or32 x y)) -> (Or32 x y)
 (Or16 x (Or16 x y)) -> (Or16 x y)
 (Or8  x (Or8  x y)) -> (Or8  x y)
-(Or64 x (Or64 y x)) -> (Or64 x y)
-(Or32 x (Or32 y x)) -> (Or32 x y)
-(Or16 x (Or16 y x)) -> (Or16 x y)
-(Or8  x (Or8  y x)) -> (Or8  x y)
-(Or64 (Or64 x y) x) -> (Or64 x y)
-(Or32 (Or32 x y) x) -> (Or32 x y)
-(Or16 (Or16 x y) x) -> (Or16 x y)
-(Or8  (Or8  x y) x) -> (Or8  x y)
-(Or64 (Or64 x y) y) -> (Or64 x y)
-(Or32 (Or32 x y) y) -> (Or32 x y)
-(Or16 (Or16 x y) y) -> (Or16 x y)
-(Or8  (Or8  x y) y) -> (Or8  x y)
 (Xor64 x (Xor64 x y)) -> y
 (Xor32 x (Xor32 x y)) -> y
 (Xor16 x (Xor16 x y)) -> y
 (Xor8  x (Xor8  x y)) -> y
-(Xor64 x (Xor64 y x)) -> y
-(Xor32 x (Xor32 y x)) -> y
-(Xor16 x (Xor16 y x)) -> y
-(Xor8  x (Xor8  y x)) -> y
-(Xor64 (Xor64 x y) x) -> y
-(Xor32 (Xor32 x y) x) -> y
-(Xor16 (Xor16 x y) x) -> y
-(Xor8  (Xor8  x y) x) -> y
-(Xor64 (Xor64 x y) y) -> x
-(Xor32 (Xor32 x y) y) -> x
-(Xor16 (Xor16 x y) y) -> x
-(Xor8  (Xor8  x y) y) -> x

 (Trunc64to8  (And64 (Const64 [y]) x)) && y&0xFF == 0xFF -> (Trunc64to8 x)
 (Trunc64to16 (And64 (Const64 [y]) x)) && y&0xFFFF == 0xFFFF -> (Trunc64to16 x)
@@ -732,9 +660,7 @@

 // user nil checks
 (NeqPtr p (ConstNil)) -> (IsNonNil p)
-(NeqPtr (ConstNil) p) -> (IsNonNil p)
 (EqPtr p (ConstNil)) -> (Not (IsNonNil p))
-(EqPtr (ConstNil) p) -> (Not (IsNonNil p))
 (IsNonNil (ConstNil)) -> (ConstBool [0])

 // slice and interface comparisons
@@ -912,7 +838,6 @@

 // Get rid of Convert ops for pointer arithmetic on unsafe.Pointer.
 (Convert (Add64 (Convert ptr mem) off) mem) -> (Add64 ptr off)
-(Convert (Add64 off (Convert ptr mem)) mem) -> (Add64 ptr off)
 (Convert (Convert ptr mem) mem) -> ptr

 // Decompose compound argument values
@@ -1238,35 +1163,11 @@
 // Reassociate expressions involving
 // constants such that constants come first,
 // exposing obvious constant-folding opportunities.
-// First, re-write (op x (op y z)) to (op (op y z) x) if
-// the op is commutative, to reduce the number of subsequent
-// matching rules for folding. Then, reassociate
-// (op (op y C) x) to (op C (op x y)) or similar, where C
+// Reassociate (op (op y C) x) to (op C (op x y)) or similar, where C
 // is constant, which pushes constants to the outside
 // of the expression. At that point, any constant-folding
 // opportunities should be obvious.

-(Add64 x l:(Add64 _ _)) && (x.Op != OpAdd64 && x.Op != OpConst64) -> (Add64 l x)
-(Add32 x l:(Add32 _ _)) && (x.Op != OpAdd32 && x.Op != OpConst32) -> (Add32 l x)
-(Add16 x l:(Add16 _ _)) && (x.Op != OpAdd16 && x.Op != OpConst16) -> (Add16 l x)
-(Add8  x l:(Add8  _ _)) && (x.Op != OpAdd8  && x.Op != OpConst8)  -> (Add8  l x)
-(And64 x l:(And64 _ _)) && (x.Op != OpAnd64 && x.Op != OpConst64) -> (And64 l x)
-(And32 x l:(And32 _ _)) && (x.Op != OpAnd32 && x.Op != OpConst32) -> (And32 l x)
-(And16 x l:(And16 _ _)) && (x.Op != OpAnd16 && x.Op != OpConst16) -> (And16 l x)
-(And8  x l:(And8  _ _)) && (x.Op != OpAnd8  && x.Op != OpConst8)  -> (And8  l x)
-(Or64 x l:(Or64 _ _)) && (x.Op != OpOr64 && x.Op != OpConst64) -> (Or64 l x)
-(Or32 x l:(Or32 _ _)) && (x.Op != OpOr32 && x.Op != OpConst32) -> (Or32 l x)
-(Or16 x l:(Or16 _ _)) && (x.Op != OpOr16 && x.Op != OpConst16) -> (Or16 l x)
-(Or8  x l:(Or8  _ _)) && (x.Op != OpOr8  && x.Op != OpConst8)  -> (Or8  l x)
-(Xor64 x l:(Xor64 _ _)) && (x.Op != OpXor64 && x.Op != OpConst64) -> (Xor64 l x)
-(Xor32 x l:(Xor32 _ _)) && (x.Op != OpXor32 && x.Op != OpConst32) -> (Xor32 l x)
-(Xor16 x l:(Xor16 _ _)) && (x.Op != OpXor16 && x.Op != OpConst16) -> (Xor16 l x)
-(Xor8  x l:(Xor8  _ _)) && (x.Op != OpXor8  && x.Op != OpConst8)  -> (Xor8  l x)
-(Mul64 x l:(Mul64 _ _)) && (x.Op != OpMul64 && x.Op != OpConst64) -> (Mul64 l x)
-(Mul32 x l:(Mul32 _ _)) && (x.Op != OpMul32 && x.Op != OpConst32) -> (Mul32 l x)
-(Mul16 x l:(Mul16 _ _)) && (x.Op != OpMul16 && x.Op != OpConst16) -> (Mul16 l x)
-(Mul8  x l:(Mul8  _ _)) && (x.Op != OpMul8  && x.Op != OpConst8)  -> (Mul8  l x)
-
 // x + (C + z) -> C + (x + z)
 (Add64 (Add64 i:(Const64 <t>) z) x) && (z.Op != OpConst64 && x.Op != OpConst64) -> (Add64 i (Add64 <t> z x))
 (Add32 (Add32 i:(Const32 <t>) z) x) && (z.Op != OpConst32 && x.Op != OpConst32) -> (Add32 i (Add32 <t> z x))
@@ -1379,19 +1280,13 @@

 // floating point optimizations
 (Add32F x (Const32F [0])) -> x
-(Add32F (Const32F [0]) x) -> x
 (Add64F x (Const64F [0])) -> x
-(Add64F (Const64F [0]) x) -> x
 (Sub32F x (Const32F [0])) -> x
 (Sub64F x (Const64F [0])) -> x
 (Mul32F x (Const32F [f2i(1)])) -> x
-(Mul32F (Const32F [f2i(1)]) x) -> x
 (Mul64F x (Const64F [f2i(1)])) -> x
-(Mul64F (Const64F [f2i(1)]) x) -> x
 (Mul32F x (Const32F [f2i(-1)])) -> (Neg32F x)
-(Mul32F (Const32F [f2i(-1)]) x) -> (Neg32F x)
 (Mul64F x (Const64F [f2i(-1)])) -> (Neg64F x)
-(Mul64F (Const64F [f2i(-1)]) x) -> (Neg64F x)
 (Mul32F x (Const32F [f2i(2)])) -> (Add32F x x)
 (Mul64F x (Const64F [f2i(2)])) -> (Add64F x x)
 (Div32F x (Const32F <t> [c])) && reciprocalExact32(float32(i2f(c))) -> (Mul32F x (Const32F <t> [f2i(1/i2f(c))]))

--- a/src/cmd/compile/internal/ssa/gen/genericOps.go
+++ b/src/cmd/compile/internal/ssa/gen/genericOps.go
@@ -28,8 +28,8 @@ var genericOps = []opData{
 	{name: "Add32", argLength: 2, commutative: true},
 	{name: "Add64", argLength: 2, commutative: true},
 	{name: "AddPtr", argLength: 2}, // For address calculations.  arg0 is a pointer and arg1 is an int.
-	{name: "Add32F", argLength: 2},
-	{name: "Add64F", argLength: 2},
+	{name: "Add32F", argLength: 2, commutative: true},
+	{name: "Add64F", argLength: 2, commutative: true},

 	{name: "Sub8", argLength: 2}, // arg0 - arg1
 	{name: "Sub16", argLength: 2},
@@ -43,24 +43,25 @@ var genericOps = []opData{
 	{name: "Mul16", argLength: 2, commutative: true},
 	{name: "Mul32", argLength: 2, commutative: true},
 	{name: "Mul64", argLength: 2, commutative: true},
-	{name: "Mul32F", argLength: 2},
-	{name: "Mul64F", argLength: 2},
+	{name: "Mul32F", argLength: 2, commutative: true},
+	{name: "Mul64F", argLength: 2, commutative: true},

 	{name: "Div32F", argLength: 2}, // arg0 / arg1
 	{name: "Div64F", argLength: 2},

-	{name: "Hmul32", argLength: 2},
-	{name: "Hmul32u", argLength: 2},
-	{name: "Hmul64", argLength: 2},
-	{name: "Hmul64u", argLength: 2},
+	{name: "Hmul32", argLength: 2, commutative: true},
+	{name: "Hmul32u", argLength: 2, commutative: true},
+	{name: "Hmul64", argLength: 2, commutative: true},
+	{name: "Hmul64u", argLength: 2, commutative: true},

-	{name: "Mul32uhilo", argLength: 2, typ: "(UInt32,UInt32)"}, // arg0 * arg1, returns (hi, lo)
-	{name: "Mul64uhilo", argLength: 2, typ: "(UInt64,UInt64)"}, // arg0 * arg1, returns (hi, lo)
+	{name: "Mul32uhilo", argLength: 2, typ: "(UInt32,UInt32)", commutative: true}, // arg0 * arg1, returns (hi, lo)
+	{name: "Mul64uhilo", argLength: 2, typ: "(UInt64,UInt64)", commutative: true}, // arg0 * arg1, returns (hi, lo)

 	// Weird special instructions for use in the strength reduction of divides.
 	// These ops compute unsigned (arg0 + arg1) / 2, correct to all
 	// 32/64 bits, even when the intermediate result of the add has 33/65 bits.
 	// These ops can assume arg0 >= arg1.
+	// Note: these ops aren't commutative!
 	{name: "Avg32u", argLength: 2, typ: "UInt32"}, // 32-bit platforms only
 	{name: "Avg64u", argLength: 2, typ: "UInt64"}, // 64-bit platforms only

@@ -159,8 +160,8 @@ var genericOps = []opData{
 	{name: "EqPtr", argLength: 2, commutative: true, typ: "Bool"},
 	{name: "EqInter", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
 	{name: "EqSlice", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
-	{name: "Eq32F", argLength: 2, typ: "Bool"},
-	{name: "Eq64F", argLength: 2, typ: "Bool"},
+	{name: "Eq32F", argLength: 2, commutative: true, typ: "Bool"},
+	{name: "Eq64F", argLength: 2, commutative: true, typ: "Bool"},

 	{name: "Neq8", argLength: 2, commutative: true, typ: "Bool"}, // arg0 != arg1
 	{name: "Neq16", argLength: 2, commutative: true, typ: "Bool"},
@@ -169,8 +170,8 @@ var genericOps = []opData{
 	{name: "NeqPtr", argLength: 2, commutative: true, typ: "Bool"},
 	{name: "NeqInter", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
 	{name: "NeqSlice", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
-	{name: "Neq32F", argLength: 2, typ: "Bool"},
-	{name: "Neq64F", argLength: 2},
+	{name: "Neq32F", argLength: 2, commutative: true, typ: "Bool"},
+	{name: "Neq64F", argLength: 2, commutative: true, typ: "Bool"},

 	{name: "Less8", argLength: 2, typ: "Bool"},  // arg0 < arg1, signed
 	{name: "Less8U", argLength: 2, typ: "Bool"}, // arg0 < arg1, unsigned

--- a/src/cmd/compile/internal/ssa/gen/rulegen.go
+++ b/src/cmd/compile/internal/ssa/gen/rulegen.go
@@ -30,7 +30,7 @@ import (
 //  sexpr [&& extra conditions] -> [@block] sexpr
 //
 // sexpr are s-expressions (lisp-like parenthesized groupings)
-// sexpr ::= (opcode sexpr*)
+// sexpr ::= [variable:](opcode sexpr*)
 //         | variable
 //         | <type>
 //         | [auxint]
@@ -39,7 +39,7 @@ import (
 // aux      ::= variable | {code}
 // type     ::= variable | {code}
 // variable ::= some token
-// opcode   ::= one of the opcodes from ../op.go (without the Op prefix)
+// opcode   ::= one of the opcodes from the *Ops.go files

 // extra conditions is just a chunk of Go that evaluates to a boolean. It may use
 // variables declared in the matching sexpr. The variable "v" is predefined to be
@@ -119,15 +119,17 @@ func genRules(arch arch) {
 		}

 		loc := fmt.Sprintf("%s.rules:%d", arch.name, ruleLineno)
-		r := Rule{rule: rule, loc: loc}
-		if rawop := strings.Split(rule, " ")[0][1:]; isBlock(rawop, arch) {
-			blockrules[rawop] = append(blockrules[rawop], r)
-		} else {
-			// Do fancier value op matching.
-			match, _, _ := r.parse()
-			op, oparch, _, _, _, _ := parseValue(match, arch, loc)
-			opname := fmt.Sprintf("Op%s%s", oparch, op.name)
-			oprules[opname] = append(oprules[opname], r)
+		for _, crule := range commute(rule, arch) {
+			r := Rule{rule: crule, loc: loc}
+			if rawop := strings.Split(crule, " ")[0][1:]; isBlock(rawop, arch) {
+				blockrules[rawop] = append(blockrules[rawop], r)
+			} else {
+				// Do fancier value op matching.
+				match, _, _ := r.parse()
+				op, oparch, _, _, _, _ := parseValue(match, arch, loc)
+				opname := fmt.Sprintf("Op%s%s", oparch, op.name)
+				oprules[opname] = append(oprules[opname], r)
+			}
 		}
 		rule = ""
 		ruleLineno = 0
@@ -754,3 +756,169 @@ func isVariable(s string) bool {
 	}
 	return b
 }
+
+// commute returns all equivalent rules to r after applying all possible
+// argument swaps to the commutable ops in r.
+// Potentially exponential, be careful.
+func commute(r string, arch arch) []string {
+	match, cond, result := Rule{rule: r}.parse()
+	a := commute1(match, varCount(match), arch)
+	for i, m := range a {
+		if cond != "" {
+			m += " && " + cond
+		}
+		m += " -> " + result
+		a[i] = m
+	}
+	if len(a) == 1 && normalizeWhitespace(r) != normalizeWhitespace(a[0]) {
+		fmt.Println(normalizeWhitespace(r))
+		fmt.Println(normalizeWhitespace(a[0]))
+		panic("commute() is not the identity for noncommuting rule")
+	}
+	if false && len(a) > 1 {
+		fmt.Println(r)
+		for _, x := range a {
+			fmt.Println("  " + x)
+		}
+	}
+	return a
+}
+
+func commute1(m string, cnt map[string]int, arch arch) []string {
+	if m[0] == '<' || m[0] == '[' || m[0] == '{' || isVariable(m) {
+		return []string{m}
+	}
+	// Split up input.
+	var prefix string
+	colon := strings.Index(m, ":")
+	if colon >= 0 && isVariable(m[:colon]) {
+		prefix = m[:colon+1]
+		m = m[colon+1:]
+	}
+	if m[0] != '(' || m[len(m)-1] != ')' {
+		panic("non-compound expr in commute1: " + m)
+	}
+	s := split(m[1 : len(m)-1])
+	op := s[0]
+
+	// Figure out if the op is commutative or not.
+	commutative := false
+	for _, x := range genericOps {
+		if op == x.name {
+			if x.commutative {
+				commutative = true
+			}
+			break
+		}
+	}
+	if arch.name != "generic" {
+		for _, x := range arch.ops {
+			if op == x.name {
+				if x.commutative {
+					commutative = true
+				}
+				break
+			}
+		}
+	}
+	var idx0, idx1 int
+	if commutative {
+		// Find indexes of two args we can swap.
+		for i, arg := range s {
+			if i == 0 || arg[0] == '<' || arg[0] == '[' || arg[0] == '{' {
+				continue
+			}
+			if idx0 == 0 {
+				idx0 = i
+				continue
+			}
+			if idx1 == 0 {
+				idx1 = i
+				break
+			}
+		}
+		if idx1 == 0 {
+			panic("couldn't find first two args of commutative op " + s[0])
+		}
+		if cnt[s[idx0]] == 1 && cnt[s[idx1]] == 1 || s[idx0] == s[idx1] && cnt[s[idx0]] == 2 {
+			// When we have (Add x y) with no ther uses of x and y in the matching rule,
+			// then we can skip the commutative match (Add y x).
+			commutative = false
+		}
+	}
+
+	// Recursively commute arguments.
+	a := make([][]string, len(s))
+	for i, arg := range s {
+		a[i] = commute1(arg, cnt, arch)
+	}
+
+	// Choose all possibilities from all args.
+	r := crossProduct(a)
+
+	// If commutative, do that again with its two args reversed.
+	if commutative {
+		a[idx0], a[idx1] = a[idx1], a[idx0]
+		r = append(r, crossProduct(a)...)
+	}
+
+	// Construct result.
+	for i, x := range r {
+		r[i] = prefix + "(" + x + ")"
+	}
+	return r
+}
+
+// varCount returns a map which counts the number of occurrences of
+// Value variables in m.
+func varCount(m string) map[string]int {
+	cnt := map[string]int{}
+	varCount1(m, cnt)
+	return cnt
+}
+func varCount1(m string, cnt map[string]int) {
+	if m[0] == '<' || m[0] == '[' || m[0] == '{' {
+		return
+	}
+	if isVariable(m) {
+		cnt[m]++
+		return
+	}
+	// Split up input.
+	colon := strings.Index(m, ":")
+	if colon >= 0 && isVariable(m[:colon]) {
+		cnt[m[:colon]]++
+		m = m[colon+1:]
+	}
+	if m[0] != '(' || m[len(m)-1] != ')' {
+		panic("non-compound expr in commute1: " + m)
+	}
+	s := split(m[1 : len(m)-1])
+	for _, arg := range s[1:] {
+		varCount1(arg, cnt)
+	}
+}
+
+// crossProduct returns all possible values
+// x[0][i] + " " + x[1][j] + " " + ... + " " + x[len(x)-1][k]
+// for all valid values of i, j, ..., k.
+func crossProduct(x [][]string) []string {
+	if len(x) == 1 {
+		return x[0]
+	}
+	var r []string
+	for _, tail := range crossProduct(x[1:]) {
+		for _, first := range x[0] {
+			r = append(r, first+" "+tail)
+		}
+	}
+	return r
+}
+
+// normalizeWhitespace replaces 2+ whitespace sequences with a single space.
+func normalizeWhitespace(x string) string {
+	x = strings.Join(strings.Fields(x), " ")
+	x = strings.Replace(x, "( ", "(", -1)
+	x = strings.Replace(x, " )", ")", -1)
+	return x
+}
--- a/src/cmd/compile/internal/ssa/opGen.go
+++ b/src/cmd/compile/internal/ssa/opGen.go
--- a/src/cmd/compile/internal/ssa/rewrite386.go
+++ b/src/cmd/compile/internal/ssa/rewrite386.go
--- a/src/cmd/compile/internal/ssa/rewriteAMD64.go
+++ b/src/cmd/compile/internal/ssa/rewriteAMD64.go
--- a/src/cmd/compile/internal/ssa/rewriteARM.go
+++ b/src/cmd/compile/internal/ssa/rewriteARM.go
--- a/src/cmd/compile/internal/ssa/rewriteARM64.go
+++ b/src/cmd/compile/internal/ssa/rewriteARM64.go
--- a/src/cmd/compile/internal/ssa/rewriteMIPS.go
+++ b/src/cmd/compile/internal/ssa/rewriteMIPS.go
--- a/src/cmd/compile/internal/ssa/rewriteMIPS64.go
+++ b/src/cmd/compile/internal/ssa/rewriteMIPS64.go
--- a/src/cmd/compile/internal/ssa/rewritePPC64.go
+++ b/src/cmd/compile/internal/ssa/rewritePPC64.go
--- a/src/cmd/compile/internal/ssa/rewriteS390X.go
+++ b/src/cmd/compile/internal/ssa/rewriteS390X.go
--- a/src/cmd/compile/internal/ssa/rewritedec.go
+++ b/src/cmd/compile/internal/ssa/rewritedec.go
@@ -36,7 +36,7 @@ func rewriteValuedec(v *Value) bool {
 	return false
 }
 func rewriteValuedec_OpComplexImag(v *Value) bool {
-	// match: (ComplexImag (ComplexMake _ imag ))
+	// match: (ComplexImag (ComplexMake _ imag))
 	// cond:
 	// result: imag
 	for {
@@ -53,7 +53,7 @@ func rewriteValuedec_OpComplexImag(v *Value) bool {
 	return false
 }
 func rewriteValuedec_OpComplexReal(v *Value) bool {
-	// match: (ComplexReal (ComplexMake real _  ))
+	// match: (ComplexReal (ComplexMake real _))
 	// cond:
 	// result: real
 	for {
@@ -276,7 +276,7 @@ func rewriteValuedec_OpSliceLen(v *Value) bool {
 	return false
 }
 func rewriteValuedec_OpSlicePtr(v *Value) bool {
-	// match: (SlicePtr (SliceMake ptr _ _ ))
+	// match: (SlicePtr (SliceMake ptr _ _))
 	// cond:
 	// result: ptr
 	for {

--- a/src/cmd/compile/internal/ssa/rewritegeneric.go
+++ b/src/cmd/compile/internal/ssa/rewritegeneric.go