• Michael Munday's avatar
    crypto/rc4: optimize generic implementation slightly · 41402b59
    Michael Munday authored
    The compiler can't currently figure out that it can eliminate both c.s
    loads (using store to load forwarding) in the second line of the
    following code:
    
    	...
    	c.s[i], c.s[j] = c.s[j], c.s[i]
    	x := c.s[j] + c.s[i]
    	...
    
    The compiler eliminates the second load of c.s[j] (using the original
    value of c.s[i]), however the load of c.s[i] remains because the compiler
    doesn't know that c.s[i] and c.s[j] either overlap completely or not at
    all.
    
    Introducing temporaries to make this explicit improves the performance
    of the generic code slightly, the goal being to remove the assembly in
    this package in the future. This change also hoists a bounds check out
    of the main loop which gives a slight performance boost and also makes
    the behaviour identical to the assembly implementation when len(dst) <
    len(src).
    
    name       old speed     new speed     delta
    RC4_128-4  491MB/s ± 3%  596MB/s ± 5%  +21.51%  (p=0.000 n=9+9)
    RC4_1K-4   504MB/s ± 2%  616MB/s ± 1%  +22.33%  (p=0.000 n=10+10)
    RC4_8K-4   509MB/s ± 1%  630MB/s ± 2%  +23.85%  (p=0.000 n=8+9)
    
    Change-Id: I27adc775713b2e74a1a94e0c1de0909fb4379463
    Reviewed-on: https://go-review.googlesource.com/102335
    Run-TryBot: Michael Munday <mike.munday@ibm.com>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
    41402b59
rc4.go 1.81 KB