• Michael Munday's avatar
    cmd/compile: implement OnesCount{8,16,32,64} intrinsics on s390x · 6f9b94ab
    Michael Munday authored
    This CL implements the math/bits.OnesCount{8,16,32,64} functions
    as intrinsics on s390x using the 'population count' (popcnt)
    instruction. This instruction was released as the 'population-count'
    facility which uses the same facility bit (45) as the
    'distinct-operands' facility which is a pre-requisite for Go on
    s390x. We can therefore use it without a feature check.
    
    The s390x popcnt instruction treats a 64 bit register as a vector
    of 8 bytes, summing the number of ones in each byte individually.
    It then writes the results to the corresponding bytes in the
    output register. Therefore to implement OnesCount{16,32,64} we
    need to sum the individual byte counts using some extra
    instructions. To do this efficiently I've added some additional
    pseudo operations to the s390x SSA backend.
    
    Unlike other architectures the new instruction sequence is faster
    for OnesCount8, so that is implemented using the intrinsic.
    
    name         old time/op  new time/op  delta
    OnesCount    3.21ns ± 1%  1.35ns ± 0%  -58.00%  (p=0.000 n=20+20)
    OnesCount8   0.91ns ± 1%  0.81ns ± 0%  -11.43%  (p=0.000 n=20+20)
    OnesCount16  1.51ns ± 3%  1.21ns ± 0%  -19.71%  (p=0.000 n=20+17)
    OnesCount32  1.91ns ± 0%  1.12ns ± 1%  -41.60%  (p=0.000 n=19+20)
    OnesCount64  3.18ns ± 4%  1.35ns ± 0%  -57.52%  (p=0.000 n=20+20)
    
    Change-Id: Id54f0bd28b6db9a887ad12c0d72fcc168ef9c4e0
    Reviewed-on: https://go-review.googlesource.com/114675
    Run-TryBot: Michael Munday <mike.munday@ibm.com>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: 's avatarCherry Zhang <cherryyz@google.com>
    6f9b94ab
a.out.go 11.9 KB