• Vlad Krasnov's avatar
    crypto/aes: implement AES-GCM AEAD for arm64 · 4f1f5033
    Vlad Krasnov authored
    Use the dedicated AES* and PMULL* instructions to accelerate AES-GCM
    
    name              old time/op    new time/op      delta
    AESGCMSeal1K-46     12.1µs ± 0%       0.9µs ± 0%    -92.66%  (p=0.000 n=9+10)
    AESGCMOpen1K-46     12.1µs ± 0%       0.9µs ± 0%    -92.43%  (p=0.000 n=10+10)
    AESGCMSign8K-46     58.6µs ± 0%       2.1µs ± 0%    -96.41%  (p=0.000 n=9+8)
    AESGCMSeal8K-46     92.8µs ± 0%       5.7µs ± 0%    -93.86%  (p=0.000 n=9+9)
    AESGCMOpen8K-46     92.9µs ± 0%       5.7µs ± 0%    -93.84%  (p=0.000 n=8+9)
    
    name              old speed      new speed        delta
    AESGCMSeal1K-46   84.7MB/s ± 0%  1153.4MB/s ± 0%  +1262.21%  (p=0.000 n=9+10)
    AESGCMOpen1K-46   84.4MB/s ± 0%  1115.2MB/s ± 0%  +1220.53%  (p=0.000 n=10+10)
    AESGCMSign8K-46    140MB/s ± 0%    3894MB/s ± 0%  +2687.50%  (p=0.000 n=9+10)
    AESGCMSeal8K-46   88.2MB/s ± 0%  1437.5MB/s ± 0%  +1529.30%  (p=0.000 n=9+9)
    AESGCMOpen8K-46   88.2MB/s ± 0%  1430.5MB/s ± 0%  +1522.01%  (p=0.000 n=8+9)
    
    This change mirrors the current amd64 implementation, and provides optimal performance
    on a range of arm64 processors including Centriq 2400 and Apple A12. By and large it is
    implicitly tested by the robustness of the already existing amd64 implementation.
    
    The implementation interleaves GHASH with CTR mode to achieve the highest possible
    throughput, it also aggregates GHASH with a factor of 8, to decrease the cost of the
    reduction step.
    
    Even thought there is a significant amount of assembly, the code reuses the go
    code for the amd64 implementation, so there is little additional go code.
    
    Since AES-GCM is critical for performance of all web servers, this change is
    required to level the playfield for arm64 CPUs, where amd64 currently enjoys an
    unfair advantage.
    
    Ideally both amd64 and arm64 codepaths could be replaced by hypothetical AES and
    CLMUL intrinsics, with a few additional vector instructions.
    
    Fixes #18498
    Fixes #19840
    
    Change-Id: Icc57b868cd1f67ac695c1ac163a8e215f74c7910
    Reviewed-on: https://go-review.googlesource.com/107298
    Run-TryBot: Vlad Krasnov <vlad@cloudflare.com>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: 's avatarBrad Fitzpatrick <bradfitz@golang.org>
    4f1f5033
common.go 34.6 KB