Comment 5 for bug 2030784

Revision history for this message
Adrien Nader (adrien) wrote (last edit ):

Thanks a lot for the tests, that's very appreciated.

I ran that on my laptop (11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz) which quite surprisingly has all these CPU features. Mostly idle, dynamic CPU governor but no thermal throttling at all (and if there were, it would probably slow down the AVX-512 code anyway), and tests are long enough for CPU governors to not matter much.

============================================================

* AES-128-GCM | AES-256-GCM
 - Baseline - Requires VAES and VPCMULQDQ features present on ICX or newer platform. This should be the most performant flow.
AES-128-GCM 855360.29k 3158479.88k 6093932.91k 8905067.37k 13336828.91k 13788498.58k

 - Individual VAES Disabled and VPCLMULQDQ Disabled should fallback to AVX AESNI flow and should have equivalent performance
AES-128-GCM 785422.85k 1936140.78k 4404423.77k 6481577.18k 7732716.48k 7873213.39k
AES-128-GCM 790775.41k 1942054.64k 4404868.20k 6484287.87k 7711803.10k 7778795.52k

 - AESNI and VAESNI Disabled should fallback to 'C code' performance
AES-128-GCM 150183.11k 167807.25k 598198.71k 662922.19k 681574.40k 678182.91k

* RSA 2K/3K/4K Sign Performance
 - Baseline - Requires AVX512F, AVX512VL, AVX512DQ, and AVX512IFMA features on ICX or newer platform. This should be the most performant flow.
rsa 2048 bits 0.000246s 0.000015s 4057.2 65278.3
rsa 3072 bits 0.000701s 0.000032s 1426.4 31247.7
rsa 4096 bits 0.001434s 0.000055s 697.4 18052.7

 - Individual AVX512F, AVX512VL, and AVX512IFMA features should yield equivalent performance. This flow will use the ADOX/ADCX/MULX RSA flow.
rsa 2048 bits 0.000523s 0.000015s 1910.4 65748.2
rsa 3072 bits 0.001579s 0.000032s 633.3 31158.1
rsa 4096 bits 0.003529s 0.000055s 283.4 18093.6

rsa 2048 bits 0.000524s 0.000015s 1909.0 66310.8
rsa 3072 bits 0.001577s 0.000032s 634.1 31309.7
rsa 4096 bits 0.003568s 0.000055s 280.2 18120.4

rsa 2048 bits 0.000523s 0.000015s 1913.3 65234.3
rsa 3072 bits 0.001583s 0.000032s 631.7 31094.6
rsa 4096 bits 0.003607s 0.000055s 277.3 18076.8

rsa 2048 bits 0.000524s 0.000015s 1907.6 66299.6
rsa 3072 bits 0.001577s 0.000032s 634.1 31214.4
rsa 4096 bits 0.003586s 0.000055s 278.9 18096.1

============================================================

We see the expected behavior (AFAIU, all features must be available at the same time for the changes to have effect).

I'm not comparing everything number by number because I don't think we're looking for specific percentages of improvements.

Overall we see up to ~2.4 performance improvement and we always see large improvements (double digit percentages).

As a control I also ran that on lunar, therefore without the patches (I acknowledge this is not the same openssl version and there are also other changes but I do not think this matters here).

============================================================

# AES-128-GCM | AES-256-GCM
 - Baseline - Requires VAES and VPCMULQDQ features present on ICX or newer platform. This should be the most performant flow.
AES-128-GCM 782474.44k 1938211.66k 4430867.84k 6402298.54k 7685819.33k 7840186.37k

 - Individual VAES Disabled and VPCLMULQDQ Disabled should fallback to AVX AESNI flow and should have equivalent performance
AES-128-GCM 750028.44k 1926234.78k 4365867.67k 6383893.16k 7742842.78k 7843146.41k
AES-128-GCM 786910.34k 1934779.33k 4421411.45k 6389114.88k 7650086.87k 7797479.86k

 - AESNI and VAESNI Disabled should fallback to 'C code' performance
AES-128-GCM 147889.72k 167843.85k 599710.04k 663642.45k 679072.96k 680631.91k

# RSA 2K/3K/4K Sign Performance
 - Baseline - Requires AVX512F, AVX512VL, AVX512DQ, and AVX512IFMA features on ICX or newer platform. This should be the most performant flow.
rsa 2048 bits 0.000247s 0.000015s 4050.8 66072.6
rsa 3072 bits 0.001596s 0.000032s 626.5 31144.2
rsa 4096 bits 0.003534s 0.000056s 282.9 18003.6

 - Individual AVX512F, AVX512VL, and AVX512IFMA features should yield equivalent performance. This flow will use the ADOX/ADCX/MULX RSA flow.
rsa 2048 bits 0.000528s 0.000015s 1892.3 66008.3
rsa 3072 bits 0.001573s 0.000032s 635.6 31094.2
rsa 4096 bits 0.003534s 0.000055s 282.9 18073.8

rsa 2048 bits 0.000522s 0.000015s 1914.7 65763.4
rsa 3072 bits 0.001575s 0.000032s 635.0 31237.8
rsa 4096 bits 0.003530s 0.000055s 283.2 18093.1

rsa 2048 bits 0.000522s 0.000015s 1917.4 65826.2
rsa 3072 bits 0.001575s 0.000032s 635.0 31177.2
rsa 4096 bits 0.003549s 0.000055s 281.8 18109.9

rsa 2048 bits 0.000522s 0.000015s 1915.1 65760.4
rsa 3072 bits 0.001575s 0.000032s 635.0 31180.2
rsa 4096 bits 0.003538s 0.000055s 282.6 18109.9

============================================================

We can see there are no change with the CPU feature flags, except for the test that disables AESNI, in which case the performance is the same in lunar and mantic. That the CPU feature flags don't change the performance except i the one aforementioned case, indicate that these patches are responsible for the large performance increase we have seen. We can also see that they don't otherwise degrade performance on this machine.