x86_64: Replace rej_uniform intrinsics with assembly#1014
Conversation
There was a problem hiding this comment.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
113118 cycles |
113013 cycles |
1.00 |
ML-DSA-44 sign |
355649 cycles |
355605 cycles |
1.00 |
ML-DSA-44 verify |
117801 cycles |
117682 cycles |
1.00 |
ML-DSA-65 keypair |
196381 cycles |
196214 cycles |
1.00 |
ML-DSA-65 sign |
589557 cycles |
588943 cycles |
1.00 |
ML-DSA-65 verify |
194604 cycles |
194375 cycles |
1.00 |
ML-DSA-87 keypair |
322210 cycles |
322148 cycles |
1.00 |
ML-DSA-87 sign |
752493 cycles |
752763 cycles |
1.00 |
ML-DSA-87 verify |
320055 cycles |
319900 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
212361 cycles |
212622 cycles |
1.00 |
ML-DSA-44 sign |
760716 cycles |
760066 cycles |
1.00 |
ML-DSA-44 verify |
228743 cycles |
228987 cycles |
1.00 |
ML-DSA-65 keypair |
379384 cycles |
379665 cycles |
1.00 |
ML-DSA-65 sign |
1250617 cycles |
1249827 cycles |
1.00 |
ML-DSA-65 verify |
371531 cycles |
372045 cycles |
1.00 |
ML-DSA-87 keypair |
604335 cycles |
605426 cycles |
1.00 |
ML-DSA-87 sign |
1593243 cycles |
1591413 cycles |
1.00 |
ML-DSA-87 verify |
618270 cycles |
617375 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 3rd gen (c6a)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
66830 cycles |
68874 cycles |
0.97 |
ML-DSA-44 sign |
184077 cycles |
187594 cycles |
0.98 |
ML-DSA-44 verify |
65562 cycles |
68993 cycles |
0.95 |
ML-DSA-65 keypair |
111959 cycles |
119089 cycles |
0.94 |
ML-DSA-65 sign |
292002 cycles |
299488 cycles |
0.98 |
ML-DSA-65 verify |
108472 cycles |
115385 cycles |
0.94 |
ML-DSA-87 keypair |
185520 cycles |
203754 cycles |
0.91 |
ML-DSA-87 sign |
379630 cycles |
396462 cycles |
0.96 |
ML-DSA-87 verify |
177291 cycles |
196231 cycles |
0.90 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton4
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
68316 cycles |
68121 cycles |
1.00 |
ML-DSA-44 sign |
202487 cycles |
202429 cycles |
1.00 |
ML-DSA-44 verify |
70722 cycles |
70691 cycles |
1.00 |
ML-DSA-65 keypair |
121061 cycles |
121050 cycles |
1.00 |
ML-DSA-65 sign |
331574 cycles |
332242 cycles |
1.00 |
ML-DSA-65 verify |
117810 cycles |
118169 cycles |
1.00 |
ML-DSA-87 keypair |
198140 cycles |
198283 cycles |
1.00 |
ML-DSA-87 sign |
427941 cycles |
428124 cycles |
1.00 |
ML-DSA-87 verify |
194637 cycles |
194645 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 3rd gen (c6a) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
134578 cycles |
135123 cycles |
1.00 |
ML-DSA-44 sign |
523923 cycles |
523989 cycles |
1.00 |
ML-DSA-44 verify |
147640 cycles |
147421 cycles |
1.00 |
ML-DSA-65 keypair |
228634 cycles |
227032 cycles |
1.01 |
ML-DSA-65 sign |
864042 cycles |
860343 cycles |
1.00 |
ML-DSA-65 verify |
236700 cycles |
234883 cycles |
1.01 |
ML-DSA-87 keypair |
371955 cycles |
371568 cycles |
1.00 |
ML-DSA-87 sign |
1080535 cycles |
1079389 cycles |
1.00 |
ML-DSA-87 verify |
383811 cycles |
383403 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 3rd gen (c6i)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
56863 cycles |
56287 cycles |
1.01 |
ML-DSA-44 sign |
181063 cycles |
181562 cycles |
1.00 |
ML-DSA-44 verify |
61140 cycles |
61061 cycles |
1.00 |
ML-DSA-65 keypair |
98291 cycles |
98770 cycles |
1.00 |
ML-DSA-65 sign |
298368 cycles |
299116 cycles |
1.00 |
ML-DSA-65 verify |
100343 cycles |
100251 cycles |
1.00 |
ML-DSA-87 keypair |
152430 cycles |
153265 cycles |
0.99 |
ML-DSA-87 sign |
354719 cycles |
355417 cycles |
1.00 |
ML-DSA-87 verify |
153124 cycles |
153884 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton4 (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
128315 cycles |
128272 cycles |
1.00 |
ML-DSA-44 sign |
447513 cycles |
447600 cycles |
1.00 |
ML-DSA-44 verify |
138123 cycles |
144678 cycles |
0.95 |
ML-DSA-65 keypair |
220541 cycles |
220481 cycles |
1.00 |
ML-DSA-65 sign |
726484 cycles |
726951 cycles |
1.00 |
ML-DSA-65 verify |
222926 cycles |
223461 cycles |
1.00 |
ML-DSA-87 keypair |
366142 cycles |
366604 cycles |
1.00 |
ML-DSA-87 sign |
927541 cycles |
927414 cycles |
1.00 |
ML-DSA-87 verify |
374016 cycles |
373875 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton3
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
72353 cycles |
72235 cycles |
1.00 |
ML-DSA-44 sign |
212424 cycles |
212375 cycles |
1.00 |
ML-DSA-44 verify |
75754 cycles |
75714 cycles |
1.00 |
ML-DSA-65 keypair |
127646 cycles |
127612 cycles |
1.00 |
ML-DSA-65 sign |
351030 cycles |
350845 cycles |
1.00 |
ML-DSA-65 verify |
125627 cycles |
125755 cycles |
1.00 |
ML-DSA-87 keypair |
205980 cycles |
208476 cycles |
0.99 |
ML-DSA-87 sign |
444778 cycles |
450018 cycles |
0.99 |
ML-DSA-87 verify |
205601 cycles |
205843 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 3rd gen (c6i) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
157499 cycles |
157541 cycles |
1.00 |
ML-DSA-44 sign |
549244 cycles |
549413 cycles |
1.00 |
ML-DSA-44 verify |
169448 cycles |
168865 cycles |
1.00 |
ML-DSA-65 keypair |
268437 cycles |
268818 cycles |
1.00 |
ML-DSA-65 sign |
903422 cycles |
903672 cycles |
1.00 |
ML-DSA-65 verify |
275283 cycles |
274680 cycles |
1.00 |
ML-DSA-87 keypair |
448241 cycles |
448464 cycles |
1.00 |
ML-DSA-87 sign |
1158654 cycles |
1157970 cycles |
1.00 |
ML-DSA-87 verify |
458704 cycles |
458043 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 4th gen (c7a)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
42142 cycles |
40662 cycles |
1.04 |
ML-DSA-44 sign |
134317 cycles |
132808 cycles |
1.01 |
ML-DSA-44 verify |
44844 cycles |
43607 cycles |
1.03 |
ML-DSA-65 keypair |
72940 cycles |
71859 cycles |
1.02 |
ML-DSA-65 sign |
213861 cycles |
213367 cycles |
1.00 |
ML-DSA-65 verify |
73729 cycles |
72847 cycles |
1.01 |
ML-DSA-87 keypair |
107003 cycles |
109237 cycles |
0.98 |
ML-DSA-87 sign |
250851 cycles |
254550 cycles |
0.99 |
ML-DSA-87 verify |
107681 cycles |
109371 cycles |
0.98 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 4th gen (c7a) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
120754 cycles |
120325 cycles |
1.00 |
ML-DSA-44 sign |
447570 cycles |
447576 cycles |
1.00 |
ML-DSA-44 verify |
130511 cycles |
130561 cycles |
1.00 |
ML-DSA-65 keypair |
205040 cycles |
205018 cycles |
1.00 |
ML-DSA-65 sign |
728790 cycles |
729474 cycles |
1.00 |
ML-DSA-65 verify |
210029 cycles |
209605 cycles |
1.00 |
ML-DSA-87 keypair |
337610 cycles |
336678 cycles |
1.00 |
ML-DSA-87 sign |
925517 cycles |
924223 cycles |
1.00 |
ML-DSA-87 verify |
347563 cycles |
347399 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton3 (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
138744 cycles |
138561 cycles |
1.00 |
ML-DSA-44 sign |
483982 cycles |
484140 cycles |
1.00 |
ML-DSA-44 verify |
148574 cycles |
162388 cycles |
0.91 |
ML-DSA-65 keypair |
241921 cycles |
241950 cycles |
1.00 |
ML-DSA-65 sign |
792702 cycles |
792591 cycles |
1.00 |
ML-DSA-65 verify |
240763 cycles |
241288 cycles |
1.00 |
ML-DSA-87 keypair |
396106 cycles |
397138 cycles |
1.00 |
ML-DSA-87 sign |
1013453 cycles |
1013569 cycles |
1.00 |
ML-DSA-87 verify |
403446 cycles |
403178 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton2
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
113189 cycles |
113255 cycles |
1.00 |
ML-DSA-44 sign |
355791 cycles |
356042 cycles |
1.00 |
ML-DSA-44 verify |
117978 cycles |
117969 cycles |
1.00 |
ML-DSA-65 keypair |
196342 cycles |
196623 cycles |
1.00 |
ML-DSA-65 sign |
589183 cycles |
589242 cycles |
1.00 |
ML-DSA-65 verify |
194553 cycles |
194559 cycles |
1.00 |
ML-DSA-87 keypair |
322537 cycles |
322281 cycles |
1.00 |
ML-DSA-87 sign |
753613 cycles |
753546 cycles |
1.00 |
ML-DSA-87 verify |
320115 cycles |
320070 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton2 (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
213219 cycles |
212521 cycles |
1.00 |
ML-DSA-44 sign |
761553 cycles |
760970 cycles |
1.00 |
ML-DSA-44 verify |
241351 cycles |
234237 cycles |
1.03 |
ML-DSA-65 keypair |
380573 cycles |
379762 cycles |
1.00 |
ML-DSA-65 sign |
1252452 cycles |
1252199 cycles |
1.00 |
ML-DSA-65 verify |
372839 cycles |
371797 cycles |
1.00 |
ML-DSA-87 keypair |
607341 cycles |
604584 cycles |
1.00 |
ML-DSA-87 sign |
1596680 cycles |
1595561 cycles |
1.00 |
ML-DSA-87 verify |
619175 cycles |
618927 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Graviton2 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 verify |
241351 cycles |
234237 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
CBMC Results (ML-DSA-44)Full Results (194 proofs)
|
CBMC Results (ML-DSA-87)Full Results (194 proofs)
|
CBMC Results (ML-DSA-65)Full Results (194 proofs)
|
There was a problem hiding this comment.
Intel Xeon 4th gen (c7i)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
34764 cycles |
34374 cycles |
1.01 |
ML-DSA-44 sign |
120113 cycles |
120132 cycles |
1.00 |
ML-DSA-44 verify |
38092 cycles |
38166 cycles |
1.00 |
ML-DSA-65 keypair |
61138 cycles |
60500 cycles |
1.01 |
ML-DSA-65 sign |
201844 cycles |
199945 cycles |
1.01 |
ML-DSA-65 verify |
62783 cycles |
62429 cycles |
1.01 |
ML-DSA-87 keypair |
93501 cycles |
94486 cycles |
0.99 |
ML-DSA-87 sign |
236815 cycles |
239500 cycles |
0.99 |
ML-DSA-87 verify |
95619 cycles |
96894 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 4th gen (c7i) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
93930 cycles |
93842 cycles |
1.00 |
ML-DSA-44 sign |
333310 cycles |
333119 cycles |
1.00 |
ML-DSA-44 verify |
100022 cycles |
100025 cycles |
1.00 |
ML-DSA-65 keypair |
159902 cycles |
160115 cycles |
1.00 |
ML-DSA-65 sign |
543114 cycles |
543227 cycles |
1.00 |
ML-DSA-65 verify |
160989 cycles |
161060 cycles |
1.00 |
ML-DSA-87 keypair |
266666 cycles |
266874 cycles |
1.00 |
ML-DSA-87 sign |
704974 cycles |
706010 cycles |
1.00 |
ML-DSA-87 verify |
270510 cycles |
269779 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
42142 cycles |
40662 cycles |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
7951c08 to
5607508
Compare
CBMC Results (ML-DSA-65, REDUCE-RAM)Full Results (194 proofs)
|
CBMC Results (ML-DSA-87, REDUCE-RAM)Full Results (194 proofs)
|
CBMC Results (ML-DSA-44, REDUCE-RAM)Full Results (194 proofs)
|
55f0028 to
991e2e9
Compare
|
@jakemas, thanks for getting this into shape! |
|
@mkannwischer ok, ready for review. The instruction PR will need to land in s2n-bignum first, while we wait I'll try the constant time proof -- but I'm really happy to be able to PR the conversion with a hol-light proof. Let me know if anything is missing. |
|
ahh just saw you comment! Yes got the proof in, runs in ~12min! |
There was a problem hiding this comment.
I left this name generic, as it could be shared between arm/x86 proof.
Thanks! Yes, I agree that it's great to get the conversion and the proof in at the same time. Would be great if we can do the same for the remaining proofs. Note that a constant-time proof is not needed for this function (all inputs are public), but we do want a memory safety proof like here: https://github.com/pq-code-package/mlkem-native/blob/2bf8e59f4330697b3924c572924136c96eb96960/proofs/hol_light/x86_64/proofs/rej_uniform_avx2_asm.ml#L1562 |
mkannwischer
left a comment
There was a problem hiding this comment.
Thanks @jakemas. Here is a first set of comments.
| __contract__( | ||
| requires(memory_no_alias(r, sizeof(int32_t) * MLDSA_N)) | ||
| requires(memory_no_alias(buf, MLD_AVX2_REJ_UNIFORM_BUFLEN)) | ||
| requires(memory_no_alias(table, 256 * sizeof(uint64_t))) |
There was a problem hiding this comment.
the table actually needs to be == mld_rej_uniform_table
There was a problem hiding this comment.
Fixed — requires(table == (const uint8_t *)mld_rej_uniform_table) in both dev/ and mldsa/src/native/ copies.
| jmp rej_uniform_avx2_asm_scalar | ||
|
|
||
| rej_uniform_avx2_asm_done: | ||
| vzeroupper |
There was a problem hiding this comment.
We don't have vzeroupper in any other routine. I don't know enough about x86_64 to know how important it is, but we should either have it everywhere or nowhere.
There was a problem hiding this comment.
Fixed — dropped vzeroupper from both dev/ and mldsa/src/native/ copies of the .S for consistency with the other ML-DSA routines. Proof adjusted accordingly.
| version = "f3c5acff6948d559194245237f6aaa7ebf7fcae8"; | ||
| # Pinned to https://github.com/awslabs/s2n-bignum/pull/387 head, | ||
| # which adds VMOVMSKPS, VPMOVZXBD, and VZEROUPPER instruction models | ||
| # required by the x86_64 rej_uniform proof. | ||
| version = "4c4fe1dfc8b79720013517a7b4dec9014c85fcf2"; | ||
| src = fetchFromGitHub { | ||
| owner = "awslabs"; | ||
| repo = "s2n-bignum"; | ||
| rev = "${version}"; | ||
| hash = "sha256-kfc8X2e+voefttshSUdifDc3Qn+dx0Gq5ENNLhWIdw0="; | ||
| hash = "sha256-64MJOqoDunpn6fx1j9P4+fDoRNZ8GRTB/d4C2JWvxFA="; |
There was a problem hiding this comment.
Leaving a comment here to remind us that we still have to change this.
There was a problem hiding this comment.
This file should be autogenerated via autogen
There was a problem hiding this comment.
Fixed — added rej_uniform_avx2_asm.S to the x86_64 joblist in scripts/autogen (joblist_x86_64), so proofs/hol_light/x86_64/mldsa/rej_uniform_avx2_asm.S is now regenerated by scripts/autogen.
| (* Lookup table for ML-DSA rejection uniform sampling. *) | ||
| (* Each entry is 8 bytes: permutation indices for VPERMD. *) | ||
|
|
||
| let mldsa_rej_uniform_table = (REWRITE_RULE[MAP] o define) |
There was a problem hiding this comment.
This file should be autogenerated via autogen
There was a problem hiding this comment.
Fixed — added gen_avx2_hol_light_rej_uniform_table to scripts/autogen, invoked from gen_zeta_tables. proofs/hol_light/x86_64/proofs/mldsa_rej_uniform_table.ml is now regenerated alongside the C/aarch64 lookup tables (mirrors the mlkem-native pattern).
| let REJ_SAMPLE = define | ||
| `REJ_SAMPLE l = FILTER (\x:int32. val x < 8380417) | ||
| (MAP (\x:24 word. word(val x MOD 2 EXP 23):int32) l)`;; |
There was a problem hiding this comment.
The spec should be moved to mldsa_specs.ml so we can re-use it for the aarch64 proof.
There was a problem hiding this comment.
Fixed — REJ_SAMPLE, REJ_SAMPLE_EMPTY, REJ_SAMPLE_APPEND now live in proofs/hol_light/common/mldsa_specs.ml (shared between arches, matching the shape used by s2n-bignum #378 for aarch64). The x86-only derived lemmas (REJ_SAMPLE_SPLIT, REJ_SAMPLE_PREFIX_256, REJ_SAMPLE_STEP_LE) stay in rej_uniform_avx2_asm.ml since they're only used by the AVX2 scalar-tail analysis.
| (let outlist = SUB_LIST(0,256) (REJ_SAMPLE inlist) in | ||
| let outlen = LENGTH outlist in | ||
| C_RETURN s = word outlen /\ | ||
| read(memory :> bytes(res,4 * outlen)) s = |
There was a problem hiding this comment.
Please add an explicit bound post-condition that all coefficients < q here to match the CBMC spec.
There was a problem hiding this comment.
think i got this, testing now added:
(!i. i < outlen
==> val(read(memory :> bytes32
(word_add res (word(4 * i)))) s) < 8380417)))
| const uint8_t *table) | ||
| __contract__( | ||
| requires(memory_no_alias(r, sizeof(int32_t) * MLDSA_N)) | ||
| requires(memory_no_alias(buf, MLD_AVX2_REJ_UNIFORM_BUFLEN)) |
There was a problem hiding this comment.
Let's change this to say 840 so it matches the HOL-light spec exactly.
There was a problem hiding this comment.
Fixed — requires(memory_no_alias(buf, 840)) (literal 840 matches the HOL-Light spec exactly), in both dev/ and mldsa/src/native/ copies.
| unsigned mld_rej_uniform_avx2_asm( | ||
| int32_t *r, const uint8_t buf[MLD_AVX2_REJ_UNIFORM_BUFLEN], | ||
| const uint8_t *table) | ||
| __contract__( |
There was a problem hiding this comment.
Add the comment here that this needs to be kept in sync with the HOL-light spec.
There was a problem hiding this comment.
Fixed — added /* This contract must be kept in sync with the HOL-Light specification in proofs/hol_light/x86_64/proofs/rej_uniform_avx2_asm.ml */ above the __contract__ block in both dev/ and mldsa/src/native/ copies.
| MAYCHANGE [memory :> bytes(res,1024)])`, | ||
| X86_PROMOTE_RETURN_NOSTACK_TAC mldsa_rej_uniform_tmc | ||
| MLDSA_REJ_UNIFORM_CORRECT);; | ||
|
|
There was a problem hiding this comment.
Add a comment here that this needs to be kept in sync wityh the CBMC spec.
There was a problem hiding this comment.
Fixed — added a comment block above SUBROUTINE_CORRECT variants noting these specifications must be kept in sync with the CBMC contract in dev/x86_64/src/arith_native_x86_64.h / mldsa/src/native/x86_64/src/arith_native_x86_64.h.
Reviewer-requested cleanup for the x86_64 rej_uniform assembly and
HOL Light proof:
Contract tightening (dev and mldsa copies of arith_native_x86_64.h):
- requires(memory_no_alias(buf, 840)) instead of
memory_no_alias(buf, MLD_AVX2_REJ_UNIFORM_BUFLEN) so the literal
matches the HOL Light spec exactly.
- requires(table == (const uint8_t *)mld_rej_uniform_table) pinning
the table to the exported rejection-sampling table, replacing the
looser memory_no_alias(table, 256 * sizeof(uint64_t)).
- Clarify sync comment.
vzeroupper removal: none of the other asm routines issue vzeroupper;
drop it from rej_uniform for consistency. This shifts the function
length by 3 bytes, so the HOL Light proof's nonoverlapping 246 / pc+245
references in mldsa_rej_uniform.ml become 243 / pc+242 accordingly, and
the two X86_STEPS_TAC invocations that stepped the vzeroupper byte are
removed. Bytecode regenerated via autogen --update-hol-light-bytecode.
Autogen plumbing: register rej_uniform_avx2_asm.S in the x86_64 HOL
Light asm joblist so the proofs/hol_light/x86_64/mldsa/ copy is
regenerated by scripts/autogen. Add gen_avx2_hol_light_rej_uniform_table
to regenerate proofs/hol_light/x86_64/proofs/mldsa_rej_uniform_table.ml
alongside the C/aarch64 lookup tables (matches mlkem-native's pattern).
Cross-reference comment in proofs/hol_light/x86_64/proofs/
rej_uniform_avx2_asm.ml pointing at the CBMC contract.
Proof runtime: ~5-6 min in the CI native build.
Signed-off-by: Jake Massimo <jakemas@amazon.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the x86_64 AVX2 rej_uniform implementation (previously written
in C with intrinsics) with a hand-written assembly routine, and adds a
functional correctness proof in HOL Light on top of the s2n-bignum
infrastructure.
Highlights:
- dev/x86_64/src/rej_uniform_avx2_asm.S and
mldsa/src/native/x86_64/src/rej_uniform_avx2_asm.S: new .S file
exposing mld_rej_uniform_avx2_asm (replaces the intrinsics-based
rej_uniform_avx2.c).
- proofs/hol_light/x86_64/mldsa/rej_uniform_avx2_asm.S and
proofs/hol_light/x86_64/proofs/rej_uniform_avx2_asm.ml: HOL Light
proof of MLDSA_REJ_UNIFORM_{,NOIBT_}SUBROUTINE_CORRECT, with no
remaining CHEATs.
- proofs/cbmc/rej_uniform_native_x86_64/: CBMC contract proof (249/249
passing).
- CI: hol_light.yml and Makefile updated for the new bytecode dump and
autogen instruction-decode format; s2n-bignum pin bumped to include
the supporting tactics.
Naming follows the asm-suffix convention introduced on main
(eada109 / e810d00): symbol mld_rej_uniform_avx2_asm, label prefix
rej_uniform_avx2_asm_.
Signed-off-by: Jake Massimo <jakemas@amazon.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer-requested cleanup for the x86_64 rej_uniform assembly and
HOL Light proof:
Contract tightening (dev and mldsa copies of arith_native_x86_64.h):
- requires(memory_no_alias(buf, 840)) instead of
memory_no_alias(buf, MLD_AVX2_REJ_UNIFORM_BUFLEN) so the literal
matches the HOL Light spec exactly.
- requires(table == (const uint8_t *)mld_rej_uniform_table) pinning
the table to the exported rejection-sampling table, replacing the
looser memory_no_alias(table, 256 * sizeof(uint64_t)).
- Clarify sync comment.
vzeroupper removal: none of the other asm routines issue vzeroupper;
drop it from rej_uniform for consistency. This shifts the function
length by 3 bytes, so the HOL Light proof's nonoverlapping 246 / pc+245
references in mldsa_rej_uniform.ml become 243 / pc+242 accordingly, and
the two X86_STEPS_TAC invocations that stepped the vzeroupper byte are
removed. Bytecode regenerated via autogen --update-hol-light-bytecode.
Autogen plumbing: register rej_uniform_avx2_asm.S in the x86_64 HOL
Light asm joblist so the proofs/hol_light/x86_64/mldsa/ copy is
regenerated by scripts/autogen. Add gen_avx2_hol_light_rej_uniform_table
to regenerate proofs/hol_light/x86_64/proofs/mldsa_rej_uniform_table.ml
alongside the C/aarch64 lookup tables (matches mlkem-native's pattern).
Cross-reference comment in proofs/hol_light/x86_64/proofs/
rej_uniform_avx2_asm.ml pointing at the CBMC contract.
Proof runtime: ~5-6 min in the CI native build.
Signed-off-by: Jake Massimo <jakemas@amazon.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
31b4b4c to
967fa0b
Compare
Reviewer-requested cleanup for the x86_64 rej_uniform assembly and
HOL Light proof:
Contract tightening (dev and mldsa copies of arith_native_x86_64.h):
- requires(memory_no_alias(buf, 840)) instead of
memory_no_alias(buf, MLD_AVX2_REJ_UNIFORM_BUFLEN) so the literal
matches the HOL Light spec exactly.
- requires(table == (const uint8_t *)mld_rej_uniform_table) pinning
the table to the exported rejection-sampling table, replacing the
looser memory_no_alias(table, 256 * sizeof(uint64_t)).
- Clarify sync comment.
vzeroupper removal: none of the other asm routines issue vzeroupper;
drop it from rej_uniform for consistency. This shifts the function
length by 3 bytes, so the HOL Light proof's nonoverlapping 246 / pc+245
references in mldsa_rej_uniform.ml become 243 / pc+242 accordingly, and
the two X86_STEPS_TAC invocations that stepped the vzeroupper byte are
removed. Bytecode regenerated via autogen --update-hol-light-bytecode.
Autogen plumbing: register rej_uniform_avx2_asm.S in the x86_64 HOL
Light asm joblist so the proofs/hol_light/x86_64/mldsa/ copy is
regenerated by scripts/autogen. Add gen_avx2_hol_light_rej_uniform_table
to regenerate proofs/hol_light/x86_64/proofs/mldsa_rej_uniform_table.ml
alongside the C/aarch64 lookup tables (matches mlkem-native's pattern).
Cross-reference comment in proofs/hol_light/x86_64/proofs/
rej_uniform_avx2_asm.ml pointing at the CBMC contract.
Proof runtime: ~5-6 min in the CI native build.
Signed-off-by: Jake Massimo <jakemas@amazon.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
967fa0b to
b1b11e9
Compare
- proofs/hol_light/README.md: add rej_uniform_avx2_asm.S to the x86_64 arithmetic proofs section. - proofs/hol_light/common/mldsa_specs.ml: add REJ_SAMPLE, REJ_SAMPLE_EMPTY, REJ_SAMPLE_APPEND. These match what's used in s2n-bignum #378 (aarch64) so the aarch64 rej_uniform proof can share the shape. - proofs/hol_light/x86_64/proofs/rej_uniform_avx2_asm.ml: needs the new mldsa_specs dependency; drop the duplicate REJ_SAMPLE definition. The x86-only REJ_SAMPLE_SPLIT / REJ_SAMPLE_PREFIX_256 / REJ_SAMPLE_STEP_LE (scalar-tail analysis helpers) stay here. - .github/workflows/hol_light.yml: add mldsa_specs.ml to the rej_uniform_avx2_asm needs list. Proof still passes in ~8 min native build. Signed-off-by: Jake Massimo <jakemas@amazon.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prove that every element of SUB_LIST (0,256) (REJ_SAMPLE inlist) has val c < 8380417 directly from the FILTER definition. Provides the coefficient bound property requested in the review; callers can specialize to per-index via EL / MEM_EL. Kept as a standalone lemma rather than adding a per-index postcondition to MLDSA_REJ_UNIFORM_CORRECT to avoid touching the inner Hoare triple. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Jake Massimo <jakemas@amazon.com>
Strengthen the postcondition of MLDSA_REJ_UNIFORM_CORRECT and
MLDSA_REJ_UNIFORM_(NOIBT_)SUBROUTINE_CORRECT to include the
per-coefficient bound
!i. i < outlen ==>
val(read(memory :> bytes32 (word_add res (word(4 * i)))) s) < 8380417
matching the CBMC contract
ensures(array_bound(buf, 0, len, 0, 8380417))
in arith_native_x86_64.h.
Uses the same layering pattern as poly_use_hint_32_aarch64_asm
(ENSURES_STRENGTHEN_POST): introduces ENSURES_STRENGTHEN_POST_X86,
a memory->list-element bridge VAL_READ_BYTES32_FROM_WORDLIST, and
the combinatorial lemma REJ_SAMPLE_COEFF_BOUND, then derives
MLDSA_REJ_UNIFORM_CORRECT_BOUND by showing the old
num_of_wordlist-based postcondition implies the new per-index bound.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Jake Massimo <jakemas@amazon.com>
2d393ab to
b10421a
Compare
- Pin s2n-bignum to awslabs/s2n-bignum@ccef2456 (upstream main with USHLL/MOVI/VPCMPGTD instruction models merged) - Add HOL Light eta rejection table generation to autogen, matching the pattern from the x86 rej_uniform table in PR #1014 Signed-off-by: Jake Massimo <jakemas@amazon.com> Signed-off-by: Ubuntu <ubuntu@ip-172-31-31-118.us-west-2.compute.internal>
- Pin s2n-bignum to awslabs/s2n-bignum@ccef2456 (upstream main with USHLL/MOVI/VPCMPGTD instruction models merged) - Add HOL Light eta rejection table generation to autogen, matching the pattern from the x86 rej_uniform table in PR #1014 Signed-off-by: Jake Massimo <jakemas@amazon.com> Signed-off-by: Ubuntu <ubuntu@ip-172-31-31-118.us-west-2.compute.internal>
- Pin s2n-bignum to awslabs/s2n-bignum@ccef2456 (upstream main with USHLL/MOVI/VPCMPGTD instruction models merged) - Add HOL Light eta rejection table generation to autogen, matching the pattern from the x86 rej_uniform table in PR #1014 Signed-off-by: Jake Massimo <jakemas@amazon.com> Signed-off-by: Ubuntu <ubuntu@ip-172-31-31-118.us-west-2.compute.internal>
Summary
Resolves #926 and #418 (?)
Hol-light proof needs instructions from awslabs/s2n-bignum#387
rej_uniformwith hand-written x86_64 assembly.rodatasection), enabling future HOL-Light formal verification#defines with#undefcleanup for SCU builds (following mlkem-native pattern)poly_uniformto component benchmarkML-DSA's 23-bit coefficients require 32-bit lanes, which naturally fills a 256-bit YMM register for 8 elements per iteration. This led to the choice of AVX2 over SSE — with SSE's 128-bit registers and 32-bit lanes, we'd only get 4 coefficients per iteration vs 8 with AVX2.
Performance
AMD EPYC 3rd gen (c6a) — opt
Proof
Includes HOL-Light and CBMC proofs, written by claude opus 4.7.
HOL-Light / x86_64 HOL Light proof for mldsa_rej_uniform.S (pull_request) Successful in 12m
No constant time/SAFE proof yet. Will continue to work on it as the instruction PR lands in s2n-bignum.