bg-nsieve-vm3jit-2026-05-19
BG nsieve, vm3 JIT closure via OpListSetI64 admission, 2026-05-19
Apple M4, darwin/arm64. Bench commands:
go test ./runtime/jit/vm3jit -run='^$' \
-bench='BenchmarkCorpusJITRunner/nsieve_n' -benchtime=3s -count=10 -cpu=1
go test ./compiler3/corpus -run='^$' \
-bench='BenchmarkGoKernels/nsieve_n' -benchtime=3s -count=10 -cpu=1
| N | vm3 JIT ns/op (median of 10) | Go ns/op (median of 10) | vm3 JIT / Go | vm3 interp / Go | vm2 / Go (baseline) | reduction vs vm2 |
|---|---|---|---|---|---|---|
| 1000 | 5064 | 3499 | 1.45x | 75.4x | 126.05x | -98.8% |
| 10000 | 74769 | 40530 | 1.85x | 58.4x | 134.10x | -98.6% |
Under 2x at both sizes. The closure path is purely the JIT admission of OpListSetI64: one whitelist entry plus a 14-line ARM64 lowering that mirrors OpListGetI64 (no cap check, no len++, just a packed Int48 store at cells.ptr[idx]).
checkCellBankAdmissible whitelist (compile.go:230):
case vm3.OpListGetI64, vm3.OpListPushI64, vm3.OpListSetI64:
continue
Hot inner-mark loop encoding (3-word hottest form, when cells.ptr is pinned in x22):
MOVZ x16, #0xFFFA, LSL #48 ; tagInt48
BFI x16, xVal, #0, #48 ; pack 48-bit payload
STR x16, [x22, xIdx, LSL #3] ; cells[idx] = packed
The inner mark loop at PC 14..18 of nsieve.go thus lowers to:
14 CMP xJ, xN ; if j > n -> skip
B.GT skip
15 MOV x4, #1 ; val = 1
16 MOVZ x16, #0xFFFA, LSL #48 ; OpListSetI64 (3 words)
BFI x16, x4, #0, #48
STR x16, [x22, xJ, LSL #3]
17 ADD xJ, xJ, xI ; j += i
18 B pc14 ; back to inner_test
8 ARM64 instructions per mark iteration. Bit-identical to c2corpus.ExpectNsieve across N in 1000 (TestNsieveJITCompiles in runtime/jit/vm3jit/nsieve_jit_test.go is the gate; TestMathKernelsMatchVm2/nsieve in compiler3/corpus covers the interp path).
Why this is a generic compiler optimization (no super-op): OpListSetI64 is the dual of OpListGetI64, already admitted. The lowering uses the same tagInt48 mask + BFI packing that lists_fill_sum's push path uses, the same hoisted-cells.ptr register pinning that lists_fill_sum's get path uses. Nothing in the lowering knows about nsieve's algorithm; any cell-bank function with a single list and an xs[i] = v op in its hot loop benefits identically.
Residual gap to Go:
- Storage density: vm3 stores marks as 8-byte
Cell(NaN-boxed i64). Go uses[]boolat 1 byte. 8x cache footprint on the inner mark loop. Closing this fully requires the Phase 3.6 bytes bank (regs<U8> / OpBytesSetU8). - Fill loop overhead: nsieve pushes n+1 zeros via
OpListPushI64in a loop. Go usesmake([]bool, n+1)(single bulk allocation). Closing this needs a generic "push-N-zeros" detection in the JIT, or a newOpListResizeop.
Both are post-2x-gate improvements; the current closure is via JIT admission alone, with no algorithmic divergence from the vm3 interp.
Raw output:
goos: darwin
goarch: arm64
pkg: mochi/runtime/jit/vm3jit
cpu: Apple M4
BenchmarkCorpusJITRunner/nsieve_n1000 814204 5049 ns/op
BenchmarkCorpusJITRunner/nsieve_n1000 609294 5412 ns/op
BenchmarkCorpusJITRunner/nsieve_n1000 787413 5079 ns/op
BenchmarkCorpusJITRunner/nsieve_n1000 792477 4940 ns/op
BenchmarkCorpusJITRunner/nsieve_n1000 802581 5108 ns/op
BenchmarkCorpusJITRunner/nsieve_n1000 785716 5734 ns/op
BenchmarkCorpusJITRunner/nsieve_n1000 709171 5560 ns/op
BenchmarkCorpusJITRunner/nsieve_n1000 798172 4949 ns/op
BenchmarkCorpusJITRunner/nsieve_n1000 725142 4583 ns/op
BenchmarkCorpusJITRunner/nsieve_n1000 809112 4742 ns/op
BenchmarkCorpusJITRunner/nsieve_n10000 49568 66037 ns/op
BenchmarkCorpusJITRunner/nsieve_n10000 48631 75036 ns/op
BenchmarkCorpusJITRunner/nsieve_n10000 53604 66918 ns/op
BenchmarkCorpusJITRunner/nsieve_n10000 67396 71540 ns/op
BenchmarkCorpusJITRunner/nsieve_n10000 36874 130344 ns/op
BenchmarkCorpusJITRunner/nsieve_n10000 43196 76741 ns/op
BenchmarkCorpusJITRunner/nsieve_n10000 46508 65960 ns/op
BenchmarkCorpusJITRunner/nsieve_n10000 49557 107668 ns/op
BenchmarkCorpusJITRunner/nsieve_n10000 33358 106754 ns/op
BenchmarkCorpusJITRunner/nsieve_n10000 49192 74501 ns/op
pkg: mochi/compiler3/corpus
BenchmarkGoKernels/nsieve_n1000 1598046 2402 ns/op
BenchmarkGoKernels/nsieve_n1000 1578606 2242 ns/op
BenchmarkGoKernels/nsieve_n1000 1670983 2568 ns/op
BenchmarkGoKernels/nsieve_n1000 1000000 3213 ns/op
BenchmarkGoKernels/nsieve_n1000 965792 5187 ns/op
BenchmarkGoKernels/nsieve_n1000 1000000 5048 ns/op
BenchmarkGoKernels/nsieve_n1000 1000000 3541 ns/op
BenchmarkGoKernels/nsieve_n1000 1000000 5518 ns/op
BenchmarkGoKernels/nsieve_n1000 1000000 3596 ns/op
BenchmarkGoKernels/nsieve_n1000 1000000 3456 ns/op
BenchmarkGoKernels/nsieve_n10000 116004 42641 ns/op
BenchmarkGoKernels/nsieve_n10000 130692 36838 ns/op
BenchmarkGoKernels/nsieve_n10000 124808 59485 ns/op
BenchmarkGoKernels/nsieve_n10000 111558 40727 ns/op
BenchmarkGoKernels/nsieve_n10000 71994 54628 ns/op
BenchmarkGoKernels/nsieve_n10000 85353 37588 ns/op
BenchmarkGoKernels/nsieve_n10000 90094 37917 ns/op
BenchmarkGoKernels/nsieve_n10000 90289 39533 ns/op
BenchmarkGoKernels/nsieve_n10000 81726 44736 ns/op
BenchmarkGoKernels/nsieve_n10000 99751 40333 ns/op