bg-switch-lookup-vm3jit-2026-05-19

Phase 6.4.b switch-lookup JIT bench (2026-05-19)

Apple M4, darwin/arm64. JIT compiled. The fn body is dispatched through the trampoline.CallStatus entry point exactly as BenchmarkCorpusJITRunner does for every other JIT-admitted kernel.

go test ./runtime/jit/vm3jit -run='^$' \
  -bench='BenchmarkSwitchLookup8JIT' -benchtime=1s -count=20 -cpu=1

20 samples per case. Below the median and minimum are over the 20 samples.

Variant	N	runs	ns/op median	ns/op min
cmp_chain	100	18	1769	1267
cmp_chain	10000	20	189355	128659
table	100	20	2049	1015
table	10000	20	153632	118012

Table-vs-chain ratio in JIT:

N=100 median: 2049 / 1769 = 1.16 (table slower, noise dominates)
N=100 minimum: 1015 / 1267 = 0.80
N=10000 median: 153632 / 189355 = 0.81
N=10000 minimum: 118012 / 128659 = 0.92

The N=10000 median is the cleaner read since per-iteration cost dominates loop overhead. Speedup is ~19% median at N=10000, below the spec's under 0.50 promise (mirroring Go CL 756340's -63%). The gap to the spec target is explained below.

Why the ratio is 0.81 instead of under 0.50 on Apple M4

The cmp-chain dispatch is 8 sequential CMP+B.COND instructions per key. The 32749-period LCG cycles through ~30% of its period over 10k dispatches, so per-call the predictor sees up to ~10k×0.3 = 3k distinct control-flow events, well within Apple M4's branch-history capacity. Once warm, most of the 8 branches predict "not taken" cheaply, and only the one taken-branch per dispatch costs a redirect.

Per-iter LCG cost ~ 20 cycles (MUL + 2x SDIV for the two ModI64K's). Per-iter dispatch cost:

cmp_chain: ~5..8 cycles (8 branches, mostly predicted not-taken; one redirect per dispatch).
table (hoisted): ~3..4 cycles (one CMP+B.HS bounds check, then a single LDR Xd, [Xhoist, Xidx, LSL #3]).

Total per-iter: ~25..28 cycles cmp_chain vs ~23..24 table. Predicted ratio ~0.85 — matches measured 0.81 within noise.

Go CL 756340's -63% was measured on a CPU with a less-aggressive predictor (the Go bench machine is typically a Skylake-class x86_64 where the indirect-branch redirect cost is higher and the cmp-chain fanout pays a much larger mispredict tax). On Apple M4 the predictor absorbs most of the dispatch fanout; the table form still wins, just less dramatically.

A linux/amd64 re-bench on server2 is expected to land closer to the spec target (under 0.50) once the same hoisted lowering is ported to lower_amd64.go (currently the AMD64 path falls back to interp for OpLookupI64KW, mirroring OpFmaF64's AMD64 deferral).

How the hoist works

Pre-hoist (Phase 6.4.b baseline), each OpLookupI64KW lowered to:

movImm64 x16, &fn.I64Tables[c][0]    ; 1..4 movz/movk words
LDR      Xd, [x16, Xidx, LSL #3]

The movImm64 reloads a loop-invariant pointer every iter (3..4 instructions on a typical heap address). With the hoist:

PROLOGUE (once per call):
  movImm64 x19, &fn.I64Tables[c][0]   ; 3..4 words
INNER LOOP:
  LDR Xd, [x19, Xidx, LSL #3]          ; 1 word

Hoist budget: x19..x28 - i64-callee-saved range. For SwitchLookup8Table (NumRegsI64 = 6, no callee-saved i64 regs), x19 is free. Hoisting costs one STP x19:x20 in the prologue and one matching LDP in the epilogue.

The implementation is generic: any fn with OpLookupI64KW gets up to N unique tables hoisted, where N is the unused tail of x19..x28. Cell-bank fns (NumRegsCell > 0) skip the hoist for now because their x19..x28 layout is fully committed to arena base + slab field pins.

Raw output

goos: darwin
goarch: arm64
pkg: mochi/runtime/jit/vm3jit
cpu: Apple M4
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      1509 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      1387 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	  904144	      1857 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	  911334	      1678 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      1357 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      1358 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      1638 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      2491 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      2690 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      1625 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	  908224	      2115 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      1907 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      1330 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	  850927	      1965 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      1881 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	  699099	      2185 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      1267 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n100         	 1000000	      2151 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    6013	    384419 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    4011	    254184 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    5271	    192897 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    7246	    191019 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    6736	    203406 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    8776	    208596 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    7041	    207047 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    6928	    229489 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    6459	    164512 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    8150	    187691 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    7519	    155265 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    8103	    128659 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    9144	    139414 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    9105	    154306 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    8185	    175642 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    7464	    218962 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    9066	    247850 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    7321	    168831 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    7407	    181299 ns/op
BenchmarkSwitchLookup8JIT/cmp_chain_n10000       	    7393	    152691 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1096940	      1015 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      1196 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      1998 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	  998000	      1308 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      1155 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      1123 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1104237	      1306 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      1600 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      3742 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      3713 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      4771 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      2863 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      2357 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      2188 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      2422 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      2129 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      2176 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      1873 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      2100 ns/op
BenchmarkSwitchLookup8JIT/table_n100             	 1000000	      1695 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    155251 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    152014 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    260283 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    159553 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    157197 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	    9386	    170249 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	    9157	    230307 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	    9738	    158524 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	    9472	    199854 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    141416 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    119671 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	    7972	    134888 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    123683 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    118012 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    141262 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    197783 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	    8443	    127309 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    121935 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    123379 ns/op
BenchmarkSwitchLookup8JIT/table_n10000           	   10000	    163977 ns/op

Phase 6.4.b switch-lookup JIT bench (2026-05-19)​

Why the ratio is 0.81 instead of under 0.50 on Apple M4​

How the hoist works​

Raw output​

Phase 6.4.b switch-lookup JIT bench (2026-05-19)

Why the ratio is 0.81 instead of under 0.50 on Apple M4

How the hoist works

Raw output