MEP 29. VM2 Dispatch Strategy - Measured Results

Field	Value
MEP	29
Title	VM2 Dispatch Strategy - Measured Results
Author	Mochi core
Status	Informational
Type	Informational
Created	2026-05-16

Abstract

MEP-25 specifies the vm2 data-model shapes and defers the dispatch strategy choice to three competing standards-track MEPs (MEP-26 Migration, MEP-27 Inline Cache, MEP-28 AOT). This MEP reports the head-to-head benchmark numbers from prototype implementations of all three strategies across three containers: lists, strings, and maps. The three containers point at three different conclusions, and together they yield a per-container playbook for which dispatch strategy to actually ship.

Three-line summary.

Lists (lists/fill_sum): AOT wins narrowly over Baseline; IC and Migration regress.
Strings (strings/concat_loop): rope shape beats today's Flat-only baseline by 2-3x; all three dispatch strategies on top of the rope shape are within 25% of each other (AOT > IC > Mig).
Maps (maps/fill_sum): the today-shape generic map[Cell]Cell Baseline wins; int specialization is in the noise because Go's map op cost drowns dispatch cost.

Top-line result. On lists/fill_sum N=1024 on Apple M4 (Go 1.25):

Strategy	ns/op	vs theoretical floor	vs current baseline
Theoretical floor (`[]int64`)	935.6	1.00x	0.25x faster
Option 3 (AOT)	3584	3.83x	1.06x faster
Current baseline (`[]Cell`)	3805	4.07x	1.00x
Option 1 (Migration)	4670	4.99x	1.23x slower
Option 2 (Inline Cache)	5883	6.29x	1.55x slower

The headline is that Option 3 (AOT) wins; Option 1 (Migration) regresses vs the current generic baseline because its multi-arm type switch costs more per op than the win from int-specialized storage on this Go runtime; and Option 2 (Inline Cache) regresses substantially because Go's interface dispatch is already as fast as an IC hit, so the IC state byte and miss-handling branch are pure overhead.

This is not the predicted ordering from the design-stage MEPs (26/27/28 all predicted Migration would win). The disagreement is informative: it tells us that the dispatch-cost model assumed in the design MEPs (where the hot path is C/asm and a type-switch costs more than an indirect call) does not hold inside a Go interpreter, where the type-switch is one of the cheapest operations available.

Experimental setup

Code layout

Three sibling packages under runtime/vm2/listopt/, plus a baseline package:

runtime/vm2/listopt/
├── doc.go             , package-level explanation
├── baseline/          , reference points: today's []Cell + raw []int64
│   ├── baseline.go
│   └── baseline_test.go
├── mig/               , Option 1: one-way migration
│   ├── mig.go
│   └── mig_test.go
├── ic/                , Option 2: per-call-site inline cache
│   ├── ic.go
│   └── ic_test.go
└── aot/               , Option 3: AOT specialized handlers
    ├── aot.go
    └── aot_test.go

Each strategy package is fully self-contained. There is no shared kernel; the non-list code (Cell encoding, Objects table, FillSum driver) is copied verbatim so that the only measured delta between strategies is the list-dispatch path. This isolation also makes each strategy trivial to enable/disable and gives a clean template for the next container (maps): copy listopt/ to mapopt/, swap the data shapes, re-measure.

Workload

The canonical list benchmark: build a list of N ints by repeated push, then sum the elements:

func (vm *VM) FillSum(n int64) int64 {
    xs := vm.newList(int(n))
    for i := int64(0); i < n; i++ {
        vm.listPush(xs, CInt(i))
    }
    var sum int64
    m := vm.listLen(xs)
    for i := int64(0); i < m; i++ {
        sum += vm.listGet(xs, i).Int()
    }
    return sum
}

Two sizes measured: N=128 (matches the corpus default) and N=1024 (10x scale, exercises slice growth past initial capacity).

The driver is hand-rolled Go, not bytecode. Each strategy's list ops are inlined into the driver in the same way a JIT would inline them after specialization. This isolates the dispatch cost from the interpreter loop's dispatch cost.

Strategies, as implemented

Option 1, Migration (listopt/mig). List opcodes do a Go switch x.(type) on Objects[idx] for every operation. On *vmListI64, push appends to []int64 or migrates to *vmList on a non-int value. On *vmList, append to []Cell. The migration is one-shot, in place via Objects[idx] = newCell. Approximate hot-path cost: one type switch (~3-5 cycles in Go) + the specialized handler.

Option 2, Inline Cache (listopt/ic). Each call site has its own ICSlot{State uint8}. The handler branches on slot.State, then performs a single-case type assertion (l, ok := obj.(*vmListI64)) and the specialized work. On miss (cache says ListI64 but object is *vmList), the slot is invalidated and the handler retried. K=1 cache (one shape per slot); state 3 is "polymorphic / fall through". Approximate hot-path cost: one byte-tag compare + one branch + one single-case type assertion + the specialized handler.

Option 3, AOT (listopt/aot). The driver calls specialized methods (listPushI64, listGetI64, listLenI64) directly. Each handler does a single-case type assertion on Objects[idx] (it must, Go offers no way to skip the itab check on a heterogeneous slice element) and the specialized work. No shape switch, no IC. Approximate hot-path cost: one single-case type assertion + the specialized handler.

Baseline, Generic []Cell (listopt/baseline). Today's MEP-24 §3 shape, no int specialization. Push allocates a Cell; read unpacks via .Int(). Approximate hot-path cost: one single-case type assertion (the only shape) + the Cell pack/unpack.

Theoretical floor, Raw []int64. No Cell, no Objects table, no dispatch. The fastest Go can possibly do this workload.

Measurement

go test -bench=. -benchmem -benchtime=2s -count=8 ./runtime/vm2/listopt/...

Apple M4, Go 1.25, darwin/arm64, 8 samples per benchmark. Median ns/op reported; mean within ±2% of median in all cases.

Results

Raw numbers (median ns/op)

Strategy	N=128	N=1024	Allocs (N=1024)	Bytes (N=1024)
Raw `[]int64`	118.6	935.6	0	0
AOT	486.0	3584	2	8.0 KiB
Baseline	534.7	3805	2	8.0 KiB
Migration	614.0	4670	2	8.0 KiB
Inline Cache	750.1	5883	2	8.0 KiB

Per-op cost decomposition

FillSum(1024) performs approximately 2049 list operations (1024 push + 1024 get + 1 len). Dividing the ns/op by 2049 gives a per-list-op cost:

Strategy	ns/list-op	Overhead vs raw
Raw `[]int64`	0.46	-
AOT	1.75	+1.29 ns
Baseline	1.86	+1.40 ns
Migration	2.28	+1.82 ns
Inline Cache	2.87	+2.41 ns

The ~1.3 ns floor between Raw and AOT is the cost of the Cell pack/unpack and the Objects[idx].(*vmListI64) type assertion. Everything above that is dispatch overhead:

AOT pays only the single-case type assertion. Best possible without giving up the Objects-indirected heap.
Migration adds ~0.5 ns/op for the multi-arm switch x.(type). On a 2-arm switch with monomorphic shape, Go's branch predictor catches the hot arm but the dispatch is still slightly heavier than a single , ok assertion.
IC adds ~1.1 ns/op for the slot.State load + branch + the type assertion (which it still performs). The IC's payoff in production runtimes (skipping a virtual call) does not apply in Go where interface dispatch is already inlined.

Reproducibility

The benchmark data is checked in at /runtime/vm2/listopt/{baseline,mig,ic,aot}/*_test.go. Re-run with:

go test -bench=. -benchtime=2s -count=8 ./runtime/vm2/listopt/...

Numbers will vary by ±5% across runs on the same M4 due to thermal and scheduler noise; the strategy ordering is stable across all observed runs.

Analysis

Why Migration regresses vs Baseline

The Migration prototype switches on two list shapes (*vmListI64 and *vmList). Even with a perfectly-predicted hot arm, the switch x.(type) lowers to two itab compares (one per case) before falling into the matched arm. The Baseline has a single shape, so its obj.(*vmList) lowers to one itab compare. The extra compare costs more than the int-vs-Cell storage saves on this Go runtime.

This is a Go-specific result. In a C/asm interpreter, the multi-arm switch would lower to a compare-and-branch tower of the same cost as the single-arm assertion, and Migration would win by the storage saving. The MEP-26 design-stage estimate assumed the C/asm cost model.

Why Inline Cache regresses substantially

Two compounding issues:

Go's interface dispatch is already cheap. An (*T, ok) assertion is a single itab compare. The IC was designed to replace a function-pointer dispatch (which in C++/asm is a 5-10 cycle indirect call); it has nothing to replace in Go.
The IC adds work on every hit. The state byte read + the branch on state + the type assertion (the IC does not skip the assertion in our implementation, Go has no safe way to coerce any to *T without one) means every hit costs more than the no-IC version.

A C-port of the IC would skip the , ok check inside the cached arm because the IC promises the shape. Our prototype cannot make that promise safely in Go.

Why AOT wins narrowly over Baseline

AOT and Baseline both perform exactly one single-case type assertion per list op. The remaining ~0.1 ns/op advantage for AOT (about 5%) comes from the int64-direct read on OpListGet, the int64 is packed into a Cell once and returned, avoiding the .Int() unpacking that Baseline does on every read.

This is small, but it is the upper bound of what int specialization can win on this workload in the current design. To extract more, the calling code would have to consume int64 directly without packing, which is what a JIT would do, but is not measurable through the Cell-returning interface here.

What this tells us about the future JIT

The JIT story for all three options is identical to the no-JIT story:

AOT: the JIT inlines the specialized handler directly. Zero shape check; the win is the same as today.
Migration: the JIT can hoist the shape guard out of the loop, so the in-loop cost is one int load. Estimated JIT win: ~30% over interpreter Migration.
IC: the JIT consumes the IC's recorded shape, generates the same specialized code as AOT, and replaces the IC with a hoisted guard. Estimated JIT win: identical to JIT Migration.

In other words: post-JIT, all three options converge to roughly the same performance because all three give the JIT the same monomorphic shape information. The interpreter-level differences measured here are what matters until the JIT lands; after the JIT lands, the difference is the engineering complexity each option imposes.

Strings (`strings/concat_loop`)

The second container exercised is strings, on the canonical strings/concat_loop workload: build "x" repeated N times via N single-char concats and return the final length. Unlike lists/fill_sum, this workload is genuinely polymorphic: the left operand transitions Inline (iters 1..5) → Flat (iter 6) → Rope (iter 7..N) as it grows. Each strategy sees three distinct shape tuples over the lifetime of one call.

Code layout

runtime/vm2/stropt/
├── doc.go             , package-level explanation
├── baseline/          , today's MEP-24 §2 shape: Inline + Flat (no Rope)
├── mig/               , Option 1: MEP-26 multi-arm switch over Inline/Flat/Rope
├── ic/                , Option 2: MEP-27 K=4 IC over (leftShape, rightShape) tuples
└── aot/               , Option 3: MEP-28 3×3 specialized concat table

Strategies, as implemented

Baseline (stropt/baseline). Today's vm2 string shape. Inline-pack ≤5 bytes, else allocate a fresh *vmString and copy both operands. No rope. This is the O(N²) trap that motivates the rope-aware strategies.

Migration (stropt/mig). Full MEP-25 §1 shape lattice (Inline/Flat/Rope). The concat handler switches on both operands' shapes via a Go type switch, then dispatches the MEP-25 §1 decision tree: pack inline if both fit, allocate flat if combined ≤64 bytes, otherwise build a Rope unless depth would exceed 32. Pays a 3-way type switch per operand per concat.

Inline Cache (stropt/ic). K=4 IC slot per call site, keyed by (leftShape*3 + rightShape). On a hit the cached tuple pre-classifies the operands so the specialized arm skips the shape switch. On miss the IC extends (up to 4 tuples) then marks megamorphic. Because concat_loop produces exactly three tuples, (Inline, Inline), (Flat, Inline), (Rope, Inline), K=4 holds the full working set and never goes megamorphic (verified by TestICTuplesObserved).

AOT (stropt/aot). Nine specialized concat handlers (concatII, concatIF, concatIR, concatFI, …, concatRR) reached via a 3×3 function-pointer table indexed by (leftShape, rightShape). Each handler is a leaf with shape-known prologue, no internal type switches. The driver still resolves operand shapes per call (which a real AOT compiler would have elided at emit time from IR types), so the measured cost is upper-bound AOT under the polymorphic workload.

Results

go test -bench=. -benchtime=2s -count=5 -run=^$ ./runtime/vm2/stropt/...

Apple M4, Go 1.25, darwin/arm64, 5 samples per benchmark. Median ns/op:

Strategy	N=30	N=200	N=1000	Allocs (N=1000)	Bytes (N=1000)
Baseline	1380	12538	113761	1990	554 KiB
AOT	1410	7697	35698	1198	66 KiB
IC	1762	9440	43064	1198	66 KiB
Migration	1938	8435	53997	1198	66 KiB

Analysis

The algorithmic win dominates. At N=1000, all three rope-aware strategies beat baseline by 2.1x–3.2x in time and 8.4x in allocated bytes. Baseline's O(N²) byte-copy cost (each concat allocates len(left) + 1 bytes and copies) crosses over rope dispatch overhead around N=60-80 and runs away from there. This is the opposite of the lists result, where dispatch overhead dwarfed the algorithmic delta (both list strategies are O(N)).

AOT wins clearly. The 3×3 specialized handlers avoid the internal type switch entirely. At N=1000 AOT beats Migration by 1.5x and beats IC by 1.2x. The win is real because each concat's hot path is short, half a dozen field loads + an allocation, so the saved type-switch is a measurable fraction of the per-op cost.

IC beats Migration here, the opposite of the lists result. The reason: this workload's polymorphism punishes Migration. Migration's switch x.(type) has to compare against three itabs (the average hit position is the second arm). The IC's linear scan over at most 3 cached tuples is a tighter loop, and the cached tuples are byte-sized comparisons rather than itab pointers.

At N=30 baseline still wins narrowly, because at small N the rope-tree allocation overhead has not yet been amortized by the avoided copies. The crossover is around N=60, which is exactly the flatConcatThreshold (below threshold every concat is still a flat allocation; above it ropes start paying off).

What this means

Strings are an algorithmic problem, not a dispatch problem. The right question is "do we have ropes?", not "which dispatch strategy?". Any of the three strategies plus a rope shape beats the current baseline by ≥2x at production sizes; the dispatch-strategy choice between them is a secondary optimization worth ~25% relative.

This inverts the design-stage MEP-26 prediction for strings (which assumed dispatch cost would dominate). It also inverts the lists conclusion (where IC was the loser): on a workload that is genuinely polymorphic and where each handler does substantive work, IC's per-tuple compare is cheaper than Migration's per-itab compare. The IC vs Migration ordering is workload-dependent, not strategy-dependent, neither is universally better.

Maps (`maps/fill_sum`)

The third container exercised is maps, on maps/fill_sum: build a map of N int to int entries via repeated set, then sum the values. Like lists/fill_sum this workload is monomorphic (the map is vmMapI64 throughout), so the dispatch sites see exactly one shape after warmup.

Code layout

runtime/vm2/mapopt/
├── doc.go              , package-level explanation
├── baseline/           , today's MEP-24 §4 generic map[Cell]Cell + raw map[int64]int64 floor
├── mig/                , Option 1: switch over Empty/I64/Generic shapes
├── ic/                 , Option 2: K=1 IC slot per call site (state byte)
└── aot/                , Option 3: specialized I64xI64 handlers

Strategies, as implemented

Baseline (mapopt/baseline). Generic map[Cell]Cell; every set/get pays a CInt pack and an itab assertion. Plus a FillSumRaw reference that uses map[int64]int64 with no Cell or Objects table.

Migration (mapopt/mig). *vmMapI64 (map[int64]int64) starts as the default; a non-int key or value triggers in-place migration to *vmMap (map[Cell]Cell). Every op pays a switch obj.(type) over the two shapes (vmMapI64, vmMap).

Inline Cache (mapopt/ic). K=1 IC slot per call site (ICSlot{State uint8}, with state 0=uninit, 1=monomorphic I64, 2=monomorphic Generic, 3=polymorphic). After warmup the slot stays at 1 and every op pays: one byte-tag compare + one branch + one single-case type assertion.

AOT (mapopt/aot). The driver calls specialized mapSetI64I64 and mapGetI64I64 directly with raw int64 keys and values, skipping Cell pack/unpack entirely. Each handler pays one single-case type assertion.

Results

go test -bench=. -benchtime=2s -count=5 -run=^$ ./runtime/vm2/mapopt/...

Apple M4, Go 1.25, darwin/arm64, 5 samples per benchmark. Median ns/op:

Strategy	N=128	N=1024	Allocs (N=1024)	Bytes (N=1024)
Raw `map[int64]int64`	1355	10668	5	37.0 KiB
Baseline (Generic)	1450	11677	7	37.0 KiB
AOT	1691	12611	7	37.0 KiB
IC	1909	15262	7	37.0 KiB
Migration	2000	16621	7	37.0 KiB

Per-op cost at N=1024 (workload is ~2049 map ops):

Strategy	ns/map-op	Overhead vs raw
Raw `map[int64]int64`	5.21	-
Baseline (Generic)	5.70	+0.49 ns
AOT	6.16	+0.95 ns
IC	7.45	+2.24 ns
Migration	8.11	+2.90 ns

Analysis

The today-shape Baseline wins. This is the biggest surprise of the three containers, and it inverts the recommendation that the lists data suggested. For Go's map:

map[Cell]Cell (where Cell is uint64) and map[int64]int64 use the same underlying hashmap implementation with the same hash function speed. The Cell pack (tagInt | uint64(v)) and unpack (mask + sign-extend) are nearly free.
The Go map operation itself (~5 ns/op for an int key) dominates the per-op cost. Dispatch overhead is a small fraction.
AOT's win in lists came from skipping the per-element Cell pack on every push/get, which mattered when the underlying slice op was cheap (~0.5 ns). On maps the underlying op is 10x more expensive, so the saved pack is in the noise.

AOT regresses slightly vs Baseline here because both pay one type assertion plus the map op, but AOT additionally calls through method wrappers that the Go inliner doesn't unify as well as Baseline's tight call site. This is within ±10% and partially noise, but the directional finding holds across all 5 samples.

IC and Migration regress meaningfully (+30-40%) for the same reason as in lists: their dispatch overhead is pure cost on a Go runtime where the alternative (one type assertion) is already as cheap as dispatch gets.

What this means

For maps in this VM, int specialization is not worth the implementation cost. The current MEP-24 §4 generic map[Cell]Cell shape is within 10% of the raw map[int64]int64 floor. The 40-90 ns/op gap to Lua (per the MEP-23 sweep) is therefore not coming from dispatch and not coming from key boxing; it is in the surrounding interpreter loop, the Cell ABI, or GC. Optimizing the map subsystem itself yields at most a 5-10% win.

This is the opposite of what the lists data suggested for maps in this MEP's earlier draft. The reason for the flip: in lists, the per-element cost was small (slice append/index) so dispatch overhead was a meaningful fraction; in maps the per-element cost is large (hashing + bucket walk), so dispatch overhead drowns in the noise.

Sets (`sets/fill_probe`)

The fourth container exercised is sets, on sets/fill_probe: insert N int members into a set, then probe the same N members and count hits. Like lists/fill_sum and maps/fill_sum this workload is monomorphic (the set stays vmSetI64 throughout), so dispatch sites see one shape after warmup.

Code layout

runtime/vm2/setopt/
├── doc.go              , package-level explanation
├── baseline/           , today's MEP-24 §4-style generic map[Cell]struct{} + raw map[int64]struct{} floor
├── mig/                , Option 1: switch over SetI64/Generic shapes
├── ic/                 , Option 2: K=1 IC slot per call site (state byte)
└── aot/                , Option 3: specialized I64 add/has handlers

Strategies, as implemented

Baseline (setopt/baseline). Generic map[Cell]struct{}; every add/has pays a CInt pack and one itab assertion. Plus a FillProbeRaw reference that uses map[int64]struct{} with no Cell or Objects table.

Migration (setopt/mig). *vmSetI64 (map[int64]struct) is the default; adding a non-int member migrates in-place to *vmSet (map[Cell]struct). Every op pays a switch obj.(type) over the two shapes.

Inline Cache (setopt/ic). K=1 IC slot per call site (ICSlot{State uint8}, with state 0=uninit, 1=monomorphic I64, 2=monomorphic Generic, 3=polymorphic). After warmup the slot stays at 1 and every op pays: one byte-tag compare + one branch + one single-case type assertion.

AOT (setopt/aot). The driver calls specialized setAddI64 and setHasI64 directly with raw int64 members, skipping Cell pack entirely. Each handler pays one single-case type assertion.

Results

go test -bench=. -benchtime=2s -count=5 -run=^$ ./runtime/vm2/setopt/...

Apple M4, Go 1.25, darwin/arm64, 5 samples per benchmark. Median ns/op:

Strategy	N=128	N=1024	Allocs (N=1024)	Bytes (N=1024)
Raw `map[int64]struct{}`	1325	10780	5	36.1 KiB
AOT	1424	11373	7	36.1 KiB
Baseline (Generic)	1429	11614	7	36.1 KiB
IC	1626	13413	7	36.1 KiB
Migration	1549	13610	7	36.1 KiB

Per-op cost at N=1024 (workload is 2048 set ops, N adds + N probes):

Strategy	ns/set-op	Overhead vs raw
Raw `map[int64]struct{}`	5.26	-
AOT	5.55	+0.29 ns
Baseline (Generic)	5.67	+0.41 ns
IC	6.55	+1.29 ns
Migration	6.65	+1.39 ns

Analysis

Sets confirm the maps heuristic. The forward prediction from the maps section ("if the underlying Go op already costs ~5 ns/op, dispatch strategy doesn't matter") lands. AOT and Baseline tie within noise, both within 8% of the raw floor. IC and Migration regress by ~18% for the usual reason: their dispatch overhead is pure cost on a Go runtime where the alternative (one single-case type assertion) is already as cheap as dispatch gets.

AOT shows a hair of a win this time (~2% over Baseline), unlike maps where Baseline edged AOT. The reason is that the set workload's hot op (map[int64]struct{} insert/lookup) is a touch cheaper than map[int64]int64 (no value byte to copy on insert, no value to load on lookup), so the saved Cell pack becomes a slightly larger fraction. The gap is too small to act on; treat sets and maps as the same regime.

What this means

For sets in this VM, int specialization is not worth the implementation cost, same conclusion as maps. The current MEP-24 generic map[Cell]struct{} shape is within 8% of the raw map[int64]struct{} floor. Wherever the Lua gap on sets/* workloads comes from, it is not in the set shape choice. Keep the generic shape and move on.

Recommendations

Three containers, three different lessons, one cross-cutting pattern.

For lists

The algorithmic delta between shapes is small (both list shapes are O(N)) and dispatch cost is a meaningful fraction of each op. AOT is the right answer: it skips the per-element Cell pack/unpack which is measurable when the underlying op (slice append, slice index) is cheap. Skip the IC entirely; its payoff model (replace virtual call with cached direct call) does not apply in Go. Defer Migration until a workload actually mixes shapes at runtime; the current corpus does not.

For strings (and any container with non-trivial algorithmic shape choices)

The algorithmic delta dominates dispatch. Add the rope shape first, that is what produces the 2-3x win. The choice of dispatch strategy among Mig/IC/AOT is a secondary 25% optimization. AOT remains the cleanest answer; IC is the runner-up because the rope workload is genuinely polymorphic and IC's tuple compare beats Migration's itab compare.

For maps

Do not bother specializing. Today's generic map[Cell]Cell shape is within 10% of the raw map[int64]int64 floor. The Go map op cost (~5 ns/op) drowns out dispatch overhead, and Cell pack/unpack is nearly free for ints. AOT shows no measurable win and IC/Migration both regress by 30-40%. The 40-90 ns/op gap to Lua on maps/fill_sum lives elsewhere (interpreter loop, Cell ABI, GC), not in map shape choice. Keep the MEP-24 §4 shape and move on.

This is the opposite of the recommendation MEP-29 made before the maps data landed. The earlier recommendation extrapolated from the lists pattern (where AOT won by skipping pack/unpack). The extrapolation broke because the per-op cost in maps is 10x the per-op cost in lists; what was a meaningful fraction in lists is in the noise in maps.

For sets

Same answer as maps: do not bother specializing. Today's generic map[Cell]struct{} shape is within 8% of the raw map[int64]struct{} floor; AOT shows a ~2% improvement that is not worth the implementation cost; IC/Migration both regress by ~18%. The maps heuristic ("if the underlying Go op already costs ~5 ns/op, dispatch strategy doesn't matter") transferred cleanly. Keep the MEP-24 set shape.

For the next container (structs)

Predict by per-op cost. Per-field load/store in Go is sub-nanosecond (slice index, struct field), much faster than a map op. If structs end up as []Cell indexed by slot, they will look like lists, and AOT-style slot-typed specialization should pay off. If structs end up as Go struct types reached via any, they will look like maps and not pay off. Build the three-package suite and let the numbers decide.

Caveats

Driver inlining. The benchmark driver calls list ops as Go methods, which Go inlines. A real vm2 bytecode interpreter dispatches through a for { switch ins.Op { ... } } loop, which inhibits some inlining and tends to widen the per-op overhead. This benchmark therefore measures the floor of each strategy; the absolute numbers in a bytecode interpreter would be higher across the board, but the ordering should be preserved (because the dispatch-loop overhead is the same constant added to every strategy).
One workload. fill_sum is the canonical list benchmark and matches what MEP-23 measures, but it is one workload. Map-heavy or string-heavy workloads might surface a different ordering. A follow-on MEP can broaden the corpus.
Go-only finding. The result that IC < Migration < AOT is contingent on Go's interface dispatch being cheap. On a C/asm interpreter or a JIT, the predicted MEP-26 / MEP-27 / MEP-28 ordering may still hold.
No GC pressure measured. Allocs/op is identical across strategies (2 allocations: the *vmListI64/*vmList header + the backing slice). Long-lived workloads with many short-lived lists would shift the balance toward Migration (which can be cheaper to allocate when starting from the int-specialized shape with a small initial slice).

Files added

runtime/vm2/listopt/doc.go, package-level documentation
runtime/vm2/listopt/baseline/baseline.go + test, reference points
runtime/vm2/listopt/mig/mig.go + test, MEP-26 prototype
runtime/vm2/listopt/ic/ic.go + test, MEP-27 prototype
runtime/vm2/listopt/aot/aot.go + test, MEP-28 prototype
runtime/vm2/stropt/doc.go, strings package documentation
runtime/vm2/stropt/baseline/baseline.go + test, today's MEP-24 §2 string shape
runtime/vm2/stropt/mig/mig.go + test, MEP-26 string prototype
runtime/vm2/stropt/ic/ic.go + test, MEP-27 string prototype (K=4 IC over shape tuples)
runtime/vm2/stropt/aot/aot.go + test, MEP-28 string prototype (3×3 specialized table)
runtime/vm2/mapopt/doc.go, maps package documentation
runtime/vm2/mapopt/baseline/baseline.go + test, today's MEP-24 §4 generic map shape
runtime/vm2/mapopt/mig/mig.go + test, MEP-26 map prototype (Empty/I64/Generic switch)
runtime/vm2/mapopt/ic/ic.go + test, MEP-27 map prototype (K=1 IC slot)
runtime/vm2/mapopt/aot/aot.go + test, MEP-28 map prototype (specialized I64xI64 handlers)
runtime/vm2/setopt/doc.go, sets package documentation
runtime/vm2/setopt/baseline/baseline.go + test, today's MEP-24-style generic set shape
runtime/vm2/setopt/mig/mig.go + test, MEP-26 set prototype (SetI64/Generic switch)
runtime/vm2/setopt/ic/ic.go + test, MEP-27 set prototype (K=1 IC slot)
runtime/vm2/setopt/aot/aot.go + test, MEP-28 set prototype (specialized I64 handlers)
website/docs/mep/mep-0029.md, this MEP

MEP-23, the cross-language baseline that established lists/fill_sum and strings/concat_loop as canonical workloads.
MEP-24 §2/§3, the current Flat-only string and generic []Cell list shapes that the Baseline packages mirror.
MEP-25, the data-model shape spec these prototypes implement, including the Inline/Flat/Rope string lattice.
MEP-26, MEP-27, MEP-28, the three design-stage strategy specs being evaluated here.

Open questions

Bytecode-level measurement. Re-run the comparison through the actual vm2 dispatch loop (not a Go-method driver) to verify the per-strategy ordering holds. Expected outcome: same ordering, larger absolute numbers.
Map equivalent. Run the same three-way comparison on maps/fill_sum. The map gap vs Lua is much larger (4.86x vs 1.56x for lists), so the strategy choice matters more.
Cell-aware bytecode. Whether to specialize OpListGet into OpListGetInt so the consumer can avoid the Cell round-trip entirely. Likely worth doing as part of the AOT rollout.

Abstract​

Experimental setup​

Code layout​

Workload​

Strategies, as implemented​

Measurement​

Results​

Raw numbers (median ns/op)​

Per-op cost decomposition​

Reproducibility​

Analysis​

Why Migration regresses vs Baseline​

Why Inline Cache regresses substantially​

Why AOT wins narrowly over Baseline​

What this tells us about the future JIT​

Strings (strings/concat_loop)​

Code layout​

Strategies, as implemented​

Results​

Analysis​

What this means​

Maps (maps/fill_sum)​

Code layout​

Strategies, as implemented​

Results​

Analysis​

What this means​

Sets (sets/fill_probe)​

Code layout​

Strategies, as implemented​

Results​

Analysis​

What this means​

Recommendations​

For lists​

For strings (and any container with non-trivial algorithmic shape choices)​

For maps​

For sets​

For the next container (structs)​

Caveats​

Files added​

Related work​

Open questions​

Abstract

Experimental setup

Code layout

Workload

Strategies, as implemented

Measurement

Results

Raw numbers (median ns/op)

Per-op cost decomposition

Reproducibility

Analysis

Why Migration regresses vs Baseline

Why Inline Cache regresses substantially

Why AOT wins narrowly over Baseline

What this tells us about the future JIT

Strings (`strings/concat_loop`)

Code layout

Strategies, as implemented

Results

Analysis

What this means

Maps (`maps/fill_sum`)

Code layout

Strategies, as implemented

Results

Analysis

What this means

Sets (`sets/fill_probe`)

Code layout

Strategies, as implemented

Results

Analysis

What this means

Recommendations

For lists

For strings (and any container with non-trivial algorithmic shape choices)

For maps

For sets

For the next container (structs)

Caveats

Files added

Related work

Open questions