Skip to main content

Phase 19. Workspace cache + parallel fetch + perf

FieldValue
MEPMEP-57 §Phases · Phase 19
StatusNOT STARTED
Started
Landed
Tracking issue
Tracking PR

Gate

TestPhase19Perf: cold resolve + fetch of the 500-package fixture finishes within the perf budget (resolve under 800 ms, fetch under 5 s on a reference machine); warm resolve under 50 ms. Bench fails the build if a regression exceeds 15%.

Pass criteria:

  1. Cold resolve budget. 500-package synthetic graph, no cache, resolves under 800ms on the reference machine (Apple M2 Pro, 32GB RAM, 1Gbit fibre).
  2. Cold fetch budget. Same fixture, blob fetch+verify+extract under 5s, dominated by parallel HTTP/2 streams to the mock registry.
  3. Warm resolve budget. With every input cached, resolve completes under 50ms (lockfile load + manifest hash check only).
  4. Workspace cache hit. A 10-member workspace where all members share @mochi/json reuses the same extracted tree under $MOCHI_HOME/store/; on-disk size grows by exactly one copy.
  5. Regression gate. CI runs mochi pkg bench resolve and rejects any PR whose result exceeds the rolling-mean baseline by more than 15%.
  6. Cache GC. mochi pkg cache gc --keep-recent=14d --max-size=10GB evicts LRU entries to meet the budget; the test asserts post-GC size and recent retention.
  7. Parallelism scaling. Cold fetch at parallelism=1, 4, 8, 16 shows expected speedup curve (sublinear past ~8 due to TCP and CPU limits, not Mochi-internal contention).

Goal-alignment audit

Perf is the day-to-day surface. A slow resolver poisons every other ergonomic win. The bar is uv (2024) for Python and Cargo (2024) for native: anything slower at parity scale is a regression we ship to users every CI run. The user-facing goal moved: "mochi build feels instantaneous on a clean checkout, not slower than cargo build".

The 800ms cold-resolve budget for 500 packages is loosely derived from uv's published benchmarks (research note 03 §6): uv resolves PyPI's transformers (~150 deps) in ~150ms cold. At linear scale 500 nodes is ~500ms; the 800ms budget allows headroom for Mochi-specific constraints (capability/target/compiler checks).

Parallel fetch over HTTP/2 multiplexing is where most of the wall-clock win lives. Single-stream sequential fetch of 500 blobs at 10ms RTT each is 5 seconds of pure RTT. With 8 streams over one HTTP/2 connection, RTT cost drops by 8x, dominated by bandwidth.

Sub-phases

#ScopeStatusCommit
19.0Shared workspace cache at $MOCHI_HOME/store/NOT STARTED
19.1HTTP/2 multiplexed parallel blob fetchNOT STARTED
19.2Solver memoisation across workspace membersNOT STARTED
19.3Concurrent decompression and extractionNOT STARTED
19.4mochi pkg bench resolve perf harness vs golden baselineNOT STARTED
19.5Regression gate (15% bound) integrated into CINOT STARTED
19.6Cache GC: LRU eviction with mochi pkg cache gcNOT STARTED
19.7Profiling instrumentation (--trace flag, pprof endpoint)NOT STARTED
19.8Lockfile load fast-path: mmap + skip-on-hash-matchNOT STARTED

Sub-phase 19.0 — Shared cache

Canonical root $MOCHI_HOME (default: ~/.cache/mochi on Linux/macOS, %LOCALAPPDATA%\mochi on Windows; full resolution order in phase 0 §conventions):

$MOCHI_HOME/
store/
blobs/<bb>/<aa>/<hex> # raw tarballs (owner: Phase 9)
extracted/<hex>/ # extracted trees (owner: Phase 9)
locks/<hex>.lock # fcntl locks (owner: Phase 9.3)
metrics/ # last-access timestamps for GC (owner: Phase 19)
index/<bucket>/<scope>/<name> # index JSONL (owner: Phase 8)

Phase 19 owns only store/metrics/ and the install/GC code paths; the storage schema itself is owned by the phases that introduce each artefact. This phase pins the canonical paths so all installer + GC + bench code agrees on disk layout.

Single canonical location for all mochi invocations on the machine. Workspaces with multiple members share extracted/<hex>/ by hardlink:

// pkg/pkgstore/install.go
func InstallToWorkspace(workspaceRoot, blake3 string) error {
storeDir := storePath(blake3)
targetDir := filepath.Join(workspaceRoot, ".mochi/deps", blake3)
if exists(targetDir) { return nil }
return hardlinkTree(storeDir, targetDir)
}

Hardlinks on POSIX, junction points on Windows. Falls back to copy on cross-filesystem boundaries (caches user warning).

Sub-phase 19.1 — Parallel fetch

Phase 9.1 already implements blob fetch via Phase 8's HTTP/2 client. Phase 19 wires the parallelism around it:

// pkg/pkgstore/fetch.go
type FetchPool struct {
Concurrency int
Store Store
Verifier Verifier
}

func (p *FetchPool) FetchAll(ctx context.Context, lock *Lockfile) error {
g, gctx := errgroup.WithContext(ctx)
sem := make(chan struct{}, p.Concurrency)
for _, pkg := range lock.Packages {
pkg := pkg
sem <- struct{}{}
g.Go(func() error {
defer func() { <-sem }()
return p.fetchOne(gctx, pkg)
})
}
return g.Wait()
}

func (p *FetchPool) fetchOne(ctx context.Context, pkg LockedPackage) error {
if exists(extractedPath(pkg.BLAKE3)) { return nil } // already extracted
lock := store.AcquireBlobLock(pkg.BLAKE3)
defer lock.Release()
if exists(extractedPath(pkg.BLAKE3)) { return nil } // racy reader won
rc, err := p.Store.Fetch(ctx, pkg.BLAKE3)
if err != nil { return err }
return p.Verifier.VerifyAndExtract(rc, pkg)
}

Concurrency default = 8. Override via MOCHI_PARALLELISM or --parallelism=N flag. The semaphore limits in-flight fetches; the HTTP/2 transport multiplexes all of them over one TCP connection per host.

Sub-phase 19.2 — Solver memoisation across workspace

In a workspace with members A, B, C all depending on @mochi/json, the solver is invoked once per member but the version search space for shared deps is identical. Memoise:

// pkg/pkgsolver/cache.go
type SolverCache struct {
Manifests map[PackageKey]*Manifest // pkg+ver -> manifest
Ranges map[string]*ResolvedRange // dep range -> resolved versions
}

func (s *Solver) ResolveWorkspace(ws *Workspace) (map[Member]Lockfile, error) {
cache := NewSolverCache()
out := map[Member]Lockfile{}
for _, m := range ws.Members {
sol, err := s.SolveWithCache(m, cache)
if err != nil { return nil, err }
out[m] = sol
}
return out, nil
}

For a 10-member workspace where 80% of the dep set is shared, this is roughly a 5x cold-resolve speedup vs naive per-member resolution.

Sub-phase 19.3 — Concurrent extraction

The extraction path in Phase 9.5 is single-threaded per tarball but multiple tarballs can extract in parallel:

// pkg/pkgblob/extract_concurrent.go
func ExtractConcurrent(blobs []Blob, dest string, parallelism int) error {
g, _ := errgroup.WithContext(context.Background())
sem := make(chan struct{}, parallelism)
for _, b := range blobs {
b := b
sem <- struct{}{}
g.Go(func() error {
defer func() { <-sem }()
return extractOne(b, dest)
})
}
return g.Wait()
}

The bottleneck is small-file syscall latency (open, write, fsync per extracted file). For 500 packages averaging 50 files each = 25000 small writes, parallel I/O hides most of the latency.

fsync policy: per-tarball fsync after the last file in that tarball, not per file. Acceptable durability trade-off (a crash mid-extract leaves a partial extracted tree, which the next install retries).

Sub-phase 19.4 — mochi pkg bench resolve

// cmd/mochi/bench.go
func cmdBenchResolve(c *cli.Context) error {
fixtures := loadFixtures(c.String("fixture-dir"))
results := []Result{}
for _, fix := range fixtures {
warmups := 2
iters := 10
for i := 0; i < warmups; i++ { runResolve(fix) }
var times []time.Duration
for i := 0; i < iters; i++ {
t := time.Now()
runResolve(fix)
times = append(times, time.Since(t))
}
results = append(results, Result{
Fixture: fix.Name,
P50: percentile(times, 0.50),
P95: percentile(times, 0.95),
P99: percentile(times, 0.99),
})
}
printResults(results)
if c.String("baseline") != "" {
return compareBaseline(results, c.String("baseline"), c.Float64("threshold"))
}
return nil
}

Output:

fixture p50 p95 p99 vs baseline
500-pkg-cold-resolve 742ms 810ms 832ms +3.1% (within 15% bound)
500-pkg-cold-fetch 4.21s 4.50s 4.61s +1.8% (within 15% bound)
500-pkg-warm-resolve 38ms 45ms 52ms -8.3% (improved)
workspace-10 1.12s 1.20s 1.25s +4.4% (within 15% bound)

Sub-phase 19.5 — CI regression gate

A nightly CI job plus a PR-gated comparison run:

# .github/workflows/bench.yml
name: Package system bench

on:
schedule:
- cron: "0 7 * * *" # nightly UTC; regenerates baseline.json
pull_request:
paths:
- 'pkg/pkgsolver/**'
- 'pkg/pkgblob/**'
- 'pkg/pkgstore/**'
- 'pkg/pkgregistry/**'
- 'bench/**'
- '.github/workflows/bench.yml'

permissions:
contents: read
# The nightly job writes back baseline.json via a follow-up PR (see
# the "publish-baseline" step). PR runs do NOT receive this scope.
pull-requests: write

concurrency:
group: bench-${{ github.event_name }}-${{ github.head_ref || 'main' }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}

jobs:
bench:
# Refuse fork PRs: the bench machine type (large runner) is paid time
# and exposing it to untrusted code is a DoS surface.
if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository
runs-on: ubuntu-24.04-large # 8-core runner for stable timing
timeout-minutes: 60
steps:
- uses: actions/checkout@v6
with:
fetch-depth: 0 # bench/baseline.json history
- uses: actions/setup-go@v6
with:
go-version-file: go.mod
cache-dependency-path: go.sum
# Pin clock, locale, CPU governor for stable measurement.
- name: Pin bench environment
run: |
echo "TZ=UTC" >> "$GITHUB_ENV"
echo "LC_ALL=C.UTF-8" >> "$GITHUB_ENV"
echo "SOURCE_DATE_EPOCH=$(git log -1 --format=%ct)" >> "$GITHUB_ENV"
sudo cpupower frequency-set --governor performance || true
shell: bash
- run: go build -trimpath -o mochi ./cmd/mochi
- name: Run bench
run: |
./mochi pkg bench resolve \
--baseline=bench/baseline.json \
--threshold=0.15 \
--report=bench/report.json
- name: Upload report
if: always()
uses: actions/upload-artifact@v5
with:
name: bench-report-${{ github.run_id }}
path: bench/report.json
retention-days: 90
# Nightly only: regenerate baseline.json and open a follow-up PR
# so a human reviews the rolling drift.
- name: Publish new baseline
if: github.event_name == 'schedule'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
cp bench/report.json bench/baseline.json
gh pr create --title "bench: rolling baseline ${{ github.run_id }}" \
--body "Auto-generated from nightly bench run" \
--label "bench-baseline" \
--base main

The baseline.json is regenerated nightly on the main branch after the smoke tests gate. PRs run the bench against the rolling baseline. Exceeding the 15% threshold marks the PR as failing the bench gate (separate from the unit-test gate; the PR can be merged with a bench-exempt label and a written reason in the PR body).

Workflow security notes:

  • pull-requests: write is set at workflow level but only the nightly step (if: github.event_name == 'schedule') ever invokes gh pr create. PR runs never reach that step.
  • Fork PRs are refused via the if: guard on the job (head.repo.full_name == github.repository); they receive the same skipped result a path filter miss would produce.
  • actions/upload-artifact@v5 is the only secret-touching action; it uses the workflow's default GITHUB_TOKEN with contents: read scope, which is sufficient.

Sub-phase 19.6 — Cache GC

mochi pkg cache gc # default: keep recent 14d, max 10GB
mochi pkg cache gc --keep-recent=30d --max-size=20GB
mochi pkg cache gc --dry-run # show what would be evicted
mochi pkg cache gc --prune-orphans # remove extracted trees with no in-store blob
// pkg/pkgstore/gc.go
func GC(opts GCOptions) (*GCReport, error) {
entries := walkExtractedTrees()
sort.Slice(entries, func(i, j int) bool { return entries[i].LastAccess.Before(entries[j].LastAccess) })
cutoff := time.Now().Add(-opts.KeepRecent)
var report GCReport
var totalSize int64
for _, e := range entries { totalSize += e.Size }
for _, e := range entries {
if totalSize <= opts.MaxSize { break }
if e.LastAccess.After(cutoff) && opts.RespectRecent { continue }
evict(e); totalSize -= e.Size
report.Evicted = append(report.Evicted, e.Path); report.BytesFreed += e.Size
}
return &report, nil
}

LastAccess derived from filesystem atime where supported; on noatime filesystems, a per-entry .lastused file is written on cache hit.

Sub-phase 19.7 — Profiling

mochi --trace=trace.json resolve writes a Chrome trace event format file; openable in chrome://tracing or https://ui.perfetto.dev/. Spans:

  • solver.decide, solver.propagate, solver.backtrack
  • fetch.http, fetch.dual_hash, fetch.extract
  • lockfile.write, lockfile.canonical_check

A --pprof=:6060 flag spawns net/http/pprof on localhost for live profiling. Disabled in default builds; enabled only with --profile.

Sub-phase 19.8 — Lockfile fast path

A warm mochi build reads the lockfile, hashes it (BLAKE3 of the canonical TOML), and compares against the manifest's stored hash. If equal, no resolution needed:

func WarmBuild(m *Manifest) (*Lockfile, error) {
lock, err := pkglock.ParseFile("mochi.lock")
if err != nil { return nil, err }
expectedHash := pkglock.HashManifestForLock(m)
if lock.ManifestHash == expectedHash { return lock, nil } // happy path
return resolveAndWriteLock(m, lock)
}

Mmap-based read: the lockfile is mmapped (read-only) for parse. For a typical 100KB lockfile, parse drops from ~5ms to ~1ms.

Files changed

FilePurposeOwner
pkg/pkgstore/store.goShared cacheOwner
pkg/pkgstore/install.goHardlink installerOwner
pkg/pkgstore/fetch.goParallel poolOwner
pkg/pkgstore/gc.goLRU GCOwner
pkg/pkgsolver/cache.goCross-member memoisationOwner
pkg/pkgblob/extract_concurrent.goParallel extractOwner
pkg/pkgtrace/trace.goChrome trace emitterOwner
cmd/mochi/bench.gomochi pkg bench handlerOwner
cmd/mochi/cache_gc.gomochi pkg cache gc handlerOwner
tests/pkgsystem/perf/500-pkg/*Cold/warm budgetsOwner
tests/pkgsystem/perf/workspace-10/*Member cache reuseOwner
tests/pkgsystem/perf/regress/*Bench fail injectionOwner
bench/baseline.jsonRolling perf baselineOwner
.github/workflows/bench.ymlNightly + PR gateOwner

Error code surface

CodeTrigger
M057_PERF_E001Bench result exceeds baseline by threshold.
M057_PERF_E002Cache GC failed (permissions, disk error).
M057_PERF_E003Hardlink failed across filesystems; fell back to copy.

Test set

  • TestPhase19SharedStore — same blob hardlinked into multiple workspace members.
  • TestPhase19ParallelFetch — concurrent fetches complete within budget.
  • TestPhase19SolverMemo — workspace resolves with cache hits.
  • TestPhase19ConcurrentExtract — parallel extraction within wall-clock budget.
  • TestPhase19BenchHarnessmochi pkg bench produces stable results across runs.
  • TestPhase19BenchRegress — synthetic regression fixture triggers fail.
  • TestPhase19CacheGC — LRU eviction meets size budget.
  • TestPhase19Trace — trace.json valid Chrome trace event format.
  • TestPhase19LockFastPath — manifest unchanged skips resolve.

Performance targets (reference machine)

From research note 05 §8 and research note 08 §14:

OperationBudgetNotes
Cold resolve, 500 pkgs800ms p95solver only; mock registry, all entries on disk
Cold fetch + extract, 500 pkgs, ~50MB5s p95over HTTP/2 to local mock
Warm resolve50ms p95manifest unchanged, lockfile hit
Workspace 10 members, shared cache1.2s p95first-run cold; second-run warm under 100ms
Cache GC, 10000 entries2smostly stat() syscalls
Lockfile mmap parse1ms100KB lockfile

Open questions

  • Whether to ship a precomputed bench baseline per supported OS/arch or just linux/amd64; current plan: linux/amd64 baseline; other architectures track relative regression only.
  • Whether to expose a mochi resolve --dry-run --json --profile for editor consumption; deferred to LSP work.
  • Whether the cache supports remote (shared via NFS / S3) backends; deferred to v1.1.

Cross-references