Comparing changes

@1024

….72x) * checkpoint: before iteration 7 * checkpoint: before iteration 8 * drop deprecated mx.metal.* fallbacks, use mx.get/reset_peak_memory Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 1 * checkpoint: before iteration 2 * checkpoint: before iteration 3 * checkpoint: before iteration 4 * checkpoint: before iteration 5 * checkpoint: before iteration 6 * checkpoint: before iteration 7 * reduce INCONCLUSIVE verdicts: WEAK_KEEP + more bench runs - Add WEAK_KEEP verdict when composite score > 1.0 but no single metric clears individual threshold - Bump warmup 3→5, bench runs 10→20 for tighter variance Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * clear experiment log for re-evaluation with WEAK_KEEP logic Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 1 * exp: wired-limit-512mb [score=1.0, verdict=KEEP] * checkpoint: before iteration 2 * checkpoint: before iteration 3 * checkpoint: before iteration 4 * checkpoint: before iteration 5 * research loop results: 5 iterations + experiment log Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 1 * checkpoint: before iteration 2 * checkpoint: before iteration 3 * checkpoint: before iteration 4 * checkpoint: before iteration 5 * checkpoint: before iteration 6 * checkpoint: before iteration 7 * checkpoint: before iteration 8 * checkpoint: before iteration 9 * loop: dedup gate, expanded search areas, upstream research round 2 - Add Gate 1.5: reject duplicate experiment names against experiments.jsonl - Show full experiment history (not last 5) + explicit tried-names list - Add search areas 12-20: sdpa, graph-materialization, stream-pipelining, weight-format, matmul-ordering, addmm-fusion, dtype-cast-cleanup, async-pipeline - Eliminate areas 8 (layernorm already dispatched) and 16 (compile already fuses) - Update decision.schema.json with new enum values - Add upstream_research_2_findings.md compound note - Add deep_research_prompt.md for external research tooling - Add btca-local skill Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * loop: integrate deep research findings, prioritized search areas Key discoveries from Gemini deep dive + ChatGPT recommendations: ROOT CAUSE: dilated convs in refiner force im2col fallback (9x memory inflation, excluded from MLX implicit GEMM per PR #3147). This explains the 2143MB peak memory and why cache-limit experiments regressed. - Restructure loop prompt with priority-ordered search areas - Top 3: BF16 full pipeline, unroll/reroll contiguity audit, dilated conv fix - Add search area #21: refiner-dilated-conv-fix - Mark #12 sdpa-attention as already applied - Add deep_dive_findings.md compound note with revised priority order - Add external research docs (chatgpt, gemini deep dive) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 1 * checkpoint: before iteration 2 * fix: handle WEAK_KEEP verdict across loop, compound_note, summarize score_experiment.py emits WEAK_KEEP but downstream scripts rejected it. Now accepted everywhere — treated as KEEP (commit, no revert). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 1 * checkpoint: before iteration 2 * exp: attn-skip-window-dim-global [score=1.002, verdict=WEAK_KEEP] * checkpoint: before iteration 3 * checkpoint: before iteration 4 * exp: qkv-split-first-global-attn [score=1.0088, verdict=WEAK_KEEP] * checkpoint: before iteration 5 * exp: qkv-split-first-windowed-attn [score=1.0054, verdict=WEAK_KEEP] * checkpoint: before iteration 6 * checkpoint: before iteration 7 * checkpoint: before iteration 8 * checkpoint: before iteration 1 * loop: add steering directive + log search_area to experiments Proposer now gets MANDATORY area targeting for top-priority areas. search_area tracked in experiments.jsonl via --decision flag. WEAK_KEEP updates best_result.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 2 * checkpoint: before iteration 3 * Revert "checkpoint: before iteration 3" This reverts commit 0751e1a. * checkpoint: before iteration 1 * fix: sync validate_decision.py search areas with decision schema Added Phase 2/3 areas (sdpa-attention, graph-materialization, stream-pipelining, weight-format, matmul-ordering, addmm-fusion, dtype-cast-cleanup, async-pipeline, refiner-dilated-conv-fix). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * upstream: mine CorridorKey-Engine, EZ-CorridorKey, MarcelLieb batch fork New findings: - CorridorKey-Engine: tiled refiner, token routing/LTRM, 4K benchmarks - EZ-CorridorKey (426★): edge-aware tile blend weights - MarcelLieb: batched frame processing + parallel postprocess Updated: program.md (areas 22-23), loop.sh (edge-aware-blend area), validate_decision.py + decision.schema.json (new enum), summarize_experiment.py (refreshed suggestion list) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 1 * checkpoint: before iteration 2 * checkpoint: before iteration 3 * checkpoint: before iteration 4 * bump loop benchmark resolution to 1024x1024 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * switch to 1024x1024: regenerate golden fixture, update test tolerances - Add --img-size flag to dump_pytorch_reference.py - Regenerate golden.npz at 1024x1024 - Update IMG_SIZE=1024 in conftest, compute backbone shapes dynamically - Relax parity tolerances for higher-res drift (~10-50x vs 512) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * add FPS reporting to all benchmark scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 1 * relax fidelity threshold to 1e-1 for 1024x1024 drift Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 1 * experiment: metal dilated conv kernel — REVERT 3 approaches tested to bypass im2col for refiner dilated convs: - naive Metal kernel (1.87x slower), SIMD Metal kernel (2.8x slower), sub-pixel decomposition (5% latency + 9% memory regression) - im2col+GEMM leverages AMX hardware; bypassing it = always slower - closes research program items #11 and #21 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 1 * checkpoint: before iteration 2 * checkpoint: before iteration 3 * checkpoint: before iteration 4 * checkpoint: before iteration 5 * fix: resolution-aware loop scoring + clean experiment slate - Pass --baseline to score_experiment.py (was using 512 baseline for 1024 runs) - Make resolution configurable via RESOLUTION env var (default 1024) - Baseline file now resolution-keyed: benchmark_baseline_${RESOLUTION}.json - Dynamic baseline numbers in proposer prompt (was hardcoded 120ms/119ms) - Clear experiment history for fresh 1024 run Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 1 * checkpoint: before iteration 2 * exp: refiner-bf16-default [score=1.1333, verdict=KEEP] * checkpoint: before iteration 3 * checkpoint: before iteration 1 * exp: precomputed-unroll-reroll-perms [score=1.1376, verdict=KEEP] * checkpoint: before iteration 2 * checkpoint: before iteration 3 * exp: einsum-fused-output-proj [score=1.1413, verdict=KEEP] * checkpoint: before iteration 4 * checkpoint: before iteration 5 * checkpoint: before iteration 6 * exp: refiner-tiled-spatial-processing [score=1.1187, verdict=KEEP] * checkpoint: before iteration 7 * checkpoint: before iteration 8 * checkpoint: before iteration 9 * exp: dual-stream-decoder-dispatch [score=1.1307, verdict=KEEP] * checkpoint: before iteration 10 * exp: extended-stream-pipeline-disable-compile-forward [score=1.1099, verdict=KEEP] * checkpoint: before iteration 11 * checkpoint: before iteration 12 * exp: folded-bn-decoder-inference [score=1.1093, verdict=KEEP] * checkpoint: before iteration 13 * checkpoint: before iteration 14 * exp: split-qkv-contiguous-matmul [score=1.117, verdict=KEEP] * checkpoint: before iteration 15 * exp: pretranspose-mlp-weights-contiguous-matmul [score=1.1058, verdict=KEEP] * checkpoint: before iteration 16 * exp: conv1x1-addmm-bypass-decoder-refiner [score=1.1009, verdict=KEEP] * checkpoint: before iteration 17 * checkpoint: before iteration 18 * exp: defer-fp32-cast-coarse-sigmoid-bf16 [score=1.1132, verdict=KEEP] * checkpoint: before iteration 19 * exp: bf16-final-addition-sigmoid-defer-all-casts [score=1.1023, verdict=KEEP] * checkpoint: before iteration 20 * exp: decoder-refiner-materialization-barrier [score=1.1138, verdict=KEEP] * checkpoint: before iteration 21 * exp: remove-stage-gc-python-overhead [score=1.219, verdict=KEEP] * checkpoint: before iteration 22 * exp: safe-quantize-backbone-stages-1-3 [score=1.217, verdict=KEEP] * checkpoint: before iteration 23 * exp: tile-eval-between-tiles-peak-memory [score=1.1994, verdict=KEEP] * checkpoint: before iteration 24 * exp: split-fuse-decoder-avoid-concat [score=1.2106, verdict=KEEP] * checkpoint: before iteration 25 * exp: refiner-no-tiling-at-1024 [score=1.2523, verdict=KEEP] * checkpoint: before iteration 26 * exp: cache-limit-1536mb-buffer-reuse [score=1.2494, verdict=KEEP] * checkpoint: before iteration 27 * exp: async-eval-backbone-decoder-overlap [score=1.2523, verdict=KEEP] * checkpoint: before iteration 28 * checkpoint: before iteration 29 * checkpoint: before iteration 30 * checkpoint: before iteration 31 * checkpoint: before iteration 32 * checkpoint: before iteration 33 * checkpoint: before iteration 34 * checkpoint: before iteration 35 * checkpoint: before iteration 36 * checkpoint: before iteration 37 * checkpoint: before iteration 38 * checkpoint: before iteration 39 * checkpoint: before iteration 40 * checkpoint: before iteration 41 * checkpoint: before iteration 42 * checkpoint: before iteration 43 * checkpoint: before iteration 44 * checkpoint: before iteration 45 * checkpoint: before iteration 46 * checkpoint: before iteration 47 * checkpoint: before iteration 48 * checkpoint: before iteration 49 * checkpoint: before iteration 50 * checkpoint: before iteration 51 * checkpoint: before iteration 52 * checkpoint: before iteration 53 * checkpoint: before iteration 54 * checkpoint: before iteration 55 * checkpoint: before iteration 56 * checkpoint: before iteration 57 * checkpoint: before iteration 58 * checkpoint: before iteration 59 * checkpoint: before iteration 60 * checkpoint: before iteration 61 * checkpoint: before iteration 62 * checkpoint: before iteration 63 * checkpoint: before iteration 64 * checkpoint: before iteration 65 * checkpoint: before iteration 66 * checkpoint: before iteration 67 * checkpoint: before iteration 68 * checkpoint: before iteration 69 * checkpoint: before iteration 70 * checkpoint: before iteration 71 * checkpoint: before iteration 72 * checkpoint: before iteration 73 * checkpoint: before iteration 74 * checkpoint: before iteration 75 * checkpoint: before iteration 76 * checkpoint: before iteration 77 * checkpoint: before iteration 78 * checkpoint: before iteration 79 * checkpoint: before iteration 80 * checkpoint: before iteration 81 * checkpoint: before iteration 82 * checkpoint: before iteration 83 * checkpoint: before iteration 84 * checkpoint: before iteration 85 * checkpoint: before iteration 86 * checkpoint: before iteration 87 * checkpoint: before iteration 88 * checkpoint: before iteration 89 * checkpoint: before iteration 90 * checkpoint: before iteration 91 * checkpoint: before iteration 92 * checkpoint: before iteration 93 * checkpoint: before iteration 94 * checkpoint: before iteration 95 * checkpoint: before iteration 96 * checkpoint: before iteration 97 * checkpoint: before iteration 98 * checkpoint: before iteration 99 * checkpoint: before iteration 100 * per-resolution best result tracking, default to 1024 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * checkpoint: before iteration 1 * exp: backbone-bf16-stages123-only [score=1.2551, verdict=KEEP] * checkpoint: before iteration 2 * exp: decoder-bf16-weights-load-time [score=1.262, verdict=KEEP] * checkpoint: before iteration 3 * metal trace analysis: model at local optimum, 6 hypotheses eliminated Metal GPU trace at 512x512 identified kernel-level bottlenecks. Tested and eliminated: half-res refiner (fidelity), gather ops (already optimized), GroupNorm fusion (already fused), compile_forward (15% slower), export_function (20% slower), refiner bf16 (fp16 faster on M3 Max). Adds: metal_trace.py, bench scripts, trace findings, handoff doc for framework-level optimization frontiers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * upstream: add CorridorKey-Runtime C++ native runtime (#19) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * use mx.take for unroll/reroll gathers — bypasses fancy indexing bound checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * add research docs, brainstorms, and groupnorm parity test scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * add systematic optimization sweep plan — 12 untried experiments across 4 tiers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * exp29: buffer limits sweep — 58% peak memory reduction via env vars MLX_MAX_MB_PER_BUFFER=2 MLX_MAX_OPS_PER_BUFFER=2 cuts peak memory from 3319MB to 1407MB @1024 with zero latency penalty. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * exp30+31: wired_limit sweep (no benefit) + fidelity budget audit exp30: mx.set_wired_limit() sweep at 0-4096MB — higher values increase latency (+5-10%), memory (+700MB), and variance. Default (0) is optimal. exp31: bisected bf16 conversions for error contribution @1024. fg_final headroom is 8.2% (critical). backbone_bf16 is top contributor (21.3% of fg error) — first lever to pull if headroom becomes blocking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * exp32-34: groupnorm native (failed), 1x1 conv (already done), del backbone (no effect) exp32: dropping pytorch_compatible=True causes catastrophic fidelity failure (0.97 max_abs) — native MLX GroupNorm uses different reduction axes. Required for correctness. exp33: 1x1 conv→linear already implemented via fold_bn() addmm bypass. exp34: del backbone after features — zero peak memory savings. High-water mark set during backbone attention, not weight coexistence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * exp35+39+40: blend weights (already done), addmm MLP (regression), bf16 checkpoint (negligible) exp35: edge-aware blend weights already implemented — position tuple equivalent to EZ-CorridorKey _blend_weight boundary flags. exp39: mx.addmm in Hiera MLP = 4.3% regression (440ms vs 422ms) — compiler already fuses contiguous matmul+add. exp40: bf16 checkpoint saves only 3.4MB (decoder+refiner tiny vs 392MB backbone). Switched load path to mx.load for bf16 support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * exp37: GEMM pad stage0 K=112→128 (regression, reverted) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * exp38: stream overlap disproven — no GPU-GPU parallelism on Apple Silicon Micro-benchmark shows MLX streams add 2-4% overhead, zero overlap. Decoder-scale 0.979x, backbone-scale 0.990x, 4-stream 0.961x. Exp 38 (interleaved pipeline) abandoned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * exp39: fix mx.compile forward path — split __call__ into compile-safe + eager - __call__ now compile-safe (no async_eval/stream); forward_eager has eager opts - decoder falls back to original conv layers pre-fold_bn - fix test_compilation + test_model_contract missing fold_bn calls - compiled 1024: 438.7ms, exact parity, was completely broken before Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * exp41-42: GroupNorm optimization exhausted — Python variants + custom Metal kernel disproven exp41: 3 Python-level GN variants all failed (ContiguousCopyGN -0.1%, TransposedAffineGN +8.8%, TwoPassFP32GN +120%). GN = 50% of refiner time at 1024. exp42: custom Metal kernel via mx.fast.metal_kernel (simd_sum + atomics). +41% slower than nn.GroupNorm, non-deterministic stats (262K atomic adds), sumsq error=315. API lacks threadgroup shared memory — root cause. All GroupNorm optimization paths exhausted. Cost is architectural. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * video pipeline brainstorm + deep research + gitignore reference/video/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * video temporal optimization experiments plan (V0-V6) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * deepen video temporal plan: 8 research agents, simplified to 3 experiments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * video pipeline V0 baseline + skip2 experiment — skip2 rejected V0: per-frame inference pipeline (infer_video.py, bench_video.py) Skip2: backbone-skip every other frame — 50% less compute but visible artifacts on skipped frames (PSNR 16-24dB vs 44-48dB), especially on fast motion. Sticking with V0 for now. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * compound: add frame-by-frame comparison data to skip2 rejection note Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * upstream research + two-tier fidelity gates + video baseline refresh - upstream research on video matting optimizations (RVM, DFF, RLT, etc) - benchmark_spec: add Tier 2 perceptual metrics (PSNR/SSIM/dtSSD) for algorithmic changes - brainstorm: updated roadmap — V1=EMA, V2=async pipeline, skip2 rejected - video baseline refresh: 476.6ms median, 1.79 FPS, 3508MB @ 1024 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * deep research: calibrate Tier 2 thresholds + partial feature reuse strategy - Tier 2 tightened: alpha PSNR >35dB, fg >33dB, SSIM >0.97, dtSSD <1.5 - key finding: run S1-S2 fresh (hint re-injection) + warp S3-S4 (stable) - PTQ4VM numbers: W8A8 safe on S3-S4, keep S1-S2/decoder/refiner at FP16 - RLT feasible via MAE-masking heritage + scatter-gather at decoder - VNGenerateOpticalFlow on ANE = zero GPU cost for flow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * add deep research source doc: video matting inference optimization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * handoff doc: video temporal optimization next steps Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * add Tier 2 fidelity metrics (PSNR/SSIM/dtSSD) to bench_video.py Pure numpy implementations — no new deps. Windowed SSIM (Wang 2004), dtSSD on alpha temporal derivatives vs V0 reference. Fidelity now reports both Tier 1 (max_abs) + Tier 2 (perceptual/temporal) with pass/fail against benchmark_spec thresholds. Runs for all modes including V0 baseline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * V1: EMA temporal blending in VideoProcessor Output-space blending: out_t = α * current + (1-α) * prev on float32 before uint8 quantization. ~zero compute overhead. Configurable via ema_alpha param (None=disabled). CLI: --ema-alpha 0.7, --ema-sweep 0.6 0.7 0.8. First frame passes through unblended. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * V2: async CPU-GPU pipeline with decode overlap Overlap frame N+1 decode (CPU thread) with frame N GPU inference via mx.async_eval + ThreadPoolExecutor decode-ahead. Targets ~43ms decode gap. Refactored postprocessing into _postprocess_frame(). CLI flag: --async-decode. No quality impact — pure scheduling optimization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * benchmark results: V1 EMA rejected, V2 async KEEP (+7% FPS) EMA blending fails fidelity at all α (0.6-0.95) on motion video. Async decode pipeline: 21.31s→19.83s wall-clock, zero quality impact. Updated handoff doc with complete results + answered open questions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * handoff doc: post V1/V2 results + experiment log (43-45) New handoff for next context window. V3 adaptive refiner tile skip is next priority. Experiment log updated to 45 entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: FrozenGroupNorm for correct tiled refiner inference Precompute full-image GroupNorm stats before tiling so per-tile normalization matches full-image processing. Unblocks V3 tile skip at tile_size=512 (previously failed fidelity due to per-tile stats). - FrozenGroupNorm: 3-mode drop-in (normal/collecting/frozen) - collect/freeze/unfreeze API on CNNRefinerModule - --frozen-gn flag on bench_video.py - 7 new tests, 0 regressions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: fp32 variance in FrozenGroupNorm, keep normalization in input dtype Stats collection computes mean/var in fp32 internally (avoids float16 overflow at 2048x2048 — 33M elements per group), then casts back to input dtype for normalization to preserve activation precision. Validated: frozen-GN-512 vs frozen-GN-1024 = max_abs 0.0 (exact match). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * handoff: V4 frozen GroupNorm results — correct but unprofitable Stats pass overhead (~1300ms) exceeds tile skip savings. Frozen-512 = frozen-1024 (0.0 diff), but 22% slower than unfrozen-1024. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * log exp 47 (frozen GN rejected) + handoff for next targets V5 partial backbone reuse is top recommendation — 40-60% potential on non-keyframes by caching S3-S4 features. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * upstream research 5 + GitHub issue tracking for experiments - add Raiden129/CorridorKey_Test to btca.config.jsonc - compound note: upstream audit findings (metal_kernel shared memory, EZ integration, tile overlap, sparse skip) - CLAUDE.md: experiments tracked as GitHub issues, update issues on completion 16 issues created (#2-#17) covering loose ends, untried ideas, and upstream findings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * handoff: GitHub project board setup + prioritized issue ordering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * add project board URL to experiment tracking section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: tile overlap 64→128px (2x safety margin over 65px receptive field) Closes #16 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * add deep research doc: MLX optimization for video matting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Metal GroupNorm v2: shared-mem + multi-threadgroup, -67% vs nn.GroupNorm Two-kernel approach: stats reduction (16 chunks/group, 1 atomic each) + fully parallel normalize. Eliminates NHWC↔NCHW transposes. 1024²: 2.37ms vs 7.02ms (-66%) 2048²: 8.80ms vs 26.65ms (-67%) Incremental parity drift ~1e-7/call, +1-4% on already-failing tests. Fidelity investigation to follow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: revert backbone BF16 + refiner BF16, all 89 tests pass Backbone BF16 stages 1-3 was the root cause of 9 parity failures (issue #8/#12). Reverting both backbone_bf16_stages123 and refiner_dtype to FP32 defaults recovers full fidelity headroom. Test tolerance updates: bit-exact assertions relaxed to 1e-4 for Metal GroupNorm atomic non-determinism (~1e-11/call, cascades to ~3e-5 through 9 GN calls in refiner). Closes #8, closes #12 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: golden_2048 reference + auto-select in compare_reference.py compare_reference.py auto-selects golden_2048.npz when --img-size >= 2048. golden_2048.npz generated from PyTorch (seed=42, CorridorKey_v1.0.pth). Baseline 2048 drift: alpha_final 5.9e-2, fg_final 6.9e-2, delta_logits 5.9e-1. Closes #11 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: backbone_size decoupled resolution — validated, rejected for default backbone_size param on GreenFormer: backbone+decoders at lower res, coarse logits upsampled to full res, refiner at full res. Quality validation on real 1920x1080 frame (all backbone sizes fail fidelity): - @448: 12% faster, alpha max_err=91/255 (edge degradation) - @384: 21% faster, alpha max_err=143/255 - @256: 38% faster, alpha max_err=192/255 Matting is edge-sensitive — refiner can't recover lost backbone detail. Code kept as opt-in (backbone_size=None = no change) for future use. Closes #4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * log exp 48-49 + compound notes: pipeline bottleneck + V7 rejected Exp 48: Metal GN pipeline impact disproven (2% of total, noise) Exp 49: V7 decoupled backbone rejected (edge fidelity fails all ratios) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * exp 50: V5 feature reuse rejected — S2 caching catastrophic, S3-only negligible Real consecutive frames: S3-only caching 27.7/255 max err (ok) but 1.6% speedup. S2+S3 caching 247.5/255 max err — S2 features not temporally stable. Feature warping (V8) needed to make deep caching viable. Closes #2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: frozen GN uses Metal kernel path + tile skip validation (V6) Frozen GroupNorm now uses fast Metal kernel with precomputed stats instead of slow transpose-based fallback. Perfect match vs non-tiled (0.0 error), eliminating tiling artifacts completely. Tile skip: 0-25% rate on real content, stats overhead cancels savings. Net pipeline impact ~0%. Keep frozen GN for correctness, not speed. Exp 51. Closes #3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: set MLX buffer env vars in engine — 17% faster at production res MLX_MAX_MB_PER_BUFFER=2, MLX_MAX_OPS_PER_BUFFER=2 via setdefault(). Small buffers force frequent eval, preventing graph buildup in tiled inference. 1832→1519ms at 1920x1080 (17% faster), no memory change. Exp 52. Closes #6 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: disable int8 quantization — 11% slower at production res Int8 backbone quant: 2796ms vs 2517ms no-quant at 1920x1080 tiled. Dequant overhead outweighs bandwidth savings on Apple Silicon unified memory. Quality impact negligible (1e-7). Default now False. Exp 53. Closes #13 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * compound notes: V6 frozen GN, env var tuning, int8 revert Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * log exp 54-57: GELU fast (skip), batch frames (rejected), RLT (skip), feature EMA (rejected) All remaining Tier 3 issues closed. Board clear. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * add comprehensive optimization summary (57 experiments) For outside contributors: what worked, what didn't, and why. Categorized failures by root cause (edge sensitivity, Apple Silicon constraints, not-a-bottleneck, correct-but-unprofitable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * add pipeline optimization handoff doc for main CorridorKey repo Methodology, fidelity tiers, benchmarking protocol, prioritized targets, proven dead ends, and quick-start checklist for picking up optimization work in the main CorridorKey pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * cleanup: remove research artifacts for squash merge to main 191 → 56 tracked files. Removed: - docs/ (37 brainstorms, plans, deep research — captured in OPTIMIZATION_SUMMARY.md) - research/compound/ (30 notes — captured in OPTIMIZATION_SUMMARY.md) - research/handoff-*, experiments.jsonl, decision.schema.json - prompts/ (6 phase port guides — port is complete) - 19 research-only scripts (one-off benchmarks, sweeps, prototypes) - .claude/ hooks and skills (autoresearch lab infrastructure) - .agents/ (TDD skill) - Root junk (loop.sh, main.py, research findings MDs, config files) Kept: src/, tests/, core scripts, benchmark_spec.md, OPTIMIZATION_SUMMARY.md, HANDOFF_TO_CORRIDORKEY.md 76 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * restore upstream-research skill Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: clamp frozen GN variance to prevent NaN + fix refiner_fn type - Clamp sumsq conversion to non-negative (fp32 rounding can yield negative variance with near-uniform activations) - Type refiner_fn as Callable instead of object Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

- Check mask channel per-tile before inference; skip pure BG/FG tiles - Enable mx.compile for tiled mode (tile_size is fixed per engine) - Log tile skip stats at DEBUG level Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Enable refiner_dtype=float16 (reduces register pressure, 0.84→0.56 cycle dependency penalty on Apple Silicon) - Only set MLX_MAX_OPS/MB_PER_BUFFER=2 when compile=False; with compile=True let the graph optimizer see larger graphs for better fusion - Combined: inference 2294ms → 2030ms (12% faster) Closes cmoyates/CorridorKey#26, closes cmoyates/CorridorKey#27 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gc.collect() and mx.clear_cache() between tiles cost ~430ms/frame (21%). On 36GB unified memory, the memory headroom is sufficient without aggressive per-tile cleanup. stage_gc=False saves another ~20ms. Inference: 2030ms → 1602ms (21% faster) Best case (matte+fg fast-exr): 1702ms/frame total, ~1:01 wall clock Also tested and rejected: - backbone_bf16_stages123: 20% slower (dtype conversion overhead) - refiner sub-tile skip (384px + frozen GN): 55% slower (overhead > savings) - 1024px tiles: 20% slower (attention quadratic scaling + im2col) - 64px overlap: same speed as 128 (same tile count at 768px) - skip explicit mx.eval before np.array: no measurable difference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Additional ~530ms/frame saved. On 36GB unified memory, letting Python/MLX manage garbage naturally is faster than forcing collection per frame. Inference: 1602ms → 1474ms, best case total: 1424ms/frame (51s for 37 frames) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* enable frozen GN by default + remove dead code (#20, #23) - refiner_frozen_gn defaults to True (perfect tiling correctness) - Removed: compile_forward, forward_eager(), quantize_backbone_stages - Removed unused safe_quantize import Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * remove accidentally committed docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: default tile_size 512→768 — 54s vs 2:04 (2.3x faster) (#19) Tile size sweep on real pipeline (37 frames @ 1920x1080): - 512px: 2:04 (3400ms/frame, 15 tiles) - 640px: 0:54 (2184ms/frame) - 768px: 0:54 (2127ms/frame) ← optimal - 1024px: 1:14 (2689ms/frame, memory pressure) Pipeline timing breakdown at 768px: - Read: 4.7ms (0.2%) - Infer: 1429ms (67%) - Postprocess: 89.8ms (4.2%) - Write: 604.6ms (28.4%) Write I/O is now the #2 bottleneck after inference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * use GPU-side preprocessing for tiled path (#29) Switch tiled inference from numpy preprocess() to preprocess_mlx(). ImageNet normalization + concat now runs on GPU instead of CPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * revert: GPU preprocessing slower for tiled path (+4s) preprocess_mlx creates extra mx.array copy overhead. numpy preprocess is faster since tiled_inference slices numpy arrays directly. 2278ms/frame vs 2127ms/frame. Reverted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * skip whole-forward compile for tiled path (per-component sufficient) Whole-forward mx.compile on top of per-component compilation is redundant. Benchmarked: 1435ms vs 1442ms (no difference). Per-component compilation already handles fusion. Removing whole-forward avoids potential issues with frozen GN state changes. Also tested and logged: - overlap 128→64: no speedup at 768px (overlap is small fraction) - frozen GN overhead: 0ms at 768px (amortized, tile < refiner_tile_size) - refiner FP16: no difference at 768px - Per-tile model cost: 218ms. 6 tiles = 1309ms + 120ms tiling overhead Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * add next-session handoff doc Current: 0:53 (3.96x vs Torch). 22 open issues tiered by priority. Write I/O (605ms, 28%) is the top target. Deep research doc available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: dynamic single-tile inference + issue board triage (#37) - bbox analysis on alpha hint → skip full tile grid when subject fits in one tile - _find_subject_bbox + _single_tile_inference in tiling.py - 5 new tests (bbox detection, margin clamping, single-tile output, fallback) - updated handoff doc: corrected timing (writes fully overlapped by async pipeline) - triaged 22 issues → 1 open (#46 manual profiling) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add deep research analysis (38-vector optimization survey) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: BBOX_THRESHOLD → module constant, logger.info → debug Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- tiling.py: split monolithic model(tile) into run_backbone → eval → run_decoders → eval → run_refiner → eval. Allows MLX memory pool to recycle backbone buffers before refiner im2col allocation. Benchmark: 0:53 (vs 0:54 baseline), median infer 1432ms - refiner.py: add SPD dilated conv transform (disabled by default). Tested: +17% regression due to pixel_unshuffle copy overhead and poor grouped conv GPU utilization on MLX. Kept for future re-eval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Granular eval benchmark (3 runs after cooldown): 1499, 1471, 1495 ms/frame → median 1495ms, 0:55-0:56 Monolithic eval baseline: 1457, 1455, 1456 ms/frame → median 1456ms, 0:54 Extra CPU-GPU sync barriers from per-component eval (+40ms/frame) outweigh any memory pool recycling benefit. SPD dilated conv transform kept (disabled, use_spd=False) — tested at +17% regression due to copy overhead. Warmup method kept but uncalled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove SPD infrastructure (pixel_unshuffle/shuffle, prepare_spd, grouped conv support) and warmup method — both tested and confirmed as regressions. Restores refiner.py and engine.py to pre-experiment state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- fp32 decoders + refiner (bf16/fp16 caused 10x delta_logits drift at 2048) - disable refiner tiling (GroupNorm per-tile stats diverge from full-image) - add sRGB↔linear color utils (LUT-accelerated) - engine: input_is_linear support, linear-space compositing, Lanczos upscale - video: PNG sequence hints, --linear/--fp32/--max-frames flags - dump_pytorch_reference: --image/--hint for real input golden fixtures - compare_reference: fp32 model for accurate parity comparison Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: 4K inference parity — fp32 precision, color space, no tiling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Uh oh!

Commits on Mar 13, 2026

Commits on Mar 14, 2026

Commits on Mar 15, 2026

This comparison is taking too long to generate.

Uh oh!