Skip to content

Add Metal GPU backend with MNIST training (98.2% accuracy)#64

Closed
alok wants to merge 1 commit intolecopivo:masterfrom
alok:pr/gpu-metal-clean
Closed

Add Metal GPU backend with MNIST training (98.2% accuracy)#64
alok wants to merge 1 commit intolecopivo:masterfrom
alok:pr/gpu-metal-clean

Conversation

@alok
Copy link
Copy Markdown
Contributor

@alok alok commented Dec 18, 2025

Summary

Metal GPU backend for accelerated ML on Apple Silicon.

Features

  • GPU-resident buffers with type-safe CpuBuffer/GpuBuffer API
  • Optimized GEMM kernels (simdgroup tiling, double-buffered)
  • Fused ML ops: biasRelu, biasGelu, biasAdd, softmax, layerNorm
  • Flash attention kernels (causal and non-causal)
  • Conv2D/MaxPool2D/AvgPool2D for CNN inference
  • Mini-batch training with GPU buffer slicing
  • Command buffer batching for reduced dispatch overhead

MNIST Results

  • 98.2% accuracy on full 60k training set
  • 10 epochs, 256-sample mini-batches
  • ~230ms per epoch (234 batches)

Files

  • Metal/kmeans.metal - Metal shader kernels
  • Metal/metal_backend.mm - C++ dispatch layer
  • SciLean/FFI/Metal.lean - Lean bindings
  • examples/GpuMNIST.lean - Training example

Features:
- GPU-resident buffers with type-safe CpuBuffer/GpuBuffer transfers
- Optimized GEMM kernels (simdgroup tiling, double-buffered)
- Fused ML ops: biasRelu, biasGelu, biasAdd, softmax, layerNorm
- Flash attention kernels (causal and non-causal)
- Conv2D/MaxPool2D/AvgPool2D for CNN inference
- Mini-batch training with GPU buffer slicing
- Command buffer batching for reduced dispatch overhead

Results: 98.2% accuracy on full 60k MNIST, 10 epochs, ~230ms/epoch
@alok alok closed this Dec 18, 2025
@alok alok deleted the pr/gpu-metal-clean branch December 18, 2025 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant