Train a GPT on your Mac.
Not inference. Not fine-tuning someone else's model. Training from scratch.
Two accelerators. One chip. Your Mac.
- ANE (Apple Neural Engine) — native Obj-C, private APIs, 38 TOPS
- MLX (Apple's ML framework) — native Python, bf16, GPU
Same dataset as Karpathy's original autoresearch (climbmix-400B). Same tokenizer (rustbpe, vocab=8192). Same metric (val_bpb). Your results are directly comparable to NVIDIA H100 runs.
We ran 400+ experiments on an M4 Max 128GB. Now you can run yours.
Three commands. That's it.
git clone https://github.com/ncdrone/train-my-mac.git
cd train-my-mac
bash setup.shSetup does everything: checks your hardware, clones the engines, installs dependencies, downloads the dataset (~500 MB), builds the native ANE binary, and runs smoke tests on both accelerators.
When it finishes, train:
bash sweep.sh # find your best config (~30 min)
bash overnight.sh # full ANE training run (~5-8 hours)
bash overnight.sh mlx # or train on MLX instead
bash gossip.sh # run both engines simultaneously (advanced)Skip all prompts with --yes:
bash setup.sh --yes- Apple Silicon Mac (M1 or later)
- 16 GB RAM minimum
- Xcode Command Line Tools (
xcode-select --install) - uv (auto-installed if missing)
| Tier | Memory | Examples | What You Get |
|---|---|---|---|
| Minimum | 16 GB | M1, M2, M3 base, M5 Air | MLX only. ANE light preset. |
| Okay | 24-36 GB | M2 Pro, M3 Pro, M4 Pro | Both accelerators. Standard config. |
| Recommended | 48-128 GB | M3/M4 Max, M5 Max | Full config. Fast steps. Gossip. |
| Ideal | 128+ GB | M3 Ultra (256 GB), M5 Ultra | The setup. You tell us. |
Below 16GB: not supported.
bash setup.sh runs six steps in order:
| Step | What | Details |
|---|---|---|
| 1 | Hardware detection | Identifies your chip, memory, tier |
| 2 | Clone engines | ANE (native Obj-C) + MLX (Python) into engines/ |
| 3 | Install dependencies | uv sync for both engines (isolated venvs) |
| 4 | Download data | Karpathy climbmix-400B, ~500 MB, tokenized |
| 5 | Build ANE binary | Compiles native training loop with Xcode CLT |
| 6 | Smoke tests | 50 steps on each engine to verify everything works |
Config is saved to my_config.txt. All subsequent scripts read from it.
Short experiments to find what works on your hardware.
bash sweep.sh # sweep ANE (default)
bash sweep.sh mlx # sweep MLX
bash sweep.sh both # sweep both (1 hour total)Tests learning rates (1e-4, 2.5e-4, 5e-4, 1e-3) and gradient accumulation (1 vs 2). Each experiment runs 5 minutes. Best config is saved automatically.
Full training with your discovered config. Auto-launches in tmux so closing your terminal won't kill the run.
bash overnight.sh # ANE overnight (72K steps, 5-8 hours)
bash overnight.sh mlx # MLX overnight
bash overnight.sh --lr 5e-4 # override learning rate
bash overnight.sh --steps 10000 # shorter runReattach anytime: tmux attach -t train-my-mac
When it finishes, visualize.py generates a results graphic comparing your run to our research.
Both engines running simultaneously on the same chip. ANE on the Neural Engine, MLX on the GPU. Zero interference.
bash gossip.sh # launches both in parallel with shared gossipEach engine writes results to a shared JSONL file. Discoveries from one inform the other. This is how we got our best results.
After the sweep: you know your best config, val_bpb under 2.5.
After overnight:
- ANE: val_bpb under 1.8, possibly under 1.7
- MLX: val_bpb under 1.4 with Muon
Our bests:
- ANE: 1.595 (M4 Max 128GB, 72K steps, 8.2 hours)
- MLX: 1.266 (M4 Max 128GB, Muon + AdamW, 259 experiments)
Getting within 0.1-0.2 of those on lesser hardware is a great result.
Remove everything setup created and start fresh.
bash clean.sh # interactive — confirms before deleting
bash clean.sh --yes # delete engines, config, and logs without prompting
bash clean.sh --all # also delete cached dataset (~1 GB in ~/.cache/autoresearch)What gets removed:
clean.sh |
clean.sh --all |
|---|---|
engines/ (cloned repos, venvs, builds) |
everything in the left column |
my_config.txt |
~/.cache/autoresearch/ (dataset + tokenizer) |
results/*.log and summaries |
The sample result PNGs in results/ are kept. Run bash setup.sh to rebuild everything.
Change one thing at a time. If you change LR and warmup simultaneously and it improves, you don't know which helped.
5-minute runs are for screening, not conclusions. They tell you what's promising. The overnight run tells you the truth.
val_bpb is the only metric that matters. Training loss can lie (overfitting). Validation bits-per-byte on held-out data is ground truth. Lower is better.
Think in relative terms. Going from 2.0 to 1.8 is huge. Going from 1.60 to 1.59 might not be worth added complexity.
When things explode, that's data. Activation magnitudes above 50 mean something is wrong. But knowing when it explodes tells you about stability boundaries.
Simpler code wins. If you can remove code and get the same val_bpb, that's a great outcome.
Log everything. You will want to go back and compare.
Modify the training code. ANE: engines/autoresearch-ANE/native/training/train.m. MLX: engines/autoresearch-mlx/train.py. Change something, run 5 minutes, see what happens.
DO NOT RUN THIS (unless you're crazy). Let Claude do everything for you. Setup, sweep, overnight, visualize — fully autonomous. You walk away. It trains your Mac overnight. You wake up to results.
claude --dangerously-skip-permissions -p "Read autorun.md and execute everything."Autonomous research mode. Claude modifies training code, runs experiments, keeps what works, discards what doesn't. Loops forever.
claude --dangerously-skip-permissions -p "Read program.md and start autoresearch."Explore what we haven't tried. See engines/autoresearch-ANE/docs/ideas/roadmap_unexplored.md. ANE classifier on-chip, Muon optimizer port, bf16, kernel fusion.
Community leaderboard. (Coming soon.) Submit your results. See what every Mac can do.
Native Obj-C. Uses private AppleNeuralEngine.framework APIs.
Weights packed into IOSurface inputs. Kernels compile once at startup.
Weight updates are just memcpy. No recompilation.
- 48.8M param GPT (NL=6, SEQ=512, DIM=768)
- ~80-100ms/step on M4 Max
- Invisible to Activity Monitor
- Best: val_bpb = 1.595
Python. Apple's native ML framework, purpose-built for Apple Silicon. Native bf16. Unified memory. Muon + AdamW optimizer.
- 15.7M param GPT (optimized architecture)
- Native bf16 (unlike MPS where it was 2.6x slower)
- Best: val_bpb = 1.266
Both engines write results to a shared JSONL file at ~/.cache/autoresearch/gossip/shared_experiments.jsonl. Each agent reads the other's experiments before planning its next one. Cross-pollination: ANE discoveries inform MLX experiments and vice versa.
We built an autonomous research loop on Apple Silicon. An AI agent modifies training code, runs a 5-minute experiment, evaluates val_bpb, keeps the change if it improved, discards if it didn't, and repeats overnight.
400+ experiments. Two accelerators. One M4 Max 128GB.
The gap between ANE and MLX? That's where the research is.
| Problem | Fix |
|---|---|
| Build fails | xcode-select --install |
| "No ANE device found" | Intel Macs don't have ANE. M-series only. |
| Activations explode (x > 50) | Lower LR by 2x |
| OOM / memory pressure | Use light preset or MLX only |
| Fans spinning hard | Normal. If ms/step climbs, thermal throttling — pause. |
| val_bpb plateaus | More steps. Model still improving at 40K+. |
| MLX: bf16 errors | uv sync in engines/autoresearch-mlx/ |
| Want a fresh start | bash clean.sh --all then bash setup.sh |
- Andrej Karpathy — autoresearch concept and climbmix-400B dataset
- maderix — ANE private API reverse engineering
- trevin-creator — MLX port
- Apple MLX team
The ANE engine uses Apple's private AppleNeuralEngine.framework via dlopen. This means:
- Undocumented. There is no official API reference. The interface was reverse-engineered.
- Unsupported. Apple does not support third-party use of this framework.
- May violate Apple's Terms of Service. Using private frameworks is explicitly discouraged by Apple and may breach the macOS EULA.
- Could break on any macOS update. Apple can change or remove the private API at any time without notice.
- No warranty. This software is provided as-is. No guarantees of correctness, safety, or fitness for any purpose.
- Could stress your hardware. Long training runs push the Neural Engine, GPU, and thermal system continuously for hours. Monitor your temps.
The MLX engine uses Apple's public ML framework. No private APIs. No risk beyond normal GPU compute.
You are responsible for what runs on your machine. If you are not comfortable with these risks, use bash sweep.sh mlx and bash overnight.sh mlx to run only the MLX engine.
MIT — no warranty, no liability, use at your own risk.
