Skip to content
Open
Changes from 1 commit
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
5e5e9db
Fix: WAN 2.2 I2V boundary detection, AdamW8bit OOM crash, and add gra…
Oct 22, 2025
12e2b37
Improve video training with better bucket allocation
Oct 28, 2025
a1f70bc
Fix MoE training: per-expert LR logging and param group splitting
Oct 28, 2025
a2749c5
Add progressive alpha scheduling and comprehensive metrics tracking f…
Oct 29, 2025
86d107e
Merge remote-tracking branch 'upstream/main'
Oct 29, 2025
c91628e
Update README with comprehensive fork documentation and alpha schedul…
Oct 29, 2025
61143d6
Add comprehensive beginner-friendly documentation and UI improvements
Oct 29, 2025
96b1bda
Remove sponsors section from README - this is a fork without sponsors
Oct 29, 2025
bce9866
Fix confusing expert metrics display - add current training status
Oct 29, 2025
bd45a9e
Fix UnboundLocalError: remove redundant local 'import os'
Oct 29, 2025
abbe765
Add metrics API endpoint and UI components for real-time training mon…
Oct 29, 2025
edaf27d
Fix: Always show Loss Trend Analysis section with collection progress
Oct 29, 2025
a551b65
Fix: SVG charts now display correctly - add viewBox for proper coordi…
Oct 29, 2025
1682199
Fix: Downsample metrics to 500 points and lower phase transition thre…
Oct 30, 2025
885bbd4
Add comprehensive training recommendations based on research
Oct 30, 2025
705c5d3
Fix TRAINING_RECOMMENDATIONS for motion training
Oct 30, 2025
54c059a
Fix metrics to use EMA instead of simple averages
Oct 30, 2025
20b3c12
FIX CRITICAL BUG: Training loop re-doing checkpoint step on resume
Oct 30, 2025
226d19d
Remove useless checkpoint analyzer script
Oct 30, 2025
66978dd
Fix: Export EMA metrics to JSONL for UI visualization
Oct 30, 2025
fa12a08
Fix: Optimizer state loading counting wrong number of params for MoE
Oct 30, 2025
264c162
Fix: Set current_expert_name for metrics tracking
Oct 30, 2025
aecc467
Fix alpha scheduler not loading for MoE models on resume
Oct 31, 2025
b1ea60f
feat: Add SageAttention support for Wan models
Nov 4, 2025
20d689d
Fix CRITICAL metrics regression: boundary misalignment on resume + ad…
Nov 4, 2025
8b8506c
Merge feature/sageattention-wan-support into main
Nov 4, 2025
6a7ecac
docs: Update README with SageAttention and metrics fixes
Nov 4, 2025
850db0f
docs: Update installation instructions to use PyTorch nightly
Nov 4, 2025
26e9bdb
docs: Major README overhaul - Focus on Wan 2.2 I2V optimization
Nov 4, 2025
88785a9
docs: Fix Blackwell CUDA requirements - CUDA 13.0 not 12.8
Nov 4, 2025
0cacab8
Fix: torchao quantized tensors don't support copy argument in .to()
Nov 4, 2025
3ad8bfb
Fix critical FP16 hardcoding causing low-noise training instability
Nov 4, 2025
8589967
Fix metrics UI cross-contamination in per-expert windows
Nov 4, 2025
47dff0d
Fix FP16 hardcoding in TrainSliderProcess mask processing
Nov 4, 2025
eeeeb2e
Fix LR scheduler stepping to respect gradient accumulation
Nov 4, 2025
f026f35
CRITICAL: Fix VAE dtype mismatch in Wan encode_images
Nov 5, 2025
c7c3459
CRITICAL: Revert CFG-zero to be optional (match Ostris Nov 4 update)
Nov 5, 2025
728b46d
CRITICAL: Fix multiple SageAttention bugs causing training instability
Nov 5, 2025
7c9b205
Additional SageAttention and VAE dtype refinements
Nov 5, 2025
1d9dc98
Fix rotary embedding application to match Diffusers WAN reference
Nov 5, 2025
67445b9
Add temporal_jitter parameter for video frame sampling
Nov 5, 2025
ab59f00
Document temporal_jitter feature in README
Nov 5, 2025
80ff3db
Fix VAE dtype handling for WAN 2.2 I2V training to prevent blurry sam…
Nov 5, 2025
384ce94
Fix MoE UI metrics bugs and optimizer state restoration
Nov 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fix alpha scheduler not loading for MoE models on resume
When resuming training for MoE models (high_noise/low_noise), the alpha
scheduler state file wasn't being found because the code was looking for
expert-specific scheduler files (_high_noise_alpha_scheduler.json or
_low_noise_alpha_scheduler.json) but the actual file is shared across
experts (just _alpha_scheduler.json).

This caused the alpha scheduler to reset to foundation phase instead of
continuing from the saved phase (e.g., emphasis), resulting in incorrect
alpha values after resume.

Fix: Strip expert suffix from filename before looking for alpha scheduler.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
  • Loading branch information
AI Toolkit Contributor and claude committed Oct 31, 2025
commit aecc467366e65115fec29fecaa45be9a614d84b0
3 changes: 3 additions & 0 deletions jobs/process/BaseSDTrainProcess.py
Original file line number Diff line number Diff line change
Expand Up @@ -880,6 +880,9 @@ def load_weights(self, path):
if hasattr(self.network, 'alpha_scheduler') and self.network.alpha_scheduler is not None:
import json
scheduler_file = path.replace('.safetensors', '_alpha_scheduler.json')
# For MoE models, strip expert suffix (_high_noise, _low_noise) since scheduler is shared
scheduler_file = scheduler_file.replace('_high_noise_alpha_scheduler.json', '_alpha_scheduler.json')
scheduler_file = scheduler_file.replace('_low_noise_alpha_scheduler.json', '_alpha_scheduler.json')
print_acc(f"[DEBUG] Looking for alpha scheduler at: {scheduler_file}")
if os.path.exists(scheduler_file):
try:
Expand Down