Skip to content

Conversation

@tlennon-ie
Copy link

use a cross-platform approach using the nul device on Windows and /dev/null on Unix

If you run your fork on Windows, the queue worker errors out due to unix syntax

[WORKER] Starting job 0307e546-7c7e-494e-8218-d2f1d6d04cc5 on GPU(s) 0
[WORKER] Error launching process: Error: ENOENT: no such file or directory, open 'I:\dev\null'
[WORKER] at Object.openSync (node:fs:555:18)
[WORKER] at I:\AI\AI-Toolkit\ui\dist\cron\actions\startJob.js:95:42 {
[WORKER] errno: -4058,
[WORKER] code: 'ENOENT',
[WORKER] syscall: 'open',
[WORKER] path: 'I:\dev\null'
[WORKER] }
[WORKER] No more jobs in queue for GPU(s) 0, stopping queue```

use a cross-platform approach using the nul device on Windows and /dev/null on Unix
- Add _merge_with_defaults() to auto-fill missing alpha values in phase config
- If user config lacks alpha values, fill them from defaults or estimate based on rank
- Add null checks in _validate_alpha_ratios() to prevent TypeError on None division
- Add null checks in get_current_alpha() and get_current_scale() methods
- Log which alpha values were auto-filled or estimated for user visibility

Fixes: TypeError when phase alpha is None in configuration
Improves: User experience with incomplete alpha scheduler configs by auto-completing them

This ensures the alpha scheduler gracefully handles incomplete configurations while
informing users which values were auto-filled.
relaxis pushed a commit that referenced this pull request Nov 4, 2025
…d EMA

**ROOT CAUSES:**
1. NO boundary realignment when resuming from checkpoint
   - Training always reset to boundary_index=0, steps_this_boundary=0
   - Caused incorrect expert labeling in metrics after every resume

2. Codex's attempted fix had off-by-one error
   - Used: steps_this_boundary = effective_step % switch_boundary_every
   - Should be: steps_this_boundary = (effective_step % switch_boundary_every) + 1
   - After completing a step, steps_this_boundary has been incremented

3. Missing EMA calculations (user's #1 requested metric)
   - UI only showed simple averages, not exponential moving averages

**EVIDENCE FROM METRICS:**
- Steps 200-400: stayed high_noise (should switch at 300) - resume at 201/301
- Steps 500-700+: stayed high_noise (should switch at 600) - resume at 701
- Timestamp gaps confirmed resumes without realignment
- Expert labels completely wrong after resume

**FIXES:**

jobs/process/BaseSDTrainProcess.py:
- Fixed off-by-one error in boundary realignment
- Added correct formula: (effective_step % switch_boundary_every) + 1
- Added debug logging for realignment state
- Comprehensive comments explaining the math

extensions_built_in/sd_trainer/SDTrainer.py:
- Added boundary switch logging at multiples of 100 steps
- Logs old_expert → new_expert transitions for debugging

ui/src/components/JobMetrics.tsx:
- Implemented EMA calculations with proper smoothing factor
- Added per-expert EMA: highNoiseLossEMA, lowNoiseLossEMA
- Added per-expert gradient stability EMA
- Created dedicated EMA Loss display card
- Updated expert comparison cards to show both simple avg and EMA
- EMA weights recent values more heavily (α = 2/(N+1))

**TESTING:**
- Next resume will log realignment state
- Metrics will show correct expert labels
- EMA values provide better training trend indicators
- Window sizes 10/50/100 all have proper EMA calculations

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant