Skip to content

Add cached dataloader, lazy imports, training improvements, and linear regression tools#18

Merged
cweniger merged 20 commits intomainfrom
feat/cached_dataloader
Feb 2, 2026
Merged

Add cached dataloader, lazy imports, training improvements, and linear regression tools#18
cweniger merged 20 commits intomainfrom
feat/cached_dataloader

Conversation

@cweniger
Copy link
Owner

@cweniger cweniger commented Jan 30, 2026

Summary

Core infrastructure

  • Cached dataloader: CachedDataLoader in raystore.py keeps training data in a local dict, periodically syncing from the Ray dataset manager. Avoids per-epoch Ray object store round-trips.
  • Lazy imports for fast CLI: Defers heavy imports (torch, ray, numpy, etc.) using __getattr__ lazy loading. Reduces falcon --help from ~3-8s to ~0.07s.
  • Startup banner: Shows falcon ▁▂▅▇█▆▃▂▁▁ vX.Y.Z on all commands.
  • In-process monitor: falcon monitor runs in-process instead of spawning a subprocess.
  • Initial sample generation moved to driver: Driver generates samples and sends them via append.remote(), consistent with the resample loop.
  • Async MonitorBridge: get_status uses await asyncio.wait_for instead of blocking ray.get.

Training output improvements

  • Epoch summary lines: Rich per-epoch output with steps, n_sims, train/val loss, lr, theta_std, eigvals_mean.
  • Removed debug spam: Cleaned up per-step debug logging in SNPE_gaussian._update_stats().
  • Metric logging: Added theta_std and residual_eigvals_mean to wandb/local logging after eigendecomposition updates.
  • on_epoch_end returns extra metrics: LossBasedEstimator.on_epoch_end returns dict with lr/theta_std/eigvals_mean for the summary line.

05_linear_regression example

  • GPU-accelerated LinearSimulator: Auto-detects CUDA for forward simulation.
  • E_fft_whiten embedding: New embedding that whitens raw input before FFT (matching standalone.py pipeline), with single linear projection.
  • standalone.py: Full standalone reference implementation with theta_std/eigvals_mean output columns for direct comparison with falcon training.
  • Updated config: Uses E_fft_whiten, lr=0.001, GPU allocation for both nodes.

Test plan

  • pytest tests/ — all tests pass
  • falcon --help is fast (<0.5s)
  • Run examples/05_linear_regression end-to-end — training converges with proper epoch summaries
  • Verified standalone.py and falcon produce comparable theta_std convergence

🤖 Generated with Claude Code

cweniger and others added 8 commits January 31, 2026 00:01
- Add CachedDataLoader class that caches samples as numpy arrays locally
- Add checkout_refs/release_refs to DatasetManagerActor for ref-counted
  incremental sync (only fetches new samples, not entire buffer)
- Add cached_loader/cached_val_loader to BufferView
- Add _train_cached path to StepwiseEstimator, activated via
  cache_sync_every > 0 in TrainingLoopConfig
- Default cache_sync_every=0 preserves original DataLoader behavior

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move torch, ray, numpy, omegaconf, and all falcon.* imports out of
module level in cli.py into the functions that use them. Replace eager
imports in falcon/__init__.py and falcon/contrib/__init__.py with
lazy __getattr__ patterns so that `falcon --help` no longer loads
the full ML stack.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Show "falcon ▁▂▅▇█▆▃▂▁▁ vX.Y.Z" banner on all commands including
help. Run falcon monitor directly in-process instead of spawning a
subprocess, avoiding a full second Python startup with heavy imports.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The initial sampling phase can take minutes but produced no output,
leaving the user with no feedback. Add log messages before and after
the blocking initialize_samples call.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…level logging

Update example to match proven baseline: 20000 bins, sigma=1.0, FFT norm
embedding, Adam betas=[0.1, 0.1], gamma=0.1. Add E_fft_norm embedding class
with orthonormal FFT, mode truncation, and gated linear projection.

Add single-flag console logging for node actors: setting logging.console.level
(e.g. DEBUG, INFO) enables node console output on stdout and couples
ray log_to_driver automatically. Console uses stdout to avoid duplication
with stderr _StreamCapture which captures C++/crash errors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MonitorBridge is a Ray actor, so blocking ray.get calls inside it stall
the event loop. Replace with await asyncio.wait_for to properly yield.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sample generation now happens on the driver side, consistent with the
resample loop. The actor retains only load_initial_samples for disk
loading. This avoids blocking ray.get inside the async actor and
unifies the sampling responsibility in one place.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The async design causes ray.get warnings from CachedDataLoader.
Document the potential refactoring path (separate training/sampling actors).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
if self._terminated:
break

async def _train_cached(self, buffer, cfg, keys) -> None:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really the best to have now two training methods. One with some conditionals internally shoudl remove code duplication.

cweniger and others added 7 commits January 31, 2026 22:42
Remove _train_original/_train_cached split in StepwiseEstimator, replacing
with a single _train method that always uses CachedDataLoader. cache_sync_every=0
now means "sync every epoch" (same data freshness as old DataLoader path).

Remove DatasetView, BatchDatasetView, batch_collate_fn, BufferView.train_loader,
BufferView.val_loader, and related DatasetManagerActor methods that are no longer
needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix sigma comments to match actual values (0.1, not 1.0)
- Update config: fft_norm embedding, gamma=1.0, betas=[0.5, 0.5],
  cache_sync_every=1, n_bins=20000
- Regenerate mock_data.npz with sigma=0.1
- Add standalone.py (gaussian_lr5 with n_bins=20000, fft_norm, gamma=1.0)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…lone.py

- config.yaml: gamma 1.0 -> 0.2, betas [0.5, 0.5] -> [0.9, 0.9]
- model.py: LinearSimulator sigma default 1.0 -> 0.1, design_matrix n_bins default 100 -> 20000

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace debug spam in GaussianPosterior with proper metric logging
  (theta_std, residual_eigvals_mean)
- Add epoch summary line with steps, n_sims, losses, lr, posterior stats
- Update on_epoch_end to return extra metrics for summary display
- Auto-detect CUDA in LinearSimulator for GPU-accelerated simulation
- Tune config: lr=0.001, resample_batch_size=2048, split GPU 0.5/0.5

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add E_fft_whiten: embedding with built-in diagonal whitening on raw
  input before FFT, matching standalone.py pipeline
- Add theta_std and eigvals_mean columns to standalone.py output for
  comparison with falcon training metrics
- Update config to use E_fft_whiten

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cweniger cweniger changed the title Add cached dataloader and lazy imports for fast CLI Add cached dataloader, lazy imports, training improvements, and linear regression tools Feb 1, 2026
cweniger and others added 5 commits February 2, 2026 13:48
CachedDataLoader now stores samples as contiguous torch tensors with
incremental sync (free-row reuse + bulk append). Adds cache_on_device
flag to optionally place the buffer on GPU. All torch.from_numpy calls
on batch data replaced with _to_tensor helper that handles both numpy
and torch inputs. Also adds simulate_chunk_size for chunked initial
sample generation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Scale n_bins from 20k to 1M in config, data generator, and mock data.
Add simulate_chunk_size, cache_on_device flag, and increase
resample_interval to 3200.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use np.asarray + torch.as_tensor instead of torch.from_numpy to handle
numpy scalars (e.g., float64 logprobs) that arrive during resampling.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SNPE_A extends StepwiseEstimator directly and calls self._to_tensor(),
but the method was only defined on LossBasedEstimator. This caused an
AttributeError at runtime for examples using SNPE_A.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Owner Author

@cweniger cweniger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@cweniger cweniger merged commit ae5de2b into main Feb 2, 2026
@cweniger cweniger deleted the feat/cached_dataloader branch February 2, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant