Add cached dataloader, lazy imports, training improvements, and linear regression tools by cweniger · Pull Request #18 · cweniger/falcon

cweniger · 2026-01-30T23:21:50Z

Summary

Core infrastructure

Cached dataloader: CachedDataLoader in raystore.py keeps training data in a local dict, periodically syncing from the Ray dataset manager. Avoids per-epoch Ray object store round-trips.
Lazy imports for fast CLI: Defers heavy imports (torch, ray, numpy, etc.) using __getattr__ lazy loading. Reduces falcon --help from ~3-8s to ~0.07s.
Startup banner: Shows falcon ▁▂▅▇█▆▃▂▁▁ vX.Y.Z on all commands.
In-process monitor: falcon monitor runs in-process instead of spawning a subprocess.
Initial sample generation moved to driver: Driver generates samples and sends them via append.remote(), consistent with the resample loop.
Async MonitorBridge: get_status uses await asyncio.wait_for instead of blocking ray.get.

Training output improvements

Epoch summary lines: Rich per-epoch output with steps, n_sims, train/val loss, lr, theta_std, eigvals_mean.
Removed debug spam: Cleaned up per-step debug logging in SNPE_gaussian._update_stats().
Metric logging: Added theta_std and residual_eigvals_mean to wandb/local logging after eigendecomposition updates.
on_epoch_end returns extra metrics: LossBasedEstimator.on_epoch_end returns dict with lr/theta_std/eigvals_mean for the summary line.

05_linear_regression example

GPU-accelerated LinearSimulator: Auto-detects CUDA for forward simulation.
E_fft_whiten embedding: New embedding that whitens raw input before FFT (matching standalone.py pipeline), with single linear projection.
standalone.py: Full standalone reference implementation with theta_std/eigvals_mean output columns for direct comparison with falcon training.
Updated config: Uses E_fft_whiten, lr=0.001, GPU allocation for both nodes.

Test plan

pytest tests/ — all tests pass
falcon --help is fast (<0.5s)
Run examples/05_linear_regression end-to-end — training converges with proper epoch summaries
Verified standalone.py and falcon produce comparable theta_std convergence

🤖 Generated with Claude Code

- Add CachedDataLoader class that caches samples as numpy arrays locally - Add checkout_refs/release_refs to DatasetManagerActor for ref-counted incremental sync (only fetches new samples, not entire buffer) - Add cached_loader/cached_val_loader to BufferView - Add _train_cached path to StepwiseEstimator, activated via cache_sync_every > 0 in TrainingLoopConfig - Default cache_sync_every=0 preserves original DataLoader behavior Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Move torch, ray, numpy, omegaconf, and all falcon.* imports out of module level in cli.py into the functions that use them. Replace eager imports in falcon/__init__.py and falcon/contrib/__init__.py with lazy __getattr__ patterns so that `falcon --help` no longer loads the full ML stack. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Show "falcon ▁▂▅▇█▆▃▂▁▁ vX.Y.Z" banner on all commands including help. Run falcon monitor directly in-process instead of spawning a subprocess, avoiding a full second Python startup with heavy imports. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The initial sampling phase can take minutes but produced no output, leaving the user with no feedback. Add log messages before and after the blocking initialize_samples call. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…level logging Update example to match proven baseline: 20000 bins, sigma=1.0, FFT norm embedding, Adam betas=[0.1, 0.1], gamma=0.1. Add E_fft_norm embedding class with orthonormal FFT, mode truncation, and gated linear projection. Add single-flag console logging for node actors: setting logging.console.level (e.g. DEBUG, INFO) enables node console output on stdout and couples ray log_to_driver automatically. Console uses stdout to avoid duplication with stderr _StreamCapture which captures C++/crash errors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

MonitorBridge is a Ray actor, so blocking ray.get calls inside it stall the event loop. Replace with await asyncio.wait_for to properly yield. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Sample generation now happens on the driver side, consistent with the resample loop. The actor retains only load_initial_samples for disk loading. This avoids blocking ray.get inside the async actor and unifies the sampling responsibility in one place. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The async design causes ray.get warnings from CachedDataLoader. Document the potential refactoring path (separate training/sampling actors). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cweniger · 2026-01-31T21:12:21Z

falcon/contrib/stepwise_estimator.py

+            if self._terminated:
+                break
+
+    async def _train_cached(self, buffer, cfg, keys) -> None:


Is it really the best to have now two training methods. One with some conditionals internally shoudl remove code duplication.

Remove _train_original/_train_cached split in StepwiseEstimator, replacing with a single _train method that always uses CachedDataLoader. cache_sync_every=0 now means "sync every epoch" (same data freshness as old DataLoader path). Remove DatasetView, BatchDatasetView, batch_collate_fn, BufferView.train_loader, BufferView.val_loader, and related DatasetManagerActor methods that are no longer needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix sigma comments to match actual values (0.1, not 1.0) - Update config: fft_norm embedding, gamma=1.0, betas=[0.5, 0.5], cache_sync_every=1, n_bins=20000 - Regenerate mock_data.npz with sigma=0.1 - Add standalone.py (gaussian_lr5 with n_bins=20000, fft_norm, gamma=1.0) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…lone.py - config.yaml: gamma 1.0 -> 0.2, betas [0.5, 0.5] -> [0.9, 0.9] - model.py: LinearSimulator sigma default 1.0 -> 0.1, design_matrix n_bins default 100 -> 20000 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace debug spam in GaussianPosterior with proper metric logging (theta_std, residual_eigvals_mean) - Add epoch summary line with steps, n_sims, losses, lr, posterior stats - Update on_epoch_end to return extra metrics for summary display - Auto-detect CUDA in LinearSimulator for GPU-accelerated simulation - Tune config: lr=0.001, resample_batch_size=2048, split GPU 0.5/0.5 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add E_fft_whiten: embedding with built-in diagonal whitening on raw input before FFT, matching standalone.py pipeline - Add theta_std and eigvals_mean columns to standalone.py output for comparison with falcon training metrics - Update config to use E_fft_whiten Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

CachedDataLoader now stores samples as contiguous torch tensors with incremental sync (free-row reuse + bulk append). Adds cache_on_device flag to optionally place the buffer on GPU. All torch.from_numpy calls on batch data replaced with _to_tensor helper that handles both numpy and torch inputs. Also adds simulate_chunk_size for chunked initial sample generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Scale n_bins from 20k to 1M in config, data generator, and mock data. Add simulate_chunk_size, cache_on_device flag, and increase resample_interval to 3200. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use np.asarray + torch.as_tensor instead of torch.from_numpy to handle numpy scalars (e.g., float64 logprobs) that arrive during resampling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

SNPE_A extends StepwiseEstimator directly and calls self._to_tensor(), but the method was only defined on LossBasedEstimator. This caused an AttributeError at runtime for examples using SNPE_A. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cweniger

All good!

cweniger and others added 8 commits January 31, 2026 00:01

Log initial sample generation progress to driver

6e3be4a

The initial sampling phase can take minutes but produced no output, leaving the user with no feedback. Add log messages before and after the blocking initialize_samples call. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use async awaits in MonitorBridge.get_status instead of blocking ray.get

52ae75e

MonitorBridge is a Ray actor, so blocking ray.get calls inside it stall the event loop. Replace with await asyncio.wait_for to properly yield. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add TODO about NodeWrapper async actor design

5854546

The async design causes ray.get warnings from CachedDataLoader. Document the potential refactoring path (separate training/sampling actors). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cweniger commented Jan 31, 2026

View reviewed changes

cweniger and others added 7 commits January 31, 2026 22:42

Update defaults for gaussian_lr5.py

f9014cb

Update standalone.py defaults to match gaussian_lr5.py

a482734

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cweniger changed the title ~~Add cached dataloader and lazy imports for fast CLI~~ Add cached dataloader, lazy imports, training improvements, and linear regression tools Feb 1, 2026

cweniger and others added 5 commits February 2, 2026 13:48

Update 05_linear_regression to 1M bins with chunked simulation

1c760d1

Scale n_bins from 20k to 1M in config, data generator, and mock data. Add simulate_chunk_size, cache_on_device flag, and increase resample_interval to 3200. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix CachedDataLoader crash on numpy scalar during incremental sync

47a0bd7

Use np.asarray + torch.as_tensor instead of torch.from_numpy to handle numpy scalars (e.g., float64 logprobs) that arrive during resampling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Updates config.yaml

b84af2c

cweniger commented Feb 2, 2026

View reviewed changes

cweniger merged commit ae5de2b into main Feb 2, 2026

cweniger deleted the feat/cached_dataloader branch February 2, 2026 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cached dataloader, lazy imports, training improvements, and linear regression tools#18

Add cached dataloader, lazy imports, training improvements, and linear regression tools#18
cweniger merged 20 commits intomainfrom
feat/cached_dataloader

cweniger commented Jan 30, 2026 •

edited

Loading

Uh oh!

cweniger Jan 31, 2026

Uh oh!

cweniger left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cweniger commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Core infrastructure

Training output improvements

05_linear_regression example

Test plan

Uh oh!

cweniger Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

cweniger left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cweniger commented Jan 30, 2026 •

edited

Loading