Add cached dataloader, lazy imports, training improvements, and linear regression tools#18
Merged
Add cached dataloader, lazy imports, training improvements, and linear regression tools#18
Conversation
- Add CachedDataLoader class that caches samples as numpy arrays locally - Add checkout_refs/release_refs to DatasetManagerActor for ref-counted incremental sync (only fetches new samples, not entire buffer) - Add cached_loader/cached_val_loader to BufferView - Add _train_cached path to StepwiseEstimator, activated via cache_sync_every > 0 in TrainingLoopConfig - Default cache_sync_every=0 preserves original DataLoader behavior Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move torch, ray, numpy, omegaconf, and all falcon.* imports out of module level in cli.py into the functions that use them. Replace eager imports in falcon/__init__.py and falcon/contrib/__init__.py with lazy __getattr__ patterns so that `falcon --help` no longer loads the full ML stack. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Show "falcon ▁▂▅▇█▆▃▂▁▁ vX.Y.Z" banner on all commands including help. Run falcon monitor directly in-process instead of spawning a subprocess, avoiding a full second Python startup with heavy imports. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The initial sampling phase can take minutes but produced no output, leaving the user with no feedback. Add log messages before and after the blocking initialize_samples call. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…level logging Update example to match proven baseline: 20000 bins, sigma=1.0, FFT norm embedding, Adam betas=[0.1, 0.1], gamma=0.1. Add E_fft_norm embedding class with orthonormal FFT, mode truncation, and gated linear projection. Add single-flag console logging for node actors: setting logging.console.level (e.g. DEBUG, INFO) enables node console output on stdout and couples ray log_to_driver automatically. Console uses stdout to avoid duplication with stderr _StreamCapture which captures C++/crash errors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MonitorBridge is a Ray actor, so blocking ray.get calls inside it stall the event loop. Replace with await asyncio.wait_for to properly yield. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sample generation now happens on the driver side, consistent with the resample loop. The actor retains only load_initial_samples for disk loading. This avoids blocking ray.get inside the async actor and unifies the sampling responsibility in one place. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The async design causes ray.get warnings from CachedDataLoader. Document the potential refactoring path (separate training/sampling actors). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
cweniger
commented
Jan 31, 2026
falcon/contrib/stepwise_estimator.py
Outdated
| if self._terminated: | ||
| break | ||
|
|
||
| async def _train_cached(self, buffer, cfg, keys) -> None: |
Owner
Author
There was a problem hiding this comment.
Is it really the best to have now two training methods. One with some conditionals internally shoudl remove code duplication.
Remove _train_original/_train_cached split in StepwiseEstimator, replacing with a single _train method that always uses CachedDataLoader. cache_sync_every=0 now means "sync every epoch" (same data freshness as old DataLoader path). Remove DatasetView, BatchDatasetView, batch_collate_fn, BufferView.train_loader, BufferView.val_loader, and related DatasetManagerActor methods that are no longer needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix sigma comments to match actual values (0.1, not 1.0) - Update config: fft_norm embedding, gamma=1.0, betas=[0.5, 0.5], cache_sync_every=1, n_bins=20000 - Regenerate mock_data.npz with sigma=0.1 - Add standalone.py (gaussian_lr5 with n_bins=20000, fft_norm, gamma=1.0) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…lone.py - config.yaml: gamma 1.0 -> 0.2, betas [0.5, 0.5] -> [0.9, 0.9] - model.py: LinearSimulator sigma default 1.0 -> 0.1, design_matrix n_bins default 100 -> 20000 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace debug spam in GaussianPosterior with proper metric logging (theta_std, residual_eigvals_mean) - Add epoch summary line with steps, n_sims, losses, lr, posterior stats - Update on_epoch_end to return extra metrics for summary display - Auto-detect CUDA in LinearSimulator for GPU-accelerated simulation - Tune config: lr=0.001, resample_batch_size=2048, split GPU 0.5/0.5 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add E_fft_whiten: embedding with built-in diagonal whitening on raw input before FFT, matching standalone.py pipeline - Add theta_std and eigvals_mean columns to standalone.py output for comparison with falcon training metrics - Update config to use E_fft_whiten Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
CachedDataLoader now stores samples as contiguous torch tensors with incremental sync (free-row reuse + bulk append). Adds cache_on_device flag to optionally place the buffer on GPU. All torch.from_numpy calls on batch data replaced with _to_tensor helper that handles both numpy and torch inputs. Also adds simulate_chunk_size for chunked initial sample generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Scale n_bins from 20k to 1M in config, data generator, and mock data. Add simulate_chunk_size, cache_on_device flag, and increase resample_interval to 3200. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use np.asarray + torch.as_tensor instead of torch.from_numpy to handle numpy scalars (e.g., float64 logprobs) that arrive during resampling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SNPE_A extends StepwiseEstimator directly and calls self._to_tensor(), but the method was only defined on LossBasedEstimator. This caused an AttributeError at runtime for examples using SNPE_A. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Core infrastructure
CachedDataLoaderinraystore.pykeeps training data in a local dict, periodically syncing from the Ray dataset manager. Avoids per-epoch Ray object store round-trips.torch,ray,numpy, etc.) using__getattr__lazy loading. Reducesfalcon --helpfrom ~3-8s to ~0.07s.falcon ▁▂▅▇█▆▃▂▁▁ vX.Y.Zon all commands.falcon monitorruns in-process instead of spawning a subprocess.append.remote(), consistent with the resample loop.get_statususesawait asyncio.wait_forinstead of blockingray.get.Training output improvements
SNPE_gaussian._update_stats().theta_stdandresidual_eigvals_meanto wandb/local logging after eigendecomposition updates.on_epoch_endreturns extra metrics:LossBasedEstimator.on_epoch_endreturns dict with lr/theta_std/eigvals_mean for the summary line.05_linear_regression example
Test plan
pytest tests/— all tests passfalcon --helpis fast (<0.5s)examples/05_linear_regressionend-to-end — training converges with proper epoch summaries🤖 Generated with Claude Code