Expose common dataloader args #2097

divyanshk · 2025-12-02T00:47:31Z

This diff introduces common dataloader args which are supported by statefuldataloader (and torch.utils.data dataloader). Users should be able to use them in their config files.

I was thinking about introducing a catch all kwargs to make it easier to specify args but that can easily complicate things (validation checks, duplication, existing defined named args in function definitions etc).

wwwjn

Thank you! I feel like I'm slightly lean towards using kwargs instead of adding these parameters one by one. This is because the StatefulDataLoader() has a lot of supported field and it's hard to say some of them are "common" in different use cases.

Can you explain more on "but that can easily complicate things"? We can just pass all the kwargs to StatefulDataLoader and let it to check correctness. wdyt @tianyu-l

tianyu-l · 2025-12-05T08:21:46Z

torchtitan/components/dataloader.py

+            num_workers=num_workers,
+            persistent_workers=persistent_workers,
+            prefetch_factor=prefetch_factor,
+            pin_memory=pin_memory,


I was thinking about introducing a catch all kwargs to make it easier to specify args but that can easily complicate things (validation checks, duplication, existing defined named args in function definitions etc).

These are valid concerns. For now I'm leaning towards keeping things simple by passing **kwargs around.

Does it make sense if we only make these args explicit when sending to the actual init of StatefulDataLoader and not passing in all **kwargs from the input of ParallelAwareDataloader? The point is to not accidentally hit error inside StatefulDataLoader.

tianyu-l · 2025-12-05T08:22:59Z

torchtitan/components/dataloader.py

        self,
        dataset: IterableDataset,
        dp_rank: int,
        dp_world_size: int,


could you help change this: let's keep at most one positional arg (dataset) and others to be kwargs.

divyanshk · 2025-12-07T20:22:46Z

Thanks @tianyu-l @wwwjn Updated the PR with kwargs based approach. I initially didn't do this to avoid any confusion on the user's part. That is because we provide batch_size, collate_fn (in mm_datasets) internally. I resolved that by making explicit args defined internally take precedence. Added a warning for users in config.py - so that should help. The error from wrong kwargs (if any) will be thrown in torchtitan itself - won't go down to StatefulDataloader.

tianyu-l

Looks good in general.

The CPU unit test in CI didn't run. Could you double check?

Also, please add an GPU integration test, see inline comments.

tianyu-l · 2025-12-09T21:40:15Z

torchtitan/config/job_config.py

+        - batch_size: Determined by training.local_batch_size
+        - collate_fn: Set by the dataset-specific collator
+
+    Example (TOML config file):


could you add a dedicated test for dataloader with kwargs passed through?
https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests/features.py

Added a GPU integration test. To be able to use the cli to pass in the kwargs I added a tyro rule. I am not super familiar with tyro so please have a look.

Also, shout out to the integration test setup. Love that we could do a quick mini-GPU run as part of feature testing.

torchtitan/components/dataloader.py

tianyu-l · 2025-12-11T10:08:53Z

tests/integration_tests/features.py

+        OverrideDefinitions(
+            [
+                [
+                    '--training.dataloader.kwargs \'{"num_workers": 2, "pin_memory": true, "prefetch_factor": 2}\'',


Instead of letting cli accept a dict, can we just do

--training.dataloader.kwargs.num_workers 2 --training.dataloader.kwargs.pin_memory true, ...

@tianyu-l That won't be possible because of how tyro is operating. If we want to have an arbitrary dict to act like a catch all kwargs then the dotted notation won't work because those fields are not pre-defined.

hmm got it. I feel this makes the CLI a bit too hard to use.

If there's a common set of kwargs that people often use, maybe we should restrict and start with that?

Sorry if it sounds going back to where we started. Do you think a middle ground makes sense where we wrap and pass explicit args around in a kwargs dict, after getting them from job_config.training.data_loader?

Happy to discuss more if you think there are better alternatives.

This can be done. I updated the code to pass around kwargs internally but expose explicit args through the user config.

…nd-users through config

tianyu-l · 2025-12-18T00:34:05Z

Not sure why the CPU unit test is taking forever. It doesn't happen for other PRs, could you take a look?

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 2, 2025

divyanshk force-pushed the divyanshk/dataloader_args branch from 6763cc0 to 990d654 Compare December 2, 2025 01:04

divyanshk marked this pull request as ready for review December 3, 2025 17:09

divyanshk requested review from fegin, tianyu-l, wconstab and wwwjn as code owners December 3, 2025 17:09

wwwjn reviewed Dec 5, 2025

View reviewed changes

tianyu-l reviewed Dec 5, 2025

View reviewed changes

tianyu-l reviewed Dec 9, 2025

View reviewed changes

tianyu-l reviewed Dec 11, 2025

View reviewed changes

tianyu-l linked an issue Dec 14, 2025 that may be closed by this pull request

Slow Dataloader should use num_worker > 1 #2073

Open

divyanshk added 4 commits December 17, 2025 11:23

expose common dataloader args

8ee0ca6

add kwarg based dataloader config

d101db1

GPU integration test

5f3ec35

Use dl kwargs internally in titan, but expose only critical args to e…

435543d

…nd-users through config

divyanshk force-pushed the divyanshk/dataloader_args branch from 604ac82 to 435543d Compare December 17, 2025 19:28

Expose common dataloader args #2097

Are you sure you want to change the base?

Expose common dataloader args #2097

Uh oh!

Conversation

divyanshk commented Dec 2, 2025

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyanshk commented Dec 7, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants