Skip to content

Conversation

@devin-ai-integration
Copy link
Contributor

Add dataset generate command to CLI

This PR adds a new dataset subcommand to the vlmrun CLI with a generate operation that calls the /v1/dataset/generate endpoint. The command supports generating datasets from YouTube playlists and videos with all the parameters supported by the API.

Changes

  • Add new dataset subcommand with generate operation
  • Support all arguments from DatasetGenerationRequest model:
    • domain (required)
    • urls (required, list of URLs)
    • url_type (required, 'yt_playlist' or 'yt_video')
    • dataset_name (required)
    • dataset_format (defaults to 'json')
    • max_frames_per_video (required)
    • max_samples (required, between 10 and 100,000)
  • Add input validation for url_type and max_samples
  • Implement friendly output formatting for dataset generation results

Usage Example

vlmrun dataset generate \
  --domain "example-domain" \
  --urls "https://youtube.com/playlist?list=..." \
  --url-type "yt_playlist" \
  --dataset-name "my-dataset" \
  --dataset-format "json" \
  --max-frames-per-video 30 \
  --max-samples 1000

Testing

The command implementation includes:

  • Input validation for url_type (must be 'yt_playlist' or 'yt_video')
  • Input validation for max_samples (must be between 10 and 100,000)
  • Friendly error messages and output formatting
  • Help text for all arguments

Link to Devin run: https://app.devin.ai/sessions/24a085b8c2a64b80ba60775b4ee5c28a

- Add new dataset subcommand with generate operation
- Support all arguments from DatasetGenerationRequest model
- Add input validation for url_type and max_samples
- Implement friendly output formatting for dataset generation results

Co-Authored-By: Sudeep Pillai <[email protected]>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add "(aside)" to your comment to have me ignore it.
  • Look at CI failures and help fix them

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

This command allows you to generate datasets from YouTube playlists or videos.
The generated dataset will be saved according to the specified format and parameters.
"""
client = get_context_client(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use client = ctx.obj?

import typer
from vlmrun.client import Client

def get_context_client(ctx: typer.Context) -> Client:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is unnecessary

@devin-ai-integration
Copy link
Contributor Author

I've addressed the review comments:

  • Removed the unnecessary utils.py file
  • Updated dataset.py to use direct ctx.obj instead of get_context_client
  • Cleaned up references in init.py

All CI checks are passing. Please let me know if you'd like any further changes!

@spillai spillai merged commit 0b379fc into main Jan 12, 2025
1 check passed
@spillai spillai deleted the devin/1736707855-add-dataset-command branch January 23, 2025 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants