Skip to content

Conversation

@devin-ai-integration
Copy link
Contributor

Fix retry error handling to throw errors gracefully for VLM-CLOUD-PROD-MA

Summary

This PR enhances retry error handling across the VLM platform to provide more specific and actionable error information, particularly for the "VLM-CLOUD-PROD-MA" error scenario. The changes replace generic "unexpected-error" codes with specific error categorization and add detailed context to retry failures.

Key Changes:

  • vlm-lab: Added _categorize_error() function to identify Modal infrastructure, network, timeout, and VLM-CLOUD-PROD-MA specific errors
  • vlm-lab: Enhanced callback retry error handling with detailed error context and retry attempt tracking
  • vlmrun-python-sdk: Added retry attempt counts to error messages and specific VLM-CLOUD-PROD-MA error detection
  • Both repos: Maintained backward compatibility while providing more actionable debugging information

Review & Testing Checklist for Human

  • Verify VLM-CLOUD-PROD-MA error detection works with real scenarios - The string matching logic "VLM-CLOUD-PROD-MA" in str(exception) needs validation with actual production errors
  • Test error categorization with live Modal infrastructure failures - The _categorize_error() function categorizes modal.exception.ModalException but needs verification with real Modal errors
  • Confirm error logging/monitoring systems still work - Changes from "unexpected-error" to specific error codes could break downstream alerting or analytics
  • Test retry behavior under various failure conditions - Ensure enhanced error handling doesn't interfere with the actual retry mechanisms in production
  • Verify callback retry improvements work end-to-end - Test the _get_callback_error_details() function with real webhook callback failures

Recommended Test Plan:

  1. Trigger Modal infrastructure errors in a test environment and verify they get categorized correctly
  2. Test callback retry failures and confirm detailed error information is logged
  3. Monitor error dashboards to ensure new error codes are properly captured
  4. Simulate VLM-CLOUD-PROD-MA scenarios if possible to validate detection logic

Diagram

%%{ init : { "theme" : "default" }}%%
graph TD
    A["vlm/infra/cloud/_modal.py"]:::major-edit --> B["handle_request_errors()"]:::major-edit
    A --> C["call_callback_with_retry()"]:::major-edit
    B --> D["_categorize_error()"]:::major-edit
    C --> E["_get_callback_error_details()"]:::major-edit
    
    F["vlmrun/client/base_requestor.py"]:::major-edit --> G["_handle_retry_error()"]:::major-edit
    
    H["Modal Infrastructure"]:::context --> D
    I["Callback URLs"]:::context --> E
    J["RetryError Exceptions"]:::context --> G
    
    D --> K["Specific Error Codes<br/>(modal-infrastructure-error,<br/>vlm-cloud-prod-ma-error, etc.)"]:::major-edit
    E --> L["Detailed Callback<br/>Error Context"]:::major-edit
    G --> M["Enhanced Retry<br/>Error Messages"]:::major-edit

    subgraph Legend
        L1[Major Edit]:::major-edit
        L2[Minor Edit]:::minor-edit  
        L3[Context/No Edit]:::context
    end

    classDef major-edit fill:#90EE90
    classDef minor-edit fill:#87CEEB
    classDef context fill:#FFFFFF
Loading

Notes

  • Environment Issues: vlm-lab tests failed due to missing vlm module setup, so changes weren't fully tested in that repository
  • String-based Detection: VLM-CLOUD-PROD-MA errors are detected via string matching since no concrete examples were found in the codebase
  • Backward Compatibility: Error code changes maintain the same database schema but modify the actual values stored
  • Session Info: Requested by [email protected] - https://app.devin.ai/sessions/ada923c8d06549e4bf8241bbab8c3fb5

…-MA detection

- Add retry attempt count to error messages for better debugging
- Detect VLM-CLOUD-PROD-MA errors and provide specific error messages
- Use getattr() for safe access to attempt_number to handle test mocks
- Maintain backward compatibility while improving error context

Co-Authored-By: [email protected] <[email protected]>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@devin-ai-integration
Copy link
Contributor Author

Closing due to inactivity for more than 7 days. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant