fix: Add IbPortDown alert for machines with down IB ports#519
fix: Add IbPortDown alert for machines with down IB ports#519hasayesh wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-03-11 04:27:25 UTC | Commit: eb88f65 |
🛡️ Vulnerability Scan🚨 Found 74 vulnerability(ies) Severity Breakdown:
🔗 View full details in Security tab 🕐 Last updated: 2026-03-11 04:27:38 UTC | Commit: eb88f65 |
|
It needs to take the SKU into account. There's hosts which intentionally have multiple ports disconnected. And if they would all have PreventAllocations set, none of them would be usable anymore. |
|
This is what I suggest: Please let me know if this is reasonable. |
OK my bad, json shows the properly populated inactive. Will update as such. Thanks. |
When the IB Fabric Monitor detects ports not in Active state, it now
sets a PreventAllocations health alert on the affected machine. This
prevents Carbide from attempting to allocate instances on machines with
degraded IB connectivity, avoiding SRE alerts.
- Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down()
- Detect ports not in Active state during IB fabric monitoring
- Set/clear IbPortDown health alert via health report overrides
- Update existing test to expect health alert blocking
## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)
## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->
## Breaking Changes
- [ ] This PR contains breaking changes
<!-- If checked above, describe the breaking changes and migration steps
-->
## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [x] Integration tests added/updated
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)
## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->
Signed-off-by: Hamid Asayesh <hasayesh@nvidia.com>
When the IB Fabric Monitor detects ports not in Active state, it now sets a PreventAllocations health alert on the affected machine. This prevents Carbide from attempting to allocate instances on machines with degraded IB connectivity, avoiding SRE alerts.
Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down()
Detect ports not in Active state during IB fabric monitoring
Set/clear IbPortDown health alert via health report overrides
Update existing test to expect health alert blocking
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes
Description
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes