Skip to content

fix: Add IbPortDown alert for machines with down IB ports#519

Open
hasayesh wants to merge 1 commit intoNVIDIA:mainfrom
hasayesh:nvbug-5866723
Open

fix: Add IbPortDown alert for machines with down IB ports#519
hasayesh wants to merge 1 commit intoNVIDIA:mainfrom
hasayesh:nvbug-5866723

Conversation

@hasayesh
Copy link
Contributor

When the IB Fabric Monitor detects ports not in Active state, it now sets a PreventAllocations health alert on the affected machine. This prevents Carbide from attempting to allocate instances on machines with degraded IB connectivity, avoiding SRE alerts.

  • Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down()

  • Detect ports not in Active state during IB fabric monitoring

  • Set/clear IbPortDown health alert via health report overrides

  • Update existing test to expect health alert blocking

    Type of Change

    • Add - New feature or capability
    • Change - Changes in existing functionality
    • Fix - Bug fixes
    • Remove - Removed features or deprecated functionality
    • Internal - Internal changes (refactoring, tests, docs, etc.)

    Related Issues (Optional)

    Breaking Changes

    • This PR contains breaking changes

    Testing

    • Unit tests added/updated
    • Integration tests added/updated
    • Manual testing performed
    • No testing required (docs, internal refactor, etc.)

    Additional Notes

Description

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@hasayesh hasayesh requested a review from a team as a code owner March 11, 2026 04:25
@github-actions
Copy link

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-11 04:27:25 UTC | Commit: eb88f65

@github-actions
Copy link

🛡️ Vulnerability Scan

🚨 Found 74 vulnerability(ies)
📊 vs main: 74 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 74
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-11 04:27:38 UTC | Commit: eb88f65

@hasayesh hasayesh requested a review from Matthias247 March 11, 2026 16:51
@Matthias247
Copy link
Contributor

It needs to take the SKU into account. There's hosts which intentionally have multiple ports disconnected. And if they would all have PreventAllocations set, none of them would be usable anymore.

@hasayesh hasayesh requested a review from wminckler March 11, 2026 22:22
@hasayesh
Copy link
Contributor Author

This is what I suggest:
Update the L40 machines to have 2 IB's in SKU
Then we go from most to least specific:
Then if the machine has been assigned an instance-type use that
If not use SKU
if SKU does not exist (during early deployment) no check.

Please let me know if this is reasonable.

@hasayesh
Copy link
Contributor Author

This is what I suggest: Update the L40 machines to have 2 IB's in SKU Then we go from most to least specific: Then if the machine has been assigned an instance-type use that If not use SKU if SKU does not exist (during early deployment) no check.

Please let me know if this is reasonable.

OK my bad, json shows the properly populated inactive. Will update as such. Thanks.

When the IB Fabric Monitor detects ports not in Active state, it now
sets a PreventAllocations health alert on the affected machine. This
prevents Carbide from attempting to allocate instances on machines with
degraded IB connectivity, avoiding SRE alerts.
- Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down()
- Detect ports not in Active state during IB fabric monitoring
- Set/clear IbPortDown health alert via health report overrides
- Update existing test to expect health alert blocking

    ## Type of Change
    <!-- Check one that best describes this PR -->
    - [ ] **Add** - New feature or capability
    - [ ] **Change** - Changes in existing functionality
    - [x] **Fix** - Bug fixes
    - [ ] **Remove** - Removed features or deprecated functionality
    - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

    ## Related Issues (Optional)
    <!-- If applicable, provide GitHub Issue. -->

    ## Breaking Changes
    - [ ] This PR contains breaking changes

    <!-- If checked above, describe the breaking changes and migration steps
    -->

    ## Testing
    <!-- How was this tested? Check all that apply -->
    - [x] Unit tests added/updated
    - [x] Integration tests added/updated
    - [x] Manual testing performed
    - [ ] No testing required (docs, internal refactor, etc.)

    ## Additional Notes
    <!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Hamid Asayesh <hasayesh@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants