Skip to content

Conversation

@thiyyakat
Copy link
Member

@thiyyakat thiyyakat commented Jul 14, 2025

What this PR does / why we need it:

This PR fixes the bug in calculation of timeout check for MachineCreationTimeout when the machine is Pending. Earlier, the value machine.Status.CurrentStatus.LastUpdateTime.Time was being used, which would reset the timer and allow machine to stay in Pending for longer than intended. The check now uses machine.CreationTimestamp.Time to check if time out has occurred. The PR also changes the way in which other timeouts are checked in reconcileMachineHealth to improve readability. Unit tests have been corrected to reflect change made to the logic.

Which issue(s) this PR fixes:
Fixes #1009

Special notes for your reviewer:
IT for mcm-provider-aws and mcm-provider-gcp passed.

The changes made were tested locally by setting machineCreationTimeout to 5m, and changing mcm-provider-aws such that GetMachineStatus() and CreateMachine() return errors until machineCreationTimeout has passed. The relevant logs, with additional logs added for testing, can be found here :

I0714 15:22:17.918733   47329 machine.go:172] reconcileClusterMachine: Start for "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb" with phase:"", description:""
I0714 15:22:19.494489   47329 machine.go:541] Creating a VM for machine "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb", please wait!
I0714 15:22:19.494513   47329 machine.go:542] The machine creation is triggered with timeout of 5m0s
E0714 15:22:19.494576   47329 machine.go:546] Error while creating machine shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb: machine codes error: code = [Internal] message = [TEST_LOG: Returning Internal Error for TESTING, creationtimestamp:2025-07-14 15:22:12 +0530 IST]

I0714 15:25:19.695174   47329 machine.go:172] reconcileClusterMachine: Start for "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb" with phase:"CrashLoopBackOff", description:"Cloud provider message - machine codes error: code = [Internal] message = [TEST_LOG: Returning Internal Error for TESTING, creationtimestamp:2025-07-14 15:22:12 +0530 IST]"
I0714 15:25:21.036181   47329 machine.go:541] Creating a VM for machine "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb", please wait!
I0714 15:25:21.036205   47329 machine.go:542] The machine creation is triggered with timeout of 5m0s
E0714 15:25:21.036260   47329 machine.go:546] Error while creating machine shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb: machine codes error: code = [Internal] message = [TEST_LOG: Returning Internal Error for TESTING, creationtimestamp:2025-07-14 15:22:12 +0530 IST]

I0714 15:28:21.036561   47329 machine.go:172] reconcileClusterMachine: Start for "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb" with phase:"CrashLoopBackOff", description:"Cloud provider message - machine codes error: code = [Internal] message = [TEST_LOG: Returning Internal Error for TESTING, creationtimestamp:2025-07-14 15:22:12 +0530 IST]"
I0714 15:28:22.592074   47329 machine.go:541] Creating a VM for machine "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb", please wait!
I0714 15:28:22.592097   47329 machine.go:542] The machine creation is triggered with timeout of 5m0s
I0714 15:28:24.592799   47329 machine.go:552] Created new VM for machine: "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb" with ProviderID: "aws:///eu-west-1/i-087cd2d92f0d61b34" and backing node: "ip-10-180-2-244.eu-west-1.compute.internal"
I0714 15:28:24.795419   47329 machine.go:728] Initializing VM instance for Machine "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb"
I0714 15:28:24.795444   47329 machine.go:103] Adding machine object to queue "shoot--i749592--aws-int2/shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb", reason: handling machine object UPDATE event
I0714 15:28:25.753322   47329 machine.go:762] VM instance "aws:///eu-west-1/i-087cd2d92f0d61b34" for machine "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb" was initialized
I0714 15:28:25.753372   47329 machine.go:110] Adding machine object to queue "shoot--i749592--aws-int2/shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb" after 5s, reason: machine creation in process. Machine initialization (if required) is successful

I0714 15:28:25.753435   47329 machine.go:172] reconcileClusterMachine: Start for "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb" with phase:"CrashLoopBackOff", description:"Cloud provider message - machine codes error: code = [Internal] message = [TEST_LOG: Returning Internal Error for TESTING, creationtimestamp:2025-07-14 15:22:12 +0530 IST]"
I0714 15:28:26.455634   47329 machine.go:668] TEST_LOG:changing state to pending
I0714 15:28:26.655472   47329 machine.go:682] Machine/status UPDATE for "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb" during creation

I0714 15:28:26.655502   47329 machine.go:110] Adding machine object to queue "shoot--i749592--aws-int2/shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb" after 5s, reason: machine creation in process. Machine/Status UPDATE successful
I0714 15:28:30.754606   47329 machine.go:172] reconcileClusterMachine: Start for "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb" with phase:"Pending", description:"Creating machine on cloud provider"
I0714 15:28:30.754914   47329 machine_util.go:1005] TEST-LOG:time elapsed since creation:6m18.754882s, time out duration: 5m0s
E0714 15:28:30.754977   47329 machine_util.go:1060] Machine shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb failed to join the cluster in 5m0s minutes.
I0714 15:28:31.341403   47329 machine.go:121] Adding machine object to termination queue "shoot--i749592--aws-int2/shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb", reason: handling terminating machine object UPDATE event
I0714 15:28:31.341487   47329 machine.go:269] reconcileClusterMachineTermination: Start for "shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb" with phase:"Failed", description:"Machine shoot--i749592--aws-int2-worker-ic0ah-z1-5785c-wlpnb failed to join the cluster in 5m0s minutes."

Release note:

Fixed checking of `createMachineTimeout` when machine is `Pending`

@thiyyakat thiyyakat requested a review from a team as a code owner July 14, 2025 11:15
@gardener-robot gardener-robot added needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Jul 14, 2025
@thiyyakat thiyyakat marked this pull request as draft July 14, 2025 11:15
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 14, 2025
@thiyyakat thiyyakat marked this pull request as ready for review July 15, 2025 06:19
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 15, 2025
Copy link
Member

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @thiyyakat!
Some comments from me

@gardener-robot gardener-robot added size/s Size of pull request is small (see gardener-robot robot/bots/size.py) and removed size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Jul 16, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 16, 2025
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 16, 2025
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 16, 2025
@thiyyakat thiyyakat requested a review from aaronfern July 16, 2025 06:53
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 16, 2025
Copy link
Member

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for the test descriptions. You don't have to use these as is, but use these as a reference

thiyyakat and others added 2 commits July 25, 2025 10:17
Co-authored-by: Aaron Francis Fernandes <[email protected]>
Co-authored-by: Aaron Francis Fernandes <[email protected]>
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 25, 2025
Co-authored-by: Aaron Francis Fernandes <[email protected]>
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 25, 2025
Copy link
Member

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/review Needs review labels Jul 25, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 25, 2025
@thiyyakat thiyyakat merged commit cd63506 into gardener:master Jul 25, 2025
8 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jul 25, 2025
@thiyyakat thiyyakat deleted the bug/timeout branch August 4, 2025 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/s Size of pull request is small (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

machineCreationTimeout not always honoured for new machines

6 participants