Skip to content

Fix MySQL CDC hang when SSH tunnel dies#4067

Draft
ilidemi wants to merge 4 commits intomainfrom
customer-mysql-cdc-keepalive
Draft

Fix MySQL CDC hang when SSH tunnel dies#4067
ilidemi wants to merge 4 commits intomainfrom
customer-mysql-cdc-keepalive

Conversation

@ilidemi
Copy link
Contributor

@ilidemi ilidemi commented Mar 18, 2026

When SSH connection dies during CDC, the keepalive machinery notices but BinlogSyncer is not aware. To make it aware, close SSH connection besides just the MySQL query connection. Add tests for hangs in mystream.GetEvent() and syncer.Close() (both observed in production). Verified that without the fix both tests fail.

@codecov
Copy link

codecov bot commented Mar 18, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
1977 2 1975 234
View the top 2 failed test(s) by shortest run time
github.com/PeerDB-io/peerdb/flow/e2e::TestGenericBQ
Stack Traces | 0s run time
=== RUN   TestGenericBQ
=== PAUSE TestGenericBQ
=== CONT  TestGenericBQ
--- FAIL: TestGenericBQ (0.00s)
github.com/PeerDB-io/peerdb/flow/e2e::TestGenericBQ/Test_Inheritance_Table_With_Dynamic_Setting
Stack Traces | 33.5s run time
=== RUN   TestGenericBQ/Test_Inheritance_Table_With_Dynamic_Setting
=== PAUSE TestGenericBQ/Test_Inheritance_Table_With_Dynamic_Setting
=== CONT  TestGenericBQ/Test_Inheritance_Table_With_Dynamic_Setting
2026/03/19 05:24:28 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_mychclg_tttlev1s.test_simple_schema_changes
    generic_test.go:1085: UNEXPECTED STATUS TIMEOUT STATUS_SNAPSHOT
    bigquery.go:86: begin tearing down postgres schema bq_ch5nrtia_20260319052428
--- FAIL: TestGenericBQ/Test_Inheritance_Table_With_Dynamic_Setting (33.46s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions
Copy link
Contributor

🔄 Flaky Test Detected

Analysis: The test failed due to "context deadline exceeded" when connecting to the Temporal server — a transient infrastructure/network timeout, not a code logic failure.
Confidence: 0.95

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Contributor

🔄 Flaky Test Detected

Analysis: Multiple Snowflake e2e tests failed with "UNEXPECTED STATUS TIMEOUT STATUS_SETUP", indicating transient connectivity/availability issues with the external Snowflake service during flow initialization, unrelated to the partitioned table column-ordering code change.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Contributor

🔄 Flaky Test Detected

Analysis: The e2e test TestGenericBQ/Test_Inheritance_Table_With_Dynamic_Setting timed out during the STATUS_SNAPSHOT phase, which is a classic flaky failure caused by timing-sensitive infrastructure (Temporal/snapshot workflow) rather than a code regression.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant