KAFKA-13392 Resolve Timeout Exception triggering reassign partitions with --bootstrap-server option #21388
+118
−37
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR resolved the timeout exception triggering reassign partitions with --bootstrap-server option. More can be found https://issues.apache.org/jira/browse/KAFKA-13392.
Root cause
When we run a reassignment using a plan file (e.g. xxx.json), the plan may still include replicas on the down broker. During the execution, we try to apply throttling by calling
adminClient.incrementalAlterConfigs(configs). The issue is: this API needs to connect to the target broker to set the broker-level throttle configs.If the broker is down, it’s obviously unreachable, so the client keeps retrying and eventually times out → TimeoutException.
My proposed solution
Add a new parameter:
--broker-list-without-throttleDescription: Optional. Comma-separated broker ID list (e.g. 1,2) that should be excluded from broker-level throttle config updates during partition reassignment execution. When --execute and --throttle are used, it normally applies throttle configs on all brokers involved in the reassignment. If any of those brokers are known to be down or unreachable, adding them to --broker-list-without-throttle makes it skip the throttle-setting step for those brokers, avoiding retries/timeouts, while still throttling the remaining reachable brokers.
Value: a list of broker IDs, comma-separated
Example: 1001 or 1001,1002
If broker 1001 is known to be down, and the reassignment plan includes it, then we exclude 1001 from throttle config changes.
Why this is needed