Skip to content

Conversation

@gluckzhang
Copy link

@gluckzhang gluckzhang commented Mar 4, 2025

Problem

Currently we run the watchtower to monitor validators on both mainnet and testnet. Though we have configured the instance to have a higher unhealthy threshold, ignore bad gateway errors, bear a longer connection time, and check the status less frequently (e.g., --unhealthy-threshold 2 --ignore-http-bad-gateway --rpc-timeout 60 --interval 65), we still receive a lot of operation timed out alerts. Such errors are more related to the availability of RPC endpoints and for now, we would like to suppress such errors.

Summary of Changes

This PR adds a new optional cli option --ignore-rpc-timeout to allow users to suppress rpc timeout errors. The default value of --ignore-rpc-timeout is false so merging this PR does not change the default behavior of watchtower. It is up to users to decide whether they would like to ignore rpc timeouts.

@gluckzhang
Copy link
Author

Hello again, as there are changes merged to the master, I have rebased the branch of this PR. Please feel free to reject if you think --ignore-rpc-timeout is no longer needed since now the watchtower can take multiple RPC endpoints :)

@joncinque
Copy link

Hey @mircea-c 👋 can you take a look at this PR? It might not be needed after #4748, but I'll let you discuss with the author

@joncinque joncinque requested a review from mircea-c April 28, 2025 14:24
@mircea-c
Copy link

Hi @gluckzhang. I'm going to rejecting this change as I don't think ignoring timeouts is a solution to the flaky endpoint problem.

As mentioned by Jon, #4748 is meant to add redundancy which reduces false positives while maintaining coverage and allowing for detection of bad RPC endpoints.

Additionally, there is the unhealthy-threshold flag available that can be used to achieve a result similar to ignoring RPC.

I would be much more willing to consider changes to unhealthy-threshold that make it "smarter"

@mircea-c mircea-c closed this Jun 26, 2025
@gluckzhang
Copy link
Author

Hi @mircea-c, thanks for the feedback. Fully understand it and agree with the rejection. Cheers :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants