Add /healthz endpoint that returns 500 when some watch stream is stale.#165
Add /healthz endpoint that returns 500 when some watch stream is stale.#165igorpeshansky merged 7 commits intomasterfrom
Conversation
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { | ||
| if (!c.second()) { |
There was a problem hiding this comment.
I'm not sure how this works, forgive my naiveness in C++, wouldn't UnregisterCallback remove the key entirely, meaning that the c for the component would not get iterated over in the first place?
There was a problem hiding this comment.
This doesn't check for the existence of the value, this calls the value (which is a std::function) and checks the result.
There was a problem hiding this comment.
@bmoyles0117 Unregistering should only happen once the thread is done.
@igorpeshansky That's right.
igorpeshansky
left a comment
There was a problem hiding this comment.
A few preliminary comments.
| std::shared_ptr<HttpServer::connection> conn); | ||
|
|
||
| const Configuration& config_; | ||
| HealthChecker* health_checker_; |
There was a problem hiding this comment.
Will you be modifying it? Why can't it be a const reference (or a const pointer if you accept nullptr as a valid value)?
| conn->set_headers(std::map<std::string, std::string>({ | ||
| {"Content-Type", "text/plain"}, | ||
| })); | ||
| conn->write("unhealthy"); |
There was a problem hiding this comment.
Probably worth dumping the list of unhealthy components (and components whose callbacks returned false)...
| config_.HealthCheckFile()).parent_path()); | ||
| } | ||
|
|
||
| void HealthChecker::RegisterCallback(const std::string& component, |
There was a problem hiding this comment.
[Optional] I'd call this AddComponent or RegisterComponent instead... You're adding a pair of <component_name, callback>.
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { | ||
| if (!c.second()) { |
There was a problem hiding this comment.
[Optional] std::function objects can have a nullptr value. Should this be if (c.second && !c.second())?
| if (!unhealthy_components_.empty()) { | ||
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { |
There was a problem hiding this comment.
Should you not lock mutex_ while iterating? It's possible to split this into read and write locks (https://en.cppreference.com/w/cpp/thread/shared_mutex).
There was a problem hiding this comment.
Ah, are you concerned about a possible deadlock? What's the right way to do this?
| health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{ | ||
| std::lock_guard<std::mutex> last_received_lock(last_received_mutex); | ||
| return last_received > (std::chrono::high_resolution_clock::now() - | ||
| std::chrono::minutes(5)); |
There was a problem hiding this comment.
Should this be a configuration option?
| std::mutex last_received_mutex; | ||
| auto last_received = std::chrono::high_resolution_clock::now(); | ||
| if (health_checker_) { | ||
| health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{ |
There was a problem hiding this comment.
Why not capture last_received by const reference?
There was a problem hiding this comment.
Looks like capturing by const reference is not available in C++11.
| void MetadataApiServer::HandleHealthz( | ||
| const HttpServer::request& request, | ||
| std::shared_ptr<HttpServer::connection> conn) { | ||
| if (health_checker_->IsHealthy()) { |
There was a problem hiding this comment.
Should we allow health_checker_ to be nullptr, so we can test different handlers in isolation?
| } catch (const boost::system::system_error& e) { | ||
| LOG(ERROR) << "Failed to query " << endpoint << ": " << e.what(); | ||
| if (health_checker_) { | ||
| health_checker_->UnregisterCallback(name); |
There was a problem hiding this comment.
Seems like a good candidate for RAII.
davidbtucker
left a comment
There was a problem hiding this comment.
Thanks, PTAL.
| void MetadataApiServer::HandleHealthz( | ||
| const HttpServer::request& request, | ||
| std::shared_ptr<HttpServer::connection> conn) { | ||
| if (health_checker_->IsHealthy()) { |
| conn->set_headers(std::map<std::string, std::string>({ | ||
| {"Content-Type", "text/plain"}, | ||
| })); | ||
| conn->write("unhealthy"); |
| std::shared_ptr<HttpServer::connection> conn); | ||
|
|
||
| const Configuration& config_; | ||
| HealthChecker* health_checker_; |
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { | ||
| if (!c.second()) { |
| config_.HealthCheckFile()).parent_path()); | ||
| } | ||
|
|
||
| void HealthChecker::RegisterCallback(const std::string& component, |
| health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{ | ||
| std::lock_guard<std::mutex> last_received_lock(last_received_mutex); | ||
| return last_received > (std::chrono::high_resolution_clock::now() - | ||
| std::chrono::minutes(5)); |
| std::mutex last_received_mutex; | ||
| auto last_received = std::chrono::high_resolution_clock::now(); | ||
| if (health_checker_) { | ||
| health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{ |
There was a problem hiding this comment.
Looks like capturing by const reference is not available in C++11.
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { | ||
| if (!c.second()) { |
There was a problem hiding this comment.
@bmoyles0117 Unregistering should only happen once the thread is done.
@igorpeshansky That's right.
| } catch (const boost::system::system_error& e) { | ||
| LOG(ERROR) << "Failed to query " << endpoint << ": " << e.what(); | ||
| if (health_checker_) { | ||
| health_checker_->UnregisterCallback(name); |
| if (!unhealthy_components_.empty()) { | ||
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { |
There was a problem hiding this comment.
Ah, are you concerned about a possible deadlock? What's the right way to do this?
| result.insert(c.first); | ||
| } | ||
| } | ||
| return std::move(result); |
There was a problem hiding this comment.
This is a set of strings; there are no movable components here — you don't need std::move.
|
|
||
| // Registers a component and then unregisters when it goes out of | ||
| // scope. | ||
| class ScopedHealthCheckRegistration { |
There was a problem hiding this comment.
Maybe just CheckHealth?
| } | ||
| private: | ||
| HealthChecker* health_checker_; | ||
| const std::string& component_; |
There was a problem hiding this comment.
Uh, don't do this. If you store a reference to a temporary, you'll cause a crash. Just store a copy of the string.
There was a problem hiding this comment.
Oops, thanks. Fixed.
| } | ||
| std::mutex last_received_mutex; | ||
| auto last_received = std::chrono::high_resolution_clock::now(); | ||
| auto timeout = std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds()); |
| } | ||
| std::mutex last_received_mutex; | ||
| auto last_received = std::chrono::high_resolution_clock::now(); | ||
| auto timeout = std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds()); |
There was a problem hiding this comment.
Seems like you can have a single variable called expiration that is set to std::chrono::high_resolution_clock::now() + std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds()) here and in the watcher callback, i.e.:
std::mutex expiration_mutex;
auto expiration = std::chrono::high_resolution_clock::now() +
time::seconds(config_.HealthCheckWatchTimeoutSeconds());
...
[&expiration_mutex, &expiration]{
std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
return std::chrono::high_resolution_clock::now() < expiration;
}
...
[=, &expiration_mutex, &expiration](json::value raw_watch) {
{
std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
expiration = std::chrono::high_resolution_clock::now() +
time::seconds(config_.HealthCheckWatchTimeoutSeconds());
}
WatchEventCallback(callback, name, std::move(raw_watch));
}Might be a good idea to factor out config_.HealthCheckWatchTimeoutSeconds() into a local variable called timeout, something like:
const int timeout = config_.HealthCheckWatchTimeoutSeconds();
auto expiration = std::chrono::high_resolution_clock::now() + time::seconds(timeout);
...
[&expiration_mutex, &expiration]{
std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
return std::chrono::high_resolution_clock::now() < expiration;
}
...
[=, &expiration_mutex, &expiration](json::value raw_watch) {
{
std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
expiration = std::chrono::high_resolution_clock::now() + time::seconds(timeout);
}
WatchEventCallback(callback, name, std::move(raw_watch));
}| LOG(INFO) << "WatchMaster(" << name << "): Contacting " << endpoint; | ||
| } | ||
| std::mutex last_received_mutex; | ||
| auto last_received = std::chrono::high_resolution_clock::now(); |
There was a problem hiding this comment.
Your timeout is measured in integral seconds. You probably don't need high_resolution_clock, especially because it may be pointing to system_clock, which can be confused by DST changes and such. Why not use std::steady_clock instead?
|
|
||
| std::set<std::string> HealthChecker::UnhealthyComponents() const { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| std::set<std::string> result(unhealthy_components_); |
There was a problem hiding this comment.
I thought we were dropping these?..
There was a problem hiding this comment.
Done. I forgot what we said -- still use the unhealthy_components_ for IsHealthy()? (The current unit test wants that.)
There was a problem hiding this comment.
Whoops, they are still needed.
| constexpr const char kDefaultInstanceZone[] = ""; | ||
| constexpr const char kDefaultHealthCheckFile[] = | ||
| "/var/run/metadata-agent/health/unhealthy"; | ||
| constexpr const int kDefaultHealthCheckWatchTimeoutSeconds = 5*60; |
There was a problem hiding this comment.
Nit: WatchTimeout is too generic, can we change it to KubernetesWatch throughout?
There was a problem hiding this comment.
KubernetesWatchTimeout*
There was a problem hiding this comment.
Renamed to HealthCheckMaxDataAgeSeconds.
igorpeshansky
left a comment
There was a problem hiding this comment.
This is no longer RFC — let's rename the PR.
Some remaining minor comments.
| [=, &expiration_mutex, &expiration](json::value raw_watch) { | ||
| { | ||
| std::lock_guard<std::mutex> expiration_lock(expiration_mutex); | ||
| expiration = std::chrono::steady_clock::now() + |
There was a problem hiding this comment.
[Optional] Might read better as:
expiration =
std::chrono::steady_clock::now() + time::seconds(timeout);| } | ||
| const int timeout = config_.HealthCheckWatchTimeoutSeconds(); | ||
| std::mutex expiration_mutex; | ||
| auto expiration = std::chrono::steady_clock::now() + time::seconds(timeout); |
There was a problem hiding this comment.
I would add a comment here explaining what this is, something like "The time by when the watcher has to receive some data to be considered healthy"...
| CheckHealth check_health( | ||
| health_checker_, name, [&expiration_mutex, &expiration]{ | ||
| std::lock_guard<std::mutex> expiration_lock(expiration_mutex); | ||
| return std::chrono::high_resolution_clock::now() < expiration; |
There was a problem hiding this comment.
std::chrono::steady_clock, right?
| } | ||
| if (unhealthy_components.empty()) { | ||
| if (config_.VerboseLogging()) { | ||
| LOG(INFO) << "Healthz returning 200"; |
There was a problem hiding this comment.
Let's s/Healthz//healthz/ here and in the next log statement.
| {"Content-Type", "text/plain"}, | ||
| })); | ||
| conn->write("unhealthy components:\n"); | ||
| for (const auto& s : unhealthy_components) { |
There was a problem hiding this comment.
[Optional] This is a component, so wouldn't c (or even component) work better?
|
|
||
| std::set<std::string> HealthChecker::UnhealthyComponents() const { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| std::set<std::string> result(unhealthy_components_); |
There was a problem hiding this comment.
Whoops, they are still needed.
No description provided.