Add /healthz endpoint that returns 500 when some watch stream is stale. by davidbtucker · Pull Request #165 · Stackdriver/metadata-agent

davidbtucker · 2018-07-19T22:19:43Z

No description provided.

… stale.

bmoyles0117 · 2018-07-19T22:29:11Z

+    return false;
+  }
+  for (auto& c : component_callbacks_) {
+    if (!c.second()) {


I'm not sure how this works, forgive my naiveness in C++, wouldn't UnregisterCallback remove the key entirely, meaning that the c for the component would not get iterated over in the first place?

This doesn't check for the existence of the value, this calls the value (which is a std::function) and checks the result.

@bmoyles0117 Unregistering should only happen once the thread is done.
@igorpeshansky That's right.

igorpeshansky

A few preliminary comments.

igorpeshansky · 2018-07-19T22:24:32Z

+                     std::shared_ptr<HttpServer::connection> conn);

  const Configuration& config_;
+  HealthChecker* health_checker_;


Will you be modifying it? Why can't it be a const reference (or a const pointer if you accept nullptr as a valid value)?

igorpeshansky · 2018-07-19T22:25:10Z

+    conn->set_headers(std::map<std::string, std::string>({
+      {"Content-Type", "text/plain"},
+    }));
+    conn->write("unhealthy");


Probably worth dumping the list of unhealthy components (and components whose callbacks returned false)...

igorpeshansky · 2018-07-19T22:27:18Z

      config_.HealthCheckFile()).parent_path());
 }

+void HealthChecker::RegisterCallback(const std::string& component,


[Optional] I'd call this AddComponent or RegisterComponent instead... You're adding a pair of <component_name, callback>.

igorpeshansky · 2018-07-19T22:28:47Z

+    return false;
+  }
+  for (auto& c : component_callbacks_) {
+    if (!c.second()) {


[Optional] std::function objects can have a nullptr value. Should this be if (c.second && !c.second())?

igorpeshansky · 2018-07-19T22:30:39Z

+  if (!unhealthy_components_.empty()) {
+    return false;
+  }
+  for (auto& c : component_callbacks_) {


Should you not lock mutex_ while iterating? It's possible to split this into read and write locks (https://en.cppreference.com/w/cpp/thread/shared_mutex).

Ah, are you concerned about a possible deadlock? What's the right way to do this?

igorpeshansky · 2018-07-19T22:32:30Z

+    health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{
+      std::lock_guard<std::mutex> last_received_lock(last_received_mutex);
+      return last_received > (std::chrono::high_resolution_clock::now() -
+                              std::chrono::minutes(5));


Should this be a configuration option?

igorpeshansky · 2018-07-19T22:33:02Z

+  std::mutex last_received_mutex;
+  auto last_received = std::chrono::high_resolution_clock::now();
+  if (health_checker_) {
+    health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{


Why not capture last_received by const reference?

Looks like capturing by const reference is not available in C++11.

igorpeshansky · 2018-07-19T22:42:10Z

+void MetadataApiServer::HandleHealthz(
+    const HttpServer::request& request,
+    std::shared_ptr<HttpServer::connection> conn) {
+  if (health_checker_->IsHealthy()) {


Should we allow health_checker_ to be nullptr, so we can test different handlers in isolation?

igorpeshansky · 2018-07-19T22:50:34Z

  } catch (const boost::system::system_error& e) {
    LOG(ERROR) << "Failed to query " << endpoint << ": " << e.what();
+    if (health_checker_) {
+      health_checker_->UnregisterCallback(name);


Seems like a good candidate for RAII.

davidbtucker

Thanks, PTAL.

davidbtucker · 2018-07-19T23:12:57Z

+void MetadataApiServer::HandleHealthz(
+    const HttpServer::request& request,
+    std::shared_ptr<HttpServer::connection> conn) {
+  if (health_checker_->IsHealthy()) {


davidbtucker · 2018-07-19T23:23:35Z

+    conn->set_headers(std::map<std::string, std::string>({
+      {"Content-Type", "text/plain"},
+    }));
+    conn->write("unhealthy");


davidbtucker · 2018-07-19T23:23:43Z

+                     std::shared_ptr<HttpServer::connection> conn);

  const Configuration& config_;
+  HealthChecker* health_checker_;


davidbtucker · 2018-07-19T23:24:32Z

+    return false;
+  }
+  for (auto& c : component_callbacks_) {
+    if (!c.second()) {


davidbtucker · 2018-07-19T23:26:00Z

      config_.HealthCheckFile()).parent_path());
 }

+void HealthChecker::RegisterCallback(const std::string& component,


davidbtucker · 2018-07-19T23:35:07Z

+    health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{
+      std::lock_guard<std::mutex> last_received_lock(last_received_mutex);
+      return last_received > (std::chrono::high_resolution_clock::now() -
+                              std::chrono::minutes(5));


davidbtucker · 2018-07-19T23:37:03Z

+  std::mutex last_received_mutex;
+  auto last_received = std::chrono::high_resolution_clock::now();
+  if (health_checker_) {
+    health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{


Looks like capturing by const reference is not available in C++11.

davidbtucker · 2018-07-19T23:40:12Z

+    return false;
+  }
+  for (auto& c : component_callbacks_) {
+    if (!c.second()) {


@bmoyles0117 Unregistering should only happen once the thread is done.
@igorpeshansky That's right.

davidbtucker · 2018-07-19T23:48:19Z

  } catch (const boost::system::system_error& e) {
    LOG(ERROR) << "Failed to query " << endpoint << ": " << e.what();
+    if (health_checker_) {
+      health_checker_->UnregisterCallback(name);


davidbtucker · 2018-07-19T23:56:21Z

+  if (!unhealthy_components_.empty()) {
+    return false;
+  }
+  for (auto& c : component_callbacks_) {


Ah, are you concerned about a possible deadlock? What's the right way to do this?

igorpeshansky

Some more stuff.

igorpeshansky · 2018-07-20T03:31:33Z

+      result.insert(c.first);
+    }
+  }
+  return std::move(result);


This is a set of strings; there are no movable components here — you don't need std::move.

igorpeshansky · 2018-07-20T03:31:52Z


+// Registers a component and then unregisters when it goes out of
+// scope.
+class ScopedHealthCheckRegistration {


Maybe just CheckHealth?

igorpeshansky · 2018-07-20T03:32:14Z

+  }
+ private:
+  HealthChecker* health_checker_;
+  const std::string& component_;


Uh, don't do this. If you store a reference to a temporary, you'll cause a crash. Just store a copy of the string.

Oops, thanks. Fixed.

igorpeshansky · 2018-07-20T03:34:00Z

  }
+  std::mutex last_received_mutex;
+  auto last_received = std::chrono::high_resolution_clock::now();
+  auto timeout = std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds());


time::seconds.

igorpeshansky · 2018-07-20T03:42:38Z

  }
+  std::mutex last_received_mutex;
+  auto last_received = std::chrono::high_resolution_clock::now();
+  auto timeout = std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds());


Seems like you can have a single variable called expiration that is set to std::chrono::high_resolution_clock::now() + std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds()) here and in the watcher callback, i.e.:

std::mutex expiration_mutex; auto expiration = std::chrono::high_resolution_clock::now() + time::seconds(config_.HealthCheckWatchTimeoutSeconds()); ... [&expiration_mutex, &expiration]{ std::lock_guard<std::mutex> expiration_lock(expiration_mutex); return std::chrono::high_resolution_clock::now() < expiration; } ... [=, &expiration_mutex, &expiration](json::value raw_watch) { { std::lock_guard<std::mutex> expiration_lock(expiration_mutex); expiration = std::chrono::high_resolution_clock::now() + time::seconds(config_.HealthCheckWatchTimeoutSeconds()); } WatchEventCallback(callback, name, std::move(raw_watch)); }

Might be a good idea to factor out config_.HealthCheckWatchTimeoutSeconds() into a local variable called timeout, something like:

const int timeout = config_.HealthCheckWatchTimeoutSeconds(); auto expiration = std::chrono::high_resolution_clock::now() + time::seconds(timeout); ... [&expiration_mutex, &expiration]{ std::lock_guard<std::mutex> expiration_lock(expiration_mutex); return std::chrono::high_resolution_clock::now() < expiration; } ... [=, &expiration_mutex, &expiration](json::value raw_watch) { { std::lock_guard<std::mutex> expiration_lock(expiration_mutex); expiration = std::chrono::high_resolution_clock::now() + time::seconds(timeout); } WatchEventCallback(callback, name, std::move(raw_watch)); }

Done, thanks.

igorpeshansky · 2018-07-20T04:01:31Z

    LOG(INFO) << "WatchMaster(" << name << "): Contacting " << endpoint;
  }
+  std::mutex last_received_mutex;
+  auto last_received = std::chrono::high_resolution_clock::now();


Your timeout is measured in integral seconds. You probably don't need high_resolution_clock, especially because it may be pointing to system_clock, which can be confused by DST changes and such. Why not use std::steady_clock instead?

igorpeshansky · 2018-07-20T04:04:12Z


+std::set<std::string> HealthChecker::UnhealthyComponents() const {
+  std::lock_guard<std::mutex> lock(mutex_);
+  std::set<std::string> result(unhealthy_components_);


I thought we were dropping these?..

Done. I forgot what we said -- still use the unhealthy_components_ for IsHealthy()? (The current unit test wants that.)

Whoops, they are still needed.

bmoyles0117 · 2018-07-20T15:28:05Z

 constexpr const char kDefaultInstanceZone[] = "";
 constexpr const char kDefaultHealthCheckFile[] =
    "/var/run/metadata-agent/health/unhealthy";
+constexpr const int kDefaultHealthCheckWatchTimeoutSeconds = 5*60;


Nit: WatchTimeout is too generic, can we change it to KubernetesWatch throughout?

KubernetesWatchTimeout*

Renamed to HealthCheckMaxDataAgeSeconds.

igorpeshansky

This is no longer RFC — let's rename the PR.
Some remaining minor comments.

igorpeshansky · 2018-07-20T15:30:20Z

+      [=, &expiration_mutex, &expiration](json::value raw_watch) {
+        {
+          std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
+          expiration = std::chrono::steady_clock::now() +


[Optional] Might read better as:

expiration = std::chrono::steady_clock::now() + time::seconds(timeout);

igorpeshansky · 2018-07-20T15:31:06Z

  }
+  const int timeout = config_.HealthCheckWatchTimeoutSeconds();
+  std::mutex expiration_mutex;
+  auto expiration = std::chrono::steady_clock::now() + time::seconds(timeout);


I would add a comment here explaining what this is, something like "The time by when the watcher has to receive some data to be considered healthy"...

igorpeshansky · 2018-07-20T15:31:18Z

+  CheckHealth check_health(
+      health_checker_, name, [&expiration_mutex, &expiration]{
+        std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
+        return std::chrono::high_resolution_clock::now() < expiration;


std::chrono::steady_clock, right?

Oops, fixed.

igorpeshansky · 2018-07-20T15:39:34Z

+  }
+  if (unhealthy_components.empty()) {
+    if (config_.VerboseLogging()) {
+      LOG(INFO) << "Healthz returning 200";


Let's s/Healthz//healthz/ here and in the next log statement.

igorpeshansky · 2018-07-20T15:40:28Z

+      {"Content-Type", "text/plain"},
+    }));
+    conn->write("unhealthy components:\n");
+    for (const auto& s : unhealthy_components) {


[Optional] This is a component, so wouldn't c (or even component) work better?

igorpeshansky

LGTM

igorpeshansky

LGTM

igorpeshansky · 2018-07-20T21:10:00Z


+std::set<std::string> HealthChecker::UnhealthyComponents() const {
+  std::lock_guard<std::mutex> lock(mutex_);
+  std::set<std::string> result(unhealthy_components_);


Whoops, they are still needed.

igorpeshansky

LGTM

bmoyles0117

LGTM!

RFC: Add /healthz endpoint that returns 500 when some watch stream is…

2b40a08

… stale.

davidbtucker requested review from bmoyles0117 and igorpeshansky July 19, 2018 22:19

bmoyles0117 reviewed Jul 19, 2018

View reviewed changes

igorpeshansky suggested changes Jul 19, 2018

View reviewed changes

Responded to comments.

90d4aa0

davidbtucker commented Jul 19, 2018

View reviewed changes

igorpeshansky suggested changes Jul 20, 2018

View reviewed changes

Address more comments.

6d71980

bmoyles0117 reviewed Jul 20, 2018

View reviewed changes

igorpeshansky suggested changes Jul 20, 2018

View reviewed changes

davidbtucker added 2 commits July 20, 2018 12:18

More comments.

861dd08

Always log warning for /healthz failure.

ad5734d

davidbtucker changed the title ~~RFC: Add /healthz endpoint that returns 500 when some watch stream is stale.~~ Add /healthz endpoint that returns 500 when some watch stream is stale. Jul 20, 2018

igorpeshansky approved these changes Jul 20, 2018

View reviewed changes

Include unhealthy components in set.

a7eec9b

igorpeshansky approved these changes Jul 20, 2018

View reviewed changes

Log unhealthy components.

384a806

igorpeshansky approved these changes Jul 20, 2018

View reviewed changes

bmoyles0117 approved these changes Jul 20, 2018

View reviewed changes

igorpeshansky merged commit 5c7dde6 into master Jul 22, 2018

igorpeshansky deleted the davidbtucker-watch-healthz branch July 22, 2018 15:14

Conversation

davidbtucker commented Jul 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igorpeshansky left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidbtucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igorpeshansky left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!