Shut down gracefully on SIGTERM. by igorpeshansky · Pull Request #136 · Stackdriver/metadata-agent

igorpeshansky · 2018-05-11T20:18:42Z

This doesn't fully work, because the agent gets stuck waiting on joining threads, but a second SIGTERM usually does the trick. It does shut down the server first, though.

supriyagarg · 2018-05-14T16:14:01Z

+    updater->stop();
+  }
+  std::cerr << "Exiting" << std::endl;
+  std::exit(128 + signum);


Why is the exit code 128 + signum?

Unix tradtion: http://tldp.org/LDP/abs/html/exitcodes.html.

supriyagarg · 2018-05-14T16:18:53Z

  }

+  std::mutex server_wait_mutex;
+  server_wait_mutex.lock();


Is this line required given the lock guard in L84?
Alternatively, should this lock be released after server.start()?

The lock guard was supposed to deadlock this thread to prevent it from going into the destructor. However, with the suggestion from @bmoyles0117, this is no longer necessary.

bmoyles0117 · 2018-05-14T17:36:30Z

+  std::cerr << "Stopping server" << std::endl;
+  google::cleanup_state->server->stop();
+  std::cerr << "Stopping updaters" << std::endl;
+  for (google::MetadataUpdater* updater : google::cleanup_state->updaters) {


We need to find some way of doing this in parallel. If the first updater gets stuck while stopping, it shouldn't prevent the other updaters from a fair chance of shutting down.

This is a good point, but I think we can just request that the implementation of these methods does not block.

bmoyles0117 · 2018-05-14T17:39:19Z

    return parse_result < 0 ? 0 : parse_result;
  }

+  std::mutex server_wait_mutex;


What do you think about having a cleanup_state.wait() method that can manage the lock instead?

Good idea. In fact, it should encapsulate the stopping of the server/updaters as well. Done.

igorpeshansky

PTAL.

igorpeshansky · 2018-05-14T19:02:29Z

+    updater->stop();
+  }
+  std::cerr << "Exiting" << std::endl;
+  std::exit(128 + signum);


Unix tradtion: http://tldp.org/LDP/abs/html/exitcodes.html.

igorpeshansky · 2018-05-14T19:20:48Z

+  std::cerr << "Stopping server" << std::endl;
+  google::cleanup_state->server->stop();
+  std::cerr << "Stopping updaters" << std::endl;
+  for (google::MetadataUpdater* updater : google::cleanup_state->updaters) {


This is a good point, but I think we can just request that the implementation of these methods does not block.

igorpeshansky · 2018-05-14T19:40:22Z

    return parse_result < 0 ? 0 : parse_result;
  }

+  std::mutex server_wait_mutex;


Good idea. In fact, it should encapsulate the stopping of the server/updaters as well. Done.

igorpeshansky · 2018-05-14T19:41:58Z

  }

+  std::mutex server_wait_mutex;
+  server_wait_mutex.lock();


The lock guard was supposed to deadlock this thread to prevent it from going into the destructor. However, with the suggestion from @bmoyles0117, this is no longer necessary.

supriyagarg

LGTM

bmoyles0117

Let's rename this PR to focus on the fact that we're handling server shutdowns.

"Ensure that the http server shuts down on sigterm."

bmoyles0117 · 2018-05-15T00:30:44Z

+void KubernetesUpdater::StopUpdater() {
+  // TODO: How do we interrupt a watch thread?
+  if (config().KubernetesUseWatch()) {
+#if 0


If we're not going to use this, let's remove it for now.

bmoyles0117 · 2018-05-15T00:31:23Z

+      std::initializer_list<MetadataUpdater*> updaters, MetadataAgent* server)
+      : updaters_(updaters), server_(server) { server_wait_mutex_.lock(); }
+
+  void StopAll() const {


As we're renaming stop to notifyStop, we should rename this accordingly.

bmoyles0117 · 2018-05-15T00:32:21Z

  void start();

+  // Stops serving.
+  void stop();


Let's rename this to notifyStop, or something similar to ensure that it's clear that this simply signals threads that we're shutting down.

This specific one does stop listening to the socket, so I would keep this as Stop.

igorpeshansky

I got everything to shut down cleanly by eliminating all thread::join() calls from the shutdown path. PTAL.

igorpeshansky · 2018-05-15T16:28:42Z

+void KubernetesUpdater::StopUpdater() {
+  // TODO: How do we interrupt a watch thread?
+  if (config().KubernetesUseWatch()) {
+#if 0


igorpeshansky · 2018-05-15T16:28:52Z

+      std::initializer_list<MetadataUpdater*> updaters, MetadataAgent* server)
+      : updaters_(updaters), server_(server) { server_wait_mutex_.lock(); }
+
+  void StopAll() const {


igorpeshansky · 2018-05-15T16:30:28Z

  void start();

+  // Stops serving.
+  void stop();


This specific one does stop listening to the socket, so I would keep this as Stop.

bmoyles0117 · 2018-05-15T18:32:01Z

+    }
+    server_wait_mutex_.unlock();
+    // Give the notifications some time to propagate.
+    std::this_thread::sleep_for(time::seconds(0.1));


Is this guaranteed to be enough time?

Empirically, smaller delays were also sufficient, as this just needs enough time for the thread to notice the timer unlock notification and exit the loop. For poller threads, even if it doesn't, nothing bad is going to happen, so I hesitate to introduce a larger wait here.

It seems to me, that unlocking the server_wait_mutex is essentially a noop, so why wait at all?

Unlocking server_wait_mutex allows the destructors to start executing, which will start joining threads. More cleanup that can proceed in parallel.

bmoyles0117 · 2018-05-16T13:53:48Z

+}
+
 MetadataApiServer::~MetadataApiServer() {
+  Stop();


If we have Stop in this destructor, should we also call stop in MetadataAgent's destructor for consistency? I'm primary concerned about the inconsistency, I'm not sure what the negative effects would be.

MetadataAgent's destructor will deallocate both the API server and the reporter, which will invoke their respective destructors.

I'm confused, why are we calling Stop from MetadataAgent, if we're relying on the destructor? I may not be clear, but it seems confusing that stop gets propagated through multiple channels simultaneously.

https://github.com/Stackdriver/metadata-agent/pull/136/files#diff-61b93c57ea92f91ec66fdd4a280d8e8bR40

Stop() is idempotent. It's just a notification under the covers, so it's ok to call it more than once. Calling it from the destructor guarantees that the server will also shut down cleanly when the object is deleted.

bmoyles0117 · 2018-05-16T13:58:17Z

+  std::cerr << "Caught SIGTERM; shutting down" << std::endl;
+  google::cleanup_state->StartShutdown();
+  std::cerr << "Exiting" << std::endl;
+  std::exit(128 + signum);


Exit code should be 0 if we terminate successfully. If it's anything but 0, Kubernetes will think that the pod crashed.

This will only happen when a pod is killed by a health check. Do we really want to report a successful exit in that case? I seem to recall that there was a distinction in pod restart behavior between success and failure exits...

I believe that we do, as this is the way we have implemented it for the logging agent (we didn't do anything to explicitly return 0, it just does, when it exits cleanly.)

That's a fluentd thing. Any other process will exit with exactly this exit code (i.e., 143) on SIGTERM. That's just a Unix convention. I bet heapster would exit with 143 as well.

We spoke offline, I believe we're going to go with a 0 exit status code on a healthy exit after looking further into how it could positively impact docker.

igorpeshansky

PTAL

igorpeshansky · 2018-05-17T23:45:26Z

+    }
+    server_wait_mutex_.unlock();
+    // Give the notifications some time to propagate.
+    std::this_thread::sleep_for(time::seconds(0.1));


Unlocking server_wait_mutex allows the destructors to start executing, which will start joining threads. More cleanup that can proceed in parallel.

igorpeshansky · 2018-05-18T00:00:25Z

+  std::cerr << "Caught SIGTERM; shutting down" << std::endl;
+  google::cleanup_state->StartShutdown();
+  std::cerr << "Exiting" << std::endl;
+  std::exit(128 + signum);


That's a fluentd thing. Any other process will exit with exactly this exit code (i.e., 143) on SIGTERM. That's just a Unix convention. I bet heapster would exit with 143 as well.

igorpeshansky · 2018-05-18T00:03:00Z

+}
+
 MetadataApiServer::~MetadataApiServer() {
+  Stop();


Stop() is idempotent. It's just a notification under the covers, so it's ok to call it more than once. Calling it from the destructor guarantees that the server will also shut down cleanly when the object is deleted.

…er before the updaters.

Also clarify the contract for StopUpdater.

- Function renames (mostly Stop->NotifyStop). - Remove dead code.

bmoyles0117

LGTM 🛰

igorpeshansky requested review from bmoyles0117 and supriyagarg May 11, 2018 20:18

supriyagarg reviewed May 14, 2018

View reviewed changes

bmoyles0117 suggested changes May 14, 2018

View reviewed changes

igorpeshansky commented May 14, 2018

View reviewed changes

supriyagarg approved these changes May 14, 2018

View reviewed changes

bmoyles0117 suggested changes May 15, 2018

View reviewed changes

igorpeshansky changed the title ~~Shut down gracefully on SIGTERM.~~ Ensure that the HTTP server shuts down gracefully on SIGTERM. May 15, 2018

igorpeshansky force-pushed the igorp-signal-handling branch from 32ec2ea to 86146ba Compare May 15, 2018 17:14

igorpeshansky changed the title ~~Ensure that the HTTP server shuts down gracefully on SIGTERM.~~ Shut down gracefully on SIGTERM. May 15, 2018

igorpeshansky commented May 15, 2018

View reviewed changes

bmoyles0117 suggested changes May 16, 2018

View reviewed changes

igorpeshansky commented May 18, 2018

View reviewed changes

igorpeshansky added 6 commits May 22, 2018 22:14

Shut down gracefully on SIGTERM.

11e36d5

Attempt to gracefully shut down watch threads.

6a0a2c5

Wait at the end of main, rather than in the destructor. Stop the serv…

a580e7a

…er before the updaters.

Encapsulate waiting and shutdown in the CleanupState class.

be3ec2c

Also clarify the contract for StopUpdater.

Address feedback.

7071958

- Function renames (mostly Stop->NotifyStop). - Remove dead code.

Make agent Stop non-blocking.

9d43060

igorpeshansky force-pushed the igorp-signal-handling branch from 86146ba to 9d43060 Compare May 23, 2018 02:15

Report success when exiting via SIGTERM.

2ffea56

bmoyles0117 approved these changes May 24, 2018

View reviewed changes

igorpeshansky merged commit f93fadb into master May 25, 2018

igorpeshansky deleted the igorp-signal-handling branch May 25, 2018 00:33

Conversation

igorpeshansky commented May 11, 2018

Uh oh!

supriyagarg May 14, 2018 • edited by igorpeshansky Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igorpeshansky left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

supriyagarg left a comment

Choose a reason for hiding this comment

Uh oh!

bmoyles0117 left a comment • edited by igorpeshansky Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bmoyles0117 May 15, 2018 • edited by igorpeshansky Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igorpeshansky left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bmoyles0117 May 17, 2018 • edited by igorpeshansky Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

supriyagarg May 14, 2018 •

edited by igorpeshansky

Loading

bmoyles0117 left a comment •

edited by igorpeshansky

Loading

bmoyles0117 May 15, 2018 •

edited by igorpeshansky

Loading

bmoyles0117 May 17, 2018 •

edited by igorpeshansky

Loading