Shut down gracefully on SIGTERM.#136
Conversation
| updater->stop(); | ||
| } | ||
| std::cerr << "Exiting" << std::endl; | ||
| std::exit(128 + signum); |
There was a problem hiding this comment.
Why is the exit code 128 + signum?
There was a problem hiding this comment.
Unix tradtion: http://tldp.org/LDP/abs/html/exitcodes.html.
| } | ||
|
|
||
| std::mutex server_wait_mutex; | ||
| server_wait_mutex.lock(); |
There was a problem hiding this comment.
Is this line required given the lock guard in L84?
Alternatively, should this lock be released after server.start()?
There was a problem hiding this comment.
The lock guard was supposed to deadlock this thread to prevent it from going into the destructor. However, with the suggestion from @bmoyles0117, this is no longer necessary.
| std::cerr << "Stopping server" << std::endl; | ||
| google::cleanup_state->server->stop(); | ||
| std::cerr << "Stopping updaters" << std::endl; | ||
| for (google::MetadataUpdater* updater : google::cleanup_state->updaters) { |
There was a problem hiding this comment.
We need to find some way of doing this in parallel. If the first updater gets stuck while stopping, it shouldn't prevent the other updaters from a fair chance of shutting down.
There was a problem hiding this comment.
This is a good point, but I think we can just request that the implementation of these methods does not block.
| return parse_result < 0 ? 0 : parse_result; | ||
| } | ||
|
|
||
| std::mutex server_wait_mutex; |
There was a problem hiding this comment.
What do you think about having a cleanup_state.wait() method that can manage the lock instead?
There was a problem hiding this comment.
Good idea. In fact, it should encapsulate the stopping of the server/updaters as well. Done.
| updater->stop(); | ||
| } | ||
| std::cerr << "Exiting" << std::endl; | ||
| std::exit(128 + signum); |
There was a problem hiding this comment.
Unix tradtion: http://tldp.org/LDP/abs/html/exitcodes.html.
| std::cerr << "Stopping server" << std::endl; | ||
| google::cleanup_state->server->stop(); | ||
| std::cerr << "Stopping updaters" << std::endl; | ||
| for (google::MetadataUpdater* updater : google::cleanup_state->updaters) { |
There was a problem hiding this comment.
This is a good point, but I think we can just request that the implementation of these methods does not block.
| return parse_result < 0 ? 0 : parse_result; | ||
| } | ||
|
|
||
| std::mutex server_wait_mutex; |
There was a problem hiding this comment.
Good idea. In fact, it should encapsulate the stopping of the server/updaters as well. Done.
| } | ||
|
|
||
| std::mutex server_wait_mutex; | ||
| server_wait_mutex.lock(); |
There was a problem hiding this comment.
The lock guard was supposed to deadlock this thread to prevent it from going into the destructor. However, with the suggestion from @bmoyles0117, this is no longer necessary.
| void KubernetesUpdater::StopUpdater() { | ||
| // TODO: How do we interrupt a watch thread? | ||
| if (config().KubernetesUseWatch()) { | ||
| #if 0 |
There was a problem hiding this comment.
If we're not going to use this, let's remove it for now.
| std::initializer_list<MetadataUpdater*> updaters, MetadataAgent* server) | ||
| : updaters_(updaters), server_(server) { server_wait_mutex_.lock(); } | ||
|
|
||
| void StopAll() const { |
There was a problem hiding this comment.
As we're renaming stop to notifyStop, we should rename this accordingly.
| void start(); | ||
|
|
||
| // Stops serving. | ||
| void stop(); |
There was a problem hiding this comment.
Let's rename this to notifyStop, or something similar to ensure that it's clear that this simply signals threads that we're shutting down.
There was a problem hiding this comment.
This specific one does stop listening to the socket, so I would keep this as Stop.
32ec2ea to
86146ba
Compare
igorpeshansky
left a comment
There was a problem hiding this comment.
I got everything to shut down cleanly by eliminating all thread::join() calls from the shutdown path. PTAL.
| void KubernetesUpdater::StopUpdater() { | ||
| // TODO: How do we interrupt a watch thread? | ||
| if (config().KubernetesUseWatch()) { | ||
| #if 0 |
| std::initializer_list<MetadataUpdater*> updaters, MetadataAgent* server) | ||
| : updaters_(updaters), server_(server) { server_wait_mutex_.lock(); } | ||
|
|
||
| void StopAll() const { |
| void start(); | ||
|
|
||
| // Stops serving. | ||
| void stop(); |
There was a problem hiding this comment.
This specific one does stop listening to the socket, so I would keep this as Stop.
| } | ||
| server_wait_mutex_.unlock(); | ||
| // Give the notifications some time to propagate. | ||
| std::this_thread::sleep_for(time::seconds(0.1)); |
There was a problem hiding this comment.
Is this guaranteed to be enough time?
There was a problem hiding this comment.
Empirically, smaller delays were also sufficient, as this just needs enough time for the thread to notice the timer unlock notification and exit the loop. For poller threads, even if it doesn't, nothing bad is going to happen, so I hesitate to introduce a larger wait here.
There was a problem hiding this comment.
It seems to me, that unlocking the server_wait_mutex is essentially a noop, so why wait at all?
There was a problem hiding this comment.
Unlocking server_wait_mutex allows the destructors to start executing, which will start joining threads. More cleanup that can proceed in parallel.
| } | ||
|
|
||
| MetadataApiServer::~MetadataApiServer() { | ||
| Stop(); |
There was a problem hiding this comment.
If we have Stop in this destructor, should we also call stop in MetadataAgent's destructor for consistency? I'm primary concerned about the inconsistency, I'm not sure what the negative effects would be.
There was a problem hiding this comment.
MetadataAgent's destructor will deallocate both the API server and the reporter, which will invoke their respective destructors.
There was a problem hiding this comment.
I'm confused, why are we calling Stop from MetadataAgent, if we're relying on the destructor? I may not be clear, but it seems confusing that stop gets propagated through multiple channels simultaneously.
There was a problem hiding this comment.
Stop() is idempotent. It's just a notification under the covers, so it's ok to call it more than once. Calling it from the destructor guarantees that the server will also shut down cleanly when the object is deleted.
| std::cerr << "Caught SIGTERM; shutting down" << std::endl; | ||
| google::cleanup_state->StartShutdown(); | ||
| std::cerr << "Exiting" << std::endl; | ||
| std::exit(128 + signum); |
There was a problem hiding this comment.
Exit code should be 0 if we terminate successfully. If it's anything but 0, Kubernetes will think that the pod crashed.
There was a problem hiding this comment.
This will only happen when a pod is killed by a health check. Do we really want to report a successful exit in that case? I seem to recall that there was a distinction in pod restart behavior between success and failure exits...
There was a problem hiding this comment.
I believe that we do, as this is the way we have implemented it for the logging agent (we didn't do anything to explicitly return 0, it just does, when it exits cleanly.)
There was a problem hiding this comment.
That's a fluentd thing. Any other process will exit with exactly this exit code (i.e., 143) on SIGTERM. That's just a Unix convention. I bet heapster would exit with 143 as well.
There was a problem hiding this comment.
We spoke offline, I believe we're going to go with a 0 exit status code on a healthy exit after looking further into how it could positively impact docker.
| } | ||
| server_wait_mutex_.unlock(); | ||
| // Give the notifications some time to propagate. | ||
| std::this_thread::sleep_for(time::seconds(0.1)); |
There was a problem hiding this comment.
Unlocking server_wait_mutex allows the destructors to start executing, which will start joining threads. More cleanup that can proceed in parallel.
| std::cerr << "Caught SIGTERM; shutting down" << std::endl; | ||
| google::cleanup_state->StartShutdown(); | ||
| std::cerr << "Exiting" << std::endl; | ||
| std::exit(128 + signum); |
There was a problem hiding this comment.
That's a fluentd thing. Any other process will exit with exactly this exit code (i.e., 143) on SIGTERM. That's just a Unix convention. I bet heapster would exit with 143 as well.
| } | ||
|
|
||
| MetadataApiServer::~MetadataApiServer() { | ||
| Stop(); |
There was a problem hiding this comment.
Stop() is idempotent. It's just a notification under the covers, so it's ok to call it more than once. Calling it from the destructor guarantees that the server will also shut down cleanly when the object is deleted.
…er before the updaters.
Also clarify the contract for StopUpdater.
- Function renames (mostly Stop->NotifyStop). - Remove dead code.
86146ba to
9d43060
Compare
This doesn't fully work, because the agent gets stuck waiting on joining threads, but a second SIGTERM usually does the trick. It does shut down the server first, though.