Be more robust in handling broken chunked encoding.#58
Conversation
bmoyles0117
left a comment
There was a problem hiding this comment.
I'm not sure that this is actually solving the problem at hand. Why are we appending the number of bytes remaining to the end of the stream?
|
This is now handled differently. PTAL. |
bmoyles0117
left a comment
There was a problem hiding this comment.
I'm torn on this, being that we're unlocking the completion mutex now, it emphasizes that these threads can die, and not having reconnect logic metadata agent, or at least a health check endpoint that tells us we're not running, means that the metadata agent will go on living forever, having its threads die, but not exiting as a process. WDYT?
|
I've verified that this at least causes the watch thread to exit gracefully instead of going into an infinite loop. PTAL. |
|
I'd rather handle these issues separately. We do need thread restart logic, and we do need a healthcheck endpoint. But we should also prevent the infinite loop. WDYT? |
|
I'm ok with that, potentially as a simple follow up PR, I would suggest we make these errors INCREDIBLY obvious, as a sign to the user that "something" is likely broken, so when they look at the logs it's not something as simple as "Thread Exited", it should have a huge warning around it that the agent should likely be restarted. |
|
Ok, let's revisit after we get the basics fixed. I have a couple of follow-up PRs depending on this one, so PTAL. |
qingling128
left a comment
There was a problem hiding this comment.
Let's fix the infinite loop first since this is quite urgent. The rest could be handled in follow-up PRs.
[Optional] Do we want to rephrase the error (Invalid chunked encoding) in a way that it implies the impact?
|
@qingling128 Do you have any concrete proposals for the phrasing? If so, could you please add them in a comment on the code? |
eb3a130 to
b674482
Compare
|
Rebased off |
| << std::string(begin, end) | ||
| << "'"; | ||
| return boost::iterator_range<const char*>(begin, end); | ||
| << "'; exiting"; |
There was a problem hiding this comment.
Something like "'; Skipping the current chunk and exiting the thread. Metadata of certain monitored resource might not be up to date.";
?
Not very sure if the impact is accurate.
There was a problem hiding this comment.
Take a look at the new message. WDYT?
igorpeshansky
left a comment
There was a problem hiding this comment.
The proper fix turned out to be trickier than I thought, but I think I got it now. PTAL.
| << std::string(begin, end) | ||
| << "'"; | ||
| return boost::iterator_range<const char*>(begin, end); | ||
| << "'; exiting"; |
There was a problem hiding this comment.
Take a look at the new message. WDYT?
| LOG(ERROR) << "No more pod metadata will be collected"; | ||
| } catch (const KubernetesReader::QueryException& e) { | ||
| // Already logged. | ||
| LOG(ERROR) << "No more pod metadata will be collected"; |
There was a problem hiding this comment.
Distinguish the log of these two errors for debugging purpose? Same below.
There was a problem hiding this comment.
They would each be preceded by another log message that details the error. That should be enough to distinguish them.
There was a problem hiding this comment.
Ah, I see. Then we should be fine.
There was a problem hiding this comment.
We're not logging the e.what() here as we do above, is that because it's logged elsewhere? Can we juggle things around for consistency?
There was a problem hiding this comment.
Yes, it's because it's logged elsewhere. The code is fairly consistent about logging any time a QueryException is thrown, so we don't need to log when catching it.
I've just added one missing log statement above.
| namespace { | ||
| struct Watcher { | ||
| Watcher(std::function<void(json::value)> event_callback, | ||
| Watcher(const std::string& endpoint, |
There was a problem hiding this comment.
Good idea adding a watching name to make the error log more informative.
| } | ||
|
|
||
| try { | ||
| //#ifdef VERBOSE |
There was a problem hiding this comment.
There seem to be many commented-out lines like this. Should we clean them up?
There was a problem hiding this comment.
Not in this PR. I have a bug open to factor out the chunked encoding handler — will clean this up when I do that.
| return range; | ||
| LOG(ERROR) << name_ << " => " | ||
| << "Asked to read next chunk with no bytes remaining"; | ||
| return range; // TODO: should this throw an exception instead? |
igorpeshansky
left a comment
There was a problem hiding this comment.
Thanks, PTAL.
| } | ||
|
|
||
| try { | ||
| //#ifdef VERBOSE |
There was a problem hiding this comment.
Not in this PR. I have a bug open to factor out the chunked encoding handler — will clean this up when I do that.
| LOG(ERROR) << "No more pod metadata will be collected"; | ||
| } catch (const KubernetesReader::QueryException& e) { | ||
| // Already logged. | ||
| LOG(ERROR) << "No more pod metadata will be collected"; |
There was a problem hiding this comment.
They would each be preceded by another log message that details the error. That should be enough to distinguish them.
| return range; | ||
| LOG(ERROR) << name_ << " => " | ||
| << "Asked to read next chunk with no bytes remaining"; | ||
| return range; // TODO: should this throw an exception instead? |
| } | ||
|
|
||
| try { | ||
| //#ifdef VERBOSE |
| LOG(ERROR) << "No more pod metadata will be collected"; | ||
| } catch (const KubernetesReader::QueryException& e) { | ||
| // Already logged. | ||
| LOG(ERROR) << "No more pod metadata will be collected"; |
There was a problem hiding this comment.
Ah, I see. Then we should be fine.
| void operator()(const boost::iterator_range<const char*>& range, | ||
| const boost::system::error_code& error) { | ||
| if (!error) { | ||
| if (!exception_.empty()) { |
There was a problem hiding this comment.
Being that exception is a string, can we change the name?
There was a problem hiding this comment.
I initially had it as error_, but that ended up being too easy to confuse with the error argument. Any naming suggestions?
There was a problem hiding this comment.
Something like exception_message_ or exception_string_ just to add clarify that we can't treat it like an exception (despite it clearly being defined as a string)
There was a problem hiding this comment.
Renamed to exception_message_.
| } | ||
| std::lock_guard<std::mutex> await_completion(completion_mutex); | ||
| if (!watcher.exception().empty()) { | ||
| throw QueryException(watcher.exception()); |
There was a problem hiding this comment.
Do we really need to throw an exception here? Either way, the thread will end up exiting. Does this exception bubble up to kill the main process?
There was a problem hiding this comment.
This isn't the same thread. This exception is thrown from the thread that is waiting for the watcher thread. It will be caught by callers of WatchMaster.
| LOG(ERROR) << "No more pod metadata will be collected"; | ||
| } catch (const KubernetesReader::QueryException& e) { | ||
| // Already logged. | ||
| LOG(ERROR) << "No more pod metadata will be collected"; |
There was a problem hiding this comment.
We're not logging the e.what() here as we do above, is that because it's logged elsewhere? Can we juggle things around for consistency?
| @@ -1028,8 +1084,9 @@ void KubernetesReader::WatchNode(MetadataUpdater::UpdateCallback callback) | |||
| std::placeholders::_2, std::placeholders::_3)); | |||
| } catch (const json::Exception& e) { | |||
There was a problem hiding this comment.
Do we need to handle the WatcherException here?
There was a problem hiding this comment.
No -- it will never be thrown in this thread, and exceptions don't propagate across threads.
| WatcherException(const std::string& what) : explanation_(what) {} | ||
| const std::string& what() const { return explanation_; } | ||
| private: | ||
| std::string explanation_; |
There was a problem hiding this comment.
Why is this called explanation_ instead of what_?
There was a problem hiding this comment.
For consistency with all other exceptions. :-) If we choose to change it, we should do it in one fell swoop, but we could also keep it this way.
e.what() is fairly standard in Boost and others.
There was a problem hiding this comment.
I'm fine with e.what(), I just got confused why the local variable is called explanation_, not critical in either case so happy to just approve.
igorpeshansky
left a comment
There was a problem hiding this comment.
Addressed your feedback. PTAL.
| WatcherException(const std::string& what) : explanation_(what) {} | ||
| const std::string& what() const { return explanation_; } | ||
| private: | ||
| std::string explanation_; |
There was a problem hiding this comment.
For consistency with all other exceptions. :-) If we choose to change it, we should do it in one fell swoop, but we could also keep it this way.
e.what() is fairly standard in Boost and others.
| void operator()(const boost::iterator_range<const char*>& range, | ||
| const boost::system::error_code& error) { | ||
| if (!error) { | ||
| if (!exception_.empty()) { |
There was a problem hiding this comment.
I initially had it as error_, but that ended up being too easy to confuse with the error argument. Any naming suggestions?
| } | ||
| std::lock_guard<std::mutex> await_completion(completion_mutex); | ||
| if (!watcher.exception().empty()) { | ||
| throw QueryException(watcher.exception()); |
There was a problem hiding this comment.
This isn't the same thread. This exception is thrown from the thread that is waiting for the watcher thread. It will be caught by callers of WatchMaster.
| LOG(ERROR) << "No more pod metadata will be collected"; | ||
| } catch (const KubernetesReader::QueryException& e) { | ||
| // Already logged. | ||
| LOG(ERROR) << "No more pod metadata will be collected"; |
There was a problem hiding this comment.
Yes, it's because it's logged elsewhere. The code is fairly consistent about logging any time a QueryException is thrown, so we don't need to log when catching it.
I've just added one missing log statement above.
| @@ -1028,8 +1084,9 @@ void KubernetesReader::WatchNode(MetadataUpdater::UpdateCallback callback) | |||
| std::placeholders::_2, std::placeholders::_3)); | |||
| } catch (const json::Exception& e) { | |||
There was a problem hiding this comment.
No -- it will never be thrown in this thread, and exceptions don't propagate across threads.
igorpeshansky
left a comment
There was a problem hiding this comment.
Thanks for the reviews. Merging.
| void operator()(const boost::iterator_range<const char*>& range, | ||
| const boost::system::error_code& error) { | ||
| if (!error) { | ||
| if (!exception_.empty()) { |
There was a problem hiding this comment.
Renamed to exception_message_.
This is a preliminary PR to check whether this fixed the problem.