Skip to content
This repository was archived by the owner on Aug 19, 2019. It is now read-only.

Be more robust in handling broken chunked encoding.#58

Merged
igorpeshansky merged 7 commits intomasterfrom
igorp-watch-chunked-fix
Feb 27, 2018
Merged

Be more robust in handling broken chunked encoding.#58
igorpeshansky merged 7 commits intomasterfrom
igorp-watch-chunked-fix

Conversation

@igorpeshansky
Copy link
Copy Markdown
Contributor

This is a preliminary PR to check whether this fixed the problem.

Copy link
Copy Markdown
Contributor

@bmoyles0117 bmoyles0117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is actually solving the problem at hand. Why are we appending the number of bytes remaining to the end of the stream?

@igorpeshansky
Copy link
Copy Markdown
Contributor Author

This is now handled differently. PTAL.

Copy link
Copy Markdown
Contributor

@bmoyles0117 bmoyles0117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on this, being that we're unlocking the completion mutex now, it emphasizes that these threads can die, and not having reconnect logic metadata agent, or at least a health check endpoint that tells us we're not running, means that the metadata agent will go on living forever, having its threads die, but not exiting as a process. WDYT?

@igorpeshansky
Copy link
Copy Markdown
Contributor Author

I've verified that this at least causes the watch thread to exit gracefully instead of going into an infinite loop. PTAL.

@igorpeshansky
Copy link
Copy Markdown
Contributor Author

I'd rather handle these issues separately. We do need thread restart logic, and we do need a healthcheck endpoint. But we should also prevent the infinite loop. WDYT?

@bmoyles0117
Copy link
Copy Markdown
Contributor

I'm ok with that, potentially as a simple follow up PR, I would suggest we make these errors INCREDIBLY obvious, as a sign to the user that "something" is likely broken, so when they look at the logs it's not something as simple as "Thread Exited", it should have a huge warning around it that the agent should likely be restarted.

@igorpeshansky
Copy link
Copy Markdown
Contributor Author

Ok, let's revisit after we get the basics fixed. I have a couple of follow-up PRs depending on this one, so PTAL.

Copy link
Copy Markdown

@qingling128 qingling128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix the infinite loop first since this is quite urgent. The rest could be handled in follow-up PRs.

[Optional] Do we want to rephrase the error (Invalid chunked encoding) in a way that it implies the impact?

@igorpeshansky
Copy link
Copy Markdown
Contributor Author

@qingling128 Do you have any concrete proposals for the phrasing? If so, could you please add them in a comment on the code?

@igorpeshansky igorpeshansky force-pushed the igorp-watch-chunked-fix branch from eb3a130 to b674482 Compare February 26, 2018 20:06
@igorpeshansky
Copy link
Copy Markdown
Contributor Author

Rebased off master.

Comment thread src/kubernetes.cc Outdated
<< std::string(begin, end)
<< "'";
return boost::iterator_range<const char*>(begin, end);
<< "'; exiting";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like "'; Skipping the current chunk and exiting the thread. Metadata of certain monitored resource might not be up to date.";
?

Not very sure if the impact is accurate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the new message. WDYT?

Copy link
Copy Markdown
Contributor Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proper fix turned out to be trickier than I thought, but I think I got it now. PTAL.

Comment thread src/kubernetes.cc Outdated
<< std::string(begin, end)
<< "'";
return boost::iterator_range<const char*>(begin, end);
<< "'; exiting";
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the new message. WDYT?

Comment thread src/kubernetes.cc
LOG(ERROR) << "No more pod metadata will be collected";
} catch (const KubernetesReader::QueryException& e) {
// Already logged.
LOG(ERROR) << "No more pod metadata will be collected";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distinguish the log of these two errors for debugging purpose? Same below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would each be preceded by another log message that details the error. That should be enough to distinguish them.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Then we should be fine.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not logging the e.what() here as we do above, is that because it's logged elsewhere? Can we juggle things around for consistency?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's because it's logged elsewhere. The code is fairly consistent about logging any time a QueryException is thrown, so we don't need to log when catching it.
I've just added one missing log statement above.

Comment thread src/kubernetes.cc
namespace {
struct Watcher {
Watcher(std::function<void(json::value)> event_callback,
Watcher(const std::string& endpoint,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea adding a watching name to make the error log more informative.

Comment thread src/kubernetes.cc
}

try {
//#ifdef VERBOSE
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be many commented-out lines like this. Should we clean them up?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR. I have a bug open to factor out the chunked encoding handler — will clean this up when I do that.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Comment thread src/kubernetes.cc Outdated
return range;
LOG(ERROR) << name_ << " => "
<< "Asked to read next chunk with no bytes remaining";
return range; // TODO: should this throw an exception instead?
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like an exception to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL.

Comment thread src/kubernetes.cc
}

try {
//#ifdef VERBOSE
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR. I have a bug open to factor out the chunked encoding handler — will clean this up when I do that.

Comment thread src/kubernetes.cc
LOG(ERROR) << "No more pod metadata will be collected";
} catch (const KubernetesReader::QueryException& e) {
// Already logged.
LOG(ERROR) << "No more pod metadata will be collected";
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would each be preceded by another log message that details the error. That should be enough to distinguish them.

Comment thread src/kubernetes.cc Outdated
return range;
LOG(ERROR) << name_ << " => "
<< "Asked to read next chunk with no bytes remaining";
return range; // TODO: should this throw an exception instead?
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown

@qingling128 qingling128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment thread src/kubernetes.cc
}

try {
//#ifdef VERBOSE
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Comment thread src/kubernetes.cc
LOG(ERROR) << "No more pod metadata will be collected";
} catch (const KubernetesReader::QueryException& e) {
// Already logged.
LOG(ERROR) << "No more pod metadata will be collected";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Then we should be fine.

Comment thread src/kubernetes.cc Outdated
void operator()(const boost::iterator_range<const char*>& range,
const boost::system::error_code& error) {
if (!error) {
if (!exception_.empty()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being that exception is a string, can we change the name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had it as error_, but that ended up being too easy to confuse with the error argument. Any naming suggestions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like exception_message_ or exception_string_ just to add clarify that we can't treat it like an exception (despite it clearly being defined as a string)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to exception_message_.

Comment thread src/kubernetes.cc
}
std::lock_guard<std::mutex> await_completion(completion_mutex);
if (!watcher.exception().empty()) {
throw QueryException(watcher.exception());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to throw an exception here? Either way, the thread will end up exiting. Does this exception bubble up to kill the main process?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the same thread. This exception is thrown from the thread that is waiting for the watcher thread. It will be caught by callers of WatchMaster.

Comment thread src/kubernetes.cc
LOG(ERROR) << "No more pod metadata will be collected";
} catch (const KubernetesReader::QueryException& e) {
// Already logged.
LOG(ERROR) << "No more pod metadata will be collected";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not logging the e.what() here as we do above, is that because it's logged elsewhere? Can we juggle things around for consistency?

Comment thread src/kubernetes.cc
@@ -1028,8 +1084,9 @@ void KubernetesReader::WatchNode(MetadataUpdater::UpdateCallback callback)
std::placeholders::_2, std::placeholders::_3));
} catch (const json::Exception& e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle the WatcherException here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No -- it will never be thrown in this thread, and exceptions don't propagate across threads.

Comment thread src/kubernetes.cc
WatcherException(const std::string& what) : explanation_(what) {}
const std::string& what() const { return explanation_; }
private:
std::string explanation_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called explanation_ instead of what_?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with all other exceptions. :-) If we choose to change it, we should do it in one fell swoop, but we could also keep it this way.
e.what() is fairly standard in Boost and others.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with e.what(), I just got confused why the local variable is called explanation_, not critical in either case so happy to just approve.

Copy link
Copy Markdown
Contributor Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed your feedback. PTAL.

Comment thread src/kubernetes.cc
WatcherException(const std::string& what) : explanation_(what) {}
const std::string& what() const { return explanation_; }
private:
std::string explanation_;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with all other exceptions. :-) If we choose to change it, we should do it in one fell swoop, but we could also keep it this way.
e.what() is fairly standard in Boost and others.

Comment thread src/kubernetes.cc Outdated
void operator()(const boost::iterator_range<const char*>& range,
const boost::system::error_code& error) {
if (!error) {
if (!exception_.empty()) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had it as error_, but that ended up being too easy to confuse with the error argument. Any naming suggestions?

Comment thread src/kubernetes.cc
}
std::lock_guard<std::mutex> await_completion(completion_mutex);
if (!watcher.exception().empty()) {
throw QueryException(watcher.exception());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the same thread. This exception is thrown from the thread that is waiting for the watcher thread. It will be caught by callers of WatchMaster.

Comment thread src/kubernetes.cc
LOG(ERROR) << "No more pod metadata will be collected";
} catch (const KubernetesReader::QueryException& e) {
// Already logged.
LOG(ERROR) << "No more pod metadata will be collected";
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's because it's logged elsewhere. The code is fairly consistent about logging any time a QueryException is thrown, so we don't need to log when catching it.
I've just added one missing log statement above.

Comment thread src/kubernetes.cc
@@ -1028,8 +1084,9 @@ void KubernetesReader::WatchNode(MetadataUpdater::UpdateCallback callback)
std::placeholders::_2, std::placeholders::_3));
} catch (const json::Exception& e) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No -- it will never be thrown in this thread, and exceptions don't propagate across threads.

Copy link
Copy Markdown
Contributor

@bmoyles0117 bmoyles0117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews. Merging.

Comment thread src/kubernetes.cc Outdated
void operator()(const boost::iterator_range<const char*>& range,
const boost::system::error_code& error) {
if (!error) {
if (!exception_.empty()) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to exception_message_.

@igorpeshansky igorpeshansky merged commit 2bbad32 into master Feb 27, 2018
@igorpeshansky igorpeshansky deleted the igorp-watch-chunked-fix branch February 27, 2018 21:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants