Be more robust in handling broken chunked encoding. by igorpeshansky · Pull Request #58 · Stackdriver/metadata-agent

igorpeshansky · 2018-02-23T22:57:34Z

This is a preliminary PR to check whether this fixed the problem.

bmoyles0117

I'm not sure that this is actually solving the problem at hand. Why are we appending the number of bytes remaining to the end of the stream?

igorpeshansky · 2018-02-24T19:01:40Z

This is now handled differently. PTAL.

bmoyles0117

I'm torn on this, being that we're unlocking the completion mutex now, it emphasizes that these threads can die, and not having reconnect logic metadata agent, or at least a health check endpoint that tells us we're not running, means that the metadata agent will go on living forever, having its threads die, but not exiting as a process. WDYT?

igorpeshansky · 2018-02-25T00:10:59Z

I've verified that this at least causes the watch thread to exit gracefully instead of going into an infinite loop. PTAL.

igorpeshansky · 2018-02-25T00:14:10Z

I'd rather handle these issues separately. We do need thread restart logic, and we do need a healthcheck endpoint. But we should also prevent the infinite loop. WDYT?

bmoyles0117 · 2018-02-25T00:23:31Z

I'm ok with that, potentially as a simple follow up PR, I would suggest we make these errors INCREDIBLY obvious, as a sign to the user that "something" is likely broken, so when they look at the logs it's not something as simple as "Thread Exited", it should have a huge warning around it that the agent should likely be restarted.

igorpeshansky · 2018-02-25T05:11:07Z

Ok, let's revisit after we get the basics fixed. I have a couple of follow-up PRs depending on this one, so PTAL.

qingling128

Let's fix the infinite loop first since this is quite urgent. The rest could be handled in follow-up PRs.

[Optional] Do we want to rephrase the error (Invalid chunked encoding) in a way that it implies the impact?

igorpeshansky · 2018-02-26T18:58:17Z

@qingling128 Do you have any concrete proposals for the phrasing? If so, could you please add them in a comment on the code?

igorpeshansky · 2018-02-26T20:07:16Z

Rebased off master.

qingling128 · 2018-02-26T20:30:51Z

                 << std::string(begin, end)
-                 << "'";
-      return boost::iterator_range<const char*>(begin, end);
+                 << "'; exiting";


Something like "'; Skipping the current chunk and exiting the thread. Metadata of certain monitored resource might not be up to date.";
?

Not very sure if the impact is accurate.

Take a look at the new message. WDYT?

igorpeshansky

The proper fix turned out to be trickier than I thought, but I think I got it now. PTAL.

igorpeshansky · 2018-02-26T23:55:03Z

                 << std::string(begin, end)
-                 << "'";
-      return boost::iterator_range<const char*>(begin, end);
+                 << "'; exiting";


Take a look at the new message. WDYT?

qingling128 · 2018-02-27T04:43:09Z

+    LOG(ERROR) << "No more pod metadata will be collected";
  } catch (const KubernetesReader::QueryException& e) {
-    // Already logged.
+    LOG(ERROR) << "No more pod metadata will be collected";


Distinguish the log of these two errors for debugging purpose? Same below.

They would each be preceded by another log message that details the error. That should be enough to distinguish them.

Ah, I see. Then we should be fine.

We're not logging the e.what() here as we do above, is that because it's logged elsewhere? Can we juggle things around for consistency?

Yes, it's because it's logged elsewhere. The code is fairly consistent about logging any time a QueryException is thrown, so we don't need to log when catching it.
I've just added one missing log statement above.

qingling128 · 2018-02-27T04:46:42Z

 namespace {
 struct Watcher {
-  Watcher(std::function<void(json::value)> event_callback,
+  Watcher(const std::string& endpoint,


Good idea adding a watching name to make the error log more informative.

qingling128 · 2018-02-27T04:47:31Z

+      }
+
+      try {
 //#ifdef VERBOSE


There seem to be many commented-out lines like this. Should we clean them up?

Not in this PR. I have a bug open to factor out the chunked encoding handler — will clean this up when I do that.

Sounds good.

qingling128 · 2018-02-27T04:49:13Z

-      return range;
+      LOG(ERROR) << name_ << " => "
+                 << "Asked to read next chunk with no bytes remaining";
+      return range;  // TODO: should this throw an exception instead?


Sounds like an exception to me.

igorpeshansky

Thanks, PTAL.

igorpeshansky · 2018-02-27T05:05:46Z

+      }
+
+      try {
 //#ifdef VERBOSE


Not in this PR. I have a bug open to factor out the chunked encoding handler — will clean this up when I do that.

igorpeshansky · 2018-02-27T05:07:30Z

+    LOG(ERROR) << "No more pod metadata will be collected";
  } catch (const KubernetesReader::QueryException& e) {
-    // Already logged.
+    LOG(ERROR) << "No more pod metadata will be collected";


They would each be preceded by another log message that details the error. That should be enough to distinguish them.

igorpeshansky · 2018-02-27T16:31:26Z

-      return range;
+      LOG(ERROR) << name_ << " => "
+                 << "Asked to read next chunk with no bytes remaining";
+      return range;  // TODO: should this throw an exception instead?


qingling128

LGTM.

qingling128 · 2018-02-27T16:40:20Z

+      }
+
+      try {
 //#ifdef VERBOSE


Sounds good.

qingling128 · 2018-02-27T16:40:50Z

+    LOG(ERROR) << "No more pod metadata will be collected";
  } catch (const KubernetesReader::QueryException& e) {
-    // Already logged.
+    LOG(ERROR) << "No more pod metadata will be collected";


Ah, I see. Then we should be fine.

bmoyles0117 · 2018-02-27T18:37:07Z

  void operator()(const boost::iterator_range<const char*>& range,
                  const boost::system::error_code& error) {
    if (!error) {
+      if (!exception_.empty()) {


Being that exception is a string, can we change the name?

I initially had it as error_, but that ended up being too easy to confuse with the error argument. Any naming suggestions?

Something like exception_message_ or exception_string_ just to add clarify that we can't treat it like an exception (despite it clearly being defined as a string)

Renamed to exception_message_.

bmoyles0117 · 2018-02-27T18:41:31Z

    }
    std::lock_guard<std::mutex> await_completion(completion_mutex);
+    if (!watcher.exception().empty()) {
+      throw QueryException(watcher.exception());


Do we really need to throw an exception here? Either way, the thread will end up exiting. Does this exception bubble up to kill the main process?

This isn't the same thread. This exception is thrown from the thread that is waiting for the watcher thread. It will be caught by callers of WatchMaster.

bmoyles0117 · 2018-02-27T18:42:37Z

+    LOG(ERROR) << "No more pod metadata will be collected";
  } catch (const KubernetesReader::QueryException& e) {
-    // Already logged.
+    LOG(ERROR) << "No more pod metadata will be collected";


We're not logging the e.what() here as we do above, is that because it's logged elsewhere? Can we juggle things around for consistency?

bmoyles0117 · 2018-02-27T18:43:32Z

@@ -1028,8 +1084,9 @@ void KubernetesReader::WatchNode(MetadataUpdater::UpdateCallback callback)
                          std::placeholders::_2, std::placeholders::_3));
  } catch (const json::Exception& e) {


Do we need to handle the WatcherException here?

No -- it will never be thrown in this thread, and exceptions don't propagate across threads.

bmoyles0117 · 2018-02-27T18:44:56Z

+    WatcherException(const std::string& what) : explanation_(what) {}
+    const std::string& what() const { return explanation_; }
+   private:
+    std::string explanation_;


Why is this called explanation_ instead of what_?

For consistency with all other exceptions. :-) If we choose to change it, we should do it in one fell swoop, but we could also keep it this way.
e.what() is fairly standard in Boost and others.

I'm fine with e.what(), I just got confused why the local variable is called explanation_, not critical in either case so happy to just approve.

igorpeshansky

Addressed your feedback. PTAL.

igorpeshansky · 2018-02-27T18:52:36Z

+    WatcherException(const std::string& what) : explanation_(what) {}
+    const std::string& what() const { return explanation_; }
+   private:
+    std::string explanation_;


For consistency with all other exceptions. :-) If we choose to change it, we should do it in one fell swoop, but we could also keep it this way.
e.what() is fairly standard in Boost and others.

igorpeshansky · 2018-02-27T18:54:15Z

  void operator()(const boost::iterator_range<const char*>& range,
                  const boost::system::error_code& error) {
    if (!error) {
+      if (!exception_.empty()) {


I initially had it as error_, but that ended up being too easy to confuse with the error argument. Any naming suggestions?

igorpeshansky · 2018-02-27T18:58:15Z

    }
    std::lock_guard<std::mutex> await_completion(completion_mutex);
+    if (!watcher.exception().empty()) {
+      throw QueryException(watcher.exception());


This isn't the same thread. This exception is thrown from the thread that is waiting for the watcher thread. It will be caught by callers of WatchMaster.

igorpeshansky · 2018-02-27T19:03:53Z

+    LOG(ERROR) << "No more pod metadata will be collected";
  } catch (const KubernetesReader::QueryException& e) {
-    // Already logged.
+    LOG(ERROR) << "No more pod metadata will be collected";


Yes, it's because it's logged elsewhere. The code is fairly consistent about logging any time a QueryException is thrown, so we don't need to log when catching it.
I've just added one missing log statement above.

igorpeshansky · 2018-02-27T19:04:44Z

@@ -1028,8 +1084,9 @@ void KubernetesReader::WatchNode(MetadataUpdater::UpdateCallback callback)
                          std::placeholders::_2, std::placeholders::_3));
  } catch (const json::Exception& e) {


No -- it will never be thrown in this thread, and exceptions don't propagate across threads.

bmoyles0117

LGTM

igorpeshansky

Thanks for the reviews. Merging.

igorpeshansky · 2018-02-27T21:39:37Z

  void operator()(const boost::iterator_range<const char*>& range,
                  const boost::system::error_code& error) {
    if (!error) {
+      if (!exception_.empty()) {


Renamed to exception_message_.

igorpeshansky requested review from bmoyles0117 and qingling128 February 23, 2018 22:57

bmoyles0117 reviewed Feb 23, 2018

View reviewed changes

bmoyles0117 reviewed Feb 24, 2018

View reviewed changes

qingling128 approved these changes Feb 26, 2018

View reviewed changes

igorpeshansky added 2 commits February 26, 2018 14:42

When a chunk does not end in CRLF, assume a CRLF at the end.

bb232ec

On second thought, just exit the watch on an invalid chunk.

b674482

igorpeshansky force-pushed the igorp-watch-chunked-fix branch from eb3a130 to b674482 Compare February 26, 2018 20:06

qingling128 reviewed Feb 26, 2018

View reviewed changes

igorpeshansky added 2 commits February 26, 2018 18:11

Ignore the rest of the streaming input when an error is encountered.

c94ebcd

Identify watchers by the endpoint they're watching, to distinguish logs.

d62b48d

igorpeshansky commented Feb 26, 2018

View reviewed changes

qingling128 suggested changes Feb 27, 2018

View reviewed changes

Turn a return into an exception; minor tweaks.

ce35ca4

igorpeshansky commented Feb 27, 2018

View reviewed changes

qingling128 approved these changes Feb 27, 2018

View reviewed changes

igorpeshansky requested a review from supriyagarg February 27, 2018 17:26

bmoyles0117 suggested changes Feb 27, 2018

View reviewed changes

Log before throwing an exception.

44eadf7

igorpeshansky commented Feb 27, 2018

View reviewed changes

bmoyles0117 approved these changes Feb 27, 2018

View reviewed changes

exception_ -> exception_message_.

64d335c

igorpeshansky commented Feb 27, 2018

View reviewed changes

igorpeshansky merged commit 2bbad32 into master Feb 27, 2018

igorpeshansky deleted the igorp-watch-chunked-fix branch February 27, 2018 21:42

		@@ -1028,8 +1084,9 @@ void KubernetesReader::WatchNode(MetadataUpdater::UpdateCallback callback)
		std::placeholders::_2, std::placeholders::_3));
		} catch (const json::Exception& e) {

Conversation

igorpeshansky commented Feb 23, 2018

Uh oh!

bmoyles0117 left a comment

Choose a reason for hiding this comment

Uh oh!

igorpeshansky commented Feb 24, 2018

Uh oh!

bmoyles0117 left a comment

Choose a reason for hiding this comment

Uh oh!

igorpeshansky commented Feb 25, 2018

Uh oh!

igorpeshansky commented Feb 25, 2018

Uh oh!

bmoyles0117 commented Feb 25, 2018

Uh oh!

igorpeshansky commented Feb 25, 2018

Uh oh!

qingling128 left a comment

Choose a reason for hiding this comment

Uh oh!

igorpeshansky commented Feb 26, 2018

Uh oh!

igorpeshansky commented Feb 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igorpeshansky left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igorpeshansky left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qingling128 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment