Skip to content

Track completed triggers until deleted from database#19546

Closed
dstandish wants to merge 4 commits into
apache:mainfrom
dstandish:trigger-multi-create-fix
Closed

Track completed triggers until deleted from database#19546
dstandish wants to merge 4 commits into
apache:mainfrom
dstandish:trigger-multi-create-fix

Conversation

@dstandish

@dstandish dstandish commented Nov 11, 2021

Copy link
Copy Markdown
Contributor

Commonly a trigger will complete and be deleted from the TriggerRunner.triggers dictionary attr before the the TriggerJob has had a chance to submit_event. So TriggerRunner thinks it needs to create the trigger [again], which it does.

One way to resolve this is to have TriggerJob mark is_submitted after successful complletion of submit_event, and then have the cleanup process wait for this. But I'm not sure if perhaps we will need to implement monitoring / timeouts if for some reason the event is never submitted; in that case we'd want to ensure that the trigger is re-run.

Should fix #18392

If you think this approach looks OK I'll proceed with tests.

@boring-cyborg boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label Nov 11, 2021
@dstandish dstandish changed the title Wait for task creation before deleting trigger Wait for event submission before deleting trigger Nov 11, 2021
@dstandish dstandish requested a review from kaxil November 12, 2021 18:05
@dstandish dstandish marked this pull request as ready for review November 12, 2021 18:05

@andrewgodwin andrewgodwin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work - but not entirely sure if it will be watertight, as what if the trigger is re-assigned to the subthread from the DB after it's deleted? Maybe it would be better to keep a separate datastructure of "recently completed trigger IDs" and use that to stop launching duplicates, rather than tying it to the triggers dict entry itself?

@dstandish dstandish force-pushed the trigger-multi-create-fix branch from a8505d6 to 73f216d Compare November 13, 2021 02:40
@dstandish dstandish changed the title Wait for event submission before deleting trigger Track completed triggers until deleted from database Nov 13, 2021
self.runner.update_triggers(set(ids))

@provide_session
def purge_completed_triggers_list(self, session):

@dstandish dstandish Nov 13, 2021

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice if we didn't have to query the database here but i don't think it can be avoided.

i explored updating clean_unused to return the ids deleted. then we could pass those to purge_completed and purge those rows. but this won't work because clean_unused does not filter w.r.t. triggerer_id, so other triggerer instances could delete the rows belonging to this triggerer, so completed_triggers would grow over time.

on the bright side, we're querying a PK, so it shouldn't be that expensive.

alternatively we could change clean_unused to take a triggerer id and delete only the rows for that triggerer, then the above approach would probably work.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def purge_completed_triggers_list(self, session):
def purge_completed_triggers(self, session):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason i put _list there iss because without it, it could signify "purge complleted triggers from the database"

but it's only purging from the instance attribribute completed_triggers. the actual deletion of triggers from the database is done in TriggerJob.clean_unused

what do you think?

details["name"],
)
self.failed_triggers.append(trigger_id)
self.completed_triggers.add(trigger_id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that completed_triggers will grow in an unbounded fashion with this code over time - I would suggest a datastructure that is self-limiting. My initial idea is a dict with the IDs as keys and the datetime they were inserted as values, and then just remove all keys whose values are more than 5 minutes old in the first part of purge_completed_triggers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you envision it growing? the items should only be in completed_triggers for basically one or two loop cycles. it goes submit_event (into completed) -> Trigger.clean_unused (purged from db) -> purge_completed_triggers_list (purged from completed)

but maybe you're saying that in the wild the code may not behave as intended, and triggers will accumulate in the DB, and therefore they'll accumulate in the completed_triggers set? but if they accumulate in the db they'll keep getting recreated and eventually that would be an issue in itself because not only would they exist as ids in a set but they would be running.

using a TTL approach as you have suggested would avoid a db query though.

LMK your thoughts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewgodwin gentle nudge here. If you want to go with time-based expiration I have no objection and am happy to rework this. I do recognize that my solution adds a query but if you don't mind explaining I'm having trouble seeingthe unbound growth.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I just don't quite trust difference_update quite enough, I missed it in my first read-through for "what in here cleans out the datastructure?"

If you're happy there's no edge cases where it can leak entries and write a test to prove so, I think my objection here is void.

@kaxil kaxil self-assigned this Dec 22, 2021

@kaxil kaxil left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach lgtm, please add tests

@dstandish dstandish marked this pull request as draft January 5, 2022 23:07
@dstandish

Copy link
Copy Markdown
Contributor Author

superceded by #20699

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TriggerEvent fires, and then defers a second time (doesn't fire a second time though).

3 participants