Skip to content

Async task fixes and new async calculation of resource size#3028

Merged
rtibbles merged 20 commits into
learningequality:hotfixesfrom
bjester:get-total-size
Mar 29, 2021
Merged

Async task fixes and new async calculation of resource size#3028
rtibbles merged 20 commits into
learningequality:hotfixesfrom
bjester:get-total-size

Conversation

@bjester
Copy link
Copy Markdown
Member

@bjester bjester commented Mar 15, 2021

Summary

Description of the change(s) you made

  • Adds helper for caching resource size calculations, using Redis hashes
  • Adds modified field to File model to use in determining if cache is stale thereby avoiding unnecessary calculations and the overhead it brings
  • Adds dedicated function to calculate and cache resource size, along with corresponding async task
    • It immediately calculates the size if the descendant count is less than 5000 nodes, otherwise returns the cached value if any, and a stale boolean
  • Updates content node size endpoint to queue an async calculation task if the size is stale, and to return that information and the task to the frontend for use in Publish modal
  • Updates Publish modal to wait for async calculation task to finish prior to moving forward
  • Refactors tasks using deprecated Celery functionality
  • Fixes application of Celery task options through config
  • Consolidates task progress tracking
  • Orders task endpoint response by created date (helpful for debugging, last task is latest)
  • Orders tasks in the Vuex map in reverse to ensure the most recent tasks are first (undocumented bug?)

Manual verification steps

Note: it may be useful to use the browser's network inspector to validate the size changes. Filter network requests by size and use the Preview tab of the request to inspect the response JSON which holds the size in bytes

Large channel scenario

  1. Open an editable channel with at least 5000 nodes (alternatively, lower this value)
  2. Add a file to the channel
  3. Click Publish
  4. If the channel is large enough, you should observe a loading spinner until the task completes
  5. Verify you can't proceed until the task completes, loader disappears, and the size appears

Deleting a file scenario

  1. Open an editable channel
  2. Click Publish
  3. Make note of the size returned from the /api/contentnode/size request
  4. Add an additional file to an existing node (add subtitles to a video, or a EPUB for an existing PDF)
  5. Allow your browser to finish syncing (wait ~30 seconds)
  6. Click Publish
  7. Verify the size returned from the /api/contentnode/size request is different than you noted originally

Moving a node scenario

  1. Open an editable channel
  2. Click Publish
  3. Make note of the size returned from the /api/contentnode/size request
  4. Delete a non-topic node (sending it to trash)
  5. Allow your browser to finish syncing (wait ~30 seconds)
  6. Click Publish
  7. Verify the size returned from the /api/contentnode/size request is different than you noted originally

Screenshots (if applicable)

Publish waiting for size Publish error (previously unseen?)
publish_waiting_size publish_error

Reviewer guidance

How can a reviewer test these changes?

  • Publish a channel
  • Synchronize a channel's imported content
  • Copy a large topic
  • Move a large topic

Are there any risky areas that deserve extra testing?

  • Publishing

References

Addresses: #1523
Addresses: #2658
Addresses: #2953

Contributor's Checklist

PR process:

  • If any Python requirements have changed, the updated requirements.txt files also included in this PR
  • Migrations are safe for a large db

Testing:

  • Code is clean and well-commented
  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Any new interactions have been added to the QA Sheet
  • Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@bjester bjester changed the title Resource size caching and async calculation Async task fixes and new async calculation of resource size Mar 19, 2021
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 22, 2021

Codecov Report

Merging #3028 (15a145a) into hotfixes (28587f8) will increase coverage by 5.13%.
The diff coverage is 92.87%.

Impacted file tree graph

@@             Coverage Diff              @@
##           hotfixes    #3028      +/-   ##
============================================
+ Coverage     80.80%   85.93%   +5.13%     
============================================
  Files           281      304      +23     
  Lines         12659    16403    +3744     
============================================
+ Hits          10229    14096    +3867     
+ Misses         2430     2307     -123     
Impacted Files Coverage Δ
contentcuration/contentcuration/decorators.py 56.60% <50.00%> (-37.60%) ⬇️
...ntentcuration/contentcuration/db/models/manager.py 91.05% <90.83%> (-8.95%) ⬇️
contentcuration/contentcuration/forms.py 82.35% <94.04%> (+33.89%) ⬆️
contentcuration/contentcuration/api.py 92.06% <100.00%> (+1.43%) ⬆️
contentcuration/contentcuration/apps.py 100.00% <100.00%> (+11.11%) ⬆️
contentcuration/contentcuration/celery.py 90.00% <100.00%> (-1.67%) ⬇️
...tentcuration/contentcuration/context_processors.py 100.00% <100.00%> (ø)
...ontentcuration/contentcuration/db/advisory_lock.py 100.00% <100.00%> (ø)
...tcuration/contentcuration/db/models/expressions.py 95.23% <100.00%> (-4.77%) ⬇️
...entcuration/contentcuration/db/models/functions.py 100.00% <100.00%> (ø)
... and 190 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efaadf6...15a145a. Read the comment docs.

@bjester bjester added this to the Post Release Stabilization milestone Mar 22, 2021
@bjester bjester added the python Pull requests that update Python code label Mar 22, 2021
Comment thread contentcuration/contentcuration/tasks.py Outdated
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
# This is needed for worker update_state calls to work so they can send progress info.
CELERYD_STATE_DB = '/tmp/celery_state'
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, we're not using the Celery daemon, so this wasn't being used.

CELERY_TASK_ALWAYS_EAGER = False
# This tells Celery to mark a task as started. Otherwise, we would have no way of tracking
# if the task is running.
CELERY_TASK_TRACK_STARTED = True
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to the CeleryTask class, so it's closer to the task definition. We likely don't want to change this value anyway.



@task(bind=True, name="duplicate_nodes_task")
@app.task(bind=True, name="duplicate_nodes_task", track_progress=True)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task boolean for tracking progress is now configurable at the task definition.

"export-channel": export_channel_task,
"sync-channel": sync_channel_task,
"get-node-diff": generatenodediff_task,
"calculate-resource-size": calculate_resource_size_task,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could get rid of this in the future, if we align the task_name sent to create_async_task() with the names of the tasks we define for Celery. Right now, they're slightly different

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I always found this mapping odd, when Celery explicitly has you define a name for the task.

self.assertEqual(data["status"], "STARTED")
self.assertEqual(data["task_type"], "YOUTUBE_IMPORT")
self.assertIsNotNone(data["channel"])
self.assertEqual(data["channel"].hex, self.channel.id)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't exactly understand why I needed .hex for the value coming from the task api response

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRF test tools take a shortcut and don't actually serialize and deserialize, instead passing the raw Python objects out as the response data. I really hate it. So because something was parsed as a UUID, it comes out as a UUID here.

try:
task = Task.objects.get(task_id=task_id)
task.status = "PENDING"
task.status = states.PENDING
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated all states to use the Celery constants.

Ensures status is updated after task completion, otherwise our signals may not fire
"""
# we assume that if the state is past starting, that this has been handled
if states.state(status) > states.state(states.STARTED):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Celery has some nice utils for state handling.

@bjester bjester added DEV: backend TAG: dependencies TODO: needs review and removed python Pull requests that update Python code labels Mar 24, 2021
@bjester bjester marked this pull request as ready for review March 24, 2021 21:17
@bjester bjester requested a review from jredrejo March 24, 2021 21:52
Copy link
Copy Markdown
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions - but I don't think there's anything blocking.

source_url = models.CharField(max_length=400, blank=True, null=True)
uploaded_by = models.ForeignKey(User, related_name='files', blank=True, null=True, on_delete=models.SET_NULL)

modified = models.DateTimeField(auto_now=True, verbose_name="modified", null=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should generally work fine - as new files are created one at a time, and so will have the modified set by auto_now. File updating is done in bulk, so as far as I can tell this will not update auto_now (I think this may be affecting our updates of ContentNodes too - so I might need to file a separate issue for that).

The only thing this might miss is file deletion or node deletion, where node deletion will normally be a move to the trash tree. Wonder if we might have to look at node modified as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, I definitely overlooked the deletion case. I'm thinking that in the case of node/file deletion/move, we could set the cache modified date to the modified date of the earliest date of the associated files prior to the change, if that time is less than the existing cache modified. So if that deleted item was the most recent, and it was included in calculation previously, that date should still be prior to the cache modified. Otherwise, it would almost always be the case, so triggering an update of the cache. Do you think that can work? Am I overlooking anything in that strategy?

Re file updating: yeah that could definitely be an issue with the modified field specifically. It seemed to me that there weren't many cases where a file would be updated in a way that would cause its resource size to change. But rethinking it now, if the file is removed from the content node, surely we would need to trigger change the cache modified time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the existing calculation didn't include assessment items.... that should be included too, correct?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a quick slack convo - I think I know what this means now!

So yes, let's update the cache modified when files are deleted, based on the deleted file's modified date, and take the min of the last modified date for all files when we move a content node.

I think we can skip assessment items for now, to avoid making this PR more unmanageable, but we should be considering them, ideally!

class TaskViewSet(ReadOnlyValuesViewset, DestroyModelMixin):
queryset = Task.objects.all()
order_by = 'created'
queryset = Task.objects.order_by(order_by)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also just add it as the default ordering on the Task model?

)

@detail_route(methods=["get"])
@action(detail=True, methods=["get"])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to give me merge conflicts in my Django 3 PR!

size, stale = calculate_resource_size(node=node, force=False)
if stale:
task_args = dict(node_id=node.pk, channel_id=node.channel_id)
task_info = get_or_create_async_task(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it took me all afternoon to realize it - might be good to add a comment here that get_or_create_async_task is being used here to prevent a cache stampede.

bjester added 2 commits March 26, 2021 11:23
- Update resource size cache modified date when moving nodes or deleting files
- Relcoate ResourceSizeCache to avoid import issues
- Only allow size calculations for root nodes through the API
Copy link
Copy Markdown
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New code changes make sense to me. I'll try to take it for a spin this afternoon for manual sanity checking!

@rtibbles
Copy link
Copy Markdown
Member

Manual testing generally worked out, but in this scenario:

  1. Add new file to existing video.
  2. Check size.
  3. See size has updated.
  4. Remove new file from video.
  5. Check size.

I found the size did not update again after step 5, but then did properly update when I moved the node in question to trash.

@bjester bjester requested a review from rtibbles March 29, 2021 15:44
@bjester bjester linked an issue Mar 29, 2021 that may be closed by this pull request
Copy link
Copy Markdown
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File deletion now resetting cache as expected!

@rtibbles rtibbles merged commit 3c2749b into learningequality:hotfixes Mar 29, 2021
@pcenov
Copy link
Copy Markdown
Member

pcenov commented Apr 5, 2021

@bjester today I tested this PR at https://hotfixes.studio.learningequality.org/. Here's a summary of my observations:

Large channel scenario: This scenario fails, I'm testing with a channel with 5000+ resources and I'm always getting a message about an unexpected error (seems that the size request timeouts). See the following screenshot:
2021-04-05_16-54-10

By the way I didn't see a loading spinner as described in the scenario do you mean the progress bar?

Deleting a file scenario or Moving a node scenario: I tested these with a very small channel but I'm still getting the /api/get_total_size timeout as described in #1523. See the following screenshot:
2021-04-05_16-36-42

Here's what I get when testing with another user (cenov1@melon.bg) where I also see a 400 error for api/sync:

2021-04-05_15-43-43
@radinamatic

@radinamatic
Copy link
Copy Markdown
Member

On my side I eventually managed to publish a large channel: despite occasional 524 and 504 errors in the console and the network tab, nothing was observable in the user interface, and channel (bulah-saroh) published successfully (although it did take more than an hour to finish).

MCAT - erthsrtnsrtn - Kolibri Studio - Google Chrome_007

I also did not observe the loading spinner while publishing, does it appear only if there are incomplete resources?

@bjester
Copy link
Copy Markdown
Member Author

bjester commented Apr 5, 2021

@pcenov @radinamatic Very interesting, since 5000+ resources should queue the calculation task asynchronously, I wonder if there's a larger issue going on with hotfixes right now. The loading spinner should occur in the modal before the actual publish process, so no the progress bar and loading spinner are not the same. Definitely seems as though something isn't quite right with hotfixes.

@radinamatic
Copy link
Copy Markdown
Member

Ok, now I do see it, although not sure how helpful it is: it appears only while the user is on the first step of the modal, writing the version comment/note, before the publishing process has even started. What exactly it is supposed to communicate to the user, compared to the progress bar when the publishing task is active? 🤔

@rtibbles
Copy link
Copy Markdown
Member

rtibbles commented Apr 5, 2021

It's literally just there to show that we are still loading the total size for the channel.

@radinamatic
Copy link
Copy Markdown
Member

In that case we might consider adding a string Loading total channel size... at some point in the future, alongside that spinner 🙁

@bjester
Copy link
Copy Markdown
Member Author

bjester commented Apr 5, 2021

@radinamatic Yes, sounds good. I believe @MisRob has a bunch of updates to the publish modals in the develop branch, so can follow up there. Also, the spinner was existing for loading the channel metadata, so the net UX change should still be an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sometimes I receive null progress value in a task metadata Async task completes but status remains pending Address /api/get_total_size timeouts

4 participants