Async task fixes and new async calculation of resource size by bjester · Pull Request #3028 · learningequality/studio

bjester · 2021-03-15T23:23:54Z

Summary

Description of the change(s) you made

Adds helper for caching resource size calculations, using Redis hashes
Adds modified field to File model to use in determining if cache is stale thereby avoiding unnecessary calculations and the overhead it brings
Adds dedicated function to calculate and cache resource size, along with corresponding async task
- It immediately calculates the size if the descendant count is less than 5000 nodes, otherwise returns the cached value if any, and a stale boolean
Updates content node size endpoint to queue an async calculation task if the size is stale, and to return that information and the task to the frontend for use in Publish modal
Updates Publish modal to wait for async calculation task to finish prior to moving forward
Refactors tasks using deprecated Celery functionality
Fixes application of Celery task options through config
Consolidates task progress tracking
Orders task endpoint response by created date (helpful for debugging, last task is latest)
Orders tasks in the Vuex map in reverse to ensure the most recent tasks are first (undocumented bug?)

Manual verification steps

Note: it may be useful to use the browser's network inspector to validate the size changes. Filter network requests by size and use the Preview tab of the request to inspect the response JSON which holds the size in bytes

Large channel scenario

Open an editable channel with at least 5000 nodes (alternatively, lower this value)
Add a file to the channel
Click Publish
If the channel is large enough, you should observe a loading spinner until the task completes
Verify you can't proceed until the task completes, loader disappears, and the size appears

Deleting a file scenario

Open an editable channel
Click Publish
Make note of the size returned from the /api/contentnode/size request
Add an additional file to an existing node (add subtitles to a video, or a EPUB for an existing PDF)
Allow your browser to finish syncing (wait ~30 seconds)
Click Publish
Verify the size returned from the /api/contentnode/size request is different than you noted originally

Moving a node scenario

Open an editable channel
Click Publish
Make note of the size returned from the /api/contentnode/size request
Delete a non-topic node (sending it to trash)
Allow your browser to finish syncing (wait ~30 seconds)
Click Publish
Verify the size returned from the /api/contentnode/size request is different than you noted originally

Screenshots (if applicable)

Publish waiting for size	Publish error (previously unseen?)

Reviewer guidance

How can a reviewer test these changes?

Publish a channel
Synchronize a channel's imported content
Copy a large topic
Move a large topic

Are there any risky areas that deserve extra testing?

Publishing

References

Addresses: #1523
Addresses: #2658
Addresses: #2953

Contributor's Checklist

PR process:

If any Python requirements have changed, the updated requirements.txt files also included in this PR
Migrations are safe for a large db

Testing:

Code is clean and well-commented
Contributor has fully tested the PR manually
If there are any front-end changes, before/after screenshots are included
Critical user journeys are covered by Gherkin stories
Any new interactions have been added to the QA Sheet
Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

Automated test coverage is satisfactory
PR is fully functional
PR has been tested for accessibility regressions
External dependency files were updated if necessary (yarn and pip)
Documentation is updated
Contributor is in AUTHORS.md

codecov · 2021-03-22T17:22:43Z

Codecov Report

Merging #3028 (15a145a) into hotfixes (28587f8) will increase coverage by 5.13%.
The diff coverage is 92.87%.

@@             Coverage Diff              @@
##           hotfixes    #3028      +/-   ##
============================================
+ Coverage     80.80%   85.93%   +5.13%     
============================================
  Files           281      304      +23     
  Lines         12659    16403    +3744     
============================================
+ Hits          10229    14096    +3867     
+ Misses         2430     2307     -123

Impacted Files	Coverage Δ
contentcuration/contentcuration/decorators.py	`56.60% <50.00%> (-37.60%)`	⬇️
...ntentcuration/contentcuration/db/models/manager.py	`91.05% <90.83%> (-8.95%)`	⬇️
contentcuration/contentcuration/forms.py	`82.35% <94.04%> (+33.89%)`	⬆️
contentcuration/contentcuration/api.py	`92.06% <100.00%> (+1.43%)`	⬆️
contentcuration/contentcuration/apps.py	`100.00% <100.00%> (+11.11%)`	⬆️
contentcuration/contentcuration/celery.py	`90.00% <100.00%> (-1.67%)`	⬇️
...tentcuration/contentcuration/context_processors.py	`100.00% <100.00%> (ø)`
...ontentcuration/contentcuration/db/advisory_lock.py	`100.00% <100.00%> (ø)`
...tcuration/contentcuration/db/models/expressions.py	`95.23% <100.00%> (-4.77%)`	⬇️
...entcuration/contentcuration/db/models/functions.py	`100.00% <100.00%> (ø)`
... and 190 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efaadf6...15a145a. Read the comment docs.

bjester · 2021-03-24T16:07:26Z

 CELERY_TASK_SERIALIZER = 'json'
 CELERY_RESULT_SERIALIZER = 'json'
-# This is needed for worker update_state calls to work so they can send progress info.
-CELERYD_STATE_DB = '/tmp/celery_state'


As far as I know, we're not using the Celery daemon, so this wasn't being used.

bjester · 2021-03-24T16:08:51Z

 CELERY_TASK_ALWAYS_EAGER = False
-# This tells Celery to mark a task as started. Otherwise, we would have no way of tracking
-# if the task is running.
-CELERY_TASK_TRACK_STARTED = True


Moved this to the CeleryTask class, so it's closer to the task definition. We likely don't want to change this value anyway.

bjester · 2021-03-24T16:10:40Z



-@task(bind=True, name="duplicate_nodes_task")
+@app.task(bind=True, name="duplicate_nodes_task", track_progress=True)


Task boolean for tracking progress is now configurable at the task definition.

bjester · 2021-03-24T16:13:34Z

+    "export-channel": export_channel_task,
+    "sync-channel": sync_channel_task,
+    "get-node-diff": generatenodediff_task,
+    "calculate-resource-size": calculate_resource_size_task,


We could get rid of this in the future, if we align the task_name sent to create_async_task() with the names of the tasks we define for Celery. Right now, they're slightly different

Yeah, I always found this mapping odd, when Celery explicitly has you define a name for the task.

bjester · 2021-03-24T16:14:39Z

+        self.assertEqual(data["status"], "STARTED")
+        self.assertEqual(data["task_type"], "YOUTUBE_IMPORT")
+        self.assertIsNotNone(data["channel"])
+        self.assertEqual(data["channel"].hex, self.channel.id)


I don't exactly understand why I needed .hex for the value coming from the task api response

DRF test tools take a shortcut and don't actually serialize and deserialize, instead passing the raw Python objects out as the response data. I really hate it. So because something was parsed as a UUID, it comes out as a UUID here.

bjester · 2021-03-24T16:15:07Z

    try:
        task = Task.objects.get(task_id=task_id)
-        task.status = "PENDING"
+        task.status = states.PENDING


Updated all states to use the Celery constants.

bjester · 2021-03-24T16:16:14Z

+        Ensures status is updated after task completion, otherwise our signals may not fire
+        """
+        # we assume that if the state is past starting, that this has been handled
+        if states.state(status) > states.state(states.STARTED):


Celery has some nice utils for state handling.

rtibbles

A few questions - but I don't think there's anything blocking.

rtibbles · 2021-03-26T00:15:10Z

    source_url = models.CharField(max_length=400, blank=True, null=True)
    uploaded_by = models.ForeignKey(User, related_name='files', blank=True, null=True, on_delete=models.SET_NULL)

+    modified = models.DateTimeField(auto_now=True, verbose_name="modified", null=True)


This should generally work fine - as new files are created one at a time, and so will have the modified set by auto_now. File updating is done in bulk, so as far as I can tell this will not update auto_now (I think this may be affecting our updates of ContentNodes too - so I might need to file a separate issue for that).

The only thing this might miss is file deletion or node deletion, where node deletion will normally be a move to the trash tree. Wonder if we might have to look at node modified as well?

Ah good catch, I definitely overlooked the deletion case. I'm thinking that in the case of node/file deletion/move, we could set the cache modified date to the modified date of the earliest date of the associated files prior to the change, if that time is less than the existing cache modified. So if that deleted item was the most recent, and it was included in calculation previously, that date should still be prior to the cache modified. Otherwise, it would almost always be the case, so triggering an update of the cache. Do you think that can work? Am I overlooking anything in that strategy?

Re file updating: yeah that could definitely be an issue with the modified field specifically. It seemed to me that there weren't many cases where a file would be updated in a way that would cause its resource size to change. But rethinking it now, if the file is removed from the content node, surely we would need to trigger change the cache modified time.

Also, the existing calculation didn't include assessment items.... that should be included too, correct?

After a quick slack convo - I think I know what this means now!

So yes, let's update the cache modified when files are deleted, based on the deleted file's modified date, and take the min of the last modified date for all files when we move a content node.

I think we can skip assessment items for now, to avoid making this PR more unmanageable, but we should be considering them, ideally!

rtibbles · 2021-03-26T00:17:40Z

 class TaskViewSet(ReadOnlyValuesViewset, DestroyModelMixin):
-    queryset = Task.objects.all()
+    order_by = 'created'
+    queryset = Task.objects.order_by(order_by)


Could also just add it as the default ordering on the Task model?

rtibbles · 2021-03-26T00:19:10Z

        )

-    @detail_route(methods=["get"])
+    @action(detail=True, methods=["get"])


Just trying to give me merge conflicts in my Django 3 PR!

rtibbles · 2021-03-26T00:21:52Z

+        size, stale = calculate_resource_size(node=node, force=False)
+        if stale:
+            task_args = dict(node_id=node.pk, channel_id=node.channel_id)
+            task_info = get_or_create_async_task(


Because it took me all afternoon to realize it - might be good to add a comment here that get_or_create_async_task is being used here to prevent a cache stampede.

- Update resource size cache modified date when moving nodes or deleting files - Relcoate ResourceSizeCache to avoid import issues - Only allow size calculations for root nodes through the API

rtibbles

New code changes make sense to me. I'll try to take it for a spin this afternoon for manual sanity checking!

rtibbles · 2021-03-26T23:38:26Z

Manual testing generally worked out, but in this scenario:

Add new file to existing video.
Check size.
See size has updated.
Remove new file from video.
Check size.

I found the size did not update again after step 5, but then did properly update when I moved the node in question to trash.

rtibbles

File deletion now resetting cache as expected!

pcenov · 2021-04-05T13:59:51Z

@bjester today I tested this PR at https://hotfixes.studio.learningequality.org/. Here's a summary of my observations:

Large channel scenario: This scenario fails, I'm testing with a channel with 5000+ resources and I'm always getting a message about an unexpected error (seems that the size request timeouts). See the following screenshot:

By the way I didn't see a loading spinner as described in the scenario do you mean the progress bar?

Deleting a file scenario or Moving a node scenario: I tested these with a very small channel but I'm still getting the /api/get_total_size timeout as described in #1523. See the following screenshot:

Here's what I get when testing with another user (cenov1@melon.bg) where I also see a 400 error for api/sync:

@radinamatic

radinamatic · 2021-04-05T15:08:45Z

On my side I eventually managed to publish a large channel: despite occasional 524 and 504 errors in the console and the network tab, nothing was observable in the user interface, and channel (bulah-saroh) published successfully (although it did take more than an hour to finish).

I also did not observe the loading spinner while publishing, does it appear only if there are incomplete resources?

bjester · 2021-04-05T15:11:14Z

@pcenov @radinamatic Very interesting, since 5000+ resources should queue the calculation task asynchronously, I wonder if there's a larger issue going on with hotfixes right now. The loading spinner should occur in the modal before the actual publish process, so no the progress bar and loading spinner are not the same. Definitely seems as though something isn't quite right with hotfixes.

radinamatic · 2021-04-05T15:49:35Z

Ok, now I do see it, although not sure how helpful it is: it appears only while the user is on the first step of the modal, writing the version comment/note, before the publishing process has even started. What exactly it is supposed to communicate to the user, compared to the progress bar when the publishing task is active? 🤔

rtibbles · 2021-04-05T15:58:15Z

It's literally just there to show that we are still loading the total size for the channel.

radinamatic · 2021-04-05T16:27:39Z

In that case we might consider adding a string Loading total channel size... at some point in the future, alongside that spinner 🙁

bjester · 2021-04-05T16:57:53Z

@radinamatic Yes, sounds good. I believe @MisRob has a bunch of updates to the publish modals in the develop branch, so can follow up there. Also, the spinner was existing for loading the channel metadata, so the net UX change should still be an improvement.

bjester added 11 commits March 5, 2021 13:15

Add modified field to File model, with index

36b4f19

Add resource size calculator utils to manage lookups, with caching

32e36db

Fix column name and datetime handling

fdaa781

Integrate new resource size calculator

ff7f204

Add logging and comments

e1b1151

Cleanup

5b33889

Tests for resource calculation

7c965c4

Updated task handling in viewset

1142866

Publish task handling

78b7c1b

Cleanup and fixes after updating task handling

26a5369

Don't overload __call__, breaks tests

c33b0a6

bjester changed the title ~~Resource size caching and async calculation~~ Async task fixes and new async calculation of resource size Mar 19, 2021

bjester added 3 commits March 22, 2021 09:58

Linting and test fixes

c0dc4e0

Fix merge conflicts

f6039df

Yet more linting fixes...

477b8df

Fix publish test

5d9b2c8

This was linked to issues Mar 22, 2021

Address /api/get_total_size timeouts #1523

Closed

Async task completes but status remains pending #2658

Closed

bjester added this to the Post Release Stabilization milestone Mar 22, 2021

bjester added the python Pull requests that update Python code label Mar 22, 2021

rtibbles reviewed Mar 23, 2021

View reviewed changes

Comment thread contentcuration/contentcuration/tasks.py Outdated

Remove unneccesary metadata

d74e05f

bjester commented Mar 24, 2021

View reviewed changes

bjester added DEV: backend TAG: dependencies TODO: needs review and removed python Pull requests that update Python code labels Mar 24, 2021

Update tests to accomodate for new task API and classes

cb2a0ed

bjester requested review from marcellamaki and rtibbles March 24, 2021 21:17

bjester marked this pull request as ready for review March 24, 2021 21:17

bjester requested a review from jredrejo March 24, 2021 21:52

rtibbles reviewed Mar 26, 2021

View reviewed changes

bjester added 2 commits March 26, 2021 11:23

Code review updates

6dbbd5d

- Update resource size cache modified date when moving nodes or deleting files - Relcoate ResourceSizeCache to avoid import issues - Only allow size calculations for root nodes through the API

Fix merge conflicts

e6eb127

rtibbles reviewed Mar 26, 2021

View reviewed changes

bjester modified the milestones: Post Release Stabilization, Sprint 2021-03-29 Mar 26, 2021

bjester mentioned this pull request Mar 26, 2021

[OLD] Release PR for Sprint ending 2021-03-29 #3044

Closed

Fix incorrect cache invalidation strategy

15a145a

bjester requested a review from rtibbles March 29, 2021 15:44

bjester linked an issue Mar 29, 2021 that may be closed by this pull request

Sometimes I receive null progress value in a task metadata #2953

Closed

rtibbles approved these changes Mar 29, 2021

View reviewed changes

rtibbles merged commit 3c2749b into learningequality:hotfixes Mar 29, 2021

bjester mentioned this pull request Mar 30, 2021

Release PR for Sprint ending 2021-03-29 #3062

Merged



		@task(bind=True, name="duplicate_nodes_task")
		@app.task(bind=True, name="duplicate_nodes_task", track_progress=True)

Uh oh!

Conversation

bjester commented Mar 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description of the change(s) you made

Manual verification steps

Large channel scenario

Deleting a file scenario

Moving a node scenario

Screenshots (if applicable)

Reviewer guidance

How can a reviewer test these changes?

Are there any risky areas that deserve extra testing?

References

Contributor's Checklist

Reviewer's Checklist

Uh oh!

codecov Bot commented Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

rtibbles commented Mar 26, 2021

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

pcenov commented Apr 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

radinamatic commented Apr 5, 2021

Uh oh!

bjester commented Apr 5, 2021

Uh oh!

radinamatic commented Apr 5, 2021

Uh oh!

rtibbles commented Apr 5, 2021

Uh oh!

radinamatic commented Apr 5, 2021

Uh oh!

bjester commented Apr 5, 2021

bjester commented Mar 15, 2021 •

edited

Loading

codecov Bot commented Mar 22, 2021 •

edited

Loading

pcenov commented Apr 5, 2021 •

edited

Loading