Automatic course reindex after course publishing

Evgen_Dyudyunov · February 13, 2023, 3:55pm

The issue

When making changes on the Course Schedule and Details page in the CMS and saving them - the automatic reindexing task is fired, but the outdated info is stored in the index.

Prerequisites

course indexing is enabled (FEATURES['ENABLE_COURSEWARE_INDEX'] = True)
course search is enabled (FEATURES['ENABLE_COURSE_DISCOVERY'] = True)
celery is configured

Example (i.e. Steps to Reproduce)

Go to the Studio
Navigate to any course and go to the Schedule and Details section

Screenshot 2023-02-13 at 17.02.471802×1028 88.9 KB
scroll down to the Course Card Image block and upload a new image, click “Save Changes”

Screenshot 2023-02-13 at 17.05.051920×792 71.3 KB
go to <LMS>/courses/<course_id>/about and confirm your image was updated (search index is not used here)
repeat steps 3 and 4, because in some cases (e.g. workers just restarted) the issue is not reproducible on the first try. Here is my newly uploaded course image:

Screenshot 2023-02-13 at 17.34.251896×824 136 KB

and the result on the course about page:

Screenshot 2023-02-13 at 17.35.042138×1060 173 KB
now go to <LMS>/courses and locate your test course. Check the course card image (and the response data for the course_discovery/ request)

Screenshot 2023-02-13 at 17.20.151920×1311 118 KB

Actual Result

the previous (outdated) image is shown on the /courses page, but it’s up to date on all the pages that use the data from the DB, not from the index (like on the course about page)

Releases:

Nutmeg
Olive

Investigation Notes

The task looks ok, but the info it receives from the modulestore is outdated (e.g. it receives the old image path with the course_image_url(course) call) for some reason. The triggered at time in the logs is correct, the queue looks correct too.
reloading of the workers fixes the issue for some time (at least one correct update)
replacing <task>.delay with task.apply_async to apply the countdown (I’ve chosen a 5-sec countdown randomly) fixes the issue permanently:

# update_search_index.delay(course_key_str, datetime.now(UTC).isoformat())
update_search_index.apply_async((course_key_str, datetime.now(UTC).isoformat()), countdown=5)

the issue isn’t reproducible if to call the task as a regular function (without .delay or .apply_async)
Using time.sleep and trying to get the course object with the delay (I’ve tried up to 15sec) has no effect - the course image is always outdated
the issue isn’t reproduced if using debug (for the CourseDetails.update_from_json method) and walking the code step-by-step. The course_published signal is fired twice at some moment (didn’t find the reason yet)
I didn’t find any caching issues according to getting the course data from mongo for the index update

Does anyone have any clues about the issue?

chintan · February 14, 2023, 6:17am

I am not sure about the newer releases, but in older releases, that and about page were cached if user was not logged in.

Also as you mentioned the re-index was not happening properly.

What I did to work around is exactly the same thing but with cron jobs instead of a task.

Evgen_Dyudyunov · February 14, 2023, 7:30am

Hello @chintan

Thanks for the response! Yeah, I’m aware of the @cache_if_anonymous() but I had my user logged in.

Anyway, its good to know it is not my exclusive issue))

What release you were working with?

chintan · February 14, 2023, 8:00am

I did the cron in KOA and in Nutmeg, haven’t added them or tested them yet with Olive.

regis · February 14, 2023, 9:18am

I did not know about this issue, but it very much looks like a race condition is happening. The following are very strong clues:

Race conditions typically don’t occur 100% of the time.

Running the task is run with a 5s delay fixes the race condition.

The race condition seems to come from the asynchronous worker.

sleep does not fix the race condition because of the GIL. (I think?)

Pausing causes a delay, which fixes the race condition.

ghassan · February 14, 2023, 12:23pm

I think the index is probably stored in Redis or/and ES, to ensure it’s from Redis side you can try to reproduce the probelm and then flush everything in Redis.

I think it’s edx-search repo that is responsible for the content /courses page The default case in tutor.

Evgen_Dyudyunov · February 14, 2023, 1:13pm

@ghassan yes, the index is stored in the ES

I think it’s edx-search repo that is responsible for the content /courses page

This is also true ^ (if FEATURES['ENABLE_COURSE_DISCOVERY'] = True)

From what I’ve got from the logs/debugging I see that the problem is that the wrong data is sent to the ES for storing. The task responsible for that sending receives outdated course data from mongo.

I’m wondering now how I could confirm or refute @regis clues about the race condition

ghassan · February 14, 2023, 1:27pm

My 2 cent, is that the task to update to ES index is triggerd before the content is updated _i.e. there are two async tasks of which one depends on the other, so it might be that those tasks are running in parallel, instead in a chain or pipeline _ I am just speculating here

Evgen_Dyudyunov · February 14, 2023, 1:33pm

I excluded this because the reindex task is fired on the course_published signal which is sent after the actual course update in the mongo

ghassan · February 14, 2023, 1:41pm

Aha I am not really that familiar about the sequence of events. I think @dave might have something to say about this

dave · February 27, 2023, 5:41pm

It might be a race condition as @regis indicated, where workers aren’t seeing the data that’s modified by the web view. This is especially possible if it’s reading from CourseOverviews.

Another race-condition-adjacent possibility that might be worth looking into is whether or not there is clock drift between the server running the web frontend, the server running the celery workers, and possibly MongoDB (I don’t remember exactly how the modified timestamp is created). The current timestamp is getting passed to the indexing function, which implies that it’s using the timestamp to narrow down the range of content changes it has to index. If the clocks are different, the workers may be looking at the wrong time window and not see the change.

dave · February 27, 2023, 5:51pm

Another potential race condition can happen if MongoDB is configured to read from secondaries–that’s usually the best thing to do for spreading load, but there could be propagation delay that prevents the workers from seeing it. It’s possible you’re not seeing that effect when everything is run in process because Modulestore’s internal caching is preventing it from re-fetching the data.

Topic		Replies	Views
Course disappear for learner after some time? From Feature Course Site Operators	1	418	July 23, 2021
Course delete but still appear in LMS site Site Operations Help	8	712	September 17, 2022
Course Card Image does not update in `Courses` Educators	1	208	December 5, 2023
Problem refreshing content and creating course in Quince Site Operations Help how-to , lms , quince	3	211	June 25, 2024
Elastic search index (reindex all course) command error Site Operations Help	1	468	September 15, 2022