Automatic course reindex after course publishing

The issue

When making changes on the Course Schedule and Details page in the CMS and saving them - the automatic reindexing task is fired, but the outdated info is stored in the index.

Prerequisites

  • course indexing is enabled (FEATURES['ENABLE_COURSEWARE_INDEX'] = True)
  • course search is enabled (FEATURES['ENABLE_COURSE_DISCOVERY'] = True)
  • celery is configured

Example (i.e. Steps to Reproduce)

  1. Go to the Studio
  2. Navigate to any course and go to the Schedule and Details section
  3. scroll down to the Course Card Image block and upload a new image, click “Save Changes”
  4. go to <LMS>/courses/<course_id>/about and confirm your image was updated (search index is not used here)
  5. repeat steps 3 and 4, because in some cases (e.g. workers just restarted) the issue is not reproducible on the first try. Here is my newly uploaded course image:

    and the result on the course about page:
  6. now go to <LMS>/courses and locate your test course. Check the course card image (and the response data for the course_discovery/ request)

Actual Result

  • the previous (outdated) image is shown on the /courses page, but it’s up to date on all the pages that use the data from the DB, not from the index (like on the course about page)

Releases:

  • Nutmeg
  • Olive

Investigation Notes

  • The task looks ok, but the info it receives from the modulestore is outdated (e.g. it receives the old image path with the course_image_url(course) call) for some reason. The triggered at time in the logs is correct, the queue looks correct too.
  • reloading of the workers fixes the issue for some time (at least one correct update)
  • replacing <task>.delay with task.apply_async to apply the countdown (I’ve chosen a 5-sec countdown randomly) fixes the issue permanently:
# update_search_index.delay(course_key_str, datetime.now(UTC).isoformat())
update_search_index.apply_async((course_key_str, datetime.now(UTC).isoformat()), countdown=5)
  • the issue isn’t reproducible if to call the task as a regular function (without .delay or .apply_async)
  • Using time.sleep and trying to get the course object with the delay (I’ve tried up to 15sec) has no effect - the course image is always outdated
  • the issue isn’t reproduced if using debug (for the CourseDetails.update_from_json method) and walking the code step-by-step. The course_published signal is fired twice at some moment (didn’t find the reason yet)
  • I didn’t find any caching issues according to getting the course data from mongo for the index update

Does anyone have any clues about the issue?

1 Like

I am not sure about the newer releases, but in older releases, that and about page were cached if user was not logged in.

Also as you mentioned the re-index was not happening properly.

What I did to work around is exactly the same thing but with cron jobs instead of a task.

1 Like

Hello @chintan

Thanks for the response! Yeah, I’m aware of the @cache_if_anonymous() but I had my user logged in.

Anyway, its good to know it is not my exclusive issue))

What release you were working with?

I did the cron in KOA and in Nutmeg, haven’t added them or tested them yet with Olive.

1 Like

I did not know about this issue, but it very much looks like a race condition is happening. The following are very strong clues:

Race conditions typically don’t occur 100% of the time.

Running the task is run with a 5s delay fixes the race condition.

The race condition seems to come from the asynchronous worker.

sleep does not fix the race condition because of the GIL. (I think?)

Pausing causes a delay, which fixes the race condition.

1 Like

I think the index is probably stored in Redis or/and ES, to ensure it’s from Redis side you can try to reproduce the probelm and then flush everything in Redis.

I think it’s edx-search repo that is responsible for the content /courses page The default case in tutor.

@ghassan yes, the index is stored in the ES

I think it’s edx-search repo that is responsible for the content /courses page

This is also true ^ (if FEATURES['ENABLE_COURSE_DISCOVERY'] = True)

From what I’ve got from the logs/debugging I see that the problem is that the wrong data is sent to the ES for storing. The task responsible for that sending receives outdated course data from mongo.

I’m wondering now how I could confirm or refute @regis clues about the race condition

My 2 cent, is that the task to update to ES index is triggerd before the content is updated _i.e. there are two async tasks of which one depends on the other, so it might be that those tasks are running in parallel, instead in a chain or pipeline _ I am just speculating here

I excluded this because the reindex task is fired on the course_published signal which is sent after the actual course update in the mongo

Aha I am not really that familiar about the sequence of events. I think @dave might have something to say about this

1 Like

It might be a race condition as @regis indicated, where workers aren’t seeing the data that’s modified by the web view. This is especially possible if it’s reading from CourseOverviews.

Another race-condition-adjacent possibility that might be worth looking into is whether or not there is clock drift between the server running the web frontend, the server running the celery workers, and possibly MongoDB (I don’t remember exactly how the modified timestamp is created). The current timestamp is getting passed to the indexing function, which implies that it’s using the timestamp to narrow down the range of content changes it has to index. If the clocks are different, the workers may be looking at the wrong time window and not see the change.

1 Like

Another potential race condition can happen if MongoDB is configured to read from secondaries–that’s usually the best thing to do for spreading load, but there could be propagation delay that prevents the workers from seeing it. It’s possible you’re not seeing that effect when everything is run in process because Modulestore’s internal caching is preventing it from re-fetching the data.