Infinite recursion bug related to XBlock translations

Hello everyone. We’re encountering problems with translations. Sometimes, the LMS gets to a point where it’s going into infinite recursion on translation.fallback when translating a message, which causes a 500 error page, which goes into infinite recursion translating the messages in the error page template.

It’s been happening very rarely for a couple of years at least, and we’ve been unable to reproduce the problem ourselves. It just happens for one request, the LMS crashes after a while (actually, it seems to get stopped by uWSGI, which looks like a graceful termination, and thus doesn’t get automatically restarted by Docker if the restart condition is set to on-failure,) and then all is fine, until next time.

Here’s a stack trace that seems to indicate the problem comes from XBlock translations, but we’re not sure, since it also happens on other pages, like `/courses`:

Traceback (most recent call last):
  File "/openedx/venv/lib/python3.11/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/django/core/handlers/base.py", line 197, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/django/views/decorators/http.py", line 43, in inner
    return func(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/edx-platform/common/djangoapps/util/views.py", line 63, in inner
    response = view_func(request, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/django/views/decorators/clickjacking.py", line 58, in wrapper_view
    resp = view_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/django/utils/decorators.py", line 134, in _wrapper_view
    response = view_func(request, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/edx-platform/lms/djangoapps/courseware/views/views.py", line 1695, in render_xblock
    fragment = block.render(requested_view, context=student_view_context)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/xblock/core.py", line 818, in render
    return self.runtime.render(self, view, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/edx-platform/xmodule/x_module.py", line 994, in render
    return super().render(block, view_name, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/xblock/runtime.py", line 823, in render
    frag = view_fn(context)
           ^^^^^^^^^^^^^^^^
  File "/openedx/edx-platform/xmodule/vertical_block.py", line 203, in student_view
    return self._student_or_public_view(context, STUDENT_VIEW)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/edx-platform/xmodule/vertical_block.py", line 130, in _student_or_public_view
    rendered_child = child.render(view, child_block_context)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/xblock/core.py", line 818, in render
    return self.runtime.render(self, view, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/edx-platform/xmodule/x_module.py", line 994, in render
    return super().render(block, view_name, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/xblock/runtime.py", line 823, in render
    frag = view_fn(context)
           ^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/done/done.py", line 81, in student_view
    frag.add_content(resource_loader.render_django_template(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/xblock/utils/resources.py", line 48, in render_django_template
    rendered = template.render(Context(context))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/django/template/base.py", line 175, in render
    return self._render(context)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/django/template/base.py", line 167, in _render
    return self.nodelist.render(context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/django/template/base.py", line 1005, in render
    return SafeString("".join([node.render_annotated(context) for node in self]))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/django/template/base.py", line 1005, in <listcomp>
    return SafeString("".join([node.render_annotated(context) for node in self]))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/django/template/base.py", line 966, in render_annotated
    return self.render(context)
           ^^^^^^^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/xblock/utils/templatetags/i18n.py", line 54, in render
    with self.merge_translation(context):
  File "/opt/pyenv/versions/3.11.8/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/openedx/venv/lib/python3.11/site-packages/xblock/utils/templatetags/i18n.py", line 40, in merge_translation
    translation.merge(i18n_service)
  File "/openedx/venv/lib/python3.11/site-packages/django/utils/translation/trans_real.py", line 264, in merge
    self.add_fallback(other._fallback)
  File "/opt/pyenv/versions/3.11.8/lib/python3.11/gettext.py", line 279, in add_fallback
    self._fallback.add_fallback(fallback)
  File "/opt/pyenv/versions/3.11.8/lib/python3.11/gettext.py", line 279, in add_fallback
    self._fallback.add_fallback(fallback)
  File "/opt/pyenv/versions/3.11.8/lib/python3.11/gettext.py", line 279, in add_fallback
    self._fallback.add_fallback(fallback)
  [Previous line repeated 831 more times]
RecursionError: maximum recursion depth exceeded

My theory is that something at runtime causes fallback languages to be added to the translation service, causing one of:

  1. A language gets configured as its own fallback (one-element recursion loop);

  2. A language’s fallback is configured to one of its parent languages (N-element recursion loop)

  3. Fallbacks keep being added without a loop, but it eventually results in a chain of 1000 elements and triggers the recursion limit error.

Potentially, this poisons the translation service, which causes a recursion error to happen when translating another page, which doesn’t have an XBlock.

Has this happened to anyone else?

We’ve been investigating this for a week, and finally today I manage to find a reproduction: when the language is set to German (de-de,) loading a unit that uses the done/Completion XBlock causes the infinite recursion bug, at least in Redwood and Sumac. I reproduced it with a unit that contains only this XBlock, and it happens every single time. If I remove the XBlock, it stops (after restarting the LMS to stop the recursion, of course.) It doesn’t seem to happen with French or Italian.

Disabling German in the DarkLang config is not sufficient, because while it will hide it from the language menu, a user that already has their cookie set to de-de will still trigger the bug. Removing German from settings.LANGUAGES will prevent the bug from happening. Another way to prevent the bug it is to remove the four translated strings from /openedx/venv/lib/python3.11/site-packages/done/templates/done.html.

What I noticed is that out of 6 XBlocks I looked at (DoneXBlock, xblock-free-text-response, xblock-qualtrics-survey, xblock-google-drive, xblock-image-explorer, xblock-drag-and-drop-v2,) DoneXBlock is the only one that doesn’t have a symlink called translations that points to conf/locale. I have to idea if it has anything to do with the matter.

Does anybody have an idea of what’s happening? At the moment, I’m thinking of replacing that XBlock with a fork that doesn’t translate its strings, but I’d be happy with a cleaner solution. Thanks in advance for your insights.

Hi @oscherler and welcome! Could you please mention what version of Open edX you’re running, and what version of the XBlocks you’re using (if you know)?

At least both Redwood with done-xblock 2.3.0 and Sumac with done-xblock 2.4.0.

Redwood:

acid-xblock                 0.3.1
crowdsourcehinter-xblock    0.7
done-xblock                 2.3.0
flow-control-xblock         2.0.1
h5p-xblock                  0.2.17
lti-consumer-xblock         9.11.0
openedx-scorm-xblock        18.0.2
recommender-xblock          2.2.0
staff_graded-xblock         2.3.0
ubcpi-xblock                1.0.0
XBlock                      4.0.1
xblock-drag-and-drop-v2     4.0.2
xblock-google-drive         0.8.1
xblock-poll                 1.15.1
xblock-utils                4.0.0

Sumac:

acid-xblock                 0.4.1
crowdsourcehinter-xblock    0.7
done-xblock                 2.4.0
flow-control-xblock         2.0.1
h5p-xblock                  0.2.17
lti-consumer-xblock         9.11.3
openedx-scorm-xblock        15.0.1
recommender-xblock          2.2.1
staff_graded-xblock         2.3.0
ubcpi-xblock                1.0.0
XBlock                      5.1.0
xblock-drag-and-drop-v2     4.0.3
xblock-google-drive         0.8.1
xblock-poll                 1.15.1
xblock-utils                4.0.0

This is something I’ve looked into as well.

2U had a tracking issue for the recursion error at https://github.com/edx/edx-arch-experiments/issues/674 although it has since been transitioned to a private issue tracker. (Nothing new since then, though.) We never got an answer but maybe some of the patterns we saw could be useful to you.

There was also a bug related to translations in the error page itself, which I fixed a while back: https://github.com/openedx/edx-platform/pull/35209 That was since backported to Redwood, but maybe it’s worth taking a look at to see if there’s a similar bug still in there?

Thanks Tim.

@oscherler do you know if your version of Redwood has the backport Tim mentions?

Thank you, Tim. It turns out that our fork of edx-platrorm was from open-release/redwood.1, as we started it just a month before your fix. I have integrated the change now, so the second recursion problem doesn’t occur anymore.

There’s still the problem with the infinite recursion on translation in DoneXBlock, but it’s less severe without the infinite error page recursion. What I did is that I forked the XBlock and removed all translation calls. It’s just one button that says “Mark as completed,” I doubt our users will be more annoyed by this than by the outages that it causes.

Finally, it turns out that when infinite recursion occurs, uWSGI terminates the process gracefully. I suspected it from the goodbye to uWSGI. message in the logs:

...
RecursionError: maximum recursion depth exceeded in comparison
SIGINT/SIGTERM received...killing workers...
gateway "uWSGI http 1" has been buried (pid: 24)
subprocess 2983 exited with code 52
worker 1 buried after 5 seconds
worker 2 buried after 5 seconds
goodbye to uWSGI.

And since our Docker services were configured with deploy.restart_policy.condition: on-failure, they weren’t restarted when this occurred.

So, to summarise: the problem still exists, but we’ve managed to work around it for now.