LMS failure: 504 Timeout

I am unable to get any response from my edx install:

❯ curl -v -so /dev/null http://10.100.2.33/
* Expire in 0 ms for 6 (transfer 0x55cf4b6cedc0)
*   Trying 10.100.2.33...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x55cf4b6cedc0)
* Connected to 10.100.2.33 (10.100.2.33) port 80 (#0)
> GET / HTTP/1.1
> Host: 10.100.2.33
> User-Agent: curl/7.64.0
> Accept: */*
> 
< HTTP/1.1 504 Gateway Time-out
< Server: nginx
< Date: Tue, 27 Oct 2020 19:07:55 GMT
< Content-Type: text/html
< Content-Length: 1522
< Connection: keep-alive
< ETag: "5e8cf106-5f2"
< 
{ [1282 bytes data]
* Connection #0 to host 10.100.2.33 left intact

Initially, I was seeing errors in the output for systemctl status supervisor:

Traceback (most recent call last):
  File "/edx/app/edxapp/edx-platform/manage.py", line 120, in <module>
    startup.run()
  File "/edx/app/edxapp/edx-platform/lms/startup.py", line 19, in run
    django.setup()
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/django/__init__.py", line 27, in setup
    apps.populate(settings.INSTALLED_APPS)
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/django/apps/registry.py", line 108, in populate
    app_config.import_models()
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/django/apps/config.py", line 202, in import_models
    self.models_module = import_module(models_module_name)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/edx/app/edxapp/edx-platform/lms/djangoapps/bulk_email/models.py", line 16, in <module>
    from openedx.core.djangoapps.course_groups.cohorts import get_cohort_by_name
  File "/edx/app/edxapp/edx-platform/openedx/core/djangoapps/course_groups/cohorts.py", line 9, in <module>
    from courseware import courses
  File "/edx/app/edxapp/edx-platform/lms/djangoapps/courseware/courses.py", line 25, in <module>
    from courseware.module_render import get_module
  File "/edx/app/edxapp/edx-platform/lms/djangoapps/courseware/module_render.py", line 60, in <module>
    from openedx.core.djangoapps.bookmarks.services import BookmarksService
  File "/edx/app/edxapp/edx-platform/openedx/core/djangoapps/bookmarks/services.py", line 12, in <module>
    from . import DEFAULT_FIELDS, api
  File "/edx/app/edxapp/edx-platform/openedx/core/djangoapps/bookmarks/api.py", line 11, in <module>
    from .models import Bookmark
  File "/edx/app/edxapp/edx-platform/openedx/core/djangoapps/bookmarks/models.py", line 41, in <module>
    class Bookmark(TimeStampedModel):
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/django/db/models/base.py", line 118, in __new__
    "INSTALLED_APPS." % (module, name)
RuntimeError: Model class openedx.core.djangoapps.bookmarks.models.Bookmark doesn't declare an explicit app_label and isn't in an application in INSTALLED_APPS.

In an attempt to “make it happy”, I introduced the following patch:

diff --git a/lms/envs/production.py b/lms/envs/production.py
index a7470fd..0e4997a 100644
--- a/lms/envs/production.py
+++ b/lms/envs/production.py
@@ -328,6 +328,7 @@ USE_I18N = ENV_TOKENS.get('USE_I18N', USE_I18N)
 # Additional installed apps
 for app in ENV_TOKENS.get('ADDL_INSTALLED_APPS', []):
     INSTALLED_APPS.append(app)
+INSTALLED_APPS.append('openedx.core.djangoapps.bookmarks')
 
 WIKI_ENABLED = ENV_TOKENS.get('WIKI_ENABLED', WIKI_ENABLED)

This prevented the above crash, but produced no change in behavior. Supervisord seems to indicate that the LMS is running, but nginx fails to connect:

2020/10/27 15:07:55 [error] 1085#1085: *47594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.2.130, server: , request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:8000/", host: "10.100.2.33"

and

2020/10/27 06:43:01 [error] 1085#1085: *43334 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET /xqueue/get_queuelen/?queue_name=certificates HTTP/1.1", upstream: "http://127.0.0.1:8040/xqueue/get_queuelen/?queue_name=certificates", host: "localhost:18040"

Honestly, I’m struggling to even know where to look to try to track down what might be going on.

Any pointers would be appreciated.

-davidc

More background, in case it matters.

I’m attempting to shoehorn the native install into our own ansible framework, and get a working openedx machine.

Basically all that means is that I set up the hostname, and a variety of other our other basic things for uniformity, template various configs, such as my-passwords.yml and config.yml then I call the native install script and let it do its thing.

I’ve installed the ironwood release.

My makefile looks like this:

export OPENEDX_RELEASE = {{ edx_release_name }}

all: fetch

ansible-bootstrap.sh:
        wget https://raw.githubusercontent.com/edx/configuration/$(OPENEDX_RELEASE)/util/install/ansible-bootstrap.sh

generate-passwords.sh:
        wget https://raw.githubusercontent.com/edx/configuration/$(OPENEDX_RELEASE)/util/install/generate-passwords.sh

native.sh:
        wget https://raw.githubusercontent.com/edx/configuration/$(OPENEDX_RELEASE)/util/install/native.sh

fetch: ansible-bootstrap.sh generate-passwords.sh native.sh

ansible_bootstrap.done: export DEBIAN_FRONTEND = noninteractive
ansible_bootstrap.done: ansible-bootstrap.sh
        sudo --preserve-env bash ./ansible-bootstrap.sh && touch ansible_bootstrap.done

my-passwords.yml: generate-passwords.sh
        sudo bash ./generate-passwords.sh

setup.done: fetch ansible_bootstrap.done my-passwords.yml

setup: setup.done

install.done:  export DEBIAN_FRONTEND = noninteractive
install.done:
        sudo --preserve-env bash ./native.sh > install.log 2> install.err && touch install.done

install: install.done

clean:
        rm -f ansible-bootstrap.sh generate-passwords.sh native.sh setup.done my-passwords.yml ansible_bootstrap.done setup.done

All configuration at this point is basically the defaults, but built from a template in such a way that we can change things via our normal ansible scripts should we need to.

The native install successfully, as far as I can tell.

We apparently have done some minor modifications to edx-platform. I overwrite the installed /edx/app/edxapp/edx-platform with our modified version. This is based on the open-release/ironwood.master branch.

Oct 27 14:14:21 aws-us-live-edx00 celery[7038]: [service_variant=ecomworker][celery.worker.consumer][env:no_env][aws-us-live-edx00 7038] [consumer.py:364] - consumer: Cannot connect to amqp://celery:**@127.0.0.1:5672//: [Errno 104] Connection reset by peer.

OK, so I’m still seeing stuff related to bookmarks occasionally here:

==> var/log/supervisor/lms_high_1-stderr.log <==
Traceback (most recent call last):
  File "/edx/app/edxapp/edx-platform/manage.py", line 120, in <module>
    startup.run()
  File "/edx/app/edxapp/edx-platform/lms/startup.py", line 19, in run
    django.setup()
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/django/__init__.py", line 27, in setup
    apps.populate(settings.INSTALLED_APPS)
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/django/apps/registry.py", line 89, in populate
    "duplicates: %s" % app_config.label)
django.core.exceptions.ImproperlyConfigured: Application labels aren't unique, duplicates: bookmarks

So, what’s the deal with bookmarks? It’s definitely not something we touched, as far as I can tell (apart from the “make it work” patch I referenced originally).

I found this, which references the same error, but the suggested fix doesn’t make any difference. That package is already installed:

Hello @davidc,

Have you tried using vanilla edx-platform branch? It’s hard to determine what customizations you have in your platform.
Also, if these are just minor changes, maybe it would be worth to port them to Juniper, as Ironwood version is no longer supported?

Sorry, I got pulled onto other projects, and did not have the opportunity to respond. I’m not extremely familiar with the changes, but they do seem to me to be relatively minor. The goal is to move to juniper, but the the install/test cycle is really long which is why I’m attempting to figure out what is going on with the system as currently configured.

I will try with the vanilla platform, though my vague memory is that suffered from the same issue. It’s been some time since I last worked on this though, so I’ll verify.