LMS stop processing requests

Hello

We are facing an strange issue in uamx:

  • from time to time, randomly, the LMS container stop accepting more requests and our platform returns a “Bad Gateway 502” error, preventing users to enter into our platform.
  • checking tutor local status, all containers are up and running, but inside the lms container the uwsgi workers won’t respond to requests.
  • we need to kill the uwsgi processes inside the LMS container (tutor local exec lms bash and “kill” to the wsgi processes). uwsgi are restarted automatically by a daemon
  • this can also be achieved with tutor local restart lms.
  • tutor local exec lms reload-uwsgi won’t do nothing, as the uwsgi won’t respond to requests.

We are runnig tutor behind a Proxy, following these instructions. Caddy logs are like the following:

tutor_local-caddy-1  | {"level":"error","ts":1693380559.2195,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"150.244.22.164","remote_port":"47169","proto":"HTTP/1.1","method":"G
ET","host":"uamx.uam.es","uri":"/"},"user_id":"","duration":59.821887904,"size":0,"status":502}

tutor_local-caddy-1  | {"level":"error","ts":1693380559.2194364,"logger":"http.log.error.log0","msg":"EOF","request":{"remote_ip":"150.244.22.164","remote_port":"47169","proto":"HTTP/1.1","method":"GET","host"
:"uamx.uam.es","uri":"/","headers":{"Accept-Language":["en-US,en;q=0.5"],"Dnt":["1"],"Sec-Fetch-User":["?1"],"User-Agent":["Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0"],"Acc
ept-Encoding":["gzip, deflate, br"],"Connection":["keep-alive"],"Upgrade-Insecure-Requests":["1"],"Accept":["text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8"],"Sec-Fetch-M
ode":["navigate"],"Sec-Fetch-Site":["none"],"Cookie":[],"Sec-Fetch-Dest":["document"]}},"duration":59.821887904,"status":502,"err_id":"1w94a8mr2","err_trace":"reverseproxy.statusError (reverseproxy.go:1299)"}

We are runnig 16 wsgi workers for lms and 2 wsgi workers for cms. According to this discuss topic they seem to be propertly configured (thought we need to grep “wsgi” instead of “processes”):

While monitoring our machine’s memory and cpu usage, none of them were saturated when the issue started:

So we are now facing out these questions:

  • is our system setting the workers propertly?
  • can it be a problem with docker/tutor installation?
  • how can we prevent this from happening?

Thank you very much in advance

Hi @Yago!

Maybe the containers are running out of mem or cpu. Please run docker stats, specially during high usage times.

Andrés

Hi @Andres.Aulasneo , thanks for answering

Unfortunately, docker stats shows a normal use of resources

We’ve found out that the problem is with MySQL transactions. We are running a local instance of tutor, and MySQL is quite overloaded, so sometimes it runs out of memory. If the LMS request data from MySQL when it is overloaded, then the request wont be answered and an LMS worker would keep busy forever. This could eventually block all the LMS workers and thus the LMS will stop processing requests.

Not every LMS request blocks the database, we’ve locallized just trying to change the user’s password; but it may be more

All an all, it seems that separating the db from the logic it is needed to improve MySQL performance and thus prevent transactions to be blocked, so we are looking forward into it. Meanwhile we are going to adjust workers configuration to be “autokilled” after 5 minutes of being busy.

Hope this settings will overcome this error

Thanks for the tip anyway

For our own education, can you give us an estimate of your daily active users? It would be super useful to help us make recommendation for scaling up an Open edX platform.

Hi @Yago,
Good that you found the cause of the problem. In a production environment it is recommended to have both MySQL and MongoDB in separate clusters. Also consider deploying Open edX in Kubernetes with Tutor. This will allow you autoscale you installation.

Hi @regis , we are not currently measuring the number of daily active users :frowning: . Looking at the logs we believe it may be something inbetween 50 and 200 active users per day.

Crawling through the logs with docker logs tutor_local-lms-1 2>&1 | grep "Login success" | awk '{print $1, $2}' | cut -d' ' -f1 | sort | uniq -c we have from 30 to 150 logged users last per day last week.

This may not be the best approach to get the active users, maybe you can give as a tip :slight_smile: We are looking forward to integrate Cairn or other analytics tool but it could take time.

BTW thank you very much for your hard work with tutor, is an amazing tool!

Hi @Andres.Aulasneo, thanks for the tip.

We are currently working on moving dbs to another server, and in the mid-term to Kubernetes.

Now the workers are killed after 5 minutes of being busy and we have 16 lms workers, the LMS is not blocked anymore. But there are some functionality like sending password reset email that are not working: the uwsgi keeps waiting for mysql to answer and the worker keeps busy until it is killed.

I don’t know if k8s or separating the db will help on this problem, as it is somehow linked with user’s data and emails, not necessarily with platform performance. I’ve created this issue Error sending bulk emails - Site Operators / Tutor Help - Open edX discussions where I explain some more on this subject.

Many thanks for all your help

Hello everyone

i have the same problem. i have an application connected to an external mysql server. we realized that there was no space left. we deleted unnecessary files. we restarted the applications in the system.(mysql and tutor) but the gateway problem persists. How can I solve the problem. Can you help me?
our website https://test-lms.allinturkiye.com/

1 Like

Hi

What version of tutor are you using and can you show us the logs?

@sbernesto hi, on-prem tutor version 15.3.5. openedx is running behind a proxy (nginx). When I use the command “tutor local launch” it cannot connect to external mysql in mysql-run-job. But I can easily connect from inside the server, I am using the right environment variables.

"tutor logs -f ” logs b

And my lms logs

uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:41045 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x5561f05efb90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 5 seconds on interpreter 0x5561f05efb90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 22:45:35 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:40369 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x559a7e5a3b90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 5 seconds on interpreter 0x559a7e5a3b90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)
appadm@openedxtestapp:/root$ clear
appadm@openedxtestapp:/root$ docker logs tutor_local_lms_1
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 19:08:31 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:42435 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x56404ffc6b90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 6 seconds on interpreter 0x56404ffc6b90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 20:16:03 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 9)
uwsgi socket 0 bound to TCP address 127.0.0.1:44961 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x55892cd69b90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 17 seconds on interpreter 0x55892cd69b90 pid: 8 (default app)
spawned uWSGI worker 1 (pid: 8, cores: 1)
spawned uWSGI worker 2 (pid: 32, cores: 1)
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 21:46:32 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:46621 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x56405a844b90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 5 seconds on interpreter 0x56405a844b90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 22:09:16 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:41045 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x5561f05efb90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 5 seconds on interpreter 0x5561f05efb90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 22:45:35 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:40369 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x559a7e5a3b90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 5 seconds on interpreter 0x559a7e5a3b90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)

Hi @sariaydinalp

If it could be useful, we separated all the DBs into another machine (MySQL, Mongo, ElasticSearch and Redis).

Our problems with tasks being blocked and wsgi workers being busy for an indefinitely period of time solved by moving Redis to a native environment. Dockerized Redis was unable to manage too many tasks at once and it was creating this issue.

One important note: we have 2 servers, one running the application and the other running the dbs. This prevents our machine from getting out of memory or disk space, as docker consumes a lot and Redis and Mongo also use a lot of memory while processing data.

Hope this will help
Regards

Hi

@Yago What specifications do your servers have if you can share them with us?

Hi @sbernesto

Of course, here you are:

  • tutor/docker machine: 4 cores, 16Gb RAM, 105Gb hard disk
  • databases machine: 2 cores, 16 Gb RAM, 146 Gb hard disk

Regards

I’m not sure I understand. You were saying previously that the issue was with the uWSGI workers getting stuck while making a request to MySQL. But now you’re saying that the problem came from Redis. Which one is it? How did you diagnose the issue?

Hi @regis, it’s true, its not easy to understand for me neither. I’ll try to explain the conclusions we’ve arrived to:

  • everytime a worker is not able to complete a task in time it goes to state “busy” indefinitedly
  • in our case, it was always due to a database not responding or throwing a timeout
  • the only databases that happen to go into this state were Redis and MySQL, and always under a docker environment
  • So: whenever a database was not able to complete a transaction it got reflected in a uWSGI worker been infinitely busy.

We moved all the databases (Redis, Mongo, ElasticSearch and MySQL) to an external machine, running them natively as a systemctl service. And then the problem disappeared, and none of the wsgi workers were blocked anymore

Hope this will help to encapsulate the issue

Regards!

1 Like

@Yago thanks for help. From now on I will run all db’s on another server.

@sariaydinalp for us at least was the best solution. Just keep in mind if your running on Kubernetes that you may need to add this patch fix: do not fail on start when there are no persistent volume claims by snglth · Pull Request #1052 · overhangio/tutor (github.com) if you put all dbs to false in config.yml.

Regards and good luck