LMS stop processing requests

Yago · August 30, 2023, 11:52am

Hello

We are facing an strange issue in uamx:

from time to time, randomly, the LMS container stop accepting more requests and our platform returns a “Bad Gateway 502” error, preventing users to enter into our platform.
checking tutor local status, all containers are up and running, but inside the lms container the uwsgi workers won’t respond to requests.
we need to kill the uwsgi processes inside the LMS container (tutor local exec lms bash and “kill” to the wsgi processes). uwsgi are restarted automatically by a daemon
this can also be achieved with tutor local restart lms.
tutor local exec lms reload-uwsgi won’t do nothing, as the uwsgi won’t respond to requests.

We are runnig tutor behind a Proxy, following these instructions. Caddy logs are like the following:

tutor_local-caddy-1  | {"level":"error","ts":1693380559.2195,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"150.244.22.164","remote_port":"47169","proto":"HTTP/1.1","method":"G
ET","host":"uamx.uam.es","uri":"/"},"user_id":"","duration":59.821887904,"size":0,"status":502}

tutor_local-caddy-1  | {"level":"error","ts":1693380559.2194364,"logger":"http.log.error.log0","msg":"EOF","request":{"remote_ip":"150.244.22.164","remote_port":"47169","proto":"HTTP/1.1","method":"GET","host"
:"uamx.uam.es","uri":"/","headers":{"Accept-Language":["en-US,en;q=0.5"],"Dnt":["1"],"Sec-Fetch-User":["?1"],"User-Agent":["Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0"],"Acc
ept-Encoding":["gzip, deflate, br"],"Connection":["keep-alive"],"Upgrade-Insecure-Requests":["1"],"Accept":["text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8"],"Sec-Fetch-M
ode":["navigate"],"Sec-Fetch-Site":["none"],"Cookie":[],"Sec-Fetch-Dest":["document"]}},"duration":59.821887904,"status":502,"err_id":"1w94a8mr2","err_trace":"reverseproxy.statusError (reverseproxy.go:1299)"}

We are runnig 16 wsgi workers for lms and 2 wsgi workers for cms. According to this discuss topic they seem to be propertly configured (thought we need to grep “wsgi” instead of “processes”):

While monitoring our machine’s memory and cpu usage, none of them were saturated when the issue started:

So we are now facing out these questions:

is our system setting the workers propertly?
can it be a problem with docker/tutor installation?
how can we prevent this from happening?

Thank you very much in advance

Andres.Aulasneo · August 31, 2023, 6:09pm

Hi @Yago!

Maybe the containers are running out of mem or cpu. Please run docker stats, specially during high usage times.

Andrés

Yago · September 5, 2023, 12:15pm

Hi @Andres.Aulasneo , thanks for answering

Unfortunately, docker stats shows a normal use of resources

We’ve found out that the problem is with MySQL transactions. We are running a local instance of tutor, and MySQL is quite overloaded, so sometimes it runs out of memory. If the LMS request data from MySQL when it is overloaded, then the request wont be answered and an LMS worker would keep busy forever. This could eventually block all the LMS workers and thus the LMS will stop processing requests.

Not every LMS request blocks the database, we’ve locallized just trying to change the user’s password; but it may be more

All an all, it seems that separating the db from the logic it is needed to improve MySQL performance and thus prevent transactions to be blocked, so we are looking forward into it. Meanwhile we are going to adjust workers configuration to be “autokilled” after 5 minutes of being busy.

Hope this settings will overcome this error

Thanks for the tip anyway

regis · September 5, 2023, 4:11pm

For our own education, can you give us an estimate of your daily active users? It would be super useful to help us make recommendation for scaling up an Open edX platform.

Andres.Aulasneo · September 5, 2023, 8:37pm

Hi @Yago,
Good that you found the cause of the problem. In a production environment it is recommended to have both MySQL and MongoDB in separate clusters. Also consider deploying Open edX in Kubernetes with Tutor. This will allow you autoscale you installation.

Yago · September 11, 2023, 9:12am

Hi @regis , we are not currently measuring the number of daily active users . Looking at the logs we believe it may be something inbetween 50 and 200 active users per day.

Crawling through the logs with docker logs tutor_local-lms-1 2>&1 | grep "Login success" | awk '{print $1, $2}' | cut -d' ' -f1 | sort | uniq -c we have from 30 to 150 logged users last per day last week.

This may not be the best approach to get the active users, maybe you can give as a tip We are looking forward to integrate Cairn or other analytics tool but it could take time.

BTW thank you very much for your hard work with tutor, is an amazing tool!

Yago · September 13, 2023, 7:18am

Hi @Andres.Aulasneo, thanks for the tip.

We are currently working on moving dbs to another server, and in the mid-term to Kubernetes.

Now the workers are killed after 5 minutes of being busy and we have 16 lms workers, the LMS is not blocked anymore. But there are some functionality like sending password reset email that are not working: the uwsgi keeps waiting for mysql to answer and the worker keeps busy until it is killed.

I don’t know if k8s or separating the db will help on this problem, as it is somehow linked with user’s data and emails, not necessarily with platform performance. I’ve created this issue Error sending bulk emails - Site Operators / Tutor Help - Open edX discussions where I explain some more on this subject.

Many thanks for all your help

sariaydinalp · June 10, 2024, 8:09pm

Hello everyone

i have the same problem. i have an application connected to an external mysql server. we realized that there was no space left. we deleted unnecessary files. we restarted the applications in the system.(mysql and tutor) but the gateway problem persists. How can I solve the problem. Can you help me?
our website https://test-lms.allinturkiye.com/

sbernesto · June 11, 2024, 12:44am

Hi

What version of tutor are you using and can you show us the logs?

sariaydinalp · June 11, 2024, 7:18am

@sbernesto hi, on-prem tutor version 15.3.5. openedx is running behind a proxy (nginx). When I use the command “tutor local launch” it cannot connect to external mysql in mysql-run-job. But I can easily connect from inside the server, I am using the right environment variables.

"tutor logs -f ” logs b

And my lms logs

uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:41045 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x5561f05efb90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 5 seconds on interpreter 0x5561f05efb90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 22:45:35 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:40369 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x559a7e5a3b90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 5 seconds on interpreter 0x559a7e5a3b90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)
appadm@openedxtestapp:/root$ clear
appadm@openedxtestapp:/root$ docker logs tutor_local_lms_1
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 19:08:31 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:42435 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x56404ffc6b90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 6 seconds on interpreter 0x56404ffc6b90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 20:16:03 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 9)
uwsgi socket 0 bound to TCP address 127.0.0.1:44961 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x55892cd69b90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 17 seconds on interpreter 0x55892cd69b90 pid: 8 (default app)
spawned uWSGI worker 1 (pid: 8, cores: 1)
spawned uWSGI worker 2 (pid: 32, cores: 1)
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 21:46:32 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:46621 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x56405a844b90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 5 seconds on interpreter 0x56405a844b90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 22:09:16 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:41045 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x5561f05efb90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 5 seconds on interpreter 0x5561f05efb90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)
[uWSGI] getting INI configuration from uwsgi.ini
[uwsgi-static] added mapping for /static => /openedx/staticfiles/
[uwsgi-static] added mapping for /media => /openedx/media/
*** Starting uWSGI 2.0.21 (64bit) on [Mon Jun 10 22:45:35 2024] ***
compiled with version: 9.4.0 on 03 April 2024 11:54:15
os: Linux-4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
nodename: c4abfe891fee
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /openedx/edx-platform
detected binary path: /openedx/venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
building mime-types dictionary from file /etc/mime.types...567 entry found
lock engine: pthread robust mutexes
thunder lock: enabled
uWSGI http bound on 0.0.0.0:8000 fd 4
spawned uWSGI http 1 (pid: 8)
uwsgi socket 0 bound to TCP address 127.0.0.1:40369 (port auto-assigned) fd 3
Python version: 3.8.12 (default, Apr  3 2024, 11:45:18)  [GCC 9.4.0]
Python main interpreter initialized at 0x559a7e5a3b90
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 154032 bytes (150 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 5 seconds on interpreter 0x559a7e5a3b90 pid: 7 (default app)
spawned uWSGI worker 1 (pid: 7, cores: 1)
spawned uWSGI worker 2 (pid: 31, cores: 1)

Yago · June 11, 2024, 7:40am

Hi @sariaydinalp

If it could be useful, we separated all the DBs into another machine (MySQL, Mongo, ElasticSearch and Redis).

Our problems with tasks being blocked and wsgi workers being busy for an indefinitely period of time solved by moving Redis to a native environment. Dockerized Redis was unable to manage too many tasks at once and it was creating this issue.

One important note: we have 2 servers, one running the application and the other running the dbs. This prevents our machine from getting out of memory or disk space, as docker consumes a lot and Redis and Mongo also use a lot of memory while processing data.

Hope this will help
Regards

sbernesto · June 11, 2024, 4:01pm

Hi

@Yago What specifications do your servers have if you can share them with us?

Yago · June 12, 2024, 7:56am

Hi @sbernesto

Of course, here you are:

tutor/docker machine: 4 cores, 16Gb RAM, 105Gb hard disk
databases machine: 2 cores, 16 Gb RAM, 146 Gb hard disk

Regards

regis · June 14, 2024, 7:51am

I’m not sure I understand. You were saying previously that the issue was with the uWSGI workers getting stuck while making a request to MySQL. But now you’re saying that the problem came from Redis. Which one is it? How did you diagnose the issue?

Yago · June 17, 2024, 9:04am

Hi @regis, it’s true, its not easy to understand for me neither. I’ll try to explain the conclusions we’ve arrived to:

everytime a worker is not able to complete a task in time it goes to state “busy” indefinitedly
in our case, it was always due to a database not responding or throwing a timeout
the only databases that happen to go into this state were Redis and MySQL, and always under a docker environment
So: whenever a database was not able to complete a transaction it got reflected in a uWSGI worker been infinitely busy.

We moved all the databases (Redis, Mongo, ElasticSearch and MySQL) to an external machine, running them natively as a systemctl service. And then the problem disappeared, and none of the wsgi workers were blocked anymore

Hope this will help to encapsulate the issue

Regards!

sariaydinalp · June 17, 2024, 6:27pm

@Yago thanks for help. From now on I will run all db’s on another server.

Yago · June 18, 2024, 7:46am

@sariaydinalp for us at least was the best solution. Just keep in mind if your running on Kubernetes that you may need to add this patch fix: do not fail on start when there are no persistent volume claims by snglth · Pull Request #1052 · overhangio/tutor (github.com) if you put all dbs to false in config.yml.

Regards and good luck

Topic		Replies	Views
Upgrade the number of lms workers from 2 to 8 Site Operations Help devops , how-to , tutor , backend-dev	12	847	May 2, 2023
Please help me, Tutor Help tutor	4	252	October 17, 2024
Horizontal Scaling - CMS/LMS and Workers Site Operators	9	224	August 8, 2024
Celery tasks are broken Site Operations Help tutor	30	5496	January 15, 2023
Cpu usage on tutor instance Tutor Help	2	747	July 18, 2022

LMS stop processing requests

Related topics