LMS stop processing requests

Hello

We are facing an strange issue in uamx:

  • from time to time, randomly, the LMS container stop accepting more requests and our platform returns a “Bad Gateway 502” error, preventing users to enter into our platform.
  • checking tutor local status, all containers are up and running, but inside the lms container the uwsgi workers won’t respond to requests.
  • we need to kill the uwsgi processes inside the LMS container (tutor local exec lms bash and “kill” to the wsgi processes). uwsgi are restarted automatically by a daemon
  • this can also be achieved with tutor local restart lms.
  • tutor local exec lms reload-uwsgi won’t do nothing, as the uwsgi won’t respond to requests.

We are runnig tutor behind a Proxy, following these instructions. Caddy logs are like the following:

tutor_local-caddy-1  | {"level":"error","ts":1693380559.2195,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"150.244.22.164","remote_port":"47169","proto":"HTTP/1.1","method":"G
ET","host":"uamx.uam.es","uri":"/"},"user_id":"","duration":59.821887904,"size":0,"status":502}

tutor_local-caddy-1  | {"level":"error","ts":1693380559.2194364,"logger":"http.log.error.log0","msg":"EOF","request":{"remote_ip":"150.244.22.164","remote_port":"47169","proto":"HTTP/1.1","method":"GET","host"
:"uamx.uam.es","uri":"/","headers":{"Accept-Language":["en-US,en;q=0.5"],"Dnt":["1"],"Sec-Fetch-User":["?1"],"User-Agent":["Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0"],"Acc
ept-Encoding":["gzip, deflate, br"],"Connection":["keep-alive"],"Upgrade-Insecure-Requests":["1"],"Accept":["text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8"],"Sec-Fetch-M
ode":["navigate"],"Sec-Fetch-Site":["none"],"Cookie":[],"Sec-Fetch-Dest":["document"]}},"duration":59.821887904,"status":502,"err_id":"1w94a8mr2","err_trace":"reverseproxy.statusError (reverseproxy.go:1299)"}

We are runnig 16 wsgi workers for lms and 2 wsgi workers for cms. According to this discuss topic they seem to be propertly configured (thought we need to grep “wsgi” instead of “processes”):

While monitoring our machine’s memory and cpu usage, none of them were saturated when the issue started:

So we are now facing out these questions:

  • is our system setting the workers propertly?
  • can it be a problem with docker/tutor installation?
  • how can we prevent this from happening?

Thank you very much in advance

Hi @Yago!

Maybe the containers are running out of mem or cpu. Please run docker stats, specially during high usage times.

Andrés

Hi @Andres.Aulasneo , thanks for answering

Unfortunately, docker stats shows a normal use of resources

We’ve found out that the problem is with MySQL transactions. We are running a local instance of tutor, and MySQL is quite overloaded, so sometimes it runs out of memory. If the LMS request data from MySQL when it is overloaded, then the request wont be answered and an LMS worker would keep busy forever. This could eventually block all the LMS workers and thus the LMS will stop processing requests.

Not every LMS request blocks the database, we’ve locallized just trying to change the user’s password; but it may be more

All an all, it seems that separating the db from the logic it is needed to improve MySQL performance and thus prevent transactions to be blocked, so we are looking forward into it. Meanwhile we are going to adjust workers configuration to be “autokilled” after 5 minutes of being busy.

Hope this settings will overcome this error

Thanks for the tip anyway

For our own education, can you give us an estimate of your daily active users? It would be super useful to help us make recommendation for scaling up an Open edX platform.

Hi @Yago,
Good that you found the cause of the problem. In a production environment it is recommended to have both MySQL and MongoDB in separate clusters. Also consider deploying Open edX in Kubernetes with Tutor. This will allow you autoscale you installation.

Hi @regis , we are not currently measuring the number of daily active users :frowning: . Looking at the logs we believe it may be something inbetween 50 and 200 active users per day.

Crawling through the logs with docker logs tutor_local-lms-1 2>&1 | grep "Login success" | awk '{print $1, $2}' | cut -d' ' -f1 | sort | uniq -c we have from 30 to 150 logged users last per day last week.

This may not be the best approach to get the active users, maybe you can give as a tip :slight_smile: We are looking forward to integrate Cairn or other analytics tool but it could take time.

BTW thank you very much for your hard work with tutor, is an amazing tool!

Hi @Andres.Aulasneo, thanks for the tip.

We are currently working on moving dbs to another server, and in the mid-term to Kubernetes.

Now the workers are killed after 5 minutes of being busy and we have 16 lms workers, the LMS is not blocked anymore. But there are some functionality like sending password reset email that are not working: the uwsgi keeps waiting for mysql to answer and the worker keeps busy until it is killed.

I don’t know if k8s or separating the db will help on this problem, as it is somehow linked with user’s data and emails, not necessarily with platform performance. I’ve created this issue Error sending bulk emails - Site Operators / Tutor Help - Open edX discussions where I explain some more on this subject.

Many thanks for all your help