We are currently running the Open edX Quince release in our production environment, using a single-server installation managed through Tutor (t2.xlarge EC2 instance). While the platform generally works well, we’ve encountered a significant issue during batch enrollment operations.
Our LMS is designed for institutions where self-enrollment is disabled, and all courses are invitation-only. Instructors frequently use the batch enrollment feature from the Instructor tab > Membership to enroll large groups of students (500+ per batch).
From what I saw on AWS an t2.xlarge EC2 instance has 4 CPUs and 16GB of RAM.
From my experience it’s too small instance to deploy and have a responsive platform for multiple users.
But if you don’t have enough money to have more hardware, you can try to limit the amount of celery worker of the lms-worker and cms-worker, for example by adding the --concurrency=1 so only a worker per service would be started up instead of start a worker per CPU (default behavior of celery).
You would need to create a Tutor plugin that would add the --concurrency=1 to both LMS_WORKER_COMMAND and CMS_WORKER_COMMAND.
The consequence of adding it would be that you limite the amount on asynchornous tasks that you platform would execute, but they would be more performant for your online users. The batch enroll would take more time to complete; other impact is that the password recover process won’t work rapidly or the course progress won’t catch up rapidly after your learners has completed some tests/exams or your course certificates could take more time to complete.
To mitigate it you could create another Tutor plugin to add more lms-workers with --concurrency=1, one for each celery queue that edx-platform uses.
Thank you for your suggestion. We have also scaled up resources, such as upgrading to t2.2xlarge instances. However, we’ve observed that the batch enrollment task is not being executed in the background using Celery. As a result, this task does not get routed to the lms-worker and remains unaffected by changes in the worker configuration.
This behavior means that until the batch enrollment task is completed, all other requests are blocked. Additionally, during this time, we’ve noticed minimal increases in CPU or RAM utilization.
Please let me know if you have any suggestions for it.
the T instances are not optimized CPU heavy tasks, try the M or C instances to see if it’s better.
Or you can try to inspect the code and change it to split the batch into multiple smaller chunks and iterate through it with a delay between chunks.