Horizontal Scaling - CMS/LMS and Workers

I’ve been reading over tutor documentation on how to horizontally scale the Open edX system out for the CMS and LMS applications and their corresponding workers.
Running Open edX at scale — Tutor documentation (edly.io)

Looking at the lms-worker docker service, I noticed that the lms service was a requirement.

If we were to use an application load balancer for the lms-worker, would the lms service need to run along side this docker container on the same machine?

How does the lms-worker communicate with the lms and vice versa? Does it use Celery with Redis backend for this?

We’re currently not using Kubernetes.

cc @dave

You are correct that the communication between lms and lms-worker happens via Redis as a Celery backend.

Strictly from edx-platform’s point of view (i.e. not considering Tutor), there’s no need for lms and lms-worker to run on the same server. They do need the same code and similar configuration settings, but edx.org has long run them as separate scaling groups. The lms servers render web requests to users and stick tasks onto Redis, and the lms-worker servers pick up those tasks and execute them.

You shouldn’t have to add a load balancer for lms-worker specifically–the workers will all connect to your Redis instance and take advantage of Redis’s queue-like facilities to distribute work between them.

I know much less about Tutor, but my understanding is that you’ll need to start using k8s to actually spread your load across multiple physical servers. I may be misinterpreting this, but it sounds like you might be considering an attempt to run a separate, non-k8s Tutor instance to host your lms-workers, and have it talk to your existing Tutor instance via a shared Redis. If you are not considering this please ignore the rest of this post.

If you are considering the above, I think it will be very painful. I guess it technically might work if you spin up a separate non-k8s Tutor instance with mostly just lms-worker instances and used config to point it to a shared Redis + shared MongoDB + shared file/object store (S3?) + shared MySQL + shared Elasticsearch (i.e. everything that persists data)… but I don’t know of anyone who runs it like that, and I imagine any such setup would be fragile and break in weird ways in practice because there’s going to be a bunch of non-obvious config that will need to be just right (e.g. encryption/signing keys).

So I think if you want to scale out up LMS workers, the practical path forward is either going to be scaling up your hardware to be a much bigger instance, or to start venturing in to Kubernetes.

1 Like

Thanks @dave. That’s right. We were considering putting the lms-workers on a separate machine than the lms. Same for the cms-workers with separate EC2 instance. All of this on a non-k8s instances. That’s right, we were considering not adding a public application load balancer but rather a private EC2 instance(s) that could be used to offload background tasking for the lms-worker or cms-worker applications.

It sounds like this could be done; however, we may be missing some configuration or need to make sure that the configuration is configured correctly. We’ll definitely consider either scaling up the LMS/CMS workers or start venturing into k8s configuration.

This is what we were considering with our non-k8s install for our AWS infrastructure:

Persistent Data Storage

  • Elasticache Redis
  • RDS Aurora MySQL
  • Atlas MongoDB
  • OpenSearch (Elasticsearch configuration)

MFEs

  • Cloudfront with S3 Origin

EC2 (scale vertically)

  • discovery

EC2 (load balancer with auto-scaling group)

  • cms

EC2 (load balancer with auto-scaling group)

  • lms

EC2 (scale vertically)

  • cms-worker

EC2 (scale vertically)

  • lms-worker

Yeah. I would strongly urge you to either scale up your server to something bigger so you can spin up a bunch of extra lms-workers within one vanilla Tutor instance, or just go all the way to k8s. Even if you got multiple Tutor instances to work together with shared data stores, running in an unsupported mode like this will likely cause you weird problems when it comes time to upgrade.

Some larger sites have completely customized the deployment process, and some don’t use Tutor at all. But down that path lies a whole lot of extra work and long term maintenance burden that I don’t think you want to tackle if you can help it.

2 Likes

If you do go k8s, definitely check out Harmony :slight_smile: It should help you get a scalable cluster working more easily.

1 Like

I don’t remember the reason why lms-worker is indicated as dependent upon lms. This was implemented a long time ago. But I’m guessing that the lms-worker is making HTTP calls to the lms container. So both of them need to be on the same network.

1 Like

@dave
I’m a little confused with where to install the cms-workers and lms-workers when we’re not running k8s. You mentioned above that edX has been running the lms-worker on separate server than the lms.

When you talk about scaling up your server to something bigger are you saying to vertically scale a local install of Tutor with all non-persistent services (cms, cms-worker, lms, lms-worker, …) and ignore setting up a load balancer and auto-scaling?

We’re considering the following two configurations at the moment. Can you advise on where you would see weird problems occuring?
(Note: both of these configurations would talk to external clustered configurations for persistent storage for MySQL, MongoDB, Redis, Elasticsearch).

First Configuration
Having a load balancer and auto-scaling group would account for additional user requests for the cms and lms applications. Both the cms-worker and lms-worker applications would run on separate EC2 instances and talk with the external Redis cluster.

Second Configuration
Based on your suggestions we’re scaling vertically but the cms and lms applications are separated.

To be clear, edX/2U has its own complex deployment setup that is completely independent of Tutor.

Yes. If you currently have a Tutor instance running and want to scale up, the easiest thing is to upgrade to some ridiculous server instance with 128+ CPU cores and gobs of memory, and crank up the worker count via configuration, ignoring the load balancer and auto-scaling. You don’t get proper failover, but you get easy scaling.

But this is where I also had a misconception. I had thought that the number of celery worker processes could be scaled up or down in the same way that the uwsgi web worker processes are configured. But it looks like the PR that @arbrandes made for that wasn’t (and won’t be) merged:

So to have actual fine grained control today, I think you’d have to add a --concurrency param to the lms-worker config in docker-compose.yml.

File storage would be one concern. Not the Studio Files and Uploads, which uses MongoDB, but a long tail of other kinds of file storage that uses django-storages backend for various media like forum post images, course block transformers collection data, etc.

Another concern would be if there are places where the celery workers expect to have a locally accessible connection to the LMS that is not exposed externally. I don’t know if that’s ever the case.

Another issue would be if there are any non-obvious config keys that need coordination. I think these are all covered in the central tutor config.yml file, but I don’t know if there’s any other auto-generated stuff elsewhere in the stack.

Upgrades/migrations might also be a problem down the line, if they assume there’s one active cluster with shared volumes that they can store things in, even temporarily.

It is highly likely that there are other things out there that I’m not aware of.


Please don’t take this the wrong way, but while I am happy to try to explain how things work to the best of my understanding, I am not going to help support you step by step into an unsupported hybrid deployment mode. Tutor is the supported mode for single server deployment. Tutor with k8s is the supported mode for larger scale deployments with autoscaling, failover, and all that goodness–though I’m not familiar with any of the details here, since most of my operational experience comes from edX and this was not how we deployed things.

@dave
You make a lot of good points about django-storages, celery workers expecting to have local accessibility to cms or lms applications, auto-generated config keys, upgrades.

Maybe we should start over and look at k8s deployment since it is the recommendation for larger scale deployments with autoscaling, failover, etc.

@braden @dave @regis @lpm0073
Aside from what Braden mentioned, what other documentation exists to explain how to deploy Open edX on k8s?

I found these articles: