Tech talk/demo: Deploying multiple Open edX instances onto a Kubernetes Cluster with Tutor

Hi Jhony it is recorded I just havn’t had time to upload and link it yet I should be able to this evening

Hey @dbates thanks so much! Please tag me once this is ready

Jhony bad news and good news

The file was recorded but only for a few minutes. It seems they ended up using peters account to run the session @pdpinch Did you manage to record it?

@kmccormick as a backup to the recording, do you mind posting the notes taken during the BTR meeting we held at the conference here?

@keithgg Got you covered → https://discuss.openedx.org/t/build-test-release-notes-lisbon-2022

Notetaking that session was pretty frantic, let me know if anything there doesn’t make sense :slight_smile:

1 Like

Hi all,
I would like to write down a brief summary of our experience about Kubernetes and Tutor implementations. Your comments will be appreciated!

I have created a separate post to discuss about infrastructure reference architecture.

These are some of our findings:

  • Multiple Open edX instances can be deployed in a single cluster, assigning one namespace to each Open edX instance.
  • K8s is straightforward for stateless pods. However, for stateful parts of the application, special care is needed.
  • databases and storage apart, lms, lms-worker, cms, cms-worker, mfe and forum are stateless, so should work fine without much care.
  • databases are complex systems.
    – We have four: MySQL, MongoDB, Redis and ElasticSearch.
    – Each of them have a different and specific dynamics, in terms of escalation and redundancy. An incorrect setup may lead to data loss, downgraded performance, loss of synchronization, etc.
    – Setting up a database properly in K8s may need a careful tuning, different for each technology.
    – Our conclusion is that it is better to rely on the cloud provider’s services for databases, and let them manage redundancy, scalability, fine tuning, updates, etc.
  • File storage can be managed by
    – out-of-the-box MinIO, but in this case files are stored in a block service (see next)
    – rely on the cloud provider’s file storage service (S3 for AWS), which is our option
  • Block storage: Stateful pods need a place to store persistent data. In K8s, it’s the PV (Persistent Volumes). The actual location and driver for each PV is defined in a storage class. The default storage class uses the node’s storage to allocate space to the PV. This is not the best option, as the data will not survive a node recreation, or a pod recreation in another node. If we want a scalable and resilient block storage, we need to use a NFS (network file system) PV, which can be complex to setup and dependent on the cloud provider.
  • However, if the databases are left out of the cluster and lms, cms and plugins are stateless, there is no need to use block storage. With one exception:
  • Caddy, which deserves a separate paragraph.

The next post is about databases.

2 Likes

Databases

I would like to differentiate here two concepts that are usually mixed:

  • the database engine, of which we have four: MySQL, MongoDB, Redis, ElasticSearch
  • the database itself, as the logical organization of data inside each database engine. There can be more than one in each engine.

It is important, because we wanted to minimize infrastructure resources (therefore costs), by sharing as much infrastructure as possible.

Modern database engines are powerful pieces of software, that can handle huge amounts of data. So the idea is to have one database engine of each type for the whole cluster, and one database for each Open edX instance in the cluster (which means, one k8s namespace).

Most of our findings are about AWS services for the database engines.

MySQL

AWS offers two options: MySQL RDS for MySQL and Aurora/MySQL. Aurora should be the best option, as is fully MySQL compatible, has lower costs and is optimized for AWS cloud resources. It can be configured to have multiple read replicas in several AZ.
We are creating one database for each Open edX installation.
Regarding backup, AWS backup is great but works at the database engine level (again confusing with the database concept). I asked the AWS guys about this, and looks like they do not support per-database backup currently. This means that if you have multiple Open edX instances and you want to restore only one, you cannot: you have to restore them all. So, for a one-by-one backup, the old dump-and-backup-to-s3 strategy is the only option.

MongoDB

AWS offers some options:

  • DynamoDB: Don’t even try. It’s not compatible with MongoDB.
  • DocumentDB: There is an old discussion about this. Although they say it’s compatible with MongoDB, it is not 100%. In a recent talk with one AWS engineers, they confirmed that these incompatibilities are still present.
  • Atlas: It’s the official MongoDB, so should be 100% compatible. Too expensive for us to try.
  • MongoDB quickstart: it’s a CloudFormation template that creates a MongoDB cluster in your VPC. It’s our option and so far worked well.

Open edX requires one database for the modulestore. If you are using forum, it will require another database.

As MongoDB is close to be deprecated, I wouldn’t spend much effort in it.

Redis

Redis is used for two main purposes in Open edX: cache and as backend for celery queues.
The concept of database in Redis is weird: they don’t have named databases, but a redis index number from 0 to 15. As we need two per Open edX instance, we have a limit of 8 Open edX instances per Redis instance.
If anyone knows how to overcome this limitation, it will be much appreciated!
We’ve been using AWS ElastiCache for Redis and works well so far.

ElasticSearch

We’ve been trying AWS OpenSearch. It is probably the most expensive service and difficult to escalate. We had to create a new domain for each Open edX instance, which creates a new set of resources.
So if anybody found a better way to escalate this service, again will be much appreciated.

2 Likes

Thank you @kmccormick!

To summarise:

  • OpenCraft has Grove [under heavy development]
  • Lawrence has Cookiecutter
  • EduNext has Shipyard which isn’t open source.

We all approach deploying Open edX similarly (please let me know if I’m mistaken):

  1. Terraform provisions the infra
  2. Kubernetes runs the app which is deployed with
  3. Tutor

Our approaches to each of these are different [especially with respect to build pipelines], but there seems to be a lot of commonality especially with Terraform so it makes sense to try and collaborate to share the maintenance burden.

What would be the best next steps here? @lpm0073 do you think it might be worthwhile to somehow integrate Grove’s terraform code into your project? There are some differences (especially with MongoDB we’re using Atlas instead), but it seems like a good way to start.

cc: @gabor @braden

4 Likes

Short answer is yes.

The Cookiecutter functionality grows as a function of whatever my clients request. Since our last meeting, in Lisbon, I’ve added more deployment logic for ecommerce, MFE, Discovery, and Credentials. I’ve also added some rudimentary tools for migrating native MySQL data to Tutor. Lastly, I reverted back to tutor’s default MongoDB service that runs “locally” on kubernetes as a pod, due to compatibility problems with AWS’ remote service. This however is also less than perfect, which you can read more here: https://discuss.openedx.org

What’s primarily of interest to me in Grove are the Azure and GCC (or Digital Ocean??) capabilities as the Cookiecutter is AWS-only, though my clients are keenly interested in leveraging other, lower-cost cloud providers.

I’d be willing to get involved in the heavy lifting of making Grove’s Azure/GCC code work inside the Cookiecutter assuming that this would lead to others becoming involved afterwards to keep this new code maintained.

As it stands, and as I’d mentioned at the last meeting, the basic care and feeding of the AWS-specific Terraform code is more than I’d prefer. Net, the Cookiecutter is a good tool for me, for upgrading clients and for doing clean installations, and overall it does save me considerable time and I get more consistent results by using Terraform. But the maintenance is non-trivial :weary:

4 Likes

Just to clarify, that’s DigitalOcean and AWS, though I cannot see any problems with supporting other providers that has proper kubernetes and terraform support like GCP, Azure or similar, though that requires some effort to define the provider specific resources – that’s way less effort than extending the Cookiecutter functionality, I believe.

Depending on the terraform provider’s “verbosity”, the necessary amount of code needed differs a lot, but it worth to check out the aws-provider and digitalocean-provider in the repo. (DigitalOcean is lot less code and most of the components are shared to reduce the complexity you mentioned too)

2 Likes

@Andres.Aulasneo We encountered some related limitations, so for now we just have Tutor deploy a separate Redis instance for each Open edX instance. Redis has very minimal memory overhead and that way it’s easier to set a per-instance cache limit.

To help establishing what we agree to develop, use and maintain in common, it could be useful to describe it in a OEP? It could become the standard/supported Open edX solution for large deployments, helping to focus usage and contribution activity on this within the community? Nobody would be forced to use it, but that could help bringing more providers to use it and maintain it with us?

1 Like

The idea behind removing Redis from the cluster is that it is a stateful component, and the volume is by default in the node’s storage.
During failover tests, I’ve found that if you kill a node, a new node is generated, and all lost pods are recreated in the new node. However, stateful pods may fail due to node affinity if the new node is in a different AZ from the original. The final solution for this is to use NFS volumes, but this is too complicated for such a simple deployment. That’s why we preferred to remove Redis from the cluster. Maybe you have a better solution for this!
Besides Redis, the only stateful pod is Caddy. For this we found that using an Emptydir volume fixes the problem, but I’m still in doubt.

I think that we should have an official reference architecture that meets all the best practices for a full scale production environment. As you said, then each one can choose to what extent to implement. But at least we should depict the best practices and desirable characteristics of such a deployment (security, resiliency, scalability, etc.) and how to achieve them.
AWS has the well-architected framework that can be used to start. It is quite general, and can be applied to any cloud or on-premise

1 Like

Ah, I see. We don’t treat Redis as a stateful component. While Redis can be configured in a manner which allows for some stateful guarantees, we do not depend on these for production use and don’t recommend anyone put anything within it they are unwilling to lose. So that explains the different approach.

1 Like

@Andres.Aulasneo do you use Terraform or CloudFormation, or something else to manage your infrastructure? Since it seems like we’re going to work with @lpm0073 to see if we can adapt our Terraform repository to cover his use cases as well as ours, I’m wondering if it would be possible to also adapt it to be able to provision your reference architecture as well.

I use aws CDK. I know nothing about Terraform, but it shouldn’t be difficult to learn. I wonder if Terraform can deploy any kind of aws resouce. It would be great if we all can join our efforts in one solution that’s better for all.

1 Like

I have been trying to follow this discussion as it develops, but It was going faster than I could focus on it to reply with some of our thoughts from edunext.

Both this thread and the cloud architecture for aws are very profound and the collaboration that is emerging is great. However I sense that it is focusing precisely on the part of the issue that will be more difficult to agree on.

Things like:

  • using the same gitlab, github, bitbucket or any other git service + CI combination
  • tenancy decisions: one install per cluster, install per namespace, multiple tenants per install
  • the infrastructure we will run the openedx services in (aws, digital ocean, azure, …)
  • how to sync the k8s manifest with the cluster

Also in my view, although the infrastructure provisioning can be a painful thing. It most often ends quite rapidly and you are good for a relatively long while. This is even more so once the switch to k8s is made. Provisioning the cluster is a one time thing, but keeping up with the manifest and what is inside of the cluster is what requires attention for longer. This was the initial idea we started with when we set up to develop shipyard at edunext.

I might as well be wrong on this and the push to make a reference installation starting from the cloud components and using terraform might be the way to go. I won’t stand in the way and we will undoubtedly collaborate on it and bring the experience that we have gathered to it. However I’d like to comment on a different approach. We have for years already agreed on good practices and patterns that we can leverage on for this:

  • open/closed principle for architecture.
  • tutor is a great software and it does it job very well
  • kubernetes is a solid choice(even if some installations decide to use something else for orchestration)

Also we all know how big and unruly configuration grew up to be and we don’t want to follow the same steps with the main difference being that the code is split in a bunch of repos instead of one.

What I’m thinking here is that we could take some lessons from the governance of the big and complicated repos such as edx-platform and set up a project where we share our common experience of hosting and maintaining openedx installations specially those of larger sizes.

This piece of code should be very much oriented to ease of extension, but also it would have a well defined list of maintainers for the different parts of the core. The main goal of this project would be render files that can be used as a manifest for k8s that hosts open edX. How you apply this manifest to the cluster is everyone’s own choice. It could be using a hosted CI or with tutor k8s init. This would also be covered by the same ease of extension policy.

This is quite a raw idea and if was reading this I would be like “sounds nice, show me the code or I’d call bs”. Just throwing it here to see if there is any traction. I’ll get started on a prototype to prove to myself that this is actually worthwhile.

Further considerations:

  • I’m talking about a flexible rendered and tutor is an amazing and extendable template and file manager. It follows naturally that the piece of code we write be a tutor plugin. What I’d like is for this project to lift the burden of supporting the issues that arise during operation tutor. If this is a non issue (which I don’t think it is) and we are better served by giving better support to the tutor plugins individually. Please say so.

  • this project would probably be better off living on the openedx org so that there is no space for provider egos anywhere. It should also be covered by the core-commiter program.

cc @jhony_avella @MoisesGonzalezS @mgmdi

4 Likes

I wanted to share a general update to this audience. This afternoon i released v1.0.0 of the Cookiecutter, the first general production release for fully automated build & deploy onto AWS. I have a continued interest in porting this code base for Digital Ocean, GCC, Azure et al as well as other popular CI platforms.

Related: the original Github Actions build & deploy workflows have been refactored into a collection of reusable components that you’ll find here: https://github.com/openedx-actions. These components greatly simplify working with Open edX on Kubernetes and most are designed to work independently of the Cookiecutter itself. Today I also promoted several of these actions to v1.0.0. See the README’s on each regarding any details that might bear on your decision of whether of not to incorporate these into your projects. The following actions are now available for general production use:

@Andres.Aulasneo “any kind of aws resource” is a very broad statement, however, i can confirm that to date i’ve been able to find high-quality vendor-supported Terraform components for everything that i needed for the Cookiecutter automation of building the AWS environment, which includes: VPC, EC2, EKS, ECR, RDS, IAM, S3, Elasticache, Certificate Manager and Route53.