Open edX Cloud Reference Architecture

Hi all,
I would like to start a discussion about a reference architecture of cloud services to host a production implementation of Open edX.

May it be a starting point of an OEP?

Using Tutor, it is clear that this cannot be achieved with a local installation. Instead, we need to use Kubernetes to deploy in a cluster in the cloud.

First of all, we want the infrastructure to be:

  • resilient:
    – if a pod stops responding, a new pod should be recreated in the same or another node
    – if a node is down, the lost pods should be recreated in other nodes, and a new node should be created
    – The nodes should be able to be located in multiple availability zones (AZ). If an AZ is lost, all the lost resources should be recreated in another.
  • scalable:
    – if a pod is consuming over a threshold of resources (cpu, memory, network), a new pod should be created and load should be balanced between them
    – if there is no more room in the existing nodes for new pods, a new node should be launched
    – scalability should work up (grow) and down (shrink when demand slows down)
  • secure
    – implement security measures to protect the application and data safe from potential attacks

In particular for Open edX, we also want it to:

  • allow multiple instances of Open edX in the same infrastructure
  • optimize resources as much as possible in order to reduce operational costs.

We have been working on this reference architecture diagram. It was prepared for AWS, but the concepts can be extended to any cloud provider.

Contributions are welcome! If anyone wants to collaborate with this diagram, let me know and I can let you access. It was created in Lucidchart.

References:

  1. All resources are contained within a dedicated VPC, across three availability zones. There are three types of subnets: one public for Internet access and bastion host and two privates: one for applications and another for databases.
  2. Route53 is used to point all external domains to internal addresses.
  3. CloudFront serves static assets for higher performance.
  4. The ingress is protected with a web application firewall
  5. Only the bastion host is exposed directly to Internet. Administration access to any other services can only be done from this host. To access the bastion host we use Systems Manager’s Session Manager.
  6. The core services (LMS, CMS and forum) are implemented in a Kubernetes cluster in EKS, one namespace per instance. The nodes span all the AZs in an ASG and the ingress is controlled by a load balancer. The images are stored in ECR. All secrets are retrieved at deployment time from AWS Secrets Manager
  7. The Redis services is implemented in AWS ElastiCache for Redis. We use a Redis DB for each namespace.
  8. AWS OpenSearch (formerly ElasticSearch) is the search engine.
  9. The relational database is implemented using RDS Aurora for MySQL. The cluster can have replicas in each AZ. We use one database for each namespace in the same RDS cluster.
  10. MongoDB is implemented using the AWS MongoDB quickstart template in three AZs. We use one database for each namespace in the same cluster.
  11. Emails are sent by the LMS using AWS SES
  12. Static assets are stored in an S3 bucket. We have one bucket per namespace.
4 Likes

this covers most of your 12 points: https://github.com/lpm0073/cookiecutter-openedx-devops

note the following:

  • on point 4, there are actually a variety of firewall protections. the cookiecutter creates somewhere around 10 security groups and IAM roles
  • on point 10, the AWS DynamoDB (hosted MongoDB) service has compatibility problems. i had to back away from using this

there are several other things the Cookiecutter does, such as install hastexo’s experimental backup plugin, credentials plugins, MFE plugin, and so on.

@Andres.Aulasneo This is very nice, thanks for sharing!

@gabor We should probably do a similar Diagram for grove in READMEs in the aws-provider and digitalocean-provider folders - I think it would be helpful. Our approach is overall fairly similar, but maybe a little less sophisticated and with minor changes like we use Atlas for MongoDB and let Tutor deploy Redis.

Actually we didn’t implement the firewall yet… I still have to redefine the ELB / ALB issue.

Regarding MongoDB, as you said DynamoDB has compatibility issues, as well as DocumentDB. Instead we use a MongoDB quickstart template that worked very well so far.