Deploying Open edX on Kubernetes Using Helm

See previous discussion

The meeting was held at 15:00 UTC on Zoom. @antoviaque can you post the recording?

Attendees

@MoisesGonzalezS @gabor @jhony_avella @Felipe @braden @lpm0073 @keithgg @antoviaque @sambapete @mtyaka

Since the previous thread was getting a bit long, we’ve decided to create a new thread to facilitate further discussion.

What was discussed

  • To deploy to Kubernetes with tutor still has some pain points:
    • It’s not possible to get rid of Caddy.
    • The built-in persistance for MongoDB/MySQL/ES and Redis is not that reliable.
    • Configuration outside of the scope of tutor is fairly difficult.
    • Building images takes a while, making changes requires rebuilding.
  • @braden is working on the prototype for Helm + tutor and is almost ready to demo.
  • In the mean time@felipe is working on OEP-59.
  • Getting broader community support is important, as this could become an official method of deploying Open edX to Kubernetes.

Action items

  • @braden will present his prototype at the next meeting (possibly before then) to get feedback.
  • @felipe to update OEP-59 so that it’s an amendment to OEP-45 instead.
  • We are to involve the wider to community (especially 2U as they’re deploying to k8s already) to get their feedback and explore any other ideas.
  • We will have another meeting on the 6th of December at 17:30 UTC. All is welcome :slight_smile:
Meeting notes

Moises:

  • eduNext has encountered problems with tutor. Specifically with Caddy, mixed with Ingress.
  • Can’t get rid of Caddy due to the way it’s specified.
  • Persistence in the cluster. ElasticSearch redis and MongoDB/MySQL is not that reliable.
    • In a real installation it doesn’t fulfil the needs.
    • Default settings aren’t enough. Have to add plugins to manage configuration.
  • Plugin ecosystem of tutor has two use-cases.
  • IDAs → Ecommerce, etc.
  • Configuration, it’s difficult to manage configuration for what we need.
  • It’s not just changing the plugin, but then overriding templates to override everything.
  • Using tutor as a deploy tool isn’t working
  • The way tutor manages jobs, hooks are injected into the manifest and applied.

Lawrence

  • Agrees. For him, building the Open edX container is a problem, because it’s so big.
  • He has to rebuild the container, because an eg. an XBlock needs to be added.
  • It’s much slower than working natively.
  • It’s very difficult to make small changes to the build.
  • Caddy is a problem, persistance mismatch, availability zones, templates and plugins all have issues.
  • Open edX container is much too large and brittle. Could turn something minor into taking up a lot of time.
  • Helm is worthwhile.
  • Is it possible to that the Open Edx container can be standalone and that everything else like Xblocks and such, could be added after the fact?

Felipe

  • Suggested using Docker layers for a base Open edX image might be an option.

Lawrence

  • Lawrence is unable to reliably build and deploy unless there’s a codified workflow.
  • He’s currently using Github actions, but is open to anything.
  • There needs to be a way to codify this workflow.
  • It’s difficult currently to get a reliable production build for client.
  • Bash scripts seems to work, but would like something better.

Braden

  • It should be possible to create a Docker layer.

Lawrence

  • The image would need to be built every time.
  • Clients are now cognizant of the time it takes to build a container and
    they are taking it into account, so requests are limited. It’s a distraction
    and “dominates the thought process”
  • Other than building images, Kubernetes and Docker is resilient and workable

Moises

  • We could use different ways of caching builds, like BuildKit, etc.
  • We could handle it later as part of a working group.

Felipe

  • Making the build work better could be within the scope of the BTR.
  • Since it’s something everyone wants, making it available in the wider community is better.

Xavier

  • What is the status of the current helm collabaration?
  • Can we announce to the community or is there still some work to do?

Braden

  • Will share prototype with the group tomorrow/later.

Xavier

  • Is there even anything that’s been agreed upon that we will take on going forward.

Braden

  • We will definitely use Helm.

Lawrence

  • For his clients, Open edX isn’t usually the only piece of software running on their cluster.
  • Helm is suited to this task, because it’s usually used to deploy the other software in the cluster as well.

Felipe

  • There might be complaints from other operators that they need to now learn another technology.
  • Some customers are not in favour of using helm as they might be tied to different methodologies/tech.

Braden

  • Helm is fairly widely used so we shouldn’t have that issue.

Xavier

  • Concrete steps:
    • Braden finishes prototype and gets feedback.
    • Is there anything missing to get the OEP accepted?

Braden

  • Is the OEP even needed?
  • Ideally, the next step is to build on it. We could just start publishing and collabarate on it?

Xavier

  • OEP is preferred since this project is community centric.
  • There is value in saying that this could be an official solution.

Braden

  • OEP can be amendment to the tutor OEP, rather than standalone as OEP’s require more effort to get accepted.

Lawrence

  • This isn’t to replace tutor, but the resulting project will live downstream from tutor.
  • In the end, it will replace tutor k8s.

Felipe

  • This is a first step to build upon that.

Xavier

  • We need to consider the Social and Governance aspect.
  • What’s the shortest path to getting to the first step?
    • Amendment to tutor OEP sounds good.
    • Is that something we can do now?
    • Is there still things to figure out regarding it?

Braden

  • Agree on the social asepct.
  • Needing to get people looking at code first > publishing an OEP.

Lawrence

  • Willing to write code and assist there where needed.

Xavier

  • We need code, but we also need broader community support.
  • Better to progress on both ends.
  • We are all agreed, but it’s worth letting the community know that if things work out, things might be official.

Felipe

  • We can added addendum to OEP-45, similar to tutor

Xavier

  • We will test Braden’s code.
  • Change the format OEP to a modification of OEP-45.
  • Will also show that multiple members of the community are working on it.

Lawrence

  • Needs a willingness among the workgroup to standardise on a machine state to build the Open edX images.
  • To benefit from this, we need a way to incrementally build Open edX images that doesn’t take 45 minutes.

Felipe

  • Unfortunately it does get complex so we might need more than just a machine to constantly build the images.

Braden

  • We should investigate changing the tutor Dockerfile to make it more cacheable.

Felipe

  • IHO that’s a separate project.

Xavier

  • Is there a Discovery item to move the exploration forward?

Braden

  • We should involve Regis next time to get his feedback on the discussion here.

Xavier

  • Let’s schedule the next meeting as we’ve come close to the allotted time here.
  • Everyone is happy to meet again in two weeks.

Felipe

  • We should get more folks to join that might be interested in this discussion.

Xavier

  • Start a new thread on the forum.

  • With a recap on the discussion so far.

  • Involve 2U as well. They don’t use tutor, but k8s. They could have a different approach.

  • Ned was suprised that we’re working on this.

  • 6th of December or the next one. Slightly later if possible. Proposed time, 17:30 UTC.

  • Lawrence offered help with anything code related.

  • Braden asked for help testing the prototype which was agreed to.

Meeting ended

7 Likes

Thanks for taking notes @keithgg. I can assist anyone who needs/wants help with the action items.

2 Likes

@keithgg Thank you for the recap and the notes! Here is the meeting recording video:

I have sent the meeting invite to the same group, if anyone wants to be added just ask me or any participant – or simply join the Zoom room at the time of the meeting.

@lpm0073 Thanks for offering to do some of the work! I think @braden you mentioned there were a few things on which you could use some help?

@kmccormick Our discussions might actually also be somewhat related to the initiative from https://openedx.atlassian.net/wiki/spaces/COMM/pages/3583016961/Developer+Experience+Working+Group ? Your focus there is about the development environment, but there might be similarities on which to collaborate?

CC @nedbat as this is a topic you showed interest for during the last community meetup
CC @regis as this is related to Tutor

CC @Kelly_Buchanan as we have also discussed these DevOps topics – see if the description above from Keith provides you with the info you need? I was about to copy a recap of it to https://openedx.atlassian.net/wiki/spaces/COMM/pages/3574661129/DevOps+Working+Group+Ideation but I noticed that it’s archived.

I’m sorry I couldn’t join the meeting on Tuesday, but something came up. I’ll watch the recording and and will definitely join the next one.

Here are my notes on the meeting:

  • @Felipe’s summary of my position on the Help OEP is accurate: I’d love to make it easier to integrate Tutor & Helm, but I think that we need to do it in a way that is compatible with the existing plugin ecosystem.
  • I’m excited to hear from @braden’s implementation of a load balancer with Helm that plays well with Tutor and the plugin ecosystem.
  • Issues with the current k8s implementation in Tutor:

There’s no way around Caddy

I do not understand this specific point. Caddy serves two purposes in a Tutor-based k8s deployment. It’s:

  1. an ingress, with SSL certificates and all.
  2. a web proxy that handles http requests.

Caddy does not do load balancing. Kubernetes is in charge of that. All Caddy does is: “I’m getting this http request, let’s process it a little, then forward it to http://lms:8000 (for instance)”. Kubernetes (or docker-compose) is then in charge of load-balancing between services named “lms”. I think that Caddy is doing a great job at being a web proxy (role 2). Caddy as an ingress can be disabled by (unintuitively) setting ENABLE_WEB_PROXY=false. So I’m not sure why we are pointing fingers at Caddy here.

Volume persistence is inconvenient, in particular across availability zones

Is this issue related to this other one? Improve storage model of Caddy's pods in K8s · Issue #737 · overhangio/tutor · GitHub Please comment there. I must admit that in this matter I’m limited by my own understanding of k8s volumes. Also, it’s difficult to propose a default implementation that works across k8s providers, who all offer different persistent volume types.

Again, as on many other matters, I’m really open to comments. Would Helm help resolving this issue? In my understanding it wouldn’t, but I might be wrong.

“Managing plugins is tricky”

This topic is close to my heart :slight_smile:

If I understand correctly, the argument that was being made here was that it’s more work for end users to make changes to their Open edX platform by creating a plugin than having the feature baked in upstream in Tutor core.

I have so many things to say on this topic, but I’ll try to keep it brief.

Adding new feature to Tutor core has heavy consequences for everyone: both Tutor maintainers and Tutor users. Let’s assume that Tutor maintainers have plenty of free time and energy and they are able to competently maintain all features that are added to Tutor core. Let’s focus on Tutor users. Adding more features to Tutor core should not make it harder for these users to maintain their own platforms. But in general, adding a feature does make it more complex.

Let’s take the example of this pull request, which was eventually closed without merging https://github.com/overhangio/tutor/pull/677 It’s a shame that we’ve lost the original commits from this PR; originally, it introduced auto-scaling to the Open edX platform on Kubernetes. I absolutely love this feature. But it required 32 new Tutor settings to be created, just for Open edX core (without ecommerce, forum, etc.). This is just too much added complexity that all Tutor end users would eventually have to deal with. So I recommended that Opencraft create a plugin that implement these changes. I also suggested that Opencraft maintains this plugin, as they clearly have the most expertise on the topic of Kubernetes auto-scaling.

My bottom line is this: addressing a few user’s use case should not make life more difficult for many others. If you have a very specific use case, then there’s a good chance that you are not an individual but a company, and one with a dedicated team of engineers working on Open edX. It’s only fair that you put in the work, create a plugin and maintain it. This “philosophy” is the origin of many design decisions that have happened inside Tutor. In particular: the recent switch to the more powerful plugin V1 API, the extensive documentation of the plugin creation process, the future creation of third-party plugin indices, etc.

“The Tutor CLI is not inconvenient; in particular for jobs”

I think that the CLI is okay (and it will be further improved in Olive) but I agree that the current implementation of k8s (and docker-compose) jobs is clunky.

Basically, the K8sJobRunner manually patches job specs to run ad-hoc tasks in the right containers. For instance: initialisation tasks, such as database migrations, etc.

I have tried hard to improve the implementation of K8s jobs, but could not find a better implementation. In particular, this was the only way I found to avoid the duplication of job declarations for every single task. I would love to have a better way to handle jobs.

The openedx container is a “gigantic monolith”

This is an actual issue, and an important one, but I I do not think that it’s related to using Helm/Kubernetes or not. Still, a few words about this…

I must admit that I cringe a little when I hear that the openedx Docker image is “not optimized for building”… I rebuild the openedx Docker image many times a day, and I really need that process to be efficient, so I’ve dedicated many hours to making this process as smooth and fast as possible.

For the record, the current openedx image is already layer-based. Those layers were designed thoughtfully and if small changes trigger cache invalidation, there’s almost certainly a good reason for that. If a user believes that their changes to the Dockerfile should not trigger cache invalidation, they should implement the “openedx-dockerfile-final” patch in their plugin.

Just the python virtualenv and the node_modules take up 1.2GB of disk space, so I find it unlikely that we are ever able to generate sub-1GB uncompressed images (note that the vanilla compressed images are already sub 1GB). I’m not saying it’s impossible, but I do not know how to improve this situation, and I’m very much open to contributions.

On that topic, I’m afraid that any further non-trivial improvements will require major upstream changes in edx-platform (but I would love to be proved wrong).

Helm as a default tool

When I first started working on deploying Open edX to Kubernetes, I seriously considered Helm. One of the reasons that I chose kubectl apply over Helm is that, in my understanding, k8s manifests could be used by Helm, but not the other way around. What I mean to say is that Helm does not have to replace the current manifest-based k8s implementation, but it can complement it.

Thus, this conversation should probably not revolve around “let’s replace tutor k8s by something else with Helm”, but instead “can we use tutor k8s to work with Helm”?

1 Like

OK folks, here is the minimal prototype I mentioned that shows you can use a Helm chart to provision shared resources onto a cluster and it’s compatible with the current implementation of tutor k8s and should generally work with all Tutor plugins (including MFE) unless they require the use of an additional domain name outside of the default set (LMS_HOST, CMS_HOST, PREVIEW_LMS_HOST, MFE_HOST).

Check it out: https://github.com/open-craft/tutor-contrib-multi

I used it to deploy two Tutor instances onto a fresh cluster (note I will destroy this cluster soon though):

@lpm0073 would you be able to test it and give me feedback?

@regis:

Caddy is great for single instances (AFAIK) but does not work very well as a cluster-wide Ingress Controller for multiple instances. If you use the Tutor defaults to deploy several Open edX instances onto a cluster, you’ll get several different Caddy installations, and each one triggers its own k8s Service.LoadBalancer resource, which in turn tells the cloud provider (AWS, DigitalOcean, etc.) to spawn a corresponding Managed Load Balancer. For example, on AWS that means that EKS will provision a Classic Load Balancer per Open edX Instance (per Caddy instance), which costs $18/mo per instance. For most providers, it’s going to be much nicer and more affordable to have a single HA load balancer per cluster instead of per instance.

I am using the term “load balancer” here very casually as something that forwards traffic along to something else; it may not actually be “balancing” in any real sense.

Yes and that’s what I did in my Helm chart demo. However, Caddy is still deployed per instance in order to route traffic among the various services (LMS, Preview, CMS, MFE) and it might be preferable in the k8s environment to just use Ingress objects to let the cluster-wide proxy/LB do that, so there is no need to run all these additional Caddy containers. However, that would be a breaking change to Tutor and I’m not sure that it’s worth it for now. The per-instance Caddy container seems to not use much resources.

I tend to make the same assumptions as you in this case and thought that generally the caching and build layers you’ve defined should optimize this well. However, the problem I heard that some people are using a repeatable workflow that involves [provisioning a new VM], installing a specific version of Tutor and docker and plugins, then building the new image, then deploying it… that often means that layer caching is not available or not used. At least that’s what I heard.

Of course people can take a “base” docker image and then use a Dockerfile to extend it with some minor changes, and that is guaranteed to be very quick and fast, but then they can’t use Tutor to manage those changes properly. And so if they want to use Tutor but also need to manage multiple deployments which are using different versions of Open edX, Tutor, Docker, etc., then they may run into problems unless they use the “repeatable workflow” above which invalidates the cache.

Again, I haven’t experienced this myself but that’s my understanding of the issue.

Yes, see proof of concept :slight_smile:

3 Likes

I feel like we are in agreement over most important issues. In particular, I agree with you that the Caddy-based load balancer that ships with Tutor by default does not work well with multiple instances. But that does not mean that we should ditch Caddy entirely. I’m afraid an Ingress cannot replace Caddy as a web proxy, because Caddy performs some essential request processing/filtering tasks. For reference, here is the base Caddyfile used in Tutor: tutor/Caddyfile at master · overhangio/tutor · GitHub

It contains some essential configuration, such as:

@favicon_matcher {
    path_regexp ^/favicon.ico$
}
rewrite @favicon_matcher /theming/asset/images/favicon.ico

# Limit profile image upload size
request_body /api/profile_images/*/*/upload {
    max_size 1MB
}
request_body {
    max_size 4MB
}

All this configuration would have to be ported over to the Ingress, which might be quite difficult. Given how little resources the Caddy container uses, the fact that it can be easily scaled to multiple instances and the simplification it introduces in request routing inside the cluster, I really think we should keep it around. Plus it’s not preventing you from deploying your own load balancer, as you’ve shown in your plugin. The fact that Caddy centralizes all requests makes your plugin quite simple, I think: tutor-contrib-multi/k8s-services at main · open-craft/tutor-contrib-multi · GitHub

On the topic of building Docker images without access to the Docker cache: I mean there is only so much that Tutor can do… If users are not leveraging the Docker cache, then yes, sure, building images will take an awfully long time.

Maybe we should discuss with @lpm0073 in which scenarios exactly is the tutor images build taking too long? Would you like to open another topic on this forum to talk about this issue Lawrence? Feel free to @me.

Makes sense, and I was already leaning that way anyways. So forget what I said about maybe ditching Caddy entirely :slight_smile: