Deploying Open edX on Kubernetes Using Helm

See previous discussion

The meeting was held at 15:00 UTC on Zoom. @antoviaque can you post the recording?

Attendees

@MoisesGonzalezS @gabor @jhony_avella @Felipe @braden @lpm0073 @keithgg @antoviaque @sambapete @mtyaka

Since the previous thread was getting a bit long, we’ve decided to create a new thread to facilitate further discussion.

What was discussed

  • To deploy to Kubernetes with tutor still has some pain points:
    • It’s not possible to get rid of Caddy.
    • The built-in persistance for MongoDB/MySQL/ES and Redis is not that reliable.
    • Configuration outside of the scope of tutor is fairly difficult.
    • Building images takes a while, making changes requires rebuilding.
  • @braden is working on the prototype for Helm + tutor and is almost ready to demo.
  • In the mean time@felipe is working on OEP-59.
  • Getting broader community support is important, as this could become an official method of deploying Open edX to Kubernetes.

Action items

  • @braden will present his prototype at the next meeting (possibly before then) to get feedback.
  • @felipe to update OEP-59 so that it’s an amendment to OEP-45 instead.
  • We are to involve the wider to community (especially 2U as they’re deploying to k8s already) to get their feedback and explore any other ideas.
  • We will have another meeting on the 6th of December at 17:30 UTC. All is welcome :slight_smile:
Meeting notes

Moises:

  • eduNext has encountered problems with tutor. Specifically with Caddy, mixed with Ingress.
  • Can’t get rid of Caddy due to the way it’s specified.
  • Persistence in the cluster. ElasticSearch redis and MongoDB/MySQL is not that reliable.
    • In a real installation it doesn’t fulfil the needs.
    • Default settings aren’t enough. Have to add plugins to manage configuration.
  • Plugin ecosystem of tutor has two use-cases.
  • IDAs → Ecommerce, etc.
  • Configuration, it’s difficult to manage configuration for what we need.
  • It’s not just changing the plugin, but then overriding templates to override everything.
  • Using tutor as a deploy tool isn’t working
  • The way tutor manages jobs, hooks are injected into the manifest and applied.

Lawrence

  • Agrees. For him, building the Open edX container is a problem, because it’s so big.
  • He has to rebuild the container, because an eg. an XBlock needs to be added.
  • It’s much slower than working natively.
  • It’s very difficult to make small changes to the build.
  • Caddy is a problem, persistance mismatch, availability zones, templates and plugins all have issues.
  • Open edX container is much too large and brittle. Could turn something minor into taking up a lot of time.
  • Helm is worthwhile.
  • Is it possible to that the Open Edx container can be standalone and that everything else like Xblocks and such, could be added after the fact?

Felipe

  • Suggested using Docker layers for a base Open edX image might be an option.

Lawrence

  • Lawrence is unable to reliably build and deploy unless there’s a codified workflow.
  • He’s currently using Github actions, but is open to anything.
  • There needs to be a way to codify this workflow.
  • It’s difficult currently to get a reliable production build for client.
  • Bash scripts seems to work, but would like something better.

Braden

  • It should be possible to create a Docker layer.

Lawrence

  • The image would need to be built every time.
  • Clients are now cognizant of the time it takes to build a container and
    they are taking it into account, so requests are limited. It’s a distraction
    and “dominates the thought process”
  • Other than building images, Kubernetes and Docker is resilient and workable

Moises

  • We could use different ways of caching builds, like BuildKit, etc.
  • We could handle it later as part of a working group.

Felipe

  • Making the build work better could be within the scope of the BTR.
  • Since it’s something everyone wants, making it available in the wider community is better.

Xavier

  • What is the status of the current helm collabaration?
  • Can we announce to the community or is there still some work to do?

Braden

  • Will share prototype with the group tomorrow/later.

Xavier

  • Is there even anything that’s been agreed upon that we will take on going forward.

Braden

  • We will definitely use Helm.

Lawrence

  • For his clients, Open edX isn’t usually the only piece of software running on their cluster.
  • Helm is suited to this task, because it’s usually used to deploy the other software in the cluster as well.

Felipe

  • There might be complaints from other operators that they need to now learn another technology.
  • Some customers are not in favour of using helm as they might be tied to different methodologies/tech.

Braden

  • Helm is fairly widely used so we shouldn’t have that issue.

Xavier

  • Concrete steps:
    • Braden finishes prototype and gets feedback.
    • Is there anything missing to get the OEP accepted?

Braden

  • Is the OEP even needed?
  • Ideally, the next step is to build on it. We could just start publishing and collabarate on it?

Xavier

  • OEP is preferred since this project is community centric.
  • There is value in saying that this could be an official solution.

Braden

  • OEP can be amendment to the tutor OEP, rather than standalone as OEP’s require more effort to get accepted.

Lawrence

  • This isn’t to replace tutor, but the resulting project will live downstream from tutor.
  • In the end, it will replace tutor k8s.

Felipe

  • This is a first step to build upon that.

Xavier

  • We need to consider the Social and Governance aspect.
  • What’s the shortest path to getting to the first step?
    • Amendment to tutor OEP sounds good.
    • Is that something we can do now?
    • Is there still things to figure out regarding it?

Braden

  • Agree on the social asepct.
  • Needing to get people looking at code first > publishing an OEP.

Lawrence

  • Willing to write code and assist there where needed.

Xavier

  • We need code, but we also need broader community support.
  • Better to progress on both ends.
  • We are all agreed, but it’s worth letting the community know that if things work out, things might be official.

Felipe

  • We can added addendum to OEP-45, similar to tutor

Xavier

  • We will test Braden’s code.
  • Change the format OEP to a modification of OEP-45.
  • Will also show that multiple members of the community are working on it.

Lawrence

  • Needs a willingness among the workgroup to standardise on a machine state to build the Open edX images.
  • To benefit from this, we need a way to incrementally build Open edX images that doesn’t take 45 minutes.

Felipe

  • Unfortunately it does get complex so we might need more than just a machine to constantly build the images.

Braden

  • We should investigate changing the tutor Dockerfile to make it more cacheable.

Felipe

  • IHO that’s a separate project.

Xavier

  • Is there a Discovery item to move the exploration forward?

Braden

  • We should involve Regis next time to get his feedback on the discussion here.

Xavier

  • Let’s schedule the next meeting as we’ve come close to the allotted time here.
  • Everyone is happy to meet again in two weeks.

Felipe

  • We should get more folks to join that might be interested in this discussion.

Xavier

  • Start a new thread on the forum.

  • With a recap on the discussion so far.

  • Involve 2U as well. They don’t use tutor, but k8s. They could have a different approach.

  • Ned was suprised that we’re working on this.

  • 6th of December or the next one. Slightly later if possible. Proposed time, 17:30 UTC.

  • Lawrence offered help with anything code related.

  • Braden asked for help testing the prototype which was agreed to.

Meeting ended

7 Likes

Thanks for taking notes @keithgg. I can assist anyone who needs/wants help with the action items.

2 Likes

@keithgg Thank you for the recap and the notes! Here is the meeting recording video:

I have sent the meeting invite to the same group, if anyone wants to be added just ask me or any participant – or simply join the Zoom room at the time of the meeting.

@lpm0073 Thanks for offering to do some of the work! I think @braden you mentioned there were a few things on which you could use some help?

@kmccormick Our discussions might actually also be somewhat related to the initiative from https://openedx.atlassian.net/wiki/spaces/COMM/pages/3583016961/Developer+Experience+Working+Group ? Your focus there is about the development environment, but there might be similarities on which to collaborate?

CC @nedbat as this is a topic you showed interest for during the last community meetup
CC @regis as this is related to Tutor

CC @Kelly_Buchanan as we have also discussed these DevOps topics – see if the description above from Keith provides you with the info you need? I was about to copy a recap of it to https://openedx.atlassian.net/wiki/spaces/COMM/pages/3574661129/DevOps+Working+Group+Ideation but I noticed that it’s archived.

I’m sorry I couldn’t join the meeting on Tuesday, but something came up. I’ll watch the recording and and will definitely join the next one.

Here are my notes on the meeting:

  • @Felipe’s summary of my position on the Help OEP is accurate: I’d love to make it easier to integrate Tutor & Helm, but I think that we need to do it in a way that is compatible with the existing plugin ecosystem.
  • I’m excited to hear from @braden’s implementation of a load balancer with Helm that plays well with Tutor and the plugin ecosystem.
  • Issues with the current k8s implementation in Tutor:

There’s no way around Caddy

I do not understand this specific point. Caddy serves two purposes in a Tutor-based k8s deployment. It’s:

  1. an ingress, with SSL certificates and all.
  2. a web proxy that handles http requests.

Caddy does not do load balancing. Kubernetes is in charge of that. All Caddy does is: “I’m getting this http request, let’s process it a little, then forward it to http://lms:8000 (for instance)”. Kubernetes (or docker-compose) is then in charge of load-balancing between services named “lms”. I think that Caddy is doing a great job at being a web proxy (role 2). Caddy as an ingress can be disabled by (unintuitively) setting ENABLE_WEB_PROXY=false. So I’m not sure why we are pointing fingers at Caddy here.

Volume persistence is inconvenient, in particular across availability zones

Is this issue related to this other one? Improve storage model of Caddy's pods in K8s · Issue #737 · overhangio/tutor · GitHub Please comment there. I must admit that in this matter I’m limited by my own understanding of k8s volumes. Also, it’s difficult to propose a default implementation that works across k8s providers, who all offer different persistent volume types.

Again, as on many other matters, I’m really open to comments. Would Helm help resolving this issue? In my understanding it wouldn’t, but I might be wrong.

“Managing plugins is tricky”

This topic is close to my heart :slight_smile:

If I understand correctly, the argument that was being made here was that it’s more work for end users to make changes to their Open edX platform by creating a plugin than having the feature baked in upstream in Tutor core.

I have so many things to say on this topic, but I’ll try to keep it brief.

Adding new feature to Tutor core has heavy consequences for everyone: both Tutor maintainers and Tutor users. Let’s assume that Tutor maintainers have plenty of free time and energy and they are able to competently maintain all features that are added to Tutor core. Let’s focus on Tutor users. Adding more features to Tutor core should not make it harder for these users to maintain their own platforms. But in general, adding a feature does make it more complex.

Let’s take the example of this pull request, which was eventually closed without merging feat: k8s horizontal pod autoscaling by gabor-boros · Pull Request #677 · overhangio/tutor · GitHub It’s a shame that we’ve lost the original commits from this PR; originally, it introduced auto-scaling to the Open edX platform on Kubernetes. I absolutely love this feature. But it required 32 new Tutor settings to be created, just for Open edX core (without ecommerce, forum, etc.). This is just too much added complexity that all Tutor end users would eventually have to deal with. So I recommended that Opencraft create a plugin that implement these changes. I also suggested that Opencraft maintains this plugin, as they clearly have the most expertise on the topic of Kubernetes auto-scaling.

My bottom line is this: addressing a few user’s use case should not make life more difficult for many others. If you have a very specific use case, then there’s a good chance that you are not an individual but a company, and one with a dedicated team of engineers working on Open edX. It’s only fair that you put in the work, create a plugin and maintain it. This “philosophy” is the origin of many design decisions that have happened inside Tutor. In particular: the recent switch to the more powerful plugin V1 API, the extensive documentation of the plugin creation process, the future creation of third-party plugin indices, etc.

“The Tutor CLI is not inconvenient; in particular for jobs”

I think that the CLI is okay (and it will be further improved in Olive) but I agree that the current implementation of k8s (and docker-compose) jobs is clunky.

Basically, the K8sJobRunner manually patches job specs to run ad-hoc tasks in the right containers. For instance: initialisation tasks, such as database migrations, etc.

I have tried hard to improve the implementation of K8s jobs, but could not find a better implementation. In particular, this was the only way I found to avoid the duplication of job declarations for every single task. I would love to have a better way to handle jobs.

The openedx container is a “gigantic monolith”

This is an actual issue, and an important one, but I I do not think that it’s related to using Helm/Kubernetes or not. Still, a few words about this…

I must admit that I cringe a little when I hear that the openedx Docker image is “not optimized for building”… I rebuild the openedx Docker image many times a day, and I really need that process to be efficient, so I’ve dedicated many hours to making this process as smooth and fast as possible.

For the record, the current openedx image is already layer-based. Those layers were designed thoughtfully and if small changes trigger cache invalidation, there’s almost certainly a good reason for that. If a user believes that their changes to the Dockerfile should not trigger cache invalidation, they should implement the “openedx-dockerfile-final” patch in their plugin.

Just the python virtualenv and the node_modules take up 1.2GB of disk space, so I find it unlikely that we are ever able to generate sub-1GB uncompressed images (note that the vanilla compressed images are already sub 1GB). I’m not saying it’s impossible, but I do not know how to improve this situation, and I’m very much open to contributions.

On that topic, I’m afraid that any further non-trivial improvements will require major upstream changes in edx-platform (but I would love to be proved wrong).

Helm as a default tool

When I first started working on deploying Open edX to Kubernetes, I seriously considered Helm. One of the reasons that I chose kubectl apply over Helm is that, in my understanding, k8s manifests could be used by Helm, but not the other way around. What I mean to say is that Helm does not have to replace the current manifest-based k8s implementation, but it can complement it.

Thus, this conversation should probably not revolve around “let’s replace tutor k8s by something else with Helm”, but instead “can we use tutor k8s to work with Helm”?

2 Likes

OK folks, here is the minimal prototype I mentioned that shows you can use a Helm chart to provision shared resources onto a cluster and it’s compatible with the current implementation of tutor k8s and should generally work with all Tutor plugins (including MFE) unless they require the use of an additional domain name outside of the default set (LMS_HOST, CMS_HOST, PREVIEW_LMS_HOST, MFE_HOST).

Check it out: https://github.com/open-craft/tutor-contrib-multi

I used it to deploy two Tutor instances onto a fresh cluster (note I will destroy this cluster soon though):

@lpm0073 would you be able to test it and give me feedback?

@regis:

Caddy is great for single instances (AFAIK) but does not work very well as a cluster-wide Ingress Controller for multiple instances. If you use the Tutor defaults to deploy several Open edX instances onto a cluster, you’ll get several different Caddy installations, and each one triggers its own k8s Service.LoadBalancer resource, which in turn tells the cloud provider (AWS, DigitalOcean, etc.) to spawn a corresponding Managed Load Balancer. For example, on AWS that means that EKS will provision a Classic Load Balancer per Open edX Instance (per Caddy instance), which costs $18/mo per instance. For most providers, it’s going to be much nicer and more affordable to have a single HA load balancer per cluster instead of per instance.

I am using the term “load balancer” here very casually as something that forwards traffic along to something else; it may not actually be “balancing” in any real sense.

Yes and that’s what I did in my Helm chart demo. However, Caddy is still deployed per instance in order to route traffic among the various services (LMS, Preview, CMS, MFE) and it might be preferable in the k8s environment to just use Ingress objects to let the cluster-wide proxy/LB do that, so there is no need to run all these additional Caddy containers. However, that would be a breaking change to Tutor and I’m not sure that it’s worth it for now. The per-instance Caddy container seems to not use much resources.

I tend to make the same assumptions as you in this case and thought that generally the caching and build layers you’ve defined should optimize this well. However, the problem I heard that some people are using a repeatable workflow that involves [provisioning a new VM], installing a specific version of Tutor and docker and plugins, then building the new image, then deploying it… that often means that layer caching is not available or not used. At least that’s what I heard.

Of course people can take a “base” docker image and then use a Dockerfile to extend it with some minor changes, and that is guaranteed to be very quick and fast, but then they can’t use Tutor to manage those changes properly. And so if they want to use Tutor but also need to manage multiple deployments which are using different versions of Open edX, Tutor, Docker, etc., then they may run into problems unless they use the “repeatable workflow” above which invalidates the cache.

Again, I haven’t experienced this myself but that’s my understanding of the issue.

Yes, see proof of concept :slight_smile:

4 Likes

I feel like we are in agreement over most important issues. In particular, I agree with you that the Caddy-based load balancer that ships with Tutor by default does not work well with multiple instances. But that does not mean that we should ditch Caddy entirely. I’m afraid an Ingress cannot replace Caddy as a web proxy, because Caddy performs some essential request processing/filtering tasks. For reference, here is the base Caddyfile used in Tutor: tutor/Caddyfile at master · overhangio/tutor · GitHub

It contains some essential configuration, such as:

@favicon_matcher {
    path_regexp ^/favicon.ico$
}
rewrite @favicon_matcher /theming/asset/images/favicon.ico

# Limit profile image upload size
request_body /api/profile_images/*/*/upload {
    max_size 1MB
}
request_body {
    max_size 4MB
}

All this configuration would have to be ported over to the Ingress, which might be quite difficult. Given how little resources the Caddy container uses, the fact that it can be easily scaled to multiple instances and the simplification it introduces in request routing inside the cluster, I really think we should keep it around. Plus it’s not preventing you from deploying your own load balancer, as you’ve shown in your plugin. The fact that Caddy centralizes all requests makes your plugin quite simple, I think: tutor-contrib-multi/k8s-services at main · open-craft/tutor-contrib-multi · GitHub

On the topic of building Docker images without access to the Docker cache: I mean there is only so much that Tutor can do… If users are not leveraging the Docker cache, then yes, sure, building images will take an awfully long time.

Maybe we should discuss with @lpm0073 in which scenarios exactly is the tutor images build taking too long? Would you like to open another topic on this forum to talk about this issue Lawrence? Feel free to @me.

Makes sense, and I was already leaning that way anyways. So forget what I said about maybe ditching Caddy entirely :slight_smile:

Thanks a lot for the POC chart @braden.

I downloaded it and read throught the code and conceptually everything was spot on.

When I tried to run in a minikube cluster I ran into issues of ever getting the /cluster-echo-test to work. It looked as if the cluster was not exposing any ports. I’ll debug some more before the meeting on Tuesday.

I am also pending the move of the OEP to the ADR. I’ll get to it today.

Yeah I had trouble with minikube so if possible it’s better to test on a “real” cluster.

1 Like

Yes, this is definitely relevant to DevExp! Let me take some time to catch up on all the work that’s happened so far.

2 Likes

Hi, sorry for the late response, last days have been a little bit hectic.

I played the prototype a little bit and I’m interested in what could be a possible approach looking forward.

A few questions that I have:

  • The role of the Tutor plugin is mostly to connect a Tutor installation with the shared pool of resources, mainly to set the configuration needed?
  • What other examples of shared resources come to mind? Something like monitoring tools, maybe a global redis?

Adding new feature to Tutor core has heavy consequences for everyone: both
Tutor maintainers and Tutor users. Let’s assume that Tutor maintainers have
plenty of free time and energy and they are able to competently maintain all
features that are added to Tutor core. Let’s focus on Tutor users. Adding
more features to Tutor core should not make it harder for these users to
maintain their own platforms. But in general, adding a feature does make it
more complex.

I don’t think that overloading the Tutor core is something that we are pursuing, but rather we are used to push our changes to upstream to reduce code drift. I do concede that is a little unfair to simply pass the burden of maintenance considering the plugin system allows us to extend Tutor and own the changes. Which circles us back to the initial discussion.

We want to find a common place to collaborate and work on those issues that affect that small percentage of very specialized Tutor users. I think having this central hub to exchange ideas and work on something together will help us align and point more decisively which parts of Tutor don’t work for us nor can we work around them. This why I think the idea that Braden proposes fits nicely as it isn’t as disruptive as the initial proposal and serves as starting point for discussion.

2 Likes

Totally unrelated to other comments in this topic (sorry about that): has anyone considered Kubernetes operators instead of relying on Helm for scaling, dynamic configuration, etc? I’m not too familiar with them, so I’m really asking this out of the blue.

Tutor Build times, more context: this problem only surfaces when using CI platforms like Github Actions for example. The tutor build begins from scratch as the worker node itself is ephemeral. There’s a noticeable difference in execution time when rebuilding from say, an AWS EC2 instance, obviously due to Docker’s built-in caching capabilities.

Ideally, we’d build a much smaller container consisting of only our frequently-changing custom code components: the custom theme, the open edx plugins, and the optional Xblocks. And then somehow, the code contained in this much smaller container would be accessible by a much larger infrequently-changing pristine openedx container for each open-release.

I think it still should be possible to store the docker layer cache in GitHub Actions Cache to speed up the build somehow, e.g. How to use Docker layer caching in GitHub Actions is how you do it for builds that directly run docker builds in GitHub Actions. But some more work may be required to tell Tutor to tell Docker to use the GHA cache.

Meeting 2022-12-06

It was a good constructive meeting yesterday! Cheers to all of us for that. :slight_smile:

Video recording

Meeting notes

Were there any notes taken, besides Community K8S Helm Chart - Google Docs ? We might have forgotten to assign a scribe to take notes and post a recap afterwards, I’ll add an item to the agenda to remember to do this next time.

The main takeaways were:

  • We confirmed that we are in agreement to pursue the approach from the current prototype, collaborating on it as a group, within the Open edX project
  • Since the proposal is becoming more concrete, we want to make sure everyone in the community has a chance to see this, as well as review and comment, so we’ll be opening a formal review period of the OEP update proposal - @Felipe will be posting a dedicated announcement thread about this shortly.
  • During this review period, we will keep working on the prototype, to fully validate that it addresses the issues we have identified. To do so, a list of tasks has been established and split between members of the group - kudos to @Felipe @lpm0073 @jhony_avella @MoisesGonzalezS @keithgg for each taking on a task on this! (and to @braden for all the work on the prototype)
  • To make it clear that this is a community effort done within the Open edX project, rather than any specific provider’s own project, we are looking into moving the prototype repo to the openedx org on github

Meeting chat log

Next meeting

The next meeting will be on 2023-01-10T17:00:00Z in this Zoom room.

If you would like to be added to to the meeting calendar invite, just ask me or any of the participants, or join the Zoom URL directly.

Agenda

Proposed agenda, based on the items mentioned during the last meeting:

  1. Assign scribe role, greetings & introductions as needed (5 minutes)

  2. Kubernetes, Tutor & Helm (30 minutes) - Debrief of the work from the tasks list & formal review

  3. Tutor improvements (10 minutes) - Are there issues or blockers that people face with Tutor, which wouldn’t already be addressed by the current initiative?

  4. DevOps Working Group (10 minutes) - Continue discussing scope for a DevOps working group

  5. Next steps & conclusion (5 minutes)

This is a temporary agenda - post changes/items as a reply here If we can have 2U engineers joining, we could carve out some time to discuss what 2U is currently doing and current needs (same thing for any other organization!).

DevExp

@kmccormick Great to hear! :smiley: If you would like to join the next meeting to talk about your work, and how we could collaborate, you’re welcome to! I could add an item in the agenda for this if you want.

This looks promising. Thanks. I’ll add this to the openedx-action build action in an experimental branch.

This is the right way to speed up the image building @lpm0073. Note that tutor images build has the --docker-arg option: everything you add to this option (such as “–cache-from …/–cache-to …”) will be forwarded as-is to docker build.

Here’s a way to implement a locally run Github Action, provided by my colleague @Melsu (Sujit Kumar) GitHub - nektos/act: Run your GitHub Actions locally 🚀. this might be the best way to combine Docker’s caching capability with Cookiecutter’s automated build workflow

I’m not sure I follow. You want to build Docker images from within act? Why would you want to do that? Why not run build images directly on the host? Building Docker images from within Docker is not so easy.