Deploying Open edX on Kubernetes Using Helm

Yes, this is definitely relevant to DevExp! Let me take some time to catch up on all the work that’s happened so far.

2 Likes

Hi, sorry for the late response, last days have been a little bit hectic.

I played the prototype a little bit and I’m interested in what could be a possible approach looking forward.

A few questions that I have:

  • The role of the Tutor plugin is mostly to connect a Tutor installation with the shared pool of resources, mainly to set the configuration needed?
  • What other examples of shared resources come to mind? Something like monitoring tools, maybe a global redis?

Adding new feature to Tutor core has heavy consequences for everyone: both
Tutor maintainers and Tutor users. Let’s assume that Tutor maintainers have
plenty of free time and energy and they are able to competently maintain all
features that are added to Tutor core. Let’s focus on Tutor users. Adding
more features to Tutor core should not make it harder for these users to
maintain their own platforms. But in general, adding a feature does make it
more complex.

I don’t think that overloading the Tutor core is something that we are pursuing, but rather we are used to push our changes to upstream to reduce code drift. I do concede that is a little unfair to simply pass the burden of maintenance considering the plugin system allows us to extend Tutor and own the changes. Which circles us back to the initial discussion.

We want to find a common place to collaborate and work on those issues that affect that small percentage of very specialized Tutor users. I think having this central hub to exchange ideas and work on something together will help us align and point more decisively which parts of Tutor don’t work for us nor can we work around them. This why I think the idea that Braden proposes fits nicely as it isn’t as disruptive as the initial proposal and serves as starting point for discussion.

2 Likes

Totally unrelated to other comments in this topic (sorry about that): has anyone considered Kubernetes operators instead of relying on Helm for scaling, dynamic configuration, etc? I’m not too familiar with them, so I’m really asking this out of the blue.

Tutor Build times, more context: this problem only surfaces when using CI platforms like Github Actions for example. The tutor build begins from scratch as the worker node itself is ephemeral. There’s a noticeable difference in execution time when rebuilding from say, an AWS EC2 instance, obviously due to Docker’s built-in caching capabilities.

Ideally, we’d build a much smaller container consisting of only our frequently-changing custom code components: the custom theme, the open edx plugins, and the optional Xblocks. And then somehow, the code contained in this much smaller container would be accessible by a much larger infrequently-changing pristine openedx container for each open-release.

I think it still should be possible to store the docker layer cache in GitHub Actions Cache to speed up the build somehow, e.g. How to use Docker layer caching in GitHub Actions is how you do it for builds that directly run docker builds in GitHub Actions. But some more work may be required to tell Tutor to tell Docker to use the GHA cache.

Meeting 2022-12-06

It was a good constructive meeting yesterday! Cheers to all of us for that. :slight_smile:

Video recording

Meeting notes

Were there any notes taken, besides Community K8S Helm Chart - Google Docs ? We might have forgotten to assign a scribe to take notes and post a recap afterwards, I’ll add an item to the agenda to remember to do this next time.

The main takeaways were:

  • We confirmed that we are in agreement to pursue the approach from the current prototype, collaborating on it as a group, within the Open edX project
  • Since the proposal is becoming more concrete, we want to make sure everyone in the community has a chance to see this, as well as review and comment, so we’ll be opening a formal review period of the OEP update proposal - @Felipe will be posting a dedicated announcement thread about this shortly.
  • During this review period, we will keep working on the prototype, to fully validate that it addresses the issues we have identified. To do so, a list of tasks has been established and split between members of the group - kudos to @Felipe @lpm0073 @jhony_avella @MoisesGonzalezS @keithgg for each taking on a task on this! (and to @braden for all the work on the prototype)
  • To make it clear that this is a community effort done within the Open edX project, rather than any specific provider’s own project, we are looking into moving the prototype repo to the openedx org on github

Meeting chat log

Next meeting

The next meeting will be on 2023-01-10T17:00:00Z in this Zoom room.

If you would like to be added to to the meeting calendar invite, just ask me or any of the participants, or join the Zoom URL directly.

Agenda

Proposed agenda, based on the items mentioned during the last meeting:

  1. Assign scribe role, greetings & introductions as needed (5 minutes)

  2. Kubernetes, Tutor & Helm (30 minutes) - Debrief of the work from the tasks list & formal review

  3. Tutor improvements (10 minutes) - Are there issues or blockers that people face with Tutor, which wouldn’t already be addressed by the current initiative?

  4. DevOps Working Group (10 minutes) - Continue discussing scope for a DevOps working group

  5. Next steps & conclusion (5 minutes)

This is a temporary agenda - post changes/items as a reply here If we can have 2U engineers joining, we could carve out some time to discuss what 2U is currently doing and current needs (same thing for any other organization!).

DevExp

@kmccormick Great to hear! :smiley: If you would like to join the next meeting to talk about your work, and how we could collaborate, you’re welcome to! I could add an item in the agenda for this if you want.

This looks promising. Thanks. I’ll add this to the openedx-action build action in an experimental branch.

This is the right way to speed up the image building @lpm0073. Note that tutor images build has the --docker-arg option: everything you add to this option (such as “–cache-from …/–cache-to …”) will be forwarded as-is to docker build.

Here’s a way to implement a locally run Github Action, provided by my colleague @Melsu (Sujit Kumar) GitHub - nektos/act: Run your GitHub Actions locally 🚀. this might be the best way to combine Docker’s caching capability with Cookiecutter’s automated build workflow

I’m not sure I follow. You want to build Docker images from within act? Why would you want to do that? Why not run build images directly on the host? Building Docker images from within Docker is not so easy.

Hello @regis ,

The cookie cutter project by @lpm0073 builds the docker images pushes it to ECR by AWS using GitHub workflow. Everytime we build image it takes close to 40 minutes because it’s done by GitHub runners instead of doing it on physical device. Hence there is no cache. To avoide this we can run these workflows locally (from bastion machine) which means while running workflow next time we will have cache and take much lesser time. This helps in making smaller changes easier particularly while developing new features.

Regards.
Sujit Kumar

@regis,

  • verifying the objective: build openedx docker images faster given the constraint that build and deploy workflows run from Github.
  • there is no host. there’s a k8s cluster where the deployed applications run, and there are ephemeral nodes at Github where build and deploy workflows are executed.

I’m unfamiliar with Act but thus far at this early stage I understand that it would allow Github Actions to run on a local host. If true, that sounds like a pretty good idea to me.

I use act routinely when I need to test GitHub CI before pushing changes upstream. Act works by running a Docker container locally. But building a Docker image inside a Docker container is not trivial. I used to build the Tutor images in a Kubernetes cluster. But that was just too difficult: very often the nodes were running out of disk space. Upgrading Kubernetes clusters was also a pain. So I moved back to building images locally: for that I use the docker:dind image, and it works great so far – though the setup is a little convoluted.

I suggest you either run a self-hosted GitHub runner, setup Docker caching or figure out how to build Docker images remotely (with the dind images for instance).

thanks @regis. the self-hosted runners look very promising, especial given that these can run locally on Windows and macOS. i’ll read more about these and will probably have time later this week to try to prototype something in the Cookiecutter. i’ll report back after i know more.

i’ll heed your advise on Act.

Meeting 2023-01-10

Thanks to everyone for joining! I won’t be cc’ing everyone since we had quite a large cohort today :slight_smile: . Video recording to follow.

Cliff notes:

Transcribed meeting notes

Xavier

  • Introductions
  • Catch up from Xavier on what we’ve done so far for the newcomers.
  • Are there any areas that they’re interested on collaborating, etc?
  • Requested further recap from Felipe

Felipe

Xavier

  • The configuration repo has been deprecated and that we were all working on the same thing. Is it possible to get something nice/maintainable.
  • Last month we started the review of the current approach and with specific steps to verify that the current approach works to validate that this is a good base to be working on.

Adam

  • Joined edX 4 years ago and researched deploying to EKS clusters.
  • Took the notes app and containerised it. It’s been the only service running in k8s for 2.5 years.
  • Was deployed using Kustomize. They needed better customization especially with respect to liveness probes.
  • It’s still closed source, but it’s running about 7 new Django api endpoints. Some are public, but not the rest due to license concerns, etc.
  • Considered using an umbrella chart like tutor-contrib-multi, but migration is difficult, especially codejail.
  • They’re still considering how they’ll be able to deploy all services using k8s instead of the old configuration repo.
  • Asked who is running codejail behind Flask?

Felipe

Adam

  • 2U will consider eduNEXT’s approach to codejail
  • 2U’s has mostly figured out autoscaling and can contribute to the effort.
  • Currently using Nginx, but is interested to what the community is using as it turn out to be useful.
  • Best practices for Kubernetes comes up within the org.
    • How to do liveness probes
    • Processes that have permission to write to disk
    • Etc.
  • There’s a [fairwinds article)[Kubernetes Configuration Benchmark Report)

Xavier

  • Are there things that we’ve worked on that 2U is interested in?

Adam

  • Codejail behind a Django (not a Flask API) would be great as they appreciate the consistency of it.
  • They’ve got an internal testing environment and will be tested from there.

Xavier

  • Meeting is too short, what about next steps?
  • A good step after this meeting might be for Adam to comment on any of the open tickets, like the codejail one.
  • Then we could discuss there and have async discussion.

Braden

  • Collabaration helps us all even if not directly using the helm chart, because everyone benefits from the small changes.

Felipe

  • Working group for Devops
  • How do we move codejail to a common/shared roadmap.
  • We could tackle multiple projects at a time instead of limiting ourselves to a single goal.

Regis

  • Created a new initiative for the Devops working group.
  • There are already projects that are devops related (three)
  • It doesn’t make much sense to have a working group for all of these projects.
  • Instead Ed proposed a Devops working group, where all the Devops related projects will live.
  • Each project can have each own Slack channel, leaderships, governance, rules.
  • Otherwise it can be handled within the Devops working group.
  • Github project GitHub - openedx/wg-devops: Issue repository for the DevOps Working Group

Adam

  • How does this differ from BTR?

Regis

  • BTR is more concerned with creating code releases and as such is distinct from Devops.

Xavier

  • What’s the approach to communicating with the Devops WG?

Regis

Adam

  • How to decide something fits within Devops or somewhere else?
  • How would we spin up/down working groups depending on the project?

Xavier

  • We try to take care of the issues for deploying larger instances.
  • Helm was a good starting point, but there could be changes in future especially with differences between small/large providers.
  • Is 2U interested in continuing the discussion?

Adam

  • Not sure.

Xavier

  • Adam mentioned some of the current issues that 2U is also phasing. Could be helpful to comment on the tickets.
  • Or on the forum?

Regis

  • Trying to push forward refined/groomed issues. Good first issues to engage folks to start working on them.
  • Can this project define important work to attract newcomers?

Adam

  • 2U is trying to figure out the best way to deploy Kubernetes. In terms of stability, etc.
  • There’s nothing on scaling yet. Expects that 2U can start there.
  • 2U can contribute, but will very likely not use the project for a while.
  • Internal helm chart is fully featured. Enterprise ready at this stage.

Daniel

  • Discussed with legal on open sourcing the helm chart.
  • Following up with them again this week.
  • Expects no opposition, legal just doing their due diligence in terms of license/contributor guidelines.
  • Should hopefully be done by the end of this month.
  • Best practices is really important.
  • Scaling/Liveness Probes/Metrics used for Scaling/Observability/Monitoring
  • All of the above are good areas for collaboration

Jeremy

  • Trying to think about how to smooth the learning curve to how to deploy in production.
  • We’re deploying to k8s, but not developing with k8s.
  • Hoping to find a way to have more consistency across environments.

Adam

  • Autoscaling

Braden

  • Lawrence would be the best person to speak to at the moment.

Lawrence’s mic wasn’t working.

Xavier

  • Go over issues. Discussed the Nginx + Cert manager task with Moises.
  • HPA with Jhony.

My sound dropped out here (low headphone battery), so I didn’t get the full conversation

Jhony

  • Talked a bit about Karpenter and approaches to autoscaling.

Xavier

  • Objections to next meeting at same time in 2 weeks.
  • Quickly went over the issues in the Git repo.
  • Further discussion to happen async.
3 Likes

@keithgg Thank you for the meeting recap! And here is the video recording (chat log):

Next meeting

I’ve sent a calendar invite for the the next meeting, which will be in 2 weeks, on 2023-01-24T17:00:00Z in this Zoom room. I’ve made it recurring to simplify the planning.

My apologies for not having properly followed the agenda for the last meeting btw, but I felt it was useful to hear from the 2U participants who have just joined us. This has delayed a bit the review of the status of the work we had scheduled for yesterday, as we didn’t have enough time to discuss it, but we’ll try to cover this asynchronously until next meeting, and dedicate more time to it during the next meeting.

Here is the proposed agenda for that next meeting - don’t hesitate if you see any changes or additions:

  1. Assign scribe role, greetings & introductions as needed. (5 minutes)

  2. Kubernetes, Tutor & Helm (40 minutes) - Debrief of the work from the tasks list & formal review

  3. DevOps Working Group (10 minutes) - Continue discussing the coordination with the parent DevOps working group, and the formalization of our group (as a “big/multi instances” subgroup of devops?)

  4. Next steps & conclusion (5 minutes)

Current tasks

Since we had very little time to discuss the tasks, the follow-up is happening async on the tickets:

CC @adzuci

1 Like

Hey folks, we need to figure out a new name for the project since it’s more about Open edX + Kubernetes + Multiple Instances than it is about Tutor, though it of course assumes use of Tutor for building container images.

Thoughts on any of these names? Vote for 1-2 that you like :slight_smile:

  • Baseline (openedx-k8s-baseline)
  • Catalyst (openedx-k8s-catalyst)
  • Harmony (openedx-k8s-harmony)
  • Ensemble (openedx-k8s-ensemble)
  • Common Hosting Environment for Containers on Kubernetes (Open edX CHECK)

0 voters

2 Likes

Hi ya’ll! Here is the recap from the meeting we held on 2023-01-24.

This was a short meeting going over the current list of tasks in the Github repo with some focus on
cloud-hosted development environments.

First PRs are ready for review, with discussion continuing on the created tasks.

Meeting notes

Braden

  • Thanks to the creators of PRs. Nginx has been reviewed, just shared Elasticsearch needs to be checked.
  • He will check it with eduNEXT
  • Please vote on the name changes for the project on the forum.
  • We will check up on the tasks.
  • NGINX work is basically done.

Jhony

Lawrence

Felipe

Braden

  • It only works with AWS. No reason to not support for the folks that want to use it.
  • Mentioned, OEP-45 and what the current status is.

Felipe

  • He is currently arbiter, but is also author. How does that work?
  • There hasn’t been proper support from outside the working group yet, so there’s an assumption of approval.

Braden

  • Is there a way to say it’s provisionally approved?

Felipe

  • Lets keep it a draft. Will change status and add comments.

Braden

Jeremy

Braden

  • Does it require less resources than on local machine?

Jeremy

  • Can we run some services on it or do you have to have to whole thing running is a question not yet answered?

Felipe

  • They had the issue of wanting to run other things on tutor, but not the LMS.

Jeremy

  • In Devstack it’s possible, but wonky since LMS is the source of auth.
  • With tutor it’s not yet ready.

Braden

  • Looks like a nice comprehensive review.

Felipe

  • Codespaces + Github is great.

Jeremy

  • We don’t want to be using a custom orchestration system, then using something else to deploy.
  • Eg. Using Docker/Python on the dev machine, but then deploying using K8s.
  • Preferably devs should be using the same dev environment to deploy.

Braden

  • Anything else that we want to chat about?

Felipe

  • They want to present at the conference a talk about crunching the Kubernetes numbers.
  • And some sort analysis of what they’ve found so far running large instances on Kubernetes.

Braden

  • OpenCraft can definitely collaborate and share some numbers. Sounds great.
  • Anything else?

Felipe

  • Regarding the monitoring, or are we interested in doing something that monitors the platform in a more useful mannner.
  • Most of us are using Prometheus + Grafana.
  • Which results in too much information sometimes that he doesn’t care about.
  • Wants to know Pod utilization vs heartbeat of LMS.
  • Is anyone else interested.

Lawrence raises his hand

Lawrence

  • Some guys at edX have made him aware of a product called Kubecost.
  • Had a convo with them last week, will be able to break down the cost service by service.

Felipe

  • Was thinking more of the values that he wants to monitor.
  • Database values/MySQL/Ingress values.
  • But going via cost is a worthwhile avenue.

Braden

  • We haven’t really looked at the cost, but it would be very useful.
  • Monitoring would be a really great thing to collaborate on.

Nothing else to discus. Meeting ends.

3 Likes

@keithgg Thanks for the notes and the recap! :+1:

@braden Is there a recording of the meeting? I didn’t receive one from Zoom, but it looks like the calendar was the host?

@lpm0073 @keithgg Did that discussion happen? It would be useful to post an update on the ticket - it looks like some pings don’t get through to you @lpm0073 ?

I also see further down the notes that @Felipe and @braden are also interested in collaborating on this. What would be the next steps on this?

@Felipe @braden To clarify here, was the decision that we are ok to move forward with Karpenter? Could this be mentioned explicitly on the ticket at Karpenter? · Issue #7 · openedx/openedx-k8s-harmony · GitHub ? And did someone volunteer to work on it?

Btw, there are a few questions on the topic of auto-scaling in Add autoscaling · Issue #2 · openedx/openedx-k8s-harmony · GitHub that could use a few eyes maybe? Were those topics discussed during the meeting? To be able to move forward, a decision would be useful? CC @jhony_avella

@Felipe Since the formal review period is over, it would be better to merge it provisionally - this way it becomes accessible at https://open-edx-proposals.readthedocs.io/en/latest/index.html and it’s clear that it’s the way we have agreed on for now. Then when refinements are done, we can always update the OEP. It makes it easier for others to open PRs against the document, too.

Yes, both are moving forward. I’ve commented on the first one to follow-up with @regis . For the second one, we need those who want to have access to the repo to mention it on the ticket.

@jmbowman +100 to this! I have created this ticket on the repo to track the collaboration/support work on that end: Support for the Cloud-based developer environment · Issue #14 · openedx/openedx-k8s-harmony · GitHub

@Felipe @braden Good idea to discuss this at the conference! It could be an occasion to advertise our work, to attract more contributors from the rest of the community?

I have created a ticket to track this: Conference presentation about Kubernetes on large instances · Issue #15 · openedx/openedx-k8s-harmony · GitHub - I’ve also asked there about who is going to be presenting?

@antoviaque Unfortunately none of us could figure out how to do a recording. It said we didn’t have permission. I’ll have to learn what it means if the calendar is the host; I assumed it was you.

That’s what we actually decided on the call, just used the wrong word. See update from @Felipe here: OEP-45 :: ADR-002: Deploying Open edX on Kubernetes Using Helm by bradenmacdonald · Pull Request #372 · openedx/open-edx-proposals · GitHub

We looked at the ticket very briefly during the meeting. I posted an update on the ticket just now. IMHO this one is not urgent so there is no need to decide now while some of the questions of scope are still a bit fuzzy. I’m hoping we’ll get more context to make such a decision in the future after other pieces have fallen into place.

Yes we discussed it and @lpm0073 is still planning to do it.

We looked at it briefly and decided to follow up async so we have more time to consider it.

I’ll reply there.

2 Likes