HPA support for extra IDA's and resources in the OpenedX ecosystem

As an Open edX installation hosted in Kubernetes receives increasing traffic, one of the features an operator might want to enable is Horizontal Pod Autoscaling (HPA). Generally, the deployments requiring HPA implementation are the LMS, CMS, and the corresponding Celery workers (the tutor-contrib-pod-autoscaling plugin already helps achieve this). However, other components in the Open edX ecosystem, such as Discovery or a specific layer of the Aspects service, might also require pod-autoscaling functionality, and there’s no clear path to proceed in these cases.

A few options we have right now are:

  1. Create a custom Tutor plugin that includes the HPA definitions for the required components.

Pros:

  • Centralized HPA implementation for additional components, allowing straightforward application of substantial updates to all resources.

Cons:

  • Another Tutor plugin to manage.
  • Managing plugin resources outside the plugin.
  1. Provide more flexibility to the tutor-contrib-pod-autoscaling plugin to scale any type of deployment. There’s an open PR that aims to achieve this goal.
    Pros:
  • Manage HPA for ALL services from the same plugin.

Cons:

  • The plugin’s configuration could be complicated to maintain.

3.Rely on the Tutor plugin IDA to implement the HPA feature (for instance, the codejail plugin already has it).
Pros:

  • No additional plugin is needed to add pod-autoscaling definitions.
  • Each plugin manages all its resources internally.

Cons:

  • Every plugin maintainer should properly support this feature.

The idea is to start the discussion and hopefully define a “standard” or a common way to add this HPA feature for medium-to-large scale installations.

Your feedback is greatly appreciated, @tutor-maintainers, @jhony_avella, @MoisesGonzalezS

Thanks for starting the conversation @Henrry_Pulgarin. I think that option 3 would be the most difficult. We would have to harmonize HPA behaviour across all plugins, which would be very tricky.

Thus, options 1 and 2 are the best, IMHO. I think that the best approach would be to use the pod-autoscaling plugin as a basis that other plugins can use to provide HPA.

For instance, the pod-autoscaling plugin would provide a base configuration only for the lms/cms(-worker) containers, like it does today. But it would also provide custom hooks such that other plugin maintainers can add their services there. And there would be comprehensive docs to explain how to do that.

Looking at that PR I strongly recommend you do not go down that road. Putting all the intelligence of your plugin inside a configuration entry will make your config.yml file very difficult to maintain. Instead, I suggest you create a custom filter that you make available and document for other plugin developers.

I have not looked too closely at this from inside of the team so I only have this thread for context, but I can imagine that one of the motivations for such a configuration heavy PR is to be able to add HPA to a service that does not have it and where the contribution back to the original plugin might not be accepted or it just becomes a drag to get the necessary approvals. Is this correct? have we seen anything like that up to this point @Henrry_Pulgarin?

Now, I like the idea that HPA is something that a single plugin could provide. Even better if we could have the best of both worlds. By adding the hpa plugin you get a basic support for hpa for all your services with some defaults for memory and cpu that are sensible enough and then you can override this at different levels via different templates or the custom filter.
One level could be in the same hpa-plugin for services that are very common and another level could be the original plugin that implements the service.

@regis if this hpa plugin offers a custom filter that every other plugin must implement, would it make sense to keep it as a plugin, or is that something that should go to the core of tutor? That is the difference between options 1 and 2 as I understand this.

2 Likes

I can imagine that one of the motivations for such a configuration heavy PR is to be able to add HPA to a service that does not have it and where the contribution back to the original plugin might not be accepted or it just becomes a drag to get the necessary approvals

@Felipe The motivation of the PR is add the possibility to configure HPA for services that does not have it, but it not becomes from issues with the contribution to IDA’s. It is a attempt to solve it and from that we are wondering where is the right place to take this.

@regis The idea of the Filter looks fine, but as @Felipe says, i wonder too if this filter should be in tutor

1 Like

@regis thanks for your reply. Let’s take a look at an example that combines 1 and 2 alternatives and make use of the tools we have right now. Suppose I want to add HPA to the notes service based on CPU consumption to keep it simple. Let’s say I want to keep the CPU consumption at 80% on average, and the HPA can scale up to 10 replicas (pods) with 1 pod minimum. I need to do a couple of things to achieve the goal:

  1. Patch the notes deployment to insert CPU requests and limits, otherwise HPA won’t work.
  2. Write the HPA resource associated with the deployment which defines the pod autoscaling behavior (CPU average consumption, min and max replicas, etc)

How can I achieve that?

  1. I can use the k8s-override core patch to get the notes deployment overridden properly.
  2. I can use the pod-autoscaling-hpa patch offered by the pod-autoscaling plugin to write an extra HPA object for notes service.

Both steps translate to a simple tutor-plugin like the following:

from tutor import hooks

hooks.Filters.ENV_PATCHES.add_items([
    (
        "k8s-override",
        """
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: notes
spec:
  template:
    spec:
      containers:
        - name: notes
          resources:
            requests:
              cpu: 50m
            limits:
              cpu: 200m
"""
    ),
    (
        "pod-autoscaling-hpa",
        """
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: notes-hpa
  labels:
    app.kubernetes.io/name: notes-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: notes
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80
"""
    ),
])

Using this plugin + tutor-contrib-pod-autoscaling will provide HPA for notes, and this can be extended to any other service requiring HPA. If this process is documented, Would this be an appropriate approach? Notice I had to write an extra plugin to get the functionality I needed.

I’m curious about the filter approach you propose, however, I’m not sure if it makes sense to write the filter outside the Tutor core. Is there any example of how to implement it in a plugin?
CC @Felipe @Henrry_Pulgarin

1 Like

Now I’m wondering about the case of the aspects plugin. This is not exactly an IDA but a group of services that work together to get insights about the OpenedX platform usage. It is likely that being a particular plugin handling multiple services, it could have special requirements for pod-autoscaling. Would it make sense for a plugin like this one to have its own HPA-pod autoscaling implementation self-contained in the plugin’s code?

@Henrry_Pulgarin @regis @Felipe

@Ian2012 @BrianMesick I’m tagging you guys since you’ve been involved in the Aspects development. Please read the context from the top to get a better understanding.

I haven’t been involved much on the deployment side of Tutor. On the surface having autoscaling as a core feature in Tutor would seem like a great thing to support, but I’m sure there’s nuance there that I don’t know about. I’d prefer to not have to rely on another plugin to manage that, and there are always a lot of factors involved in tuning which parts of an app make sense to scale vertically or horizontally, so having patches to make those things configurable on a per-deployment basis would seem to make sense.

Yes. Have a look at the “hooks” module from the tutor-mfe plugin: https://github.com/overhangio/tutor-mfe/blob/master/tutormfe/hooks.py They are documented here: GitHub - overhangio/tutor-mfe: This plugin makes it possible to easily add micro frontend (MFE) applications on top of an Open edX platform that runs with Tutor.

@regis I’m revisiting this topic and would like to share the PR we created to implement the filter that would allow extending the HPA/VPA features to the different IDAs.

For now, it is not a requirement to call the filter from other IDAs plugins. For instance, if I want to add HPA support, I can add and activate the following Python plugin:

from tutorpod_autoscaling.hooks import AUTOSCALING_CONFIG

@AUTOSCALING_CONFIG.add()
def _add_idas_autoscaling(scaling_config):
    scaling_config["notes"] = {
        "enable_hpa": True,
        "memory_request": "50Mi",
        "cpu_request": 0.1,
        "memory_limit": "100Mi",
        "cpu_limit": 0.2,
        "min_replicas": 1,
        "max_replicas": 3,
        "avg_cpu": 100,
        "avg_memory": "",
        "enable_vpa": False,
    }

    return scaling_config

Still need a tutor plugin to handle the HPA/VPA parameters but it’s easier now since we no longer deal with patches to add YAML. The invitation is open for you and anyone else in the community interested in this topic to take a look into the PR. If everything works fine and we get the approvals, we will release these changes on the Redwood OpenedX release.

2 Likes

Although I see the point of this auto-scaling configuration handling, the fact that a plugin is needed to update a plugin is odd and feels wrong to me. Isn’t there a way to avoid “dependency hell”?

I had an idea how we could do that, though as it was pointed out, that could add maintenance burden to the plugin maintainers.

At this point, I feel that either the plugin maintainer or the plugin user will have to deal with this.

I would like to invite the @tutor-maintainers to this conversation, especially those supporting tutor plugins related to OpenedX IDA’s:

  • A refactoring was applied to the tutor-contrib-pod-autoscaling(tcpa) plugin to facilitate the HPA integration in different OpenedX services. Please check the thread to have a better context.
  • The re-factory implements a new filter that defines a static dictionary with default HPA values for the most important OpenedX services (LMS, CMS, and celery workers). This dictionary can be modified by hooking into the filter and adding/modifying/removing elements. Other Tutor plugins can hook into the filter and add their own HPA values for the corresponding IDA’s. It prevents having too many configuration values in the same plugin which can make it difficult to maintain.
  • As @gabor pointed out in his comment, an additional plugin is required to modify the filter values since they were statically defined. And changing the HPA values is a trivial operation during the performance tuning phase of a medium to large OpenedX installation. This means that, in addition to installing tcpa, an extra plugin is required to adjust the HPA values which leads to a dependency chain issue caused by this filter approach.

This is a discussion of too many settings to maintain VS Tutor plugin dependency chain hard to maintain (plugin user/operator). I think the alternative @gabor proposes allows us to decide between the two: either use a tpca configuration to override the default filter content or have plugins to hook into the filter and change the content. We’ll modify the PR to enable this behavior and will improve the documentation.

We would like to have as much feedback as possible to hopefully release these changes for the Redwood OpenedX release. I would like to ask the @tutor-maintainers maintaining IDA’s like Forum, Discovery, and others: do you see any potential on hooking from your plugins to this new tpca filter to add the HPA feature? Do you see blockers or impediments to implement this approach?

I know @bmedx is interested in using pod-autoscaling for Aspects, I’m interested on hearing what you think. I think that as long as a tutor python plugin can be created on the plugins folder, operators could easily use it. Leaving the operation/development to tutor plugin developers may not be the right way to solve this problem as we will increase the dependency chain and operators will be more in context to tune their values according to their needs.

I understand your point of view, but I think that in practice this works fine. Anyone who is going to implement HPA is going to tune the heck of their k8s deployment anyway, so they will most likely already have a tutor plugin to customize their platform.

The question that you need to ask is: what API do you want to expose for your end users? There are many options. For instance:

  1. Implementing a tutor plugin
  2. tutor hpa set '{"lms": "cpu_request": 0.5}'
  3. Override defaults in hpa-overrides.yml
  4. export HPA_OVERRIDES='{"lms": "cpu_request": 0.5}'

For now, only option 1 exists. You need to decide what API you want, and then implement your plugin in a way that supports it.

I moved the original PR to the Redwood upgrade PR. @gabor would you mind taking a look at the new PR? The idea is to have the new implementation ready for Redwood as indicated here

Since Redwood was already released, we’ve merged the filter implementation in the pod autoscaling plugin. It is available in the latest plugin release. Thanks all for your comments and contributions to this effort.