HPA support for extra IDA's and resources in the OpenedX ecosystem

As an Open edX installation hosted in Kubernetes receives increasing traffic, one of the features an operator might want to enable is Horizontal Pod Autoscaling (HPA). Generally, the deployments requiring HPA implementation are the LMS, CMS, and the corresponding Celery workers (the tutor-contrib-pod-autoscaling plugin already helps achieve this). However, other components in the Open edX ecosystem, such as Discovery or a specific layer of the Aspects service, might also require pod-autoscaling functionality, and there’s no clear path to proceed in these cases.

A few options we have right now are:

  1. Create a custom Tutor plugin that includes the HPA definitions for the required components.

Pros:

  • Centralized HPA implementation for additional components, allowing straightforward application of substantial updates to all resources.

Cons:

  • Another Tutor plugin to manage.
  • Managing plugin resources outside the plugin.
  1. Provide more flexibility to the tutor-contrib-pod-autoscaling plugin to scale any type of deployment. There’s an open PR that aims to achieve this goal.
    Pros:
  • Manage HPA for ALL services from the same plugin.

Cons:

  • The plugin’s configuration could be complicated to maintain.

3.Rely on the Tutor plugin IDA to implement the HPA feature (for instance, the codejail plugin already has it).
Pros:

  • No additional plugin is needed to add pod-autoscaling definitions.
  • Each plugin manages all its resources internally.

Cons:

  • Every plugin maintainer should properly support this feature.

The idea is to start the discussion and hopefully define a “standard” or a common way to add this HPA feature for medium-to-large scale installations.

Your feedback is greatly appreciated, @tutor-maintainers, @jhony_avella, @MoisesGonzalezS

Thanks for starting the conversation @Henrry_Pulgarin. I think that option 3 would be the most difficult. We would have to harmonize HPA behaviour across all plugins, which would be very tricky.

Thus, options 1 and 2 are the best, IMHO. I think that the best approach would be to use the pod-autoscaling plugin as a basis that other plugins can use to provide HPA.

For instance, the pod-autoscaling plugin would provide a base configuration only for the lms/cms(-worker) containers, like it does today. But it would also provide custom hooks such that other plugin maintainers can add their services there. And there would be comprehensive docs to explain how to do that.

Looking at that PR I strongly recommend you do not go down that road. Putting all the intelligence of your plugin inside a configuration entry will make your config.yml file very difficult to maintain. Instead, I suggest you create a custom filter that you make available and document for other plugin developers.

I have not looked too closely at this from inside of the team so I only have this thread for context, but I can imagine that one of the motivations for such a configuration heavy PR is to be able to add HPA to a service that does not have it and where the contribution back to the original plugin might not be accepted or it just becomes a drag to get the necessary approvals. Is this correct? have we seen anything like that up to this point @Henrry_Pulgarin?

Now, I like the idea that HPA is something that a single plugin could provide. Even better if we could have the best of both worlds. By adding the hpa plugin you get a basic support for hpa for all your services with some defaults for memory and cpu that are sensible enough and then you can override this at different levels via different templates or the custom filter.
One level could be in the same hpa-plugin for services that are very common and another level could be the original plugin that implements the service.

@regis if this hpa plugin offers a custom filter that every other plugin must implement, would it make sense to keep it as a plugin, or is that something that should go to the core of tutor? That is the difference between options 1 and 2 as I understand this.

2 Likes

I can imagine that one of the motivations for such a configuration heavy PR is to be able to add HPA to a service that does not have it and where the contribution back to the original plugin might not be accepted or it just becomes a drag to get the necessary approvals

@Felipe The motivation of the PR is add the possibility to configure HPA for services that does not have it, but it not becomes from issues with the contribution to IDA’s. It is a attempt to solve it and from that we are wondering where is the right place to take this.

@regis The idea of the Filter looks fine, but as @Felipe says, i wonder too if this filter should be in tutor

1 Like

@regis thanks for your reply. Let’s take a look at an example that combines 1 and 2 alternatives and make use of the tools we have right now. Suppose I want to add HPA to the notes service based on CPU consumption to keep it simple. Let’s say I want to keep the CPU consumption at 80% on average, and the HPA can scale up to 10 replicas (pods) with 1 pod minimum. I need to do a couple of things to achieve the goal:

  1. Patch the notes deployment to insert CPU requests and limits, otherwise HPA won’t work.
  2. Write the HPA resource associated with the deployment which defines the pod autoscaling behavior (CPU average consumption, min and max replicas, etc)

How can I achieve that?

  1. I can use the k8s-override core patch to get the notes deployment overridden properly.
  2. I can use the pod-autoscaling-hpa patch offered by the pod-autoscaling plugin to write an extra HPA object for notes service.

Both steps translate to a simple tutor-plugin like the following:

from tutor import hooks

hooks.Filters.ENV_PATCHES.add_items([
    (
        "k8s-override",
        """
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: notes
spec:
  template:
    spec:
      containers:
        - name: notes
          resources:
            requests:
              cpu: 50m
            limits:
              cpu: 200m
"""
    ),
    (
        "pod-autoscaling-hpa",
        """
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: notes-hpa
  labels:
    app.kubernetes.io/name: notes-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: notes
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80
"""
    ),
])

Using this plugin + tutor-contrib-pod-autoscaling will provide HPA for notes, and this can be extended to any other service requiring HPA. If this process is documented, Would this be an appropriate approach? Notice I had to write an extra plugin to get the functionality I needed.

I’m curious about the filter approach you propose, however, I’m not sure if it makes sense to write the filter outside the Tutor core. Is there any example of how to implement it in a plugin?
CC @Felipe @Henrry_Pulgarin

1 Like

Now I’m wondering about the case of the aspects plugin. This is not exactly an IDA but a group of services that work together to get insights about the OpenedX platform usage. It is likely that being a particular plugin handling multiple services, it could have special requirements for pod-autoscaling. Would it make sense for a plugin like this one to have its own HPA-pod autoscaling implementation self-contained in the plugin’s code?

@Henrry_Pulgarin @regis @Felipe

@Ian2012 @BrianMesick I’m tagging you guys since you’ve been involved in the Aspects development. Please read the context from the top to get a better understanding.

I haven’t been involved much on the deployment side of Tutor. On the surface having autoscaling as a core feature in Tutor would seem like a great thing to support, but I’m sure there’s nuance there that I don’t know about. I’d prefer to not have to rely on another plugin to manage that, and there are always a lot of factors involved in tuning which parts of an app make sense to scale vertically or horizontally, so having patches to make those things configurable on a per-deployment basis would seem to make sense.

Yes. Have a look at the “hooks” module from the tutor-mfe plugin: https://github.com/overhangio/tutor-mfe/blob/master/tutormfe/hooks.py They are documented here: GitHub - overhangio/tutor-mfe: This plugin makes it possible to easily add micro frontend (MFE) applications on top of an Open edX platform that runs with Tutor.

@regis I’m revisiting this topic and would like to share the PR we created to implement the filter that would allow extending the HPA/VPA features to the different IDAs.

For now, it is not a requirement to call the filter from other IDAs plugins. For instance, if I want to add HPA support, I can add and activate the following Python plugin:

from tutorpod_autoscaling.hooks import AUTOSCALING_CONFIG

@AUTOSCALING_CONFIG.add()
def _add_idas_autoscaling(scaling_config):
    scaling_config["notes"] = {
        "enable_hpa": True,
        "memory_request": "50Mi",
        "cpu_request": 0.1,
        "memory_limit": "100Mi",
        "cpu_limit": 0.2,
        "min_replicas": 1,
        "max_replicas": 3,
        "avg_cpu": 100,
        "avg_memory": "",
        "enable_vpa": False,
    }

    return scaling_config

Still need a tutor plugin to handle the HPA/VPA parameters but it’s easier now since we no longer deal with patches to add YAML. The invitation is open for you and anyone else in the community interested in this topic to take a look into the PR. If everything works fine and we get the approvals, we will release these changes on the Redwood OpenedX release.

2 Likes