Kubernetes errors updating from Palm to Redwood

Yago · April 3, 2025, 11:45am

Hello

We are having problems updating our kubernetes cluster from OpenedEdx’s v16.1.5 to v18.2.2.

On the one hand we encountered some stoppers on the node side (screenshots on links):

Evicted pods: throughout the upgrade process time and again we received an Evicted status in pods. Although this status per se is assumable, the recreation of the pods was unstable (pods with status Error, ContainerStatusUnkonwn, pods that were up only for a few minutes, etc…)
Node condition Disk Pressure. Evicted from a pod because the node to which the pod was associated was declared with DiskPressure
low ephemeral storage. Evicted and Kill of a pod because the node ran out of resources in ephemeral-storage
untolerated taint. Unable to associate a pod to a node (FailedSchedule) because all nodes are in UntoleratedTaint, by {node.kubernetes.io/disk-pressure} and {node-role.kubernetes.io/control-plane}

On the other hand we cannot rule out that the problem is on the application side:

all these errors jumped when upgrading the application from version 16.x to version 18.x.
the upgrade involved removing all pods from version 16. x (both versions never coexisted)
while version 18.x is working in staging correctly…
…when we saw that we could not solve the problem in the established deadline we reverted production from version 18.x to version 16.x and this one worked correctly. None of the above-mentioned errors appeared.

The latter surprises me: if there is no disk space for a new application, why is there disk space for an old one? why staging yaml’s don’t work in PRO? some caching system maybe? I haven’t found any information applicable to this issue…

For me it is difficult to rule out any of both “responsibility fronts”, application or infrastructure. Any help is welcome.

Best regards and thanks in advance

maxim · April 3, 2025, 1:01pm

I’m not a k8s guru, so I could be wrong, and at the same time, based on my experience and the things you’ve posted, all the issues that you’re experiencing (e.g. pods getting evicted as well as the error message about low ephemeral storage, untolerated taint, etc) point toward the issues with the infra and lack of the resources on the nodes/disks, and not the application.

From your post, my understanding is that you’re trying to jump a release (i.e. upgrade from v16 to v18, skipping v17). If that is right, I would recommend against it, as each version has “upgrade” scripts that are ran, and if you skip a release, they might not get applied.

I think this has to do with how Taints and Tolerations work (ref). Basically, in the newer version, pods declare the minimum number of resources (e.g. disk space or RAM) to be available on the node that the pod is going to be spawned on, and if there are no nodes that satisfy that requirement, the pod will not be spawned. This kind of tracks with some of the error messages you are getting. The older version of the app might have lower requirements.

Yago · April 9, 2025, 12:46pm

Hi Maxim. Sorry for the late response.

It is strange for me also, I guess that some of the issues we are facing have to do with availale resources but it is working on a less powerful environment (staging).

I am checking the declarative files but I don’t see a higher limit request from the newer version.

We’ll keep on searching.

Thanks for your response
regards

Topic		Replies	Views
Tutor k8s severe error sometime Site Operations Help tutor	1	163	March 12, 2024
Kubernetes howtos? Site Operators	5	1039	December 3, 2020
Deploying Open edX on Kubernetes Using Helm Collaborative Proposals k8s	55	4746	July 13, 2023
IBM cloud deployment not running Site Operators	2	665	July 2, 2020
Kubernetes cms-job error Tutor Help tutor , k8s	2	42	February 19, 2025

Kubernetes errors updating from Palm to Redwood

Related topics