Kubernetes errors updating from Palm to Redwood

Hello

We are having problems updating our kubernetes cluster from OpenedEdx’s v16.1.5 to v18.2.2.

On the one hand we encountered some stoppers on the node side (screenshots on links):

On the other hand we cannot rule out that the problem is on the application side:

  • all these errors jumped when upgrading the application from version 16.x to version 18.x.
  • the upgrade involved removing all pods from version 16. x (both versions never coexisted)
  • while version 18.x is working in staging correctly
  • …when we saw that we could not solve the problem in the established deadline we reverted production from version 18.x to version 16.x and this one worked correctly. None of the above-mentioned errors appeared.

The latter surprises me: if there is no disk space for a new application, why is there disk space for an old one? why staging yaml’s don’t work in PRO? some caching system maybe? I haven’t found any information applicable to this issue…

For me it is difficult to rule out any of both “responsibility fronts”, application or infrastructure. Any help is welcome.

Best regards and thanks in advance

I’m not a k8s guru, so I could be wrong, and at the same time, based on my experience and the things you’ve posted, all the issues that you’re experiencing (e.g. pods getting evicted as well as the error message about low ephemeral storage, untolerated taint, etc) point toward the issues with the infra and lack of the resources on the nodes/disks, and not the application.

From your post, my understanding is that you’re trying to jump a release (i.e. upgrade from v16 to v18, skipping v17). If that is right, I would recommend against it, as each version has “upgrade” scripts that are ran, and if you skip a release, they might not get applied.

I think this has to do with how Taints and Tolerations work (ref). Basically, in the newer version, pods declare the minimum number of resources (e.g. disk space or RAM) to be available on the node that the pod is going to be spawned on, and if there are no nodes that satisfy that requirement, the pod will not be spawned. This kind of tracks with some of the error messages you are getting. The older version of the app might have lower requirements.

Hi Maxim. Sorry for the late response.

It is strange for me also, I guess that some of the issues we are facing have to do with availale resources but it is working on a less powerful environment (staging).

I am checking the declarative files but I don’t see a higher limit request from the newer version.

We’ll keep on searching.

Thanks for your response
regards

1 Like