Hello
We are having problems updating our kubernetes cluster from OpenedEdx’s v16.1.5 to v18.2.2.
On the one hand we encountered some stoppers on the node side (screenshots on links):
- Evicted pods: throughout the upgrade process time and again we received an Evicted status in pods. Although this status per se is assumable, the recreation of the pods was unstable (pods with status Error, ContainerStatusUnkonwn, pods that were up only for a few minutes, etc…)
- Node condition Disk Pressure. Evicted from a pod because the node to which the pod was associated was declared with DiskPressure
- low ephemeral storage. Evicted and Kill of a pod because the node ran out of resources in ephemeral-storage
- untolerated taint. Unable to associate a pod to a node (FailedSchedule) because all nodes are in UntoleratedTaint, by {node.kubernetes.io/disk-pressure} and {node-role.kubernetes.io/control-plane}
On the other hand we cannot rule out that the problem is on the application side:
- all these errors jumped when upgrading the application from version 16.x to version 18.x.
- the upgrade involved removing all pods from version 16. x (both versions never coexisted)
- while version 18.x is working in staging correctly…
- …when we saw that we could not solve the problem in the established deadline we reverted production from version 18.x to version 16.x and this one worked correctly. None of the above-mentioned errors appeared.
The latter surprises me: if there is no disk space for a new application, why is there disk space for an old one? why staging yaml’s don’t work in PRO? some caching system maybe? I haven’t found any information applicable to this issue…
For me it is difficult to rule out any of both “responsibility fronts”, application or infrastructure. Any help is welcome.
Best regards and thanks in advance