Proposal: Require Caddy (or Nginx) for LMS/Studio

Back during the 2024 Open edX conference, I gave a talk where I touched briefly on an idea for how we can more efficiently serve media assets at scale using the X-Accel-Redirect header:

This is covered in more detail in an ADR on the subject, but the short version is that instead of having Django send back a course media asset (like an image, transcript, or video) by streaming it directly as we do today, we would have our Django view write a special X-Accel-Redirect header that would point to the private location of the file. This private location could be a shared directory that Caddy has access to, or it could be a signed URL to an object in S3-like storage. Either way, Caddy will take over at that point and send the file from the private location to the user, instead of tying down the Python worker process for that purpose.

This arrangement would allow us to much more efficiently/cheaply serve media assets at scale, possibly allowing us to work with video files directly. It would also allow us to keep more course assets private and only make them available to users based on permissions checks, e.g. you could have an answer key PDF that only unlocks after a student has completed an exam.

There is, however, a catch. If we started to rely on this behavior being true, we would have to always run Caddy (or Nginx) as a reverse proxy in front of our LMS and Studio web processes. That is not the case today, and doing this would introduce a new level of coupling.

An alternative to this would be to have the backend code switch between using X-Accel-Redirect and sending things using the Django process, based on configuration. We could make it so that we serve things the less efficient way by default, and only use the more efficient method when we’re running a Caddy reverse proxy anyway. But this would add complexity, increase the odds that subtle bugs and incompatibilities make it out (e.g. in partial content responses), and provide a default experience that is significantly less performant.

I think there’s a strong case for always deploying with Caddy as a reverse proxy in Tutor anyway:

  1. It doesn’t add a new server type to the Tutor deployment stack, since we already use Caddy for other purposes.
  2. Having the reverse proxy is better for performance and reducing overall resource consumption.
  3. We are already reliant on the Caddy layer to enforce things like upload limits.

I’m not sure what all the implications of such a move would be. I know that 2U used to run with Nginx as the web proxy, though I’m not sure if that’s still the case today. I don’t know of anyone out there that is running a reverse-proxy that is not Caddy or Nginx, but I’m definitely interested if anyone is doing so. I know this runs a bit at odds with the bare-metal deployment work that’s been going on, and it’s possible that we’d want to ship a reference Caddyfile with openedx-platform itself.

Anyhow, I would love to get thoughts from you folks. Thank you!

Yep, 2U continues to use an nginx reverse proxy. We’re switching to a kubernetes-based deployment, but keeping the nginx sidecar – it has some rate-limiting, access control (for admin pages), static file handling (see Various static assets served from LMS webapp rather than CDN · Issue #38648 · openedx/openedx-platform · GitHub), and other stuff.

@MoisesGonzalezS could tell you more, but during a prolonged high traffic event, eduNEXT found:

for sites with very large amount of traffic the Caddy proxy provided by Tutor introduces a significant amount of latency (changing some ingress rules to forward directly to the LMS pod cut response times in half IIRC)

So we may wish to find a way to keep this limited to certain paths, if we’re going to pursue it.

I definitely want to understand that better. From my recollection, edX basically never had a bottleneck at the nginx layer (though that was not a k8s deployment while I was there). I’m very surprised that Caddy would add a lot of latency. @MoisesGonzalezS: Do you remember the details of that?

I couldn’t go too much in depth around that issue, I do have some theories and I suspect is k8s specific. We saw a sharp drop in latency and error rate the very moment we bypassed caddy.

The Caddy configuration used by Tutor proxies using docker and Kubernetes service discovery (“lms:8000”). Reading the Ingress NGINX documentation it mentions the following:

The Ingress-Nginx Controller does not use Services to route traffic to the pods. Instead it uses the Endpoints API in order to bypass kube-proxy to allow NGINX features like session affinity and custom load balancing algorithms. It also removes some overhead, such as conntrack entries for iptables DNAT.

So it might not necessarily be that Caddy adds a lot of latency by itself, but rather the current configuration is too naive for large scale sites running on k8s.