OEP-41: Asynchronous Server Event Messaging

I have recently done some work on a pull request for OEP-41: Asynchronous Server Event Messaging. It’s incomplete, but there is enough there that I would appreciate early feedback from any brave souls who are interested in the topic.

What is it?

This OEP describes the conventions Open edX should use for asynchronous event messaging across services. These events would be emitted when a service has taken some action to data that it owns, and that other services would be interested in. This could be a course publish, a user enrollment, updates to catalog metadata, etc.

What is it NOT?

This OEP does not cover:

  • Remote Prodecure Calls, i.e. using messaging to send commands to other
    services. Events in this document are the services broadcasting, “Here is what
    I just did!” to interested parties, not “Please do this thing for me!”
  • Real-time streaming of student learning events to external platforms, as
    detailed in OEP-26, though it is possible that
    the implementation of OEP-26 might build on the infrastructure outlined in
    this document.
  • Browser level notifications and event streaming along the lines of
    Server-Sent-Events, GraphQL subscriptions, or frameworks built on top of
    WebSocket. The interactions here are strictly server to server interactions
    between services serving the same site. Again, it is entirely possible that
    an implementation of browser level notifications would be built on top of the
    infrastructure outlined here.

Why now?

Open edX services already use Django signals for subscribing to events within the same service, but we have no accepted recommendation for how to do something similar across service boundaries. As the number of services in Open edX has grown, the various ad hoc synchronization and communication methods we’ve implemented over the years have caused numerous operational and performance problems (particularly around catalog data). At the same time, recent discussions around Domain Driven Design and data replication have given us a better framing for the kind of relationship we want to have between various services. There has also been a strong desire for a properly documented set of integration APIs that would allow Open edX to better interoperate with other applications in the ecosystem.

There is an opportunity to bring all these elements together and create a solid foundation for simple, reliable, well documented, near real-time cross-service message passing. But without some planning and design work, we may end up in an even more chaotic place where each pairing of services uses different conventions for this kind of communication.

Okay, but why is this being posted right now, at 2:30AM your local time on a weeknight?

Due to recent world events, edX is seeing a surge in usage. I’m am glad that we can provide a useful service for our students during these times, but this recent increase in traffic has also stressed our software and caused a number of performance related issues. Some of my colleagues and I have spent a good chunk of the past few weeks frantically tracking these issues down and squashing them, so that edx.org can continue to stay up and serve our learners. (Side note: Juniper should noticeably reduce database load, for you folks that run at scale and closely monitor that sort of thing.)

One of the worst offenders in terms of performance is the ad hoc way in which we synchronize some types of data across our services, particularly between course-discovery and edx-platform. We are always reaching for a message-queued based solution when one of these problems arise, but abandoning the idea because we have no infrastructure or established patterns there. So instead of sending single events through a message queue, we end up just tacking on one more thing to some periodically running cron job or other, which I can then bitterly stare at in my New Relic graphs as a saw-tooth of despair cutting through my response time and load graphs. Thousands of requests, sparking off every N minutes and naively querying everything because we don’t know what’s changed.

Today, I needed to fix a performance issue where if you have a certain feature enabled, it can cause certain course-level access checks in the LMS to make a blocking call to course-discovery (if you’re not familiar with how often we do access checks in edx-platform, just trust me that this is bad). How can I avoid that? Replicate the necessary data from course-discovery to edx-platform when it changes. How am I going to do that? Clearly, the pragmatic thing is to add one more thing to the list of batch data copying tasks, because we don’t have message queuing.

At that point, I got up, went to my kitchen, and poured myself another cup of coffee. I then came back to my desk and, in a fit of rage, proceeded to work more on this OEP.

But doing message queuing infrastructure is going to be a lot of work, right?

No, and yes. The proposal as it currently stands advocates that we use a stack that is already installed pretty much everywhere because we use it for celery – namely the kombu library that celery is built on top of, and redis as the message broker. We’d be using CloudEvents as an envelope, which seems perfectly reasonable and doesn’t require us to reinvent that particular wheel.

So I think that getting started is going to be easier than most people think. It’s the long term costs that are more expensive. We don’t have a lot of operational experience with using messages in this way. We don’t have testing and documentation figured out (I have ideas, but I need to prove them out, and they haven’t made it into the OEP yet). And getting these fundamentals wrong could cause a lot of pain down the road.

But I think we’ve already crossed an inflection point in cross-service communication where it makes sense to bite that bullet. Actually, I think we crossed that point years ago, and that we’ve missed a lot of opportunities and made things harder to develop because we haven’t had this fundamental capability in our stack. Developers feel this pain all the time, which is part of what led to a recent hackathon project which this OEP partly comes out of. We gave a recent architecture workshop to edX developers, and the majority of the solutions that teams proposed used a message broker like this despite it being nowhere in our stack today.

So it’s easy to start, it won’t be easy to do well, but it’s already long overdue. There’s a quote that springs to mind:

“The best time to plant a tree was twenty years ago. The next best time is now.”

I aim to plant this tree. If you have time to look at my early cut of an OEP, I would appreciate it: https://github.com/edx/open-edx-proposals/pull/136

Thank you.