When to make a new backend service?

There’s recently been talk about when to make something a service vs. making separate Django apps that plug into the LMS process at runtime. I thought there was specific guidance on this in the wiki somewhere, but I couldn’t find it. So I’m curious to get people’s thoughts.

Someone recently framed it as a neat thought experiment: If we were building course-discovery today, would it still be a separate service? Or would we prefer to make a separate set of apps that runs in-process in the LMS?

It was originally built as a separate service for a number of reasons:

  1. Faster Deployment Cycles
    At the time that course-discovery was created, the deployment cycle for the LMS was somewhere between one and two weeks. Missing the release window was extremely painful. In contrast, a new service that didn’t carry all of edx-platform’s deployment baggage with it could run through CI/CD and have new commits be in production in under an hour.
  2. Team Autonomy
    Having a shared service meant that you always had to worry about being delayed by another team–someone causing a bug that forces a rollback, or another team asking for a short delay in the release so they could their one critical feature over the line and not have to wait until the following week. Having a separate service meant that your team was fully in control.
  3. Operational Independence
    In theory, having separate services gives a certain amount of resiliency, since a failure of one component doesn’t take down everything else with it. In practice, this is a mixed bag because many synchronous API dependency calls do exist between the LMS and other services, meaning that failures cascade anyway. In addition, the added data access challenges means that some of these calls are much more expensive than they might otherwise be (e.g. n+1 queries via REST API instead of doing a SQL join).
  4. Force Separation of Logic
    If you’re in the same process, it’s much easier to throw in a quick hack and import some internal piece you’re really not supposed to. The thought here was that physical boundaries forced developers to think about these interfaces in a more rigorous way.
  5. Different Access Restrictions
    Data in different services can vary wildly in sensitivity. Course discovery data at the time as purely catalog related and contained no user-specific data (I’m not sure if this is still the case). That’s much less sensitive than student or financial data. As a separate service, there is less of a chance that course-discovery apps will somehow be vulnerable in a way that exposes student data.
  6. Dependency Control
    If you’re going from v2 to v3 of some big dependency that everything uses, then every repo that uses said dependency and is deployed together must be upgraded at the same time (or at least made compatible with both versions). This is a facet of Team Autonomy.

(If there are more reasons, please let me know.)

The tradeoffs have shifted since then, especially in terms of deployment speed. Services also carry a higher long term maintenance burden and significantly higher operational complexity. On the other hand, separate repos deploying in the same process has its own set of issues that we’ve run into.

I’m curious where people’s thoughts are on this. Which way would you go with course-discovery, and why?

4 Likes

Here is my input on this, taking course-discovery as example.

Here I might mix up ElasticSeach with course-discovery, nonetheless it might be releavnt info

The thing with course-discovery I found a bit annoying is that it default to use Elasticsearch, ES(or similar) which is RAM hungry, in tutor based would take 2GB (Heap size) by defualt. And I think it might be an overkill for a platform that only have few courses. Lowering the requirment by 2GB, in the context of AWS instance, if somehow we can skip that and then expect the platform to work as expected, it would open a possiblity for ~50% less AWS charges. (i.e. from t2.large to t2.meduim ref:1).

What I can think of in top of my head is, if clock would go back in time when this service was developed and I was asked my opinion:

Let’s first consider why we need an external service like ES,

When courses number exceed X. or/and total size of indexable content is larger than Y., then if this case you are better of using external service like ES.

However in case the above condition is not met:

  • The requried feature is not needed (i.e. who needs a search functionality for 10 or less courses), but however going this path would imply that I can still use the platform as expected. i.e. Enforcing other Apps not rely on an optional service, which is not the case for (edx-search and course-discovery)
  • Create an interface/abstraction around the external service, which would default to use internal service i.e. LMS/CMS. i.e. in Redis context we Have Django default local memory cache engine which is suppose to be fine if you have one main instace?.

However contracting myself, if you choose the first option above, this would probably lead to more complex configuration/settings, i.e what is the settings combination to use edx-search or forum when I don’t want to use ES.

In the bigger picture, I think one way around that, probably is to have the platform, in different modes/situation, of which each it would default to differenet combination of settings.

  • Express mode (recommended when courses/learner don’t excced X) ref:2
  • Normal mode (recommend when courses/learner exceed X).
  • Giant mode (recommend when you are having running multiple instance in parrael)

There is of course a downside for going this path, the advantage of this is that whenever we want to add a new service or app, we would ask a simple a question: what this mean in different modes instead of trying to have something like one size fits all.

ref:1 - Amazon EC2 T2 Instances – Amazon Web Services (AWS)
ref:2 - https://github.com/openedx/platform-roadmap/issues/169

“Services also carry a higher long term maintenance burden” I don’t think this is proven. Yes, you have to actually run the service. But understanding the context of the service is so much easier it may make up for it. Today my team is dealing with a bug where upgrading some proctoring plugin imports breaks static asset construction for the platform. Why? We don’t know yet, should be not even involved. We’ll spend a lot of time and effort figuring that out. If proctoring were completely separated, this waste of time could not happen. When we do figure it out, we will again waste time to remove the library pin in platform and wait for it to rebuild.

Or a way to flip that on its head - because of the difficulties of change at scale 2U has to split out services and build new things in separate services. We don’t really have a choice at this point.

Thanks for listing those reasons; I never really understood until now why it was made as a separate service.

Re your question, the main thing I personally think about is the flow of data.

Looking at the discovery service documentation, here is its explanation:

The distribution of edX’s data has grown over time. Any given feature on edx.org may need information from Studio, the LMS, the Ecommerce service, and/or the Drupal marketing site. Discovery is a data aggregator whose job is to collect, consolidate, and provide access to information from these services.

At one level, that seems very reasonable. And if the data flow were like this:

Untitled drawing

I think it would be great.

But as I understand it, it’s really more like this in practice:

Untitled drawing (1)

For example, to define a program, you have to go into the Django admin of the discovery service, so it’s the original source of truth. Then the program is available via the Discovery Catalog API (which may or may not use the elasticsearch index depending on how you call it) and the LMS actually calls that API and caches it and makes it available within the LMS.

This a complicated flow of data, with several places where things can get out of sync. It’s also a bit difficult to debug because if you’re having some data issue, in a worst case scenario you may have to compare the LMS modulestore, the LMS Course Metadata table, the Discovery service MySQL table, Discovery ElasticSearch cache, and perhaps even LMS discovery cache to figure out what’s wrong.

From a user perspective, this can also be a bit strange. If you think of “Programs” as just a way of grouping together sets of courses and selling the set a discount, and that’s all managed by a Sales team that in unrelated to the course authoring teams, this makes sense. But if you’re an educator thinking of “Programs” as a way to structurally group courses and micro-courses in a way to optimize learning for different target audiences, you’d expect to be able to edit programs in Studio and view program enrollment in the same manner as you view course enrollment. It doesn’t make sense that to create a new course run you use Studio but to create a new course group (Program) you have to manually enter the URL of some entirely different website and use the admin backend to do so.

So while I feel like I understand why things are as they are today, and it’s not unreasonable, if it were me doing it today I would consider it a core responsibility to have a pluggable API for maintaining metadata about all courses, programs, mini-courses, libraries, etc., and which lives in the LMS and mostly uses foreign keys to ensure data integrity and never be out of date. But I would probably try not to include a search index as part of this core, and have a separate application which pulls data from the core metadata API and stores it in elasticsearch or typesense, to serve end-user searches only.

This is because python doesn’t have a good way to indicate a public API vs. private API. But there is a nice solution which can reduce the impact of this issue quite a bit. If you have any interface that makes sense to implement using plugins, you can design the plugin API so that you can’t really call any functions on the plugin directly, but only through a central plugin manager (because you won’t be able to instantiate the plugin and all of its methods are instance methods). An example of this in the platform is that you can’t call the split modulestore API directly from anywhere in the LMS or Studio code, but rather you have to call from xmodule.modulestore.django import modulestore and then use its public API which will in turn call split. We don’t have any problems with people calling into the split code directly because, well, you can’t.

Another alternative which is used in edx-platform is the use of api.py files, although this is less effective.

This is fundamentally a design limitation of python, which some other languages like Rust and JS do not have. It could be especially problematic in this case if e.g. the LMS and Discovery were using different versions of the ElasticSearch driver. So I think this can be a compelling reason to create a separate service, though it also helps to find ways to keep dependencies to an absolute minimum in either case imho.

I also think that your learning-core project will solve this, by creating a lean, mean core with few dependencies and with thoughtful boundaries around the data.

3 Likes

A very meaningful and historical context discussion and I like that. The data flow to and from Discovery is confusing for sure. In my opinion, the decisions around the source of truth of marketing/catalog data were not formulated when most of the catalog information was moved to Discovery. For instance, there are remnants of About Pages in Studio (image, description, etc.) and then there are the start and end dates that are editable only in Studio. In Publisher, this is mentioned explicitly to Editors and Project Coordinators on UI. That is one data flow. But when a course run is created by the Publisher, the publisher demands start and end dates at that point. That creates a data inconsistency issue whose resolution is attempted by a very expensive management command refresh_course_metadata in Discovery.
If I look at the Discovery codebase, I feel like it has been treated like a playground or testing area for Catalog information. We need indexing; add ES. ES is doing its job but we need more. Add Algolia but keep ES for other teams and for the open edX community. We need Partner-editor communication flow, add Salesforce. Discovery has become the octopus that has its tentacles everywhere. It is difficult to find out why a certain decision was made. ADRs help but they contain only a part of the context. Then if we think about deprecating/removing components, we would not know the consequences until it is live. There is always some hidden use-case, be it an external tool, data warehouse, or reporting that breaks.

1 Like

The following are a variety of related questions we might ask:

  1. When should we make a new backend service?
  2. How do we ensure any new service has good boundaries?
  3. Are there special considerations for the LMS when deciding on separate services and good boundaries?
  4. If we were building course-discovery today, would it still be a separate service?

I’m unclear from your post if you are looking to answer all of these, or are focused on 3, and using 4 as a potential case-study.

This seems to imply that you think the reasons you posted aren’t as important as they once were. I may be misunderstanding what you wrote though.

I still think the reasons you documented make sense. I also like what is documented in the Architecture Manifesto (WIP). I think this manifesto presumes a micro-services architecture, but I also think we imagine/hope that we’d ultimately have the right set of services. And, if we were to determine that course-discovery (as-is) is not the right service, I don’t think that means we couldn’t get wins from a properly bounded service.

Also, I just learned about ADR for new Edx Exams IDA. I don’t know much about this, but because it is an attempt to shift things out of the LMS, it may also be a relevant case-study for question 3 (and 2 and 1), if that is what you are after.

Separate, but related, can we remove the “WIP” from the manifesto? Is it marked “WIP” because there is lots of debate and we are unclear if we want to commit to any of it? Or is it “WIP” because it could change over time, and we have been too afraid to remove the “WIP”? (Happy to move this to a separate topic if that makes sense.)

I guess I was looking at all of those, and hoping to use the example of course-discovery as a case study of sorts. I’m not proposing that we move course-discovery or re-implement it. But I find that conversations like “when to make something a service” can become too theoretical to be useful, and I thought having a known use case would help ground the discussion.

I think the reasons are still important, but that the gaps between the two have shifted. Faster deployment cycles were probably the dominant factor in the initial decision to implement course-discovery as its own service, at which point the comparison was 1-2 weeks vs. < 1 hour from merge to deployment. Now, it’s more like 3-4 hours vs. < 1 hour. Still a big difference, but not nearly what it was before.

Your work on monitoring and mapping views to ownership there also made a big impact on the tradeoffs we had to make in terms of accountability and routing of production issues between separate apps vs. services.

Agreed that it assumes multiple services (though I don’t really think it assumes that they’re micro-services). I think most of the principles are useful in a more app-centric world too, including things like data redundancy.

Agreed. Here’s the doc laying out the case for it as a separate service, in case anyone has thoughts. I still like the course-discovery example though, because its data dependency and operational issues are well understood at this point, while those sorts of issues may be less apparent with a new service. I definitely think it’s worth talking about exams as another example though.

And to be perfectly clear to everyone on this thread, I am not looking to re-legislate the decision to put exams into its own service. The team went through a discovery and ADR process, months have passed, and our goals for architectural autonomy under OEP-56 cannot be achieved if the teams are dragged back into the review process half a year later to defend their choices instead of doing their implementation work. But I do think that whatever the team learns will be a valuable input into other teams considering this kind of decision going forward, so if @zhancock_edx or others on that team have thoughts in the coming months on what worked and what didn’t, happy and unhappy surprises, etc.–I’d love to hear it.

Sure. I just removed it, since really nobody has been adding to it for a long while now, so “In Progress” isn’t really accurate anyhow. I’ve invited anyone who has objections to comment in the doc and we can have a separate discussion if necessary. I’m pretty sure it was just there because of inertia.

Thanks for the clarifications.

In a general sense, I still think that most of the benefits, especially those around team autonomy, are still very useful benefits for separate services. But this does rely on a good design with proper boundaries, which will always be a case-by-case question.