Attacking the Monolith by Extracting a Core

Disclaimer: This is just a thought I had recently, and is not any kind of official plan.

The edx-platform repo has a lot of stuff in it, and most folks share the belief that it should be simplified. We have various documents and talks where we outline the need for an “extensible core” model. Every so often, we manage to peel one or two bits of edx-platform out into its own repo, and sometimes new pieces of functionality are started outside of edx-platform entirely and pulled into edx-platform as a pip-installed requirement (like edx-when).

But edx-platform is enormous, DEPRs have a long cycle time, and this process has stretched on for years with no clear end in sight. We all want a sane, simplified core set of apps to extend. The question I had in my mind was: Does that subset have to be edx-platform? For instance, could we leave edx-platform as kind of a messy place, but extract out a few key pieces into an openedx-content-core that edx-platform also installs?

Some parts of the content code have tendrils everywhere, but many of them have minimal dependencies that could be shifted. In a number of places, we use the publish step to translate some obscure and complex XBlock edge cases into mostly boring Django data models that are much easier to work with and reason about.

For instance, with the learning_sequences app (intended to give you course outlines and other sequence metadata), the app itself defines an explicit set of data types in its data.py, and its api package gives a public set of methods with type annotations. But code in contentstore is responsible for extracting XBlock data from the modulestore and translating it into the simpler models in learning_sequences–catching cases like a sequence appearing in more than one section, or scanning the units of a sequence to determine whether enrollment track partition group settings should bubble up.

So what if we could extract out a set of apps with a simplified model of the world, and leave edx-platform as a kind of messy translation layer? Say we created an openedx-content-core repo that started with one or two of these, but gradually grew to contain the whole set:

  • Course Overviews
  • Course Settings
  • User Roles (course/library student/staff/etc.)
  • Learning Sequences
  • Scheduling
  • Unit Composition (what blocks are in this unit for a given user?)
  • User Partition Groups
  • Grades?

And say we made sure that as each piece was extracted, it was given a well documented public interface with the conventions specified in OEP-49. No static assets, no UI, maybe not even any REST APIs. Just the base building blocks for some of the core pieces of our platform, so that extensions are free to interact with them via in-process APIs.

If we had such a repo, then people developing extensions could import that in the same way that edx-platform would. An extension would pin versions that it supports, and tox would hopefully find the breaking error when some API call unexpectedly goes away with the latest update. When combined with the new work happening in openedx-events and openedx-hooks, could extensions be mostly developed without having to run edx-platform as a whole?

9 Likes

I am definitely supportive of this idea. What is a core though? What you had in that list, why would that be a single core?
As I imagine, those features you listed can arguably be their own libraries or even IDAs, right? Why can’t we just extract those one by one into their own repos, instead of a core set of functionalities?
That said, I do recognize things below are content related and may make sense to be together:

  • Learning Sequences
  • Scheduling
  • Unit Composition
    The following are what construct as a course, which is a special set of content context:
  • Course Overview
  • Course Settings
  • User Partition Groups
  • User Roles

All I am saying is, instead of a single core, we could have multiple sub functioning groups of features constructing into a largely coherent repository.
This way, edx-platform is nothing but a glorified HA account service.

For grading, I say that should be it’s own IDA and service (not gonna be a micro one at that)

To add to your last bolded question, the problem in my opinion is how to emulate the user (learners or course teams) behaviors interacting with these Cores repo. That is something I believe we can use more guidance from.

I agree on both counts:

  • taking action and bringing to fruition our thought exercises to-date on monolith and system boundaries (we are garnering organizational support on this).

  • extensible core does not imply a single core

Core versus extension

On the 2nd point, extensions and core are reference points in a distributed system. Referring to our architectural vision diagram for an extensible platform, we’re designing pluggability frameworks that allow extensions to any microservice and any microfrontend.

Taking Proctoring as a concrete example, Proctoring is a type of Exam that can extend the core Learning experience. From this vantage point, Proctoring is an extension to the core Courseware. On the other hand, from the perspective of Proctoring itself, Proctoring’s core framework supports multiple 3rd party Proctoring providers as extensions to its core API. So a single Proctoring component is both an “extension” to one reference point and “core” to another reference point.

(Perhaps this was obvious, but I wanted to clarify since I myself may have obfuscated DDD’s definition of “core” with the “core” we really mean when we say “extensible core platform”.)

Avoid entity-based services

As the Content theme pursues this path of exploration, I encourage you to refresh your memories of the Architecture Manifesto’s guidance on Bounded Contexts. In particular, avoid starting from the data models as a point of extraction. Instead, separate along the lines of “jobs-to-be-done” in order to avoid the anti-pattern of entity-based services.

For example, I would put Course Overviews, Course Settings, User Partition Groups as data models and storage mechanisms that may (or may not) come in for the ride or may be duplicated across multiple services once edx-platform is broken apart. I do not classify them as 1st class named citizens of services since they are data models and entities used to serve higher-order jobs.

If it would be helpful, we can facilitate an event-storming exercise specific to the Content theme’s domain to get a better sense of the “jobs” owned by the theme.

Yeah, that’s totally fair. I throw around the word “core” a little too casually. That’s why I sort of hand-waved it with the name “openedx-content-core”. I could see edx-platform broken apart into previously sketched bounded contexts, so that it looks more like:

  • User / Account / Roles
  • Authoring
  • Learning

The list I gave above is mostly on the Learning side of this. The reason I was focusing on extracting one such repo out was because I was fumbling around for a practical first cut that would yield an actual tangible benefit (e.g. accelerated extension development).

For some of these, I think that extracting them out to individual repos imposes more overhead than it’s worth. There’s some fixed cost that we incur on a per-repo basis: permissions, PyPI, OEP compliance tracking, CI setup, etc. That’s not worth it for some things that will never be used in isolation. Also, I feel like if we’re going to lift out a sane subset for the purposes of making extensions easier to build, then it should have a very straightforward versioning story. People shouldn’t have to worry about a dozen individual repos with separate versions that may have funny interactions with other versions.

It’s a balancing act, to be sure. Individual repos might still be spun off where things are unstable or ownership boundaries are strongly needed.

Iteration on edx-platform is slowed by a bunch of things. A byzantine mess of static asset compilation slows down the builds. Modulestore makes tests take 10X longer than they should. The list goes on. But at the same time, I don’t want to overcorrect towards full granularity either. Taking out modulestore usage, all the apps I listed above would have a test suite that runs in seconds. There’d be no static asset compilation step.

But it all comes back to how useful this would actually be to developing extensions. There needs to be a user-centric driver to prioritize what gets pulled out and what doesn’t in this model, and the lens I’m thinking of is “what removes the need to spin up edx-platform?”

I listed these things because in my mind, they’re low level building blocks that need to be pulled out in order to pull out some of the other things (e.g. you can’t have something that does unit composition without user partition groups–the implementation of mapping users to groups may be pluggable, but the data models to answer “what group is this user in for this partition?” should be brought along).

Right. I haven’t forgotten the arch manifesto dictates, particularly in terms of data duplication–things like User Partition Group configuration data and Course Settings currently canonically live in the modulestore, and I’m proposing these have their own self-contained models. I’m proposing a new facility to handle Course Settings, and it’s not just a raw data model–pulling config values for a course is a really common use case, and sometimes we’ll want that to come from the content directly, sometimes it’s a derived value from multiple settings, and sometimes we want the site operators to set overrides at a system level. CourseApps would probably fall into this category as well. I think that 90% of what Course Overviews does is better handled by something that handles settings specifically. The other major jobs of Course Overviews are to give the local view of discovery-style catalog information and to just have a table to hang foreign keys off of when it comes to ensuring your course key actually exists.

I think it’s a useful exercise, but I also don’t think that’s really the question I’m looking to answer right now. Or rather, it might be better as a second pass. I’ve been rambling a lot (my apologies), but I’ll try to summarize here:

Pain point I’m trying to address: edx-platform slows development of extensions.

Hypothesis/consensus: This is because edx-platform is too big, complex, slow, and poorly documented.

My interpretation of our approach to date: Gradually remove things from edx-platform and start development of new features outside of edx-platform, until what remains inside edx-platform is “core”.

Alternative suggestion: Shift the focal point of defining the core out of edx-platform itself. Instead of winnowing non-core elements from edx-platform, gradually extract the core elements from edx-platform into a new repo. Have extension developers work with that repo’s smaller subset instead. Have edx-platform import and make calls to the new thing.

Tactics: Pull out apps one by one. Make their data models independent of modulestore (e.g. shift to push data in during publish in certain places where it currently lazily loads). Combine or refactor apps as necessary (e.g. separation of settings management from catalog info in the Course Overviews app). Adapt them to fit our Python internal app and documentation conventions, but don’t radically change the interfaces if it makes adoption difficult. The new repo’s apps should have no runtime dependencies on anything in edx-platform for basic operation (though edx-platform would start defining some things as plugins for new-repo extension points, like extensions do).

Hypothetical Example: If the set of apps that I hand-waved were pulled out, we could have a management command that would just create a course locally, with no XBlock or studio dependencies, basically instantly (based on some template YAML file or something), and populate it with “lorem ipsum”-style dummy HTML for Unit rendering. You could start up a Django server that would offer all the APIs that the baseline version of the Courseware MFE would need to do its work. This would all be done from a Django project that imports this new openedx-learning-core lib, and not require edx-platform as a whole. I think that should be enough of a framing to develop CourseApps like inline discussions that don’t really care that much about all the crazy business rules of edx-platform.

What I’m looking for: I’m trying to figure out how to get to that accelerated extension development as quickly as possible, and test my assumptions. If we do all this and people still have to spin up edx-platform for day to day iteration on their plugin, then this kind of separation only slows development by adding one more thing. That’s why I want to start from the extension developer point of view and figure out what jobs need doing, by examining what’s already being done. Once we have a mapping of what’s being used and for what purpose, I feel like having a storming session where we categorize these into buckets makes a lot of sense. If that means this is three repos instead of one, that’s fine by me.

What I’m wary of: We’ve done large scale mapping exercises in the past, and they’ve been valuable to etch out the broad boundaries. But I want to stay tightly focused on the user needs, and I expect that our potential extension points follow some power law distribution of usefulness. Having a separate space to pull these core pieces into means that we don’t have to make the decision up front of sifting through all the things–we can go through the most important functionality needed by our developer users and elevate them one by one.

3 Likes

I’ve been poking at this idea a bit more and trying to incorporate @nimisha’s feedback about avoiding entity-based services. My latest stab at the list of apps looks something like:

navigation
Sequence and unit metadata, outlines, etc. How do you get to the particular piece of content that you need to learn from.
policy
Content-related settings as they apply to the site as a whole, organizations, and individual courses.
publishing
Centralized list of Learning Contexts (e.g. Courses, Libraries), their published versions, and various content-related errors and warnings are associated with them.
unit_composition
What permutation of a single unit does this user see? This would handle things like A/B tests, problem banks, etc. There should be multiple backends for what would render the composed unit.
utils/partitioning
Lower level support library for user partition groups.
utils/scheduling
Lower level support library for scheduling information (fold in much of edx-when).
2 Likes

@dave I like this idea. Have you had any more thoughts on this?

Am I right that the focus of these apps ^ is on “content”, and it would support both Studio and the LMS? i.e. both “authoring” and “learning”?

Where does “enrollment” fit into this picture? Because “navigation” and “unit_composition” (what units does a particular learner see) depend on policy (enrollment is required to view, or cohorts, etc.) which depend on knowing actual enrollment/cohort data, right? Or do these navigation/unit_composition/etc. apps provide a more abstract data set (and events), and it’s up to some later app in the LMS to convert that to actual per-user content outline/visibility?

With the inline discussion example, how would user profiles/auth come into play during development? Would one still need to run edx-platform to test the frontend components? (which I assume would be a frontend plugin for the courseware MFE, which consumes the LMS’s REST API?)

Do you have any examples of extensions that could be developed using only this core set of apps? I think that would really help understanding and clarifying this.

Some ideas, big and small:

  • I want to notify instructors [or a mailing list] whenever changes to a course are published
  • I want to use an adaptive learning service to customize course content for each learner
  • I want a button next to each XBlock that will show its OLX source to instructors
  • I want to use H5P as an alternative to XBlocks

What occurs to me when trying to think of examples is that the capability of extensions so developed is going to be very limited until there is a good solution to frontend pluggability too.

1 Like

I think this is something that’s owned by LMS but can be pushed to via Studio. As in, they’re self-contained enough where they have their own in-process API interfaces, and the thing pushing data into it could be Studio running things in process, or potentially something else picking data off a message queue. So more “learning” than “authoring”, come to think on it–openedx-learning-core is probably a more appropriate name.

I have this whole separate rant about “enrollment” because I think that it’s an amalgamation of a number of different concepts in different systems that share the same name–and that the LMS should restrict itself to thinking about enrollments as a Student Role for a course, but that’s an entirely different thread… :stuck_out_tongue:

Enrollments are one of a number of concepts where the logic for it is both surprisingly complex and deeply ingrained in edx-platform. In the past, I feel like we’ve tried to pull some of these out, only to find it it’s actually six different things with tendrils everywhere. My thought with things like this is to have plugin interfaces that go the other way from what we usually do–where the core framework logic is in apps in this new repo, and the little plugin objects are created in edx-platform (and optionally elsewhere as well).

So to use a concrete example, say we migrate the learning_sequences app in edx-platform today to become part of the navigation app in this new repo. The navigation app will then have the concept of OutlineProcessors–an object interface for different concerns that have to modify the set of things you can see or access in a course outline. The navigation app would have the logic for running OutlineProcessors, reading those values from a list defined in Django settings. The EnrollmentOutlineProcessor would be defined in edx-platform (and likely have imports to a bunch of things also in edx-platform), and then be specified in the settings file.

By doing this, we can keep some of the crazy logic and dependencies in edx-platform–it would allow us to move things more incrementally, without taking huge risks.

Extraction + Extension = Re-platforming?

Going back to:

If this works out the way I hope it will, what we’d end up with in this new repo is something you’d use to build an LMS on top of. The long term goal would be to have an extracted core learning platform that people can innovate on top of. The migration path would be a gradual extraction of pieces that could accelerate plugin development, providing obvious with each incremental addition/extraction.

Open edX has so much stuff in it today, and so many assumptions about how the learning experience should be structured–course hierarchy, grading policies, enrollments, etc. As an end-user product, maybe that’s fine. But I think we need to be something more to be a platform that truly pushes things forward in terms of learning innovation.

The LabXchange use case has really affected my thinking in this. It’s a great site that uses Open edX for its core content interactions, but the user experience is almost entirely different–no courses, no enrollments, nothing at a high level that you would expect if you were familiar with edX. You folks at OpenCraft managed to pull that trick off because you have deep expertise in the platform, but what would it take to make building something like that simple and straightforward? How many of those different systems could co-exist on the same site, even?

Suppose in the new repo, Courses don’t exist as a thing. We have a LearningContext model, with some identifiers and basic metadata. But something like edx-platform wants Courses, so it makes its own model and makes it a 1:1 relation with LearningContext. Something like LabXchange defines Pathways and Clusters instead.


I’m fuzzy on this, but my thought was that you could exercise the core logic and REST endpoints of this new repo without having to have all the plugins of edx-platform (e.g. the EnrolmentOutlineProcessor wouldn’t be installed or invoked, but the outline still renders without it). I’m not sure this will really pan though.

Thank you so much for these use cases. I don’t know if I have really solid answers for any of them, but some quick thoughts:

I want to notify instructors [or a mailing list] whenever changes to a course are published

This would involve listening for a publish signal (probably from openedx-events), and then querying for the appropriate Roles associated with the course’s learning context.

I want to use an adaptive learning service to customize course content for each learner

This should be doable by making a new type of navigation. There is some mapping between learning contexts and their navigation, with some Navigation interface defined. Probably one or two are defined internally, but others are allowed for in pluggable ways.

I want a button next to each XBlock that will show its OLX source to instructors

I think that’s something local to the XBlock unit rendering runtime, which would be outside the scope of this. Or it could be that the unit_composition piece has enough data to provide this information, in which case this would have to be surfaced as a REST API.

I want to use H5P as an alternative to XBlocks

A new Unit type in unit_composition. Again, probably a pluggable interface. There’s a generic Unit model that defines necessary metadata, but an H5P extension could flesh out its own data model.


One weird thing about all of these is what the publishing process looks like. If navigation and units are pluggable interfaces, then something’s stitching these things together, ideally in the same transaction. I’m not sure what that looks like.

What occurs to me when trying to think of examples is that the capability of extensions so developed is going to be very limited until there is a good solution to frontend pluggability too.

Yup. (FYI @djoy)

2 Likes

Thanks @dave, I think it’s a great direction to go in. What other type of feedback or ideation would be helpful here?

I guess the next step for me is figuring out what piece or pieces it makes sense to try to test this out on. Something that would actually be useful to put a more stable, versioned API on top of, but wouldn’t be too difficult to extract and port over. The publishing app would be a precursor because it would contain the most basic content data models. The policy bit would be relatively straightforward to implement, and navigation would be a straightforward port of learning_sequences. I can port over things in edx-platform to point to these new apps, but it’s not clear to me that anyone would use these independently.

Any feedback you and others can provide on this aspect would be greatly appreciated.

I can port over things in edx-platform to point to these new apps, but it’s not clear to me that anyone would use these independently.

Not sure if this is what you’re going for @dave, as it’s still in the edx-platform context… but there are plenty of edx-platform-plugin-esque repos like edx-enterprise, which are installed into edx-platform but also have unfortunate imports back to edx-platform features such as course_overviews. Knowing how common it is to want course structure metadata, I wouldn’t be surprised if there existed an edx-platform-plugin repo that wants or will soon want to use learning_sequences. If we were to pull some form of learning_sequences into an openedx-learning-core, then perhaps we’d be preventing someone from writing from openedx.core.djangoapps.content import learning_sequences into a non-edx-platform repository.

I know @schenedx 's team has made use of learning_sequences already to support a piece of special exams (?) functionality. Simon, do you know if your team has edx-platform plugins that would benefit from being able to properly use a tool like learning_sequences?

I’ll start with a proof-of-concept that brings over CourseOverview-like policy information. It’s a nice place to start in that it’s a useful piece with a fairly limited scope, and it doesn’t require bringing over user role information. I think the hardest thing is adding the underlying publishing infrastructure code—I have some ideas around conventions for stitching the overall content publishing process together, but they’re not fully baked.

@kmccormick I want to clarify. My team’s addition to Learning Sequence was not on special exams. We added processing of Enrollment track to the Learning Sequence API. What’s more important was, we added “bubble up” feature so if all of the units should be hidden because of the enrollment track, the whole sequence of which the units are under, should hidden.
There is no other plugins I can think of at the moment that leverages this yet. I can think of possibilities, but those are not on my team’s current radar.

1 Like