Auto-suggest course content on search (Meilisearch-compatible)

@braden Your implementation caught my attention. I am considering extending it to include MeiliSearch support for LMS content. If you foresee any potential issues, please let me know. Otherwise, I plan to proceed with developing a minimal viable product based on my current ideas.

1 Like

@braden I have recorded a quick video to show you how I am thinking of the implementation, please let me know how you feel about this feature?

@braden I have created draft PRs, I am sharing here on thread to discuss the approach I have take to implement this feature. Please it a look and let me know if you think it should be implemented in any other way.

Backend: edx-search
Frontend: frontend-app-learning

@qasimgulzar Could you please post your question in a new thread, and ping me there? I’m happy to give you feedback, but I want to keep this thread focused on feedback about using Meilisearch in production and evaluation of the new Studio search feature that we’re launching with Redwood.

I’ve taken the liberty to move this sub-conversation to a new topic :slight_smile:

1 Like

Perfect, thank you @regis!

@qasimgulzar Thanks for looking into this! Here is some early feedback:

  1. I would not recommend putting the new code in edx-search (which has a rather messy API) but rather in the new content/search Django app. Basically it’s a lot simpler that way, and we can avoid an unnecessary abstraction layer.

  2. Your demo is using the same Studio index as the Studio Search feature, but the Studio search index has draft content. So you will need to create and maintain a separate index of published content for use in the LMS.

  3. You will have to think carefully about permissions, because the rules for what content learners can see (and search) are much more complex than the rules for what instructors can search for in Studio. I think this will actually be the most challenging piece; everything else is fairly straightforward.

  4. The “auto-complete” style dropdown you’re showing is not very useful - one cannot tell the difference between the three results in the video. You should include a preview of the content of each result too.
    Screenshot showing several rows of results that are identical

  5. We have purposefully held off on implementing LMS search until we get more feedback from people using Meilisearch in production. So even if you implement this now, people may ask you to wait before merging anything. But I still encourage you to develop this :slight_smile:

I can’t agree with this strongly enough. The root problem is that the inputs for whether someone is allowed to see a piece of courseware content in the LMS are both complex and opaque. You basically call has_access on every little piece of content, and trust the byzantine logic in there to figure things out. It would include things like:

  • Any special user roles (staff, beta tester, etc.)
  • Enrollment and enrollment track.
  • Dates on content, including inherited dates.
  • Overrides on dates/deadlines.
  • Cohort membership.
  • A/B test group, randomized problem group.
  • Subsection pre-reqs
  • Whether something has been hidden from students.
  • Whether a deadline has passed + configuration.
  • Exam settings.

And that’s literally just stuff off the top of my head. We had a courseware content search system in the past, and it spent the vast, vast majority of its processing time running has_access checks on the returned results to see whether this user was allowed to see it. edX only ever turned the feature on for course staff (where those checks are simple)–the performance of the system was deemed too risky to turn on for students.

There are underlying systems here that could be leveraged. A/B test groups, enrollment tracks, and randomized problems from libraries all share an underlying representation in the user partitioning system. But I think the design for this would require grabbing all the user information that affects the permutations of course content that they see, translating that to a list of parameters that are captured on the content (e.g. the content groups), and passing that through as search parameters for each user for a specific time frame. And that’s a lot of careful work.

Thanks for the excellent summary @dave.

I think it might make sense to develop the next version of course search in such a way that it covers most of the “regular” types of course content and simply excludes altogether anything with complex access rules.

This is, the search index would contain:

  • Any content that’s been published, and is past its release date, and is not: randomized, A/B tested, part of an exam, hidden from students, cohorted, hidden based on prereqs, staff-only, etc.
  • Any content that is visible only to staff or specific cohorts (these cases involving broad groups of users are easy to account for)

And anything else (e.g. exam content, randomized content, A/B tested content) would simply be “unsearchable” in the LMS. At least for the MVP version.

(Later on, one could build a per-user “supplementary index” for such content, and join it into the search, perhaps only for a few users who enable an “complete search mode” for the course… but I’m not sure it’s worth the resources that would take.)

As you add further search functionality, it might be useful to have have some sort of UX in the CMS that indicates whether content will be searchable in the outline and edit views for components. I think it will be hard for everyone to reason about why some things are searchable and others aren’t if they’re not already familiar with the innards of the system.

1 Like

Great idea. If we went with that sort of approach, it would probably make sense to display it to learners too - e.g. some icon indicating “personalized content - this is excluded from course search”. Otherwise they would also be confused about why some content is found and others not.

I recommend integrating Meilisearch using searcher backends. I understand the goal of removing edX-search, but when adding a new search engine to the platform, we must ensure the integration doesn’t hard-code any specific search engine into the codebase. Since we’re moving away from Elasticsearch, and given the rapidly evolving nature of search engines, we might adopt another search engine in the future. Additionally, as the edX platform is used by various self-hosted clients, different users may have different preferences.

Considering these factors, I have created a PR to implement a search backend. I will also work on building a common API interface, as suggested by @Bradan, to avoid abstraction issues and handle content permissions. However, this will take some time.

Here is the PR

I have also added django_meili in course_discovery service to index data into meilisearch.

here is the PR

Thank you so much for making those PRs! I think that a number of us have been really distracted by the conference this week, and that’s delayed replies to this. A lot of folks are going to be taking vacation time right after the conference as well, but I should be able to look over these sometime early next week.

I wanted to provide additional insights based on my review of the implementations and comments on Slack. As we aim to minimize the edX-platform codebase, it is crucial to avoid integrating external services directly into the edX-platform. Removing the layer of abstraction and tightly coupling Meilisearch with the edX-platform poses a significant challenge for its implementation.

Had we correctly utilized the edX-search within other services like course-discovery, notes, and forums, implementing a new backend for Meilisearch would have been straightforward. However, in this case, we will need to make changes to course discovery, notes, and forums.

Furthermore, eliminating the “edx-search” layer of abstraction offers no substantial benefits. When fetching data from search engines in the edX-platform backend, we already validate/authenticate users and index data using management commands.

While the personalized access token feature is indeed attractive, removing edx-search is not directly related to this functionality. We should have a more in-depth discussion to determine whether removing edx-search is necessary or if we should leverage edx-search across other services.

If we can extend edx-search in a way that we’re actually supporting different search backends properly, (i.e. not a bunch of if statements littering the other repos), then that would seem like a decent migration path for existing search functionality that already uses Elasticsearch. So In that scenario, Elasticsearch doesn’t actually go away, per se, as long as someone’s interested in maintaining that code–even if the platform default becomes Meilisearch.

I’d worry about maintaining this over time, and how we ensure that new features don’t break under other search engines. The course-discovery PR you put up has an explicit set of classes defined when Meilisearch is being used–which is understandable since those definitions have to exist somewhere, but I’m not clear if we have the ability to adequately test and maintain the separate backends over time. How would you see that working?

I think it is related in the sense that we could decide that we only want to support backends with that sort of frontend, live-search capability (e.g. Meilisearch, Algolia, Typesense), which would significantly change how we build those features going forward. Yes, it doesn’t mean we’d have to kill edx-search, but it would affect how deep changes might go.

@qasimgulzar: Maybe it makes sense to take this conversation to the other thread evaluating Meilisearch?

This approach is definitely feasible, and I fully support it. Regarding the concern about maintaining the repository, I am willing to share this responsibility.

The question about testing different backends is indeed complex. Traditionally, we mock integrations. However, if you believe we need to implement end-to-end (E2E) tests for each backend, we can set up a sandbox environment. I am open to any suggestions on this matter.

:white_check_mark: I am totally fine with the approach of implementing Meilisearch (and/or Algolia) support for edx-search as a short-term solution for removing Elasticsearch from our stack. It’s my hope that we would only officially support Meilisearch, unless someone is really keen on maintaining the Elasticsearch backend. If people like @qasimgulzar want to take on this work and enable us to get rid of Elasticsearch before the Sumac release, so much the better.

:x: However, I am opposed to building any new functionality on edx-search, nor planning to maintain it for the long term. In my opinion, the amount of useful code in edx-search is so small, you might as well copy-paste it into a new repo and start from scratch, with a better design.

  1. First of all, edx-search doesn’t support the personalized access token feature. As we have seen, without this feature the courseware search puts too much load on edxapp’s Django workers, and it has never been possible for edx.org to turn on courseware search as a result. Thus, I believe this feature is critically important, as it forces us to design the search index in a way that enables both instant results (directly from the search engine to the browser) and places no load on edxapp workers. That feature is thus a good forcing function that encourages better design.
  2. Just look at the edx-search API - it’s terrible:
    a. course_discovery_filter_fields() - something specific to course-discovery. This shouldn’t be in a generic/abstract library.
    b. course_discovery_aggregations - same
    c. NoSearchEngineError - this class makes sense but note it has only 1 line of code
    d. perform_search - this function sounds generic but is hard-coded to search the LMS courseware index (published content).
    e. emit_api_timing_event - again hard-coded function specific to LMS courseware search
    f. course_discovery_search - something specific to course-discovery. This shouldn’t be in a generic/abstract library.
    g. And that’s the entire functional API, though api.py does also expose SearchEngine, SearchFilterGenerator, and SearchResultProcessor.
  3. edx-search is a very leaky abstraction so you’ll often see code like this and this which breaks if you change the backend to anything other than ElasticSearch.
  4. What API it does have is basically “do what elasticsearch does”, so when @qasimgulzar implemented the Meilisearch backend, it required a lot of translating ES syntax to Meilisearch/Algolia syntax; if we drop support for ES and only support Meili/Algolia in the future, this just makes the code way more complex than it needs to be, and debugging is harder. It would be better if our abstraction layer used either a more generic syntax or the Meili/Algolia syntax as the primary way of specifying filters, aggregations, etc.
  5. edx-search doesn’t have good APIs for defining an index, which is why @qasimgulzar’s PR had to hard-code courseware-specific index settings into the abstract search engine in edx-search.
  6. Several other problems with edx-search are listed here.

Fixing most of these will require changing the API so much that it will not be recognizable as “edx-search” anymore, which is why I’m calling for a replacement rather than fixing it in place.

2 Likes

@braden , I have initialized a new abstraction package for search, aiming to keep it as generic and configurable as possible. I believe your expertise would be valuable in designing this package. Here is the repo for your reference. I look forward to your input and collaboration.

@qasimgulzar I am back from my recent vacation and will take a look when I can. In the meantime, perhaps you can post some high-level details (maybe an ADR) so others can weigh in on this approach?