Evaluating Meilisearch

Review

In the thread where we first discussed trying out Meilisearch (Is Meilisearch a viable upgrade alternative to OpenSearch?), I wrote the following:

With another followup message:

Now that Redwood is set to release on Monday, and others seem to be interested in using Meilisearch for more things, now is as good a time as any to have that discussion.

As far as I know, the big drawback that’s been identified so far (by @blarghmatey) is lack of high availability (HA) support outside of their cloud hosted service:

@braden covers some of the nuances of Meilisearch and HA support here:

As far as I’ve seen, the HA situation remains unchanged since it was last discussed in that thread.


Next Steps

@braden (and anyone else who has used Meilisearch, on platform or off): Could you please summarize your development experience with Meilisearch?

Site operators: Is anyone planning to turn this feature on when Redwood releases? Have you been testing with it already? Please comment here with your operational experiences, or let us know if you plan to run it soon and are willing to provide feedback then. If you have thoughts around high availability, please add those as well.

Thanks folks.

3 Likes

For more on the rationale to experiment with Meilisearch, please see the architecture decision record for it.

Hey!

I used meillisearch a while back when developing a product catalog for a website I was working on. My implementation used strapi as an integrating CMS and digitalocean to host the instance of meillisearch. All things seemed quite straightforward, their documentation is pretty good and their user interface for testing seems to be quite helpful.

I really liked their typo tolerance, searchable fields and field priority options, though I was not a fan of the complex nature of their filtering calls. Overall, great choice to add to the platform and I look forward to the implementation.

1 Like

I’ve been following the progress from Braden on the Meilisearch front and we have been having a few discussions in the large instances WG.

As an experiment I decided to run the indexing of the courses in one of our larger installations. We have around 40k courses, so I started testing. That instance is running on an old version of Open edX (nutmeg), so I had to hack a backport of the original Braden’s PR.

Initially I noticed that I couldn’t start the Indexing in our instance, we identified the problem and Braden landed the fix recently.

After that I just let the job run, the whole thing takes several days to go through all the 40k (it still hasn’t finished). It had indexed around 800k blocks for 8k~ courses.

I just performed a few queries in the Meilisearch dashboard, my first impression is that is really snappy, Queries take about 5ms to perform. Resource usage is at around 1.600MB of memory with close to no CPU usage (this is an environment without traffic).

My take is that it seems better overall in comparison with ES. The footprint is minimal, and even when filling up it doesn’t seem to crazy. I deployed in a k8s cluster with a single pod, mostly copying what Braden has in the tutor-contrib-meilisearch plugin.

I’m not that concerned about the HA problem, We don’t actually deploy that many ES clusters, even for large instances, mostly because search isn’t really a critical path in the overall experience so downtime there isn’t as terrible as it could be for Redis for example (no broker for celery).

I don’t know if we are going to enable it in redwood, but no so much because of Meilisearch but because we first want to get familiar with the actual new search feature first.

2 Likes

As a developer, Meilisearch has been extremely easy to work with. It has great documentation, and I really like how the usage examples with the official python and JavaScript clients are integrated into the documentation, so you don’t have to learn them separately.

Generally it has been painless to work with, and most things I tried “just worked” out of the box. The API was relatively straightforward, and both indexing and searching were easy to implement.

What’s more, they seem to be developing it quite actively. Between the time when we first started integrating Meilisearch and the Redwood release, they released two new versions including a very nice feature that we wanted (negative keyword search).

On the frontend, we first integrated it using Instantsearch, which worked really well. But things got a little more complicated when we needed to implement filtering by [multiple] hierarchical tags, including a keyword search field to refine the list of hierarchical tags. It turns out that neither Instantsearch nor Meilisearch support this (<HierarchicalMenu> does not allow multiple selections, and facet search doesn’t support a hierarchy nor keywords that occur in the middle or end of a tag value). So we had to replace the Instantsearch widgets on the frontend with a custom UI and a custom “search manager” built using React Query. This was actually not too difficult, and I’m happy with the result, which doesn’t have much more code but allowed us to remove Instantsearch as a dependency (it turns out that React Query provides a lot of the functionality we were getting from Instantsearch). Then I had to figure out how to use the Meilisearch APIs to achieve the functionality we need, even though it doesn’t technically support that use case. The approach we ended up with is a bit of a hack, though it should work for most cases. (Now, I found out this week that Meilisearch is getting a new feature, distinct attributes at search time; once that is implemented, I believe the “hacky” solution will actually work correctly in all cases. I’ve been corresponding with the Meilisearch team about this on Hierarchical Facet Search · meilisearch · Discussion #735 · GitHub .)

The other downsides I’m aware of:

  • Users: No boolean operators for keywords (x AND y), but it does support "exact phrase" and -negative keyword search. Since we have those operators and pretty advanced filtering, I don’t think the lack of boolean keywords is a big deal.
  • Operators: The upgrade process for major versions is a bit annoying. Well, it’s easy if you don’t mind deleting and rebuilding your whole index, such as on devstack, but on production where you probably need a faster solution, it requires you to create a dump, upgrade, then import the dump manually. This is the sort of thing that could be automated by Tutor’s upgrade workflow, though, like we already do for MongoDB etc.
  • Operators: Still no true HA support, though it may come in the future.

Out of curiosity, was it Meilisearch that was the limiting factor here, or was it the modulestore querying to collect the data to feed into Meilisearch? Several days isn’t the end of the world, but we could probably get a lot of speedup with more parallelism on celery workers if modulestore is the bottleneck–and that info might be relevant to others with large migrations to consider.

Its mostly spent processing the data, the actual indexing isn’t sweating.

I think parallelizing would be a great way to improve times.

Yup, our current code for the initial index is a very basic single-threaded single-worker all-courses approach, and it waits until the job has completed before it makes the index available. I’m sure there’s a ton of room for improvement.

In particular, I’m thinking we should add the ability to just create (and immediately start using) a blank index, and then queue a bunch of tasks (one for each course?) to index each course. These tasks would then get executed in parallel by however many celery workers are available.

It’s also worth pointing out that as far as I know, almost nobody has tried this out yet; Redwood hasn’t even technically released. So I’m hoping we’ll get more detailed feedback once more operators have upgraded to Redwood and elected to try it out.

Just wanted to mention here that there is a separate conversation to run Meilisearch for course search in the LMS: Auto-suggest course content on search (Meilisearch-compatible)

Yeah, good point. I think that means that we need to stretch out the transition to this another release. Right now it’s default-off in Redwood, but not required. I think we had hoped to have it just be on in Sumac with no toggle, but it probably makes sense to at least have the option to toggle it off in Sumac.

@braden (or anyone who’s familiar): How hard would it be to make our search implementation pluggable to use either Meilisearch or Algolia via configuration?

I ask because it seems like Meilisearch is built to the same basic outward features and architecture, and Algolia is established enough that it should alleviate concerns around HA and probably provide a better upgrade experience over time. It’s also popular enough where many organizations might already be using it as part of their stack (e.g. 2U runs it for their catalog browsing experience).

As you said, Meilisearch and Algolia intentionally have pretty similar APIs, design, and overall architecture, so it would definitely be a lot easier than, say, supporting Meilisearch and Elasticsearch.

Nevertheless, we’d have to write our own small abstraction layers on both the backend and the frontend. This is required on the backend because I’m not aware of any existing python abstraction layer that supports these two. And it would be required on the frontend for the same reason I mentioned earlier in this thread - because although there is a very nice frontend abstraction framework (Instantsearch), it doesn’t support key features we need (multi-select hierarchical filters, keyword refinement of hierarchical filters). (And for the record, I first tried extending Instantsearch to support our needs, but found the code was too abstruse; it was literally faster to write a replacement for everything we used from Instantsearch than to extend it to support this one feature.)

And then of course, supporting two search engines means more complexity in testing, more work in resolving bugs that only affect one or the other, etc.

So: I think that would be a reasonable path forward if adopting Meilisearch is a no-go for a stakeholder like 2U. But I’d definitely prefer just supporting Meilisearch for simplicity if that’s feasible.

That’s really disappointing. I had hoped that it wouldn’t be such a big lift to support both simultaneously.

To be clear: my understanding is that 2U currently has no stance on whether they’d want to use Algolia, and does not have interest in maintaining any such plugin. I’m curious if there are any other Algolia users out there who definitely would want to use it with Open edX.

I do think that there’s a lot of value in leaving the door open for other search engines of this style (Algolia, Typesense), both as a hedge against needing to move off someday and as a way to offer the choice of a more robust (albeit commercial) alternative for folks who are more risk averse. But yeah, I can’t justify doing extra work to make sure that extension point works well if it’s costly to do so and there’s literally nobody interested in running alternatives.

So I guess I await other comments on this thread. :stuck_out_tongue:

Do you think there are things we should be doing at the moment to better contain Meilisearch on the front and back end, to make an extension point easier to build later on?

Based on your comment, it seems we need integration/abstraction at both the backend and frontend, likely because we need to index data to Meilisearch or Algolia. This makes sense, but it will require maintenance effort to ensure our custom backend connector works properly for both Algolia and Meilisearch.

I’d like to propose a few options that could alleviate some of the backend abstraction responsibilities:

  1. Meilisync: We can use the Meilisync utility to sync data from our database to Meilisearch. Since this utility is open-source and maintained by the community, we can rely on its stable version without worrying about keeping our indexers updated.
  2. Algolia MySQL Connector: Algolia offers a MySQL Connector that we can use directly from the Algolia dashboard. By using this integration, we can delegate the indexing responsibility to the Algolia connector.
  3. Typesense: Typesense supports Airbyte, a no-code tool for data syncing. Additionally, it supports Maxwell as per their documentation. These options can be used to sync data to Typesense effectively.

By leveraging these available options, we can reduce our concerns significantly and focus on maintaining an abstraction package on the frontend for instant search features.

Meilisync does not support Algolia so I don’t see how it helps us. In addition, it must be deployed as a separate application (separate docker container) and would need monitoring and configuration in turn. I think that’s too much complexity. We already have code for pushing Open edX data into the meilisearch index; it works perfectly well, doesn’t require a separate app, has logging that uses our existing log channels, and it can easily be adapted to support Algolia.

This requires that your MySQL server is allowing connections from the public internet, and it would require a totally different configuration than Meilisync or the current indexing code. I would prefer to keep the current indexing code with small adaptations to support Algolia, so that the core logic around pushing documents into the search index, as well as the shape of the documents themselves, is consistent for both Meilisearch and Algolia. For example, there is no Meilisearch-specific code in our index document definition. Basically the only file which requires changes to accommodate Algolia would be api.py.

Typesense is a really nice search engine but I don’t think there is any strong reason for supporting it, and it would be more work to support as it’s a bit more different from Algolia and Meilisearch. In fact, I haven’t yet heard definitively from 2U (or anyone) that they want Algolia to be an option. If nobody expresses a strong need for Algolia, we may not even need an abstraction layer (though I’m expecting 2U will want it). Does someone know who would be the right person there to ping?

Thank you for the great understanding and elaboration. In my previous comment, I mentioned that we currently have three search engines to consider. Out of these, Algolia and Meilisearch are already being considered.

As the name suggests, Meilisync is not intended for Algolia, since Algolia is closed-source and has its own integration mechanisms.

My recommendation is to avoid the unnecessary responsibility of maintenance and to refrain from integrating a search engine directly into the edx-platform.

Since we are focused on implementing an abstraction layer on the front end to support multiple search engines (Meilisearch and Algolia), adding another abstraction layer on the back end just to index data would simply add complexity. From my understanding of the available utility, we can sync the table and apply projection as shown here.

You are absolutely right and your solution is very clear. However, the problem I’m seeing is similar to the one we are facing with Elasticsearch integration in other services like course-discovery, forums, and notes.

As we want to move abstractions to the front end, we will also need to accommodate these services. We will have to implement abstractions (whether on the back end or front end) for these services if we plan to support them.

If we integrate search engines into all the codebases, it could become a potential mess the next time we want to add another search engine, as it might violate the interfacing protocol of these services.

Using the available connector for Algolia is not harmful and can be maintained seamlessly. We can set up some views and a user to expose it to out systems. To enhance security, we can expose proxy endpoints to our database instance.

Yes, it’s totally fine with me if you want to create a search abstraction library that’s separate from edx-platform, and then we can modify the search code I wrote to use it. However, it has to be a good abstraction, not one based on the design of edx-search :stuck_out_tongue:

Maybe but not necessarily. Keep in mind, discovery, forums and notes are all optional services. People can run the platform without those, and it’s not the end of the world if (for now) they have to install elasticsearch in order to use those extra services. For example, I would not bother implementing Meilisearch support in the discovery service if it’s complicated to do, because I’d rather focus on reducing the need for the discovery service in the first place.

Great, I believe we have a plan now. As you have already implemented the authentication mechanism under the core package, and in my PR, I have implemented a backend for Meilisearch so we can index data and leverage the existing search logic.

I will create a new PR that contains only the Meilisearch backend, which we can use to index data into Meilisearch. We can then further work on our front end and back end abstraction layers. Please let me know if you are fine with this approach or if you would like to add anything.

Sorry, I’m not clear on what you’re proposing. Is your goal to (A) make the existing LMS courseware search work with meilisearch instead of Elasticsearch, (B) implement a new “auto-suggest” courseware search feature using Meilisearch/edx-search, (C) implement a new abstraction layer for search backends, or (D) all of the above?

What I’d love to see (in an ideal world) is:

  1. Propose a design for a simple new search abstraction layer on the backend that supports Meilisearch and Algolia, supports proper creation/updating of indexes similar to how django handles migrations (you can even use django migrations as the framework for this), supports “personalized access tokens” (requests go directly from browser to search engine, for instant results as you type), etc.
  2. Once you get buy-in, implement the Meilisearch backend for (1) and convert my existing Studio search code to use the new abstraction layer on the backend.
  3. Create an edx-search backend that uses (1) so that the existing courseware search can be used with meilisearch instead of Elasticsearch.

But if nobody is willing to take that on, we can just do:

  1. Hack edx-search/courseware search with the minimal changes needed to run Meilisearch instead of Elasticsearch.

Post Sumac (or sooner if possible):

  • Rewrite courseware search to use the new abstraction layer instead of edx-search and follow the design discussed in Auto-suggest course content on search (Meilisearch-compatible) so that it uses “personalized access tokens” (requests go directly from browser to search engine, for instant results as you type)
  • Rewrite forums/notes/etc. to use the new abstraction layer
  • Discovery service - tbd; there will be a bunch of proposals about this in the coming months