Is Meilisearch a viable upgrade alternative to OpenSearch?

dave · February 27, 2024, 8:46pm

Please note that the following is a request for comments/input. This is not a decision or definitive recommendation on Axim’s part.

DEPR-170 covers a move from Elasticsearch to OpenSearch, which was also discussed in these forums:

@regis suggested that we might use this opportunity to remove Elasticsearch (and by extension OpenSearch) altogether, pointing out their extremely high memory requirements:

Unfortunately, MySQL full text performance is lackluster, both in terms of result quality and performance when combining a field to be searched with other indexes (like a course). So we still need some search engine to fill that gap to maintain feature parity.

In a recent comment, @jmbowman stated his belief that Meilisearch would be a more promising long-term alternative:

From a future-looking perspective, I feel that Meilisearch would be a better search engine to integrate with. It’s MIT-licensed, blazing fast (implemented in Rust), much less resource-intensive than Elasticsearch, already fairly competitive with Algolia in many respects, has solid commercial support, and has pretty good Python support. There isn’t an authoritative Django package for it yet, but there are several packages and blog posts outlining how other people have used them together. It would be a gamble, but frankly it feels like it has more momentum than OpenSearch.

I’m unfortunately not likely to be able to help much with this for a while, so it’s going to be up to other people to pick a path forward. I just wanted to articulate that while OpenSearch looks at first like the easiest/safest path forward to solve the licensing problem, it’s actually harder than it looks and may not really set up Open edX for success in future search improvements.

I have not tested anything locally, but based on various blog posts, that resource usage difference is massive. This one shows Meilisearch using 1/10th the memory of Elasticsearch, this one at around 1/5th. (There’s even one claiming a 1/50th memory usage, though I suspect misconfiguration in that one.)

Elasticsearch to OpenSearch isn’t a drop-in replacement, and we will have to modify some code to use newer libraries. If we’re going through that effort anyway, should be consider a wholesale move to Meilisearch? It’s a less mature codebase, but it seems to be friendlier out of the box, it has compelling performance characteristics, and it has a much smaller memory footprint. It seems to be actively maintained and developed by an open-source friendly company. By the very unscientific measure that is GitHub stars, it also seems to have more developer excitement than OpenSearch.

The question is this: Is it worth delaying the Elasticsearch → OpenSearch migration in order to do discovery work around Meilisearch? Doing so might give us a more compelling long term search engine for the project, but it would further delay this long-running DEPR effort, and possibly threaten the timeline for new Studio course content search functionality currently scheduled for Redwood.

FYI to @feanil, @Diana_Huang, @braden, and @jristau1984 who have all recently commented on that DEPR.

ashultz · February 27, 2024, 9:16pm

We haven’t done a full rollout yet but tests of older opensearch still using the elasticsearch libraries work fine. So it seems possible to stop using elasticsearch servers before having to change the library, which would possibly ease the transition. To do the library transition we’d want to introduce another search engine and also overhaul the weird way search engines are chosen from settings to allow index by index changeover.

Different searches in the code are different levels of tied to ES - course content search is completely tied to the ES format and that tie is scattered across multiple codebases. Other more modern searches at least put all their assumptions together in one place.

Fixing courseware search would be sufficiently annoying that I think it would be better to just make a Meilisearch or whatever separate library and then redo courseware search on it than to carefully juggle the way it is split between the library, LMS, and CMS now.

See https://openedx.atlassian.net/wiki/spaces/AC/pages/3884744738/State+of+edx-search+2023 for various search stuff dug out last year.

On an operational note it only took a day to rebuild the course content index for all active edx courses using the management command, so I think we should go with the “rebuild your indices using these commands” route for any migrations rather than worry about some exotic index to index magic.

Personally I have not liked elasticsearch any time I have had to use it though it is relatively harmless here. I’m sure opensearch is about the same. But that certanily makes me inclined to like a different search solution.

regis · February 28, 2024, 4:54am

I believe it is! I’ve been looking at Meilisearch myself as a candidate replacement for Elasticsearch in Open edX, and I’m excited by this prospect. Elasticsearch is the biggest memory hog (and climate changer) in Open edX. If we were to switch to OpenSearch, I believe we would never invest the energy to migrate again to Meilisearch.

That being said, it’s not a trivial task to isolate the Elasticsearch clients in Open edX. ES is used in multiple apps, including edx-search and the forum. So it’s a complex project, but I think the reward would be well worth the effort.

braden · February 28, 2024, 10:15pm

If anyone wants to explore this in detail out or hack on it with me:

I just created tutor-contrib-meilisearch will add Meilisearch to your Tutor Nightly devstack and configure edx-platform to connect to it.
- With only a few small documents in each index (i.e. typical devstack case), it uses < 20 MB memory vs ElasticSearch uses 1,380 MB
This PR demonstrates indexing library content in Studio. No search nor courseware functionality tested yet. You can do basic searches using the Meilisearch UI though - see screenshots.

Glib_Glugovskiy · February 28, 2024, 11:05pm

I don’t have much to add, but I’m thrilled that this discussion is happening. I’ve used meilisearch in a Rust-based project before, although the number of usecases was limited, I had a good experience working with it. I believe that performance is definitely the strongest point of meilisearch, and it worth to consider the difference with EC and Opensearch during the discovery.

The Braden’s POC looks very promising, great work!

Additionally, different Rust solutions mature every year and continue to surprise with their performance characteristics compared to more popular choices. They also provide safety and easy integration with other technologies, including Python, for example, as through the PyO3, by building native Python wheel. In my biased opinion, the community could benefit from including such technologies in the core offering, potentially making OeX technologies appear more prestigious.
Would be happy to join and participate in this initiative regarding Meilisearch.

dave · February 29, 2024, 9:27pm

@ashultz: Thank you for the background info! That wiki page is amazing.

@braden: That proof-of-concept looks great!

We have a couple of in-Studio content search features that will be in development very shortly–one for courses and one for the new libraries experience. I’d like to propose that we implement these using Meilisearch, to try it out on something real. Some rough thoughts for an ADR:

We keep the Meilisearch-specific code isolated to a single module, so it’s relatively easy to swap out later if this experiment doesn’t pan out.
All use of Meilisearch would be off by default in Redwood, giving folks until at least Sumac to plan for the additional infrastructure required.

Once we try it out and have some experience with it, we can make the decision of whether alter the DEPR to convert to Meilisearch. We don’t want to end up in a final state where we’re running both indefinitely, so if things don’t go well with Meilisearch, we’ll convert the Studio content search functionality over to OpenSearch for consistency with the DEPR.

Does that sound reasonable to folks?

Felipe · February 29, 2024, 11:47pm

The plan to experiment with Meilisearch sounds very reasonable. The most critical point for me is

We don’t want to end up in a final state where we’re running both indefinitely

Which you cover in the case that the experiments with Meilisearch go wrong. Should it go well and be as good as it looks, what could be a rough plan for moving all search to it?

braden · March 1, 2024, 6:23pm

@dave The plan sounds great. I will continue to develop my prototype along those lines. I already expanded it to index courseware (in Studio).

dave · March 2, 2024, 5:50am

Shortly after Redwood, we should have enough information to know how we want to proceed. If Meilisearch pans out, we make a new DEPR, and make Meilisearch a baseline requirement for Sumac. We start porting over existing Python code that currently uses Elasticsearch to use Meilisearch instead in the run up to Sumac. Elasticsearch is likely still around as an option for the Teak release, but is dropped entirely after Teak is cut.

The most annoying sticking point is likely to be the forums–particularly the cs_comments_service written in Ruby. Next week, I’ll work on a long overdue ADR for re-implementing that service’s functionality in Django. The search part of that re-implementation would likely not start until after Redwood is cut, so we should have a direction by then.

blarghmatey · March 4, 2024, 4:48pm

One thing that is worth noting about Meilisearch is that it seems to only allow high availability mode in their hosted cloud service. In their comparison matrix, under deployment it shows high availability as “available with Meilisearch Cloud”.

Overall it seems like a promising product, but for anyone running a non-trivial deployment of Open edX it would force them into using their hosted product. In general that may not be a show-stopper, but I can imagine there are cases where that would prevent someone from using Open edX.

dave · March 4, 2024, 5:09pm

@blarghmatey: Thank you for the info! That’s definitely concerning.

The most recent ticket I can find about it is here:

github.com/meilisearch/meilisearch

About replicating Meilisearch

opened 01:49PM - 14 Feb 23 UTC

Kerollmops

tech discussion prototype available

This morning I had a meeting with @dureuill, and we discussed the different solu…tions to replicate Meilisearch we could implement and the pros and cons of each. First, we must define what we want to achieve. Meilisearch is very fast to boot. It doesn't need to process anything before being able to serve requests, not even if a crash occurs. The reason is that it uses LMDB and not RocksDB or SQLite, which are [WAL-based embedded databases](https://en.wikipedia.org/wiki/Write-ahead_logging). The high-availability feature of the Cloud highly depends on this feature, but it doesn't apply in situations where an entire cluster is down. These different solutions can fix the weaknesses of the current design. They are more or less complex to implement. So that you know, no single solution described below straightforwardly manages task cancellation, and therefore task cancellation will be disabled for the prototypes. ## A Single Writer Broadcasts its Task Queue to multiple Readers This solution is the easiest to implement by far. The principle is that we broadcast the tasks received by the single writer to the other readers. There are different quirks to think about, but the current Meilisearch codebase rests unchanged. A Meilisearch server would be allowed to receive user write requests. Every time it successfully writes it to its task queue, it broadcasts it to all the previously registered Meilisearch Readers. The Meilisearch Readers also store this task, but only if the highest task id follows the previous one. If not, it asks for the set of missing tasks to the Meilisearch Writer. The Meilisearch Writer will generate a dump or send the list of tasks depending on whether the task content files are available (we delete the task content files when a task is finished). Pros: - Seems easy enough to create a prototype. - Elegant solution where we keep the engine as is. - Highly available on the read side. Cons: - Highly available on the read side only. ## The Task Queue is Replicated using a Raft or Paxos Consensus Protocol This solution looks to be the ideal one. Multiple Meilisearch servers can synchronize the task queue together. Sending the tasks to each other and committing changes at the same time. Unfortunately, it is not the easiest to build as consensus protocols are [hard to implement right](https://jepsen.io/). Here is a list of available Rust replication libraries: [tikv/raft-rs](https://github.com/tikv/raft-rs), [datafuselabs/openraft](https://github.com/datafuselabs/openraft), [benschulz/paxakos](https://github.com/benschulz/paxakos). And here is a list of some C/C++ libraries we can probably wrap: [baidu/braft](https://github.com/baidu/braft), [canonical/raft](https://github.com/canonical/raft), [willemt/raft](https://github.com/willemt/raft), [logcabin/logcabin](https://github.com/logcabin/logcabin). Pros: - Highly available on reads and writes. - The replication library manages the replication. Cons: - [It doesn't seem easy to create a prototype](https://github.com/Kerollmops/canonical-raft/issues/4). - It can be hard to debug network and consistency issues. - There is no actual production-grade replication library in Rust. ## The Task Queue is a Message Broker A [message broker](https://en.wikipedia.org/wiki/Message_broker) is a, most of the time, distributed event store in which we can send events that every listener will read. In this solution, all Meilisearch servers were listening to the same broker and subscribed simultaneously. When a written request is sent to any Meilisearch server, this server will store it in the broker, and all the cluster members will receive it and start processing it. When a Meilisearch server is lacking, it must start from a dump or process the tasks in the order from the start, which can take a lot of time. I didn't take the time to think more about that, but there surely be a communication between two Meilisearch servers to ask for a dump at a moment which can complexify the solution. Pros: - The broker manages replication and high availability. - Highly available on reads and writes. Cons: - Depending on the broker, the user must set up and monitor another program. - We must maintain another task queue that lies in the broker queue.

The most recent activity was in November though. They have a prototype, but it’s rough and has some big flaws, the most obvious of which is: “If the leader crashes, there is no re-election, the cluster no longer works, but the followers can still answer search requests. We are still thinking about what we could do about this.”

I believe this is the draft PR for the prototype:

github.com/meilisearch/meilisearch

Cluster

meilisearch:main ← meilisearch:cluster

opened 06:05PM - 16 Mar 23 UTC

irevoire

+1520 -157

## How does it work? In this first implementation, we went on a leader/follow…er approach with a pre-selected leader that can't change. The followers only follow the order of the leader but allow read. And the leader is in charge of replicating all the writes to the followers and itself. ### Processing a task The leader will send the tasks to process to the follower. Then, after indexing everything **but right before committing the changes on disk**, it'll wait for the state of the follower. At the same time, the followers get the batch to process from the leader and also wait before committing. Depending on the consistency rule, the leader might tell them to commit right away or later. If the consistency has been set to; - `one`: the leader will tell everyone to commit **without waiting for any followers**. - `two`: the leader will wait for one follower to be ready to commit before telling everyone to commit and moving on - `quorum`: the leader will wait until more than half of the cluster is ready to commit - `all`: the leader will wait until **all** the followers are ready to commit. **Not implemented yet**: If a follower doesn't get the same result as the leader, it should either: - kill itself - don't commit but continue to accept reads (it's going to be outdated) ### Joining the cluster When a node joins the cluster it won't be active straight away. The leader will accept the connection with the follower, but it'll wait until the current task has been processed. And in between two tasks, all the followers will « officially » join the cluster (we say they become active). To share the leader's state with the new followers, it'll create a dump and send it to the followers so they can update themselves to the current state of the cluster. The leaders and followers **must** share the same master key. If that's not the case, the follower won't be able to join the cluster. Also: the connections between the leader and followers are encrypted with chacha20 and the master key; thus, it's recommended to have a secure autogenerated master key of at least 32 bytes. ### Synchronizing the API key The leader forwards the API key operations to every follower, and it's updated ASAP without synchronizing anything. ## What new API pieces have been introduced: - CLI: - A new `--experimental-enable-ha <EXPERIMENTAL_ENABLE_HA>` flag has been introduced. Its values are either `leader` or `follower`. - A new `--leader <LEADER>` flag has been introduced. It lets you specify the address of the leader, and it's mandatory if you're a follower - A new `--consistency <CONSISTENCY>` flag has been introduced to configure the consistency rules. Its possible values are: - `one` => The leader progress as fast as possible - `two` => The leader + one node are in sync - `quorum` => The majority of the cluster stays synchronized - `all` => The whole cluster stays in sync ## What is utterly broken/ugly currently and should be rewritten / handled correctly - :one: The TCP connection used between the leader and the followers doesn't have the `keepalive` option enabled. Thus the connections are probably going to die often. - :one: Add an internal interface - :one: The tasks received while joining the cluster might be lost - :one: Handle what happens when a follower doesn’t get the same result the leader got from processing a batch - :one: What happens when the tick function need to re-run (cancel + MaxDatabaseSizeReached). For the potential users reading us, I think the whole cluster might get stuck for ever. If you have a « normal » Linux machine that should never happens though - :one: Currently, the only sync we have is made on the index operation - :two: Don’t truncate the master-key when starting the cluster -> handle the error when the master-key is wrong / we can’t connect - :two: When we send a task (with its update file) or a dump (to let the followers join the cluster), it must be stored entirely in RAM, which definitely won't scale on a small computer - :two: The followers are still able to receive writes (tasks or API keys), and I don't exactly know what happens in this case, but it's definitely nothing good - :three: We spawn like 200 threads that could all be a super small async rust routine that doesn't costs anything - :three: It doesn't work on windows - :three: The instance-uid should be shared for the whole cluster ------ Below are tamo's notes, don't try to understand anything. - Make the consistency configurable at the task level - Synchronize the API key - Synchronize the instance uid, maybe?

I wrote a comment on their Discord channel for this topic.

braden · March 4, 2024, 6:52pm

Good point. One thing that’s important to distinguish is whether they are purposely keeping HA out of the open source product as a business strategy (as many “open source” database/search vendors do these days), or they just haven’t developed it. From what I can tell, it’s the latter - they fully intend to support this feature in the open source project and would welcome contributions to do that, but it has been repeatedly delayed / de-prioritized. (The PR that Dave linked to is from a Meilisearch employee.)

So I am optimistic that this could be resolved in the future, but it seems like nobody should count on that anytime soon.

It’s also worth noting that (if I understand correctly) the nominal “high availability” that they advertise on their cloud offering is not replication-based but instead “we ensure the high availability of your project with Kubernetes technology, redundant volumes, and regular backups. In the event of an error, a Meilisearch server takes only a few milliseconds to restart” (source). So Open edX operators may not be able to use replication in the immediate future, but can certainly use those strategies. What’s more, because Meilisearch is so lightweight, you can deploy a separate instance per index, so that (for example) your Studio courseware search doesn’t go down at the same time as your forum search.

Q: Would this be a deal-breaker?

Q: Is ElasticSearch ever on the “critical path” for learning? i.e. learner account creation, logging in, course purchasing/enrollment, viewing courseware, submitting assignments/exams/problems, posting in the forum, viewing grades, etc.

braden · March 5, 2024, 8:12pm

I have expanded my prototype so it can demonstrate full end-to-end search functionality from backend to frontend. It also includes courseware now, not just v2 libraries. It also includes tag data from the new tagging system. Pardon the ugly UI.

braden · March 18, 2024, 4:47pm

For those following this thread, I am planning to proceed with developing new Studio search functionality using Meilisearch discussed (as an experiment - the feature will be off by default, and so Meilisearch won’t be required unless you choose to opt in and help test it out; later we will evaluate it and make a decision about what path to take for Sumac).

I’ve added an ADR to the PR and it’s ready for review/merge: Index Studio content using Meilisearch [FC-0040] by bradenmacdonald · Pull Request #34310 · openedx/edx-platform · GitHub

regis · June 12, 2024, 1:17pm

4 posts were split to a new topic: Auto-suggest course content on search (Meilisearch-compatible)

blarghmatey · August 9, 2024, 8:26pm

Recognizing that there has already been substantial investment in the adoption of Meilisearch as the de facto search backend for edx-platform, I wanted to follow up with this topic.

I had not been following this work closely, but after revisiting the conversations around high availability/redundancy/failover in Meilisearch it seems that there has still been no real progress in that direction. All of the GitHub issue and discussion threads peter out in the same manner of noting the challenge of implementing distributed consensus (e.g. Paxos, Raft) and the lack of high-quality libraries in Rust to handle them. All of the recommended methods of handling failures rely on persistent disk and restarting the process, which fundamentally fails to address high availability and shared-nothing architectures. Instead it forces you to have a distributed storage layer (e.g. NFS, GlusterFS, Ceph) or some other means of data replication to be able to handle server failures, disk corruption, etc.

I understand that the majority case of edX installations, and the primary mode of operation supported by e.g. Tutor is to have a single server or virtual machine, but for cases where someone is not operating in that fashion Meilisearch continues to pose an operational risk. Granted, the search functionality is not mission critical for the use of edx-platform, but for anyone who operates the system a failure in any element of the system can still lead to a degradation of trust in the system or the ability of the operator.

I recognize that there is no perfect answer to the challenge of search, and I do like the performance promises that Meilisearch offers. That being said, it seems that Typesense would be a more appropriate alternative? It offers similar performance benefits, a longer development history with wider adoption, and an out-of-the-box HA story (Algolia vs Elasticsearch vs Meilisearch vs Typesense Comparison). Looking at the comparison document it seems that the primary downside is the in-memory nature of the engine?

braden · August 12, 2024, 4:59pm

@blarghmatey What we’ve been discussing in other threads is to implement an abstraction layer, so that anyone who really cares about HA for search can use Algolia (or perhaps TypeSense if someone wants to implement that). Note that Elasticsearch will likely not be an option as it’s very different from these more modern search engines, and probably not worth including under the same abstraction.

TypeSense is nice and I’ve used it before, but one of the big drawbacks is exactly what you noted: that it can require a lot of memory because it stores the entire index in RAM.

As of yet however, nobody other than you has actually said that they need something other than Meilisearch, and nobody has volunteered to implement wrappers for other search engines. @qasimgulzar is working on the abstraction layer in general and Meilisearch in particular.

dave · August 20, 2024, 9:14pm

@braden: We haven’t had that much feedback from elsewhere in the community, but I’m inclined to believe that others will share @blarghmatey’s concerns about HA. It’s unlikely to change what goes out in the Sumac timeframe, but I think it’s worth Axim’s while to fund an investigation to the memory usage issues around Typesense.

I agree with both of you that the memory usage is the biggest potential drawback of Typesense, but I’m not sure how that will actually play out in practice. My intuition is that because Meilisearch uses memory mapped files, and because relatively few parts of the index will be “hot” at any given time, that it will effectively require less memory for comparable performance. But there are some huge caveats with that:

My intuition is purely guesswork–it could be that Meilisearch requires comparable RAM to Typesense to give acceptable performance in practice on large datasets.
Running in clustered mode will increase indexing write latencies. By how much?
Are there significant differences in how compactly they represent their indexes?

The two biggest things that I’m aware of are course content data storage and forums post storage (the catalog-related metadata that I know of is orders of magnitude smaller).

@blarghmatey: If someone coded a minimal Typesense integration (using the interface that @qasimgulzar is making), would you have the time/capacity to be able to run both Typesense and Meilisearch indexing against the data on your production site? So that we can get a better understanding on how memory and latency compare across the two using real data?

blarghmatey · August 22, 2024, 1:14pm

Thank you for that suggestion. I can plan to set aside some time for that testing to help move the conversation forward with a bit of concrete evidence. I agree that getting some real-world data around the operational overhead of each solution would be useful.

Topic		Replies	Views
Evaluating Meilisearch Architecture	21	680	July 15, 2024
Deprecation/Removal: DEPR-170 Move from Elasticsearch to OpenSearch Deprecation	9	1738	March 17, 2022
Modular Search Functionality Development	1	31	March 13, 2025
How to reduce memory useage of Elastic search Site Operations Help	2	52	May 13, 2025
Auto-suggest course content on search (Meilisearch-compatible) Development	26	389	August 8, 2024

Is Meilisearch a viable upgrade alternative to OpenSearch?

Related topics