Upgrading MySQL charset to utf8mb4

Nutmeg introduced a new bug in outlines where the underlying issue was our use of utf8 character set encoding for our MySQL tables.

The problem: MySQL’s utf8 encoding isn’t “real” UTF-8. It’s a three-byte encoding that is missing certain characters, including emojis. We want to upgrade to utf8mb4 which is MySQL’s real UTF-8 data. This has been an issue for a while, but is only going to bite us more over time as some of our content storage gets shifted into MySQL.

Some relevant pieces that make a migration interesting:

  • In addition to targeting utf8mb4 for the charset encoding, we probably want to switch the collation to the new MySQL 8.0.1 default of utf8mb4_0900_ai_ci. This collation is unavailable in MySQL 5.7.
  • We need to complete the migration from MySQL 5.7 to 8.0 at some point soon (Olive?), because 5.7’s end of life is Oct. 2023. If that is going to be a disruptive change anyway, we might try to bundle encoding changes at the same time.
  • In theory, edx-platform can run using either as of Nutmeg, thanks to work done by 2U. To my knowledge, nobody has actually taken the plunge and tried MySQL 8.0 yet though.
  • The migration path I’m aware of involves backup → export → import, which may not be feasible for people running large sites.
  • I don’t know if there’s any plausible way to run in a “mixed mode” to do gradual rollouts, since Django settings also have to be tweaked to transmit data using utf8mb4.
  • Many fields were created as varchar(255) because utf8’s 3-byte character encoding means that it’s the largest char field you can make that still fits under InnoDB’s 767 character index limit. Newer tables using DYNAMIC row format can go over this, but doing so may have performance implications.

In an ideal world, it would be great to have something that does a seamless, zero-downtime upgrade. One step down from that is maybe a Tutor script that can do the upgrade process with some downtime? A first step might be to at least make it so that Tutor dev on nightly runs MySQL 8.0 with the new charset and collation settings.

If you have thoughts on this, or have experience running Open edX on MySQL 8.0, I’d love to hear from you.

Thanks folks. :smile:


Just tested on the latest version of MySQL. Everything seems to be working fine… at least for the moment.

Only noticed one strange thing during the migration. The title of one of the courses “Apprendre à apprendre” became “Apprendre à apprendre”. And that’s only for the title of the course, everything else is well written.

Course outline screenshot

For the migration, one of the @regis plugins has been forked and modified. The plugin can be installed at this address: Tutor plugin to install the latest version of MySQL and adding support of utf8mb4 · GitHub

We can see that Tutor uses the latest version of MySQL with the collation utf8mb4_0900_ai_ci:

Terminal screenshots


Seems to works well on a fresh install

Course outline from fresh install

How the fresh installation was made:

  • add charset: "utf8mb4" to tutor/templates/apps/openedx/config/partials/auth.yml (DATABASES[‘default’][‘OPTIONS’][‘charset’] = ‘utf8mb4’)
tutor plugins install https://gist.githubusercontent.com/Abdess/3ed9ed1d42821d00a5cf2481870df26f/raw/tutor-mysql8utf8mb4.yml
tutor plugins enable mysql8utf8mb4
tutor config save
tutor local quickstart

I extracted the database schema from a fresh install with the charset: "utf8mb4" parameter to see the difference with and without the plugin:
openedx_schema5.sql (498.6 KB)
openedx_schema8.sql (515.6 KB)

Example of a MySQL5 (left) and MySQL8 (right) table

For the experiment I replace the remaining utf8mb3:

Replacing remaining utf8mb3
docker exec tutor_local_mysql_1 /usr/bin/mysqldump -Q -d -uroot -pXXXXXXXX --default-character-set=utf8 --skip-set-charset openedx | sed 's/utf8mb3/utf8mb4/gi' | sed 's/utf8_bin/utf8mb4_0900_ai_ci/gi' | sed 's/utf8mb4_general_ci/utf8mb4_0900_ai_ci/gi' > openedx_schema.sql
docker exec tutor_local_mysql_1 /usr/bin/mysqldump -Q --insert-ignore -t -uroot -pXXXXXXXX --default-character-set=utf8 --skip-set-charset openedx > openedx_data.sql
docker exec -it tutor_local_mysql_1 bash
mysql -uroot -pXXXXXXXX
create database openedx default charset utf8mb4 collate utf8mb4_0900_ai_ci;
cat openedx_schema.sql | docker exec -i tutor_local_mysql_1 /usr/bin/mysql -u root --password=XXXXXXXX openedx
cat openedx_data.sql | docker exec -i tutor_local_mysql_1 /usr/bin/mysql -u root --password=XXXXXXXX openedx

It is when I replace the utf8mb3 by utf8mb4 that the encoding bug appears:


But when I make a modification in the studio, the issue disappears:


Also I just noticed another bug. The LMS displays the n-1 changes.
For example:

  • If I replace “introduction” by " introduction s" then “introduction” is displayed on the LMS
  • “introduction s” → “introduction d” = “introduction s”
  • “introduction d” → “introduction” = “introduction d”
1 Like

I came across one nice written article when I first raised this in the tutor forum: How to support full Unicode in MySQL databases · Mathias Bynens
The official documentation is worth reading as well: MySQL :: MySQL 8.0 Reference Manual :: 10.9.8 Converting Between 3-Byte and 4-Byte Unicode Character Sets

If my understanding is correct, to avoid downtime, we can do the following:

  1. update the default database charset. (This is used only as a default value for future tables)
  2. update the default table charset (This is used only as a default value for future columns, at this point utf8mb4 still cannot be inserted)
  3. update the column charset (at this point utf8mb4 can be inserted via MySQL CLI but not through Django as it’s still using utf8mb3 connection?)
  4. redeploy the Django application with the new utf8mb4 connection settings
1 Like

That is a little troubling. I’d be curious to know what the exported data looks like for that entry.

Ah, maybe that’s the part I’m missing. I didn’t realize that it was possible to incrementally upgrade individual tables to utf8mb4 while the Django connection is utf8 and still have everything work normally for the 3-byte UTF-8 characters. If that’s the case, that’s fantastic, particularly for large sites that need to be extra careful about performance regressions.