Upgrading MySQL charset to utf8mb4

dave · July 26, 2022, 4:51pm

Nutmeg introduced a new bug in outlines where the underlying issue was our use of utf8 character set encoding for our MySQL tables.

The problem: MySQL’s utf8 encoding isn’t “real” UTF-8. It’s a three-byte encoding that is missing certain characters, including emojis. We want to upgrade to utf8mb4 which is MySQL’s real UTF-8 data. This has been an issue for a while, but is only going to bite us more over time as some of our content storage gets shifted into MySQL.

Some relevant pieces that make a migration interesting:

In addition to targeting utf8mb4 for the charset encoding, we probably want to switch the collation to the new MySQL 8.0.1 default of utf8mb4_0900_ai_ci. This collation is unavailable in MySQL 5.7.
We need to complete the migration from MySQL 5.7 to 8.0 at some point soon (Olive?), because 5.7’s end of life is Oct. 2023. If that is going to be a disruptive change anyway, we might try to bundle encoding changes at the same time.
In theory, edx-platform can run using either as of Nutmeg, thanks to work done by 2U. To my knowledge, nobody has actually taken the plunge and tried MySQL 8.0 yet though.
The migration path I’m aware of involves backup → export → import, which may not be feasible for people running large sites.
I don’t know if there’s any plausible way to run in a “mixed mode” to do gradual rollouts, since Django settings also have to be tweaked to transmit data using utf8mb4.
Many fields were created as varchar(255) because utf8’s 3-byte character encoding means that it’s the largest char field you can make that still fits under InnoDB’s 767 character index limit. Newer tables using DYNAMIC row format can go over this, but doing so may have performance implications.

In an ideal world, it would be great to have something that does a seamless, zero-downtime upgrade. One step down from that is maybe a Tutor script that can do the upgrade process with some downtime? A first step might be to at least make it so that Tutor dev on nightly runs MySQL 8.0 with the new charset and collation settings.

If you have thoughts on this, or have experience running Open edX on MySQL 8.0, I’d love to hear from you.

Thanks folks.

Abdess · July 26, 2022, 7:21pm

Just tested on the latest version of MySQL. Everything seems to be working fine… at least for the moment.

Only noticed one strange thing during the migration. The title of one of the courses “Apprendre à apprendre” became “Apprendre Ã apprendre”. And that’s only for the title of the course, everything else is well written.

Course outline screenshot

For the migration, one of the @regis plugins has been forked and modified. The plugin can be installed at this address: Tutor plugin to install the latest version of MySQL and adding support of utf8mb4 · GitHub

We can see that Tutor uses the latest version of MySQL with the collation utf8mb4_0900_ai_ci:

Terminal screenshots

UPDATE:

Seems to works well on a fresh install

Course outline from fresh install

How the fresh installation was made:

Summary

add charset: "utf8mb4" to tutor/templates/apps/openedx/config/partials/auth.yml (DATABASES[‘default’][‘OPTIONS’][‘charset’] = ‘utf8mb4’)

tutor plugins install https://gist.githubusercontent.com/Abdess/3ed9ed1d42821d00a5cf2481870df26f/raw/tutor-mysql8utf8mb4.yml

tutor plugins enable mysql8utf8mb4

tutor config save

tutor local quickstart

I extracted the database schema from a fresh install with the charset: "utf8mb4" parameter to see the difference with and without the plugin:
openedx_schema5.sql (498.6 KB)
openedx_schema8.sql (515.6 KB)

Example of a MySQL5 (left) and MySQL8 (right) table

For the experiment I replace the remaining utf8mb3:

Replacing remaining utf8mb3

docker exec tutor_local_mysql_1 /usr/bin/mysqldump -Q -d -uroot -pXXXXXXXX --default-character-set=utf8 --skip-set-charset openedx | sed 's/utf8mb3/utf8mb4/gi' | sed 's/utf8_bin/utf8mb4_0900_ai_ci/gi' | sed 's/utf8mb4_general_ci/utf8mb4_0900_ai_ci/gi' > openedx_schema.sql

docker exec tutor_local_mysql_1 /usr/bin/mysqldump -Q --insert-ignore -t -uroot -pXXXXXXXX --default-character-set=utf8 --skip-set-charset openedx > openedx_data.sql

docker exec -it tutor_local_mysql_1 bash

mysql -uroot -pXXXXXXXX

DROP DATABASE openedx;

create database openedx default charset utf8mb4 collate utf8mb4_0900_ai_ci;

exit

exit

cat openedx_schema.sql | docker exec -i tutor_local_mysql_1 /usr/bin/mysql -u root --password=XXXXXXXX openedx

cat openedx_data.sql | docker exec -i tutor_local_mysql_1 /usr/bin/mysql -u root --password=XXXXXXXX openedx

It is when I replace the utf8mb3 by utf8mb4 that the encoding bug appears:

Screenshot

But when I make a modification in the studio, the issue disappears:

Screenshot

Also I just noticed another bug. The LMS displays the n-1 changes.
For example:

If I replace “introduction” by " introduction s" then “introduction” is displayed on the LMS
“introduction s” → “introduction d” = “introduction s”
“introduction d” → “introduction” = “introduction d”

iamCristYe · July 27, 2022, 6:36am

I came across one nice written article when I first raised this in the tutor forum: How to support full Unicode in MySQL databases · Mathias Bynens
The official documentation is worth reading as well: MySQL :: MySQL 8.0 Reference Manual :: 10.9.8 Converting Between 3-Byte and 4-Byte Unicode Character Sets

If my understanding is correct, to avoid downtime, we can do the following:

update the default database charset. (This is used only as a default value for future tables)
update the default table charset (This is used only as a default value for future columns, at this point utf8mb4 still cannot be inserted)
update the column charset (at this point utf8mb4 can be inserted via MySQL CLI but not through Django as it’s still using utf8mb3 connection?)
redeploy the Django application with the new utf8mb4 connection settings

dave · July 27, 2022, 2:14pm

That is a little troubling. I’d be curious to know what the exported data looks like for that entry.

Ah, maybe that’s the part I’m missing. I didn’t realize that it was possible to incrementally upgrade individual tables to utf8mb4 while the Django connection is utf8 and still have everything work normally for the 3-byte UTF-8 characters. If that’s the case, that’s fantastic, particularly for large sites that need to be extra careful about performance regressions.

lpm0073 · June 16, 2023, 1:16pm

Thanks for posting this information @dave. What was the final decision with regard to Olive and MySQL 8? Is it recommended to upgrade Olive installations from MySQL 5.7 to MySQL 8?

dave · June 16, 2023, 2:11pm

I think you probably want to upgrade Olive to 8.0, though I defer to @jmbowman and @regis on that. You definitely want to be using MySQL 8.0 with Palm. MySQL 5.7 will be unsupported by the time Quince rolls around.

ABC · April 6, 2024, 8:07pm

Note for future self and others:

tutor plugins install https://gist.githubusercontent.com/Abdess/3ed9ed1d42821d00a5cf2481870df26f/raw/tutor-mysql8utf8mb4.yml
tutor plugins enable mysql8utf8mb4
tutor config save
tutor local quickstart

is the actual mechanism that causes an update from MySQL 5.7 to 8 (8.3 currently)

ABC · July 14, 2024, 2:01pm

Another note for future generations: The https://gist.githubusercontent.com/Abdess/3ed9ed1d42821d00a5cf2481870df26f/raw/tutor-mysql8utf8mb4.yml plugin has

image: docker.io/mysql:latest as the mysql to be used. That worked for me through MySQL 8.3.0. However the latest MySQL has started deprecating CLI passwords. (as encountered here

So the mysql -uroot -pXXXXXXXX command mentioned above will break if you use image: docker.io/mysql:latest as is default in that plugin.

I currently recommend using image: docker.io/mysql:8.3.0 instead, if you want to use the above instructions.

ABC · July 21, 2024, 5:30pm

@Abdess , @regis, @dave, what is the correct way to set the Django default charset to utf8mb4 in Palm?

In this thread @Abdess said to edit .local/lib/python3.x/site-packages/tutor/templates/apps/openedx/config/partials/auth.yml (which for me is python3.10).
However

I have a beta site that’s on Palm where apparently I never did that AFAICT (it still has a charset: "utf8mb3" line though, and I’m not sure if that’s default or not), but everything working for emojis
I have a production site that’s on Palm where I did modify it to charset: "utf8mb4" but it’s still not working.

Also in this thread above @Abdess says to edit tutor/templates/apps/openedx/config/partials/auth.yml, but no auth.yml file exists for me in .local/share/tutor on Palm… (But maybe it existed on Nutmeg at the time my beta was on Nutmeg and then the configuration got moved to whereever was appropriate once it upgraded to Palm on the one system, but not the other which was coming from Lilac to Palm?)

So I’m wondering if I found some other way to set it, and now I’m just not able to re-find it… Because I think the error message over on my not working thread is indicative of Django erroring out due to not understanding utf8mb4.

(I would also note that on my working beta server I don’t see charset: "utf8mb4" under [‘DATABASES’][‘default’][‘OPTIONS’] within .local/share/tutor/env/apps/openedx/config/lms.env.yml

    OPTIONS:
      init_command: "SET sql_mode='STRICT_TRANS_TABLES'

and I went so far as to remove the if RUN_MYSQL conditional in .local/lib/python3.10/site-packages/tutor/templates/apps/openedx/config/partials/auth.yml on my non-working server, to try and force it in (and indeed then it showed up in .local/share/tutor/env/apps/openedx/config/lms.env.yml, but that still didn’t fix it )

Topic		Replies	Views
Palm upgrade, running RDBMS in Tutor, MySQL vs. MariaDB, utf8 vs. utf8mb4 Site Operators palm	11	723	September 6, 2023
Anybody running MariaDB? Site Operators	3	283	May 18, 2023
Data Working Group: 2022-01-27 Meeting & MySQL Upgrade Path Proposal Data	0	420	January 27, 2022
Emoji in subsection title not supported since Nutmeg Tutor Help nutmeg	5	1082	August 6, 2024
Deprecation/Removal: translation of legacy attributes during course import/export Deprecation	1	361	April 21, 2022

Upgrading MySQL charset to utf8mb4

Related topics