How to fully support MySQL UTF-8 with RDS and Doctrine
If you’re reading this it’s probably because you’ve realised that MySQL’s
utf8
encoding isn’t really UTF-8, despite what it says on the packet. The
reason behind is because UTF-8 stores characters with anywhere between 1 and 4
bytes, depending on the character. MySQL’s bodged up form of UTF-8 can only use
1 to 3 characters. There’s a bunch more detail about this which you can read
about in the links posted at the end.
However, as it turns out MySQL released a new encoding utf8mb4
which supports
full UTF-8, however there are a bunch of settings you have to change in order
to use it. In all of my projects I use Symfony Doctrine and MySQL, in production
MySQL is running on AWS RDS, and in dev it’s a MySQL installation on a vagrant
box. I need to change settings in all three areas to utilise utf8mb4
.
If you have an existing database that you want to change to utf8mb4
check
out this post, you need to
manually change the encoding and collation settings for each table in every
schema, and then the schema itself.
Doctrine
You need to tell doctrine to create the database and tables with the utf8mb4
encoding:
# app/config/config.yml (symfony < 4)
# config/packages/doctrine.yaml symfony > 4)
doctrine:
dbal:
...
charset: utf8mb4
default_table_options:
charset: utf8mb4
collate: utf8mb4_unicode_ci
...
UPDATE: If you have multiple database connections, you’ll have to specify
both charset
and default_tables_options
for each one:
# app/config/config.yml (symfony < 4)
# config/packages/doctrine.yaml (symfony > 4)
doctrine:
dbal:
default_connection: default
default:
...
charset: utf8mb4
default_table_options:
charset: utf8mb4
collate: utf8mb4_unicode_ci
...
other:
...
charset: utf8mb4
default_table_options:
charset: utf8mb4
collate: utf8mb4_unicode_ci
...
MySQL
With a local installation (or a self managed instance) of MySQL, you need to
edit the my.cnf
file. You just add to the default /etc/my.cnf
, but I find
it easier to just create a new file /etc/mysq/conf.d/mysql_encoding.cnf
:
[client]
# If a client doesn't specify an encoding, use utf8mb4
default-character-set = utf8mb4
[mysql]
# Create tables with utf8mb4 encoding by default
default-character-set = utf8mb4
[mysqld]
# If a client specifies an encoding, ignore it and use the default
character-set-client-handshake = FALSE
# Use utf8mb4 enocding
character-set-server = utf8mb4
# Compare characters with utf8mb4
collation-server = utf8mb4_unicode_ci
I use the ansible role
geerlingguy.mysql and
include the above file in the mysql_config_include_files
array when I
provision vagrant boxes.
RDS
For RDS you need to create a new DB parameter group and add the following parameters to it:
character_set_client: utf8mb4
character_set_database: utf8mb4
character_set_results: utf8mb4
character_set_connection: utf8mb4
collation_connection: utf8mb4_unicode_ci
collation_server: utf8mb4_unicode_ci
character_set_server: utf8mb4
I like to do this with Terraform:
resource "aws_db_parameter_group" "utf8mb4" {
name = "utf8mb4"
family = "mysql5.7"
description = "enable 'real' utf8 (utf8mb4)"
parameter {
name = "character_set_client"
value = "utf8mb4"
}
parameter {
name = "character_set_database"
value = "utf8mb4"
}
parameter {
name = "character_set_connection"
value = "utf8mb4"
}
parameter {
name = "character_set_server"
value = "utf8mb4"
}
parameter {
name = "character_set_results"
value = "utf8mb4"
}
parameter {
name = "collation_server"
value = "utf8mb4_unicode_ci"
}
parameter {
name = "collation_connection"
value = "utf8mb4_unicode_ci"
}
}
resource "aws_db_instance" "primary" {
...
parameter_group_name = "utf8mb4"
...
}
Testing the encoding
Once you’ve connected to your RDS or MySQL instance, you should test the encoding:
mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
Don’t worry about the charater_set_filesystem
or character_set_system
, they
cannot be changed, but don’t break anything.
UPDATE
To check that your tables are all set correctly, you can run the follwing:
SELECT
T.TABLE_NAME,
CCSA.CHARACTER_SET_NAME,
CCSA.COLLATION_NAME
FROM
information_schema.`TABLES` T,
information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA
WHERE
CCSA.COLLATION_NAME = T.TABLE_COLLATIOn
AND T.TABLE_SCHEMA = "db_name"
;
To check that your columns set correctly, run the following:
SELECT
TABLE_NAME,
COLUMN_NAME,
CHARACTER_SET_NAME,
COLLATION_NAME
from
INFORMATION_SCHEMA.columns
where
TABLE_SCHEMA = "db_name"
;