While this guide walks through and explains the processes involved with inspecting and changing character encoding within a MariaDB database, a basic familiarity with database management and character encoding is assumed and recommended
Introduction
Many Specify databases were created some time ago, using relatively old versions of MySQL or MariaDB.
These older versions of MySQL and MariaDB commonly defaulted to character sets and collations that have limited support for some characters with diacritics, non-western languages, and general symbols.
This can result in databases that do not support characters that commonly appear in names for people and places, different characters in non-western languages, etc.
What is a Character Set and Collation?
- Character Set: Similar to character encoding, the character set defines which characters/symbols are supported and maps each supported character to its representation in bytes.
- Collation: Each character set can have one or more collations, which define how characters/symbols are compared and sorted. i.e., the collation defines the rules that state whether
a = A,a < b, etc.
Modern applications and their users, including Specify databases, should strongly prefer a character set like utf8mb4, which fully supports the Unicode standard.
Review the MariaDB docs for additional information:
Viewing Current Character Set and Collation
Connection/Client Level
The connection character set and collation determine how data is interpreted between the client and server.
Specifically, MariaDB will assume text coming from the client is encoded with the Connection/Client level character set and convert characters to their correct character set as determined by the database, table, column, etc. when needed.
Specify 7 defaults to the utf8mb3 character set for its connections, which is the default used by 4.2 Django.
Below is a query that can be used to view the current character set and collation at the connection/client level:
SELECT VARIABLE_NAME, SESSION_VALUE
FROM INFORMATION_SCHEMA.SYSTEM_VARIABLES WHERE
VARIABLE_NAME LIKE 'character_set_con%' OR
VARIABLE_NAME LIKE 'character_set_cl%' OR
VARIABLE_NAME LIKE 'character_set_re%' OR
VARIABLE_NAME LIKE 'collation_c%';
Example results:
+--------------------------+-----------------------+
| VARIABLE_NAME | SESSION_VALUE |
+--------------------------+-----------------------+
| CHARACTER_SET_CONNECTION | utf8mb4 |
| CHARACTER_SET_CLIENT | utf8mb4 |
| COLLATION_CONNECTION | utf8mb4_uca1400_ai_ci |
| CHARACTER_SET_RESULTS | utf8mb4 |
+--------------------------+-----------------------+
See the MariaDB docs on:
System Level
By default, MariaDB will use the system character set and collation.
MariaDB versions less than 11.6 will commonly default to the latin1 character set and latin1_swedish_ci collation.
MariaDB versions >= 11.6 will commonly default to the utf8mb4 character set and utf8mb4_uca1400_ai_ci collation.
Some distributions of MariaDB will use other default character sets and collations. See the MariaDB documentation:
The following query can be used to retrieve the system level character set
SELECT VARIABLE_NAME,
VARIABLE_VALUE
FROM INFORMATION_SCHEMA.GLOBAL_VARIABLES
WHERE VARIABLE_NAME LIKE 'character_set_sys%';
Example result:
+----------------------+----------------+
| VARIABLE_NAME | VARIABLE_VALUE |
+----------------------+----------------+
| CHARACTER_SET_SYSTEM | utf8mb3 |
+----------------------+----------------+
Server Level
The character set and collation can be defined at the server level.
The server-level character set and collation defines the default character set and collation used when creating new databases.
If the server-level character set and collation is unset, MariaDB will fallback to the system character set and its default collation.
Below is a query that can be used to view the server level character set and collation:
SELECT VARIABLE_NAME, SESSION_VALUE
FROM INFORMATION_SCHEMA.SYSTEM_VARIABLES WHERE
VARIABLE_NAME LIKE 'character_set_ser%' OR
VARIABLE_NAME LIKE 'collation_s%';
Example results:
+----------------------+--------------------+
| VARIABLE_NAME | SESSION_VALUE |
+----------------------+--------------------+
| CHARACTER_SET_SERVER | utf8mb4 |
| COLLATION_SERVER | utf8mb4_general_ci |
+----------------------+--------------------+
Database Level
Each database can define its own default character set and collation. If not specified within the CREATE DATABASE statement or later changed with an ALTER DATABASE statement, then the database will assume the character set and collation defined at the server level.
The database character set and collation defines the default character set and collation for tables created within the database.
The following query can be used to retrieve the character set and collation for each database in the database manager.
SELECT SCHEMA_NAME,
DEFAULT_CHARACTER_SET_NAME,
DEFAULT_COLLATION_NAME
FROM INFORMATION_SCHEMA.SCHEMATA
WHERE SCHEMA_NAME NOT IN (
'sys',
'information_schema',
'performance_schema',
'mysql',
'innodb',
'tmp'
);
Example results:
+--------------+----------------------------+------------------------+
| SCHEMA_NAME | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME |
+--------------+----------------------------+------------------------+
| db3 | utf8mb4 | utf8mb4_general_ci |
| db2 | utf8mb4 | utf8mb4_uca1400_ai_ci |
| db1 | latin1 | latin1_swedish_ci |
| specify | utf8mb4 | utf8mb4_general_ci |
+--------------+----------------------------+------------------------+
Reminder that even if a database does not explicitly define a character set or collation, it will use the defaults defined at the server level.
Table Level
Each table within a database can define its own collation (and character set implicitly). Each text-based column defined on a table will use the collation of the table (unless the collation is overridden at a lower level).
If a table does not explicitly define its own character set and collation, it will assume the collation of the database.
The following query can be used to retrieve the collation (and its character set) of each database in MariaDB:
SELECT tbls.TABLE_SCHEMA,
tbls.TABLE_NAME,
cols.CHARACTER_SET_NAME,
tbls.TABLE_COLLATION
FROM INFORMATION_SCHEMA.TABLES tbls
JOIN INFORMATION_SCHEMA.COLLATIONS cols ON cols.COLLATION_NAME = tbls.TABLE_COLLATION
WHERE TABLE_SCHEMA NOT IN (
'sys',
'information_schema',
'performance_schema',
'mysql',
'innodb',
'tmp'
)
ORDER BY tbls.TABLE_SCHEMA,
tbls.TABLE_NAME;
Example results (truncated for brevity):
+--------------+----------------------------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | CHARACTER_SET_NAME | TABLE_COLLATION |
+--------------+----------------------------------+--------------------+--------------------+
| specify | absoluteage | utf8mb4 | utf8mb4_general_ci |
| specify | absoluteageattachment | utf8mb4 | utf8mb4_general_ci |
| specify | absoluteagecitation | utf8mb4 | utf8mb4_general_ci |
| specify | accession | utf8mb3 | utf8mb3_general_ci |
| specify | attachmentdataset | utf8mb3 | utf8mb3_unicode_ci |
| specify | attachmentimageattribute | utf8mb3 | utf8mb3_general_ci |
| specify | attachmentmetadata | utf8mb3 | utf8mb3_general_ci |
+--------------+----------------------------------+--------------------+--------------------+
Column Level
Each text column (CHAR, VARCHAR, TEXT, etc.) can define its own character set and collation.
If the character set and collation is not explicitly defined on a column, it will assume the character set and collation defined at the table level.
The following query can be used to return all tables in database manager:
SELECT tbs.TABLE_SCHEMA,
tbs.TABLE_NAME,
tbs.TABLE_COLLATION,
cols.COLUMN_NAME,
cols.COLUMN_TYPE,
cols.CHARACTER_SET_NAME,
cols.COLLATION_NAME
FROM information_schema.tables tbs
JOIN information_schema.columns cols ON tbs.TABLE_NAME = cols.TABLE_NAME
AND cols.TABLE_SCHEMA = tbs.TABLE_SCHEMA
WHERE tbs.TABLE_SCHEMA NOT IN (
'sys',
'information_schema',
'performance_schema',
'mysql',
'innodb',
'tmp'
)
AND cols.CHARACTER_SET_NAME IS NOT NULL
ORDER BY tbs.TABLE_SCHEMA,
tbs.TABLE_NAME,
cols.COLUMN_NAME;
Example results (truncated for brevity):
+--------------+----------------------------------+-----------------------+--------------------------------------+---------------+--------------------+-----------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION | COLUMN_NAME | COLUMN_TYPE | CHARACTER_SET_NAME | COLLATION_NAME |
+--------------+----------------------------------+-----------------------+--------------------------------------+---------------+--------------------+-----------------------+
| specify | collectingtripattribute | utf8mb4_uca1400_ai_ci | Text5 | varchar(100) | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collectingtripattribute | utf8mb4_uca1400_ai_ci | Text6 | varchar(50) | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collectingtripattribute | utf8mb4_uca1400_ai_ci | Text7 | varchar(50) | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collectingtripattribute | utf8mb4_uca1400_ai_ci | Text8 | varchar(50) | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collectingtripattribute | utf8mb4_uca1400_ai_ci | Text9 | varchar(50) | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collectingtripauthorization | utf8mb4_uca1400_ai_ci | Remarks | mediumtext | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collection | utf8mb4_uca1400_ai_ci | CatalogFormatNumName | varchar(64) | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collection | utf8mb4_uca1400_ai_ci | Code | varchar(50) | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collection | utf8mb4_uca1400_ai_ci | CollectionName | varchar(50) | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collection | utf8mb4_uca1400_ai_ci | CollectionType | varchar(32) | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collection | utf8mb4_uca1400_ai_ci | DbContentVersion | varchar(32) | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collection | utf8mb4_uca1400_ai_ci | Description | mediumtext | utf8mb4 | utf8mb4_uca1400_ai_ci |
| specify | collection | utf8mb4_uca1400_ai_ci | DevelopmentStatus | varchar(32) | utf8mb4 | utf8mb4_uca1400_ai_ci |
+--------------+----------------------------------+-----------------------+--------------------------------------+---------------+--------------------+-----------------------+
Changing Character Set and Collation
Important: Always take a full backup before making these changes.
These operations can be destructive if misapplied or interrupted.
Table level:
Connection Level
The SET NAMES command can be used to alter the character-set_client, character_set_connection, character_set_results, and collation_connection system variables to tell MariaDB that textual data sent to the server will be with a specific character set/collation.
Below are two example queries which set the connection character set and collation:
SET NAMES utf8mb4;
SET NAMES utf8mb4 COLLATE utf8mb4_uca1400_ai_ci;
Once the connection level character set and collation is set, MariaDB will assume strings sent to the server are encoded with the defined character set.
When necessary, MariaDB will handle converting characters sent with the connection character set to the effective character set of the column.
See the MariaDB docs for more information:
Server Level
The server level character set and collation can be configured via the character_set_server and collation_server, options respectively.
For example:
-- Set the character set and collation for the current connection
SET character_set_server = 'utf8mb4';
SET collation_server = 'utf8mb4_uca1400_ai_ci';
-- Set the character set and collation globally, meaning all new connections
-- will use these values
SET GLOBAL character_set_server = 'utf8mb4';
SET GLOBAL collation_server = 'utf8mb4_uca1400_ai_ci';
See the MariaDB documentation on setting character sets at the server level:
The settings can be persisted by saving them to a MariaDB/MySQL configuration file that is read on service start/restart.
Consider the following sample of a configuration file that sets the server character set and collation:
[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_uca1400_ai_ci
For more information about option files, read the following MariaDB documentation:
Database Level
As the Database character set and collation is used when creating new tables, changing the character set and collation at the database level does not change the character set and collation for existing tables and columns.
Existing tables and columns need to be migrated and converted separately for a complete migration
The character set and collation can be configured for an individual database via an ALTER DATABASE statement.
Below are all examples of setting the default character set and collation on a specify database:
ALTER DATABASE `specify` CHARACTER SET=utf8mb4 COLLATE=utf8mb4_uca1400_ai_ci;
-- If the collation is omitted, the default collation for the character set is
-- inferred
-- defaults to utf8mb4_uca1400_ai_ci, as that is the default collation for utf8mb4
ALTER DATABASE `specify` CHARACTER SET=utf8mb4;
-- If the character set is omitted but collation provided, MariaDB sets the
-- character set to the character set of the collation
-- Sets the character set to latin1, as that is the character set of the
-- latin1_swedish_ci collation
ALTER DATABASE `specify` COLLATE=latin1_swedish_ci;
For more information, view the MariaDB documentation on setting the character set and collation at the database level:
Additionally, review the ALTER DATABASE documentation:
If there are multiple databases that need converting on the same instance of MariaDB, the following query can be used and modified to generate the SQL queries for each:
SELECT CONCAT(
'ALTER DATABASE `',
SCHEMA_NAME,
'` CHARACTER SET=utf8mb4 COLLATE=utf8mb4_uca1400_ai_ci;'
) AS 'Alter Database Statements'
FROM INFORMATION_SCHEMA.SCHEMATA
WHERE SCHEMA_NAME NOT IN (
'sys',
'information_schema',
'performance_schema',
'mysql',
'innodb',
'tmp'
);
Example results:
+------------------------------------------------------------------------------------+
| Alter Database Statements |
+------------------------------------------------------------------------------------+
| ALTER DATABASE `db1` CHARACTER SET=utf8mb4 COLLATE=utf8mb4_uca1400_ai_ci; |
| ALTER DATABASE `db2` CHARACTER SET=utf8mb4 COLLATE=utf8mb4_uca1400_ai_ci; |
| ALTER DATABASE `specify` CHARACTER SET=utf8mb4 COLLATE=utf8mb4_uca1400_ai_ci; |
+------------------------------------------------------------------------------------+
Table Level
The character set and collation for an existing table can be changed with an ALTER TABLE statement.
The ALTER TABLE statement will convert all columns in the table that do not have an explicit character set and collation defined on the column level.
The following SQL statement converts an agent table in the specify database to the utf8mb4 character set and utf8mb4_uca1400_ai_ci collation:
ALTER TABLE `specify`.`agent` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci;
The below SQL statement can be used and modified as desired to generate SQL statement to convert all tables in multiple databases:
SELECT CONCAT(
'ALTER TABLE `',
TABLE_SCHEMA,
'`.`',
TABLE_NAME,
'` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci;'
) AS 'Alter Table Statements'
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA NOT IN (
'sys',
'information_schema',
'performance_schema',
'mysql',
'innodb',
'tmp'
)
ORDER BY TABLE_SCHEMA,
TABLE_NAME;
Example results (truncated for brevity):
+-------------------------------------------------------------------------------------------------------------------------------+
| Alter Table Statements |
+-------------------------------------------------------------------------------------------------------------------------------+
| ALTER TABLE `specify`.`agentattachment` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci; |
| ALTER TABLE `specify`.`agentgeography` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci; |
| ALTER TABLE `specify`.`agentidentifier` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci; |
| ALTER TABLE `specify`.`agentspecialty` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci; |
| ALTER TABLE `specify`.`agentvariant` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci; |
| ALTER TABLE `specify`.`appraisal` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci; |
| ALTER TABLE `specify`.`attachment` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci; |
+-------------------------------------------------------------------------------------------------------------------------------+
Column Level
Nearly all columns in Specify utilize the table’s character set and collation, so altering the character set and collation at the table level is sufficient.
Some columns which explicitly diverge from the table’s default are expected to maintain the general sorting behavior of their current collation.
For example thespdataset.visualordercolumn by default has theutf8mb4_bincollation, which differs significantly from theutf8mb4_uca1400_ai_ciand otherutf8mb4collations
Be mindful to maintain or consider the collation’s sorting behavior when modifying the character set and collation at the column level.
The ALTER TABLE statement can be used to explicitly define a
The following SQL query modifies the agent table in the specify database to use the utf8mb4 character set and utf8mb4_uca1400_ai_ci collation:
ALTER TABLE `specify`.`agent` MODIFY `LastName` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci;
For more information, view the MariaDB documentation on setting character set and collation at the column level: