)
⚈ it could even include providing alternate spellings or synonyms
canonical would also have an index, possibly FULLTEXT and/or utf8mb4_bin.
canonical may have duplicates even though exact probably does not.
Then, when the 'user' searches, you do the same folding to "canonical" before searching canonical.
Display from exact.
LOAD DATA INFILE
In LOAD DATA INFILE, the manual points out:
"The server uses the character set indicated by the character_set_database system variable
to interpret the information in the file. SET NAMES and the setting of character_set_client
do not affect interpretation of input. If the contents of the input file use a character set
that differs from the default, it is usually preferable to specify the character set
of the file by using the CHARACTER SET clause. A character set of binary specifies 'no conversion'."
Example:
LOAD DATA LOCAL INFILE '...'
INTO TABLE tbl
CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
ESCAPED BY '\\'
IGNORE 1 ROWS;
Characters in Western Europe
Taken from Wikipedia
å, ä and ö in Finnish and Swedish
à, ç, è, é, ï, í, ò, ó, ú, ü in Catalan
å, æ and ø in Norwegian and Danish
á, é, ó, ý, è, ë, ï in Dutch
ä, ö, ü and ß in German
á, ð, í, ó, ú, ý, æ and ø in Faroese
á, ð, é, í, ó, ú, ý, þ, æ and ö in Icelandic
à, â, ç, è, é, ë, ê, ï, î, ö, ô, ù, û, ÿ, æ, œ in French
à, è, é, ì, ò, ù in Italian
á, é, í, ñ, ó, ú, ï, ü, ¡, ¿ in Spanish
à, á, â, ã, ç, é, ê, í, ó, ô, õ, ú in Portuguese (ü no longer used)
á, é, í, ó, ú in Irish
£ in British English
Turkish
The columns: utf8, unicode codepoint decimal, hex, HTML entity, 'name':
C387 0199 00C7 Ç latin capital letter c with cedilla
C396 0214 00D6 Ö latin capital letter o with diaeresis
C39C 0220 00DC Ü latin capital letter u with diaeresis
C3A7 0231 00E7 ç latin small letter c with cedilla
C3B6 0246 00F6 ö latin small letter o with diaeresis
C3BC 0252 00FC ü latin small letter u with diaeresis
C49E 0286 011E Ğ latin capital letter g with breve
C49F 0287 011F ğ latin small letter g with breve
C4B0 0304 0130 İ latin capital letter i with dot above
C4B1 0305 0131 ı latin small letter dotless i
C59E 0350 015E Ş latin capital letter s with cedilla
C59F 0351 015F ş latin small letter s with cedilla
E282A4 8356 20A4 ₤ lira sign
E282BA 8378 20BA ₺ Turkish lira sign
Don't expect latin1 to work well:
CREATE TABLE `se96265` (
`content` mediumtext NOT NULL,
FULLTEXT KEY `content` (`content`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
SET NAMES utf8;
-- Turkish word for Croatia, second char is \xC4\xB1
INSERT INTO `se96265` (`content`) VALUES ('Hırvatistan');
SELECT content, HEX(content), HEX('Hırvatistan') FROM se96265\G
*************************** 1. row ***************************
content: H?rvatistan
HEX(content): 483F72766174697374616E
HEX('Hırvatistan'): 48C4B172766174697374616E
Note how the 'C4B1' turned into '?'
To get an error:
mysql> SET SESSION sql_mode = 'STRICT_ALL_TABLES'; -- So the INSERT will cause an error:
mysql> INSERT INTO `se96265` (`content`) VALUES ('Hırvatistan');
ERROR 1366 (HY000): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1
(Note: This technique for getting an error message is not specific to Turkish.)
ISO/IEC 8859-9 = latin5 = Turkish
If you have Turkish text in a latin1 column/table, you should convert to latin5 in order to correctly handle
the 6 characters that are different (latin1: ÐÝÞðýþ --> latin5 ĞİŞğış). See
Stackoverflow
Arabic/Farsi
Arabic characters 2 bytes in utf8: Dxyy. For example, Arabic Letter 'HAH' is D8AD.
To get diacritic insensitive comparisons, use utf8_unicode_ci:
SELECT 'بِسْمِ' = 'بسم' COLLATE utf8_unicode_ci;
The result I got back was 1 (true), meaning they are considered equal.
With utf8_general_ci it came back with 0, meaning not equal.
(Of course, you should declare your column(s) as utf8_unicode_ci rather than using the COLLATE clause.)
Stored Farsi, got '???'
Urdu case folding
Hebrew
To search for a particular character in a string, LIKE or REGEXP will work.
To search for "a cell being entirely Hebrew" is not easy.
Essentially, you look for the utf8 encoding, then build a REGEXP to accomodate it.
For example, to see which rows have a Hebrew name (in utf8):
WHERE HEX(name) REGEXP '^(D7..)+$'
See forum discussion
This assumes (not quite perfectly) that all Hebrew encodings are D7xx.
This can be adapted to other languages if the encoding is not to complex.
Chinese, for example, is complex:
WHERE HEX(name) REGEXP '^(E.....)+$'
⚈ Chinese characters are 3 bytes (6 hex) long
⚈ This fails to handle the 4-byte Chinese characters
⚈ This includes Japanese, Cherokee, etc.
If you have the "hebrew" encoding in a latin1 column, you may need
a 3-step ALTER.
See stackoverflow
Storing Hebrew strings Ok, but gibberish when pulling data on another server
Cyrillic, Macedonian, etc
(Random notes)
Cyrillic in utf8 is encoded D0yy-D4yy and EA99yy-EA9Ayy, but perhaps only D0yy is in common use.
For Cyrillic: utf8_unicode_ci is fine for these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
Ќ is hex D08C in utf8 or utf8mb4. Cyrillic is completely covered by either CHARACTER SET. К is D09A.
Hmmm, this is quite interesting:
mysql> SELECT 'К' = 'Ќ' COLLATE utf8_bin AS bin,
'К' = 'Ќ' COLLATE utf8_general_ci AS general,
'К' = 'Ќ' COLLATE utf8_unicode_ci AS unicode;
+-----+---------+---------+
| bin | general | unicode |
+-----+---------+---------+
| 0 | 1 | 0 |
+-----+---------+---------+f
mysql> SELECT 'Г' = 'Ѓ' COLLATE utf8_bin AS bin,
'Г' = 'Ѓ' COLLATE utf8_general_ci AS general,
'Г' = 'Ѓ' COLLATE utf8_unicode_ci AS unicode;
+-----+---------+---------+
| bin | general | unicode |
+-----+---------+---------+
| 0 | 1 | 0 |
+-----+---------+---------+
And if it helps, here's the method from the alphabet helper. def alphabet cyrillic_alphabet = ["А", "Б", "В", "Г", "Д", "Ѓ", "Е", "Ж", "З", "Ѕ", "И", "Ј", "К", "Л", "Њ", "М", "Н", "Љ", "О", "П", "Р", "С","Т", "Ќ", "У", "Ф", "Х", "Ц", "Ч", "Џ", "Ш"] end – NewbieOnRails yesterday
These are letters in Macedonian alphabet, not on the Ubuntu Cyrillic keyboard.
⚈ D086 1030=x0406 [І] L CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
⚈ D087 1031=x0407 [Ї] L CYRILLIC CAPITAL LETTER YI
⚈ D08B 1035=x040B [Ћ] L CYRILLIC CAPITAL LETTER TSHE
⚈ D08E 1038=x040E [Ў] L CYRILLIC CAPITAL LETTER SHORT U
The bit patterns are different, so utf8_bin collates unequal. But usually whenever general is equal, so is unicode.
Cyrillic/Macedonian
Cyrillic example:
title: Добро пожаловать в Диксон,
hex: D094D0BED0B1D180D0BE20D0BFD0BED0B6D0B0D0BBD0BED0B2D0B0D182D18C20D0B220D094D0B8D0BAD181D0BED0BD,
length: 47,
char length: 25
Data entry of Cyrillic in Workbench yields "????"
Other Natural Languages
This is a list hits about natural languages. Most have not been verified.
Lithuanian ąĄ čČ ęĘ ėĖ įĮ ųŲ ūŪ are saved as ?, however, šŠ žŽ do get saved. The reason:
Column declared latin1
Vietnamese collations were added in MySQL 5.6.
em dash, dagger, and other special characters
mysql> SELECT ';' semicolon, UNHEX('cdbe') 'Greek question mark';
+-----------+---------------------+
| semicolon | Greek question mark |
+-----------+---------------------+
| ; | ; |
+-----------+---------------------+
cp1250 - Central Europe
For a Devanagari character range:
SELECT 'क़ज़फ़' REGEXP '^[क़-फ़]+$'; --> true -- But that is mere coincidence!
Stackoverflow thread
UTF8 Chinese String Comparison
REGEXP / RLIKE
Do not attempt to use most of the features of REGEXP to test for non-ascii characters; it will generally not work in utf8.
MySQL's REGEXP only works with bytes, not utf8 characters.
Greek alpha (α) is hex CEB1; omega (ω) is hex CF89
* Character sets don't work right: [αω] checks for 4 bytes: CE, B1, CF, 89, which has very little to do with checking for those two letters.
* Ranges also don't work right: [α-ω] checks for byte CE, range B1-CF, and byte 89.
* . is one byte, not one character.
* etc.
A few tricks will work:
* WHERE HEX(str) REGEXP '(..)*CEB1CF89' -- will check for 'αω' in the column; but you may as well do WHERE str LIKE '%αω%'.
* WHERE HEX(str) REGEXP '(..)*F' -- will check for whether the UTF-8 string needs utf8mb4. (But how did you get str stored in the first place??)
Python
1st or 2nd line in source code (to have literals in the code utf8-encoded):
# -*- coding: utf-8 -*-
Python code for dumping hex (etc) for string 'u':
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
Miscellany notes on coding for utf8:
⚈ db = MySQLdb.connect(host=DB_HOST, user=DB_USER, passwd=DB_PASS, db=DB_NAME, charset="utf8", use_unicode=True)
⚈ conn = MySQLdb.connect(host="localhost", user='root', password='', db='', charset='utf8')
⚈ config = 'OPTIONS': { ..., 'charset': 'utf8mb4', ... } -- in "settings.py"?
⚈ cursor.execute("SET NAMES utf8mb4;") -- not as good as using `charset'
⚈ db.set_character_set('utf8'), implies use_unicode=True
⚈ Literals should be u'...'
⚈ MySQL-python 1.2.4 fixes a bug wherein varchar(255) CHARACTER SET utf8 COLLATE utf8_bin is treated like a BLOB.
Checklist:
⚈ # -*- coding: utf-8 -*- -- (for literals in code)
⚈ charset='utf8' in connect() call -- Is that buried in bottle_mysql.Plugin? (Note: Try 'utf-8' and 'utf8')
⚈ Text encoded in utf8.
⚈ No need for encode() or decode() if you are willing to accept utf8 everywhere.
⚈ u'...' for literals
⚈ <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> near start of html page
⚈ <meta charset="utf-8" /> is a shorter version (if using only HTML5).
⚈ Content-Type: text/html; charset=UTF-8 (in HTTP response header)
⚈ header('Content-Type: text/html; charset=UTF-8'); (in PHP to get that response header)
⚈ CHARACTER SET utf8 COLLATE utf8_general_ci on column (or table) definition in MySQL.
⚈ utf8 all the way through
⚈ Use MySQL Connector/Python instead of pyodbc and MySQL Connector/ODBC
Side note: Python defaults to autocommit=off; this can be a rude surprise when your INSERTs vanish.
Showing hex: var.encode().hex()
References:
⚈ https://docs.python.org/2/howto/unicode.html#the-unicode-type
⚈ https://stackoverflow.com/questions/9154998/python-encoding-mysql
⚈ https://dev.mysql.com/doc/connector-python/en/connector-python-connectargs.html
⚈ https://stackoverflow.com/questions/34069603/python-pyodbc-weird-unicode-bytecode
How to store arabic text in mysql database using python?
The Python language environment officially only uses UCS-2 internally since version 2.0,
but the UTF-8 decoder to "Unicode" produces correct UTF-16. Since Python 2.2, "wide" builds
of Unicode are supported which use UTF-32 instead;[16] these are primarily used on Linux.
Python 3.3 no longer ever uses UTF-16, instead strings are stored in one of ASCII/Latin-1, UCS-2, or UTF-32,
depending on which code points are in the string, with a UTF-8 version also included so that
repeated conversions to UTF-8 are fast.
Python check for utf8 (versus utf8mb4)
The python package "MySQL-python" version needs to be at least 1.2.5 in order to handle utf8mb4.
import sys
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
Python 2.7 issues; improvements in Python 3
PHP
⚈ If using mysqli, do $mysqli_obj->set_charset('utf8mb4');
⚈ If using PDO do somethin like $db = new PDO('dblib:host=host;dbname=db;charset=utf8mb4', $user, $pwd);
⚈ Alternatively, execute SET NAMES utf8mb4
Any of these will say that the bytes in the client are UTF-8 encoded.
Conversion, if necessary, will occur between the client and the database if the column definition is something other than utf8mb4.
Be sure to have php-mbstring installed (see php.ini)
Random notes about PHP. (See also Best Practice, above.)
In php.ini: default_charset UTF-8
The default changed to UTF-8 in PHP 5.6.
To see what PHP has:
$hex = unpack('H*', $text);
echo implode('', $hex);
or simply echo bin2hex($text);
To get "U+" (unicode) notation, first fetch the data, then use PHP code
trim(str_replace('\u', ' U+', json_encode('كتب عربية')), '" ');
This will result in U+0643 U+062a U+0628 U+0639 U+0631 U+0628 U+064a U+0629.
PHP: Convert \unnnn to utf8:
$myr = '\u101c\u1031\u1011'; // Given this
$x = json_decode('"'.$myr.'"'); // Add quotes and decode
// $x is not string, like လေထ
If you use header(), do: header('Content-type: text/plain; charset=utf-8');
PHP: undo_magic_quotes_gpc($_POST);
BjyProfiler needs charset=utf8; it apparently overrides other PHP settings.
If using mysqli, character_set_name() (formerly spelled client_encoding()) does NOT mirror SET NAMES.
Other Computer Languages
This is a list of hints on how to establish utf8 for various computer languages. Most have not been verified.
ADO.NET
"charset=utf8"
Drupal, WordPress, and any other product that uses PHP:
Turn off mbstring.encoding_translation.
Probably everything to do with mbstring should be avoided.
AJAX
Apache servers are (were?) configured to serve files in ISO-8859-1 by default, so you need to add the following line to your .htaccess file:
AddDefaultCharset UTF-8
ASP, ASP.Net
<%Response.charset="utf-8"%>
Mathias Bynens
Connector/C
mysql_set_character_set('utf8mb4');
C#:
C# code to undo (while reading?) double-encoded strings
Similar code in SQL:
CONVERT(UNHEX(HEX(CONVERT(CONVERT(col USING utf8) USING latin1))), CHAR)
Put this in the connection string:
id=my_user;password=my_password;database=some_db123;charset=utf8;
Lao and C#
codeigniter:
application/config/config.php :
$config['charset'] = 'UTF-8';
application/config/database.php :
'char_set' => 'utf8',
'dbcollat' => 'utf8_general_ci',
httppost.setEntity(new UrlEncodedFormEntity(nameValuePairs, HTTP.UTF_8));
echo meta('Content-type', 'text/html; charset=utf-8', 'equiv');
In database.php, Set char_set to utf8mb4 and dbcollat to utf8mb4_unicode_ci or utf8_general_ci
Codeigniter 3.0 notes
ColdFusion
<cfprocessingdirective pageEncoding="utf-8">
<cffile
action="read"
file="#settings.csvfile#"
variable="autodata"
charset="utf-8">
Django
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.mysql',
...
'OPTIONS': {
'charset': 'utf8mb4',
'use_unicode': True, },
},
}
my.cnf:
[mysqld]
character-set-server=utf8mb4
default-collation=utf8mb4_unicode_ci
[client]
default-character-set=utf8mb4
StackOverflow
client_encoding: 'UTF8'
Django utf8 usage
See also the Python notes, above.
Drupal
amending your settings.php database connection by adding:
$databases['default']['default'] = array(
'driver' => 'mysql',
'database' => 'databasename',
'username' => 'username',
'password' => 'password',
'host' => 'localhost',
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_general_ci',
);
More discussion of utf8mb4
go-mysql
[username[:password]@][protocol[(address)]]/dbname?collation=utf8mb4_unicode_ci
https://github.com/go-sql-driver/mysql#charset
Grails
Grails and utf8mb4
Remove the "characterEncoding=UTF" part in the dataSource url.
Otherwise grails always forces MySQL to use UTF-8 instead of utf8mb4. E.g.
dataSource:
dbCreate: "update"
url: "jdbc:mysql://localhost:8889/mydbname?useUnicode=true&zeroDateTimeBehavior=convertToNull&autoReconnect=true"
Java
Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0.
ucs2 versus utf16
⚈ What if you put a 5.1 slave off a 4.x master? (The German sharp-s could cause trouble.)
⚈ Replication type (SBR vs RBR)?
Add ?useUnicode=yes&characterEncoding=UTF-8 to the JDBC URL (or maybe it is =true and =utf8)
<%@ page language="java" pageEncoding="UTF-8"%>
<%@ page contentType="text/html;charset=UTF-8" %>
compileJava.options.encoding = 'UTF-8'
<form method="post" action="/your/url/" accept-charset="UTF-8">
To use 4-byte UTF8 with Connector/J configure the MySQL server with character_set_server=utf8mb4.
Connector/J will then use that setting as long as characterEncoding has not been set in the connection string.
This is equivalent to autodetection of the character set.
Jetty
jetty
Java Servlet
resource.setContentType ("text/html;charset=utf-8");
JSP
<%@ page contentType="text/html; charset=UTF-8" %>
Laravel
If you have trouble with 191, etc in Laravel, you have these choices
⚈ 255 -> 191 (but make sure you don't currently have longer addresses)
⚈ utf8mb4 -> utf8 (Caution: This disallows Emoji and some Chinese characters) See config/database.php
⚈ INDEX(email(20)) (If it is UNIQUE or PRIMARY KEY, do not pick this option)
⚈ It is possible to reconfigure (if after 5.6.3) the server and table to allow bigger indexes; this is not the 'default' until 5.7.7.
Stackoverflow
In the file config/database.php: 'mysql' => [..., 'charset' => 'utf8mb4', 'collation' => 'utf8mb4_unicode_ci', ...]
Navicat
Old versions of Navicat have an option for utf8, but not for utf8mb4. So, if you need utf8mb4,...
In the encoding options for navicat connections, pick "use mysql encoding", not "utf8".
Forum
NET
Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword;CharSet=utf8;
node.js
var connection = mysql.createConnection({ ... , charset : 'utf8mb4'});
stackoverflow
OrmLite
ConnectionReset=False;
Powershell ISE and Powershell Console:
$OutputEncoding = New-Object -typename System.Text.UTF8Encoding
library(RODBC)
imp <- odbcConnect("SERVER", uid = "***", pwd="***",DBMSencoding="utf8")
ODBC 3.5.1 probably won't handle utf8; either of these may be correct:
ODBC;DRIVER={MySQL ODBC 5.1 Driver};UID=...;PWD=...;PORT=3306;DATABASE=...;SERVER=127.0.0.1;CHARSET=UTF8;
strConnection="driver={MySQL ODBC 5.1 Driver}; server=...;uid=...;pwd=...;database=...;stmt=SET NAMES utf8;"
Unicode is not supported in ADO applications (including ASP) with the Connector/ODBC 3.51 driver. Please consider testing MySQL Connector/ODBC 5.1 which includes Unicode support for ASP.
Panda:
Stackoverflow on Panda
See also notes on Python.
Perl:
use utf8;
use open ':std', ':encoding(UTF-8)';
my $dbh = DBI->connect("dbi:mysql:".$dsn, $user, $password, {
PrintError => 0,
RaiseError => 1,
mysql_enable_utf8 => 1, # Switch to UTF-8 for communication and decode.
});
# or {mysql_enable_utf8mb4 => 1} if using utf8mb4
Stackoverflow - double encoding and JSON
PHP
see above
pymysql:
import pymysql
con = pymysql.connect(host='127.0.0.1', port=3306,
user='root', passwd='******',
charset="utf8mb4")
Python
see above
R / RStudio
Tool -> Global Options -> Code -> Saving and put UTF-8
rs <- dbSendQuery(con, 'set character set "utf8"')
rs <- dbSendQuery(con, 'SET NAMES utf8')
Rails:
Rails and utf8mb4
Ruby on Rails:
"\xD0\x9A\xD0\xB0\xD1\x81\xD0\xBB".force_encoding("UTF-8") => "Щасл"
Watch out for rake db:reset .
Consider, instead, rake db:drop && rake db:create && rake db:migrate .
The reset just created the tables from the already stored schema which
had the wrong charset and collation. But re-creating them from the
migrations created them with the new charset and collation.
RoR: application_controller.rb
def configure_charsets
response.headers["Content-Type"] = "text/html; charset=utf-8"
suppress(ActiveRecord::StatementInvalid) do
ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end
end
Edit your database.yml with below code line
encoding: utf8
RocksDB, MyRocks
not yet answered
ServiceStack
ConnectionReset=False;
Spring/Hibernate: (See also Best Practice, above.)
Hibernate XML:
<property name="hibernate.connection.CharSet">utf8mb4</property>
<property name="hibernate.connection.characterEncoding">utf8</property>
<property name="hibernate.connection.useUnicode">true</property>
Connection url:
db.url=jdbc:mysql://localhost:3306/db_name?useUnicode=true&character_set_server=utf8mb4
The above changes were enough for me to upgrade from utf8 to utf8mb4 charset scheme.
As a side note I would like to make one clarification that UTF-8 is the character encoding while utf8mb4 is a character set that MySQL supports.
MySQL's utf8mb4 is a superset to MySQL's utf8.
Spring/Hibernate filter:
<form accept-charset="UTF-8">
Spring/Hibernate:
<property name="url" value="jdbc:mysql://localhost:3306/miniprojetjee?useUnicode=true&connectionCollation=utf8_general_ci&characterSetResults=utf8&characterEncoding=utf-8"/>
(or maybe it is =yes)
"Spring": @RequestMapping(value = "/getRegion2", produces={"application/json; charset=UTF-8"},method = RequestMethod.GET)
Hibernate manual
Spring MVC UTF-8 Encoding
encoding-filter
org.springframework.web.filter.CharacterEncodingFilter
encoding
UTF-8
forceEncoding
true
Spring Boot
Make sure you register Spring's CharacterEncodingFilter in your web.xml
sqlalchemy:
db_url = sqlalchemy.engine.url.URL(drivername='mysql', host=foo.db_host,
database=db_schema,
query={ 'read_default_file' : foo.db_config, 'charset': 'utf8mb4' })
MySQL-Unicode in sqlalchemy
sqoop
sqoop import --connect jdbc:mysql://server.foo.com/db --table bar --direct -- --default-character-set=latin1
(Probably the latin1 should be replaced by the encoding you have.)
Tomcat
tomcat
Wildfly: Configuration -> Subsystems -> Undertow -> Servlet/JSP -> View ->
Set "User listener encoding" to true.
WordPress: in wp-config.php, define('DB_CHARSET', 'utf8mb4'); and define('DB_COLLATE', '');
WordPress & utf8mb4
vb.net / VBScript:
Dim MyConn As New MySqlConnection("Server=localhost;User Id=u;Password=x;Database=d;Charset=utf8")
<%@language="VBSCRIPT" CODEPAGE="65001" LCID=1033%>
web2py
db = DAL('mysql://username:password@localhost/test?set_encoding=utf8mb4')
web.xml
web.xml
MySQL 8.0
(Source)
Collations to add:
We are going to add following collations. The collation name contains:
a. character set name: "utf8mb4"
b. language's iso code: for example, "cs" for Czech
c. UCA version: "800"
d. accent / case insensitive: "ai_ci"
The complete list is:
Collation name language
------------------------------------------------------------
utf8mb4_cs_800_ai_ci Czech
utf8mb4_da_800_ai_ci Danish
utf8mb4_de_phonebook_800_ai_ci German (phonebook order)
utf8mb4_eo_800_ai_ci Esperanto
utf8mb4_es_800_ai_ci Spanish
utf8mb4_es_traditional_800_ai_ci Spanish (traditional)
utf8mb4_et_800_ai_ci Estonian
utf8mb4_hr_800_ai_ci Croatian
utf8mb4_hu_800_ai_ci Hungarian
utf8mb4_is_800_ai_ci Icelandic
utf8mb4_la_800_ai_ci Roman (classical Latin)
utf8mb4_lt_800_ai_ci Lithuanian
utf8mb4_lv_800_ai_ci Latvian
utf8mb4_pl_800_ai_ci Polish
utf8mb4_ro_800_ai_ci Romanian
utf8mb4_sk_800_ai_ci Slovak
utf8mb4_sl_800_ai_ci Slovenian
utf8mb4_sv_800_ai_ci Swedish
utf8mb4_tr_800_ai_ci Turkish
utf8mb4_vi_800_ai_ci Vietnamese
Random notes
UTF-8 all the way through
Scan for 8-bit codes, find and decipher double-encoding
A table declared to be latin1, and containing latin1 bytes
can be converted to utf8 via
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8;
Example from forums
Using Replication to assist in charset conversion?
utf8_general_ci, which will treat Résumé the same as resume
MYSQL doesn't 'see' accents
Exact match when _ci collation is on column - even using index
WHERE keyword = CONVERT('cho' USING binary)
Re: constants in SQL:
EXPLAIN EXTENDED SELECT ...;
SHOW WARNINGS;
sometimes gives extra clues into what charset/collation issues are being
faced during a SELECT. It will probably contain 'introducers'; example: _utf8'abc'.
To ponder:
convert(convert((binary col1) using latin1) using utf8)
phpMyAdmin might be connecting as "root", thereby bypassing
init_connect = "SET NAMES utf8". This could
defeat your attempts to be fully utf8.
Possible workaround: Use a login that avoids SUPER priv.
(This suggestion is good advice for all applications -- for security reasons.)
# Find rows with any 8-bit chars:
SELECT * FROM tbl WHERE HEX(colname) RLIKE '^(..)*[8-F].';
# Remove all occurrences of one 8-bit code (hex 92, in this example):
UPDATE tbl SET colname = REPLACE(colname, UNHEX('92'), '');
Reference:
how to replace hex in varchar field
Converting latin1 hex to utf8 hex:
HEX(CONVERT(CONVERT(UNHEX('F6') USING LATIN1) USING UTF8)) ⇒ C3B6
(Note: The doc refers to version 5.5, but probably applies to all versions from 4.1 on.)
Connector/.NET possibly defaults to latin1. Consider changing 'connection string' if you need utf8.
Suggested for my.cnf by
stackoverflow
[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
character-set-server = utf8
[mysqld]
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8
Caution: Setting default-character-set that way for the mysql client
cannot be overridden by SET NAMES.
my.cnf: skip-character-set-client-handshake to pretend to be 4.0?
A possible double-encoding fix
$string = htmlspecialchars($string, ENT_QUOTES, "UTF-8");
my.cnf needs these to force the server and the client to use UTF8:
skip-character-set-client-handshake
collation-server=utf8_unicode_ci
character-set-server=utf8
However, "The server character set and collation are used as default values if the
database character set and collation are not specified in CREATE DATABASE statements.
They have no other purpose."
manual on charset-server
It is not clear whether skip-character-set-client-handshake is really needed.
For a description of the character_set_* settings and SET NAMES:
manual on Connection Character Sets
In 5.6, this helps get Unicode rendering for mysql commandline tool for Windows:
mysql --default-character-set=utf8
Windows Unicode Support
Also sugest (in my.ini)
[mysql]
default-character-set=utf8
<?xml version="1.0" encoding="utf-8"?>
REGEXP (RLIKE) with character class is likely to work wrong:
Bug report from 2007
Manual Warning
<form method="post" action="/your/url/" accept-charset="UTF-8">
Changing default charset
Possible values for <meta http-equiv="Content-Type" content="text/html; charset=..." />
can be found in
iana charsets
For example, use 'ISO-8859-9' for MySQL's 'latin5'.
How to fix a table that has a mixture of latin1 and utf8
Test for Chinese characters: REGEXP '^(..)*(E[2-9F]|F0A)' -- Caveat: This may not be precise.
StackOverflow discussion
In my.cnf:
collation_server = utf8mb4_unicode_520_ci
sets
character_set_database | utf8mb4
character_set_server | utf8mb4
collation_database | utf8mb4_unicode_520_ci
collation_server | utf8mb4_unicode_520_ci
but not the things set by SET NAMES, nor collation_connection.
From language name to list of characters (dwell to see unicode, not utf8)
Illustrive examples of ALTERs
5.5.3 (10/2010) Changelog: The following obsolete constructs have been removed.
The --default-character-set and --default-collation server options (use --character-set-server and --collation-server).
Then...
5.7.6 (9/2015) Changelog: The following items are deprecated and will be removed in a future MySQL release.
The global character_set_database and collation_database system variables are deprecated
and will be removed in a future version of MySQL.
Assigning a value to the session character_set_database
and collation_database system variables is deprecated and assignments produce a warning.
The session variables will become read only in a future version of MySQL
and assignments will produce an error. It will remain possible to access
the session variables to determine the database character set and collation for the default database.
LOAD DATA added syntax to include charset valid for the file to be loaded.
In 4.1, LOAD DATA utilises database charset.
Bug 10195
Check for capitalized Swedish words
shows the difficulty of using REGEXP, and provides a workaround.
When a Stored Routine is created, it has the current CHARACTER SET and COLLATION attached to it.
Beware -- this may not be what you expect later.
Use SHOW PROCEDURE name to see the setting.
"ch and ll are no longer letters since 1994" - This probably means that "spanish2" collations are not needed for current texts.
[[http://www.rae.es/consultas/exclusion-de-ch-y-ll-del-abecedario][Reference]] (in Spanish).
LIKE != = -- (a LIKE b) is almost the same as (a = b) in that collations matter.
However trailing spaces can lead to different actions.
And 2-byte letters (ae) are treated as two characters in LIKE, but perhaps one character, such as when compared to (ä) with utf8_german2_ci.
SO Discussion
Diacritics used in Europe
$str = mb_convert_encoding($str, 'UTF-8', 'HTML-ENTITIES');
Changing case
Examples of mismatch of utf8 and utf8mb4
Search for emoji
Footnote
If this solves your problem, please let me know.
My ego needs a pat on the back.
If it does not solve your problem, please contact me,
maybe I can pull as special rabbit out of my hat, just for you.
Original writing -- Sep 2010, after struggling with hundreds of questions
both in-house and from the forums.
Alas, this does not achieve the ultimate goal of "Given this symptom, do that". (See 2015)
I hope it helps most people.
Revision date: 2012-08-26
(This document continues to evolve as more problems and examples are discovered.)
Revision date: 2013-02-12
(Misc, plus added 8-bit encodings)
Revision date: 2015-March/April
Finally getting close to "If you see this, this is what happened, and here's how to fix it".
Add better summary of what errors occur and how to fix them.
Add/fix several links.
Revision date: 2016-01-25
Shuffled language info.
Revision date: 2016-09-14
Beginnings of 8.0, including utf8mb4...unicode_800_ci
Contact me by posting a question at MySQL Forums :: Character Sets, Collation, Unicode
-- Rick James
MySQL Documents by Rick James
HowTo Techniques for Optimizing Tough Tasks:
Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
Overview
Summary Tables
High speed ingestion
Entity-Attribute-Value -- a common, poorly performing, design pattern (EAV); plus an alternative
Find the nearest 10 pizza parlors -- efficient searching on Latitude + Longitude (another PARITION use)
Lat/Long representation choices
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Rollup Unique User Counts
Alter of a Huge table -- Mostly obviated by 5.6
Latest 10 news articles -- how to optimize the schema and code for such
Build and execute a "Pivot" SELECT (showing rows as columns)
Find largest row for each group ("groupwise max")
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8
utf8 Collations
utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table
wp_postmeta
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Request for tuning / slowlog info
Best of MySQL Forum -- index of lots of tips, discussions, etc
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS
Analyze SlowLog
utf8mb4 collations
My slides from conferences
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
(older ones upon request)
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request: