MySQL Charset/Collate

Table of Contents

The problems being addressed
Basic Concepts
History
Best Practice
Example, using PHP
When do bytes get translated?
Life Cycle of a Character
Disaster before INSERT
Diagnosing CHARSET issues
The manual
Viewing text
Entering accents in CMD
Example of Double-encoding
Fixing utf8 bytes in latin1 column
Fixing mismatch between CHARSET and data encoding
Fixing mix of charsets in a column
Fixing MicroSoft thingies
Fixing "double encoding"
Fix definition and data, IF correctly encoded
Testing an in-place fix
Fixing while moving data with mysqldump
Fixing while moving data with LOAD DATA
Conversion Gotchas
4-byte utf8
Functions to know about
BOM - Byte-Order-Mark
German "sharp-s"
Circumflex
Cautions
Introspection
my.cnf
Stored Procedures/Functions, Triggers
utf8 Collations
utf8 Collation Variations:
8-bit Encodings
Turkish
More references
latin1 ~= cp1252 ~= ISO 8859-1
Combining Diacriticals
Open issues
Random notes
Footnote
Brought to you by Rick James

The problems being addressed


    ⚈  Your web page the wrong characters coming out where accented letters should be.
    ⚈  You are upgrading from MySQL 4.0, where characters were simply bytes.
    ⚈  You are trying to wrap your head around how MySQL handles CHARSET / COLLATION.
    ⚈  You upgraded, and now you are getting garbage.

Please read most of this document. I know it is long (after all, I wrote it!) But you really need to understand a lot of the pieces in it, in order to solve your problem.

Basic Concepts


"Character" and "byte" are different! You must understand this before continuing. A "byte" is an a-bit thing; it is the unit of space in computers (today). A "character" is composed of one or more bytes, and represents what we think of when reading.

A byte can represent only 256 different values. There are over 11,000 Korean characters and over 40,000 Chinese characters -- no way to squeeze such a character into a single byte.

Charset vs collation. These are different things! 'Charset' ('character set'; 'encoding') refers to the bits used to represent 'characters'. 'Collation' refers to how those bits could be compare for inequality (WHERE) and sorting (ORDER BY). GROUP BY and FOREIGN KEY CONSTRAINTS can also involve collation. And it even can involve deciding whether two different bit strings compare 'equal'.

History


1950's -- A character was (sometimes) represented in only 5 bits, on "paper tape"; no lowercase; had to "shift" to switch between letters and digits.
1960's -- A character was 6 bits; no lower case. "Punched cards" were used.
1970's -- 7-bit ASCII becomes common -- that limits you to English. And the 8-bit "byte" was invented and was coming into common usage (re: IBM 360).
1980's -- Beginning to see 8-bit encodings. Especially since 7-bit ASCII was wasting a bit of the omni-present "byte". This can handle Western European accented letters.
1990's -- The computer world realizes that there are other people in the world and embarks on Unicode and UTF8.

Meanwhile, MySQL is born, but has enough problems without worrying about character sets. Through version 4.0, a CHAR is just a byte. You can put any kind of bytes, representing anything, into a VARCHAR. Thus, begins the need for this discussion.

MySQL 4.1 introduced the concept of "character set" and "collation". If you had legacy data or legacy code, you probably did not notice that you were messing things up when you upgraded. Or you started with 4.1 (or later) and "latin1 / latin1_swedish_ci" and failed to notice that you were asking for trouble.

Today, it's pretty safe to simply lay down the law and say "Use utf8 for all text." If you have version 5.5.3 or later, "Use utf8mb4 for all text."

Sure, there are other character sets, and they are useful if you have a narrow focus. But, you may as well use utf8mb4.

Best Practice


Best practice is to go completely utf8mb4. I will focus on utf8 and utf8mb4, but if you choose to do otherwise, keep reading; most of this discussion can still be adapted to the charset of your choice.

A web page (if that is what you have) should begin with <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Establish the characterset for talking to the MySQL server:
    ⚈  init_connect = 'SET NAMES utf8' in my.cnf
    ⚈  SET NAMES utf8 -- done immediately after connecting
    ⚈  SET GLOBAL VARIABLE character_set_... SET NAMES does the three you need.

For PHP:
    ⚈  mysql interface: mysql_set_charset('utf8'); (assuming PHP 5.2.3 & MySQL 5.0.7) Note: "mysql_" interface is deprecated.
    ⚈  mysqli interface: mysqli_set_charset('utf8') function. See
mysqli_set_charset
    ⚈  PDO interface: set the charset attribute of the PDO dsn.
(from #29 in Devshed's PHP tips)

SET NAMES can be invoked by your language just like other non-SELECT commands: mysqli_query(), do(), execute(), etc.

I digress here for a note about GRANTs. 'root', and any other user from GRANT ALL ON *.*, because it has "SUPER" privilege, will skip the init_connect, so that technique is not 'perfect'.

SET NAMES

Declare most CHAR/TEXT columns in all tables as CHARSET utf8. Some exceptions: columns with hex (MD5, GUID), ... These should be BINARY or CHAR charset ASCII.

Collation?

For true text, utf8_unicode_ci is best. It handles the complex rules of "combining chararacters", etc. It handles essentially all languages simultaneously, but compromises on ordering.

utf8_general_ci is the default for utf8, so you may accidently get this. It is a little faster than utf8_unicode_ci and works ok for many situations.

utf8_bin says to just compare the bytes. CHAR/TEXT utf8 with utf8_bin validates (on INSERT) that the bytes comprise valid utf8 bytes, but does not do anything useful for comparisions other than exact (no case folding, etc) equality. BINARY/BLOB should be usually be used instead CHAR+utf8; this stores the bytes without any checking.

Alas, MySQL's collations think of case folding and accent-stripping as equivalent. That is, there is no utf8 collation for case folding, but keeping accents distinct. Do SHOW COLLATION LIKE '%cs'; to see the few collations that work that way.

Example, using PHP


Arabic in mysql tables

When do bytes get translated?


If you do all the "right things", you will accidentally have a "correctly" performing application. You need to understand when and how bytes are modified, in order to fix existing problems and to build your application correctly.

Short answer: You tell MySQL how the bytes in your application are encoded; you tell MySQL what encoding to use in your tables; MySQL converts as needed when INSERTing / SELECTing.

You tell MySQL about your application bytes by using SET NAMES or the underlying variables. You tell MySQL about the table bytes in CREATE TABLE.

Actually, the charset and collation of a column has a long, not too consistent, defaulting mechanism:
    ⚈  Default for system
    ⚈  Default for a database ...
    ⚈  Default for a table
    ⚈  Settings for a column
    ⚈  Fields/literals can be converted on the fly
    ⚈  A different collation can be specified in a WHERE clause...
To see what charset & collation are in effect for a column, use SHOW CREATE TABLE.

Life Cycle of a Character


Data -> Your client application -> INSERT into MySQL table. Simple? No.

You get the data from somewhere. Is the e-acute in "Beyoncé" represented in one latin1 byte, hex E9? Or is it the two-byte utf8 C3 A9? Not knowing which you have is the first step toward having a mess in the database. More in a minute.

Meanwhile, you have declared a column in a table in a database in MySQL to be utf8. (Collation is not matter in this discussion.)

Now you INSERT INTO Tbl (name) VALUES ("Beyoncé"). And the mess continues.

What that value being inserted encoded in latin1 or utf8? MySQL is happy to take either, and will gladly do any transcoding to store it properly in the table. But you have to tell it! The default may not be right!

Your application announces to MySQL the encoding via any of
    ⚈  The VARIABLE character_set_client
    ⚈  Executing SET NAMES
    ⚈  init-connect in my.cnf (my.ini) (Caution: root bypasses this)
I recommend you Execute SET NAMES in your application, right after connecting to the database. This reminds you, in your code, that you are establising for the connection.

Example of a mess: The data is in utf8, the connection settings are in latin1. The table is in utf8. So, MySQL converts each of C3 and A9 (which it was told was latin1) to utf8, leading to 4 bytes in the table: C3 83 C2 A9. I call this double-encoding. The problem was caused by lying (usually accidentally) to MySQL about what encoding was in the application's data.

Another example: The data was loaded in an old system in the following manner. The application got utf8 bytes and blindly stored them in a latin1 column (using the default latin1 settings for character_set% -- or it was done in 4.0, which had no character set concept). This left e-acute as 2 bytes in the table in a latin1 column. Then, you decide to 'do the right thing' and switch to utf8. Now what?

BIG5 example: You have an application that lets the user enter data into a form. This data is stored into a VARCHAR. All along the way, you use latin1. But the user connected from China with his browser set to BIG5. The latin1 column happily accepted the 2-byte Chinese characters. But your application reading the data has no clue of actual encoding. That same user, connecting in that same way, and rereading his data, will probably see the latin1 treated as BIG5, and be oblivious of the "two wrongs make a right" that happened. But display that text elsewhere, and you have a mess.

Disaster before INSERT


Most Mojibake starts even before the text is in the table. Let's dissect the steps of doing an INSERT.

Step 1 -- Get the data.

First you get the data from somewhere
    ⚈  Type it in, using a special keyboard, or keyboard tricks
    ⚈  Load it from a file
    ⚈  Create it from constants in your client programming language
    ⚈  Receive text from an html form

What do you have? You probably don't know. There is no single standard for what bytes are generated.

Step 2 -- INSERT

Let's walk through inserting a single character: e-acute. INSERT INTO tbl VALUES ('é'); or $str = ...; INSERT INTO tbl VALUES ('$str'); Well, this is really two steps.

Two likely cases for the encoding of e-acute:
    ⚈  latin1 (or code pages 1280, 1282): hex E9
    ⚈  utf8: 2 hex bytes: C3A9

Step 2a -- Send the command to the MySQL server

The bytes that make up the INSERT statement are sent from the client. Here enters the VARIABLE character_set_client. This tells the client what the bytes mean.

What is logically sent across the wire:
                           Client byte(s)
    character_set_client:    E9    C3A9
                   latin1    é     Ã ©
                     utf8          é
    ⚈  Upper-left and lower-right: The correct 'character' is sent.
    ⚈  Upper-right: Since you said the bytes are 'latin1', then C3A9 must be the two chars é.
    ⚈  Lower-left: Since E9 (and what follows) is not valid utf8, the string is truncated.

Step 2b -- Do the INSERT

The 'characters' on the wire are now transliterated to the desired encoding for the table.
                        On the wire
    column's charset:    é     Ã ©
               latin1    E9    C3A9
                 utf8    C3A9  C383C2A9
    ⚈  Left column: all is well.
    ⚈  Upper-right: Mojibake. Treated as latin1, C3A9 looks like é
    ⚈  Lower-right: "Double encoding" -- discussed at length below.

See also:
Mojibake
Thread with data entry issues.
Graphemica

Diagnosing CHARSET issues


You have some non-English characters that are showing up strangely, perhaps as a question mark (?), perhaps as two letters, when you were expecting one, such as ç is displayed as ç.

First, let's get the lay of the land:
SHOW SESSION VARIABLES LIKE 'character_set%';
| character_set_client     | latin1
| character_set_connection | latin1
| character_set_results    | latin1
(There will be other rows; these three are the important ones.) SET NAMES sets those three; this example shows the result of SET NAMES latin1, not the recommended setting. More later.

SHOW CREATE TABLE tbl will show the charset and collation for each column. More later.

You have data in a table, but it is garbled when displaying it. The first thing to do is to decide whether the data was stored correctly. Find a small cell that is bad, and do this:
SELECT x, HEX(x), LENGTH(x), CHAR_LENGTH(x) FROM ... WHERE ... LIMIT 1
You will get 4 columns:
    ⚈  x -- probably not useful, but may help confirm that you got the cell you wanted.
    ⚈  HEX(x) -- This lets us look at the encoding; more later.
    ⚈  LENGTH is the number of _bytes_.
    ⚈  CHAR_LENGTH is the number of _characters_. (Remember the distinction, above?)

LENGTH vs CHAR_LENGTH -- Are they equal? For correctly stored utf8, they should be not equal. Let's look at the two lengths of specific types of characters: Looking at a single utf8 character, would see:
    ⚈  ASCII: 1:1
    ⚈  LATIN1: 1:1
    ⚈  English in utf8: 1:1 -- Ascii is a subset of utf8
    ⚈  Western European accented letters: 2:1 -- one character is encoded in two bytes
    ⚈  Eastern Europe: also 2:1 (mostly)
    ⚈  East Asia: mostly 3:1
    ⚈  "double encoding" 4:1 (Europe) or 6:1 (Asian)
    ⚈  With utf8mb4, a few Chinese characters will be 4:1

A typical phrase in French might have LENGTH=20 and CHAR_LENGTH=17, implying that there were 17 characters, 3 of which had accents.

latin1 is happy to cram Western European characters into 1 byte.

"Double encoding" is a term I made up for the following situation. It is a case where "two wrongs make a right".
    ⚈  Table defined utf8
    ⚈  INSERTer declared latin1 (used default instead of doing SET NAMES)
    ⚈  Data being INSERTed is actually utf8 already.
What happened:
    ⚈  A 2-byte letter (say, a grave-e) was correctly represented in utf8.
    ⚈  The INSERT statement handed the 2 bytes to MySQL, but implied the need for conversion
    ⚈  Each byte was converted to utf8
    ⚈  The table has 4 bytes.
When SELECTing, the reverse happens 4->2->1, and the user is oblivious of the bug. Looking CHAR_LENGTH will spot it. Strange orderings may happen.
details of a double encoding
A good writeup on double-encoding

If you are moving the data using mysqldump, consider looking at the dump file via some hexdump utility. For experimenting, mysqldump lets you specify one database, one table, and constrain the rows via --where="...".

In the hexdump, you need to be able to recognize various byte patterns: In this, "yy" stands for the range 80-BF.
    ⚈  09,0A,0D -- The only control character you are likely to encounter (TAB, NL, CR)
    ⚈  20 -- space (handy to know when reading hex dump)
    ⚈  21-7E -- Ascii characters
    ⚈  C2yy -- symbols
    ⚈  C3yy -- Typical accented letters of Western Europe
    ⚈  Cxyy -- More Western Europe: Latin (C3-CA), Combining Diacritical Marks (CC-CD), Greek (CE-CF)
    ⚈  Dxyy -- utf8: Cyrillic (D0-D4), Hebrew (D6-D7), Arabic/Persian/Farsi (D8-DB), etc
    ⚈  E0yyyy -- 3-byte utf8 for various Indian character sets, etc.
    ⚈  E1yyyy -- Hangul, Cherokee, Balinese, etc.
    ⚈  E2yyyy -- Symbols, Braille, etc
    ⚈  E381yy -- Hiragana (Japanese)
    ⚈  E383yy -- Katakana (Japanese)
    ⚈  Exyyyy -- (E3-ED) Chinese, Japanese, Korean?
    ⚈  EFBxyy -- Arabic
    ⚈  F0yyyyyy -- 4-byte utf8. (See notes about utf8mb4)
Note, especially, the absence of 80-BF unless preceeded by some other 8-bit code.

If the data is not utf8, you may see
    ⚈  00xx -- Lots of these pairs could mean Ascii in UCS2. (Shift-JIS is also possible)
    ⚈  FFFE (or FEFF) at start of file (BOM) -- utf8 file (not cell in MySQL table)
    ⚈  00-1F -- control characters (mostly 0a, newline)
    ⚈  20 -- space (handy to know when reading hex dump)
    ⚈  21-7E -- Ascii characters
    ⚈  7F -- DEL; rare
    ⚈  80-FF -- "8-bit characters":
    ⚈  80-9F -- Possibly MicroSoft Word things (smart quotes, etc)
    ⚈  80-FF -- valid latin1 encodings, mostly European accented letters
    ⚈  A0 -- "hard space" -- mostly from MicroSoft
    ⚈  Cx not followed by yy (80-BF) cannot be utf8. This is usually a quick clue that it is not utf8.

Note: The Microsoft things really need to be fixed _before_ trying to store into a table. MicroSoft characters

In hex, "double encoding" will have a lot of "Cx yy Cx yy".

The manual


charset-connection.html
is that manual page that explains the interactions of the vital variables:
    ⚈  character_set_client -- the encoding in the client.
    ⚈  character_set_connection -- what to re-code characters into for client-to-server transfer
    ⚈  character_set_results -- what to re-code characters into for server-to-client transfer
The page also discusses collation_connection and collation_database, which have less impact.

Between that manual page and this document, I hope that you can understand things as they relate to your situation.

To see your current settings:
   SHOW VARIABLES LIKE 'character_set%';
   SHOW VARIABLES LIKE 'collation%';

Viewing text


Alas, CMD (Windows) and perhaps other terminal emulators cannot (or will not by default) display utf8 characters. This makes it difficult to debug. And it is confusing to see incorrect characters when everything else is working correctly.

Viewing text in a browser can be misleading in a different way. Browsers may "fix" things for you. For example, 'double encoded' characters may appear to be correct in the browser!

Bottom line: SELECT HEX() is about the only thing to trust.

Entering accents in CMD


As noted above, CMD in Windows is limited in its character set handing.

The following is a clumsy way to enter Western European characters. It depends on your knowing the hex or decimal values for the characters you want. Sorry, this won't handle Hebrew, Cyrillic, Chinese, etc.

Press and hold ALT, then
    ⚈  decimal digits for the code page, or
    ⚈  0, then decimal digits of the decimal Unicode
digits to get the extra characters in latin1. When you let go, the accented character will appear. This implies that you need SET NAMES latin1, not utf8, since you are creating latin1 bytes, not utf8 bytes in your 'client'. For the encodings, see
    ⚈  
link1, or
    ⚈  link2.

More
info on how to enter Unicode, and about "code pages".

The command "chcp" controls the "code page". chcp 65001 is supposed to provide utf8, but it seems to need a special charset installed, too. some code pages
To set the font in the console window: Right-click on the title of the window -> Properties -> Font -> pick Lucida Console

Example of Double-encoding

Desired string:  სახლი არის
Hex in table:
C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1
Correct utf8 hex:
E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1

Analysis:
C3A1C692C2A1 -- should be E183A1 -- Georgian SAN (U+10E1)  "ს"
C3A1C692C290
C3A1C692C2AE
C3A1C692C5A1
C3A1C692CB9C
20           -- space (same in utf8 and latin1)
C3A1C692C290
C3A1C692C2A0
C3A1C692CB9C
C3A1C692C2A1
Georgian, when properly encoded in utf8, should be 3 bytes per character:
E182xx
E183xx
E2B4xx
Taking the first "character" (C3A1C692C2A1, in hex):
    ⚈  latin1 E1 = utf8 C3A1
    ⚈  latin1 83 = utf8 C692
    ⚈  latin1 A1 = utf8 C2A1
Notice how most of the letters look very similar: C3A1C692xxyy, when they should start with E183xx.

One pass over the data to convert each two bytes (4 hex) from utf8 to latin1:
   SELECT
   HEX(CONVERT(CONVERT(UNHEX(
     'C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1'
   ) USING utf8) USING latin1))
-->
E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1
Now to display as utf8:
SELECT UNHEX('E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1')
-->
სახლი არის

(Ref: forum thread #418885.)

Another example of double-encoding: 'C398C2AF' should have been 'D8AF'
SELECT HEX(CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1));
--> D8AF, which is the utf8 for Arabic DAL.
SELECT CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1);
--> that character.
(Ref: forum thread #565669.)

Fixing utf8 bytes in latin1 column


CAVEAT! Test before using on Production!
# Set up bad case:
mysql> create table latin1 (c varchar(11) charset latin1);
mysql> insert into latin1 (c) values ('x'), (unhex('c2a3'));
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c    | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| A£   | C2A3   |         2 |              2 |
+------+--------+-----------+----------------+
# Step 1 of 2 of fix:
mysql> alter table latin1 modify c tinyblob;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c    | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| A£   | C2A3   |         2 |              2 |
+------+--------+-----------+----------------+
# Step 2 of 2 of fix:
mysql> alter table latin1 modify c varchar(11) charset utf8;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c    | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| £    | C2A3   |         2 |              1 |
+------+--------+-----------+----------------+

That is, ALTER to BLOB (of appropriate size), then ALTER to VARCHAR (of original size) and CHARSET utf8.

This technique should work for other cases of the right encoding sitting in a 5.1 table with the wrong CHARSET declaration.

One more note: Suggest you include the COLLATE clause on the the final ALTER that converts to VARCHAR.

Fixing mismatch between CHARSET and data encoding


The problem: cp850 encoding was inserted into a latin1 field.

The following will build an example, then walk through two solutions. The solutions should work for any mis-coded field where bytes for one charset had been stuffed into a column with a different charset declaration (usually latin1). This particular example stems from
Forum thread 616834

See also: cp850 encodings
latin1 and utf8 encodings
-- First, let's establish cp850 everywhere:
-- In Windows, do "chcp 850" before entering mysql.
-- Then get the mysql commandline to agree:
SET NAMES cp850;  -- Now CodePage and the following match
SHOW VARIABLES LIKE 'char%';  -- note the 3 cp850 entries

Solution A: Change CHARACTER SET.

Given: cp850 bytes are stored in a latin1 field. Goal: Leave the bytes alone, but change the column to be CHARACTER SET cp850.
-- Build sample table
CREATE TABLE f616834A (
    str VARCHAR(111) CHARACTER SET latin1
);
SHOW CREATE TABLE f616834A\G

-- Populate with accented E's and I's:
SELECT @str := HEX('ÈÉÊËÌÍÎÏ');
-- Expect        D490D2D3DED6D7D8 (cp850 encoding)
-- Don't expect  C8C9CACBCCCDCECF (latin1)
-- Don't expect  C388C389C38AC38BC38CC38DC38EC38F (utf8)
SELECT UNHEX('D490D2D3DED6D7D8');  -- expect 'ÈÉÊËÌÍÎÏ'

INSERT INTO f616834A (str)  VALUES (UNHEX('D490D2D3DED6D7D8')) ;

SELECT str, HEX(str) FROM f616834A;
-- Expect gibberish (Ô?ÒÓÞÖ×Ø) and D490D2D3DED6D7D8
Notice how it looks wrong but it is encoded correctly. This is because the server/client, in doing the SELECT, is transcoding from latin1 (which it isn't) to cp850. (Some combination of "chcp latin1" and "SET NAMES latin1" would prevent the gibberish, but we don't want to go that direction since it would only mask the problem.)

The "?" in the gibberish is not really a question mark, but rather the result of an unconvertable character.

The ALTER must be done in two steps:
ALTER TABLE f616834A  MODIFY str  VARBINARY(111);
ALTER TABLE f616834A  MODIFY str  VARCHAR(111) CHARACTER SET f616834A;

SELECT str, HEX(str) FROM f616834A;  -- Expect: ÈÉÊËÌÍÎÏ and D490D2D3DED6D7D8 (cp850)
The CHARACTER SET changed, but the data was not.

Solution B: Change Data Encoding

Given: cp850 bytes are stored in a latin1 field. Goal: Leave the column's CHARACTER SET alone, but change the bytes in the column.
SHOW VARIABLES LIKE 'char%';  -- cp850 and "chcp 850" must match at this point.
-- Build a test table:
CREATE TABLE f616834B (
    str VARCHAR(111) CHARACTER SET latin1
);
SHOW CREATE TABLE f616834B\G

-- Populate with accented E's and I's:
SELECT @str := HEX('ÈÉÊËÌÍÎÏ');    -- Expect D490D2D3DED6D7D8 (cp850 encoding)
SELECT UNHEX('D490D2D3DED6D7D8');  -- Expect 'ÈÉÊËÌÍÎÏ'
INSERT INTO f616834B (str)  VALUES (UNHEX('D490D2D3DED6D7D8')) ;

SELECT str, HEX(str) FROM f616834B;
-- Expect gibberish (Ô?ÒÓÞÖ×Ø) and D490D2D3DED6D7D8

UPDATE f616834B SET str = CONVERT(BINARY(str) USING cp850);
SELECT str, HEX(str) FROM f616834B;  -- Expect ÈÉÊËÌÍÎÏ and C8C9CACBCCCDCECF (latin1)
(A side note: CONVERT(UNHEX(HEX(str)) USING cp850) would also work.)

Fixing mix of charsets in a column


You let the users enter stuff in a UI form, and you stored it into a latin1 field. The stuff entered was in latin1, big5, utf8, symbols, etc.

You have a mess. Without deducing, row by row, what the intent was, you cannot fix it.

Fixing MicroSoft thingies


Perhaps best is to use the REPLACE() function.
    ⚈  
replace()

Fixing "double encoding"


CAVEAT! Test before using on Production!

*** Plan A: Use UPDATE to fix in-place:
UPDATE Tbl SET col =
    IFNULL(CONVERT(CONVERT(CONVERT(col USING latin1)
                                       USING binary)
                                       USING utf8),
                                    col )
Caution: I have only briefly tested this.

The IFNULL has the advantage of leaving intact most cells that are not double-encoded. (Maybe -- need further testing)

How does it work?
    1.  The first (innermost) CONVERT takes a 4-byte encoding and converts to 2 bytes.
    2.  Converting to binary and then to something else is a way to change the type without changing the bytes.
    3.  The final convert to utf8 prevents the assignment from further messing up
    4.  The last step (IFNULL) is to undo the operation if it ended up invalid utf8??

*** Plan B: Use ALTER:
mysql> # Create sample:
mysql> CREATE TABLE dbl (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO dbl (c) VALUES ('x'), (UNHEX('C382C2A4'));
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c    | HEX(c)   | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x    | 78       |         1 |              1 |
| A,A☼ | C382C2A4 |         4 |              4 |
+------+----------+-----------+----------------+
mysql> # Step 1 of 5:
Note:  Steps 1 and 2 are needed only if you start with latin1;
If your column is already utf8, skip to step 3.
mysql> ALTER TABLE dbl MODIFY c TINYBLOB;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c    | HEX(c)   | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x    | 78       |         1 |              1 |
| A,A☼ | C382C2A4 |         4 |              4 |
+------+----------+-----------+----------------+
mysql> # Step 2 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c    | HEX(c)   | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x    | 78       |         1 |              1 |
| A☼   | C382C2A4 |         4 |              2 |
+------+----------+-----------+----------------+
mysql> # Step 3 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET latin1;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| A☼   | C2A4   |         2 |              2 |
+------+--------+-----------+----------------+
mysql> # Step 4 of 5:
mysql> ALTER TABLE dbl MODIFY c TINYBLOB;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| A☼   | C2A4   |         2 |              2 |
+------+--------+-----------+----------------+
mysql> # Step 5 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| ☼    | C2A4   |         2 |              1 |
+------+--------+-----------+----------------+
mysql> # Check schema:
mysql> SHOW CREATE TABLE dbl \G
CREATE TABLE `dbl` (
  `c` varchar(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8

One more note: Suggest you include the COLLATE clause on the the final ALTER that converts to VARCHAR.

*** Plan C: Using mysqldump

(This has had only limited testing.)
    1. mysqldump --default-character-set=latin1 .... my_database > my_database_latin1.sql
    2. Edit my_database_latin1.sql to set NAMES=utf8 at the top.
    3. mysql ... < mydatabase.sqlmo
Thanks to Peter Barry

Fix definition and data, IF correctly encoded


You have a latin1 table containing latin1 encodings; you want utf8. (This is not the general thrust of this document, but is included for completeness, and how tempting looking the command is.)
mysql> # Create sample:
mysql> CREATE TABLE trycvt (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO trycvt (c) VALUES ('x'), (UNHEX('BB'));
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
  `c` varchar(11) CHARACTER SET latin1 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
+------+--------+-----------+----------------+
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| »    | BB     |         1 |              1 |
+------+--------+-----------+----------------+
mysql> # Step 1 of 1
mysql> ALTER TABLE trycvt CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
+------+--------+-----------+----------------+
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| »    | C2BB   |         2 |              1 |
+------+--------+-----------+----------------+
mysql> # Note schema changed:
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
  `c` varchar(11) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Note: TEXT will become MEDIUMTEXT, etc. This is because transcoding latin1 to utf8 usually lengthens the data. You may want to follow with a ALTER TABLE ... MODIFY c TEXT ..esume. to put it back.

Note about ALTER: Generally it is possible and faster to do all ALTERs of a table in a single ALTER;

Haidong Ji Blog

performanceblog

Testing an in-place fix


Some of the above fixes require live changes. Here is a way that might build confidence that it works:
    ⚈  find some rows with accents (etc) in them
    ⚈  create a table with the _same_ schema; copy (INSERT SELECT) the rows into them (INSERT ... SELECT)
    ⚈  Try one of the conversions on that table
    ⚈  Check the results -- now length and char_length should be different, and should differ by the appropriate amount. Fetch the rows in you application; see if they look right.

Fixing while moving data with mysqldump


You are copying the data from, say, 4.x server to a new, say, 5.1 server. It may be easiest to convert the data as you copy. You can test the results. If necessary, you can adjust and reapply the conversion.

Overview:
      mysqldump -h old_host --no-data >schema.sql
      # now edit schema.sql to fix charset, collation, etc
      # edit in/out SET NAMES
      mysql -h new_host <schema.sql
      mysqldump -h old_host --no-create-info --opt >data.sql
      # edit in/out SET NAMES
      mysql -h new_host <data.sql

Notes (and things that may vary with our situation)
    ⚈  May need --default-character-set=utf8
    ⚈  Other issues?
    ⚈  (aside) Do not dump the database `information_schema`
    ⚈  (aside) Do not dump and reload `mysql` if you are moving between versions; there are usually schema changes that will cause trouble.

Fixing while moving data with LOAD DATA


    ⚈  SELECT INTO OUTFILE may create something useful for LOAD DATA.
    ⚈  Perhaps use mysqldump suggestions for creating schema
    ⚈  LOAD DATA ... CHARACTER SET utf8 ...; (as of 5.0.38 / 5.1.17)
    ⚈  Sorry, no experience in this approach.

Conversion Gotchas


    ⚈  DROP and re-add FULLTEXT indexes around changes.
    ⚈  DROP and re-add all(?) FOREIGN KEY CONSTRANTs; perhaps disabling will suffice
    ⚈  Collation changes could lead to duplicated keys
    ⚈  Collation changes could lead to broken VARCHAR FOREIGN KEY CONSTRANTs

4-byte utf8


Inserting a 4-byte utf8 character into a column with CHARACTER SET utf8 will cause the string to be truncated starting with the 4-byte character.

MySQL Versions 5.1 and before do not handle utf8 characters longer than 4 bytes. Version 5.5.3 (March, 2010) has a new CHARACTER SET, utf8mb4, for utf8 encodings up to 4 bytes.
    ⚈  
Introducing utf8mb4
    ⚈  Upgrading, including sample ALTERs
    ⚈  'f0 a2 88 a2' showing up as 'd848 de22'
    ⚈  Graphemica
utf8mb4 in necessary for a small number of Chinese characters.

The MySQL changelog for 5.5.3 says "The Unicode implementation has been extended to provide support for supplementary characters that lie outside the Basic Multilingual Plane (BMP)."

If your product is going to store characters from Hong Kong, utf8 is not a good option since many characters in HKSCS are 4-byte utf8 characters. utf8mb4 would be better.

Good article on a variety of related issues such as SET NAMES, utf8mb4, collations, converting from utf8 to utf8mb4, index lengths, character-set-client-handshake, init-connect, Java, etc. is by Mathias Bynens
A lot of the meat is in the Comment section.

Much of this blog mentions utf8. Unless otherwise specified, utf8mb4 applies, too. Occasionally, "utf8" may refer to the standard, not specifically to MySQL's 3-byte subset.

Functions to know about


Since you may be using some of these against the source (old) database, be cautious of whether the function exists.

String functions
    ⚈  CHAR_LENGTH(field) -- num chars in field
    ⚈  LENGTH(field) -- num bytes in field
    ⚈  CHAR(123 USING latin1)
    ⚈  CHARSET(expression)
    ⚈  UPPER(_binary'Abc') = 'Abc' -- a subtle surprise (other string functions, too)
    ⚈  ORD(char) = ASCII(char) = number -- for the _char_, not the _byte_
    ⚈  HEX(string) -- use to see actual encoding
    ⚈  UNHEX('AABBCC') -- does not exist before 4.1.2
Convert functions
    ⚈  CONVERT('...' USING utf8)
    ⚈  CONVERT(blob_col USING utf8) -- no "USING" before 4.1.16
    ⚈  HEX(CAST(_ucs 0x00a0 as CHAR CHARACTER SET utf8)) -- show that Unicode 00a0 is utf8 C2A0
Literal introducer
    ⚈  _utf8'...'
PHP
    ⚈  mysql: mysql_set_charset('utf8');
    ⚈  mysqli: $mysqli_obj->set_charset('utf8');
    ⚈  PDO: $db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);

Note: the use of mysqli_set_charaset / set_charset should obviate the need for SET NAMES.

If you change the collation of a field, ORDER BY cannot use any INDEX; hence it could be surprisingly inefficient.

BOM - Byte-Order-Mark


Wikipedia
says

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this.

If, instead, you see this hex: C3AFC2BBC2BF, then you are also suffering from double encoding.

If you encounter this (presumably at the start of a file), it implies that your editor is adding this, but the the reader of the file (eg, mysqldump) is does not know what to do with it. Consider using a different editor.

Error #1064 SQL syntax at line 1 - strange characters

German "sharp-s"


Version 5.1.24 introduced a change in the collation of the German "sharp-s".
bug 27877
(Apr. 2007) Starting with 5.1.24, 's' = 'ß' for utf8_general_ci. 'ss' = 'ß' for utf8_unicode_ci, which is 'correct' for German.

A nasty effect of this collation change occurs when a user upgrades past 5.1.24. If there were sharp-s characters in PRIMARY (or UNIQUE) KEYs, the reloading of the data may cause "duplicate key" errors.

To 'fix' the incompatibility, the collation utf8_general_mysql500_ci was added in MySQL 5.1.62 / 5.5.21 / 5.6.5. See ref page

Graphemica

More on what to do.

Circumflex


There are many versions of "circumflex":
unicode: 005E  = (^) utf8: 5E = CIRCUMFLEX ACCENT = SPACING CIRCUMFLEX
unicode: 02C6  = (ˆ) utf8: CB86 = MODIFIER LETTER CIRCUMFLEX ACCENT = MODIFIER LETTER CIRCUMFLEX
unicode: 0302  =     utf8: CC82 = COMBINING CIRCUMFLEX ACCENT = NON-SPACING CIRCUMFLEX
unicode: 032D  =     utf8: CCAD = COMBINING CIRCUMFLEX ACCENT BELOW = NON-SPACING CIRCUMFLEX BELOW
unicode: 1DCD  =     utf8: E1B78D = COMBINING DOUBLE CIRCUMFLEX ABOVE
unicode: A788  = (ꞈ) utf8: EA9E88 = MODIFIER LETTER LOW CIRCUMFLEX ACCENT
unicode: FF3E  = (^) utf8: EFBCBE = FULLWIDTH CIRCUMFLEX ACCENT = FULLWIDTH SPACING CIRCUMFLEX
unicode: E005E =     utf8: F3A0819E = TAG CIRCUMFLEX ACCENT
This debates what latin1's 0x88 (ˆ) maps to:
forum discussion
Graphemica

Cautions


If you have FOREIGN KEY CONSTRAINTs on fields you are 'fixing', you will probably mess up the constraints. It might work to disable FK constraints, fix both tables, then re-enable them.

If you have FOREIGN KEY CONSTRAINTs on fields other than the ones you are 'fixing', it is probably best to disable/reenable them.

If your table is huge, and/or you have lots of indexes, the ALTERs can be very time consuming. It may be better to disable/drop the indexes during the 'fix', then reapply them at the end. Do not drop an AUTO_INCREMENT index. In the case of InnoDB, do not drop the PRIMARY KEY.

Introspection


To see what charsets and collations exist in your version of MySQL:
    SHOW CHARSET;  -- only latin1 and utf8/utf8mb4 are likely to be interesting
    SHOW COLLATION;  -- list the possible values.
    SHOW COLLATION LIKE 'utf8%';
    SHOW COLLATION LIKE '%_cs';
    SHOW SESSION VARIABLES LIKE 'character_set%';
    SHOW GLOBAL  VARIABLES LIKE 'character_set%';
    SHOW CREATE TABLE tbl \G

Collation name suffixes
    ⚈  _ci -- case insensitive; this includes ignoring accents
    ⚈  _cs -- case sensitive (and accent sensitive)
    ⚈  _bin -- just compare the bits

my.cnf


The following lines in your my.cnf initialize various things in various contexts. (It is untested.) It may fail to be recognized for SUPER user:
    [client]
    default-character-set = cp1252

    [mysqld]
    skip-character-set-client-handshake
    default-character-set = cp1252
    character-set-server = cp1252
    collation-server = latin1_general_cs
    init-connect = SET NAMES cp1252

    [mysqldump]
    default-character-set = cp1252

    [mysql]
    default-character-set = cp1252

in my.ini: skip-character-set-client-handshake collation_server=utf8_unicode_ci character_set_server=utf8 This may speed up connections, and avoid need for SET NAMES.

Stored Procedures/Functions, Triggers


ALTER DATABASE dbname CHARACTER SET ... can impact already compiled Stored Procedures.
Recompile them.

There may be other charset issues. In particular, a routine seems to have a charset, but that declaration seems to have no effect.

utf8 Collations


Which to pick? Why?
    ⚈  utf8_bin -- just looks at bits; no case folding, etc
    ⚈  utf8_general_ci -- case and accent folding.
    ⚈  utf8_unicode_ci -- Perhaps best
    ⚈  utf8_..._ci -- If your application is entirely (or mostly) in one language, this may be better for you. See the chart below.
dou Notes on utf8_general_ci vs utf8_unicode_ci
    ⚈  utf8_general_ci is a very simple collation. What it does - it just - removes all accents - then converts to upper case
    ⚈  utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss"; letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
    ⚈  utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.
    ⚈  utf8_unicode_ci is generally more accurate for all scripts.
    ⚈  For Cyrillic: utf8_unicode_ci is fine for these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
    ⚈  The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
    ⚈  from
StackOverflow

utf8 Collation Variations:


Here's a comparison of the collation of the alphabet characters for the utf8 collations you might consider.
    ⚈  spanish2 puts ch after cz and ll after lz and Ñ,ñ after nz
    ⚈  utf8_general_ci treats sharp-s as 's'; the rest treat it as 'ss'.
    ⚈  Ligature (Æ, etc) collation varies.
    ⚈  Eth (Ð) is treated like a letter between D and E.
    ⚈  utf8_estonian_ci does some major shuffling (Z between S and T, etc).
    ⚈  Danish sorts 'aa' (and several other things) after z.
utf8 : utf8_general_ci                A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å    Aa  ae  az
utf8 : utf8_general_mysql500_ci       A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å    Aa  ae  az
utf8 : utf8_danish_ci                 A=a=ª=À=Á=Â=Ã=à=á=â=ã              ae  az
utf8 : utf8_swedish_ci                A=a=ª=À=Á=Â=Ã=à=á=â=ã          Aa  ae  az
utf8 : utf8_estonian_ci               A=a=ª=À=Á=Â=Ã=Å=à=á=â=ã=å      Aa  ae  az  Æ=æ
utf8 : utf8_unicode_ci                A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å  Aa  ae  az  Æ=æ
utf8 : utf8_polish_ci                 A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å  Aa  ae  az  Æ=æ
utf8 : utf8_spanish_ci                A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å  Aa  ae  az  Æ=æ
utf8 : utf8_spanish2_ci               A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å  Aa  ae  az  Æ=æ

utf8 : utf8_general_ci                B=b  C=c=Ç=ç      ch  cz  D=d  dz
utf8 : utf8_general_mysql500_ci       B=b  C=c=Ç=ç      ch  cz  D=d  dz
utf8 : utf8_danish_ci                 B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_swedish_ci                B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_estonian_ci               B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_unicode_ci                B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_polish_ci                 B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_spanish_ci                B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_spanish2_ci               B=b  C=c=Ç=ç  cz  ch      D=d  dz  Ð=ð

utf8 : utf8_general_ci                E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_general_mysql500_ci       E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_danish_ci                 E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_swedish_ci                E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_estonian_ci               E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_unicode_ci                E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_polish_ci                 E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ě=ě  Ę=ę
utf8 : utf8_spanish_ci                E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_spanish2_ci               E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě

utf8 : utf8_general_ci                F=f  fz     G=g  H=h  hz
utf8 : utf8_general_mysql500_ci       F=f  fz     G=g  H=h  hz
utf8 : utf8_danish_ci                 F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_swedish_ci                F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_estonian_ci               F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_unicode_ci                F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_polish_ci                 F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_spanish_ci                F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_spanish2_ci               F=f  fz  ƒ  G=g  H=h  hz

utf8 : utf8_general_ci                I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij    iz
utf8 : utf8_general_mysql500_ci       I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij    iz
utf8 : utf8_danish_ci                 I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_swedish_ci                I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_estonian_ci               I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_unicode_ci                I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_polish_ci                 I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_spanish_ci                I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_spanish2_ci               I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz

utf8 : utf8_general_ci                J=j  K=k  L=l  lj            ll  lz       M=m  N=n=Ñ=ñ  nz
utf8 : utf8_general_mysql500_ci       J=j  K=k  L=l  lj            ll  lz       M=m  N=n=Ñ=ñ  nz
utf8 : utf8_danish_ci                 J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n=Ñ=ñ  nz
utf8 : utf8_swedish_ci                J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n=Ñ=ñ  nz
utf8 : utf8_estonian_ci               J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n=Ñ=ñ  nz
utf8 : utf8_unicode_ci                J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n=Ñ=ñ  nz
utf8 : utf8_polish_ci                 J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n=Ñ=ñ  nz
utf8 : utf8_spanish_ci                J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n      nz  Ñ=ñ
utf8 : utf8_spanish2_ci               J=j  K=k  L=l  lj=LJ=Lj=lj  lz  ll      Ł=ł  M=m  N=n      nz  Ñ=ñ

utf8 : utf8_general_ci                O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö    oe      oz
utf8 : utf8_general_mysql500_ci       O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö    oe      oz
utf8 : utf8_danish_ci                 O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ      oe=Œ=œ  oz
utf8 : utf8_swedish_ci                O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ      oe=Œ=œ  oz
utf8 : utf8_estonian_ci               O=o=º=Ò=Ó=Ô=ò=ó=ô          oe=Œ=œ  oz       Ø=ø
utf8 : utf8_unicode_ci                O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö  oe=Œ=œ  oz       Ø=ø
utf8 : utf8_polish_ci                 O=o=º=Ò=Ô=Õ=Ö=ò=ô=õ=ö      oe=Œ=œ  oz  Ó=ó  Ø=ø
utf8 : utf8_spanish_ci                O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö  oe=Œ=œ  oz       Ø=ø
utf8 : utf8_spanish2_ci               O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö  oe=Œ=œ  oz       Ø=ø

utf8 : utf8_general_ci                P=p  Q=q  R=r  S=s=ß=Š=š  sh  ss    sz  T=t  TM=tm    tz
utf8 : utf8_general_mysql500_ci       P=p  Q=q  R=r  S=s=Š=š    sh  ss    sz  T=t  TM=tm    tz
utf8 : utf8_danish_ci                 P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz
utf8 : utf8_swedish_ci                P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz
utf8 : utf8_estonian_ci               P=p  Q=q  R=r  S=s        sh  ss=ß  sz  Š=š  Z=z  zh  zz  Ž=ž  T=t  TM=tm=™  tz
utf8 : utf8_unicode_ci                P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz
utf8 : utf8_polish_ci                 P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz
utf8 : utf8_spanish_ci                P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz
utf8 : utf8_spanish2_ci               P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz

utf8 : utf8_general_ci                U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz
utf8 : utf8_general_mysql500_ci       U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz
utf8 : utf8_danish_ci                 U=u=Ù=Ú=Û=ù=ú=û      ue  uz
utf8 : utf8_swedish_ci                U=u=Ù=Ú=Û=ù=ú=û      ue  uz
utf8 : utf8_estonian_ci               U=u=Ù=Ú=Û=ù=ú=û      ue  uz
utf8 : utf8_unicode_ci                U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz
utf8 : utf8_polish_ci                 U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz
utf8 : utf8_spanish_ci                U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz
utf8 : utf8_spanish2_ci               U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz

utf8 : utf8_general_ci                V=v  W=w
utf8 : utf8_general_mysql500_ci       V=v  W=w
utf8 : utf8_danish_ci                 V=v  W=w
utf8 : utf8_swedish_ci                V=v  W=w
utf8 : utf8_estonian_ci               V=v  W=w  Õ=õ  Ä=ä  Ö=ö  Ü=ü
utf8 : utf8_unicode_ci                V=v  W=w
utf8 : utf8_polish_ci                 V=v  W=w
utf8 : utf8_spanish_ci                V=v  W=w
utf8 : utf8_spanish2_ci               V=v  W=w

utf8 : utf8_general_ci                X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Æ=æ  Ð=ð  Ø=ø  Þ=þ     Ł=ł  Œ=œ  ƒ  LJ=Lj=lj
utf8 : utf8_general_mysql500_ci       X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Æ=æ  Ð=ð  Ø=ø  Þ=þ  ß  Ł=ł  Œ=œ  ƒ  LJ=Lj=lj
utf8 : utf8_danish_ci                 X=x  Y=y=Ü=Ý=ü=ý=ÿ=Ÿ  yz  Z=z=Ž=ž  zh  zz       Ä=Æ=ä=æ  Ö=Ø=ö=ø  Aa=Å=å  Þ=þ
utf8 : utf8_swedish_ci                X=x  Y=y=Ü=Ý=ü=ý=ÿ=Ÿ  yz  Z=z=Ž=ž  zh  zz  Å=å  Ä=Æ=ä=æ  Ö=Ø=ö=ø          Þ=þ
utf8 : utf8_estonian_ci               X=x  Y=y=Ý=ý=ÿ=Ÿ      yz                   Þ=þ
utf8 : utf8_unicode_ci                X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Þ=þ
utf8 : utf8_polish_ci                 X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Þ=þ
utf8 : utf8_spanish_ci                X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Þ=þ
utf8 : utf8_spanish2_ci               X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Þ=þ

utf8mb4_% mostly works the same as utf8.

Unfortunately, it is not easy to separate case folding from accent stripping. Here is what someone has created:
utf8_distinct_ci

8-bit Encodings


The columns are decimal, latin1 hex, utf8 hex, char, htmlentity.
 128 80 E282AC € &#x80;
 129 81   C281  &#x81;
 130 82 E2809A ‚ &#x82;
 131 83   C692 ƒ &#x83;
 132 84 E2809E „ &#x84;
 133 85 E280A6 … &#x85;
 134 86 E280A0 † &#x86;
 135 87 E280A1 ‡ &#x87;
 136 88   CB86 ˆ &#x88;
 137 89 E280B0 ‰ &#x89;
 138 8A   C5A0 Š &#x8A;
 139 8B E280B9 ‹ &#x8B;
 140 8C   C592 Π&#x8C;
 141 8D   C28D  &#x8D;
 142 8E   C5BD Ž &#x8E;
 143 8F   C28F  &#x8F;
 144 90   C290  &#x90;
 145 91 E28098 ‘ &#x91;
 146 92 E28099 ’ &#x92;
 147 93 E2809C “ &#x93;
 148 94 E2809D ” &#x94;
 149 95 E280A2 • &#x95;
 150 96 E28093 – &#x96;
 151 97 E28094 — &#x97;
 152 98   CB9C ˜ &#x98;
 153 99 E284A2 ™ &#x99;
 154 9A   C5A1 š &#x9A;
 155 9B E280BA › &#x9B;
 156 9C   C593 œ &#x9C;
 157 9D   C29D  &#x9D;
 158 9E   C5BE ž &#x9E;
 159 9F   C5B8 Ÿ &#x9F;
 160 A0   C2A0   &nbsp;
 161 A1   C2A1 ¡ &iexcl;
 162 A2   C2A2 ¢ &cent;
 163 A3   C2A3 £ &pound;
 164 A4   C2A4 ¤ &curren;
 165 A5   C2A5 ¥ &yen;
 166 A6   C2A6 ¦ &brvbar;
 167 A7   C2A7 § &sect;
 168 A8   C2A8 ¨ &uml;
 169 A9   C2A9 © &copy;
 170 AA   C2AA ª &ordf;
 171 AB   C2AB « &laquo;
 172 AC   C2AC ¬ &not;
 173 AD   C2AD ­ &shy;
 174 AE   C2AE ® &reg;
 175 AF   C2AF ¯ &macr;
 176 B0   C2B0 ° &deg;
 177 B1   C2B1 ± &plusmn;
 178 B2   C2B2 ² &sup2;
 179 B3   C2B3 ³ &sup3;
 180 B4   C2B4 ´ &acute;
 181 B5   C2B5 µ &micro;
 182 B6   C2B6 ¶ &para;
 183 B7   C2B7 · &middot;
 184 B8   C2B8 ¸ &cedil;
 185 B9   C2B9 ¹ &sup1;
 186 BA   C2BA º &ordm;
 187 BB   C2BB » &raquo;
 188 BC   C2BC ¼ &frac14;
 189 BD   C2BD ½ &frac12;
 190 BE   C2BE ¾ &frac34;
 191 BF   C2BF ¿ &iquest;
 192 C0   C380 À &Agrave;
 193 C1   C381 Á &Aacute;
 194 C2   C382 Â &Acirc;
 195 C3   C383 Ã &Atilde;
 196 C4   C384 Ä &Auml;
 197 C5   C385 Å &Aring;
 198 C6   C386 Æ &AElig;
 199 C7   C387 Ç &Ccedil;
 200 C8   C388 È &Egrave;
 201 C9   C389 É &Eacute;
 202 CA   C38A Ê &Ecirc;
 203 CB   C38B Ë &Euml;
 204 CC   C38C Ì &Igrave;
 205 CD   C38D Í &Iacute;
 206 CE   C38E Î &Icirc;
 207 CF   C38F Ï &Iuml;
 208 D0   C390 Ð &ETH;
 209 D1   C391 Ñ &Ntilde;
 210 D2   C392 Ò &Ograve;
 211 D3   C393 Ó &Oacute;
 212 D4   C394 Ô &Ocirc;
 213 D5   C395 Õ &Otilde;
 214 D6   C396 Ö &Ouml;
 215 D7   C397 × &times;
 216 D8   C398 Ø &Oslash;
 217 D9   C399 Ù &Ugrave;
 218 DA   C39A Ú &Uacute;
 219 DB   C39B Û &Ucirc;
 220 DC   C39C Ü &Uuml;
 221 DD   C39D Ý &Yacute;
 222 DE   C39E Þ &THORN;
 223 DF   C39F ß &szlig;
 224 E0   C3A0 à &agrave;
 225 E1   C3A1 á &aacute;
 226 E2   C3A2 â &acirc;
 227 E3   C3A3 ã &atilde;
 228 E4   C3A4 ä &auml;
 229 E5   C3A5 å &aring;
 230 E6   C3A6 æ &aelig;
 231 E7   C3A7 ç &ccedil;
 232 E8   C3A8 è &egrave;
 233 E9   C3A9 é &eacute;
 234 EA   C3AA ê &ecirc;
 235 EB   C3AB ë &euml;
 236 EC   C3AC ì &igrave;
 237 ED   C3AD í &iacute;
 238 EE   C3AE î &icirc;
 239 EF   C3AF ï &iuml;
 240 F0   C3B0 ð &eth;
 241 F1   C3B1 ñ &ntilde;
 242 F2   C3B2 ò &ograve;
 243 F3   C3B3 ó &oacute;
 244 F4   C3B4 ô &ocirc;
 245 F5   C3B5 õ &otilde;
 246 F6   C3B6 ö &ouml;
 247 F7   C3B7 ÷ &divide;
 248 F8   C3B8 ø &oslash;
 249 F9   C3B9 ù &ugrave;
 250 FA   C3BA ú &uacute;
 251 FB   C3BB û &ucirc;
 252 FC   C3BC ü &uuml;
 253 FD   C3BD ý &yacute;
 254 FE   C3BE þ &thorn;
 255 FF   C3BF ÿ &yuml;


The 'missing' characters in the above table:
    ⚈  129
    ⚈  141
reverse line feed
    ⚈  143
    ⚈  144
    ⚈  157
    ⚈  173 shows when breaking across lines Soft Hyphen

Turkish


The columns: utf8, unicode codepoint decimal, hex, HTML entity, 'name':
C387    0199    00C7    Ç  latin capital letter c with cedilla
C396    0214    00D6    Ö  latin capital letter o with diaeresis
C39C    0220    00DC    Ü  latin capital letter u with diaeresis
C3A7    0231    00E7    ç  latin small letter c with cedilla
C3B6    0246    00F6    ö  latin small letter o with diaeresis
C3BC    0252    00FC    ü  latin small letter u with diaeresis
C49E    0286    011E    Ğ  latin capital letter g with breve
C49F    0287    011F    ğ  latin small letter g with breve
C4B0    0304    0130    İ  latin capital letter i with dot above
C4B1    0305    0131    ı  latin small letter dotless i
C59E    0350    015E    Ş  latin capital letter s with cedilla
C59F    0351    015F    ş  latin small letter s with cedilla
E282A4  8356    20A4    ₤  lira sign
E282BA  8378    20BA    ₺  Turkish lira sign


More references


Various discussions on the MySQL forums:

Migrating from latin1 to utf-8
Convert database from utf8_general_ci to uft8_unicode_ci
Latin1 to unicode character convert -- data are display with ?????
Problem with accents
help in relating tables and normailsation
Select statement with town Polish characters
connection collation
cannot change character set
joined query problem
Slow retreieve under MyIsam
UTF-8 problem
trying to move a database from mysql 4.0.18 to 5.0.77
Server just hanging - Copying to tmp table
Encoding problem...
Unicode collation, utf8 data fetch/manip problem
What storage requirement has FLOAT(24) ? Doc is ambigous
what is the step to convert a db from latin1 to UTF8
UTF8 Chinese String Comparison
Special characters and database
Storing Hebrew strings Ok, but gibberish when pulling data on another server
Require suggestion for changing character set in Master master replication setup
Character encodings issues with php, mysql, apache
load data character set question
Storing UTF-8 strings in MySQL Database
utf8 inputs store as html entities, how to retrievesubstr of utf8 inputs?
Urdu case folding
Is STR_TO_DATE function doesn't work with UCS2 charset?

Miscellany:

Converting through BLOB

Wikipedia shows 0xD5 0xF5 iso upper/lower
shows utf8 c590 / c591
details several representations for 0150
discussion about how charsets are used

MySQL 5.5 & 4-byte utf8
MSDN discussion of Unicode
unicode.org -- The definitive word, but heavy going

mysqldump issues
Example of double-encoding and mysqldump
Windows, Unicode, LOAD DATA, little-endian, etc

Charbase - A visual unicode database

latin1 ~= cp1252 ~= ISO 8859-1


latin1 is the default character set. MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.
-- an Oracle doc

Combining Diacriticals


Some characters can be encoded as either a single character or as a letter and a "combining" acceent. For example, ä can be represented in utf8 as either c3a4 or 61cc88. The ä as a single 'character' is c3a4. The other encoding is 61, the hex code for 'a', plus cc88 is the utf-8 code for the 'combining diaeresis' (umlaut, double dot, etc).
Combining Diacritical Marks

If the collation is utf8_general_ci, the two encodings are treated as unequal. With utf8_unicode_ci, they are equal. Searching database for strings with Swedish characters

Open issues


    ⚈  What if you put a 5.1 slave off a 4.x master? (The German sharp-s could cause trouble.)
    ⚈  Replication type (SBR vs RBR)?

Random notes


To see what PHP has:
   $hex = unpack('H*', $text);
   echo implode('', $hex), "
";

Vietnamese collations were added in MySQL 5.6.

A table declared to be latin1, and containing latin1 bytes can be converted to utf8 via
    ALTER TABLE tbl CONVERT TO CHARACTER SET utf8;
Example from forums

Using Replication to assist in charset conversion?

C# code to undo (while reading?) double-encoded strings
Similar code in SQL:
CONVERT(UNHEX(HEX(CONVERT(CONVERT(col USING utf8) USING latin1))), CHAR)

utf8_general_ci, which will treat Résumé the same as resume MYSQL doesn't 'see' accents
Cyrillic example:
title:       Добро пожаловать в Диксон,
hex:         D094D0BED0B1D180D0BE20D0BFD0BED0B6D0B0D0BBD0BED0B2D0B0D182D18C20D0B220D094D0B8D0BAD181D0BED0BD,
length:      47,
char length: 25
EXPLAIN EXTENDED SELECT ...;
SHOW WARNINGS;
sometimes gives extra clues into what charset/collation issues are being faced during a SELECT. It will probably contain 'introducers'; example: _utf8'abc'.

To ponder: convert(convert((binary col1) using latin1) using utf8)

phpMyAdmin might be connecting as "root", thereby bypassing init_connect = "SET NAMES utf8". This could defeat your attempts to be fully utf8. Possible workaround: Use a login that avoids SUPER priv. (This suggestion is good advice for all applications -- for security reasons.)
    # Find rows with any 8-bit chars:
    SELECT * FROM tbl WHERE HEX(colname) RLIKE '^(..)*[8-F].';

    # Remove all occurrences of one 8-bit code (hex 92, in this example):
    UPDATE tbl SET colname = REPLACE(colname, UNHEX('92'), '');
Reference: how to replace hex in varchar field

Converting latin1 hex to utf8 hex: HEX(CONVERT(CONVERT(UNHEX('F6') USING LATIN1) USING UTF8)) --> C3B6

em dash, dagger, and other special characters:

Adding a Character Set or Collation
(Note: The doc refers to version 5.5, but probably applies to all versions from 4.1 on.)

Connector/.NET possibly defaults to latin1. Consider changing 'connection string' if you need utf8.

Footnote


If this solves your problem, please let me know. My ego needs a pat on the back. If it does not solve your problem, please contact me, maybe I can pull as special rabbit out of my hat, just for you.

Original writing -- Sep 2010, after struggling with hundreds of questions both in-house and from the forums. Alas, this does not achieve the ultimate goal of "Given this symptom, do that". I hope it helps most people.

Revision date: 2012-08-26 (This document continues to evolve as more problems and examples are discovered.)

Revision date: 2013-02-12 (Misc, plus added 8-bit encodings)

-- RjamesYahoo -- 21 Sep 2010


Contact me by posting a question at
MySQL Forums :: Character Sets, Collation, Unicode
-- Rick James

Rick's MySQL Documents

MySQL Documents by Rick James

Tips, Debugging, HowTos, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Converting from MyISAM to InnoDB -- includes differences between them
Big DELETEs - how to optimize
Compound INDEXes plus other insights into the mysteries of INDEXing
Partition Maintenance (DROP+REORG) for time series
Entity-Attribute-Value -- a common, poorly performing, design patter; plus an alternative
Find the nearest 10 pizza parlors (efficient searching on Latitude + Longitude)
Alter of a Huge table
Latest 10 news articles -- how to optimize the schema and code for such
Pagination, not with OFFSET, LIMIT
Data Warehouse techniques (esp., Summary Tables)
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance
MySQL Limits
Galera Limitations (with Percona XtraDB Cluster / MariaDB)
Rollup Unique User Counts
Best of MySQL Forum