MySQL Charset/Collate

Table of Contents

The problems being addressed
Basic Concepts
Best Practice
Conversions and Common Errors
When do bytes get translated?
Life Cycle of a Character
Disaster before INSERT
Diagnosing CHARSET issues
The manual
Viewing text
Entering accents in CMD
Example of Double-encoding
Fixing utf8 bytes in latin1 column
Fixing mismatch between CHARSET and data encoding
Fixing mix of charsets in a column
Fixing MicroSoft thingies
Fixing "double encoding"
Fix definition and data, IF correctly encoded
Testing an in-place fix
Fixing while moving data with mysqldump
Fixing while moving data with LOAD DATA
Conversion Gotchas
4-byte utf8
Functions to know about
BOM - Byte-Order-Mark
German "sharp-s"
Stored Procedures/Functions, Triggers
utf8 Collations
utf8 Collations Examples
utf8 Collation Variations:
8-bit Encodings
More references
latin1 ~= cp1252 ~= ISO 8859-1
Combining Diacriticals
Programatic Collation
Open issues
Random notes
Brought to you by Rick James

The problems being addressed

    ⚈  Your web page the wrong characters coming out where accented letters should be.
    ⚈  You are upgrading from MySQL 4.0, where characters were simply bytes.
    ⚈  You are trying to wrap your head around how MySQL handles CHARSET / COLLATION.
    ⚈  You upgraded, and now you are getting garbage.

Please read most of this document. I know it is long (after all, I wrote it!) But you really need to understand a lot of the pieces in it, in order to solve your problem.

Basic Concepts

"Character" and "byte" are different! You must understand this before continuing. A "byte" is an a-bit thing; it is the unit of space in computers (today). A "character" is composed of one or more bytes, and represents what we think of when reading.

A byte can represent only 256 different values. There are over 11,000 Korean characters and over 40,000 Chinese characters -- no way to squeeze such a character into a single byte.

Charset vs collation. These are different things! 'Charset' ('character set'; 'encoding') refers to the bits used to represent 'characters'. 'Collation' refers to how those bits could be compare for inequality (WHERE) and sorting (ORDER BY). GROUP BY and FOREIGN KEY CONSTRAINTS can also involve collation. And it even can involve deciding whether two different bit strings compare 'equal'.


1950's -- A character was (sometimes) represented in only 5 bits, on "paper tape"; no lowercase; had to "shift" to switch between letters and digits.
1960's -- A character was 6 bits; no lower case. "Punched cards" were used.
1970's -- 7-bit ASCII becomes common -- that limits you to English. And the 8-bit "byte" was invented and was coming into common usage (re: IBM 360).
1980's -- Beginning to see 8-bit encodings. Especially since 7-bit ASCII was wasting a bit of the omni-present "byte". This can handle Western European accented letters.
1990's -- The computer world realizes that there are other people in the world and embarks on Unicode and UTF8.

Meanwhile, MySQL is born, but has enough problems without worrying about character sets. Through version 4.0, a CHAR is just a byte. You can put any kind of bytes, representing anything, into a VARCHAR. Thus, begins the need for this discussion.

MySQL 4.1 introduced the concept of "character set" and "collation". If you had legacy data or legacy code, you probably did not notice that you were messing things up when you upgraded. Or you started with 4.1 (or later) and "latin1 / latin1_swedish_ci" and failed to notice that you were asking for trouble.

Today, it's pretty safe to simply lay down the law and say "Use utf8 for all text." If you have version 5.5.3 or later, "Use utf8mb4 for all text."

Sure, there are other character sets, and they are useful if you have a narrow focus. But, you may as well use utf8mb4.

Best Practice

Best practice is to go completely utf8mb4. I will focus on utf8 and utf8mb4, but if you choose to do otherwise, keep reading; most of this discussion can still be adapted to the charset of your choice.

A web page (if that is what you have) should begin with <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Establish the characterset for talking to the MySQL server:
    ⚈  init_connect = 'SET NAMES utf8' in my.cnf
    ⚈  SET NAMES utf8 -- done immediately after connecting
    ⚈  SET GLOBAL VARIABLE character_set_... SET NAMES does the three you need.

For PHP:
    ⚈  mysql interface: mysql_set_charset('utf8'); (assuming PHP 5.2.3 & MySQL 5.0.7) Note: "mysql_" interface is deprecated.
    ⚈  mysqli interface: mysqli_set_charset('utf8') function. See
    ⚈  PDO interface: set the charset attribute of the PDO dsn.
(from #29 in Devshed's PHP tips)

SET NAMES can be invoked by your language just like other non-SELECT commands: mysqli_query(), do(), execute(), etc.

I digress here for a note about GRANTs. 'root', and any other user from GRANT ALL ON *.*, because it has "SUPER" privilege, will skip the init_connect, so that technique is not 'perfect'.


Declare most CHAR/TEXT columns in all tables as CHARSET utf8. Some exceptions: columns with hex (MD5, GUID), ... These should be BINARY or CHAR charset ASCII.


For true text, utf8_unicode_ci is best. It handles the complex rules of "combining chararacters", etc. It handles essentially all languages simultaneously, but compromises on ordering.

utf8_general_ci is the default for utf8, so you may accidently get this. It is a little faster than utf8_unicode_ci and works ok for many situations.

utf8_bin says to just compare the bytes. CHAR/TEXT utf8 with utf8_bin validates (on INSERT) that the bytes comprise valid utf8 bytes, but does not do anything useful for comparisions other than exact (no case folding, etc) equality. BINARY/BLOB should be usually be used instead CHAR+utf8; this stores the bytes without any checking.

Alas, MySQL's collations think of case folding and accent-stripping as equivalent. That is, there is no utf8 collation for case folding, but keeping accents distinct. Do SHOW COLLATION LIKE '%cs'; to see the few collations that work that way.

Conversions and Common Errors

This is a summary of how things work, and what is likely to go wrong.

There are 3 dimensions to character set problems:
    ⚈  How the client's bytes are encoded when INSERTing, and what encoding you want when SELECTing;
    ⚈  What SET NAMES you use (or what the default is);
    ⚈  The CHARACTER SET on the column definition.

All is well if the SET NAMES agrees with the encoding of the bytes in the Client. The column's CHARACTER SET need not agree with SET NAMES; if they differ, a conversion will be performed for you. If your characters exist in both encodings, the conversion will be transparent. If a source character does not exist in the target encoding (example: when converting a Chinese character from utf8mb4 to latin1), a "?" is usually put in its place.

The main thing that can go wrong is that SET NAMES can disagree with the Client's bytes. Let's walk an acute-e (é) through the INSERT and SELECT. First, note the encodings for é:
    ⚈  Hex for latin1's 1-byte encoding: E9
    ⚈  Hex for utf8's 2-byte encoding: C3A9
(BTW, this and some other encodings are enumerated below.)

Case 1: SET NAMES latin1 (the default) is in effect and your application is thinking in utf8. It INSERTs é encoded as utf8 (hex C3A9):
    ⚈  C3A9 will become two latin1 characters: 'é' "on the wire". (C3 represents à in latin1, etc)
    ⚈  If the CHARACTER SET for the column is latin1, then those 2 characters (2 bytes) are inserted.
    ⚈  If the CHARACTER SET for the column is utf8, then they are converted to utf8: Ã ⇒ C383 and © ⇒ C289; you insert 4 latin1 characters (4 bytes: C383C289) into the table.
Continuing the case, let's do a SELECT:
    ⚈  For utf8 column, the C383C289 is converted to latin1 é (C3A9) for sending across the wire to the client.
    ⚈  For latin1 column, no conversion is performed, so, again, C389 goes across the wire.
    ⚈  The client receives the two bytes C389, thinking it is é (because of SET NAMES). However, the rest of your application is thinking 'utf8', it sees it as é.
C383C289 is an example of what I call "double encoding" because of the two 'wrong' conversions from latin1 to utf8. The resulting SELECT (at least for European text) looks exactly like what you INSERTed. However, comparisions (WHERE x>y) and sorting (ORDER BY) and, in a few cases, equality (x=y) will not work as expected.

Now, let's flip the situation: Case 2: SET NAMES utf8, but you are inserting é encoded as latin1 (E9):
    ⚈  SET NAMES is falsely asserting that E9 is a valid utf8 encoding of something, which it is not.
    ⚈  The INSERT finishes (with a WARNING that you probably ignored), but the string is truncated before the é.
    ⚈  The CHARACTER SET of the column does not matter since the string is truncated.
    ⚈  A SELECT will get only the truncated string.

Case 3: One application INSERTs and another SELECTs, but they are using different SET NAMES. All sorts of messes can occur.

Case 4: You have an old latin1 table (possibly dating back to MySQL 4.0, which did not really have any CHARACTER SET concept) and you want to modernize it to utf8 or utf8mb4. First, you need to see what is in it. (A later section discusses using HEX to help.)
    ⚈  All ASCII -- no problem.
    ⚈  All latin1 (eg, E9 for é) -- Continuing to call it latin1 is fine, unless you anticipate adding some Asian text in that column.
    ⚈  All utf8 (eg, C3A9 for é) -- Trickier. See the 2-step ALTER below.
    ⚈  A mixture of latin1 and utf8 -- Warm up your resume.

What symptoms exist for the cases?

Case 1 is virtually symptomless. SELECT usually gets what was INSERTed; comparision problems may go unnoticed.

Case 2 is often recognized by having truncated strings.

Case 3 is a mess.

Case 4 is symptomless. However, if the conversion is not done properly, you can end up with some other case.

When do bytes get translated?

If you do all the "right things", you will accidentally have a "correctly" performing application. You need to understand when and how bytes are modified, in order to fix existing problems and to build your application correctly.

Short answer: You tell MySQL how the bytes in your application are encoded; you tell MySQL what encoding to use in your tables; MySQL converts as needed when INSERTing / SELECTing.

You tell MySQL about your application bytes by using SET NAMES or the underlying variables. You tell MySQL about the table bytes in CREATE TABLE.

Actually, the charset and collation of a column has a long, not too consistent, defaulting mechanism:
    ⚈  Default for system
    ⚈  Default for a database ...
    ⚈  Default for a table
    ⚈  Settings for a column
    ⚈  Fields/literals can be converted on the fly
    ⚈  A different collation can be specified in a WHERE clause...
To see what charset & collation are in effect for a column, use SHOW CREATE TABLE.

Life Cycle of a Character

Data ➡ Your client application ➡ INSERT into MySQL table. Simple? No.

You get the data from somewhere. Is the e-acute in "Beyoncé" represented in one latin1 byte, hex E9? Or is it the two-byte utf8 C3 A9? Not knowing which you have is the first step toward having a mess in the database. More in a minute.

Meanwhile, you have declared a column in a table in a database in MySQL to be utf8. (Collation is not matter in this discussion.)

Now you INSERT INTO Tbl (name) VALUES ("Beyoncé"). And the mess continues.

What that value being inserted encoded in latin1 or utf8? MySQL is happy to take either, and will gladly do any transcoding to store it properly in the table. But you have to tell it! The default may not be right!

Your application announces to MySQL the encoding via any of
    ⚈  The VARIABLE character_set_client
    ⚈  Executing SET NAMES
    ⚈  init-connect in my.cnf (my.ini) (Caution: root bypasses this)
I recommend you Execute SET NAMES in your application, right after connecting to the database. This reminds you, in your code, that you are establising for the connection.

Example of a mess: The data is in utf8, the connection settings are in latin1. The table is in utf8. So, MySQL converts each of C3 and A9 (which it was told was latin1) to utf8, leading to 4 bytes in the table: C3 83 C2 A9. I call this double-encoding. The problem was caused by lying (usually accidentally) to MySQL about what encoding was in the application's data.

Another example: The data was loaded in an old system in the following manner. The application got utf8 bytes and blindly stored them in a latin1 column (using the default latin1 settings for character_set% -- or it was done in 4.0, which had no character set concept). This left e-acute as 2 bytes in the table in a latin1 column. Then, you decide to 'do the right thing' and switch to utf8. Now what?

BIG5 example: You have an application that lets the user enter data into a form. This data is stored into a VARCHAR. All along the way, you use latin1. But the user connected from China with his browser set to BIG5. The latin1 column happily accepted the 2-byte Chinese characters. But your application reading the data has no clue of actual encoding. That same user, connecting in that same way, and rereading his data, will probably see the latin1 treated as BIG5, and be oblivious of the "two wrongs make a right" that happened. But display that text elsewhere, and you have a mess.

Disaster before INSERT

Most Mojibake starts even before the text is in the table. Let's dissect the steps of doing an INSERT.

Step 1 -- Get the data.

First you get the data from somewhere
    ⚈  Type it in, using a special keyboard, or keyboard tricks
    ⚈  Load it from a file
    ⚈  Create it from constants in your client programming language
    ⚈  Receive text from an html form

What do you have? You probably don't know. There is no single standard for what bytes are generated.

Step 2 -- INSERT

Let's walk through inserting a single character: e-acute. INSERT INTO tbl VALUES ('é'); or $str = ...; INSERT INTO tbl VALUES ('$str'); Well, this is really two steps.

Two likely cases for the encoding of e-acute:
    ⚈  latin1 (or code pages 1280, 1282): hex E9
    ⚈  utf8: 2 hex bytes: C3A9

Step 2a -- Send the command to the MySQL server

The bytes that make up the INSERT statement are sent from the client. Here enters the VARIABLE character_set_client. This tells the client what the bytes mean.

What is logically sent across the wire:
                           Client byte(s)
    character_set_client:    E9    C3A9
                   latin1    é     Ã ©
                     utf8          é
    ⚈  Upper-left and lower-right: The correct 'character' is sent.
    ⚈  Upper-right: Since you said the bytes are 'latin1', then C3A9 must be the two chars é.
    ⚈  Lower-left: Since E9 (and what follows) is not valid utf8, the string is truncated.

Step 2b -- Do the INSERT

The 'characters' on the wire are now transliterated to the desired encoding for the table.
                        On the wire
    column's charset:    é     Ã ©
               latin1    E9    C3A9
                 utf8    C3A9  C383C2A9
    ⚈  Left column: all is well.
    ⚈  Upper-right: Mojibake. Treated as latin1, C3A9 looks like é
    ⚈  Lower-right: "Double encoding" -- discussed at length below.

See also:
Thread with data entry issues.

Diagnosing CHARSET issues

You have some non-English characters that are showing up strangely, perhaps as a question mark (?), perhaps as two letters, when you were expecting one, such as ç is displayed as ç.

First, let's get the lay of the land:
| character_set_client     | latin1
| character_set_connection | latin1
| character_set_results    | latin1
(There will be other rows; these three are the important ones.) SET NAMES sets those three; this example shows the result of SET NAMES latin1, not the recommended setting. More later.

SHOW CREATE TABLE tbl will show the charset and collation for each column. More later.

You have data in a table, but it is garbled when displaying it. The first thing to do is to decide whether the data was stored correctly. Find a small cell that is bad, and do this:
You will get 4 columns:
    ⚈  x -- probably not useful, but may help confirm that you got the cell you wanted.
    ⚈  HEX(x) -- This lets us look at the encoding; more later.
    ⚈  LENGTH is the number of _bytes_.
    ⚈  CHAR_LENGTH is the number of _characters_. (Remember the distinction, above?)

LENGTH vs CHAR_LENGTH -- Are they equal? For correctly stored utf8, they should be not equal. Let's look at the two lengths of specific types of characters: Looking at a single utf8 character, would see:
    ⚈  ASCII: 1:1
    ⚈  LATIN1: 1:1
    ⚈  English in utf8: 1:1 -- Ascii is a subset of utf8
    ⚈  Western European accented letters: 2:1 -- one character is encoded in two bytes
    ⚈  Eastern Europe: also 2:1 (mostly)
    ⚈  East Asia: mostly 3:1
    ⚈  "double encoding" 4:1 (Europe) or 6:1 (Asian)
    ⚈  With utf8mb4, a few Chinese characters will be 4:1

A typical phrase in French might have LENGTH=20 and CHAR_LENGTH=17, implying that there were 17 characters, 3 of which had accents.

latin1 is happy to cram Western European characters into 1 byte.

"Double encoding" is a term I made up for the following situation. It is a case where "two wrongs make a right".
    ⚈  Table defined utf8
    ⚈  INSERTer declared latin1 (used default instead of doing SET NAMES)
    ⚈  Data being INSERTed is actually utf8 already.
What happened:
    ⚈  A 2-byte letter (say, a grave-e) was correctly represented in utf8.
    ⚈  The INSERT statement handed the 2 bytes to MySQL, but implied the need for conversion
    ⚈  Each byte was converted to utf8
    ⚈  The table has 4 bytes.
When SELECTing, the reverse happens 4->2->1, and the user is oblivious of the bug. Looking CHAR_LENGTH will spot it. Strange orderings may happen.
details of a double encoding
A good writeup on double-encoding

If you are moving the data using mysqldump, consider looking at the dump file via some hexdump utility. For experimenting, mysqldump lets you specify one database, one table, and constrain the rows via --where="...".

In the hexdump, you need to be able to recognize various byte patterns: In this, "yy" stands for the range 80-BF.
    ⚈  09,0A,0D -- The only control character you are likely to encounter (TAB, NL, CR)
    ⚈  20 -- space (handy to know when reading hex dump)
    ⚈  21-7E -- Ascii characters
    ⚈  C2yy -- symbols
    ⚈  C3yy -- Typical accented letters of Western Europe
    ⚈  Cxyy -- More Western Europe: Latin (C3-CA), Combining Diacritical Marks (CC-CD), Greek (CE-CF)
    ⚈  Dxyy -- Cyrillic (D0-D5), Hebrew (D7), Arabic/Persian/Farsi (D8-DB), etc
    ⚈  E0yyyy -- various Indian character sets, southern Asia, etc.
    ⚈  E1yyyy -- Cherokee, Balinese, Khmer, Mongolian, etc.
    ⚈  E2yyyy -- Symbols, Braille, etc
    ⚈  E381yy -- Hiragana (Japanese)
    ⚈  E383yy -- Katakana (Japanese)
    ⚈  Exyyyy -- E3-EA: Chinese, Japanese; EAB0-ED9E: Hangul (Korean)
    ⚈  EFAxyy -- CJK Extension A
    ⚈  EFACyy -- Hebrew (obscure)
    ⚈  EFBxyy -- Arabic (obscure)
    ⚈  F0yyyyyy -- 4-byte utf8. (See notes about utf8mb4)
    ⚈  F0Axyyyy -- CJK Extension B
Note, especially, the absence of 80-BF unless preceeded by some other 8-bit code.

If the data is not utf8, you may see
    ⚈  00xx -- Lots of these pairs could mean Ascii in UCS2. (Shift-JIS is also possible)
    ⚈  FFFE (or FEFF) at start of file (BOM) -- utf8 file (not cell in MySQL table)
    ⚈  00-1F -- control characters (mostly 0a, newline)
    ⚈  20 -- space (handy to know when reading hex dump)
    ⚈  21-7E -- Ascii characters
    ⚈  7F -- DEL; rare
    ⚈  80-FF -- "8-bit characters":
    ⚈  80-9F -- Possibly MicroSoft Word things (smart quotes, etc)
    ⚈  80-FF -- valid latin1 encodings, mostly European accented letters
    ⚈  A0 -- "hard space" -- mostly from MicroSoft
    ⚈  Cx not followed by yy (80-BF) cannot be utf8. This is usually a quick clue that it is not utf8.

Note: The Microsoft things really need to be fixed _before_ trying to store into a table. MicroSoft characters

In hex, "double encoding" will have a lot of "Cx yy Cx yy".

The manual

is that manual page that explains the interactions of the vital variables:
    ⚈  character_set_client -- the encoding in the client.
    ⚈  character_set_connection -- what to re-code characters into for client-to-server transfer
    ⚈  character_set_results -- what to re-code characters into for server-to-client transfer
The page also discusses collation_connection and collation_database, which have less impact.

Between that manual page and this document, I hope that you can understand things as they relate to your situation.

To see your current settings:
   SHOW VARIABLES LIKE 'character_set%';
   SHOW VARIABLES LIKE 'collation%';

Viewing text

Alas, CMD (Windows) and perhaps other terminal emulators cannot (or will not by default) display utf8 characters. This makes it difficult to debug. And it is confusing to see incorrect characters when everything else is working correctly.

Viewing text in a browser can be misleading in a different way. Browsers may "fix" things for you. For example, 'double encoded' characters may appear to be correct in the browser!

Bottom line: SELECT HEX() is about the only thing to trust.

Entering accents in CMD

As noted above, CMD in Windows is limited in its character set handing.

The following is a clumsy way to enter Western European characters. It depends on your knowing the hex or decimal values for the characters you want. Sorry, this won't handle Hebrew, Cyrillic, Chinese, etc.

Press and hold ALT, then
    ⚈  decimal digits for the code page, or
    ⚈  0, then decimal digits of the decimal Unicode
digits to get the extra characters in latin1. When you let go, the accented character will appear. This implies that you need SET NAMES latin1, not utf8, since you are creating latin1 bytes, not utf8 bytes in your 'client'. For the encodings, see
link1, or
    ⚈  link2.

info on how to enter Unicode, and about "code pages".

The command "chcp" controls the "code page". chcp 65001 is supposed to provide utf8, but it seems to need a special charset installed, too. some code pages
To set the font in the console window: Right-click on the title of the window → Properties → Font → pick Lucida Console

Example of Double-encoding

Desired string:  სახლი არის
Hex in table:
Correct utf8 hex:

C3A1C692C2A1 -- should be E183A1 -- Georgian SAN (U+10E1)  "ს"
20           -- space (same in utf8 and latin1)
Georgian, when properly encoded in utf8, should be 3 bytes per character:
Taking the first "character" (C3A1C692C2A1, in hex):
    ⚈  latin1 E1 = utf8 C3A1
    ⚈  latin1 83 = utf8 C692
    ⚈  latin1 A1 = utf8 C2A1
Notice how most of the letters look very similar: C3A1C692xxyy, when they should start with E183xx.

One pass over the data to convert each two bytes (4 hex) from utf8 to latin1:
   ) USING utf8) USING latin1))
Now to display as utf8:
SELECT UNHEX('E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1')
სახლი არის

(Ref: forum thread #418885.)

Another example of double-encoding: 'C398C2AF' should have been 'D8AF'
⇒ D8AF, which is the utf8 for Arabic DAL.
⇒ that character.
(Ref: forum thread #565669.)

Fixing utf8 bytes in latin1 column

CAVEAT! Test before using on Production!
# Set up bad case:
mysql> create table latin1 (c varchar(11) charset latin1);
mysql> insert into latin1 (c) values ('x'), (unhex('c2a3'));
mysql> select c, hex(c), length(c), char_length(c) from latin1;
| c    | hex(c) | length(c) | char_length(c) |
| x    | 78     |         1 |              1 |
| A£   | C2A3   |         2 |              2 |
# Step 1 of 2 of fix:
mysql> alter table latin1 modify c tinyblob;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
| c    | hex(c) | length(c) | char_length(c) |
| x    | 78     |         1 |              1 |
| A£   | C2A3   |         2 |              2 |
# Step 2 of 2 of fix:
mysql> alter table latin1 modify c varchar(11) charset utf8;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
| c    | hex(c) | length(c) | char_length(c) |
| x    | 78     |         1 |              1 |
| £    | C2A3   |         2 |              1 |

That is, ALTER to BLOB (of appropriate size), then ALTER to VARCHAR (of original size) and CHARSET utf8.

This technique should work for other cases of the right encoding sitting in a 5.1 table with the wrong CHARSET declaration.

One more note: Suggest you include the COLLATE clause on the the final ALTER that converts to VARCHAR.

Fixing mismatch between CHARSET and data encoding

The problem: cp850 encoding was inserted into a latin1 field.

The following will build an example, then walk through two solutions. The solutions should work for any mis-coded field where bytes for one charset had been stuffed into a column with a different charset declaration (usually latin1). This particular example stems from
Forum thread 616834

See also: cp850 encodings
latin1 and utf8 encodings
-- First, let's establish cp850 everywhere:
-- In Windows, do "chcp 850" before entering mysql.
-- Then get the mysql commandline to agree:
SET NAMES cp850;  -- Now CodePage and the following match
SHOW VARIABLES LIKE 'char%';  -- note the 3 cp850 entries

Solution A: Change CHARACTER SET.

Given: cp850 bytes are stored in a latin1 field. Goal: Leave the bytes alone, but change the column to be CHARACTER SET cp850.
-- Build sample table
    str VARCHAR(111) CHARACTER SET latin1

-- Populate with accented E's and I's:
-- Expect        D490D2D3DED6D7D8 (cp850 encoding)
-- Don't expect  C8C9CACBCCCDCECF (latin1)
-- Don't expect  C388C389C38AC38BC38CC38DC38EC38F (utf8)
SELECT UNHEX('D490D2D3DED6D7D8');  -- expect 'ÈÉÊËÌÍÎÏ'

INSERT INTO f616834A (str)  VALUES (UNHEX('D490D2D3DED6D7D8')) ;

SELECT str, HEX(str) FROM f616834A;
-- Expect gibberish (Ô?ÒÓÞÖ×Ø) and D490D2D3DED6D7D8
Notice how it looks wrong but it is encoded correctly. This is because the server/client, in doing the SELECT, is transcoding from latin1 (which it isn't) to cp850. (Some combination of "chcp latin1" and "SET NAMES latin1" would prevent the gibberish, but we don't want to go that direction since it would only mask the problem.)

The "?" in the gibberish is not really a question mark, but rather the result of an unconvertable character.

The ALTER must be done in two steps:

SELECT str, HEX(str) FROM f616834A;  -- Expect: ÈÉÊËÌÍÎÏ and D490D2D3DED6D7D8 (cp850)
The CHARACTER SET changed, but the data was not.

Solution B: Change Data Encoding

Given: cp850 bytes are stored in a latin1 field. Goal: Leave the column's CHARACTER SET alone, but change the bytes in the column.
SHOW VARIABLES LIKE 'char%';  -- cp850 and "chcp 850" must match at this point.
-- Build a test table:
    str VARCHAR(111) CHARACTER SET latin1

-- Populate with accented E's and I's:
SELECT @str := HEX('ÈÉÊËÌÍÎÏ');    -- Expect D490D2D3DED6D7D8 (cp850 encoding)
INSERT INTO f616834B (str)  VALUES (UNHEX('D490D2D3DED6D7D8')) ;

SELECT str, HEX(str) FROM f616834B;
-- Expect gibberish (Ô?ÒÓÞÖ×Ø) and D490D2D3DED6D7D8

UPDATE f616834B SET str = CONVERT(BINARY(str) USING cp850);
SELECT str, HEX(str) FROM f616834B;  -- Expect ÈÉÊËÌÍÎÏ and C8C9CACBCCCDCECF (latin1)
(A side note: CONVERT(UNHEX(HEX(str)) USING cp850) would also work.)

Fixing mix of charsets in a column

You let the users enter stuff in a UI form, and you stored it into a latin1 field. The stuff entered was in latin1, big5, utf8, symbols, etc.

You have a mess. Without deducing, row by row, what the intent was, you cannot fix it.

Fixing MicroSoft thingies

Perhaps best is to use the REPLACE() function.

Fixing "double encoding"

CAVEAT! Test before using on Production!

*** Plan A: Use UPDATE to fix in-place:
UPDATE Tbl SET col =
                                       USING binary)
                                       USING utf8),
                                    col )
Caution: I have only briefly tested this.

The IFNULL has the advantage of leaving intact most cells that are not double-encoded. (Maybe -- need further testing)

How does it work?
    1.  The first (innermost) CONVERT takes a 4-byte encoding and converts to 2 bytes.
    2.  Converting to binary and then to something else is a way to change the type without changing the bytes.
    3.  The final convert to utf8 prevents the assignment from further messing up
    4.  The last step (IFNULL) is to undo the operation if it ended up invalid utf8??

*** Plan B: Use ALTER:
mysql> # Create sample:
mysql> CREATE TABLE dbl (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO dbl (c) VALUES ('x'), (UNHEX('C382C2A4'));
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
| c    | HEX(c)   | LENGTH(c) | CHAR_LENGTH(c) |
| x    | 78       |         1 |              1 |
| A,A☼ | C382C2A4 |         4 |              4 |
mysql> # Step 1 of 5:
Note:  Steps 1 and 2 are needed only if you start with latin1;
If your column is already utf8, skip to step 3.
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
| c    | HEX(c)   | LENGTH(c) | CHAR_LENGTH(c) |
| x    | 78       |         1 |              1 |
| A,A☼ | C382C2A4 |         4 |              4 |
mysql> # Step 2 of 5:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
| c    | HEX(c)   | LENGTH(c) | CHAR_LENGTH(c) |
| x    | 78       |         1 |              1 |
| A☼   | C382C2A4 |         4 |              2 |
mysql> # Step 3 of 5:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
| x    | 78     |         1 |              1 |
| A☼   | C2A4   |         2 |              2 |
mysql> # Step 4 of 5:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
| x    | 78     |         1 |              1 |
| A☼   | C2A4   |         2 |              2 |
mysql> # Step 5 of 5:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
| x    | 78     |         1 |              1 |
| ☼    | C2A4   |         2 |              1 |
mysql> # Check schema:
  `c` varchar(11) DEFAULT NULL

One more note: Suggest you include the COLLATE clause on the the final ALTER that converts to VARCHAR.

*** Plan C: Using mysqldump

(This has had only limited testing.)
    1. mysqldump --default-character-set=latin1 .... my_database > my_database_latin1.sql
    2. Edit my_database_latin1.sql to set NAMES=utf8 at the top.
    3. mysql ... < mydatabase.sqlmo
Thanks to Peter Barry

Fix definition and data, IF correctly encoded

You have a latin1 table containing latin1 encodings; you want utf8. (This is not the general thrust of this document, but is included for completeness, and how tempting looking the command is.)
mysql> # Create sample:
mysql> CREATE TABLE trycvt (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO trycvt (c) VALUES ('x'), (UNHEX('BB'));
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
  `c` varchar(11) CHARACTER SET latin1 DEFAULT NULL
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
| x    | 78     |         1 |              1 |
| »    | BB     |         1 |              1 |
mysql> # Step 1 of 1
mysql> ALTER TABLE trycvt CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
| x    | 78     |         1 |              1 |
| »    | C2BB   |         2 |              1 |
mysql> # Note schema changed:
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
  `c` varchar(11) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Note: TEXT will become MEDIUMTEXT, etc. This is because transcoding latin1 to utf8 usually lengthens the data. You may want to follow with a ALTER TABLE ... MODIFY c TEXT ..esume. to put it back.

Note about ALTER: Generally it is possible and faster to do all ALTERs of a table in a single ALTER;

Haidong Ji Blog


Testing an in-place fix

Some of the above fixes require live changes. Here is a way that might build confidence that it works:
    ⚈  find some rows with accents (etc) in them
    ⚈  create a table with the _same_ schema; copy (INSERT SELECT) the rows into them (INSERT ... SELECT)
    ⚈  Try one of the conversions on that table
    ⚈  Check the results -- now length and char_length should be different, and should differ by the appropriate amount. Fetch the rows in you application; see if they look right.

Fixing while moving data with mysqldump

You are copying the data from, say, 4.x server to a new, say, 5.1 server. It may be easiest to convert the data as you copy. You can test the results. If necessary, you can adjust and reapply the conversion.

      mysqldump -h old_host --no-data >schema.sql
      # now edit schema.sql to fix charset, collation, etc
      # edit in/out SET NAMES
      mysql -h new_host <schema.sql
      mysqldump -h old_host --no-create-info --opt >data.sql
      # edit in/out SET NAMES
      mysql -h new_host <data.sql

Notes (and things that may vary with our situation)
    ⚈  May need --default-character-set=utf8
    ⚈  Other issues?
    ⚈  (aside) Do not dump the database `information_schema`
    ⚈  (aside) Do not dump and reload `mysql` if you are moving between versions; there are usually schema changes that will cause trouble.

Fixing while moving data with LOAD DATA

    ⚈  SELECT INTO OUTFILE may create something useful for LOAD DATA.
    ⚈  Perhaps use mysqldump suggestions for creating schema
    ⚈  LOAD DATA ... CHARACTER SET utf8 ...; (as of 5.0.38 / 5.1.17)
    ⚈  Sorry, no experience in this approach.

Conversion Gotchas

    ⚈  DROP and re-add FULLTEXT indexes around changes.
    ⚈  DROP and re-add all(?) FOREIGN KEY CONSTRANTs; perhaps disabling will suffice
    ⚈  Collation changes could lead to duplicated keys
    ⚈  Collation changes could lead to broken VARCHAR FOREIGN KEY CONSTRANTs

4-byte utf8

Inserting a 4-byte utf8 character into a column with CHARACTER SET utf8 will cause the string to be truncated starting with the 4-byte character.

MySQL Versions 5.1 and before do not handle utf8 characters longer than 4 bytes. Version 5.5.3 (March, 2010) has a new CHARACTER SET, utf8mb4, for utf8 encodings up to 4 bytes.
Introducing utf8mb4
    ⚈  Upgrading, including sample ALTERs
    ⚈  'f0 a2 88 a2' showing up as 'd848 de22'
    ⚈  Graphemica
utf8mb4 in necessary for a small number of Chinese characters.

The MySQL changelog for 5.5.3 says "The Unicode implementation has been extended to provide support for supplementary characters that lie outside the Basic Multilingual Plane (BMP)."

If your product is going to store characters from Hong Kong, utf8 is not a good option since many characters in HKSCS are 4-byte utf8 characters. utf8mb4 would be better.

Good article on a variety of related issues such as SET NAMES, utf8mb4, collations, converting from utf8 to utf8mb4, index lengths, character-set-client-handshake, init-connect, Java, etc. is by Mathias Bynens
A lot of the meat is in the Comment section.

Much of this blog mentions utf8. Unless otherwise specified, utf8mb4 applies, too. Occasionally, "utf8" may refer to the standard, not specifically to MySQL's 3-byte subset.

Functions to know about

Since you may be using some of these against the source (old) database, be cautious of whether the function exists.

String functions
    ⚈  CHAR_LENGTH(field) -- num chars in field
    ⚈  LENGTH(field) -- num bytes in field
    ⚈  CHAR(123 USING latin1)
    ⚈  CHARSET(expression)
    ⚈  UPPER(_binary'Abc') = 'Abc' -- a subtle surprise (other string functions, too)
    ⚈  ORD(char) = ASCII(char) = number -- for the _char_, not the _byte_
    ⚈  HEX(string) -- use to see actual encoding
    ⚈  UNHEX('AABBCC') -- does not exist before 4.1.2
Convert functions
    ⚈  CONVERT('...' USING utf8)
    ⚈  CONVERT(blob_col USING utf8) -- no "USING" before 4.1.16
    ⚈  HEX(CAST(_ucs 0x00a0 as CHAR CHARACTER SET utf8)) -- show that Unicode 00a0 is utf8 C2A0
Literal introducer
    ⚈  _utf8'...'
    ⚈  mysql: mysql_set_charset('utf8');
    ⚈  mysqli: $mysqli_obj->set_charset('utf8');
    ⚈  PDO: $db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);

Note: the use of mysqli_set_charaset / set_charset should obviate the need for SET NAMES.

If you change the collation of a field, ORDER BY cannot use any INDEX; hence it could be surprisingly inefficient.

BOM - Byte-Order-Mark


The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this.

If, instead, you see this hex: C3AFC2BBC2BF, then you are also suffering from double encoding.

If you encounter this (presumably at the start of a file), it implies that your editor is adding this, but the the reader of the file (eg, mysqldump) is does not know what to do with it. Consider using a different editor.

Error #1064 SQL syntax at line 1 - strange characters

German "sharp-s"

Version 5.1.24 introduced a change in the collation of the German "sharp-s".
bug 27877
(Apr. 2007) Starting with 5.1.24, 's' = 'ß' for utf8_general_ci. 'ss' = 'ß' for utf8_unicode_ci, which is 'correct' for German.

A nasty effect of this collation change occurs when a user upgrades past 5.1.24. If there were sharp-s characters in PRIMARY (or UNIQUE) KEYs, the reloading of the data may cause "duplicate key" errors.

To 'fix' the incompatibility, the collation utf8_general_mysql500_ci was added in MySQL 5.1.62 / 5.5.21 / 5.6.5. See ref page


More on what to do.


There are many versions of "circumflex":
unicode: 005E  = (^) utf8: 5E = CIRCUMFLEX ACCENT = SPACING CIRCUMFLEX
unicode: E005E =     utf8: F3A0819E = TAG CIRCUMFLEX ACCENT
This debates what latin1's 0x88 (ˆ) maps to:
forum discussion


If you have FOREIGN KEY CONSTRAINTs on fields you are 'fixing', you will probably mess up the constraints. It might work to disable FK constraints, fix both tables, then re-enable them.

If you have FOREIGN KEY CONSTRAINTs on fields other than the ones you are 'fixing', it is probably best to disable/reenable them.

If your table is huge, and/or you have lots of indexes, the ALTERs can be very time consuming. It may be better to disable/drop the indexes during the 'fix', then reapply them at the end. Do not drop an AUTO_INCREMENT index. In the case of InnoDB, do not drop the PRIMARY KEY.


To see what charsets and collations exist in your version of MySQL:
    SHOW CHARSET;  -- only latin1 and utf8/utf8mb4 are likely to be interesting
    SHOW COLLATION;  -- list the possible values.
    SHOW SESSION VARIABLES LIKE 'character_set%';
    SHOW GLOBAL  VARIABLES LIKE 'character_set%';

Collation name suffixes
    ⚈  _ci -- case insensitive; this includes ignoring accents
    ⚈  _cs -- case sensitive (and accent sensitive)
    ⚈  _bin -- just compare the bits


The following lines in your my.cnf initialize various things in various contexts. (It is untested.) It may fail to be recognized for SUPER user:
    default-character-set = cp1252

    default-character-set = cp1252
    character-set-server = cp1252
    collation-server = latin1_general_cs
    init-connect = SET NAMES cp1252

    default-character-set = cp1252

    default-character-set = cp1252

in my.ini: skip-character-set-client-handshake collation_server=utf8_unicode_ci character_set_server=utf8 This may speed up connections, and avoid need for SET NAMES.

Stored Procedures/Functions, Triggers

ALTER DATABASE dbname CHARACTER SET ... can impact already compiled Stored Procedures.
Recompile them.

There may be other charset issues. In particular, a routine seems to have a charset, but that declaration seems to have no effect.

utf8 Collations

Which to pick? Why?
    ⚈  utf8_bin -- just looks at bits; no case folding, etc
    ⚈  utf8_general_ci -- case and accent folding.
    ⚈  utf8_unicode_ci -- Perhaps best
    ⚈  utf8_..._ci -- If your application is entirely (or mostly) in one language, this may be better for you. See the chart below.
dou Notes on utf8_general_ci vs utf8_unicode_ci
    ⚈  utf8_general_ci is a very simple collation. What it does - it just - removes all accents - then converts to upper case
    ⚈  utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss"; letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
    ⚈  utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.
    ⚈  utf8_unicode_ci is generally more accurate for all scripts.
    ⚈  For Cyrillic: utf8_unicode_ci is fine for these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
    ⚈  The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
    ⚈  from

utf8 Collations Examples

What pairs of things will be treated equal? What order does a collation cause? How do I override a column's collation? (Caution: Doing so usually prevents effecient use of an INDEX.) Examples below show altering an ORDER BY clause and an "=" operator.

Here is test code for the results below:
drop table charutf8;

-- Create and populate a table with a "utf8_bin" column:
CREATE TABLE `charutf8` (
  `a` varchar(9) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
    ('abc'), ('def'), ('ABC'), ('ábc'), ('́abc'), ('Abç'), ('ss'), ('ß');

-- Dump of characters, and lengths:
    FROM charutf8
    ORDER BY a;
-- CC81 is "U+0301: COMBINING ACUTE ACCENT" which,
-- together with 'a' is equal to 'á' in utf8_unicode_ci

-- Show that the ordering varies with collation:
SELECT GROUP_CONCAT(a ORDER BY a) AS 'bin order' FROM charutf8;
SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_general_ci) AS 'general_ci order' FROM charutf8;
SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_unicode_ci) AS 'unicode_ci order' FROM charutf8;

-- Self-JOIN to see what values are the same; collation-dependent:
SELECT CHAR_LENGTH(x.a) AS clen, LENGTH(x.a), x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_bin'
    FROM charutf8 x
    JOIN charutf8 y ON x.a = y.a
    GROUP BY x.a;
SELECT x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_general_ci'
    FROM charutf8 x
    JOIN charutf8 y ON x.a = y.a COLLATE utf8_general_ci
    GROUP BY x.a;
SELECT x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_unicode_ci'
    FROM charutf8 x
    JOIN charutf8 y ON x.a = y.a COLLATE utf8_unicode_ci
    GROUP BY x.a;

Results and conclusions:
-- Dump of characters, and lengths:
| a     | CHAR_LENGTH(a) | LENGTH(a) | HEX(a)     |
| ABC   |              3 |         3 | 414243     |
| Abç   |              3 |         4 | 4162C3A7   |
| abc   |              3 |         3 | 616263     |
| def   |              3 |         3 | 646566     |
| ss    |              2 |         2 | 7373       |
| ß     |              1 |         2 | C39F       |
| ábc   |              3 |         4 | C3A16263   |
| ́abc  |              4 |         5 | CC81616263 | -- the misalignment is due to the output not handling "combining" chars correctly
8 rows in set (0.00 sec)
-- CC81 is "U+0301: COMBINING ACUTE ACCENT" which,
-- together with 'a' is equal to 'á' in utf8_unicode_ci

-- Show that the ordering varies with collation:
| bin order                         |
| ABC,Abç,abc,def,ss,ß,ábc,́abc     | -- Notice how accents follow all letters
mysql> SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_general_ci) AS 'general_ci order' FROM charutf8;
| general_ci order                  |
| ábc,abc,Abç,ABC,def,ß,ss,́abc     | -- Regular accents mixed in; Combining guy is not'
| unicode_ci order                  |
| ́abc,ábc,abc,Abç,ABC,def,ß,ss     | -- Combining accent sorted with its mates

-- Self-JOIN to see what values are the same; collation-dependent:
| a     | Equal with utf8_bin |
| ABC   | ABC                 | -- exact match for case and accents
| Abç   | Abç                 |
| abc   | abc                 |
| def   | def                 |
| ss    | ss                  |
| ß     | ß                   | -- ss and ß are NOT equal
| ábc   | ábc                 |
| ́abc  | ́abc                |
| a     | Equal with utf8_general_ci |
| ABC   | ábc,ABC,Abç,abc            | -- case folding and accent stripping
| Abç   | ábc,ABC,Abç,abc            |
| abc   | ábc,ABC,Abç,abc            |
| def   | def                        |
| ss    | ss                         | -- ss and ß are NOT equal
| ß     | ß                          |
| ábc   | ABC,Abç,abc,ábc            |
| ́abc  | ́abc                       | -- 'combining' NOT equal
| a     | Equal with utf8_unicode_ci |
| ABC   | ábc,ABC,́abc,Abç,abc       | -- case folding and accent stripping
| Abç   | abc,ábc,ABC,́abc,Abç       |
| abc   | Abç,abc,ábc,ABC,́abc       |
| def   | def                        |
| ss    | ss,ß                       | -- ss and ß are EQUAL
| ß     | ß,ss                       |
| ábc   | ábc,ABC,́abc,Abç,abc       |
| ́abc  | abc,ábc,ABC,́abc,Abç       | -- 'combining' EQUAL

utf8 Collation Variations:

Here's a comparison of the collation of the alphabet characters for the utf8 collations you might consider.
    ⚈  spanish2 puts ch after cz and ll after lz and Ñ,ñ after nz
    ⚈  utf8_general_ci treats sharp-s as 's'; the rest treat it as 'ss'.
    ⚈  Ligature (Æ, etc) collation varies.
    ⚈  Eth (Ð) is treated like a letter between D and E.
    ⚈  utf8_estonian_ci does some major shuffling (Z between S and T, etc).
    ⚈  Danish sorts 'aa' (and several other things) after z.
utf8 : utf8_general_ci                A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å    Aa  ae  az
utf8 : utf8_general_mysql500_ci       A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å    Aa  ae  az
utf8 : utf8_danish_ci                 A=a=ª=À=Á=Â=Ã=à=á=â=ã              ae  az
utf8 : utf8_swedish_ci                A=a=ª=À=Á=Â=Ã=à=á=â=ã          Aa  ae  az
utf8 : utf8_estonian_ci               A=a=ª=À=Á=Â=Ã=Å=à=á=â=ã=å      Aa  ae  az  Æ=æ
utf8 : utf8_unicode_ci                A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å  Aa  ae  az  Æ=æ
utf8 : utf8_polish_ci                 A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å  Aa  ae  az  Æ=æ
utf8 : utf8_spanish_ci                A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å  Aa  ae  az  Æ=æ
utf8 : utf8_spanish2_ci               A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å  Aa  ae  az  Æ=æ

utf8 : utf8_general_ci                B=b  C=c=Ç=ç      ch  cz  D=d  dz
utf8 : utf8_general_mysql500_ci       B=b  C=c=Ç=ç      ch  cz  D=d  dz
utf8 : utf8_danish_ci                 B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_swedish_ci                B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_estonian_ci               B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_unicode_ci                B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_polish_ci                 B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_spanish_ci                B=b  C=c=Ç=ç      ch  cz  D=d  dz  Ð=ð
utf8 : utf8_spanish2_ci               B=b  C=c=Ç=ç  cz  ch      D=d  dz  Ð=ð

utf8 : utf8_general_ci                E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_general_mysql500_ci       E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_danish_ci                 E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_swedish_ci                E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_estonian_ci               E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_unicode_ci                E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_polish_ci                 E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ě=ě  Ę=ę
utf8 : utf8_spanish_ci                E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_spanish2_ci               E=e=È=É=Ê=Ë=è=é=ê=ë=ĕ=Ė=ė=Ę=ę=Ě=ě

utf8 : utf8_general_ci                F=f  fz     G=g  H=h  hz
utf8 : utf8_general_mysql500_ci       F=f  fz     G=g  H=h  hz
utf8 : utf8_danish_ci                 F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_swedish_ci                F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_estonian_ci               F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_unicode_ci                F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_polish_ci                 F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_spanish_ci                F=f  fz  ƒ  G=g  H=h  hz
utf8 : utf8_spanish2_ci               F=f  fz  ƒ  G=g  H=h  hz

utf8 : utf8_general_ci                I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij    iz
utf8 : utf8_general_mysql500_ci       I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij    iz
utf8 : utf8_danish_ci                 I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_swedish_ci                I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_estonian_ci               I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_unicode_ci                I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_polish_ci                 I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_spanish_ci                I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz
utf8 : utf8_spanish2_ci               I=i=Ì=Í=Î=Ï=ì=í=î=ï  ij=ij  iz

utf8 : utf8_general_ci                J=j  K=k  L=l  lj            ll  lz       M=m  N=n=Ñ=ñ  nz
utf8 : utf8_general_mysql500_ci       J=j  K=k  L=l  lj            ll  lz       M=m  N=n=Ñ=ñ  nz
utf8 : utf8_danish_ci                 J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n=Ñ=ñ  nz
utf8 : utf8_swedish_ci                J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n=Ñ=ñ  nz
utf8 : utf8_estonian_ci               J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n=Ñ=ñ  nz
utf8 : utf8_unicode_ci                J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n=Ñ=ñ  nz
utf8 : utf8_polish_ci                 J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n=Ñ=ñ  nz
utf8 : utf8_spanish_ci                J=j  K=k  L=l  lj=LJ=Lj=lj      ll  lz  Ł=ł  M=m  N=n      nz  Ñ=ñ
utf8 : utf8_spanish2_ci               J=j  K=k  L=l  lj=LJ=Lj=lj  lz  ll      Ł=ł  M=m  N=n      nz  Ñ=ñ

utf8 : utf8_general_ci                O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö    oe      oz
utf8 : utf8_general_mysql500_ci       O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö    oe      oz
utf8 : utf8_danish_ci                 O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ      oe=Œ=œ  oz
utf8 : utf8_swedish_ci                O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ      oe=Œ=œ  oz
utf8 : utf8_estonian_ci               O=o=º=Ò=Ó=Ô=ò=ó=ô          oe=Œ=œ  oz       Ø=ø
utf8 : utf8_unicode_ci                O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö  oe=Œ=œ  oz       Ø=ø
utf8 : utf8_polish_ci                 O=o=º=Ò=Ô=Õ=Ö=ò=ô=õ=ö      oe=Œ=œ  oz  Ó=ó  Ø=ø
utf8 : utf8_spanish_ci                O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö  oe=Œ=œ  oz       Ø=ø
utf8 : utf8_spanish2_ci               O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö  oe=Œ=œ  oz       Ø=ø

utf8 : utf8_general_ci                P=p  Q=q  R=r  S=s=ß=Š=š  sh  ss    sz  T=t  TM=tm    tz
utf8 : utf8_general_mysql500_ci       P=p  Q=q  R=r  S=s=Š=š    sh  ss    sz  T=t  TM=tm    tz
utf8 : utf8_danish_ci                 P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz
utf8 : utf8_swedish_ci                P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz
utf8 : utf8_estonian_ci               P=p  Q=q  R=r  S=s        sh  ss=ß  sz  Š=š  Z=z  zh  zz  Ž=ž  T=t  TM=tm=™  tz
utf8 : utf8_unicode_ci                P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz
utf8 : utf8_polish_ci                 P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz
utf8 : utf8_spanish_ci                P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz
utf8 : utf8_spanish2_ci               P=p  Q=q  R=r  S=s=Š=š    sh  ss=ß  sz  T=t  TM=tm=™  tz

utf8 : utf8_general_ci                U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz
utf8 : utf8_general_mysql500_ci       U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz
utf8 : utf8_danish_ci                 U=u=Ù=Ú=Û=ù=ú=û      ue  uz
utf8 : utf8_swedish_ci                U=u=Ù=Ú=Û=ù=ú=û      ue  uz
utf8 : utf8_estonian_ci               U=u=Ù=Ú=Û=ù=ú=û      ue  uz
utf8 : utf8_unicode_ci                U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz
utf8 : utf8_polish_ci                 U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz
utf8 : utf8_spanish_ci                U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz
utf8 : utf8_spanish2_ci               U=u=Ù=Ú=Û=Ü=ù=ú=û=ü  ue  uz

utf8 : utf8_general_ci                V=v  W=w
utf8 : utf8_general_mysql500_ci       V=v  W=w
utf8 : utf8_danish_ci                 V=v  W=w
utf8 : utf8_swedish_ci                V=v  W=w
utf8 : utf8_estonian_ci               V=v  W=w  Õ=õ  Ä=ä  Ö=ö  Ü=ü
utf8 : utf8_unicode_ci                V=v  W=w
utf8 : utf8_polish_ci                 V=v  W=w
utf8 : utf8_spanish_ci                V=v  W=w
utf8 : utf8_spanish2_ci               V=v  W=w

utf8 : utf8_general_ci                X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Æ=æ  Ð=ð  Ø=ø  Þ=þ     Ł=ł  Œ=œ  ƒ  LJ=Lj=lj
utf8 : utf8_general_mysql500_ci       X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Æ=æ  Ð=ð  Ø=ø  Þ=þ  ß  Ł=ł  Œ=œ  ƒ  LJ=Lj=lj
utf8 : utf8_danish_ci                 X=x  Y=y=Ü=Ý=ü=ý=ÿ=Ÿ  yz  Z=z=Ž=ž  zh  zz       Ä=Æ=ä=æ  Ö=Ø=ö=ø  Aa=Å=å  Þ=þ
utf8 : utf8_swedish_ci                X=x  Y=y=Ü=Ý=ü=ý=ÿ=Ÿ  yz  Z=z=Ž=ž  zh  zz  Å=å  Ä=Æ=ä=æ  Ö=Ø=ö=ø          Þ=þ
utf8 : utf8_estonian_ci               X=x  Y=y=Ý=ý=ÿ=Ÿ      yz                   Þ=þ
utf8 : utf8_unicode_ci                X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Þ=þ
utf8 : utf8_polish_ci                 X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Þ=þ
utf8 : utf8_spanish_ci                X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Þ=þ
utf8 : utf8_spanish2_ci               X=x  Y=y=Ý=ý=ÿ=Ÿ      yz  Z=z=Ž=ž  zh  zz  Þ=þ

utf8mb4_% mostly works the same as utf8.

Unfortunately, it is not easy to separate case folding from accent stripping. Here is what someone has created:

8-bit Encodings

The columns are decimal, latin1 hex, utf8 hex, char, htmlentity.
 128 80 E282AC € &#x80;
 129 81   C281  &#x81;
 130 82 E2809A ‚ &#x82;
 131 83   C692 ƒ &#x83;
 132 84 E2809E „ &#x84;
 133 85 E280A6 … &#x85;
 134 86 E280A0 † &#x86;
 135 87 E280A1 ‡ &#x87;
 136 88   CB86 ˆ &#x88;
 137 89 E280B0 ‰ &#x89;
 138 8A   C5A0 Š &#x8A;
 139 8B E280B9 ‹ &#x8B;
 140 8C   C592 Π&#x8C;
 141 8D   C28D  &#x8D;
 142 8E   C5BD Ž &#x8E;
 143 8F   C28F  &#x8F;
 144 90   C290  &#x90;
 145 91 E28098 ‘ &#x91;
 146 92 E28099 ’ &#x92;
 147 93 E2809C “ &#x93;
 148 94 E2809D ” &#x94;
 149 95 E280A2 • &#x95;
 150 96 E28093 – &#x96;
 151 97 E28094 — &#x97;
 152 98   CB9C ˜ &#x98;
 153 99 E284A2 ™ &#x99;
 154 9A   C5A1 š &#x9A;
 155 9B E280BA › &#x9B;
 156 9C   C593 œ &#x9C;
 157 9D   C29D  &#x9D;
 158 9E   C5BE ž &#x9E;
 159 9F   C5B8 Ÿ &#x9F;
 160 A0   C2A0   &nbsp;
 161 A1   C2A1 ¡ &iexcl;
 162 A2   C2A2 ¢ &cent;
 163 A3   C2A3 £ &pound;
 164 A4   C2A4 ¤ &curren;
 165 A5   C2A5 ¥ &yen;
 166 A6   C2A6 ¦ &brvbar;
 167 A7   C2A7 § &sect;
 168 A8   C2A8 ¨ &uml;
 169 A9   C2A9 © &copy;
 170 AA   C2AA ª &ordf;
 171 AB   C2AB « &laquo;
 172 AC   C2AC ¬ &not;
 173 AD   C2AD ­ &shy;
 174 AE   C2AE ® &reg;
 175 AF   C2AF ¯ &macr;
 176 B0   C2B0 ° &deg;
 177 B1   C2B1 ± &plusmn;
 178 B2   C2B2 ² &sup2;
 179 B3   C2B3 ³ &sup3;
 180 B4   C2B4 ´ &acute;
 181 B5   C2B5 µ &micro;
 182 B6   C2B6 ¶ &para;
 183 B7   C2B7 · &middot;
 184 B8   C2B8 ¸ &cedil;
 185 B9   C2B9 ¹ &sup1;
 186 BA   C2BA º &ordm;
 187 BB   C2BB » &raquo;
 188 BC   C2BC ¼ &frac14;
 189 BD   C2BD ½ &frac12;
 190 BE   C2BE ¾ &frac34;
 191 BF   C2BF ¿ &iquest;
 192 C0   C380 À &Agrave;
 193 C1   C381 Á &Aacute;
 194 C2   C382 Â &Acirc;
 195 C3   C383 Ã &Atilde;
 196 C4   C384 Ä &Auml;
 197 C5   C385 Å &Aring;
 198 C6   C386 Æ &AElig;
 199 C7   C387 Ç &Ccedil;
 200 C8   C388 È &Egrave;
 201 C9   C389 É &Eacute;
 202 CA   C38A Ê &Ecirc;
 203 CB   C38B Ë &Euml;
 204 CC   C38C Ì &Igrave;
 205 CD   C38D Í &Iacute;
 206 CE   C38E Î &Icirc;
 207 CF   C38F Ï &Iuml;
 208 D0   C390 Ð &ETH;
 209 D1   C391 Ñ &Ntilde;
 210 D2   C392 Ò &Ograve;
 211 D3   C393 Ó &Oacute;
 212 D4   C394 Ô &Ocirc;
 213 D5   C395 Õ &Otilde;
 214 D6   C396 Ö &Ouml;
 215 D7   C397 × &times;
 216 D8   C398 Ø &Oslash;
 217 D9   C399 Ù &Ugrave;
 218 DA   C39A Ú &Uacute;
 219 DB   C39B Û &Ucirc;
 220 DC   C39C Ü &Uuml;
 221 DD   C39D Ý &Yacute;
 222 DE   C39E Þ &THORN;
 223 DF   C39F ß &szlig;
 224 E0   C3A0 à &agrave;
 225 E1   C3A1 á &aacute;
 226 E2   C3A2 â &acirc;
 227 E3   C3A3 ã &atilde;
 228 E4   C3A4 ä &auml;
 229 E5   C3A5 å &aring;
 230 E6   C3A6 æ &aelig;
 231 E7   C3A7 ç &ccedil;
 232 E8   C3A8 è &egrave;
 233 E9   C3A9 é &eacute;
 234 EA   C3AA ê &ecirc;
 235 EB   C3AB ë &euml;
 236 EC   C3AC ì &igrave;
 237 ED   C3AD í &iacute;
 238 EE   C3AE î &icirc;
 239 EF   C3AF ï &iuml;
 240 F0   C3B0 ð &eth;
 241 F1   C3B1 ñ &ntilde;
 242 F2   C3B2 ò &ograve;
 243 F3   C3B3 ó &oacute;
 244 F4   C3B4 ô &ocirc;
 245 F5   C3B5 õ &otilde;
 246 F6   C3B6 ö &ouml;
 247 F7   C3B7 ÷ &divide;
 248 F8   C3B8 ø &oslash;
 249 F9   C3B9 ù &ugrave;
 250 FA   C3BA ú &uacute;
 251 FB   C3BB û &ucirc;
 252 FC   C3BC ü &uuml;
 253 FD   C3BD ý &yacute;
 254 FE   C3BE þ &thorn;
 255 FF   C3BF ÿ &yuml;

The 'missing' characters in the above table:
    ⚈  129
    ⚈  141
reverse line feed
    ⚈  143
    ⚈  144
    ⚈  157
    ⚈  173 shows when breaking across lines Soft Hyphen


The columns: utf8, unicode codepoint decimal, hex, HTML entity, 'name':
C387    0199    00C7    Ç  latin capital letter c with cedilla
C396    0214    00D6    Ö  latin capital letter o with diaeresis
C39C    0220    00DC    Ü  latin capital letter u with diaeresis
C3A7    0231    00E7    ç  latin small letter c with cedilla
C3B6    0246    00F6    ö  latin small letter o with diaeresis
C3BC    0252    00FC    ü  latin small letter u with diaeresis
C49E    0286    011E    Ğ  latin capital letter g with breve
C49F    0287    011F    ğ  latin small letter g with breve
C4B0    0304    0130    İ  latin capital letter i with dot above
C4B1    0305    0131    ı  latin small letter dotless i
C59E    0350    015E    Ş  latin capital letter s with cedilla
C59F    0351    015F    ş  latin small letter s with cedilla
E282A4  8356    20A4    ₤  lira sign
E282BA  8378    20BA    ₺  Turkish lira sign

More references

Various discussions on the MySQL forums:

Please note that any PHP code using the mysql_* API should be replaced by the mysqli_* API; the former is deprecated.

PHP/html/mysql usage
Migrating from latin1 to utf-8
Convert database from utf8_general_ci to uft8_unicode_ci
Latin1 to unicode character convert -- data are display with ?????
Problem with accents
help in relating tables and normailsation
Select statement with town Polish characters
connection collation
cannot change character set
joined query problem
Slow retreieve under MyIsam
UTF-8 problem
trying to move a database from mysql 4.0.18 to 5.0.77
Server just hanging - Copying to tmp table
Encoding problem...
Unicode collation, utf8 data fetch/manip problem
What storage requirement has FLOAT(24) ? Doc is ambigous
what is the step to convert a db from latin1 to UTF8
UTF8 Chinese String Comparison
Special characters and database
Storing Hebrew strings Ok, but gibberish when pulling data on another server
Require suggestion for changing character set in Master master replication setup
Character encodings issues with php, mysql, apache
load data character set question
Storing UTF-8 strings in MySQL Database
utf8 inputs store as html entities, how to retrievesubstr of utf8 inputs?
Urdu case folding
Is STR_TO_DATE function doesn't work with UCS2 charset?


Converting through BLOB

Wikipedia shows 0xD5 0xF5 iso upper/lower
shows utf8 c590 / c591
details several representations for 0150
discussion about how charsets are used

MySQL 5.5 & 4-byte utf8
MSDN discussion of Unicode -- The definitive word, but heavy going

mysqldump issues
Example of double-encoding and mysqldump
Windows, Unicode, LOAD DATA, little-endian, etc

Charbase - A visual unicode database

latin1 ~= cp1252 ~= ISO 8859-1

latin1 is the default character set. MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.
-- an Oracle doc

Combining Diacriticals

Some characters can be encoded as either a single character or as a letter and a "combining" acceent. For example, ä can be represented in utf8 as either c3a4 or 61cc88. The ä as a single 'character' is c3a4. The other encoding is 61, the hex code for 'a', plus cc88 is the utf-8 code for the 'combining diaeresis' (umlaut, double dot, etc).
Combining Diacritical Marks

If the collation is utf8_general_ci, the two encodings are treated as unequal. With utf8_unicode_ci, they are equal. Searching database for strings with Swedish characters

Programatic Collation

Suppose you have need for a collation that cannot be performed by any existing collation. Create two columns instead of one; let's call them `exact` and `canonical`. `exact` contains the correctly spelled (caps, accents, etc) value, and it has a suitable collation. `canonical` is a programmatically-altered version of `exact`. This may include
    ⚈  stripping accents
    ⚈  folding to lowercase
    ⚈  removing redundant whitespace (blanks, newlines, tabs)
    ⚈  removing punctuation
    ⚈  removing markup (such as html )
    ⚈  it could even include providing alternate spellings or synonyms

`canonical` would also have an index, possibly FULLTEXT and/or utf8mb4_bin. `canonical` may have duplicates even though `exact` probably does not.

Then, when the 'user' searches, you do the same folding to "canonical" before searching `canonical`. Display from `exact`.


In LOAD DATA INFILE, the manual points out: "The server uses the character set indicated by the character_set_database system variable to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause. A character set of binary specifies 'no conversion'."

Open issues

    ⚈  What if you put a 5.1 slave off a 4.x master? (The German sharp-s could cause trouble.)
    ⚈  Replication type (SBR vs RBR)?

Random notes

To see what PHP has:
   $hex = unpack('H*', $text);
   echo implode('', $hex), "

Vietnamese collations were added in MySQL 5.6.

A table declared to be latin1, and containing latin1 bytes can be converted to utf8 via
Example from forums

Using Replication to assist in charset conversion?

C# code to undo (while reading?) double-encoded strings
Similar code in SQL:

utf8_general_ci, which will treat Résumé the same as resume MYSQL doesn't 'see' accents
Cyrillic example:
title:       Добро пожаловать в Диксон,
hex:         D094D0BED0B1D180D0BE20D0BFD0BED0B6D0B0D0BBD0BED0B2D0B0D182D18C20D0B220D094D0B8D0BAD181D0BED0BD,
length:      47,
char length: 25
sometimes gives extra clues into what charset/collation issues are being faced during a SELECT. It will probably contain 'introducers'; example: _utf8'abc'.

To ponder: convert(convert((binary col1) using latin1) using utf8)

phpMyAdmin might be connecting as "root", thereby bypassing init_connect = "SET NAMES utf8". This could defeat your attempts to be fully utf8. Possible workaround: Use a login that avoids SUPER priv. (This suggestion is good advice for all applications -- for security reasons.)
    # Find rows with any 8-bit chars:
    SELECT * FROM tbl WHERE HEX(colname) RLIKE '^(..)*[8-F].';

    # Remove all occurrences of one 8-bit code (hex 92, in this example):
    UPDATE tbl SET colname = REPLACE(colname, UNHEX('92'), '');
Reference: how to replace hex in varchar field

Converting latin1 hex to utf8 hex: HEX(CONVERT(CONVERT(UNHEX('F6') USING LATIN1) USING UTF8))C3B6

em dash, dagger, and other special characters:

Adding a Character Set or Collation
(Note: The doc refers to version 5.5, but probably applies to all versions from 4.1 on.)

Connector/.NET possibly defaults to latin1. Consider changing 'connection string' if you need utf8.
mysql> SELECT ';' semicolon, UNHEX('cdbe') 'Greek question mark';
| semicolon | Greek question mark |
| ;         | ;                   |

Suggested for my.cnf by stackoverflow


collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8

my.cnf: skip-character-set-client-handshake to pretend to be 4.0?


If this solves your problem, please let me know. My ego needs a pat on the back. If it does not solve your problem, please contact me, maybe I can pull as special rabbit out of my hat, just for you.

Original writing -- Sep 2010, after struggling with hundreds of questions both in-house and from the forums. Alas, this does not achieve the ultimate goal of "Given this symptom, do that". I hope it helps most people.

Revision date: 2012-08-26 (This document continues to evolve as more problems and examples are discovered.)

Revision date: 2013-02-12 (Misc, plus added 8-bit encodings)

-- RjamesYahoo -- 21 Sep 2010

Contact me by posting a question at
MySQL Forums :: Character Sets, Collation, Unicode
-- Rick James

Rick's MySQL Documents

MySQL Documents by Rick James

Tips, Debugging, HowTos, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Converting from MyISAM to InnoDB -- includes differences between them
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Compound INDEXes plus other insights into the mysteries of INDEXing
Partition Maintenance (DROP+REORG) for time series
Entity-Attribute-Value -- a common, poorly performing, design pattern; plus an alternative
Find the nearest 10 pizza parlors -- efficient searching on Latitude + Longitude (another PARITION use)
Alter of a Huge table -- Mostly obviated by 5.6
Latest 10 news articles -- how to optimize the schema and code for such
Pagination, not with OFFSET, LIMIT
Data Warehouse techniques (esp., Summary Tables)
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
MySQL Limits -- built-in hard limits
Galera Notes (Percona XtraDB Cluster, MariaDB 10, or manually installed)
Rollup Unique User Counts
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Best of MySQL Forum -- index of lots of tips, discussions, etc

View Rick James's profile on LinkedIn