MySQL Charset/Collate

CAVEAT

The problems being addressed

Basic Concepts

History

Best Practice

Conversions and Common Errors

When do bytes get translated?

Life Cycle of a Character

How Mangling Happens on INSERT

Disaster before INSERT

Diagnosing CHARSET issues

The manual

Viewing text

Fixes for various Cases

Entering accents in CMD

Example of Double-encoding

2-step ALTER

Fixing utf8 bytes in latin1 column

Fixing mismatch between CHARSET and data encoding

Fixing mix of charsets in a column

Fixing MicroSoft thingies

Fixing "double encoding"

Fix definition and data, IF correctly encoded

Testing an in-place fix

Fixing while moving data with mysqldump

Fixing while moving data with LOAD DATA

Conversion Gotchas

4-byte utf8

Functions to know about

BOM * Byte-Order-Mark

German "sharp-s"

Where does Ö sort?

Scandanavian collations

Circumflex

Quotes

When is ö not equal to ö?

"Combining" (non-spacing) characters

Cautions

Introspection

my.cnf

Stored Procedures/Functions, Triggers

JSON

utf8 Collations

Making a new collation / alternatives

utf8 Collations Examples

Changing the collation in a SELECT

ERROR 1253 (42000): COLLATION '...' is not valid for CHARACTER SET '...'

utf8 Collation Variations:

8-bit Encodings

More references

latin1 ~= cp1252 ~= ISO 8859-1

Combining Diacriticals

Programatic Collation

LOAD DATA INFILE

Characters in Western Europe

Turkish

Arabic/Farsi

Hebrew

Cyrillic, Macedonian, etc

Other Natural Languages

Other Computer Languages

MySQL 8.0

Random notes

Footnote

Brought to you by Rick James

CAVEAT

This one of my pages is not well maintained. This is a jumble of poorly-organized information; some info is quite useful, some is not. I keep it around only because there of the few good nuggets buried in it.

The problems being addressed

    ⚈  Your web page the wrong characters coming out where accented letters should be.
    ⚈  You are upgrading from MySQL 4.0, where characters were simply bytes.
    ⚈  You are trying to wrap your head around how MySQL handles CHARSET / COLLATION.
    ⚈  You upgraded, and now you are getting garbage.

Please read most of this document. I know it is long (after all, I wrote it!) But you really need to understand a lot of the pieces in it, in order to solve your problem.

The tips in this document apply to MySQL, MariaDB, and Percona. Version differences are called out where relevant.

Basic Concepts

"Character" and "byte" are different! You must understand this before continuing. A "byte" is an a-bit thing; it is the unit of space in computers (today). A "character" is composed of one or more bytes, and represents what we think of when reading.

A byte can represent only 256 different values. There are over 11,000 Korean characters and over 40,000 Chinese characters -- no way to squeeze such a character into a single byte.

Charset vs collation. These are different things! 'Charset' ('character set'; 'encoding') refers to the bits used to represent 'characters'. 'Collation' refers to how those bits could be compare for inequality (WHERE) and sorting (ORDER BY). GROUP BY and FOREIGN KEY CONSTRAINTS can also involve collation. And it even can involve deciding whether two different bit strings compare 'equal'.

History

1950's -- A character was (sometimes) represented in only 5 bits, on "paper tape"; no lowercase; had to "shift" to switch between letters and digits.
1960's -- A character was 6 bits; no lower case. "Punched cards" were used.
1970's -- 7-bit ASCII becomes common -- that limits you to English. And the 8-bit "byte" was invented and was coming into common usage (re: IBM 360).
1980's -- Beginning to see 8-bit encodings. Especially since 7-bit ASCII was wasting a bit of the omni-present "byte". This can handle Western European accented letters.
1990's -- The computer world realizes that there are other people in the world and embarks on Unicode and UTF8. ("UTF" = "Unicode Transformation Format")

Meanwhile, MySQL is born, but has enough problems without worrying about character sets. Through version 4.0, a CHAR is just a byte. You can put any kind of bytes, representing anything, into a VARCHAR. Thus, begins the need for this discussion.

MySQL 4.1 introduced the concept of "character set" and "collation". If you had legacy data or legacy code, you probably did not notice that you were messing things up when you upgraded. Or you started with 4.1 (or later) and "latin1 / latin1_swedish_ci" and failed to notice that you were asking for trouble.

Today, it's pretty safe to simply lay down the law and say "Use utf8 for all text." If you have version 5.5.3 or later, "Use utf8mb4 for all text."

Sure, there are other character sets, and they are useful if you have a narrow focus. But, you may as well use utf8mb4.

Another take on the History

Best Practice

Best practice is to go completely utf8mb4. I will focus on utf8 and utf8mb4, but if you choose to do otherwise, keep reading; most of this discussion can still be adapted to the charset of your choice.

For collation, probably the best 'overall' collation is utf8mb4_0900_ai_ci (if available) or utf8mb4_unicode_520_ci. 0900 refers to UCA 9.0. xxx_unicode_520_ci collations are based on UCA 5.2.0 weight keys: https://www.unicode.org/Public/UCA/5.2.0/allkeys.txt. Unnumbered Collations are based on the older UCA 4.0.0.

A web page (if that is what you have) should begin with <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Establish the characterset for talking to the MySQL server:
    ⚈  init_connect = 'SET NAMES utf8' in my.cnf
    ⚈  SET NAMES utf8 -- done immediately after connecting
    ⚈  SET GLOBAL VARIABLE character_set_... SET NAMES does the three you need.
    ⚈  In some cases in your my.cnf (my.ini) file, add this under [mysql] or [client]: default-character-set = utf8

For PHP:
    ⚈  (deprecated; DO NOT USE mysql_* !) mysql interface: mysql_set_charset('utf8'); (assuming PHP 5.2.3 & MySQL 5.0.7) Note: "mysql_" interface is deprecated.
    ⚈  mysqli interface: mysqli_set_charset('utf8') function. See mysqli_set_charset
    ⚈  PDO interface: set the charset attribute of the PDO dsn or via SET NAMES utf8mb4. PDO manual

For Java (JDBC):
    ⚈  ?useUnicode=yes&characterEncoding=UTF-8 in the getConnection() call.
For Hikari (perhaps):
    ⚈  spring.jpa.properties.hibernate.connection.characterEncoding=utf-8
    ⚈  spring.jpa.properties.hibernate.connection.CharSet=utf-8
    ⚈  spring.jpa.properties.hibernate.connection.useUnicode=true
(or maybe it is =yes and =utf8)

add

spring:
  datasource:
     connectionInitSql: "SET NAMES 'utf8mb4'"

in the application.yml. connectionInitSql is used by HikariCP when it open the connection.

Adding this to your application.propertie spring.jpa.show-sql=true. It will show all the query u have made.

For Java/JSP
Something about 'filters'

SET NAMES can be invoked by your language just like other non-SELECT commands: mysqli_query(), do(), execute(), etc.

I digress here for a note about GRANTs. 'root', and any other user from GRANT ALL ON *.*, because it has "SUPER" privilege, will skip the init_connect, so that technique is not 'perfect'.

SET NAMES

Declare most CHAR/TEXT columns in all tables as CHARSET utf8. Some exceptions: columns with hex (MD5, GUID), ... These should be BINARY or CHAR charset ASCII.

Collation?

For true text, utf8_unicode_ci is best. It handles the complex rules of "combining chararacters", etc. It handles essentially all languages simultaneously, but compromises on ordering.

utf8_general_ci is the default for utf8, so you may accidently get this. It is a little faster than utf8_unicode_ci and works ok for many situations.

utf8_bin says to just compare the bytes. CHAR/TEXT utf8 with utf8_bin validates (on INSERT) that the bytes comprise valid utf8 bytes, but does not do anything useful for comparisions other than exact (no case folding, etc) equality. BINARY/BLOB should be usually be used instead CHAR+utf8; this stores the bytes without any checking.

Alas, MySQL's collations think of case folding and accent-stripping as equivalent. That is, there is no utf8 collation for case folding, but keeping accents distinct. Do SHOW COLLATION LIKE '%cs'; to see the few collations that work that way.

This document is evolving. It started with just 'utf8', but the 'standard' is becoming 'utf8mb4'. The document is inconsistent as to which it specifies. If you are running MySQL before 5.5.3, you have only 'utf8'. If you need Emoji (Emoticons) or Chinese, then you need 'utf8mb4'.

Conversions and Common Errors

This is a summary of how things work, and what is likely to go wrong.

There are 3 dimensions to character set problems:
    ⚈  How the client's bytes are encoded when INSERTing, and what encoding you want when SELECTing;
    ⚈  What SET NAMES you use (or what the default is);
    ⚈  The CHARACTER SET on the column definition.

All is well if the SET NAMES agrees with the encoding of the bytes in the Client. The column's CHARACTER SET need not agree with SET NAMES; if they differ, a conversion will be performed for you. If your characters exist in both encodings, the conversion will be transparent. If a source character does not exist in the target encoding (example: when converting a Chinese character from utf8mb4 to latin1), a "?" is usually put in its place.

The main thing that can go wrong is that SET NAMES can disagree with the Client's bytes. Let's walk an acute-e (é) through the INSERT and SELECT. First, note the encodings for é:
    ⚈  Hex for latin1's 1-byte encoding: E9
    ⚈  Hex for utf8's 2-byte encoding: C3A9
(BTW, this and some other encodings are enumerated below.)

Case 1: SET NAMES latin1 (the default) is in effect and your application is thinking in utf8. It INSERTs é encoded as utf8 (hex C3A9):
    ⚈  C3A9 will become two latin1 characters: 'Ã©' "on the wire". (C3 represents Ã in latin1, etc)
    ⚈  If the CHARACTER SET for the column is latin1, then those 2 characters (2 bytes) are inserted.
    ⚈  If the CHARACTER SET for the column is utf8, then they are converted to utf8: Ã ⇒ C383 and © ⇒ C289; you insert 4 latin1 characters (4 bytes: C383C289) into the table.
Continuing the case, let's do a SELECT:
    ⚈  For utf8 column, the C383C289 is converted to latin1 Ã© (C3A9) for sending across the wire to the client.
    ⚈  For latin1 column, no conversion is performed, so, again, C389 goes across the wire.
    ⚈  The client receives the two bytes C389, thinking it is Ã© (because of SET NAMES). However, the rest of your application is thinking 'utf8', it sees it as é.
C383C289 is an example of what I call "double encoding" because of the two 'wrong' conversions from latin1 to utf8. The resulting SELECT (at least for European text) looks exactly like what you INSERTed. However, comparisions (WHERE x>y) and sorting (ORDER BY) and, in a few cases, equality (x=y) will not work as expected.

Now, let's flip the situation: Case 2: SET NAMES utf8, but you are inserting é encoded as latin1 (E9):
    ⚈  SET NAMES is falsely asserting that E9 is a valid utf8 encoding of something, which it is not.
    ⚈  The INSERT finishes (with a WARNING that you probably ignored), but the string is truncated before the é.
    ⚈  The CHARACTER SET of the column does not matter since the string is truncated.
    ⚈  A SELECT will get only the truncated string.

Case 3: One application INSERTs and another SELECTs, but they are using different SET NAMES. All sorts of messes can occur.

Case 4: You have an old latin1 table (possibly dating back to MySQL 4.0, which did not really have any CHARACTER SET concept) and you want to modernize it to utf8 or utf8mb4. First, you need to see what is in it. (A later section discusses using HEX to help.)
    ⚈  All ASCII -- no problem.
    ⚈  All latin1 (eg, E9 for é) -- Continuing to call it latin1 is fine, unless you anticipate adding some Asian text in that column.
    ⚈  All utf8 (eg, C3A9 for é) -- Trickier. ALTER to VARBINARY/BLOB, then ALTER back to VARCHAR/TEXT. See the 2-step ALTER below.
    ⚈  A mixture of latin1 and utf8 -- Warm up your resume.

Case 5: A table is declared to be latin1 and correctly contains latin1 bytes, but you would like to change it to utf8.
    ⚈  ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4; -- This actively changes the necessary bytes in the columns
or
    ⚈  ALTER TABLE tbl MODIFY col1 ... CHARACTER SET utf8mb4;
is similar to the above, but works only one column at a time, and needs exactly the right stuff in the MODIFY clause. Hence, it would be quite tedious.

Converting entire database

What symptoms exist for the cases?

Case 1 is virtually symptomless. SELECT usually gets what was INSERTed; comparision problems may go unnoticed.

Case 2 is often recognized by having truncated strings.

Case 3 is a mess.

Case 4 is symptomless. However, if the conversion is not done properly, you can end up with some other case.

Case 5 is symptomless, but you want to branch out in the world from your original, limited, CHARACTER SET.

When do bytes get translated?

If you do all the "right things", you will accidentally have a "correctly" performing application. You need to understand when and how bytes are modified, in order to fix existing problems and to build your application correctly.

Short answer: You tell MySQL how the bytes in your application are encoded; you tell MySQL what encoding to use in your tables; MySQL converts as needed when INSERTing / SELECTing.

You tell MySQL about your application bytes by using SET NAMES or the underlying variables. You tell MySQL about the table bytes in CREATE TABLE.

Actually, the charset and collation of a column has a long, not too consistent, defaulting mechanism:
    ⚈  Default for system
    ⚈  Default for a database ...
    ⚈  Default for a table
    ⚈  Settings for a column
    ⚈  Fields/literals can be converted on the fly
    ⚈  A different collation can be specified in a WHERE clause...
To see what charset & collation are in effect for a column, use SHOW CREATE TABLE.

Life Cycle of a Character

Data ➡ Your client application ➡ INSERT into MySQL table. Simple? No.

You get the data from somewhere. Is the e-acute in "Beyoncé" represented in one latin1 byte, hex E9? Or is it the two-byte utf8 C3 A9? Not knowing which you have is the first step toward having a mess in the database. More in a minute.

Meanwhile, you have declared a column in a table in a database in MySQL to be utf8. (Collation is not matter in this discussion.)

Now you INSERT INTO Tbl (name) VALUES ("Beyoncé"). And the mess continues.

What that value being inserted encoded in latin1 or utf8? MySQL is happy to take either, and will gladly do any transcoding to store it properly in the table. But you have to tell it! The default may not be right!

Your application announces to MySQL the encoding via any of
    ⚈  The VARIABLE character_set_client
    ⚈  Executing SET NAMES
    ⚈  init-connect in my.cnf (my.ini) (Caution: root bypasses this)
I recommend you Execute SET NAMES in your application, right after connecting to the database. This reminds you, in your code, that you are establising for the connection.

Example of a mess: The data is in utf8, the connection settings are in latin1. The table is in utf8. So, MySQL converts each of C3 and A9 (which it was told was latin1) to utf8, leading to 4 bytes in the table: C3 83 C2 A9. I call this double-encoding. The problem was caused by lying (usually accidentally) to MySQL about what encoding was in the application's data.

Another example: The data was loaded in an old system in the following manner. The application got utf8 bytes and blindly stored them in a latin1 column (using the default latin1 settings for character_set% -- or it was done in 4.0, which had no character set concept). This left e-acute as 2 bytes in the table in a latin1 column. Then, you decide to 'do the right thing' and switch to utf8. Now what?

BIG5 example: You have an application that lets the user enter data into a form. This data is stored into a VARCHAR. All along the way, you use latin1. But the user connected from China with his browser set to BIG5. The latin1 column happily accepted the 2-byte Chinese characters. But your application reading the data has no clue of actual encoding. That same user, connecting in that same way, and rereading his data, will probably see the latin1 treated as BIG5, and be oblivious of the "two wrongs make a right" that happened. But display that text elsewhere, and you have a mess.

How Mangling Happens on INSERT

When INSERTing a string, 3 things make a difference on whether the inserted value is good or bad.
    ⚈  Data in client is encoded latin1 versus utf8
    ⚈  SET NAMES (or equivalent) is latin1 versus utf8
    ⚈  The target column's CHARACTER SET is ascii vs latin1 vs utf8
The outcome is good or bad according to this ordered list of possibilities:

    Client Byte Encoding  SET NAMES    column CHARSET  Outcome
    --------------------  ----------   --------------  ---------
    7-bit (eg, English)    (any)        (any)           GOOD: bytes stored unmodified
    latin1 8-bit           utf8         (any)           BAD: String truncated
    any 8-bit byte         (any)        ascii           BAD: '????;
    utf8 (not West Europe) utf8         latin1          BAD: '????'
    utf8 (west Europe)     utf8         latin1          GOOD: latin1 has the equivalent values
    (any)                  Matches Enc  Matches Enc     GOOD: bytes stored unmodified
    (any)                  Matches Enc  different       GOOD: bytes transcoded to CHARSET
    utf8 8-bit             latin1       latin1          BAD: 'é' -> 'Ã©' and similar garbage; ; see "2-step ALTER"
    utf8 8-bit             latin1       utf8            BAD: see "double encoding"

In the cases of truncation and '?', data is lost. In the other two 'BAD' cases, it is possible to repair the table (in 2 different ways).

Mixing of utf8 and utf8mb4 (client byte encoding vs. SET NAMES vs. CHARSET) may lead to truncation. Perhaps: SET NAMES utf8 -> truncation and CHARSET utf8 -> '?'

Disaster before INSERT

Most Mojibake starts even before the text is in the table. Let's dissect the steps of doing an INSERT.

Step 1 -- Get the data.

First you get the data from somewhere
    ⚈  Type it in, using a special keyboard, or keyboard tricks
    ⚈  Load it from a file
    ⚈  Create it from constants in your client programming language
    ⚈  Receive text from an html form

What do you have? You probably don't know. There is no single standard for what bytes are generated.

Step 2 -- INSERT

Let's walk through inserting a single character: e-acute. INSERT INTO tbl VALUES ('é'); or $str = ...; INSERT INTO tbl VALUES ('$str'); Well, this is really two steps.

Two likely cases for the encoding of e-acute:
    ⚈  latin1 (or code pages 1280, 1282): hex E9
    ⚈  utf8: 2 hex bytes: C3A9

Step 2a -- Send the command to the MySQL server

The bytes that make up the INSERT statement are sent from the client. Here enters the VARIABLE character_set_client. This tells the client what the bytes mean.

What is logically sent across the wire:

                           Client byte(s)
    character_set_client:    E9    C3A9
                   latin1    é     Ã ©
                     utf8          é

    ⚈  Upper-left and lower-right: The correct 'character' is sent.
    ⚈  Upper-right: Since you said the bytes are 'latin1', then C3A9 must be the two chars Ã©.
    ⚈  Lower-left: Since E9 (and what follows) is not valid utf8, the string is truncated.

Step 2b -- Do the INSERT

The 'characters' on the wire are now transliterated to the desired encoding for the table.

                        On the wire
    column's charset:    é     Ã ©
               latin1    E9    C3A9
                 utf8    C3A9  C383C2A9

    ⚈  Left column: all is well.
    ⚈  Upper-right: Mojibake. Treated as latin1, C3A9 looks like Ã©
    ⚈  Lower-right: "Double encoding" -- discussed at length below.

See also: Mojibake
Thread with data entry issues.
Graphemica é
Another thread

Diagnosing CHARSET issues

Stackoverflow -- list of 5 cases
(More cases are identfied there; more details are here.)

You have some non-English characters that are showing up strangely, perhaps as a question mark (?), perhaps as two letters, when you were expecting one, such as ç is displayed as Ã§.

First, let's get the lay of the land:

SHOW SESSION VARIABLES LIKE 'character_set%';
| character_set_client     | latin1
| character_set_connection | latin1
| character_set_results    | latin1

(There will be other rows; these three are the important ones.) SET NAMES sets those three; this example shows the result of SET NAMES latin1, not the recommended setting. More later.

SHOW CREATE TABLE tbl will show the charset and collation for each column. More later.

You have data in a table, but it is garbled when displaying it. The first thing to do is to decide whether the data was stored correctly. Find a small cell that is bad, and do this:

SELECT x, HEX(x), LENGTH(x), CHAR_LENGTH(x) FROM ... WHERE ... LIMIT 1

You will get 4 columns:
    ⚈  x -- probably not useful, but may help confirm that you got the cell you wanted.
    ⚈  HEX(x) -- This lets us look at the encoding; more later.
    ⚈  LENGTH is the number of bytes.
    ⚈  CHAR_LENGTH is the number of characters. (Remember the distinction, above?)

LENGTH vs CHAR_LENGTH -- Are they equal? For correctly stored utf8, they should be not equal. Let's look at the two lengths of specific types of characters: Looking at a single utf8 character, would see:
    ⚈  ASCII: 1:1
    ⚈  LATIN1: 1:1
    ⚈  English in utf8: 1:1 -- Ascii is a subset of utf8
    ⚈  Western European accented letters: 2:1 -- one character is encoded in two bytes
    ⚈  Eastern Europe: also 2:1 (mostly)
    ⚈  East Asia: mostly 3:1
    ⚈  "double encoding" 4:1 (Europe) or 6:1 (Asian)
    ⚈  With utf8mb4, a few Chinese characters will be 4:1

A typical phrase in French might have LENGTH=20 and CHAR_LENGTH=17, implying that there were 17 characters, 3 of which had accents.

latin1 is happy to cram Western European characters into 1 byte.

"Double encoding" is a term I made up for the following situation. It is a case where "two wrongs make a right".
    ⚈  Table defined utf8
    ⚈  INSERTer declared latin1 (used default instead of doing SET NAMES)
    ⚈  Data being INSERTed is actually utf8 already.
What happened:
    ⚈  A 2-byte letter (say, a grave-e) was correctly represented in utf8.
    ⚈  The INSERT statement handed the 2 bytes to MySQL, but implied the need for conversion
    ⚈  Each byte was converted to utf8
    ⚈  The table has 4 bytes.
When SELECTing, the reverse happens 4->2->1, and the user is oblivious of the bug. Looking CHAR_LENGTH will spot it. Strange orderings may happen. details of a double encoding
Double Encoding 'answer'
Looks like Mojibake, but really half-Double-Encoding
Best reference

If you are moving the data using mysqldump, consider looking at the dump file via some hexdump utility. For experimenting, mysqldump lets you specify one database, one table, and constrain the rows via --where="...".

In the hexdump, you need to be able to recognize various byte patterns: In this, "yy" stands for the range 80-BF.
    ⚈  09,0A,0D -- The only control character you are likely to encounter (TAB, NL, CR)
    ⚈  20 -- space (handy to know when reading hex dump)
    ⚈  21-7E -- Ascii characters
    ⚈  C2yy -- symbols
    ⚈  C3yy -- Typical accented letters of Western Europe
    ⚈  Cxyy -- More Western Europe: Latin (C3-CA), Combining Diacritical Marks (CC-CD), Greek (CE-CF)
    ⚈  Dxyy -- Cyrillic (D0-D4), Hebrew (D6-D7), Arabic/Persian/Farsi (D8-DB), etc
    ⚈  E0yyyy -- various Indian character sets, southern Asia, etc.
    ⚈  E1yyyy -- Cherokee, Balinese, Khmer, Mongolian, Vietnamese, etc.
    ⚈  E2yyyy -- Symbols, Braille, etc
    ⚈  E381yy -- Hiragana (Japanese)
    ⚈  E383yy -- Katakana (Japanese)
    ⚈  Exyyyy -- E3-EA: Chinese, Japanese; EAB0-ED9E: Hangul (Korean)
    ⚈  EFBBBF -- BOM indicator (start of file only)
    ⚈  EFAxyy -- CJK Extension A
    ⚈  EFACyy -- Hebrew (obscure)
    ⚈  EFBxyy -- Arabic (obscure)
    ⚈  F0yyyyyy -- 4-byte utf8. (See notes about utf8mb4)
    ⚈  F0Axyyyy -- CJK Extension B
Note, especially, the absence of 80-BF unless preceeded by some other 8-bit code.

REGEXO for testing for Emoji

    WHERE HEX(col) RLIKE "^(..)*(F.|E2|EF)"

Codepage layout (Wikipedia ref)
Another more detailed list
(but aimed at Unicode instead of utf8)

If the data is not utf8, you may see
    ⚈  00xx -- Lots of these pairs could mean Ascii in UCS2. (Shift-JIS is also possible)
    ⚈  FFFE (or FEFF) at start of file (BOM) -- utf8 file (not cell in MySQL table)
    ⚈  00-1F -- control characters (mostly 0a, newline)
    ⚈  20 -- space (handy to know when reading hex dump)
    ⚈  21-7E -- Ascii characters
    ⚈  7F -- DEL; rare
    ⚈  80-FF -- "8-bit characters":
    ⚈  80-9F -- Possibly MicroSoft Word things (smart quotes, etc)
    ⚈  80-FF -- valid latin1 encodings, mostly European accented letters
    ⚈  A0 -- "hard space" -- mostly from MicroSoft
    ⚈  Cx not followed by yy (80-BF) cannot be utf8. This is usually a quick clue that it is not utf8.

Note: The Microsoft things really need to be fixed before trying to store into a table. MicroSoft characters

In hex, "double encoding" will have a lot of "Cx yy Cx yy".

The manual

charset-connection.html
is that manual page that explains the interactions of the vital variables:
    ⚈  character_set_client -- the encoding in the client.
    ⚈  character_set_connection -- what to re-code characters into for client-to-server transfer
    ⚈  character_set_results -- what to re-code characters into for server-to-client transfer
The page also discusses collation_connection and collation_database, which have less impact.

Between that manual page and this document, I hope that you can understand things as they relate to your situation.

To see your current settings:

   SHOW VARIABLES LIKE 'character_set%';
   SHOW VARIABLES LIKE 'collation%';

Viewing text

Alas, CMD (Windows) and perhaps other terminal emulators cannot (or will not by default) display utf8 characters. This makes it difficult to debug. And it is confusing to see incorrect characters when everything else is working correctly.

Viewing text in a browser can be misleading in a different way. Browsers may "fix" things for you. For example, 'double encoded' characters may appear to be correct in the browser!

Bottom line: SELECT HEX() is about the only thing to trust.

Fixes for various Cases

If you are running MySQL before 5.5.3, you have only 'utf8'. If you need Emoji or Chinese, then you need 'utf8mb4'. The advice in this section assumes you will be going to utf8mb4. If you are going only to utf8, adjust accordingly.

Table is CHARACTER SET latin1 and correctly encoded in latin1; want utf8mb4:

   ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;

Column is CHARACTER SET latin1 and correctly encoded in latin1; want utf8mb4 without touching other columns:

   ALTER TABLE tbl MODIFY COLUMN col ... CHARACTER SET utf8mb4;

(Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)

CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset:
First, lets assume you have this declaration for tbl.col:

    col VARCHAR(111) CHARACTER SET latin1 NOT NULL

Then to convert the column without changing the bytes:

    ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
    ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;

Note: If you start with TEXT, use BLOB as the intermediate definition. (This is the "2-step ALTER", as discussed elsewhere.) (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)

CHARACTER SET utf8mb4 with Mojibake:

    UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);

Another ref

CHARACTER SET latin1 with double-encoding: Do the 2-step ALTER, then fix the double-encoding.

Greek example

Question marks / Truncation The data is lost.

Stored Routines Stored Procedures, etc, may need to be recreated if, for example, they were built with utf8 but need to be utf8mb4. StackOverflow

See also Fixing column encoding
Example of mess caused by using wrong 'Fix'

Entering accents in CMD

If isn’t displays as isnâ€™t, you forgot SET NAMES utf8. If isn’t displays (in the mysql commandline tool in Windows) as isn?t, see the above info.

As noted above, CMD in Windows is limited in its character set handing.

The following is a clumsy way to enter Western European characters. It depends on your knowing the hex or decimal values for the characters you want. Sorry, this won't handle Hebrew, Cyrillic, Chinese, etc.

Press and hold ALT, then
    ⚈  decimal digits for the code page, or
    ⚈  0, then decimal digits of the decimal Unicode
digits to get the extra characters in latin1. When you let go, the accented character will appear. This implies that you need SET NAMES latin1, not utf8, since you are creating latin1 bytes, not utf8 bytes in your 'client'. For the encodings, see
    ⚈  shortcuts.

More
info on how to enter Unicode, and about "code pages".

The command "chcp" controls the "code page". chcp 65001 provides utf8, but it needs a special charset installed, too. some code pages
To set the font in the console window: Right-click on the title of the window → Properties → Font → pick Lucida Console

5.6 ref manual
Note: There were bugs in 5.5's use of utf8 in the commandline 'mysql'.

GNU Unifont
Windows 7

Example of Double-encoding

Desired string:  სახლი არის
Hex in table:
C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1
Correct utf8 hex:
E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1

Analysis:

C3A1C692C2A1 -- should be E183A1 -- Georgian SAN (U+10E1)  "ს"
C3A1C692C290
C3A1C692C2AE
C3A1C692C5A1
C3A1C692CB9C
20           -- space (same in utf8 and latin1)
C3A1C692C290
C3A1C692C2A0
C3A1C692CB9C
C3A1C692C2A1

Georgian, when properly encoded in utf8, should be 3 bytes per character:

E182xx
E183xx
E2B4xx

Taking the first "character" (C3A1C692C2A1, in hex):
    ⚈  latin1 E1 = utf8 C3A1
    ⚈  latin1 83 = utf8 C692
    ⚈  latin1 A1 = utf8 C2A1
Notice how most of the letters look very similar: C3A1C692xxyy, when they should start with E183xx.

One pass over the data to convert each two bytes (4 hex) from utf8 to latin1:

   SELECT
   HEX(CONVERT(CONVERT(UNHEX(
     'C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1'
   ) USING utf8) USING latin1))
⇒
E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1

Now to display as utf8:

SELECT UNHEX('E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1')
⇒
სახლი არის

(Ref: forum thread #418885.)

Another example of double-encoding: 'C398C2AF' should have been 'D8AF'

SELECT HEX(CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1));
⇒ D8AF, which is the utf8 for Arabic DAL.
SELECT CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1);
⇒ that character.

(Ref: forum thread #565669.)

Another... from StackOverflow

    SELECT hex('â˜¼STICKYâ˜¼');         --> C3A2CB9CC2BC535449434B59C3A2CB9CC2BC
    SELECT unhex(hex(convert(convert(unhex('C3A2CB9CC2BC535449434B59C3A2CB9CC2BC') using utf8) using latin1))) -->
    ☼STICKY☼                                                                                            |

A blog on doubly-encoded text.
This even tackles the case of a column eith both doubly-encoded and OK text.

2-step ALTER

The "2-step ALTER" applies when you wrote utf8 bytes with SET NAMES latin1 to a CHARACTER SET latin1 column. The symptom is often (but not always) that your 'é' is showing as 'Ã©'.

    ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
    ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;

where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.

Similarly, TEXT -> BLOB -> TEXT.

If col is in any indexes, you might want to DROP INDEX in the first ALTER and ADD INDEX in the second. (This is for efficiency and possibly to avoid index limitations.)

Two examples follow:

Fixing utf8 bytes in latin1 column

(The "2-step ALTER", example 1)

CAVEAT! Test before using on Production!

# Set up bad case:
mysql> create table latin1 (c varchar(11) charset latin1);
mysql> insert into latin1 (c) values ('x'), (unhex('c2a3'));
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c    | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| A£   | C2A3   |         2 |              2 |
+------+--------+-----------+----------------+
# Step 1 of 2 of fix:
mysql> alter table latin1 modify c tinyblob;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c    | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| A£   | C2A3   |         2 |              2 |
+------+--------+-----------+----------------+
# Step 2 of 2 of fix:
mysql> alter table latin1 modify c varchar(11) charset utf8;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c    | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| £    | C2A3   |         2 |              1 |
+------+--------+-----------+----------------+

That is, ALTER to BLOB (of appropriate size), then ALTER to VARCHAR (of original size) and CHARSET utf8.

This technique should work for other cases of the right encoding sitting in a 5.1 table with the wrong CHARSET declaration.

One more note: Suggest you include the COLLATE clause on the the final ALTER that converts to VARCHAR.

Fixing mismatch between CHARSET and data encoding

(The "2-step ALTER", example 2)

The problem: cp850 encoding was inserted into a latin1 field.

The following will build an example, then walk through two solutions. The solutions should work for any mis-coded field where bytes for one charset had been stuffed into a column with a different charset declaration (usually latin1). This particular example stems from Forum thread 616834

See also: latin1 and utf8 encodings

-- First, let's establish cp850 everywhere:
-- In Windows, do "chcp 850" before entering mysql.
-- Then get the mysql commandline to agree:
SET NAMES cp850;  -- Now CodePage and the following match
SHOW VARIABLES LIKE 'char%';  -- note the 3 cp850 entries

Solution A: Change CHARACTER SET.

Given: cp850 bytes are stored in a latin1 field. Goal: Leave the bytes alone, but change the column to be CHARACTER SET cp850.

-- Build sample table
CREATE TABLE f616834A (
    str VARCHAR(111) CHARACTER SET latin1
);
SHOW CREATE TABLE f616834A\G

-- Populate with accented E's and I's:
SELECT @str := HEX('ÈÉÊËÌÍÎÏ');
-- Expect        D490D2D3DED6D7D8 (cp850 encoding)
-- Don't expect  C8C9CACBCCCDCECF (latin1)
-- Don't expect  C388C389C38AC38BC38CC38DC38EC38F (utf8)
SELECT UNHEX('D490D2D3DED6D7D8');  -- expect 'ÈÉÊËÌÍÎÏ'

INSERT INTO f616834A (str)  VALUES (UNHEX('D490D2D3DED6D7D8')) ;

SELECT str, HEX(str) FROM f616834A;
-- Expect gibberish (Ô?ÒÓÞÖ×Ø) and D490D2D3DED6D7D8

Notice how it looks wrong but it is encoded correctly. This is because the server/client, in doing the SELECT, is transcoding from latin1 (which it isn't) to cp850. (Some combination of "chcp latin1" and "SET NAMES latin1" would prevent the gibberish, but we don't want to go that direction since it would only mask the problem.)

The "?" in the gibberish is not really a question mark, but rather the result of an unconvertable character.

The ALTER must be done in two steps:

ALTER TABLE f616834A  MODIFY str  VARBINARY(111);
ALTER TABLE f616834A  MODIFY str  VARCHAR(111) CHARACTER SET cp850;

SELECT str, HEX(str) FROM f616834A;  -- Expect: ÈÉÊËÌÍÎÏ and D490D2D3DED6D7D8 (cp850)

The CHARACTER SET changed, but the data was not.

Solution B: Change Data Encoding

Given: cp850 bytes are stored in a latin1 field. Goal: Leave the column's CHARACTER SET alone, but change the bytes in the column.

SHOW VARIABLES LIKE 'char%';  -- cp850 and "chcp 850" must match at this point.
-- Build a test table:
CREATE TABLE f616834B (
    str VARCHAR(111) CHARACTER SET latin1
);
SHOW CREATE TABLE f616834B\G

-- Populate with accented E's and I's:
SELECT @str := HEX('ÈÉÊËÌÍÎÏ');    -- Expect D490D2D3DED6D7D8 (cp850 encoding)
SELECT UNHEX('D490D2D3DED6D7D8');  -- Expect 'ÈÉÊËÌÍÎÏ'
INSERT INTO f616834B (str)  VALUES (UNHEX('D490D2D3DED6D7D8')) ;

SELECT str, HEX(str) FROM f616834B;
-- Expect gibberish (Ô?ÒÓÞÖ×Ø) and D490D2D3DED6D7D8

UPDATE f616834B SET str = CONVERT(BINARY(str) USING cp850);
SELECT str, HEX(str) FROM f616834B;  -- Expect ÈÉÊËÌÍÎÏ and C8C9CACBCCCDCECF (latin1)

(A side note: CONVERT(UNHEX(HEX(str)) USING cp850) would also work.)

Fixing mix of charsets in a column

You let the users enter stuff in a UI form, and you stored it into a latin1 field. The stuff entered was in latin1, big5, utf8, symbols, etc.

You have a mess. Without deducing, row by row, what the intent was, you cannot fix it.

Fixing MicroSoft thingies

Perhaps best is to use the REPLACE() function.
⚈ replace()

Fixing "double encoding"

CAVEAT! Test before using on Production!

*** Plan A: Use UPDATE to fix in-place:

UPDATE Tbl SET col =
    IFNULL(CONVERT(CONVERT(CONVERT(col USING latin1)
                                       USING binary)
                                       USING utf8),
                                    col )

Caution: I have only briefly tested this.

The IFNULL has the advantage of leaving intact most cells that are not double-encoded. (Maybe -- need further testing)

How does it work?
    1.  The first (innermost) CONVERT takes a 4-byte encoding and converts to 2 bytes.
    2.  Converting to binary and then to something else is a way to change the type without changing the bytes.
    3.  The final convert to utf8 prevents the assignment from further messing up
    4.  The last step (IFNULL) is to undo the operation if it ended up invalid utf8??

*** Plan B: Use ALTER:

mysql> # Create sample:
mysql> CREATE TABLE dbl (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO dbl (c) VALUES ('x'), (UNHEX('C382C2A4'));
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c    | HEX(c)   | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x    | 78       |         1 |              1 |
| A,A☼ | C382C2A4 |         4 |              4 |
+------+----------+-----------+----------------+
mysql> # Step 1 of 5:
Note:  Steps 1 and 2 are needed only if you start with latin1;
If your column is already utf8, skip to step 3.
mysql> ALTER TABLE dbl MODIFY c TINYBLOB;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c    | HEX(c)   | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x    | 78       |         1 |              1 |
| A,A☼ | C382C2A4 |         4 |              4 |
+------+----------+-----------+----------------+
mysql> # Step 2 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c    | HEX(c)   | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x    | 78       |         1 |              1 |
| A☼   | C382C2A4 |         4 |              2 |
+------+----------+-----------+----------------+
mysql> # Step 3 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET latin1;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| A☼   | C2A4   |         2 |              2 |
+------+--------+-----------+----------------+
mysql> # Step 4 of 5:
mysql> ALTER TABLE dbl MODIFY c TINYBLOB;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| A☼   | C2A4   |         2 |              2 |
+------+--------+-----------+----------------+
mysql> # Step 5 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| ☼    | C2A4   |         2 |              1 |
+------+--------+-----------+----------------+
mysql> # Check schema:
mysql> SHOW CREATE TABLE dbl \G
CREATE TABLE `dbl` (
  `c` varchar(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8

One more note: Suggest you include the COLLATE clause on the the final ALTER that converts to VARCHAR.

*** Plan C: Using mysqldump

(This has had only limited testing.)

    1. mysqldump --default-character-set=latin1 .... my_database > my_database_latin1.sql
    2. Edit my_database_latin1.sql to set NAMES=utf8 at the top.
    3. mysql ... < mydatabase.sqlmo

Thanks to Peter Barry

*** Plan D: A function

DELIMITER $$

CREATE FUNCTION maybe_utf8_decode(str text charset utf8)
RETURNS text CHARSET utf8 DETERMINISTIC
BEGIN
declare str_converted text charset utf8;
declare max_error_count int default @@max_error_count;
set @@max_error_count = 0;
set str_converted = convert(binary convert(str using latin1) using utf8);
set @@max_error_count = max_error_count;
if @@warning_count > 0 then
return str;
else
return str_converted;
end if;
END$$

DELIMITER ;

Then do this for each column in each table that is likely to have doubly-encoded text:

UPDATE tbl SET col = maybe_utf8_decode(col);

Thanks for Plan D to Joni Salonen for fixing-doubly-utf-8-encoded-text-in-mysql

Fix definition and data, IF correctly encoded

You have a latin1 table containing latin1 encodings; you want utf8. (This is not the general thrust of this document, but is included for completeness, and how tempting looking the command is.)

mysql> # Create sample:
mysql> CREATE TABLE trycvt (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO trycvt (c) VALUES ('x'), (UNHEX('BB'));
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
  `c` varchar(11) CHARACTER SET latin1 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
+------+--------+-----------+----------------+
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| »    | BB     |         1 |              1 |
+------+--------+-----------+----------------+
mysql> # Step 1 of 1
mysql> ALTER TABLE trycvt CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
+------+--------+-----------+----------------+
| c    | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x    | 78     |         1 |              1 |
| »    | C2BB   |         2 |              1 |
+------+--------+-----------+----------------+
mysql> # Note schema changed:
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
  `c` varchar(11) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Note: TEXT will become MEDIUMTEXT, etc. This is because transcoding latin1 to utf8 usually lengthens the data. You may want to follow with a ALTER TABLE ... MODIFY c TEXT ..esume. to put it back.

Note about ALTER: Generally it is possible and faster to do all ALTERs of a table in a single ALTER;

Haidong Ji Blog

performanceblog

Testing an in-place fix

Some of the above fixes require live changes. Here is a way that might build confidence that it works:
    ⚈  find some rows with accents (etc) in them
    ⚈  create a table with the same schema; copy (INSERT SELECT) the rows into them (INSERT ... SELECT)
    ⚈  Try one of the conversions on that table
    ⚈  Check the results -- now length and char_length should be different, and should differ by the appropriate amount. Fetch the rows in you application; see if they look right.

Fixing while moving data with mysqldump

You are copying the data from, say, 4.x server to a new, say, 5.1 server. It may be easiest to convert the data as you copy. You can test the results. If necessary, you can adjust and reapply the conversion.

Overview:

      mysqldump -h old_host --no-data >schema.sql
      # now edit schema.sql to fix charset, collation, etc
      # edit in/out SET NAMES
      mysql -h new_host <schema.sql
      mysqldump -h old_host --no-create-info --opt >data.sql
      # edit in/out SET NAMES
      mysql -h new_host <data.sql

Notes (and things that may vary with our situation)
    ⚈  May need --default-character-set=utf8
    ⚈  Other issues?
    ⚈  (aside) Do not dump the database information_schema
    ⚈  (aside) Do not dump and reload mysql if you are moving between versions; there are usually schema changes that will cause trouble.

Fixing while moving data with LOAD DATA

    ⚈  SELECT INTO OUTFILE may create something useful for LOAD DATA.
    ⚈  Perhaps use mysqldump suggestions for creating schema
    ⚈  LOAD DATA ... CHARACTER SET utf8 ...; (as of 5.0.38 / 5.1.17)
    ⚈  Sorry, no experience in this approach.

Conversion Gotchas

    ⚈  DROP and re-add FULLTEXT indexes around changes.
    ⚈  DROP and re-add all(?) FOREIGN KEY CONSTRANTs; perhaps disabling will suffice
    ⚈  Collation changes could lead to duplicated keys
    ⚈  Collation changes could lead to broken VARCHAR FOREIGN KEY CONSTRANTs

4-byte utf8

Inserting a 4-byte utf8 character into a column with CHARACTER SET utf8 will cause the string to be truncated starting with the 4-byte character.

MySQL Versions 5.1 and before do not handle utf8 characters longer than 4 bytes. Version 5.5.3 (March, 2010) has a new CHARACTER SET, utf8mb4, for utf8 encodings up to 4 bytes.
    ⚈  'f0 a2 88 a2' showing up as 'd848 de22'
    ⚈  Graphemica
    ⚈  Winking Face 😉
utf8mb4 in necessary for a small number of Chinese characters.

The MySQL changelog for 5.5.3 says "The Unicode implementation has been extended to provide support for supplementary characters that lie outside the Basic Multilingual Plane (BMP)."

If your product is going to store characters from Hong Kong, utf8 is not a good option since many characters in HKSCS are 4-byte utf8 characters. utf8mb4 would be better.

Good article on a variety of related issues such as SET NAMES, utf8mb4, collations, converting from utf8 to utf8mb4, index lengths, character-set-client-handshake, init-connect, Java, etc. is by Mathias Bynens
A lot of the meat is in the Comment section.

Much of this blog mentions utf8. Unless otherwise specified, utf8mb4 applies, too. Occasionally, "utf8" may refer to the standard, not specifically to MySQL's 3-byte subset.

😁 => F09F9881

VARCHAR(191) -- Indexes are limited to 767 bytes, which is sufficient to hold 255 utf8 bytes. However, utf8mb4 can take 4 bytes. So, the limit for building an index of a utf8mb4 column is 191. So, if you are picking 255 for no good reason, and if 191 will do, use 191. To check the max length in a field:

   SELECT MAX(CHAR_LENGTH(col)) FROM tbl;

How to upgrade utf8 to utf8mb4? Probably this will do:

   ALTER TABLE tbl MODIFY COLUMN col ... CHARACTER SET utf8mb4 ...;

Emoji (such as 😁😂😃😄) are all differ under the newer utf8mb4_unicode_520_ci, but equal with the older COLLATE utf8mb4_unicode_ci. Stackoverflow

If you currently have utf8 on all the columns in a table, and want to covert to utf8mb4,

   ALTER TABLE tbl_name CONVERT TO utf8mb4;

However, this may fail for index issues. If you have an indexed VARCHAR(255), that turns out to exceed the 767-byte limit. Change it to VARCHAR(191) if the column does not really need 255. If you really need more than 191, things get more complicated. StackExchange thread

Repeat the CONVERT on the other tables.

Functions to know about

Since you may be using some of these against the source (old) database, be cautious of whether the function exists.

String functions
    ⚈  CHAR_LENGTH(field) -- num chars in field
    ⚈  LENGTH(field) -- num bytes in field
    ⚈  CHAR(123 USING latin1)
    ⚈  CHARSET(expression)
    ⚈  UPPER(_binary'Abc') = 'Abc' -- a subtle surprise (other string functions, too)
    ⚈  ORD(char) = ASCII(char) = number -- for the _char_, not the byte
    ⚈  HEX(string) -- use to see actual encoding
    ⚈  UNHEX('AABBCC') -- does not exist before 4.1.2 Convert functions
    ⚈  CONVERT('...' USING utf8)
    ⚈  CONVERT(blob_col USING utf8) -- no "USING" before 4.1.16
    ⚈  HEX(CAST(_ucs 0x00a0 as CHAR CHARACTER SET utf8)) -- show that Unicode 00a0 is utf8 C2A0 Literal introducer
    ⚈  _utf8'...' PHP
    ⚈  mysql: mysql_set_charset('utf8');
    ⚈  mysqli: $mysqli_obj->set_charset('utf8');
    ⚈  PDO: $db = new PDO('mysql:host=...;dbname=...;charset=utf8mb4', $user, $pwd); or by executing SET NAMES utf8mb4.

Note: the use of mysqli_set_charset / set_charset should obviate the need for SET NAMES.

If you change the collation of a field, ORDER BY cannot use any INDEX; hence it could be surprisingly inefficient.

BOM * Byte-Order-Mark

Wikipedia
says

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.

If, instead, you see this hex: C3AFC2BBC2BF, then you are also suffering from double encoding.

If you encounter this (presumably at the start of a file), it implies that your editor is adding this, but the the reader of the file (eg, mysqldump) is does not know what to do with it. Check the editor's settings, or consider using a different editor.

Error #1064 SQL syntax at line 1 * strange characters
more on BOM

If you see The ÿþ (hex FFFE) at the beginning of a 'text' file, that is the Byte-Order-Mark.

Python issue with BOM
ERROR 1366 (HY000) at line ...: Incorrect string value: '\xEF\xBB\xBFC# BOM

    open(filename, 'r', encoding='utf-8-bom')

German "sharp-s"

Version 5.1.24 introduced a change in the collation of the German "sharp-s". bug 27877
(Apr. 2007) Starting with 5.1.24, 's' = 'ß' for utf8_general_ci. 'ss' = 'ß' for utf8_unicode_ci, which is 'correct' for German.

A nasty effect of this collation change occurs when a user upgrades past 5.1.24. If there were sharp-s characters in PRIMARY (or UNIQUE) KEYs, the reloading of the data may cause "duplicate key" errors.

To 'fix' the incompatibility, the collation utf8_general_mysql500_ci was added in MySQL 5.1.62 / 5.5.21 / 5.6.5. See ref page

Graphemica

From 5.1 release notes:

New utf8_general_mysql500_ci and ucs2_general_mysql500_ci collations have been added that preserve the behavior of utf8_general_ci and ucs2_general_ci from versions of MySQL previous to 5.1.24. CHECK TABLE produces this error:

Table upgrade required. Please do "REPAIR TABLE t" or dump/reload to fix it!

Unfortunately, REPAIR TABLE could not fix the problem. The new collations permit older tables created before MySQL 5.1.24 to be upgraded to current versions of MySQL.

To convert an affected table after a binary upgrade that leaves the table files in place, alter the table to use the new collation. Suppose that the table t1 contains one or more problematic utf8 columns. To convert the table at the table level, use a statement like this:

ALTER TABLE t1 CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_mysql500_ci;

To apply the change on a column-specific basis, use a statement like this (be sure to repeat the column definition as originally specified except for the COLLATE clause):

ALTER TABLE t1 MODIFY c1 CHAR(N) CHARACTER SET utf8 COLLATE utf8_general_mysql500_ci;

To upgrade the table using a dump and reload procedure, dump the table using mysqldump, modify the CREATE TABLE statement in the dump file to use the new collation, and reload the table.

After making the appropriate changes, CREATE TABLE should report no error.

For more information, see Checking Whether Tables or Indexes Must Be Rebuilt
and Rebuilding or Repairing Tables or Indexes (Bug #43593, Bug #11752408)

Where does Ö sort?

    ⚈  utf8_estonian_ci: Between 'W' and 'X'
    ⚈  danish, icelandic, swedish: After 'Z'
    ⚈  utf8_gernan2_ci: As if the two letters 'oe'
    ⚈  hungarian and turkish: Between 'O' and 'P' (that is, after 'oz')
    ⚈  Other collations (including unicode, unicode_520, 0900): As if the letter 'O'

These apply to both utf8 and utf8mb4, MySQL 8.0 and before.

Programatically generated collation tester

Scandanavian collations

Among utf8_danish_ci, utf8_icelandic_ci, and utf8_swedish_ci (and utf8mb4 variant):

icelandic differs due to: AZ < Á < B, E < É, IZ < Í < J, UZ < Ü < V, YZ < Ý < Z. Also

danish:    zz <          Ä=Æ=ä=æ < Ö=Ø=ö=ø < Aa=Å=å < Þ=þ
icelandic: zz < Þ=þ    < Ä=Æ=ä=æ < Ö=Ø=ö=ø <    Å=å
swedish:   zz <    Å=å < Ä=Æ=ä=æ < Ö=Ø=ö=ø <          Þ=þ

(There may be more differences.) Programatically generated collation tester

Circumflex

There are many versions of "circumflex":

unicode: 005E  = (^) utf8: 5E = CIRCUMFLEX ACCENT = SPACING CIRCUMFLEX
unicode: 02C6  = (ˆ) utf8: CB86 = MODIFIER LETTER CIRCUMFLEX ACCENT = MODIFIER LETTER CIRCUMFLEX
unicode: 0302  =     utf8: CC82 = COMBINING CIRCUMFLEX ACCENT = NON-SPACING CIRCUMFLEX
unicode: 032D  =     utf8: CCAD = COMBINING CIRCUMFLEX ACCENT BELOW = NON-SPACING CIRCUMFLEX BELOW
unicode: 1DCD  =     utf8: E1B78D = COMBINING DOUBLE CIRCUMFLEX ABOVE
unicode: A788  = (ꞈ) utf8: EA9E88 = MODIFIER LETTER LOW CIRCUMFLEX ACCENT
unicode: FF3E  = (＾) utf8: EFBCBE = FULLWIDTH CIRCUMFLEX ACCENT = FULLWIDTH SPACING CIRCUMFLEX
unicode: E005E =     utf8: F3A0819E = TAG CIRCUMFLEX ACCENT

This debates what latin1's 0x88 (ˆ) maps to: forum discussion
Graphemica

Quotes

+----------+---------+------+----------------------------------------------------+
| utf8mb4  | Unicode | c    | Name                                               |
+----------+---------+------+----------------------------------------------------+
| 22       |      34 | "    | QUOTATION MARK                                     |
| C2AB     |     171 | «    | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK          |
| C2BB     |     187 | »    | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK         |
| E28098   |    8216 | ‘    | LEFT SINGLE QUOTATION MARK                         | 91
| E28099   |    8217 | ’    | RIGHT SINGLE QUOTATION MARK                        | 92
| E2809A   |    8218 | ‚    | SINGLE LOW-9 QUOTATION MARK                        |
| E2809B   |    8219 | ‛    | SINGLE HIGH-REVERSED-9 QUOTATION MARK              |
| E2809C   |    8220 | “    | LEFT DOUBLE QUOTATION MARK                         | 93
| E2809D   |    8221 | ”    | RIGHT DOUBLE QUOTATION MARK                        | 94
| E2809E   |    8222 | „    | DOUBLE LOW-9 QUOTATION MARK                        |
| E2809F   |    8223 | ‟    | DOUBLE HIGH-REVERSED-9 QUOTATION MARK              |
| E280B9   |    8249 | ‹    | SINGLE LEFT-POINTING ANGLE QUOTATION MARK          |
| E280BA   |    8250 | ›    | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK         |
| E28D98   |    9048 | ⍘    | APL FUNCTIONAL SYMBOL QUOTE UNDERBAR               |
| E28D9E   |    9054 | ⍞    | APL FUNCTIONAL SYMBOL QUOTE QUAD                   |
| E29D9B   |   10075 | ❛    | HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT  |
| E29D9C   |   10076 | ❜    | HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT         |
| E29D9D   |   10077 | ❝    | HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT  |
| E29D9E   |   10078 | ❞    | HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT         |
| E29DAE   |   10094 | ❮    | HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT  |
| E29DAF   |   10095 | ❯    | HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT |
| E3809D   |   12317 | 〝   | REVERSED DOUBLE PRIME QUOTATION MARK               |
| E3809E   |   12318 | 〞   | DOUBLE PRIME QUOTATION MARK                        |
| E3809F   |   12319 | 〟   | LOW DOUBLE PRIME QUOTATION MARK                    |
| EA9084   |   41988 | ꐄ   | YI SYLLABLE QUOT                                   |
| EFBC82   |   65282 | ＂   | FULLWIDTH QUOTATION MARK                           |
| F3A080A2 |  917538 | ó ¢     | TAG QUOTATION MARK                                 |
+----------+---------+------+----------------------------------------------------+

⚈ 91..94 are the hex for latin1 if coming from Microsoft Word.
⚈ F3A080A2 requires utf8mb4; the rest work in both utf8 and utf8mb4 (graphic probably not rendered correctly)

When is ö not equal to ö?

First, let's discuss when 'ö' is not equal to 'o'. (I will focus on utf8/utf8mb4 here.) 'o' != 'ö' for utf8_bin collation and 7 others (danish, icelandic, swedish, estonian, turkish, hungarian, and german2); see utf8collations
For MySQL 8.0 and utf8mb4, there are 12: utf8mb4collations

How could 'ö' not be equal to 'ö'? When one is a single character (hex C3B6) and the other is a combination of a 'combining diaeressis' plus 'o' -- hex CC886F.

Further discussion: Searching with German Umlauts

"Combining" (non-spacing) characters

These can be used for constructing accented letters (etc) from the code plus the letter. But keep in mind that (at least) utf8_general_ci will not treat such 2-letter codes as equal to the apparently same 1-letter equivalent.

+----------+----------------------------------------------------+---------+
| hex_utf8 | name                                               | Unicode |
+----------+----------------------------------------------------+---------+
| CC80     | COMBINING GRAVE ACCENT                             | U+0300  |
| CC81     | COMBINING ACUTE ACCENT                             | U+0301  |
| CC82     | COMBINING CIRCUMFLEX ACCENT                        | U+0302  |
| CC83     | COMBINING TILDE                                    | U+0303  |
| CC84     | COMBINING MACRON                                   | U+0304  |
| CC85     | COMBINING OVERLINE                                 | U+0305  |
| CC86     | COMBINING BREVE                                    | U+0306  |
| CC87     | COMBINING DOT ABOVE                                | U+0307  |
| CC88     | COMBINING DIAERESIS                                | U+0308  |
| CC89     | COMBINING HOOK ABOVE                               | U+0309  |
| CC8A     | COMBINING RING ABOVE                               | U+030A  |
| CC8B     | COMBINING DOUBLE ACUTE ACCENT                      | U+030B  |
| CC8C     | COMBINING CARON                                    | U+030C  |
| CC8D     | COMBINING VERTICAL LINE ABOVE                      | U+030D  |
| CC8E     | COMBINING DOUBLE VERTICAL LINE ABOVE               | U+030E  |
| CC8F     | COMBINING DOUBLE GRAVE ACCENT                      | U+030F  |
| CC90     | COMBINING CANDRABINDU                              | U+0310  |
| CC91     | COMBINING INVERTED BREVE                           | U+0311  |
| CC92     | COMBINING TURNED COMMA ABOVE                       | U+0312  |
| CC93     | COMBINING COMMA ABOVE                              | U+0313  |
| CC94     | COMBINING REVERSED COMMA ABOVE                     | U+0314  |
| CC95     | COMBINING COMMA ABOVE RIGHT                        | U+0315  |
| CC96     | COMBINING GRAVE ACCENT BELOW                       | U+0316  |
| CC97     | COMBINING ACUTE ACCENT BELOW                       | U+0317  |
| CC98     | COMBINING LEFT TACK BELOW                          | U+0318  |
| CC99     | COMBINING RIGHT TACK BELOW                         | U+0319  |
| CC9A     | COMBINING LEFT ANGLE ABOVE                         | U+031A  |
| CC9B     | COMBINING HORN                                     | U+031B  |
| CC9C     | COMBINING LEFT HALF RING BELOW                     | U+031C  |
| CC9D     | COMBINING UP TACK BELOW                            | U+031D  |
| CC9E     | COMBINING DOWN TACK BELOW                          | U+031E  |
| CC9F     | COMBINING PLUS SIGN BELOW                          | U+031F  |
| CCA0     | COMBINING MINUS SIGN BELOW                         | U+0320  |
| CCA1     | COMBINING PALATALIZED HOOK BELOW                   | U+0321  |
| CCA2     | COMBINING RETROFLEX HOOK BELOW                     | U+0322  |
| CCA3     | COMBINING DOT BELOW                                | U+0323  |
| CCA4     | COMBINING DIAERESIS BELOW                          | U+0324  |
| CCA5     | COMBINING RING BELOW                               | U+0325  |
| CCA6     | COMBINING COMMA BELOW                              | U+0326  |
| CCA7     | COMBINING CEDILLA                                  | U+0327  |
| CCA8     | COMBINING OGONEK                                   | U+0328  |
| CCA9     | COMBINING VERTICAL LINE BELOW                      | U+0329  |
| CCAA     | COMBINING BRIDGE BELOW                             | U+032A  |
| CCAB     | COMBINING INVERTED DOUBLE ARCH BELOW               | U+032B  |
| CCAC     | COMBINING CARON BELOW                              | U+032C  |
| CCAD     | COMBINING CIRCUMFLEX ACCENT BELOW                  | U+032D  |
| CCAE     | COMBINING BREVE BELOW                              | U+032E  |
| CCAF     | COMBINING INVERTED BREVE BELOW                     | U+032F  |
| CCB0     | COMBINING TILDE BELOW                              | U+0330  |
| CCB1     | COMBINING MACRON BELOW                             | U+0331  |
| CCB2     | COMBINING LOW LINE                                 | U+0332  |
| CCB3     | COMBINING DOUBLE LOW LINE                          | U+0333  |
| CCB4     | COMBINING TILDE OVERLAY                            | U+0334  |
| CCB5     | COMBINING SHORT STROKE OVERLAY                     | U+0335  |
| CCB6     | COMBINING LONG STROKE OVERLAY                      | U+0336  |
| CCB7     | COMBINING SHORT SOLIDUS OVERLAY                    | U+0337  |
| CCB8     | COMBINING LONG SOLIDUS OVERLAY                     | U+0338  |
| CCB9     | COMBINING RIGHT HALF RING BELOW                    | U+0339  |
| CCBA     | COMBINING INVERTED BRIDGE BELOW                    | U+033A  |
| CCBB     | COMBINING SQUARE BELOW                             | U+033B  |
| CCBC     | COMBINING SEAGULL BELOW                            | U+033C  |
| CCBD     | COMBINING X ABOVE                                  | U+033D  |
| CCBE     | COMBINING VERTICAL TILDE                           | U+033E  |
| CCBF     | COMBINING DOUBLE OVERLINE                          | U+033F  |
| CD80     | COMBINING GRAVE TONE MARK                          | U+0340  |
| CD81     | COMBINING ACUTE TONE MARK                          | U+0341  |
| CD84     | COMBINING GREEK DIALYTIKA TONOS                    | U+0344  |
| CD85     | COMBINING GREEK YPOGEGRAMMENI                      | U+0345  |
| D283     | COMBINING CYRILLIC TITLO                           | U+0483  |
| D284     | COMBINING CYRILLIC PALATALIZATION                  | U+0484  |
| D285     | COMBINING CYRILLIC DASIA PNEUMATA                  | U+0485  |
| D286     | COMBINING CYRILLIC PSILI PNEUMATA                  | U+0486  |
| E28390   | COMBINING LEFT HARPOON ABOVE                       | U+20D0  |
| E28391   | COMBINING RIGHT HARPOON ABOVE                      | U+20D1  |
| E28392   | COMBINING LONG VERTICAL LINE OVERLAY               | U+20D2  |
| E28393   | COMBINING SHORT VERTICAL LINE OVERLAY              | U+20D3  |
| E28394   | COMBINING ANTICLOCKWISE ARROW ABOVE                | U+20D4  |
| E28395   | COMBINING CLOCKWISE ARROW ABOVE                    | U+20D5  |
| E28396   | COMBINING LEFT ARROW ABOVE                         | U+20D6  |
| E28397   | COMBINING RIGHT ARROW ABOVE                        | U+20D7  |
| E28398   | COMBINING RING OVERLAY                             | U+20D8  |
| E28399   | COMBINING CLOCKWISE RING OVERLAY                   | U+20D9  |
| E2839A   | COMBINING ANTICLOCKWISE RING OVERLAY               | U+20DA  |
| E2839B   | COMBINING THREE DOTS ABOVE                         | U+20DB  |
| E2839C   | COMBINING FOUR DOTS ABOVE                          | U+20DC  |
| E283A1   | COMBINING LEFT RIGHT ARROW ABOVE                   | U+20E1  |
| E38299   | COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK      | U+3099  |
| E3829A   | COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK | U+309A  |
+----------+----------------------------------------------------+---------+
88 rows in set (0.01 sec)

Cautions

If you have FOREIGN KEY CONSTRAINTs on fields you are 'fixing', you will probably mess up the constraints. It might work to disable FK constraints, fix both tables, then re-enable them.

If you have FOREIGN KEY CONSTRAINTs on fields other than the ones you are 'fixing', it is probably best to disable/reenable them.

If your table is huge, and/or you have lots of indexes, the ALTERs can be very time consuming. It may be better to disable/drop the indexes during the 'fix', then reapply them at the end. Do not drop an AUTO_INCREMENT index. In the case of InnoDB, do not drop the PRIMARY KEY.

Introspection

To see what charsets and collations exist in your version of MySQL:

    SHOW CHARSET;  -- only latin1 and utf8/utf8mb4 are likely to be interesting
    SHOW COLLATION;  -- list the possible values.
    SHOW COLLATION LIKE 'utf8%';
    SHOW COLLATION LIKE '%_cs';
    SHOW SESSION VARIABLES LIKE 'character_set%';
    SHOW GLOBAL  VARIABLES LIKE 'character_set%';
    SHOW CREATE TABLE tbl \G

Collation name suffixes
    ⚈  _as_ci -- case insensitive but accent sensitive
    ⚈  _ci -- case insensitive; this includes ignoring accents
    ⚈  _cs -- case sensitive (and accent sensitive)
    ⚈  _bin -- just compare the bits

my.cnf

The following lines in your my.cnf initialize various things in various contexts. (It is untested.) It may fail to be recognized for SUPER user:

    [client]
    default-character-set = cp1252

    [mysqld]
    skip-character-set-client-handshake
    # (deprecated in mysqld) default-character-set = cp1252
    character-set-server = cp1252
    collation-server = latin1_general_cs
    init-connect = SET NAMES cp1252
    # (init-connect is _not_ seen by SUPER users, such as "root".)

    [mysqldump]
    default-character-set = cp1252

    [mysql]
    default-character-set = cp1252

Caution: Setting default-character-set that way for the mysql client cannot be overridden by SET NAMES.

in my.ini in the [mysqld] group: skip-character-set-client-handshake collation_server=utf8_unicode_ci character_set_server=utf8 This may speed up connections, and avoid need for SET NAMES.

Stored Procedures/Functions, Triggers

ALTER DATABASE dbname CHARACTER SET ... can impact already compiled Stored Procedures. Recompile them.

There may be other charset issues. In particular, a routine seems to have a charset, but that declaration seems to have no effect.

JSON

If your JSON strings are showing things like "\uff62\u30c9\u30e9\u30b4\u30f3\u30b3\u30a4", you have "escaped unicode". (Those are unicode characters for "｢ドラゴンコ".)

If you would prefer utf8, then this should work in PHP (since version 5.4):

$t = json_encode($s, JSON_UNESCAPED_UNICODE);

Before PHP 5.4:
function json_encode_unicode($input) { return preg_replace_callback( '/\\\\u([0-9a-zA-Z]{4})/', function ($matches) { return mb_convert_encoding(pack('H*',$matches[1]),'UTF-8','UTF-16'); }, json_encode($input) ); }

utf8 Collations

Which to pick? Why?
    ⚈  utf8_bin -- just looks at bits; no case folding, etc
    ⚈  utf8_general_ci -- case and accent folding.
    ⚈  utf8_unicode_ci -- Perhaps best. Dipthongs, combining accents, etc are handled
    ⚈  utf8_..._ci -- If your application is entirely (or mostly) in one language, this may be better for you. See the chart below.

Notes on utf8_general_ci vs utf8_unicode_ci
    ⚈  utf8_general_ci is a very simple collation. What it does * it just - removes all accents - then converts to upper case
    ⚈  utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss"; letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
    ⚈  utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.
    ⚈  utf8_unicode_ci is generally more accurate for all scripts.
    ⚈  For Cyrillic: utf8_unicode_ci is fine for these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
    ⚈  The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
    ⚈  from StackOverflow

Making a new collation / alternatives

Three ways to deal with inadequacies of existing collations: Custom Collation | Regularized field | REGEXP

utf8 Collations Examples

What pairs of things will be treated equal? What order does a collation cause? How do I override a column's collation? (Caution: Doing so usually prevents effecient use of an INDEX.) Examples below show altering an ORDER BY clause and an "=" operator.

Here is test code for the results below:

drop table charutf8;

-- Create and populate a table with a "utf8_bin" column:
CREATE TABLE `charutf8` (
  `a` varchar(9) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT IGNORE INTO charutf8 (a) VALUES
    ('abc'), ('def'), ('ABC'), ('ábc'), ('́abc'), ('Abç'), ('ss'), ('ß');

-- Dump of characters, and lengths:
SELECT a, CHAR_LENGTH(a), LENGTH(a), HEX(a)
    FROM charutf8
    ORDER BY a;
-- CC81 is "U+0301: COMBINING ACUTE ACCENT" which,
-- together with 'a' is equal to 'á' in utf8_unicode_ci

Unicode canonical equivalence

-- Show that the ordering varies with collation:
SELECT GROUP_CONCAT(a ORDER BY a) AS 'bin order' FROM charutf8;
SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_general_ci) AS 'general_ci order' FROM charutf8;
SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_unicode_ci) AS 'unicode_ci order' FROM charutf8;

-- Self-JOIN to see what values are the same; collation-dependent:
SELECT CHAR_LENGTH(x.a) AS clen, LENGTH(x.a), x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_bin'
    FROM charutf8 x
    JOIN charutf8 y ON x.a = y.a
    GROUP BY x.a;
SELECT x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_general_ci'
    FROM charutf8 x
    JOIN charutf8 y ON x.a = y.a COLLATE utf8_general_ci
    GROUP BY x.a;
SELECT x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_unicode_ci'
    FROM charutf8 x
    JOIN charutf8 y ON x.a = y.a COLLATE utf8_unicode_ci
    GROUP BY x.a;

Results and conclusions:

-- Dump of characters, and lengths:
+-------+----------------+-----------+------------+
| a     | CHAR_LENGTH(a) | LENGTH(a) | HEX(a)     |
+-------+----------------+-----------+------------+
| ABC   |              3 |         3 | 414243     |
| Abç   |              3 |         4 | 4162C3A7   |
| abc   |              3 |         3 | 616263     |
| def   |              3 |         3 | 646566     |
| ss    |              2 |         2 | 7373       |
| ß     |              1 |         2 | C39F       |
| ábc   |              3 |         4 | C3A16263   |
| ́abc  |              4 |         5 | CC81616263 | -- the misalignment is due to the output not handling "combining" chars correctly
+-------+----------------+-----------+------------+
8 rows in set (0.00 sec)
-- CC81 is "U+0301: COMBINING ACUTE ACCENT" which,
-- together with 'a' is equal to 'á' in utf8_unicode_ci

-- Show that the ordering varies with collation:
+-----------------------------------+
| bin order                         |
+-----------------------------------+
| ABC,Abç,abc,def,ss,ß,ábc,́abc     | -- Notice how accents follow all letters
+-----------------------------------+
mysql> SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_general_ci) AS 'general_ci order' FROM charutf8;
+-----------------------------------+
| general_ci order                  |
+-----------------------------------+
| ábc,abc,Abç,ABC,def,ß,ss,́abc     | -- Regular accents mixed in; Combining guy is not'
+-----------------------------------+
+-----------------------------------+
| unicode_ci order                  |
+-----------------------------------+
| ́abc,ábc,abc,Abç,ABC,def,ß,ss     | -- Combining accent sorted with its mates
+-----------------------------------+

-- Self-JOIN to see what values are the same; collation-dependent:
+-------+---------------------+
| a     | Equal with utf8_bin |
+-------+---------------------+
| ABC   | ABC                 | -- exact match for case and accents
| Abç   | Abç                 |
| abc   | abc                 |
| def   | def                 |
| ss    | ss                  |
| ß     | ß                   | -- ss and ß are NOT equal
| ábc   | ábc                 |
| ́abc  | ́abc                |
+-------+---------------------+
+-------+----------------------------+
| a     | Equal with utf8_general_ci |
+-------+----------------------------+
| ABC   | ábc,ABC,Abç,abc            | -- case folding and accent stripping
| Abç   | ábc,ABC,Abç,abc            |
| abc   | ábc,ABC,Abç,abc            |
| def   | def                        |
| ss    | ss                         | -- ss and ß are NOT equal
| ß     | ß                          |
| ábc   | ABC,Abç,abc,ábc            |
| ́abc  | ́abc                       | -- 'combining' NOT equal
+-------+----------------------------+
+-------+----------------------------+
| a     | Equal with utf8_unicode_ci |
+-------+----------------------------+
| ABC   | ábc,ABC,́abc,Abç,abc       | -- case folding and accent stripping
| Abç   | abc,ábc,ABC,́abc,Abç       |
| abc   | Abç,abc,ábc,ABC,́abc       |
| def   | def                        |
| ss    | ss,ß                       | -- ss and ß are EQUAL
| ß     | ß,ss                       |
| ábc   | ábc,ABC,́abc,Abç,abc       |
| ́abc  | abc,ábc,ABC,́abc,Abç       | -- 'combining' EQUAL
+-------+----------------------------+

Changing the collation in a SELECT

Suppose you have

   city VARCHAR(66) COLLATE utf8_bin

and you need to find "San José" when you do

   WHERE city = "San Jose"

This applies a different collation for the comparision:

   WHERE city COLLATE utf8_general_ci = 'San Jose'

Caveat: An INDEX(city) cannot be used when you override the collation, so the latter WHERE will be slower.

ERROR 1253 (42000): COLLATION '...' is not valid for CHARACTER SET '...'

mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = 'rené' collate utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = 'rené' collate utf8_general_ci;
Empty set (0.00 sec)
Altenatively, use can explicitly set the character set using a 'character set introducer':
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = _utf8'rené' collate utf8_general_ci;
Empty set (0.00 sec)

StackOverflow reference

utf8 Collation Variations:

See also a more complete table for utf8 and utf8mb4 in MySQL 8.0.

Here's a comparison of the collation of the alphabet characters for the utf8 collations you might consider.
    ⚈  spanish2 puts ch after cz and ll after lz and Ñ,ñ after nz
    ⚈  utf8_general_ci treats sharp-s as 's'; the rest treat it as 'ss'.
    ⚈  Ligature (Æ, etc) collation varies.
    ⚈  Eth (Ð) is treated like a letter between D and E.
    ⚈  utf8_estonian_ci does some major shuffling (Z between S and T, etc).
    ⚈  Danish sorts Ä=Æ=ä=æ Ö=Ø=ö=ø Aa=Å=å Þ=þ after z.

utf8 : utf8_general_ci             A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą    Aa  ae  az
utf8 : utf8_general_mysql500_ci    A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą    Aa  ae  az
utf8 : utf8_danish_ci              A=a=ª=À=Á=Â=Ã=à=á=â=ã=Ā=ā=Ą=ą              ae  az
utf8 : utf8_swedish_ci             A=a=ª=À=Á=Â=Ã=à=á=â=ã=Ā=ā=Ą=ą          Aa  ae  az
utf8 : utf8_estonian_ci            A=a=ª=À=Á=Â=Ã=Å=à=á=â=ã=å=Ā=ā=Ą=ą      Aa  ae  az       Æ=æ
utf8 : utf8_lithuanian_ci          A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą  Aa  ae  az       Æ=æ
utf8 : utf8_latvian_ci             A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą  Aa  ae  az       Æ=æ
utf8 : utf8_unicode_ci             A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą  Aa  ae  az       Æ=æ
utf8 : utf8_polish_ci              A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā      Aa  ae  az  Ą=ą  Æ=æ
utf8 : utf8_spanish_ci             A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą  Aa  ae  az       Æ=æ
utf8 : utf8_spanish2_ci            A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą  Aa  ae  az       Æ=æ
utf8 : utf8_turkish_ci             A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą  Aa  ae  az       Æ=æ

utf8 : utf8_general_ci             B=b  C=c=Ç=ç=Č=č      ch  cz       D=d  dz
utf8 : utf8_general_mysql500_ci    B=b  C=c=Ç=ç=Č=č      ch  cz       D=d  dz
utf8 : utf8_danish_ci              B=b  C=c=Ç=ç=Č=č      ch  cz       D=d  dz  Ð=ð
utf8 : utf8_swedish_ci             B=b  C=c=Ç=ç=Č=č      ch  cz       D=d  dz  Ð=ð
utf8 : utf8_estonian_ci            B=b  C=c=Ç=ç=Č=č      ch  cz       D=d  dz  Ð=ð
utf8 : utf8_lithuanian_ci          B=b  C=c=ch=Ç=ç   cz          Č=č  D=d  dz  Ð=ð
utf8 : utf8_latvian_ci             B=b  C=c=Ç=ç          ch  cz  Č=č  D=d  dz  Ð=ð
utf8 : utf8_unicode_ci             B=b  C=c=Ç=ç=Č=č      ch  cz       D=d  dz  Ð=ð
utf8 : utf8_polish_ci              B=b  C=c=Ç=ç=Č=č      ch  cz       D=d  dz  Ð=ð
utf8 : utf8_spanish_ci             B=b  C=c=Ç=ç=Č=č      ch  cz       D=d  dz  Ð=ð
utf8 : utf8_spanish2_ci            B=b  C=c=Ç=ç=Č=č  cz  ch           D=d  dz  Ð=ð
utf8 : utf8_turkish_ci             B=b  C=c=Č=č          ch  cz  Ç=ç  D=d  dz  Ð=ð

utf8 : utf8_general_ci             E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_general_mysql500_ci    E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_danish_ci              E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_swedish_ci             E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_estonian_ci            E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_lithuanian_ci          E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_latvian_ci             E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_unicode_ci             E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_polish_ci              E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ě=ě  Ę=ę
utf8 : utf8_spanish_ci             E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_spanish2_ci            E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_turkish_ci             E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě

utf8 : utf8_general_ci             F=f  fz     G=g=ğ=Ģ=ģ   H=h  hz
utf8 : utf8_general_mysql500_ci    F=f  fz     G=g=ğ=Ģ=ģ   H=h  hz
utf8 : utf8_danish_ci              F=f  fz  ƒ  G=g=ğ=Ģ=ģ   H=h  hz
utf8 : utf8_swedish_ci             F=f  fz  ƒ  G=g=ğ=Ģ=ģ   H=h  hz
utf8 : utf8_estonian_ci            F=f  fz  ƒ  G=g=ğ=Ģ=ģ   H=h  hz
utf8 : utf8_lithuanian_ci          F=f  fz  ƒ  G=g=ğ=Ģ=ģ   H=h  hz
utf8 : utf8_latvian_ci             F=f  fz  ƒ  G=g=ğ  Ģ=ģ  H=h  hz
utf8 : utf8_unicode_ci             F=f  fz  ƒ  G=g=ğ=Ģ=ģ   H=h  hz
utf8 : utf8_polish_ci              F=f  fz  ƒ  G=g=ğ=Ģ=ģ   H=h  hz
utf8 : utf8_spanish_ci             F=f  fz  ƒ  G=g=ğ=Ģ=ģ   H=h  hz
utf8 : utf8_spanish2_ci            F=f  fz  ƒ  G=g=ğ=Ģ=ģ   H=h  hz
utf8 : utf8_turkish_ci             F=f  fz  ƒ  G=g=Ģ=ģ     H=h  hz

utf8 : utf8_general_ci             I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ=ı    ij    iz
utf8 : utf8_general_mysql500_ci    I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ=ı    ij    iz
utf8 : utf8_danish_ci              I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ      ij=ĳ  iz
utf8 : utf8_swedish_ci             I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ      ij=ĳ  iz
utf8 : utf8_estonian_ci            I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ      ij=ĳ  iz
utf8 : utf8_lithuanian_ci          I=Y=i=y=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ  ij=ĳ  iz=yz
utf8 : utf8_latvian_ci             I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ      ij=ĳ  iz  Y=y  yz
utf8 : utf8_unicode_ci             I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ      ij=ĳ  iz
utf8 : utf8_polish_ci              I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ      ij=ĳ  iz
utf8 : utf8_spanish_ci             I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ      ij=ĳ  iz
utf8 : utf8_spanish2_ci            I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ      ij=ĳ  iz
utf8 : utf8_turkish_ci             I=ı  i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ   ij=ĳ  iz

utf8 : utf8_general_ci             J=j  K=k=Ķ=ķ   L=l=Ļ=ļ  lj            ll  lz            M=m  N=n=Ñ=ñ=Ņ=ņ  nz
utf8 : utf8_general_mysql500_ci    J=j  K=k=Ķ=ķ   L=l=Ļ=ļ  lj            ll  lz            M=m  N=n=Ñ=ñ=Ņ=ņ  nz
utf8 : utf8_danish_ci              J=j  K=k=Ķ=ķ   L=l=Ļ=ļ  lj=Ǉ=ǈ=ǉ      ll  lz       Ł=ł  M=m  N=n=Ñ=ñ=Ņ=ņ  nz
utf8 : utf8_swedish_ci             J=j  K=k=Ķ=ķ   L=l=Ļ=ļ  lj=Ǉ=ǈ=ǉ      ll  lz       Ł=ł  M=m  N=n=Ñ=ñ=Ņ=ņ  nz
utf8 : utf8_estonian_ci            J=j  K=k=Ķ=ķ   L=l=Ļ=ļ  lj=Ǉ=ǈ=ǉ      ll  lz       Ł=ł  M=m  N=n=Ñ=ñ=Ņ=ņ  nz
utf8 : utf8_lithuanian_ci          J=j  K=k=Ķ=ķ   L=l=Ļ=ļ  lj=Ǉ=ǈ=ǉ      ll  lz       Ł=ł  M=m  N=n=Ñ=ñ=Ņ=ņ  nz
utf8 : utf8_latvian_ci             J=j  K=k  Ķ=ķ  L=l      lj=Ǉ=ǈ=ǉ      ll  lz  Ļ=ļ  Ł=ł  M=m  N=n=Ñ=ñ      nz  Ņ=ņ
utf8 : utf8_unicode_ci             J=j  K=k=Ķ=ķ   L=l=Ļ=ļ  lj=Ǉ=ǈ=ǉ      ll  lz       Ł=ł  M=m  N=n=Ñ=ñ=Ņ=ņ  nz
utf8 : utf8_polish_ci              J=j  K=k=Ķ=ķ   L=l=Ļ=ļ  lj=Ǉ=ǈ=ǉ      ll  lz       Ł=ł  M=m  N=n=Ñ=ñ=Ņ=ņ  nz
utf8 : utf8_spanish_ci             J=j  K=k=Ķ=ķ   L=l=Ļ=ļ  lj=Ǉ=ǈ=ǉ      ll  lz       Ł=ł  M=m  N=n=Ņ=ņ      nz  Ñ=ñ
utf8 : utf8_spanish2_ci            J=j  K=k=Ķ=ķ   L=l=Ļ=ļ  lj=Ǉ=ǈ=ǉ  lz  ll           Ł=ł  M=m  N=n=Ņ=ņ      nz  Ñ=ñ
utf8 : utf8_turkish_ci             J=j  K=k=Ķ=ķ   L=l=Ļ=ļ  lj=Ǉ=ǈ=ǉ      ll  lz       Ł=ł  M=m  N=n=Ñ=ñ=Ņ=ņ  nz

utf8 : utf8_general_ci             O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö    oe      oz
utf8 : utf8_general_mysql500_ci    O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö    oe      oz
utf8 : utf8_danish_ci              O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ      oe=Œ=œ  oz
utf8 : utf8_swedish_ci             O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ      oe=Œ=œ  oz
utf8 : utf8_estonian_ci            O=o=º=Ò=Ó=Ô=ò=ó=ô          oe=Œ=œ  oz       Ø=ø
utf8 : utf8_lithuanian_ci          O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö  oe=Œ=œ  oz       Ø=ø
utf8 : utf8_latvian_ci             O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö  oe=Œ=œ  oz       Ø=ø
utf8 : utf8_unicode_ci             O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö  oe=Œ=œ  oz       Ø=ø
utf8 : utf8_polish_ci              O=o=º=Ò=Ô=Õ=Ö=ò=ô=õ=ö      oe=Œ=œ  oz  Ó=ó  Ø=ø
utf8 : utf8_spanish_ci             O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö  oe=Œ=œ  oz       Ø=ø
utf8 : utf8_spanish2_ci            O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö  oe=Œ=œ  oz       Ø=ø
utf8 : utf8_turkish_ci             O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ      oe=Œ=œ  oz  Ö=ö  Ø=ø

utf8 : utf8_general_ci             P=p  Q=q  R=r  S=s=ß=ş=Š=Š=š=š  sh  ss    sz
utf8 : utf8_general_mysql500_ci    P=p  Q=q  R=r  S=s=ş=Š=Š=š=š    sh  ss    sz
utf8 : utf8_danish_ci              P=p  Q=q  R=r  S=s=ş=Š=Š=š=š    sh  ss=ß  sz
utf8 : utf8_swedish_ci             P=p  Q=q  R=r  S=s=ş=Š=Š=š=š    sh  ss=ß  sz
utf8 : utf8_estonian_ci            P=p  Q=q  R=r  S=s=ş            sh  ss=ß  sz  Š=Š=š=š  Z=z  zh  zz  Ž=Ž=ž=ž
utf8 : utf8_lithuanian_ci          P=p  Q=q  R=r  S=s=ş            sh  ss=ß  sz  Š=Š=š=š
utf8 : utf8_latvian_ci             P=p  Q=q  R=r  S=s=ş            sh  ss=ß  sz  Š=Š=š=š
utf8 : utf8_unicode_ci             P=p  Q=q  R=r  S=s=ş=Š=Š=š=š    sh  ss=ß  sz
utf8 : utf8_polish_ci              P=p  Q=q  R=r  S=s=ş=Š=Š=š=š    sh  ss=ß  sz
utf8 : utf8_spanish_ci             P=p  Q=q  R=r  S=s=ş=Š=Š=š=š    sh  ss=ß  sz
utf8 : utf8_spanish2_ci            P=p  Q=q  R=r  S=s=ş=Š=Š=š=š    sh  ss=ß  sz
utf8 : utf8_turkish_ci             P=p  Q=q  R=r  S=s=Š=Š=š=š      sh  ss=ß  sz

utf8 : utf8_general_ci             T=t  TM=tm    tz  U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų  ue  uz
utf8 : utf8_general_mysql500_ci    T=t  TM=tm    tz  U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų  ue  uz
utf8 : utf8_danish_ci              T=t  TM=tm=™  tz  U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų      ue  uz
utf8 : utf8_swedish_ci             T=t  TM=tm=™  tz  U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų      ue  uz
utf8 : utf8_estonian_ci            T=t  TM=tm=™  tz  U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų      ue  uz
utf8 : utf8_lithuanian_ci          T=t  TM=tm=™  tz  U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų  ue  uz
utf8 : utf8_latvian_ci             T=t  TM=tm=™  tz  U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų  ue  uz
utf8 : utf8_unicode_ci             T=t  TM=tm=™  tz  U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų  ue  uz
utf8 : utf8_polish_ci              T=t  TM=tm=™  tz  U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų  ue  uz
utf8 : utf8_spanish_ci             T=t  TM=tm=™  tz  U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų  ue  uz
utf8 : utf8_spanish2_ci            T=t  TM=tm=™  tz  U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų  ue  uz
utf8 : utf8_turkish_ci             T=t  TM=tm=™  tz  U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų      ue  uz  Ü=ü

utf8 : utf8_general_ci             V=v  W=w  X=x  Y=y=Ý=ý=ÿ=Ÿ      yz
utf8 : utf8_general_mysql500_ci    V=v  W=w  X=x  Y=y=Ý=ý=ÿ=Ÿ      yz
utf8 : utf8_danish_ci              V=v  W=w  X=x  Y=y=Ü=Ý=ü=ý=ÿ=Ÿ  yz
utf8 : utf8_swedish_ci             V=v  W=w  X=x  Y=y=Ü=Ý=ü=ý=ÿ=Ÿ  yz
utf8 : utf8_estonian_ci            V=v  W=w  Õ=õ  Ä=ä  Ö=ö  Ü=ü  X=x  Y=y=Ý=ý=ÿ=Ÿ  yz  Þ=þ
utf8 : utf8_lithuanian_ci          V=v  W=w  X=x  Ý=ý=ÿ=Ÿ
utf8 : utf8_latvian_ci             V=v  W=w  X=x  Ý=ý=ÿ=Ÿ
utf8 : utf8_unicode_ci             V=v  W=w  X=x  Y=y=Ý=ý=ÿ=Ÿ      yz
utf8 : utf8_polish_ci              V=v  W=w  X=x  Y=y=Ý=ý=ÿ=Ÿ      yz
utf8 : utf8_spanish_ci             V=v  W=w  X=x  Y=y=Ý=ý=ÿ=Ÿ      yz
utf8 : utf8_spanish2_ci            V=v  W=w  X=x  Y=y=Ý=ý=ÿ=Ÿ      yz
utf8 : utf8_turkish_ci             V=v  W=w  X=x  Y=y=Ý=ý=ÿ=Ÿ      yz

utf8 : utf8_general_ci             Z=z=Ž=Ž=ž=ž  zh  zz  Æ=æ  Ð=ð  Ø=ø  Þ=þ     Ł=ł  Œ=œ  ƒ  Ǉ=ǈ=ǉ
utf8 : utf8_general_mysql500_ci    Z=z=Ž=Ž=ž=ž  zh  zz  Æ=æ  Ð=ð  Ø=ø  Þ=þ  ß  Ł=ł  Œ=œ  ƒ  Ǉ=ǈ=ǉ
utf8 : utf8_danish_ci              Z=z=Ž=Ž=ž=ž  zh  zz       Ä=Æ=ä=æ  Ö=Ø=ö=ø  Aa=Å=å  Þ=þ
utf8 : utf8_swedish_ci             Z=z=Ž=Ž=ž=ž  zh  zz  Å=å  Ä=Æ=ä=æ  Ö=Ø=ö=ø          Þ=þ
utf8 : utf8_estonian_ci
utf8 : utf8_lithuanian_ci          Z=z          zh  zz  Ž=Ž=ž=ž  Þ=þ
utf8 : utf8_latvian_ci             Z=z          zh  zz  Ž=Ž=ž=ž  Þ=þ
utf8 : utf8_unicode_ci             Z=z=Ž=Ž=ž=ž  zh  zz           Þ=þ
utf8 : utf8_polish_ci              Z=z=Ž=Ž=ž=ž  zh  zz           Þ=þ
utf8 : utf8_spanish_ci             Z=z=Ž=Ž=ž=ž  zh  zz           Þ=þ
utf8 : utf8_spanish2_ci            Z=z=Ž=Ž=ž=ž  zh  zz           Þ=þ
utf8 : utf8_turkish_ci             Z=z=Ž=Ž=ž=ž  zh  zz           Þ=þ

utf8mb4_% mostly works the same as utf8.

Unfortunately, it is not easy to separate case folding from accent stripping. Here is what someone has created: utf8_distinct_ci

8-bit Encodings

The columns are decimal, latin1 hex, utf8 hex, char, htmlentity.

 128 80 E282AC € &#x80;
 129 81   C281  &#x81;
 130 82 E2809A ‚ &#x82;
 131 83   C692 ƒ &#x83;
 132 84 E2809E „ &#x84;
 133 85 E280A6 … &#x85;
 134 86 E280A0 † &#x86;
 135 87 E280A1 ‡ &#x87;
 136 88   CB86 ˆ &#x88;
 137 89 E280B0 ‰ &#x89;
 138 8A   C5A0 Š &#x8A;
 139 8B E280B9 ‹ &#x8B;
 140 8C   C592 Œ &#x8C;
 141 8D   C28D  &#x8D;
 142 8E   C5BD Ž &#x8E;
 143 8F   C28F  &#x8F;
 144 90   C290  &#x90;
 145 91 E28098 ‘ &#x91;
 146 92 E28099 ’ &#x92;
 147 93 E2809C “ &#x93;
 148 94 E2809D ” &#x94;
 149 95 E280A2 • &#x95;
 150 96 E28093 – &#x96;
 151 97 E28094 — &#x97;
 152 98   CB9C ˜ &#x98;
 153 99 E284A2 ™ &#x99;
 154 9A   C5A1 š &#x9A;
 155 9B E280BA › &#x9B;
 156 9C   C593 œ &#x9C;
 157 9D   C29D  &#x9D;
 158 9E   C5BE ž &#x9E;
 159 9F   C5B8 Ÿ &#x9F;

 160 A0   C2A0   &nbsp;
 161 A1   C2A1 ¡ &iexcl;
 162 A2   C2A2 ¢ &cent;
 163 A3   C2A3 £ &pound;
 164 A4   C2A4 ¤ &curren;
 165 A5   C2A5 ¥ &yen;
 166 A6   C2A6 ¦ &brvbar;
 167 A7   C2A7 § &sect;
 168 A8   C2A8 ¨ &uml;
 169 A9   C2A9 © &copy;
 170 AA   C2AA ª &ordf;
 171 AB   C2AB « &laquo;
 172 AC   C2AC ¬ &not;
 173 AD   C2AD  &shy;
 174 AE   C2AE ® &reg;
 175 AF   C2AF ¯ &macr;
 176 B0   C2B0 ° &deg;
 177 B1   C2B1 ± &plusmn;
 178 B2   C2B2 ² &sup2;
 179 B3   C2B3 ³ &sup3;
 180 B4   C2B4 ´ &acute;
 181 B5   C2B5 µ &micro;
 182 B6   C2B6 ¶ &para;
 183 B7   C2B7 · &middot;
 184 B8   C2B8 ¸ &cedil;
 185 B9   C2B9 ¹ &sup1;
 186 BA   C2BA º &ordm;
 187 BB   C2BB » &raquo;
 188 BC   C2BC ¼ &frac14;
 189 BD   C2BD ½ &frac12;
 190 BE   C2BE ¾ &frac34;
 191 BF   C2BF ¿ &iquest;

 192 C0   C380 À &Agrave;
 193 C1   C381 Á &Aacute;
 194 C2   C382 Â &Acirc;
 195 C3   C383 Ã &Atilde;
 196 C4   C384 Ä &Auml;
 197 C5   C385 Å &Aring;
 198 C6   C386 Æ &AElig;
 199 C7   C387 Ç &Ccedil;
 200 C8   C388 È &Egrave;
 201 C9   C389 É &Eacute;
 202 CA   C38A Ê &Ecirc;
 203 CB   C38B Ë &Euml;
 204 CC   C38C Ì &Igrave;
 205 CD   C38D Í &Iacute;
 206 CE   C38E Î &Icirc;
 207 CF   C38F Ï &Iuml;
 208 D0   C390 Ð &ETH;
 209 D1   C391 Ñ &Ntilde;
 210 D2   C392 Ò &Ograve;
 211 D3   C393 Ó &Oacute;
 212 D4   C394 Ô &Ocirc;
 213 D5   C395 Õ &Otilde;
 214 D6   C396 Ö &Ouml;
 215 D7   C397 × &times;
 216 D8   C398 Ø &Oslash;
 217 D9   C399 Ù &Ugrave;
 218 DA   C39A Ú &Uacute;
 219 DB   C39B Û &Ucirc;
 220 DC   C39C Ü &Uuml;
 221 DD   C39D Ý &Yacute;
 222 DE   C39E Þ &THORN;
 223 DF   C39F ß &szlig;

 224 E0   C3A0 à &agrave;
 225 E1   C3A1 á &aacute;
 226 E2   C3A2 â &acirc;
 227 E3   C3A3 ã &atilde;
 228 E4   C3A4 ä &auml;
 229 E5   C3A5 å &aring;
 230 E6   C3A6 æ &aelig;
 231 E7   C3A7 ç &ccedil;
 232 E8   C3A8 è &egrave;
 233 E9   C3A9 é &eacute;
 234 EA   C3AA ê &ecirc;
 235 EB   C3AB ë &euml;
 236 EC   C3AC ì &igrave;
 237 ED   C3AD í &iacute;
 238 EE   C3AE î &icirc;
 239 EF   C3AF ï &iuml;
 240 F0   C3B0 ð &eth;
 241 F1   C3B1 ñ &ntilde;
 242 F2   C3B2 ò &ograve;
 243 F3   C3B3 ó &oacute;
 244 F4   C3B4 ô &ocirc;
 245 F5   C3B5 õ &otilde;
 246 F6   C3B6 ö &ouml;
 247 F7   C3B7 ÷ &divide;
 248 F8   C3B8 ø &oslash;
 249 F9   C3B9 ù &ugrave;
 250 FA   C3BA ú &uacute;
 251 FB   C3BB û &ucirc;
 252 FC   C3BC ü &uuml;
 253 FD   C3BD ý &yacute;
 254 FE   C3BE þ &thorn;
 255 FF   C3BF ÿ &yuml;

The 'missing' characters in the above table:
    ⚈  129
    ⚈  141 reverse line feed
    ⚈  143
    ⚈  144
    ⚈  157
    ⚈  173 shows when breaking across lines Soft Hyphen

Misinterpreting the utf8 bytes as latin1. The last column is what you may be seeing.

 128 80 E282AC € â‚¬
 129 81   C281   Â
 130 82 E2809A ‚ â€š 
 131 83   C692 ƒ Æ’
 132 84 E2809E „ â€ž
 133 85 E280A6 … â€¦
 134 86 E280A0 † â€
 135 87 E280A1 ‡ â€¡
 136 88   CB86 ˆ Ë†
 137 89 E280B0 ‰ â€°
 138 8A   C5A0 Š Å
 139 8B E280B9 ‹ â€¹
 140 8C   C592 Œ Å’
 141 8D   C28D   Â
 142 8E   C5BD Ž Å½
 143 8F   C28F   Â
 144 90   C290   Â
 145 91 E28098 ‘ â€˜
 146 92 E28099 ’ â€™
 147 93 E2809C “ â€œ
 148 94 E2809D ” â€
 149 95 E280A2 • â€¢
 150 96 E28093 – â€“
 151 97 E28094 — â€”
 152 98   CB9C ˜ Ëœ
 153 99 E284A2 ™ â„¢
 154 9A   C5A1 š Å¡
 155 9B E280BA › â€º
 156 9C   C593 œ Å“
 157 9D   C29D   Â
 158 9E   C5BE ž Å¾
 159 9F   C5B8 Ÿ Å¸

 160 A0   C2A0   Â
 161 A1   C2A1 ¡ Â¡
 162 A2   C2A2 ¢ Â¢ 
 163 A3   C2A3 £ Â£
 164 A4   C2A4 ¤ Â¤
 165 A5   C2A5 ¥ Â¥
 166 A6   C2A6 ¦ Â¦
 167 A7   C2A7 § Â§
 168 A8   C2A8 ¨ Â¨
 169 A9   C2A9 © Â©
 170 AA   C2AA ª Âª
 171 AB   C2AB « Â«
 172 AC   C2AC ¬ Â¬
 173 AD   C2AD   Â
 174 AE   C2AE ® Â®
 175 AF   C2AF ¯ Â¯
 176 B0   C2B0 ° Â°
 177 B1   C2B1 ± Â±
 178 B2   C2B2 ² Â²
 179 B3   C2B3 ³ Â³
 180 B4   C2B4 ´ Â´
 181 B5   C2B5 µ Âµ
 182 B6   C2B6 ¶ Â¶
 183 B7   C2B7 · Â·
 184 B8   C2B8 ¸ Â¸
 185 B9   C2B9 ¹ Â¹
 186 BA   C2BA º Âº
 187 BB   C2BB » Â»
 188 BC   C2BC ¼ Â¼
 189 BD   C2BD ½ Â½
 190 BE   C2BE ¾ Â¾
 191 BF   C2BF ¿ Â¿

 192 C0   C380 À Ã€
 193 C1   C381 Á Ã
 194 C2   C382 Â Ã‚ 
 195 C3   C383 Ã Ãƒ
 196 C4   C384 Ä Ã„
 197 C5   C385 Å Ã…
 198 C6   C386 Æ Ã†
 199 C7   C387 Ç Ã‡
 200 C8   C388 È Ãˆ
 201 C9   C389 É Ã‰
 202 CA   C38A Ê ÃŠ
 203 CB   C38B Ë Ã‹
 204 CC   C38C Ì ÃŒ
 205 CD   C38D Í Ã
 206 CE   C38E Î ÃŽ
 207 CF   C38F Ï Ã
 208 D0   C390 Ð Ã
 209 D1   C391 Ñ Ã‘
 210 D2   C392 Ò Ã’
 211 D3   C393 Ó Ã“
 212 D4   C394 Ô Ã”
 213 D5   C395 Õ Ã•
 214 D6   C396 Ö Ã–
 215 D7   C397 × Ã—
 216 D8   C398 Ø Ã˜
 217 D9   C399 Ù Ã™
 218 DA   C39A Ú Ãš
 219 DB   C39B Û Ã›
 220 DC   C39C Ü Ãœ
 221 DD   C39D Ý Ã
 222 DE   C39E Þ Ãž
 223 DF   C39F ß ÃŸ

 224 E0   C3A0 à Ã
 225 E1   C3A1 á Ã¡
 226 E2   C3A2 â Ã¢ 
 227 E3   C3A3 ã Ã£
 228 E4   C3A4 ä Ã¤
 229 E5   C3A5 å Ã¥
 230 E6   C3A6 æ Ã¦
 231 E7   C3A7 ç Ã§
 232 E8   C3A8 è Ã¨
 233 E9   C3A9 é Ã©
 234 EA   C3AA ê Ãª
 235 EB   C3AB ë Ã«
 236 EC   C3AC ì Ã¬
 237 ED   C3AD í Ã
 238 EE   C3AE î Ã®
 239 EF   C3AF ï Ã¯
 240 F0   C3B0 ð Ã°
 241 F1   C3B1 ñ Ã±
 242 F2   C3B2 ò Ã²
 243 F3   C3B3 ó Ã³
 244 F4   C3B4 ô Ã´
 245 F5   C3B5 õ Ãµ
 246 F6   C3B6 ö Ã¶
 247 F7   C3B7 ÷ Ã·
 248 F8   C3B8 ø Ã¸
 249 F9   C3B9 ù Ã¹
 250 FA   C3BA ú Ãº
 251 FB   C3BB û Ã»
 252 FC   C3BC ü Ã¼
 253 FD   C3BD ý Ã½
 254 FE   C3BE þ Ã¾
 255 FF   C3BF ÿ Ã¿

More references

Various discussions on the MySQL forums:

Please note that any PHP code using the mysql_* API should be replaced by the mysqli_* API; the former is deprecated.

PHP/html/mysql usage
Migrating from latin1 to utf-8
Convert database from utf8_general_ci to uft8_unicode_ci
Latin1 to unicode character convert -- data are display with ?????
Problem with accents
help in relating tables and normailsation
Select statement with town Polish characters
connection collation
cannot change character set
joined query problem
Slow retreieve under MyIsam
UTF-8 problem
trying to move a database from mysql 4.0.18 to 5.0.77
Server just hanging - Copying to tmp table
Encoding problem...
Unicode collation, utf8 data fetch/manip problem
What storage requirement has FLOAT(24) ? Doc is ambigous
what is the step to convert a db from latin1 to UTF8
Special characters and database
Require suggestion for changing character set in Master master replication setup
Character encodings issues with php, mysql, apache
load data character set question
Storing UTF-8 strings in MySQL Database
utf8 inputs store as html entities, how to retrievesubstr of utf8 inputs?
Is STR_TO_DATE function doesn't work with UCS2 charset?

Miscellany:

Converting through BLOB

Wikipedia shows 0xD5 0xF5 iso upper/lower
shows utf8 c590 / c591
details several representations for 0150
discussion about how charsets are used

MySQL 5.5 & 4-byte utf8
MSDN discussion of Unicode
unicode.org -- The definitive word, but heavy going

mysqldump issues
Example of double-encoding and mysqldump
Windows, Unicode, LOAD DATA, little-endian, etc

Charbase - A visual unicode database

Adding a custom collation
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Unicode.scarfboy
UTF8 all the way through

latin1 ~= cp1252 ~= ISO 8859-1

latin1 is the default character set. MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d. 8859 and some comparisions

Combining Diacriticals

Some characters can be encoded as either a single character or as a letter and a "combining" acceent. For example, ä can be represented in utf8 as either c3a4 or 61cc88. The ä as a single 'character' is c3a4. The other encoding is 61, the hex code for 'a', plus cc88 is the utf-8 code for the 'combining diaeresis' (umlaut, double dot, etc). Wikipedia: Combining Diacritical Marks

If the collation is utf8_general_ci, the two encodings are treated as unequal. With utf8_unicode_ci, they are equal. Searching database for strings with Swedish characters

Programatic Collation

Suppose you have need for a collation that cannot be performed by any existing collation. Create two columns instead of one; let's call them exact and canonical. exact contains the correctly spelled (caps, accents, etc) value, and it has a suitable collation. canonical is a programmatically-altered version of exact. This may include
    ⚈  stripping accents
    ⚈  folding to lowercase
    ⚈  removing redundant whitespace (blanks, newlines, tabs)
    ⚈  removing punctuation
    ⚈  removing markup (such as html )
    ⚈  it could even include providing alternate spellings or synonyms

canonical would also have an index, possibly FULLTEXT and/or utf8mb4_bin. canonical may have duplicates even though exact probably does not.

Then, when the 'user' searches, you do the same folding to "canonical" before searching canonical. Display from exact.

LOAD DATA INFILE

In LOAD DATA INFILE, the manual points out:

"The server uses the character set indicated by the character_set_database system variable to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause. A character set of binary specifies 'no conversion'."

Example:

   LOAD DATA LOCAL INFILE '...'
      INTO TABLE tbl
      CHARACTER SET 'utf8mb4'
      FIELDS TERMINATED BY ','
      ENCLOSED BY '"'
      ESCAPED BY '\\'
      IGNORE 1 ROWS;

Characters in Western Europe

Taken from Wikipedia

    å, ä and ö in Finnish and Swedish
    à, ç, è, é, ï, í, ò, ó, ú, ü in Catalan
    å, æ and ø in Norwegian and Danish
    á, é, ó, ý, è, ë, ï in Dutch
    ä, ö, ü and ß in German
    á, ð, í, ó, ú, ý, æ and ø in Faroese
    á, ð, é, í, ó, ú, ý, þ, æ and ö in Icelandic
    à, â, ç, è, é, ë, ê, ï, î, ö, ô, ù, û, ÿ, æ, œ in French
    à, è, é, ì, ò, ù in Italian
    á, é, í, ñ, ó, ú, ï, ü, ¡, ¿ in Spanish
    à, á, â, ã, ç, é, ê, í, ó, ô, õ, ú in Portuguese (ü no longer used)
    á, é, í, ó, ú in Irish
    £ in British English

Turkish

The columns: utf8, unicode codepoint decimal, hex, HTML entity, 'name':

C387    0199    00C7    Ç  latin capital letter c with cedilla
C396    0214    00D6    Ö  latin capital letter o with diaeresis
C39C    0220    00DC    Ü  latin capital letter u with diaeresis
C3A7    0231    00E7    ç  latin small letter c with cedilla
C3B6    0246    00F6    ö  latin small letter o with diaeresis
C3BC    0252    00FC    ü  latin small letter u with diaeresis
C49E    0286    011E    Ğ  latin capital letter g with breve
C49F    0287    011F    ğ  latin small letter g with breve
C4B0    0304    0130    İ  latin capital letter i with dot above
C4B1    0305    0131    ı  latin small letter dotless i
C59E    0350    015E    Ş  latin capital letter s with cedilla
C59F    0351    015F    ş  latin small letter s with cedilla
E282A4  8356    20A4    ₤  lira sign
E282BA  8378    20BA    ₺  Turkish lira sign

Don't expect latin1 to work well:

CREATE TABLE `se96265` (
  `content` mediumtext NOT NULL,
  FULLTEXT KEY `content` (`content`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

SET NAMES utf8;
-- Turkish word for Croatia, second char is \xC4\xB1
INSERT INTO `se96265` (`content`) VALUES ('Hırvatistan');
SELECT content, HEX(content), HEX('Hırvatistan') FROM se96265\G
*************************** 1. row ***************************
            content: H?rvatistan
       HEX(content): 483F72766174697374616E
HEX('Hırvatistan'): 48C4B172766174697374616E

Note how the 'C4B1' turned into '?'

To get an error:

mysql> SET SESSION sql_mode = 'STRICT_ALL_TABLES';  -- So the INSERT will cause an error:
mysql> INSERT INTO `se96265` (`content`) VALUES ('Hırvatistan');
ERROR 1366 (HY000): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1

(Note: This technique for getting an error message is not specific to Turkish.)

ISO/IEC 8859-9 = latin5 = Turkish

If you have Turkish text in a latin1 column/table, you should convert to latin5 in order to correctly handle the 6 characters that are different (latin1: ÐÝÞðýþ --> latin5 ĞİŞğış). See Stackoverflow
Dotless i and dotted I

Arabic/Farsi

Arabic characters 2 bytes in utf8: Dxyy. For example, Arabic Letter 'HAH' is D8AD.

To get diacritic insensitive comparisons, use utf8_unicode_ci:

    SELECT 'بِسْمِ' = 'بسم' COLLATE utf8_unicode_ci;

The result I got back was 1 (true), meaning they are considered equal. With utf8_general_ci it came back with 0, meaning not equal. (Of course, you should declare your column(s) as utf8_unicode_ci rather than using the COLLATE clause.)

Stored Farsi, got '???'
Urdu case folding

Three ways to deal with inadequacies of existing collations: Custom Collation | Regularized field | REGEXP

Hebrew

To search for a particular character in a string, LIKE or REGEXP will work.

To search for "a cell being entirely Hebrew" is not easy. Essentially, you look for the utf8 encoding, then build a REGEXP to accomodate it. For example, to see which rows have a Hebrew name (in utf8):

    WHERE HEX(name) REGEXP '^(D7..)+$'

See forum discussion

This assumes (not quite perfectly) that all Hebrew encodings are D7xx. This can be adapted to other languages if the encoding is not to complex. Chinese, for example, is complex:

    WHERE HEX(name) REGEXP '^(E.....)+$'

    ⚈  Chinese characters are 3 bytes (6 hex) long
    ⚈  This fails to handle the 4-byte Chinese characters
    ⚈  This includes Japanese, Cherokee, etc.

If you have the "hebrew" encoding in a latin1 column, you may need a 3-step ALTER. See stackoverflow

Storing Hebrew strings Ok, but gibberish when pulling data on another server

Cyrillic, Macedonian, etc

(Random notes)

Cyrillic in utf8 is encoded D0yy-D4yy and EA99yy-EA9Ayy, but perhaps only D0yy is in common use.

For Cyrillic: utf8_unicode_ci is fine for these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.

Ќ is hex D08C in utf8 or utf8mb4. Cyrillic is completely covered by either CHARACTER SET. К is D09A. Hmmm, this is quite interesting:

    mysql> SELECT 'К' = 'Ќ' COLLATE utf8_bin AS bin,
                  'К' = 'Ќ' COLLATE utf8_general_ci AS general,
                  'К' = 'Ќ' COLLATE utf8_unicode_ci AS unicode;
    +-----+---------+---------+
    | bin | general | unicode |
    +-----+---------+---------+
    |   0 |       1 |       0 |
    +-----+---------+---------+f

    mysql> SELECT 'Г' = 'Ѓ' COLLATE utf8_bin AS bin,
                  'Г' = 'Ѓ' COLLATE utf8_general_ci AS general,
                  'Г' = 'Ѓ' COLLATE utf8_unicode_ci AS unicode;
    +-----+---------+---------+
    | bin | general | unicode |
    +-----+---------+---------+
    |   0 |       1 |       0 |
    +-----+---------+---------+

And if it helps, here's the method from the alphabet helper. def alphabet cyrillic_alphabet = ["А", "Б", "В", "Г", "Д", "Ѓ", "Е", "Ж", "З", "Ѕ", "И", "Ј", "К", "Л", "Њ", "М", "Н", "Љ", "О", "П", "Р", "С","Т", "Ќ", "У", "Ф", "Х", "Ц", "Ч", "Џ", "Ш"] end – NewbieOnRails yesterday

These are letters in Macedonian alphabet, not on the Ubuntu Cyrillic keyboard.
    ⚈  D086 1030=x0406 [І] L CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
    ⚈  D087 1031=x0407 [Ї] L CYRILLIC CAPITAL LETTER YI
    ⚈  D08B 1035=x040B [Ћ] L CYRILLIC CAPITAL LETTER TSHE
    ⚈  D08E 1038=x040E [Ў] L CYRILLIC CAPITAL LETTER SHORT U

The bit patterns are different, so utf8_bin collates unequal. But usually whenever general is equal, so is unicode. Cyrillic/Macedonian

Cyrillic example:
title:       Добро пожаловать в Диксон,
hex:         D094D0BED0B1D180D0BE20D0BFD0BED0B6D0B0D0BBD0BED0B2D0B0D182D18C20D0B220D094D0B8D0BAD181D0BED0BD,
length:      47,
char length: 25

Data entry of Cyrillic in Workbench yields "????"

Other Natural Languages

This is a list hits about natural languages. Most have not been verified.

Lithuanian ąĄ čČ ęĘ ėĖ įĮ ųŲ ūŪ are saved as ?, however, šŠ žŽ do get saved. The reason: Column declared latin1

Vietnamese collations were added in MySQL 5.6.

em dash, dagger, and other special characters

mysql> SELECT ';' semicolon, UNHEX('cdbe') 'Greek question mark';
+-----------+---------------------+
| semicolon | Greek question mark |
+-----------+---------------------+
| ;         | ;                   |
+-----------+---------------------+

cp1250 - Central Europe

For a Devanagari character range:

   SELECT 'क़ज़फ़' REGEXP '^[क़-फ़]+$';  --> true -- But that is mere coincidence!

Stackoverflow thread

UTF8 Chinese String Comparison

REGEXP / RLIKE

Do not attempt to use most of the features of REGEXP to test for non-ascii characters; it will generally not work in utf8. MySQL's REGEXP only works with bytes, not utf8 characters.

Greek alpha (α) is hex CEB1; omega (ω) is hex CF89

* Character sets don't work right: [αω] checks for 4 bytes: CE, B1, CF, 89, which has very little to do with checking for those two letters. * Ranges also don't work right: [α-ω] checks for byte CE, range B1-CF, and byte 89. * . is one byte, not one character. * etc.

A few tricks will work:

* WHERE HEX(str) REGEXP '(..)*CEB1CF89' -- will check for 'αω' in the column; but you may as well do WHERE str LIKE '%αω%'. * WHERE HEX(str) REGEXP '(..)*F' -- will check for whether the UTF-8 string needs utf8mb4. (But how did you get str stored in the first place??)

To discover if a column has any UTF-8 characters:

WHERE HEX(name) REGEXP '^(..)*[CDEFcdef]'

will spot any name that has at least one UTF-8 character. (That is, CHARACTER SET utf8 or utf8mb4.)

Python

1st or 2nd line in source code (to have literals in the code utf8-encoded): # -*- coding: utf-8 -*-

Python code for dumping hex (etc) for string 'u':

for i, c in enumerate(u):
    print i, '%04x' % ord(c), unicodedata.category(c),
    print unicodedata.name(c)

Miscellany notes on coding for utf8:
    ⚈  db = MySQLdb.connect(host=DB_HOST, user=DB_USER, passwd=DB_PASS, db=DB_NAME, charset="utf8", use_unicode=True)
    ⚈  conn = MySQLdb.connect(host="localhost", user='root', password='', db='', charset='utf8')
    ⚈  config = 'OPTIONS': { ..., 'charset': 'utf8mb4', ... } -- in "settings.py"?
    ⚈  cursor.execute("SET NAMES utf8mb4;") -- not as good as using `charset'
    ⚈  db.set_character_set('utf8'), implies use_unicode=True
    ⚈  Literals should be u'...'
    ⚈  MySQL-python 1.2.4 fixes a bug wherein varchar(255) CHARACTER SET utf8 COLLATE utf8_bin is treated like a BLOB.

Checklist:

    ⚈  # -*- coding: utf-8 -*- -- (for literals in code) (This tells the interpreter which encoding to use when parsing the file, it affects only files that contain non-ascii characters (literals). It does not effect the code in any other way.)
    ⚈  charset='utf8' in connect() call -- Is that buried in bottle_mysql.Plugin? (Note: Try 'utf-8' and 'utf8')
    ⚈  Text encoded in utf8.
    ⚈  No need for encode() or decode() if you are willing to accept utf8 everywhere.
    ⚈  u'...' for literals
    ⚈  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> near start of html page
    ⚈  <meta charset="utf-8" /> is a shorter version (if using only HTML5).
    ⚈  Content-Type: text/html; charset=UTF-8 (in HTTP response header)
    ⚈  header('Content-Type: text/html; charset=UTF-8'); (in PHP to get that response header)
    ⚈  CHARACTER SET utf8 COLLATE utf8_general_ci on column (or table) definition in MySQL.
    ⚈  utf8 all the way through
    ⚈  Use MySQL Connector/Python instead of pyodbc and MySQL Connector/ODBC

Side note: Python defaults to autocommit=off; this can be a rude surprise when your INSERTs vanish.

Showing hex: var.encode().hex()

References:

    ⚈  https://docs.python.org/2/howto/unicode.html#the-unicode-type
    ⚈  https://stackoverflow.com/questions/9154998/python-encoding-mysql
    ⚈  https://dev.mysql.com/doc/connector-python/en/connector-python-connectargs.html
    ⚈  https://stackoverflow.com/questions/34069603/python-pyodbc-weird-unicode-bytecode

How to store arabic text in mysql database using python?

The Python language environment officially only uses UCS-2 internally since version 2.0, but the UTF-8 decoder to "Unicode" produces correct UTF-16. Since Python 2.2, "wide" builds of Unicode are supported which use UTF-32 instead;[16] these are primarily used on Linux. Python 3.3 no longer ever uses UTF-16, instead strings are stored in one of ASCII/Latin-1, UCS-2, or UTF-32, depending on which code points are in the string, with a UTF-8 version also included so that repeated conversions to UTF-8 are fast.

Python check for utf8 (versus utf8mb4)

The python package "MySQL-python" version needs to be at least 1.2.5 in order to handle utf8mb4.

import sys
reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8')

pre-1.2.4 (thanks to Ashish Bhatt): Before querying into the database try enforcing this over the cursor

   cursor.execute('SET NAMES utf8mb4')
   cursor.execute("SET CHARACTER SET utf8mb4")
   cursor.execute("SET character_set_connection=utf8mb4")

Python 2.7 issues; improvements in Python 3

json.dumps(mydict, ensure_ascii=False) avoids "\u...." strings.

If you have \u00ef type text, consider this for fixing it: codecs.decode(body, 'unicode-escape')

utf-8-sig

From codecs — Codec registry and base classes

Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to


    LATIN SMALL LETTER I WITH DIAERESIS
    RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
    INVERTED QUESTION MARK

in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

PHP

    ⚈  If using mysqli, do $mysqli_obj->set_charset('utf8mb4');
    ⚈  If using PDO do somethin like $db = new PDO('dblib:host=host;dbname=db;charset=utf8mb4', $user, $pwd);
    ⚈  Alternatively, execute SET NAMES utf8mb4
Any of these will say that the bytes in the client are UTF-8 encoded. Conversion, if necessary, will occur between the client and the database if the column definition is something other than utf8mb4.

Be sure to have php-mbstring installed (see php.ini)

Random notes about PHP. (See also Best Practice, above.)

In php.ini: default_charset UTF-8 The default changed to UTF-8 in PHP 5.6.

To see what PHP has:

   $hex = unpack('H*', $text);
   echo implode('', $hex);

or simply echo bin2hex($text);

To get "U+" (unicode) notation, first fetch the data, then use PHP code trim(str_replace('\u', ' U+', json_encode('كتب عربية')), '" '); This will result in U+0643 U+062a U+0628 U+0639 U+0631 U+0628 U+064a U+0629.

PHP: Convert \unnnn to utf8:

   $myr = '\u101c\u1031\u1011';     // Given this
   $x = json_decode('"'.$myr.'"');  // Add quotes and decode
   // $x is not string, like လေထ

If you use header(), do: header('Content-type: text/plain; charset=utf-8');

PHP: undo_magic_quotes_gpc($_POST);

BjyProfiler needs charset=utf8; it apparently overrides other PHP settings.

If using mysqli, character_set_name() (formerly spelled client_encoding()) does NOT mirror SET NAMES.

$doc->loadHTML(''. $content);

Other Computer Languages

This is a list of hints on how to establish utf8 for various computer languages. Most have not been verified.

ADO.NET "charset=utf8"

AJAX AJAX
encodeURIComponent() may be useful.

Apache servers are (were?) configured to serve files in ISO-8859-1 by default, so you need to add the following line to your .htaccess file:

AddDefaultCharset UTF-8

ASP, ASP.Net

<%Response.charset="utf-8"%>

Mathias Bynens

Various tips
including:

Apache Config (/etc/httpd/conf/httpd.conf) :

AddDefaultCharset UTF-8 Apache .htaccess file :

AddCharset UTF-8 .htm AddCharset UTF-8 .html AddCharset UTF-8 .php

Connector/C

mysql_set_character_set('utf8mb4');

C#: C# code to undo (while reading?) double-encoded strings
Similar code in SQL:

   CONVERT(UNHEX(HEX(CONVERT(CONVERT(col USING utf8) USING latin1))), CHAR)

Put this in the connection string:

   id=my_user;password=my_password;database=some_db123;charset=utf8;

Lao and C#

Console.OutputEncoding = System.Text.Encoding.UTF8;

Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword; CharSet=utf8; Note! Use lower case value utf8 and not upper case UTF8 as this will fail. Connector/Net

codeigniter:

application/config/config.php - Note the dash :

   $config['charset'] = 'UTF-8';

application/config/database.php :

$db['default'] = array(
   ...
   'char_set' => 'utf8mb4',
   'dbcollat' => 'utf8mb4_unicode_520_ci',

Note: Those are aimed at MySQL. dbcollat is used only for PHP's mysqli.

   httppost.setEntity(new UrlEncodedFormEntity(nameValuePairs, HTTP.UTF_8));

   echo meta('Content-type', 'text/html; charset=utf-8', 'equiv');

In database.php, Set char_set to utf8mb4 and dbcollat to utf8mb4_unicode_ci or utf8_general_ci Codeigniter 3.0 notes

ColdFusion

    <cfprocessingdirective pageEncoding="utf-8">

    <cffile
    action="read"
    file="#settings.csvfile#"
    variable="autodata"
    charset="utf-8">

Django

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        ...
        'OPTIONS': {
                    'charset': 'utf8mb4',
                    'use_unicode': True, },
    },
}
my.cnf:

[mysqld]
character-set-server=utf8mb4
default-collation=utf8mb4_unicode_ci

[client]
default-character-set=utf8mb4

StackOverflow

client_encoding: 'UTF8' Django utf8 usage

The Django container (we use docker) was missing the LANG env var. Setting LANG to en_US.UTF-8 and restarting the container fixed the problem.

See also the Python notes, above.

Drupal, WordPress, and any other product that uses PHP: Turn off mbstring.encoding_translation. Probably everything to do with mbstring should be avoided.

amending your settings.php database connection by adding:

   $databases['default']['default'] = array(
     'driver' => 'mysql',
     'database' => 'databasename',
     'username' => 'username',
     'password' => 'password',
     'host' => 'localhost',
     'charset' => 'utf8mb4',
     'collation' => 'utf8mb4_general_ci',
   );

More discussion of utf8mb4

go-mysql

[username[:password]@][protocol[(address)]]/dbname?collation=utf8mb4_unicode_ci https://github.com/go-sql-driver/mysql#charset

Grails Grails and utf8mb4
Remove the "characterEncoding=UTF" part in the dataSource url. Otherwise grails always forces MySQL to use UTF-8 instead of utf8mb4. E.g.

dataSource:
        dbCreate: "update"
        url: "jdbc:mysql://localhost:8889/mydbname?useUnicode=true&zeroDateTimeBehavior=convertToNull&autoReconnect=true"

Heidi

When importing, don't use "auto-detect", explicitly request "UTF-8" if appropriate. Heidi import

Java

Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0. ucs2 versus utf16

⚈ What if you put a 5.1 replica off a 4.x master? (The German sharp-s could cause trouble.)
⚈ Replication type (SBR vs RBR)?

Add ?useUnicode=yes&characterEncoding=UTF-8 to the JDBC URL (or maybe it is =true and =utf8)

   <%@ page language="java" pageEncoding="UTF-8"%>
   <%@ page contentType="text/html;charset=UTF-8" %>

   compileJava.options.encoding = 'UTF-8'

   <form method="post" action="/your/url/" accept-charset="UTF-8">

To use 4-byte UTF8 with Connector/J configure the MySQL server with character_set_server=utf8mb4. Connector/J will then use that setting as long as characterEncoding has not been set in the connection string. This is equivalent to autodetection of the character set.

utf8mb4 needs special handling in Connector/J

"For Connector/J 8.0.12 and earlier: In order to use the utf8mb4 character set for the connection, the server MUST be configured with character_set_server=utf8mb4; if that is not the case, when UTF-8 is used for characterEncoding in the connection string, it will map to the MySQL character set name utf8, which is an alias for utf8mb3.'

Do not issue the query SET NAMES with Connector/J, as the driver will not detect that the character set has been changed by the query, and will continue to use the character set configured when the connection was first set up.

https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-charsets.html

Specifically says:

For Connector/J 8.0.12 and earlier: In order to use the utf8mb4 character set for the connection, the server MUST be configured with character_set_server=utf8mb4; if that is not the case, when UTF-8 is used for characterEncoding in the connection string, it will map to the MySQL character set name utf8, which is an alias for utf8mb3.

https://stackoverflow.com/questions/73190405/intermittent-random-double-encoded-presents-as-mojibake-emojis-when-storing-fr Jetty

jetty

Java Servlet

resource.setContentType ("text/html;charset=utf-8");

JSP

<%@ page contentType="text/html; charset=UTF-8" %>

Laravel

If you have trouble with 191, etc in Laravel, you have these choices

    ⚈  255 -> 191 (but make sure you don't currently have longer addresses)
    ⚈  utf8mb4 -> utf8 (Caution: This disallows Emoji and some Chinese characters) See config/database.php
    ⚈  INDEX(email(20)) (If it is UNIQUE or PRIMARY KEY, do not pick this option)
    ⚈  It is possible to reconfigure (if after 5.6.3) the server and table to allow bigger indexes; this is not the 'default' until 5.7.7. Stackoverflow

In the file config/database.php: 'mysql' => [..., 'charset' => 'utf8mb4', 'collation' => 'utf8mb4_unicode_ci', ...]

If you would like to log the MYSQL of the query you can use this: \DB::enableQueryLog(); before the query and this: logger(\DB::getQueryLog() ); afterwards. This will save the query to your laravel log file Navicat

Old versions of Navicat have an option for utf8, but not for utf8mb4. So, if you need utf8mb4,... In the encoding options for navicat connections, pick "use mysql encoding", not "utf8". Forum

Moodle

UTF8 in Moodle

mysql commandline tool mysql --default-characterset=utf8mb4 and (or)

[mysql]
default-character-set = utf8mb4

StackOverflow

NET

Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword;CharSet=utf8;

node.js var connection = mysql.createConnection({ ... , charset : 'utf8mb4'}); stackoverflow
utf8 package

OrmLite ConnectionReset=False;

Powershell ISE and Powershell Console: $OutputEncoding = New-Object -typename System.Text.UTF8Encoding

   library(RODBC)
   imp <- odbcConnect("SERVER", uid = "***", pwd="***",DBMSencoding="utf8")

ODBC 3.5.1 probably won't handle utf8; either of these may be correct:

   ODBC;DRIVER={MySQL ODBC 5.1 Driver};UID=...;PWD=...;PORT=3306;DATABASE=...;SERVER=127.0.0.1;CHARSET=UTF8;
   strConnection="driver={MySQL ODBC 5.1 Driver}; server=...;uid=...;pwd=...;database=...;stmt=SET NAMES utf8;"

Unicode is not supported in ADO applications (including ASP) with the Connector/ODBC 3.51 driver. Please consider testing MySQL Connector/ODBC 5.1 which includes Unicode support for ASP.

One source calls for "MySQL ODBC 8.0 ANSI Driver" to handle Emoji.

Panda:

Stackoverflow on Panda
See also notes on Python.

df.to_csv('file.csv',encoding='utf-8-sig')

Panda Release 0.17.1 [on Nov 13, 2015](#event-463745727) fixes an issue with non-ascii characters in column (etc) names.

Perl:

    use utf8;
    use open ':std', ':encoding(UTF-8)';
my $dbh = DBI->connect("dbi:mysql:".$dsn, $user, $password, {
   PrintError => 0,
   RaiseError => 1,
   mysql_enable_utf8 => 1,  # Switch to UTF-8 for communication and decode.
});
# or {mysql_enable_utf8mb4 => 1} if using utf8mb4

Stackoverflow - double encoding and JSON
Perl and UTF-8

A workaround for an issue in Perl 5: binmode(STDOUT, ':unix:encoding(utf8):crlf');

Perl and JSON

PHP see above

pymysql:

import pymysql
con = pymysql.connect(host='127.0.0.1', port=3306,
                      user='root', passwd='******',
                      charset="utf8mb4")

Python see above

R / RStudio

   Tool -> Global Options -> Code -> Saving and put UTF-8
   rs <- dbSendQuery(con, 'set character set "utf8"')
   rs <- dbSendQuery(con, 'SET NAMES utf8')

options(encoding = "UTF-8") at the top of my main script from which I call my package seems to fix the issue with having non-ascii characters in my package code.

read_chunk(lines = readLines("TestSpanishText.R", encoding = "UTF-8")) (also file())

Character Encoding

Rails: Rails and utf8mb4

Ruby on Rails:

   "\xD0\x9A\xD0\xB0\xD1\x81\xD0\xBB".force_encoding("UTF-8") => "Щасл"

Watch out for rake db:reset . Consider, instead, rake db:drop && rake db:create && rake db:migrate . The reset just created the tables from the already stored schema which had the wrong charset and collation. But re-creating them from the migrations created them with the new charset and collation.

RoR: application_controller.rb

  def configure_charsets
    response.headers["Content-Type"] = "text/html; charset=utf-8"
    suppress(ActiveRecord::StatementInvalid) do
      ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
    end
  end

Edit your database.yml with below code line encoding: utf8

RocksDB, MyRocks not yet answered

ServiceStack ConnectionReset=False;

Spring/Hibernate: (See also Best Practice, above.)

Hibernate XML:

   <property name="hibernate.connection.CharSet">utf8mb4</property>
   <property name="hibernate.connection.characterEncoding">UTF-8</property>
   <property name="hibernate.connection.useUnicode">true</property>

Connection url:

db.url=jdbc:mysql://localhost:3306/db_name?useUnicode=true&character_set_server=utf8mb4

The above changes were enough for me to upgrade from utf8 to utf8mb4 charset scheme.

As a side note I would like to make one clarification that UTF-8 is the character encoding while utf8mb4 is a character set that MySQL supports. MySQL's utf8mb4 is a superset to MySQL's utf8.

Spring/Hibernate filter:

    <form accept-charset="UTF-8">

Spring/Hibernate: <property name="url" value="jdbc:mysql://localhost:3306/miniprojetjee?useUnicode=true&connectionCollation=utf8_general_ci&characterSetResults=utf8&characterEncoding=utf-8"/> (or maybe it is =yes)

"Spring": @RequestMapping(value = "/getRegion2", produces={"application/json; charset=UTF-8"},method = RequestMethod.GET)

Hibernate manual

Spring MVC UTF-8 Encoding

<filter>
    <filter-name>encoding-filter</filter-name>
    <filter-class>
        org.springframework.web.filter.CharacterEncodingFilter
    </filter-class>
    <init-param>
        <param-name>encoding</param-name>
        <param-value>UTF-8</param-value>
    </init-param>
    <init-param>
        <param-name>forceEncoding</param-name>
        <param-value>true</param-value>
    </init-param>
</filter>

Spring Boot
Make sure you register Spring's CharacterEncodingFilter in your web.xml

Load the lib hibernate-spatial in pom.xml, and set the config spring.jpa.properties.hibernate.dialect: org.hibernate.dialect.MySQL5InnoDBDialect in application.yml Reference

For sprint-boot, you may need N in front of strings: INSERT INTO X (col) VALUE (N'ščž');

sqlalchemy:

    db_url = sqlalchemy.engine.url.URL(drivername='mysql', host=foo.db_host,
        database=db_schema,
        query={ 'read_default_file' : foo.db_config, 'charset': 'utf8mb4' })

MySQL-Unicode in sqlalchemy

engine = create_engine('mysql://root:@localhost/testdb?charset=utf8', encoding = 'utf-8')

sqoop

sqoop import --connect jdbc:mysql://server.foo.com/db --table bar --direct -- --default-character-set=latin1 (Probably the latin1 should be replaced by the encoding you have.)

Tomcat

tomcat

Wildfly: Configuration -> Subsystems -> Undertow -> Servlet/JSP -> View -> Set "User listener encoding" to true.

WordPress: in wp-config.php, define('DB_CHARSET', 'utf8mb4'); and define('DB_COLLATE', ''); WordPress & utf8mb4

Tomcat/servlet/JSP/etc settings

vb.net / VBScript:

Dim MyConn As New MySqlConnection("Server=localhost;User Id=u;Password=x;Database=d;Charset=utf8")

<%@language="VBSCRIPT" CODEPAGE="65001" LCID=1033%>

Console.OutputEncoding = System.Text.Encoding.UTF8

web2py

   db = DAL('mysql://username:password@localhost/test?set_encoding=utf8mb4')

web.xml

web.xml

Hex 80-9F of cp1252 (Windows-1252 vs ISO-8859-15) latin1 codes

MySQL 8.0

(Source)

This is out of date

Collations to add:
We are going to add following collations. The collation name contains:
a. character set name: "utf8mb4"
b. language's iso code: for example, "cs" for Czech
c. UCA version: "800"
d. accent / case insensitive: "ai_ci"

The complete list is:
Collation name                      language
------------------------------------------------------------
utf8mb4_cs_800_ai_ci                Czech
utf8mb4_da_800_ai_ci                Danish
utf8mb4_de_phonebook_800_ai_ci      German (phonebook order)
utf8mb4_eo_800_ai_ci                Esperanto
utf8mb4_es_800_ai_ci                Spanish
utf8mb4_es_traditional_800_ai_ci    Spanish (traditional)
utf8mb4_et_800_ai_ci                Estonian
utf8mb4_hr_800_ai_ci                Croatian
utf8mb4_hu_800_ai_ci                Hungarian
utf8mb4_is_800_ai_ci                Icelandic
utf8mb4_la_800_ai_ci                Roman (classical Latin)
utf8mb4_lt_800_ai_ci                Lithuanian
utf8mb4_lv_800_ai_ci                Latvian
utf8mb4_pl_800_ai_ci                Polish
utf8mb4_ro_800_ai_ci                Romanian
utf8mb4_sk_800_ai_ci                Slovak
utf8mb4_sl_800_ai_ci                Slovenian
utf8mb4_sv_800_ai_ci                Swedish
utf8mb4_tr_800_ai_ci                Turkish
utf8mb4_vi_800_ai_ci                Vietnamese

Instead, see (as of 8.0.0) utf8mb4 collations

Random notes

UTF-8 all the way through
Scan for 8-bit codes, find and decipher double-encoding

A table declared to be latin1, and containing latin1 bytes can be converted to utf8 via

    ALTER TABLE tbl CONVERT TO CHARACTER SET utf8;

Example from forums

Using Replication to assist in charset conversion?

utf8_general_ci, which will treat Résumé the same as resume MYSQL doesn't 'see' accents

Exact match when _ci collation is on column - even using index
WHERE keyword = CONVERT('cho' USING binary)

Re: constants in SQL:

EXPLAIN EXTENDED SELECT ...;
SHOW WARNINGS;

sometimes gives extra clues into what charset/collation issues are being faced during a SELECT. It will probably contain 'introducers'; example: _utf8'abc'.

To ponder: convert(convert((binary col1) using latin1) using utf8)

phpMyAdmin might be connecting as "root", thereby bypassing init_connect = "SET NAMES utf8". This could defeat your attempts to be fully utf8. Possible workaround: Use a login that avoids SUPER priv. (This suggestion is good advice for all applications -- for security reasons.)

    # Find rows with any 8-bit chars:
    SELECT * FROM tbl WHERE HEX(colname) RLIKE '^(..)*[8-F].';

    # Remove all occurrences of one 8-bit code (hex 92, in this example):
    UPDATE tbl SET colname = REPLACE(colname, UNHEX('92'), '');

Reference: how to replace hex in varchar field

Converting latin1 hex to utf8 hex: HEX(CONVERT(CONVERT(UNHEX('F6') USING LATIN1) USING UTF8)) ⇒ C3B6

(Note: The doc refers to version 5.5, but probably applies to all versions from 4.1 on.)

Connector/.NET possibly defaults to latin1. Consider changing 'connection string' if you need utf8.

Some discussion of Japanese collation

Suggested for my.cnf by stackoverflow
stackoverflow

[client]
default-character-set=utf8

[mysql]
default-character-set=utf8
character-set-server = utf8

[mysqld]
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8

Caution: Setting default-character-set that way for the mysql client cannot be overridden by SET NAMES.

my.cnf: skip-character-set-client-handshake to pretend to be 4.0?

A possible double-encoding fix

$string = htmlspecialchars($string, ENT_QUOTES, "UTF-8");

my.cnf needs these to force the server and the client to use UTF8:

skip-character-set-client-handshake
collation-server=utf8_unicode_ci
character-set-server=utf8

However, "The server character set and collation are used as default values if the database character set and collation are not specified in CREATE DATABASE statements. They have no other purpose." manual on charset-server
It is not clear whether skip-character-set-client-handshake is really needed.

For a description of the character_set_* settings and SET NAMES: manual on Connection Character Sets

In 5.6, this helps get Unicode rendering for mysql commandline tool for Windows: mysql --default-character-set=utf8 Windows Unicode Support
Also sugest (in my.ini)

   [mysql]
   default-character-set=utf8

    <?xml version="1.0" encoding="utf-8"?>

REGEXP (RLIKE) with character class is likely to work wrong: Bug report from 2007
Manual Warning

   <form method="post" action="/your/url/" accept-charset="UTF-8">

Changing default charset

Possible values for <meta http-equiv="Content-Type" content="text/html; charset=..." /> can be found in iana charsets
For example, use 'ISO-8859-9' for MySQL's 'latin5'.

How to fix a table that has a mixture of latin1 and utf8

Test for Chinese characters: REGEXP '^(..)*(E[2-9F]|F0A)' -- Caveat: This may not be precise. StackOverflow discussion

In my.cnf:

   collation_server = utf8mb4_unicode_520_ci

sets

   character_set_database   | utf8mb4
   character_set_server     | utf8mb4
   collation_database   | utf8mb4_unicode_520_ci
   collation_server     | utf8mb4_unicode_520_ci

but not the things set by SET NAMES, nor collation_connection.

From language name to list of characters (dwell to see unicode, not utf8)

Illustrive examples of ALTERs

5.5.3 (10/2010) Changelog: The following obsolete constructs have been removed. The --default-character-set and --default-collation server options (use --character-set-server and --collation-server). Then...

5.7.6 (9/2015) Changelog: The following items are deprecated and will be removed in a future MySQL release. The global character_set_database and collation_database system variables are deprecated and will be removed in a future version of MySQL. Assigning a value to the session character_set_database and collation_database system variables is deprecated and assignments produce a warning. The session variables will become read only in a future version of MySQL and assignments will produce an error. It will remain possible to access the session variables to determine the database character set and collation for the default database.

LOAD DATA added syntax to include charset valid for the file to be loaded. In 4.1, LOAD DATA utilises database charset. Bug 10195

Check for capitalized Swedish words
shows the difficulty of using REGEXP, and provides a workaround.

When a Stored Routine is created, it has the current CHARACTER SET and COLLATION attached to it. Beware -- this may not be what you expect later. Use SHOW PROCEDURE name to see the setting.

"ch and ll are no longer letters since 1994" - This probably means that "spanish2" collations are not needed for current texts. Reference
(in Spanish).

LIKE != = -- (a LIKE b) is almost the same as (a = b) in that collations matter. However trailing spaces can lead to different actions. And 2-byte letters (ae) are treated as two characters in LIKE, but perhaps one character, such as when compared to (ä) with utf8_german2_ci. SO Discussion
Diacritics used in Europe

$str = mb_convert_encoding($str, 'UTF-8', 'HTML-ENTITIES');

Changing case
Examples of mismatch of utf8 and utf8mb4
Search for emoji

iPhone Emoticons

Concerning LOAD DATA INFILE: The server uses the character set indicated by the character_set_database system variable to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause. A character set of binary specifies “no conversion.”

Transliterate accented and Cyrilic characters into latin
A messy case where double-encoding was visible
Stored function to decode unicode escapes

ALTER TABLE mytable DEFAULT CHARACTER SET utf8 can create double encoding

Good explanation of collation_connection

If having trouble importing utf8mb4 characters, give this a try:

USE my_example_db_name;
# Select the right charset
SET NAMES 'utf8mb4';
# Import from SQL file
SOURCE /somewhere/dump.sql;
# Disconnect from SQL server
EXIT

Unicode BMP breakdown

You can force client programs to use specific character set as follows:

[client]
default-character-set=charset_name

This is normally unnecessary. However, when character_set_system differs from character_set_server or character_set_client, and you input characters manually (as database object identifiers, column values, or both), these may be displayed incorrectly in output from the client or the output itself may be formatted incorrectly. In such cases, starting the mysql client with --default-character-set=system_character_set—that is, setting the client character set to match the system character set—should fix the problem.

Reference

Check that skip-character-set-client-handshake is not being used.

A double-encoding example

A history of "why" utf8 vs utf8mb4

[[https://stackoverflow.com/questions/910793/detect-encoding-and-make-everything-utf-8][Detect encoding]] iconv(mb_detect_encoding())

[[https://stackoverflow.com/a/64863428/1766831][UCS2 vs UTF-16]] (and Emoji)

VIEW

unicode and html entities and php and json -- confusion

History, with emphasis on Windows and UCS-2/UTF-16

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text