How Mangling Happens on INSERT
Fixing utf8 bytes in latin1 column
Fixing mismatch between CHARSET and data encoding
Fixing mix of charsets in a column
Fix definition and data, IF correctly encoded
Fixing while moving data with mysqldump
Fixing while moving data with LOAD DATA
"Combining" (non-spacing) characters
Stored Procedures/Functions, Triggers
Making a new collation / alternatives
Changing the collation in a SELECT
ERROR 1253 (42000): COLLATION '...' is not valid for CHARACTER SET '...'
CAVEAT
This one of my pages is not well maintained. This is a jumble of poorly-organized information; some info is quite useful, some is not. I keep it around only because there of the few good nuggets buried in it.
The problems being addressed
⚈ Your web page the wrong characters coming out where accented letters should be.
⚈ You are upgrading from MySQL 4.0, where characters were simply bytes.
⚈ You are trying to wrap your head around how MySQL handles CHARSET / COLLATION.
⚈ You upgraded, and now you are getting garbage.
Please read most of this document. I know it is long (after all, I wrote it!)
But you really need to understand a lot of the pieces in it, in order
to solve your problem.
The tips in this document apply to MySQL, MariaDB, and Percona.
Version differences are called out where relevant.
Basic Concepts
"Character" and "byte" are different!
You must understand this before continuing.
A "byte" is an a-bit thing; it is the unit of space in computers (today).
A "character" is composed of one or more bytes, and represents what we think
of when reading.
A byte can represent only 256 different values.
There are over 11,000 Korean characters and over 40,000 Chinese characters --
no way to squeeze such a character into a single byte.
Charset vs collation. These are different things!
'Charset' ('character set'; 'encoding') refers to the bits used to represent 'characters'.
'Collation' refers to how those bits could be compare for inequality (WHERE)
and sorting (ORDER BY). GROUP BY and FOREIGN KEY CONSTRAINTS can also involve collation.
And it even can involve deciding whether two different bit strings compare 'equal'.
History
1950's -- A character was (sometimes) represented in only 5 bits, on "paper tape"; no lowercase; had to "shift" to switch between letters and digits.
1960's -- A character was 6 bits; no lower case. "Punched cards" were used.
1970's -- 7-bit ASCII becomes common -- that limits you to English.
And the 8-bit "byte" was invented and was coming into common usage (re: IBM 360).
1980's -- Beginning to see 8-bit encodings. Especially since 7-bit ASCII
was wasting a bit of the omni-present "byte".
This can handle Western European accented letters.
1990's -- The computer world realizes that there are other people in
the world and embarks on Unicode and UTF8. ("UTF" = "Unicode Transformation Format")
Meanwhile, MySQL is born,
but has enough problems without worrying about character sets.
Through version 4.0, a CHAR is just a byte.
You can put any kind of bytes, representing anything, into a VARCHAR.
Thus, begins the need for this discussion.
MySQL 4.1 introduced the concept of "character set" and "collation".
If you had legacy data or legacy code, you probably did not notice
that you were messing things up when you upgraded.
Or you started with 4.1 (or later) and "latin1 / latin1_swedish_ci"
and failed to notice that you were asking for trouble.
Today, it's pretty safe to simply lay down the law and say
"Use utf8 for all text."
If you have version 5.5.3 or later,
"Use utf8mb4 for all text."
Sure, there are other character sets,
and they are useful if you have a narrow focus.
But, you may as well use utf8mb4.
Another take on the History
Best Practice
Best practice is to go completely utf8mb4.
I will focus on utf8 and utf8mb4, but if you choose to do otherwise, keep reading;
most of this discussion can still be adapted to the charset of your choice.
For collation, probably the best 'overall' collation is utf8mb4_0900_ai_ci (if available) or utf8mb4_unicode_520_ci.
0900 refers to UCA 9.0. xxx_unicode_520_ci collations are based on UCA 5.2.0 weight keys: https://www.unicode.org/Public/UCA/5.2.0/allkeys.txt.
Unnumbered Collations are based on the older UCA 4.0.0.
A web page (if that is what you have) should begin with
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Establish the characterset for talking to the MySQL server:
⚈ init_connect = 'SET NAMES utf8' in my.cnf
⚈ SET NAMES utf8 -- done immediately after connecting
⚈ SET GLOBAL VARIABLE character_set_... SET NAMES does the three you need.
⚈ In some cases in your my.cnf (my.ini) file, add this under [mysql] or [client]: default-character-set = utf8
For PHP:
⚈ (deprecated; DO NOT USE mysql_* !) mysql interface: mysql_set_charset('utf8'); (assuming PHP 5.2.3 & MySQL 5.0.7) Note: "mysql_" interface is deprecated.
⚈ mysqli interface: mysqli_set_charset('utf8') function. See mysqli_set_charset
⚈ PDO interface: set the charset attribute of the PDO dsn or via SET NAMES utf8mb4.
PDO manual
For Java (JDBC):
⚈ ?useUnicode=yes&characterEncoding=UTF-8 in the getConnection() call.
For Hikari (perhaps):
⚈ spring.jpa.properties.hibernate.connection.characterEncoding=utf-8
⚈ spring.jpa.properties.hibernate.connection.CharSet=utf-8
⚈ spring.jpa.properties.hibernate.connection.useUnicode=true
(or maybe it is =yes and =utf8)
add
spring: datasource: connectionInitSql: "SET NAMES 'utf8mb4'"in the application.yml. connectionInitSql is used by HikariCP when it open the connection.
Client Byte Encoding SET NAMES column CHARSET Outcome -------------------- ---------- -------------- --------- 7-bit (eg, English) (any) (any) GOOD: bytes stored unmodified latin1 8-bit utf8 (any) BAD: String truncated any 8-bit byte (any) ascii BAD: '????; utf8 (not West Europe) utf8 latin1 BAD: '????' utf8 (west Europe) utf8 latin1 GOOD: latin1 has the equivalent values (any) Matches Enc Matches Enc GOOD: bytes stored unmodified (any) Matches Enc different GOOD: bytes transcoded to CHARSET utf8 8-bit latin1 latin1 BAD: 'é' -> 'é' and similar garbage; ; see "2-step ALTER" utf8 8-bit latin1 utf8 BAD: see "double encoding"
Client byte(s) character_set_client: E9 C3A9 latin1 é Ã © utf8 é⚈ Upper-left and lower-right: The correct 'character' is sent.
On the wire column's charset: é Ã © latin1 E9 C3A9 utf8 C3A9 C383C2A9⚈ Left column: all is well.
SHOW SESSION VARIABLES LIKE 'character_set%'; | character_set_client | latin1 | character_set_connection | latin1 | character_set_results | latin1(There will be other rows; these three are the important ones.) SET NAMES sets those three; this example shows the result of SET NAMES latin1, not the recommended setting. More later.
SELECT x, HEX(x), LENGTH(x), CHAR_LENGTH(x) FROM ... WHERE ... LIMIT 1You will get 4 columns:
WHERE HEX(col) RLIKE "^(..)*(F.|E2|EF)"
SHOW VARIABLES LIKE 'character_set%'; SHOW VARIABLES LIKE 'collation%';
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;
ALTER TABLE tbl MODIFY COLUMN col ... CHARACTER SET utf8mb4;(Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
col VARCHAR(111) CHARACTER SET latin1 NOT NULLThen to convert the column without changing the bytes:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL; ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;Note: If you start with TEXT, use BLOB as the intermediate definition. (This is the "2-step ALTER", as discussed elsewhere.) (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);Another ref
Desired string: სახლი არის Hex in table: C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1 Correct utf8 hex: E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1
C3A1C692C2A1 -- should be E183A1 -- Georgian SAN (U+10E1) "ს" C3A1C692C290 C3A1C692C2AE C3A1C692C5A1 C3A1C692CB9C 20 -- space (same in utf8 and latin1) C3A1C692C290 C3A1C692C2A0 C3A1C692CB9C C3A1C692C2A1Georgian, when properly encoded in utf8, should be 3 bytes per character:
E182xx E183xx E2B4xxTaking the first "character" (C3A1C692C2A1, in hex):
SELECT HEX(CONVERT(CONVERT(UNHEX( 'C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1' ) USING utf8) USING latin1)) ⇒ E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1Now to display as utf8:
SELECT UNHEX('E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1') ⇒ სახლი არის
SELECT HEX(CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1)); ⇒ D8AF, which is the utf8 for Arabic DAL. SELECT CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1); ⇒ that character.(Ref: forum thread #565669.)
SELECT hex('☼STICKY☼'); --> C3A2CB9CC2BC535449434B59C3A2CB9CC2BC SELECT unhex(hex(convert(convert(unhex('C3A2CB9CC2BC535449434B59C3A2CB9CC2BC') using utf8) using latin1))) --> ☼STICKY☼ |
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...; ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.
# Set up bad case: mysql> create table latin1 (c varchar(11) charset latin1); mysql> insert into latin1 (c) values ('x'), (unhex('c2a3')); mysql> select c, hex(c), length(c), char_length(c) from latin1; +------+--------+-----------+----------------+ | c | hex(c) | length(c) | char_length(c) | +------+--------+-----------+----------------+ | x | 78 | 1 | 1 | | A£ | C2A3 | 2 | 2 | +------+--------+-----------+----------------+ # Step 1 of 2 of fix: mysql> alter table latin1 modify c tinyblob; mysql> select c, hex(c), length(c), char_length(c) from latin1; +------+--------+-----------+----------------+ | c | hex(c) | length(c) | char_length(c) | +------+--------+-----------+----------------+ | x | 78 | 1 | 1 | | A£ | C2A3 | 2 | 2 | +------+--------+-----------+----------------+ # Step 2 of 2 of fix: mysql> alter table latin1 modify c varchar(11) charset utf8; mysql> select c, hex(c), length(c), char_length(c) from latin1; +------+--------+-----------+----------------+ | c | hex(c) | length(c) | char_length(c) | +------+--------+-----------+----------------+ | x | 78 | 1 | 1 | | £ | C2A3 | 2 | 1 | +------+--------+-----------+----------------+
-- First, let's establish cp850 everywhere: -- In Windows, do "chcp 850" before entering mysql. -- Then get the mysql commandline to agree: SET NAMES cp850; -- Now CodePage and the following match SHOW VARIABLES LIKE 'char%'; -- note the 3 cp850 entries
-- Build sample table CREATE TABLE f616834A ( str VARCHAR(111) CHARACTER SET latin1 ); SHOW CREATE TABLE f616834A\G -- Populate with accented E's and I's: SELECT @str := HEX('ÈÉÊËÌÍÎÏ'); -- Expect D490D2D3DED6D7D8 (cp850 encoding) -- Don't expect C8C9CACBCCCDCECF (latin1) -- Don't expect C388C389C38AC38BC38CC38DC38EC38F (utf8) SELECT UNHEX('D490D2D3DED6D7D8'); -- expect 'ÈÉÊËÌÍÎÏ' INSERT INTO f616834A (str) VALUES (UNHEX('D490D2D3DED6D7D8')) ; SELECT str, HEX(str) FROM f616834A; -- Expect gibberish (Ô?ÒÓÞÖר) and D490D2D3DED6D7D8Notice how it looks wrong but it is encoded correctly. This is because the server/client, in doing the SELECT, is transcoding from latin1 (which it isn't) to cp850. (Some combination of "chcp latin1" and "SET NAMES latin1" would prevent the gibberish, but we don't want to go that direction since it would only mask the problem.)
ALTER TABLE f616834A MODIFY str VARBINARY(111); ALTER TABLE f616834A MODIFY str VARCHAR(111) CHARACTER SET cp850; SELECT str, HEX(str) FROM f616834A; -- Expect: ÈÉÊËÌÍÎÏ and D490D2D3DED6D7D8 (cp850)The CHARACTER SET changed, but the data was not.
SHOW VARIABLES LIKE 'char%'; -- cp850 and "chcp 850" must match at this point. -- Build a test table: CREATE TABLE f616834B ( str VARCHAR(111) CHARACTER SET latin1 ); SHOW CREATE TABLE f616834B\G -- Populate with accented E's and I's: SELECT @str := HEX('ÈÉÊËÌÍÎÏ'); -- Expect D490D2D3DED6D7D8 (cp850 encoding) SELECT UNHEX('D490D2D3DED6D7D8'); -- Expect 'ÈÉÊËÌÍÎÏ' INSERT INTO f616834B (str) VALUES (UNHEX('D490D2D3DED6D7D8')) ; SELECT str, HEX(str) FROM f616834B; -- Expect gibberish (Ô?ÒÓÞÖר) and D490D2D3DED6D7D8 UPDATE f616834B SET str = CONVERT(BINARY(str) USING cp850); SELECT str, HEX(str) FROM f616834B; -- Expect ÈÉÊËÌÍÎÏ and C8C9CACBCCCDCECF (latin1)(A side note: CONVERT(UNHEX(HEX(str)) USING cp850) would also work.)
UPDATE Tbl SET col = IFNULL(CONVERT(CONVERT(CONVERT(col USING latin1) USING binary) USING utf8), col )Caution: I have only briefly tested this.
mysql> # Create sample: mysql> CREATE TABLE dbl (c VARCHAR(11) CHARSET latin1); mysql> INSERT INTO dbl (c) VALUES ('x'), (UNHEX('C382C2A4')); mysql> # Diagnose: mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl; +------+----------+-----------+----------------+ | c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) | +------+----------+-----------+----------------+ | x | 78 | 1 | 1 | | A,A☼ | C382C2A4 | 4 | 4 | +------+----------+-----------+----------------+ mysql> # Step 1 of 5: Note: Steps 1 and 2 are needed only if you start with latin1; If your column is already utf8, skip to step 3. mysql> ALTER TABLE dbl MODIFY c TINYBLOB; mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl; +------+----------+-----------+----------------+ | c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) | +------+----------+-----------+----------------+ | x | 78 | 1 | 1 | | A,A☼ | C382C2A4 | 4 | 4 | +------+----------+-----------+----------------+ mysql> # Step 2 of 5: mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8; mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl; +------+----------+-----------+----------------+ | c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) | +------+----------+-----------+----------------+ | x | 78 | 1 | 1 | | A☼ | C382C2A4 | 4 | 2 | +------+----------+-----------+----------------+ mysql> # Step 3 of 5: mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET latin1; mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl; +------+--------+-----------+----------------+ | c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) | +------+--------+-----------+----------------+ | x | 78 | 1 | 1 | | A☼ | C2A4 | 2 | 2 | +------+--------+-----------+----------------+ mysql> # Step 4 of 5: mysql> ALTER TABLE dbl MODIFY c TINYBLOB; mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl; +------+--------+-----------+----------------+ | c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) | +------+--------+-----------+----------------+ | x | 78 | 1 | 1 | | A☼ | C2A4 | 2 | 2 | +------+--------+-----------+----------------+ mysql> # Step 5 of 5: mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8; mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl; +------+--------+-----------+----------------+ | c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) | +------+--------+-----------+----------------+ | x | 78 | 1 | 1 | | ☼ | C2A4 | 2 | 1 | +------+--------+-----------+----------------+ mysql> # Check schema: mysql> SHOW CREATE TABLE dbl \G CREATE TABLE `dbl` ( `c` varchar(11) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8
1. mysqldump --default-character-set=latin1 .... my_database > my_database_latin1.sql 2. Edit my_database_latin1.sql to set NAMES=utf8 at the top. 3. mysql ... < mydatabase.sqlmoThanks to Peter Barry
DELIMITER $$ CREATE FUNCTION maybe_utf8_decode(str text charset utf8) RETURNS text CHARSET utf8 DETERMINISTIC BEGIN declare str_converted text charset utf8; declare max_error_count int default @@max_error_count; set @@max_error_count = 0; set str_converted = convert(binary convert(str using latin1) using utf8); set @@max_error_count = max_error_count; if @@warning_count > 0 then return str; else return str_converted; end if; END$$ DELIMITER ;
UPDATE tbl SET col = maybe_utf8_decode(col);
mysql> # Create sample: mysql> CREATE TABLE trycvt (c VARCHAR(11) CHARSET latin1); mysql> INSERT INTO trycvt (c) VALUES ('x'), (UNHEX('BB')); mysql> SHOW CREATE TABLE trycvt \G CREATE TABLE `trycvt` ( `c` varchar(11) CHARACTER SET latin1 DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8 mysql> # Diagnose: mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt; +------+--------+-----------+----------------+ | c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) | +------+--------+-----------+----------------+ | x | 78 | 1 | 1 | | » | BB | 1 | 1 | +------+--------+-----------+----------------+ mysql> # Step 1 of 1 mysql> ALTER TABLE trycvt CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; mysql> # Diagnose: mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt; +------+--------+-----------+----------------+ | c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) | +------+--------+-----------+----------------+ | x | 78 | 1 | 1 | | » | C2BB | 2 | 1 | +------+--------+-----------+----------------+ mysql> # Note schema changed: mysql> SHOW CREATE TABLE trycvt \G CREATE TABLE `trycvt` ( `c` varchar(11) COLLATE utf8_unicode_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
mysqldump -h old_host --no-data >schema.sql # now edit schema.sql to fix charset, collation, etc # edit in/out SET NAMES mysql -h new_host <schema.sql mysqldump -h old_host --no-create-info --opt >data.sql # edit in/out SET NAMES mysql -h new_host <data.sql
SELECT MAX(CHAR_LENGTH(col)) FROM tbl;
ALTER TABLE tbl MODIFY COLUMN col ... CHARACTER SET utf8mb4 ...;
ALTER TABLE tbl_name CONVERT TO utf8mb4;However, this may fail for index issues. If you have an indexed VARCHAR(255), that turns out to exceed the 767-byte limit. Change it to VARCHAR(191) if the column does not really need 255. If you really need more than 191, things get more complicated. StackExchange thread
open(filename, 'r', encoding='utf-8-bom')
danish: zz < Ä=Æ=ä=æ < Ö=Ø=ö=ø < Aa=Å=å < Þ=þ icelandic: zz < Þ=þ < Ä=Æ=ä=æ < Ö=Ø=ö=ø < Å=å swedish: zz < Å=å < Ä=Æ=ä=æ < Ö=Ø=ö=ø < Þ=þ(There may be more differences.) Programatically generated collation tester
unicode: 005E = (^) utf8: 5E = CIRCUMFLEX ACCENT = SPACING CIRCUMFLEX unicode: 02C6 = (ˆ) utf8: CB86 = MODIFIER LETTER CIRCUMFLEX ACCENT = MODIFIER LETTER CIRCUMFLEX unicode: 0302 = utf8: CC82 = COMBINING CIRCUMFLEX ACCENT = NON-SPACING CIRCUMFLEX unicode: 032D = utf8: CCAD = COMBINING CIRCUMFLEX ACCENT BELOW = NON-SPACING CIRCUMFLEX BELOW unicode: 1DCD = utf8: E1B78D = COMBINING DOUBLE CIRCUMFLEX ABOVE unicode: A788 = (ꞈ) utf8: EA9E88 = MODIFIER LETTER LOW CIRCUMFLEX ACCENT unicode: FF3E = (^) utf8: EFBCBE = FULLWIDTH CIRCUMFLEX ACCENT = FULLWIDTH SPACING CIRCUMFLEX unicode: E005E = utf8: F3A0819E = TAG CIRCUMFLEX ACCENTThis debates what latin1's 0x88 () maps to: forum discussion
+----------+---------+------+----------------------------------------------------+ | utf8mb4 | Unicode | c | Name | +----------+---------+------+----------------------------------------------------+ | 22 | 34 | " | QUOTATION MARK | | C2AB | 171 | « | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK | | C2BB | 187 | » | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK | | E28098 | 8216 | ‘ | LEFT SINGLE QUOTATION MARK | 91 | E28099 | 8217 | ’ | RIGHT SINGLE QUOTATION MARK | 92 | E2809A | 8218 | ‚ | SINGLE LOW-9 QUOTATION MARK | | E2809B | 8219 | ‛ | SINGLE HIGH-REVERSED-9 QUOTATION MARK | | E2809C | 8220 | “ | LEFT DOUBLE QUOTATION MARK | 93 | E2809D | 8221 | ” | RIGHT DOUBLE QUOTATION MARK | 94 | E2809E | 8222 | „ | DOUBLE LOW-9 QUOTATION MARK | | E2809F | 8223 | ‟ | DOUBLE HIGH-REVERSED-9 QUOTATION MARK | | E280B9 | 8249 | ‹ | SINGLE LEFT-POINTING ANGLE QUOTATION MARK | | E280BA | 8250 | › | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK | | E28D98 | 9048 | ⍘ | APL FUNCTIONAL SYMBOL QUOTE UNDERBAR | | E28D9E | 9054 | ⍞ | APL FUNCTIONAL SYMBOL QUOTE QUAD | | E29D9B | 10075 | ❛ | HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT | | E29D9C | 10076 | ❜ | HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT | | E29D9D | 10077 | ❝ | HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT | | E29D9E | 10078 | ❞ | HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT | | E29DAE | 10094 | ❮ | HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT | | E29DAF | 10095 | ❯ | HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT | | E3809D | 12317 | 〝 | REVERSED DOUBLE PRIME QUOTATION MARK | | E3809E | 12318 | 〞 | DOUBLE PRIME QUOTATION MARK | | E3809F | 12319 | 〟 | LOW DOUBLE PRIME QUOTATION MARK | | EA9084 | 41988 | ꐄ | YI SYLLABLE QUOT | | EFBC82 | 65282 | " | FULLWIDTH QUOTATION MARK | | F3A080A2 | 917538 | ó ¢ | TAG QUOTATION MARK | +----------+---------+------+----------------------------------------------------+⚈ 91..94 are the hex for latin1 if coming from Microsoft Word.
+----------+----------------------------------------------------+---------+ | hex_utf8 | name | Unicode | +----------+----------------------------------------------------+---------+ | CC80 | COMBINING GRAVE ACCENT | U+0300 | | CC81 | COMBINING ACUTE ACCENT | U+0301 | | CC82 | COMBINING CIRCUMFLEX ACCENT | U+0302 | | CC83 | COMBINING TILDE | U+0303 | | CC84 | COMBINING MACRON | U+0304 | | CC85 | COMBINING OVERLINE | U+0305 | | CC86 | COMBINING BREVE | U+0306 | | CC87 | COMBINING DOT ABOVE | U+0307 | | CC88 | COMBINING DIAERESIS | U+0308 | | CC89 | COMBINING HOOK ABOVE | U+0309 | | CC8A | COMBINING RING ABOVE | U+030A | | CC8B | COMBINING DOUBLE ACUTE ACCENT | U+030B | | CC8C | COMBINING CARON | U+030C | | CC8D | COMBINING VERTICAL LINE ABOVE | U+030D | | CC8E | COMBINING DOUBLE VERTICAL LINE ABOVE | U+030E | | CC8F | COMBINING DOUBLE GRAVE ACCENT | U+030F | | CC90 | COMBINING CANDRABINDU | U+0310 | | CC91 | COMBINING INVERTED BREVE | U+0311 | | CC92 | COMBINING TURNED COMMA ABOVE | U+0312 | | CC93 | COMBINING COMMA ABOVE | U+0313 | | CC94 | COMBINING REVERSED COMMA ABOVE | U+0314 | | CC95 | COMBINING COMMA ABOVE RIGHT | U+0315 | | CC96 | COMBINING GRAVE ACCENT BELOW | U+0316 | | CC97 | COMBINING ACUTE ACCENT BELOW | U+0317 | | CC98 | COMBINING LEFT TACK BELOW | U+0318 | | CC99 | COMBINING RIGHT TACK BELOW | U+0319 | | CC9A | COMBINING LEFT ANGLE ABOVE | U+031A | | CC9B | COMBINING HORN | U+031B | | CC9C | COMBINING LEFT HALF RING BELOW | U+031C | | CC9D | COMBINING UP TACK BELOW | U+031D | | CC9E | COMBINING DOWN TACK BELOW | U+031E | | CC9F | COMBINING PLUS SIGN BELOW | U+031F | | CCA0 | COMBINING MINUS SIGN BELOW | U+0320 | | CCA1 | COMBINING PALATALIZED HOOK BELOW | U+0321 | | CCA2 | COMBINING RETROFLEX HOOK BELOW | U+0322 | | CCA3 | COMBINING DOT BELOW | U+0323 | | CCA4 | COMBINING DIAERESIS BELOW | U+0324 | | CCA5 | COMBINING RING BELOW | U+0325 | | CCA6 | COMBINING COMMA BELOW | U+0326 | | CCA7 | COMBINING CEDILLA | U+0327 | | CCA8 | COMBINING OGONEK | U+0328 | | CCA9 | COMBINING VERTICAL LINE BELOW | U+0329 | | CCAA | COMBINING BRIDGE BELOW | U+032A | | CCAB | COMBINING INVERTED DOUBLE ARCH BELOW | U+032B | | CCAC | COMBINING CARON BELOW | U+032C | | CCAD | COMBINING CIRCUMFLEX ACCENT BELOW | U+032D | | CCAE | COMBINING BREVE BELOW | U+032E | | CCAF | COMBINING INVERTED BREVE BELOW | U+032F | | CCB0 | COMBINING TILDE BELOW | U+0330 | | CCB1 | COMBINING MACRON BELOW | U+0331 | | CCB2 | COMBINING LOW LINE | U+0332 | | CCB3 | COMBINING DOUBLE LOW LINE | U+0333 | | CCB4 | COMBINING TILDE OVERLAY | U+0334 | | CCB5 | COMBINING SHORT STROKE OVERLAY | U+0335 | | CCB6 | COMBINING LONG STROKE OVERLAY | U+0336 | | CCB7 | COMBINING SHORT SOLIDUS OVERLAY | U+0337 | | CCB8 | COMBINING LONG SOLIDUS OVERLAY | U+0338 | | CCB9 | COMBINING RIGHT HALF RING BELOW | U+0339 | | CCBA | COMBINING INVERTED BRIDGE BELOW | U+033A | | CCBB | COMBINING SQUARE BELOW | U+033B | | CCBC | COMBINING SEAGULL BELOW | U+033C | | CCBD | COMBINING X ABOVE | U+033D | | CCBE | COMBINING VERTICAL TILDE | U+033E | | CCBF | COMBINING DOUBLE OVERLINE | U+033F | | CD80 | COMBINING GRAVE TONE MARK | U+0340 | | CD81 | COMBINING ACUTE TONE MARK | U+0341 | | CD84 | COMBINING GREEK DIALYTIKA TONOS | U+0344 | | CD85 | COMBINING GREEK YPOGEGRAMMENI | U+0345 | | D283 | COMBINING CYRILLIC TITLO | U+0483 | | D284 | COMBINING CYRILLIC PALATALIZATION | U+0484 | | D285 | COMBINING CYRILLIC DASIA PNEUMATA | U+0485 | | D286 | COMBINING CYRILLIC PSILI PNEUMATA | U+0486 | | E28390 | COMBINING LEFT HARPOON ABOVE | U+20D0 | | E28391 | COMBINING RIGHT HARPOON ABOVE | U+20D1 | | E28392 | COMBINING LONG VERTICAL LINE OVERLAY | U+20D2 | | E28393 | COMBINING SHORT VERTICAL LINE OVERLAY | U+20D3 | | E28394 | COMBINING ANTICLOCKWISE ARROW ABOVE | U+20D4 | | E28395 | COMBINING CLOCKWISE ARROW ABOVE | U+20D5 | | E28396 | COMBINING LEFT ARROW ABOVE | U+20D6 | | E28397 | COMBINING RIGHT ARROW ABOVE | U+20D7 | | E28398 | COMBINING RING OVERLAY | U+20D8 | | E28399 | COMBINING CLOCKWISE RING OVERLAY | U+20D9 | | E2839A | COMBINING ANTICLOCKWISE RING OVERLAY | U+20DA | | E2839B | COMBINING THREE DOTS ABOVE | U+20DB | | E2839C | COMBINING FOUR DOTS ABOVE | U+20DC | | E283A1 | COMBINING LEFT RIGHT ARROW ABOVE | U+20E1 | | E38299 | COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK | U+3099 | | E3829A | COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK | U+309A | +----------+----------------------------------------------------+---------+ 88 rows in set (0.01 sec)
SHOW CHARSET; -- only latin1 and utf8/utf8mb4 are likely to be interesting SHOW COLLATION; -- list the possible values. SHOW COLLATION LIKE 'utf8%'; SHOW COLLATION LIKE '%_cs'; SHOW SESSION VARIABLES LIKE 'character_set%'; SHOW GLOBAL VARIABLES LIKE 'character_set%'; SHOW CREATE TABLE tbl \G
[client] default-character-set = cp1252 [mysqld] skip-character-set-client-handshake # (deprecated in mysqld) default-character-set = cp1252 character-set-server = cp1252 collation-server = latin1_general_cs init-connect = SET NAMES cp1252 # (init-connect is _not_ seen by SUPER users, such as "root".) [mysqldump] default-character-set = cp1252 [mysql] default-character-set = cp1252Caution: Setting default-character-set that way for the mysql client cannot be overridden by SET NAMES.
$t = json_encode($s, JSON_UNESCAPED_UNICODE);
drop table charutf8; -- Create and populate a table with a "utf8_bin" column: CREATE TABLE `charutf8` ( `a` varchar(9) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL, PRIMARY KEY (`a`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; INSERT IGNORE INTO charutf8 (a) VALUES ('abc'), ('def'), ('ABC'), ('ábc'), ('́abc'), ('Abç'), ('ss'), ('ß'); -- Dump of characters, and lengths: SELECT a, CHAR_LENGTH(a), LENGTH(a), HEX(a) FROM charutf8 ORDER BY a; -- CC81 is "U+0301: COMBINING ACUTE ACCENT" which, -- together with 'a' is equal to 'á' in utf8_unicode_ci
-- Show that the ordering varies with collation: SELECT GROUP_CONCAT(a ORDER BY a) AS 'bin order' FROM charutf8; SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_general_ci) AS 'general_ci order' FROM charutf8; SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_unicode_ci) AS 'unicode_ci order' FROM charutf8; -- Self-JOIN to see what values are the same; collation-dependent: SELECT CHAR_LENGTH(x.a) AS clen, LENGTH(x.a), x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_bin' FROM charutf8 x JOIN charutf8 y ON x.a = y.a GROUP BY x.a; SELECT x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_general_ci' FROM charutf8 x JOIN charutf8 y ON x.a = y.a COLLATE utf8_general_ci GROUP BY x.a; SELECT x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_unicode_ci' FROM charutf8 x JOIN charutf8 y ON x.a = y.a COLLATE utf8_unicode_ci GROUP BY x.a;
-- Dump of characters, and lengths: +-------+----------------+-----------+------------+ | a | CHAR_LENGTH(a) | LENGTH(a) | HEX(a) | +-------+----------------+-----------+------------+ | ABC | 3 | 3 | 414243 | | Abç | 3 | 4 | 4162C3A7 | | abc | 3 | 3 | 616263 | | def | 3 | 3 | 646566 | | ss | 2 | 2 | 7373 | | ß | 1 | 2 | C39F | | ábc | 3 | 4 | C3A16263 | | ́abc | 4 | 5 | CC81616263 | -- the misalignment is due to the output not handling "combining" chars correctly +-------+----------------+-----------+------------+ 8 rows in set (0.00 sec) -- CC81 is "U+0301: COMBINING ACUTE ACCENT" which, -- together with 'a' is equal to 'á' in utf8_unicode_ci -- Show that the ordering varies with collation: +-----------------------------------+ | bin order | +-----------------------------------+ | ABC,Abç,abc,def,ss,ß,ábc,́abc | -- Notice how accents follow all letters +-----------------------------------+ mysql> SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_general_ci) AS 'general_ci order' FROM charutf8; +-----------------------------------+ | general_ci order | +-----------------------------------+ | ábc,abc,Abç,ABC,def,ß,ss,́abc | -- Regular accents mixed in; Combining guy is not' +-----------------------------------+ +-----------------------------------+ | unicode_ci order | +-----------------------------------+ | ́abc,ábc,abc,Abç,ABC,def,ß,ss | -- Combining accent sorted with its mates +-----------------------------------+ -- Self-JOIN to see what values are the same; collation-dependent: +-------+---------------------+ | a | Equal with utf8_bin | +-------+---------------------+ | ABC | ABC | -- exact match for case and accents | Abç | Abç | | abc | abc | | def | def | | ss | ss | | ß | ß | -- ss and ß are NOT equal | ábc | ábc | | ́abc | ́abc | +-------+---------------------+ +-------+----------------------------+ | a | Equal with utf8_general_ci | +-------+----------------------------+ | ABC | ábc,ABC,Abç,abc | -- case folding and accent stripping | Abç | ábc,ABC,Abç,abc | | abc | ábc,ABC,Abç,abc | | def | def | | ss | ss | -- ss and ß are NOT equal | ß | ß | | ábc | ABC,Abç,abc,ábc | | ́abc | ́abc | -- 'combining' NOT equal +-------+----------------------------+ +-------+----------------------------+ | a | Equal with utf8_unicode_ci | +-------+----------------------------+ | ABC | ábc,ABC,́abc,Abç,abc | -- case folding and accent stripping | Abç | abc,ábc,ABC,́abc,Abç | | abc | Abç,abc,ábc,ABC,́abc | | def | def | | ss | ss,ß | -- ss and ß are EQUAL | ß | ß,ss | | ábc | ábc,ABC,́abc,Abç,abc | | ́abc | abc,ábc,ABC,́abc,Abç | -- 'combining' EQUAL +-------+----------------------------+
city VARCHAR(66) COLLATE utf8_binand you need to find "San José" when you do
WHERE city = "San Jose"This applies a different collation for the comparision:
WHERE city COLLATE utf8_general_ci = 'San Jose'Caveat: An INDEX(city) cannot be used when you override the collation, so the latter WHERE will be slower.
mysql> set names latin1; Query OK, 0 rows affected (0.00 sec) mysql> select * from User where username = 'rené' collate utf8_general_ci; ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1' mysql> set names utf8; Query OK, 0 rows affected (0.00 sec) mysql> select * from User where username = 'rené' collate utf8_general_ci; Empty set (0.00 sec) Altenatively, use can explicitly set the character set using a 'character set introducer': mysql> set names latin1; Query OK, 0 rows affected (0.00 sec) mysql> select * from User where username = _utf8'rené' collate utf8_general_ci; Empty set (0.00 sec)
utf8 : utf8_general_ci A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az utf8 : utf8_general_mysql500_ci A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az utf8 : utf8_danish_ci A=a=ª=À=Á=Â=Ã=à=á=â=ã=Ā=ā=Ą=ą ae az utf8 : utf8_swedish_ci A=a=ª=À=Á=Â=Ã=à=á=â=ã=Ā=ā=Ą=ą Aa ae az utf8 : utf8_estonian_ci A=a=ª=À=Á=Â=Ã=Å=à=á=â=ã=å=Ā=ā=Ą=ą Aa ae az Æ=æ utf8 : utf8_lithuanian_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ utf8 : utf8_latvian_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ utf8 : utf8_unicode_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ utf8 : utf8_polish_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā Aa ae az Ą=ą Æ=æ utf8 : utf8_spanish_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ utf8 : utf8_spanish2_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ utf8 : utf8_turkish_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ utf8 : utf8_general_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz utf8 : utf8_general_mysql500_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz utf8 : utf8_danish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð utf8 : utf8_swedish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð utf8 : utf8_estonian_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð utf8 : utf8_lithuanian_ci B=b C=c=ch=Ç=ç cz Č=č D=d dz Ð=ð utf8 : utf8_latvian_ci B=b C=c=Ç=ç ch cz Č=č D=d dz Ð=ð utf8 : utf8_unicode_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð utf8 : utf8_polish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð utf8 : utf8_spanish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð utf8 : utf8_spanish2_ci B=b C=c=Ç=ç=Č=č cz ch D=d dz Ð=ð utf8 : utf8_turkish_ci B=b C=c=Č=č ch cz Ç=ç D=d dz Ð=ð utf8 : utf8_general_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě utf8 : utf8_general_mysql500_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě utf8 : utf8_danish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě utf8 : utf8_swedish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě utf8 : utf8_estonian_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě utf8 : utf8_lithuanian_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě utf8 : utf8_latvian_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě utf8 : utf8_unicode_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě utf8 : utf8_polish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ě=ě Ę=ę utf8 : utf8_spanish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě utf8 : utf8_spanish2_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě utf8 : utf8_turkish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě utf8 : utf8_general_ci F=f fz G=g=ğ=Ģ=ģ H=h hz utf8 : utf8_general_mysql500_ci F=f fz G=g=ğ=Ģ=ģ H=h hz utf8 : utf8_danish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz utf8 : utf8_swedish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz utf8 : utf8_estonian_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz utf8 : utf8_lithuanian_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz utf8 : utf8_latvian_ci F=f fz ƒ G=g=ğ Ģ=ģ H=h hz utf8 : utf8_unicode_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz utf8 : utf8_polish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz utf8 : utf8_spanish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz utf8 : utf8_spanish2_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz utf8 : utf8_turkish_ci F=f fz ƒ G=g=Ģ=ģ H=h hz utf8 : utf8_general_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ=ı ij iz utf8 : utf8_general_mysql500_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ=ı ij iz utf8 : utf8_danish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz utf8 : utf8_swedish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz utf8 : utf8_estonian_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz utf8 : utf8_lithuanian_ci I=Y=i=y=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz=yz utf8 : utf8_latvian_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz Y=y yz utf8 : utf8_unicode_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz utf8 : utf8_polish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz utf8 : utf8_spanish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz utf8 : utf8_spanish2_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz utf8 : utf8_turkish_ci I=ı i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz utf8 : utf8_general_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj ll lz M=m N=n=Ñ=ñ=Ņ=ņ nz utf8 : utf8_general_mysql500_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj ll lz M=m N=n=Ñ=ñ=Ņ=ņ nz utf8 : utf8_danish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz utf8 : utf8_swedish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz utf8 : utf8_estonian_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz utf8 : utf8_lithuanian_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz utf8 : utf8_latvian_ci J=j K=k Ķ=ķ L=l lj=LJ=Lj=lj ll lz Ļ=ļ Ł=ł M=m N=n=Ñ=ñ nz Ņ=ņ utf8 : utf8_unicode_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz utf8 : utf8_polish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz utf8 : utf8_spanish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ņ=ņ nz Ñ=ñ utf8 : utf8_spanish2_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj lz ll Ł=ł M=m N=n=Ņ=ņ nz Ñ=ñ utf8 : utf8_turkish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz utf8 : utf8_general_ci O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe oz utf8 : utf8_general_mysql500_ci O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe oz utf8 : utf8_danish_ci O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ oe=Œ=œ oz utf8 : utf8_swedish_ci O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ oe=Œ=œ oz utf8 : utf8_estonian_ci O=o=º=Ò=Ó=Ô=ò=ó=ô oe=Œ=œ oz Ø=ø utf8 : utf8_lithuanian_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø utf8 : utf8_latvian_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø utf8 : utf8_unicode_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø utf8 : utf8_polish_ci O=o=º=Ò=Ô=Õ=Ö=ò=ô=õ=ö oe=Œ=œ oz Ó=ó Ø=ø utf8 : utf8_spanish_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø utf8 : utf8_spanish2_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø utf8 : utf8_turkish_ci O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ oe=Œ=œ oz Ö=ö Ø=ø utf8 : utf8_general_ci P=p Q=q R=r S=s=ß=ş=Š=Š=š=š sh ss sz utf8 : utf8_general_mysql500_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss sz utf8 : utf8_danish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz utf8 : utf8_swedish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz utf8 : utf8_estonian_ci P=p Q=q R=r S=s=ş sh ss=ß sz Š=Š=š=š Z=z zh zz Ž=Ž=ž=ž utf8 : utf8_lithuanian_ci P=p Q=q R=r S=s=ş sh ss=ß sz Š=Š=š=š utf8 : utf8_latvian_ci P=p Q=q R=r S=s=ş sh ss=ß sz Š=Š=š=š utf8 : utf8_unicode_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz utf8 : utf8_polish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz utf8 : utf8_spanish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz utf8 : utf8_spanish2_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz utf8 : utf8_turkish_ci P=p Q=q R=r S=s=Š=Š=š=š sh ss=ß sz utf8 : utf8_general_ci T=t TM=tm tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz utf8 : utf8_general_mysql500_ci T=t TM=tm tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz utf8 : utf8_danish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz utf8 : utf8_swedish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz utf8 : utf8_estonian_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz utf8 : utf8_lithuanian_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz utf8 : utf8_latvian_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz utf8 : utf8_unicode_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz utf8 : utf8_polish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz utf8 : utf8_spanish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz utf8 : utf8_spanish2_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz utf8 : utf8_turkish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz Ü=ü utf8 : utf8_general_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz utf8 : utf8_general_mysql500_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz utf8 : utf8_danish_ci V=v W=w X=x Y=y=Ü=Ý=ü=ý=ÿ=Ÿ yz utf8 : utf8_swedish_ci V=v W=w X=x Y=y=Ü=Ý=ü=ý=ÿ=Ÿ yz utf8 : utf8_estonian_ci V=v W=w Õ=õ Ä=ä Ö=ö Ü=ü X=x Y=y=Ý=ý=ÿ=Ÿ yz Þ=þ utf8 : utf8_lithuanian_ci V=v W=w X=x Ý=ý=ÿ=Ÿ utf8 : utf8_latvian_ci V=v W=w X=x Ý=ý=ÿ=Ÿ utf8 : utf8_unicode_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz utf8 : utf8_polish_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz utf8 : utf8_spanish_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz utf8 : utf8_spanish2_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz utf8 : utf8_turkish_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz utf8 : utf8_general_ci Z=z=Ž=Ž=ž=ž zh zz Æ=æ Ð=ð Ø=ø Þ=þ Ł=ł Œ=œ ƒ LJ=Lj=lj utf8 : utf8_general_mysql500_ci Z=z=Ž=Ž=ž=ž zh zz Æ=æ Ð=ð Ø=ø Þ=þ ß Ł=ł Œ=œ ƒ LJ=Lj=lj utf8 : utf8_danish_ci Z=z=Ž=Ž=ž=ž zh zz Ä=Æ=ä=æ Ö=Ø=ö=ø Aa=Å=å Þ=þ utf8 : utf8_swedish_ci Z=z=Ž=Ž=ž=ž zh zz Å=å Ä=Æ=ä=æ Ö=Ø=ö=ø Þ=þ utf8 : utf8_estonian_ci utf8 : utf8_lithuanian_ci Z=z zh zz Ž=Ž=ž=ž Þ=þ utf8 : utf8_latvian_ci Z=z zh zz Ž=Ž=ž=ž Þ=þ utf8 : utf8_unicode_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ utf8 : utf8_polish_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ utf8 : utf8_spanish_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ utf8 : utf8_spanish2_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ utf8 : utf8_turkish_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þutf8mb4_% mostly works the same as utf8.
128 80 E282AC € 129 81 C281  130 82 E2809A ‚ 131 83 C692 ƒ 132 84 E2809E „ 133 85 E280A6 … 134 86 E280A0 † 135 87 E280A1 ‡ 136 88 CB86 ˆ 137 89 E280B0 ‰ 138 8A C5A0 Š 139 8B E280B9 ‹ 140 8C C592 Œ 141 8D C28D  142 8E C5BD Ž 143 8F C28F  144 90 C290  145 91 E28098 ‘ 146 92 E28099 ’ 147 93 E2809C “ 148 94 E2809D ” 149 95 E280A2 • 150 96 E28093 – 151 97 E28094 — 152 98 CB9C ˜ 153 99 E284A2 ™ 154 9A C5A1 š 155 9B E280BA › 156 9C C593 œ 157 9D C29D  158 9E C5BE ž 159 9F C5B8 Ÿ |
160 A0 C2A0 161 A1 C2A1 ¡ ¡ 162 A2 C2A2 ¢ ¢ 163 A3 C2A3 £ £ 164 A4 C2A4 ¤ ¤ 165 A5 C2A5 ¥ ¥ 166 A6 C2A6 ¦ ¦ 167 A7 C2A7 § § 168 A8 C2A8 ¨ ¨ 169 A9 C2A9 © © 170 AA C2AA ª ª 171 AB C2AB « « 172 AC C2AC ¬ ¬ 173 AD C2AD ­ 174 AE C2AE ® ® 175 AF C2AF ¯ ¯ 176 B0 C2B0 ° ° 177 B1 C2B1 ± ± 178 B2 C2B2 ² ² 179 B3 C2B3 ³ ³ 180 B4 C2B4 ´ ´ 181 B5 C2B5 µ µ 182 B6 C2B6 ¶ ¶ 183 B7 C2B7 · · 184 B8 C2B8 ¸ ¸ 185 B9 C2B9 ¹ ¹ 186 BA C2BA º º 187 BB C2BB » » 188 BC C2BC ¼ ¼ 189 BD C2BD ½ ½ 190 BE C2BE ¾ ¾ 191 BF C2BF ¿ ¿ |
192 C0 C380 À À 193 C1 C381 Á Á 194 C2 C382 Â Â 195 C3 C383 Ã Ã 196 C4 C384 Ä Ä 197 C5 C385 Å Å 198 C6 C386 Æ Æ 199 C7 C387 Ç Ç 200 C8 C388 È È 201 C9 C389 É É 202 CA C38A Ê Ê 203 CB C38B Ë Ë 204 CC C38C Ì Ì 205 CD C38D Í Í 206 CE C38E Î Î 207 CF C38F Ï Ï 208 D0 C390 Ð Ð 209 D1 C391 Ñ Ñ 210 D2 C392 Ò Ò 211 D3 C393 Ó Ó 212 D4 C394 Ô Ô 213 D5 C395 Õ Õ 214 D6 C396 Ö Ö 215 D7 C397 × × 216 D8 C398 Ø Ø 217 D9 C399 Ù Ù 218 DA C39A Ú Ú 219 DB C39B Û Û 220 DC C39C Ü Ü 221 DD C39D Ý Ý 222 DE C39E Þ Þ 223 DF C39F ß ß |
224 E0 C3A0 à à 225 E1 C3A1 á á 226 E2 C3A2 â â 227 E3 C3A3 ã ã 228 E4 C3A4 ä ä 229 E5 C3A5 å å 230 E6 C3A6 æ æ 231 E7 C3A7 ç ç 232 E8 C3A8 è è 233 E9 C3A9 é é 234 EA C3AA ê ê 235 EB C3AB ë ë 236 EC C3AC ì ì 237 ED C3AD í í 238 EE C3AE î î 239 EF C3AF ï ï 240 F0 C3B0 ð ð 241 F1 C3B1 ñ ñ 242 F2 C3B2 ò ò 243 F3 C3B3 ó ó 244 F4 C3B4 ô ô 245 F5 C3B5 õ õ 246 F6 C3B6 ö ö 247 F7 C3B7 ÷ ÷ 248 F8 C3B8 ø ø 249 F9 C3B9 ù ù 250 FA C3BA ú ú 251 FB C3BB û û 252 FC C3BC ü ü 253 FD C3BD ý ý 254 FE C3BE þ þ 255 FF C3BF ÿ ÿ |
128 80 E282AC € 129 81 C281  130 82 E2809A ‚ 131 83 C692 ƒ 132 84 E2809E „ 133 85 E280A6 … 134 86 E280A0 †135 87 E280A1 ‡ 136 88 CB86 ˆ 137 89 E280B0 ‰ 138 8A C5A0 Š139 8B E280B9 ‹ 140 8C C592 Œ 141 8D C28D  142 8E C5BD Ž 143 8F C28F  144 90 C290  145 91 E28098 ‘ 146 92 E28099 ’ 147 93 E2809C “ 148 94 E2809D †149 95 E280A2 • 150 96 E28093 – 151 97 E28094 — 152 98 CB9C ˜ 153 99 E284A2 ™ 154 9A C5A1 š 155 9B E280BA › 156 9C C593 œ 157 9D C29D  158 9E C5BE ž 159 9F C5B8 Ÿ |
160 A0 C2A0  161 A1 C2A1 ¡ ¡ 162 A2 C2A2 ¢ ¢ 163 A3 C2A3 £ £ 164 A4 C2A4 ¤ ¤ 165 A5 C2A5 ¥ Â¥ 166 A6 C2A6 ¦ ¦ 167 A7 C2A7 § § 168 A8 C2A8 ¨ ¨ 169 A9 C2A9 © © 170 AA C2AA ª ª 171 AB C2AB « « 172 AC C2AC ¬ ¬ 173 AD C2AD  174 AE C2AE ® ® 175 AF C2AF ¯ ¯ 176 B0 C2B0 ° ° 177 B1 C2B1 ± ± 178 B2 C2B2 ² ² 179 B3 C2B3 ³ ³ 180 B4 C2B4 ´ ´ 181 B5 C2B5 µ µ 182 B6 C2B6 ¶ ¶ 183 B7 C2B7 · · 184 B8 C2B8 ¸ ¸ 185 B9 C2B9 ¹ ¹ 186 BA C2BA º º 187 BB C2BB » » 188 BC C2BC ¼ ¼ 189 BD C2BD ½ ½ 190 BE C2BE ¾ ¾ 191 BF C2BF ¿ ¿ |
192 C0 C380 À À 193 C1 C381 Á à 194 C2 C382  Â 195 C3 C383 à à 196 C4 C384 Ä Ã„ 197 C5 C385 Å Ã… 198 C6 C386 Æ Ã† 199 C7 C387 Ç Ã‡ 200 C8 C388 È Ãˆ 201 C9 C389 É Ã‰ 202 CA C38A Ê ÃŠ 203 CB C38B Ë Ã‹ 204 CC C38C Ì ÃŒ 205 CD C38D Í Ã 206 CE C38E Î ÃŽ 207 CF C38F Ï Ã 208 D0 C390 Ð Ã 209 D1 C391 Ñ Ã‘ 210 D2 C392 Ò Ã’ 211 D3 C393 Ó Ã“ 212 D4 C394 Ô Ã” 213 D5 C395 Õ Ã• 214 D6 C396 Ö Ã– 215 D7 C397 × Ã— 216 D8 C398 Ø Ã˜ 217 D9 C399 ٠Ù 218 DA C39A Ú Ãš 219 DB C39B Û Ã› 220 DC C39C Ü Ãœ 221 DD C39D Ý Ã 222 DE C39E Þ Ãž 223 DF C39F ß ÃŸ |
224 E0 C3A0 à à 225 E1 C3A1 á á 226 E2 C3A2 â â 227 E3 C3A3 ã ã 228 E4 C3A4 ä ä 229 E5 C3A5 å Ã¥ 230 E6 C3A6 æ æ 231 E7 C3A7 ç ç 232 E8 C3A8 è è 233 E9 C3A9 é é 234 EA C3AA ê ê 235 EB C3AB ë ë 236 EC C3AC ì ì 237 ED C3AD í à 238 EE C3AE î î 239 EF C3AF ï ï 240 F0 C3B0 ð ð 241 F1 C3B1 ñ ñ 242 F2 C3B2 ò ò 243 F3 C3B3 ó ó 244 F4 C3B4 ô ô 245 F5 C3B5 õ õ 246 F6 C3B6 ö ö 247 F7 C3B7 ÷ ÷ 248 F8 C3B8 ø ø 249 F9 C3B9 ù ù 250 FA C3BA ú ú 251 FB C3BB û û 252 FC C3BC ü ü 253 FD C3BD ý ý 254 FE C3BE þ þ 255 FF C3BF ÿ ÿ |
LOAD DATA LOCAL INFILE '...' INTO TABLE tbl CHARACTER SET 'utf8mb4' FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' IGNORE 1 ROWS;
å, ä and ö in Finnish and Swedish à, ç, è, é, ï, í, ò, ó, ú, ü in Catalan å, æ and ø in Norwegian and Danish á, é, ó, ý, è, ë, ï in Dutch ä, ö, ü and ß in German á, ð, í, ó, ú, ý, æ and ø in Faroese á, ð, é, í, ó, ú, ý, þ, æ and ö in Icelandic à, â, ç, è, é, ë, ê, ï, î, ö, ô, ù, û, ÿ, æ, œ in French à, è, é, ì, ò, ù in Italian á, é, í, ñ, ó, ú, ï, ü, ¡, ¿ in Spanish à, á, â, ã, ç, é, ê, í, ó, ô, õ, ú in Portuguese (ü no longer used) á, é, í, ó, ú in Irish £ in British English
C387 0199 00C7 Ç latin capital letter c with cedilla C396 0214 00D6 Ö latin capital letter o with diaeresis C39C 0220 00DC Ü latin capital letter u with diaeresis C3A7 0231 00E7 ç latin small letter c with cedilla C3B6 0246 00F6 ö latin small letter o with diaeresis C3BC 0252 00FC ü latin small letter u with diaeresis C49E 0286 011E Ğ latin capital letter g with breve C49F 0287 011F ğ latin small letter g with breve C4B0 0304 0130 İ latin capital letter i with dot above C4B1 0305 0131 ı latin small letter dotless i C59E 0350 015E Ş latin capital letter s with cedilla C59F 0351 015F ş latin small letter s with cedilla E282A4 8356 20A4 ₤ lira sign E282BA 8378 20BA ₺ Turkish lira sign
CREATE TABLE `se96265` ( `content` mediumtext NOT NULL, FULLTEXT KEY `content` (`content`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1; SET NAMES utf8; -- Turkish word for Croatia, second char is \xC4\xB1 INSERT INTO `se96265` (`content`) VALUES ('Hırvatistan'); SELECT content, HEX(content), HEX('Hırvatistan') FROM se96265\G *************************** 1. row *************************** content: H?rvatistan HEX(content): 483F72766174697374616E HEX('Hırvatistan'): 48C4B172766174697374616ENote how the 'C4B1' turned into '?'
mysql> SET SESSION sql_mode = 'STRICT_ALL_TABLES'; -- So the INSERT will cause an error: mysql> INSERT INTO `se96265` (`content`) VALUES ('Hırvatistan'); ERROR 1366 (HY000): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1(Note: This technique for getting an error message is not specific to Turkish.)
SELECT 'بِسْمِ' = 'بسم' COLLATE utf8_unicode_ci;The result I got back was 1 (true), meaning they are considered equal. With utf8_general_ci it came back with 0, meaning not equal. (Of course, you should declare your column(s) as utf8_unicode_ci rather than using the COLLATE clause.)
WHERE HEX(name) REGEXP '^(D7..)+$'See forum discussion
WHERE HEX(name) REGEXP '^(E.....)+$'⚈ Chinese characters are 3 bytes (6 hex) long
mysql> SELECT 'К' = 'Ќ' COLLATE utf8_bin AS bin, 'К' = 'Ќ' COLLATE utf8_general_ci AS general, 'К' = 'Ќ' COLLATE utf8_unicode_ci AS unicode; +-----+---------+---------+ | bin | general | unicode | +-----+---------+---------+ | 0 | 1 | 0 | +-----+---------+---------+f mysql> SELECT 'Г' = 'Ѓ' COLLATE utf8_bin AS bin, 'Г' = 'Ѓ' COLLATE utf8_general_ci AS general, 'Г' = 'Ѓ' COLLATE utf8_unicode_ci AS unicode; +-----+---------+---------+ | bin | general | unicode | +-----+---------+---------+ | 0 | 1 | 0 | +-----+---------+---------+
Cyrillic example: title: Добро пожаловать в Диксон, hex: D094D0BED0B1D180D0BE20D0BFD0BED0B6D0B0D0BBD0BED0B2D0B0D182D18C20D0B220D094D0B8D0BAD181D0BED0BD, length: 47, char length: 25
mysql> SELECT ';' semicolon, UNHEX('cdbe') 'Greek question mark'; +-----------+---------------------+ | semicolon | Greek question mark | +-----------+---------------------+ | ; | ; | +-----------+---------------------+
SELECT 'क़ज़फ़' REGEXP '^[क़-फ़]+$'; --> true -- But that is mere coincidence!Stackoverflow thread
for i, c in enumerate(u): print i, '%04x' % ord(c), unicodedata.category(c), print unicodedata.name(c)
import sys reload(sys) # Reload does the trick! sys.setdefaultencoding('UTF8')
cursor.execute('SET NAMES utf8mb4') cursor.execute("SET CHARACTER SET utf8mb4") cursor.execute("SET character_set_connection=utf8mb4")
LATIN SMALL LETTER I WITH DIAERESIS RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK INVERTED QUESTION MARK
$hex = unpack('H*', $text); echo implode('', $hex);or simply echo bin2hex($text);
$myr = '\u101c\u1031\u1011'; // Given this $x = json_decode('"'.$myr.'"'); // Add quotes and decode // $x is not string, like လေထ
AddDefaultCharset UTF-8
<%Response.charset="utf-8"%>
mysql_set_character_set('utf8mb4');
CONVERT(UNHEX(HEX(CONVERT(CONVERT(col USING utf8) USING latin1))), CHAR)Put this in the connection string:
id=my_user;password=my_password;database=some_db123;charset=utf8;Lao and C#
$config['charset'] = 'UTF-8';application/config/database.php :
$db['default'] = array( ... 'char_set' => 'utf8mb4', 'dbcollat' => 'utf8mb4_unicode_520_ci',Note: Those are aimed at MySQL. dbcollat is used only for PHP's mysqli.
httppost.setEntity(new UrlEncodedFormEntity(nameValuePairs, HTTP.UTF_8));
echo meta('Content-type', 'text/html; charset=utf-8', 'equiv');
<cfprocessingdirective pageEncoding="utf-8"> <cffile action="read" file="#settings.csvfile#" variable="autodata" charset="utf-8">
DATABASES = { 'default': { 'ENGINE': 'django.db.backends.mysql', ... 'OPTIONS': { 'charset': 'utf8mb4', 'use_unicode': True, }, }, } my.cnf: [mysqld] character-set-server=utf8mb4 default-collation=utf8mb4_unicode_ci [client] default-character-set=utf8mb4StackOverflow
amending your settings.php database connection by adding: $databases['default']['default'] = array( 'driver' => 'mysql', 'database' => 'databasename', 'username' => 'username', 'password' => 'password', 'host' => 'localhost', 'charset' => 'utf8mb4', 'collation' => 'utf8mb4_general_ci', );
dataSource: dbCreate: "update" url: "jdbc:mysql://localhost:8889/mydbname?useUnicode=true&zeroDateTimeBehavior=convertToNull&autoReconnect=true"
<%@ page language="java" pageEncoding="UTF-8"%> <%@ page contentType="text/html;charset=UTF-8" %> compileJava.options.encoding = 'UTF-8'
<form method="post" action="/your/url/" accept-charset="UTF-8">
resource.setContentType ("text/html;charset=utf-8");
<%@ page contentType="text/html; charset=UTF-8" %>
[mysql] default-character-set = utf8mb4StackOverflow
Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword;CharSet=utf8;
library(RODBC) imp <- odbcConnect("SERVER", uid = "***", pwd="***",DBMSencoding="utf8")
ODBC;DRIVER={MySQL ODBC 5.1 Driver};UID=...;PWD=...;PORT=3306;DATABASE=...;SERVER=127.0.0.1;CHARSET=UTF8; strConnection="driver={MySQL ODBC 5.1 Driver}; server=...;uid=...;pwd=...;database=...;stmt=SET NAMES utf8;"Unicode is not supported in ADO applications (including ASP) with the Connector/ODBC 3.51 driver. Please consider testing MySQL Connector/ODBC 5.1 which includes Unicode support for ASP.
use utf8; use open ':std', ':encoding(UTF-8)'; my $dbh = DBI->connect("dbi:mysql:".$dsn, $user, $password, { PrintError => 0, RaiseError => 1, mysql_enable_utf8 => 1, # Switch to UTF-8 for communication and decode. }); # or {mysql_enable_utf8mb4 => 1} if using utf8mb4Stackoverflow - double encoding and JSON
import pymysql con = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='******', charset="utf8mb4")
Tool -> Global Options -> Code -> Saving and put UTF-8 rs <- dbSendQuery(con, 'set character set "utf8"') rs <- dbSendQuery(con, 'SET NAMES utf8')
"\xD0\x9A\xD0\xB0\xD1\x81\xD0\xBB".force_encoding("UTF-8") => "Щасл"
def configure_charsets response.headers["Content-Type"] = "text/html; charset=utf-8" suppress(ActiveRecord::StatementInvalid) do ActiveRecord::Base.connection.execute 'SET NAMES UTF8' end end
<property name="hibernate.connection.CharSet">utf8mb4</property> <property name="hibernate.connection.characterEncoding">UTF-8</property> <property name="hibernate.connection.useUnicode">true</property>Connection url:
db.url=jdbc:mysql://localhost:3306/db_name?useUnicode=true&character_set_server=utf8mb4The above changes were enough for me to upgrade from utf8 to utf8mb4 charset scheme.
<form accept-charset="UTF-8">
<filter> <filter-name>encoding-filter</filter-name> <filter-class> org.springframework.web.filter.CharacterEncodingFilter </filter-class> <init-param> <param-name>encoding</param-name> <param-value>UTF-8</param-value> </init-param> <init-param> <param-name>forceEncoding</param-name> <param-value>true</param-value> </init-param> </filter>
db_url = sqlalchemy.engine.url.URL(drivername='mysql', host=foo.db_host, database=db_schema, query={ 'read_default_file' : foo.db_config, 'charset': 'utf8mb4' })MySQL-Unicode in sqlalchemy
Dim MyConn As New MySqlConnection("Server=localhost;User Id=u;Password=x;Database=d;Charset=utf8")
<%@language="VBSCRIPT" CODEPAGE="65001" LCID=1033%>
db = DAL('mysql://username:password@localhost/test?set_encoding=utf8mb4')
Collations to add: We are going to add following collations. The collation name contains: a. character set name: "utf8mb4" b. language's iso code: for example, "cs" for Czech c. UCA version: "800" d. accent / case insensitive: "ai_ci" The complete list is: Collation name language ------------------------------------------------------------ utf8mb4_cs_800_ai_ci Czech utf8mb4_da_800_ai_ci Danish utf8mb4_de_phonebook_800_ai_ci German (phonebook order) utf8mb4_eo_800_ai_ci Esperanto utf8mb4_es_800_ai_ci Spanish utf8mb4_es_traditional_800_ai_ci Spanish (traditional) utf8mb4_et_800_ai_ci Estonian utf8mb4_hr_800_ai_ci Croatian utf8mb4_hu_800_ai_ci Hungarian utf8mb4_is_800_ai_ci Icelandic utf8mb4_la_800_ai_ci Roman (classical Latin) utf8mb4_lt_800_ai_ci Lithuanian utf8mb4_lv_800_ai_ci Latvian utf8mb4_pl_800_ai_ci Polish utf8mb4_ro_800_ai_ci Romanian utf8mb4_sk_800_ai_ci Slovak utf8mb4_sl_800_ai_ci Slovenian utf8mb4_sv_800_ai_ci Swedish utf8mb4_tr_800_ai_ci Turkish utf8mb4_vi_800_ai_ci VietnameseInstead, see (as of 8.0.0) utf8mb4 collations
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8;Example from forums
EXPLAIN EXTENDED SELECT ...; SHOW WARNINGS;sometimes gives extra clues into what charset/collation issues are being faced during a SELECT. It will probably contain 'introducers'; example: _utf8'abc'.
# Find rows with any 8-bit chars: SELECT * FROM tbl WHERE HEX(colname) RLIKE '^(..)*[8-F].'; # Remove all occurrences of one 8-bit code (hex 92, in this example): UPDATE tbl SET colname = REPLACE(colname, UNHEX('92'), '');Reference: how to replace hex in varchar field
[client] default-character-set=utf8 [mysql] default-character-set=utf8 character-set-server = utf8 [mysqld] collation-server = utf8_unicode_ci init-connect='SET NAMES utf8' character-set-server = utf8
skip-character-set-client-handshake collation-server=utf8_unicode_ci character-set-server=utf8However, "The server character set and collation are used as default values if the database character set and collation are not specified in CREATE DATABASE statements. They have no other purpose." manual on charset-server
[mysql] default-character-set=utf8
<?xml version="1.0" encoding="utf-8"?>
<form method="post" action="/your/url/" accept-charset="UTF-8">
collation_server = utf8mb4_unicode_520_cisets
character_set_database | utf8mb4 character_set_server | utf8mb4 collation_database | utf8mb4_unicode_520_ci collation_server | utf8mb4_unicode_520_cibut not the things set by SET NAMES, nor collation_connection.
USE my_example_db_name; # Select the right charset SET NAMES 'utf8mb4'; # Import from SQL file SOURCE /somewhere/dump.sql; # Disconnect from SQL server EXIT
[client] default-character-set=charset_name
Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
Data Warehouse Overview
Summary Tables
High speed ingestion
Bulk Normalization
Schema and code design for large Sensor database
Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative
Lat/Lng search to Find the nearest 10 pizza parlors
Lat/Long representation choices
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Alter of a Huge table -- Mostly obviated by 5.6
Efficient List of Latest 10 news articles
Build and execute a Pivot SELECT (showing rows as columns)
(Groupwise Max): Efficiently find largest row(s) for each group
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Datatypes and building a good schema
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8
If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
Python tips,
PHP tips,
other language tips
utf8 Collations
utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table
Handler counts
wp_postmeta
UNION+OFFSET
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS
Analyze SlowLog
My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request: