Example of Double-encoding
Desired string: სახლი არის
Hex in table:
C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1
Correct utf8 hex:
E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1
Analysis:
C3A1C692C2A1 -- should be E183A1 -- Georgian SAN (U+10E1) "ს"
C3A1C692C290
C3A1C692C2AE
C3A1C692C5A1
C3A1C692CB9C
20 -- space (same in utf8 and latin1)
C3A1C692C290
C3A1C692C2A0
C3A1C692CB9C
C3A1C692C2A1
Georgian, when properly encoded in utf8, should be 3 bytes per character:
E182xx
E183xx
E2B4xx
Taking the first "character" (C3A1C692C2A1, in hex):
⚈ latin1 E1 = utf8 C3A1
⚈ latin1 83 = utf8 C692
⚈ latin1 A1 = utf8 C2A1
Notice how most of the letters look very similar: C3A1C692xxyy, when they
should start with E183xx.
One pass over the data to convert each two bytes (4 hex) from utf8 to latin1:
SELECT
HEX(CONVERT(CONVERT(UNHEX(
'C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1'
) USING utf8) USING latin1))
-->
E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1
Now to display as utf8:
SELECT UNHEX('E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1')
-->
სახლი არის
(Ref: forum thread #418885.)
Another example of double-encoding:
'C398C2AF' should have been 'D8AF'
SELECT HEX(CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1));
--> D8AF, which is the utf8 for Arabic DAL.
SELECT CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1);
--> that character.
(Ref: forum thread #565669.)
Fixing utf8 bytes in latin1 column
CAVEAT! Test before using on Production!
# Set up bad case:
mysql> create table latin1 (c varchar(11) charset latin1);
mysql> insert into latin1 (c) values ('x'), (unhex('c2a3'));
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A£ | C2A3 | 2 | 2 |
+------+--------+-----------+----------------+
# Step 1 of 2 of fix:
mysql> alter table latin1 modify c tinyblob;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A£ | C2A3 | 2 | 2 |
+------+--------+-----------+----------------+
# Step 2 of 2 of fix:
mysql> alter table latin1 modify c varchar(11) charset utf8;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| £ | C2A3 | 2 | 1 |
+------+--------+-----------+----------------+
That is, ALTER to BLOB (of appropriate size), then ALTER to VARCHAR
(of original size) and CHARSET utf8.
This technique should work for other cases of the right encoding
sitting in a 5.1 table with the wrong CHARSET declaration.
One more note: Suggest you include the COLLATE clause on the
the final ALTER that converts to VARCHAR.
Fixing "double encoding"
CAVEAT! Test before using on Production!
*** Plan A: Use UPDATE to fix in-place:
UPDATE Tbl SET col =
IFNULL(CONVERT(CONVERT(CONVERT(col USING latin1)
USING binary)
USING utf8),
col )
Caution: I have only briefly tested this.
The IFNULL has the advantage of leaving intact most cells that
are not double-encoded. (Maybe -- need further testing)
How does it work?
1. The first (innermost) CONVERT takes a 4-byte encoding and converts to 2 bytes.
2. Converting to binary and then to something else is a way to change the type without changing the bytes.
3. The final convert to utf8 prevents the assignment from further messing up
4. The last step (IFNULL) is to undo the operation if it ended up invalid utf8??
*** Plan B: Use ALTER:
mysql> # Create sample:
mysql> CREATE TABLE dbl (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO dbl (c) VALUES ('x'), (UNHEX('C382C2A4'));
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x | 78 | 1 | 1 |
| A,A☼ | C382C2A4 | 4 | 4 |
+------+----------+-----------+----------------+
mysql> # Step 1 of 5:
Note: Steps 1 and 2 are needed only if you start with latin1;
If your column is already utf8, skip to step 3.
mysql> ALTER TABLE dbl MODIFY c TINYBLOB;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x | 78 | 1 | 1 |
| A,A☼ | C382C2A4 | 4 | 4 |
+------+----------+-----------+----------------+
mysql> # Step 2 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x | 78 | 1 | 1 |
| A☼ | C382C2A4 | 4 | 2 |
+------+----------+-----------+----------------+
mysql> # Step 3 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET latin1;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A☼ | C2A4 | 2 | 2 |
+------+--------+-----------+----------------+
mysql> # Step 4 of 5:
mysql> ALTER TABLE dbl MODIFY c TINYBLOB;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A☼ | C2A4 | 2 | 2 |
+------+--------+-----------+----------------+
mysql> # Step 5 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| ☼ | C2A4 | 2 | 1 |
+------+--------+-----------+----------------+
mysql> # Check schema:
mysql> SHOW CREATE TABLE dbl \G
CREATE TABLE `dbl` (
`c` varchar(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
One more note: Suggest you include the COLLATE clause on the
the final ALTER that converts to VARCHAR.
*** Plan C: Using mysqldump
(This has had only limited testing.)
1. mysqldump --default-character-set=latin1 .... my_database > my_database_latin1.sql
2. Edit my_database_latin1.sql to set NAMES=utf8 at the top.
3. mysql ... < mydatabase.sql
Fix definition and data, IF correctly encoded
You have a latin1 table containing latin1 encodings;
you want utf8.
(This is not the general thrust of this document, but is
included for completeness, and how tempting looking the command is.)
mysql> # Create sample:
mysql> CREATE TABLE trycvt (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO trycvt (c) VALUES ('x'), (UNHEX('BB'));
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
`c` varchar(11) CHARACTER SET latin1 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| » | BB | 1 | 1 |
+------+--------+-----------+----------------+
mysql> # Step 1 of 1
mysql> ALTER TABLE trycvt CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| » | C2BB | 2 | 1 |
+------+--------+-----------+----------------+
mysql> # Note schema changed:
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
`c` varchar(11) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Note: TEXT will become MEDIUMTEXT, etc.
This is because transcoding latin1 to utf8 usually lengthens the data.
You may want to follow with a
ALTER TABLE ... MODIFY c TEXT ..esume.
to put it back.
Note about ALTER: Generally it is possible and faster to
do all ALTERs of a table in a single ALTER;
128 80 E282AC €
129 81 C281 
130 82 E2809A ‚
131 83 C692 ƒ
132 84 E2809E „
133 85 E280A6
…
134 86 E280A0 †
135 87 E280A1 ‡
136 88 CB86 ˆ
137 89 E280B0 ‰
138 8A C5A0 Š
139 8B E280B9 ‹
140 8C C592 Œ
141 8D C28D 
142 8E C5BD Ž
143 8F C28F 
|
160 A0 C2A0
161 A1 C2A1 ¡ ¡
162 A2 C2A2 ¢ ¢
163 A3 C2A3 £ £
164 A4 C2A4 ¤ ¤
165 A5 C2A5 ¥ ¥
166 A6 C2A6 ¦ ¦
167 A7 C2A7 § §
168 A8 C2A8 ¨ ¨
169 A9 C2A9 © ©
170 AA C2AA ª ª
171 AB C2AB « «
172 AC C2AC ¬ ¬
173 AD C2AD ­
174 AE C2AE ® ®
175 AF C2AF ¯ ¯
|
192 C0 C380 À À
193 C1 C381 Á Á
194 C2 C382 Â Â
195 C3 C383 Ã Ã
196 C4 C384 Ä Ä
197 C5 C385 Å Å
198 C6 C386 Æ Æ
199 C7 C387 Ç Ç
200 C8 C388 È È
201 C9 C389 É É
202 CA C38A Ê Ê
203 CB C38B Ë Ë
204 CC C38C Ì Ì
205 CD C38D Í Í
206 CE C38E Î Î
207 CF C38F Ï Ï
|
224 E0 C3A0 à à
225 E1 C3A1 á á
226 E2 C3A2 â â
227 E3 C3A3 ã ã
228 E4 C3A4 ä ä
229 E5 C3A5 å å
230 E6 C3A6 æ æ
231 E7 C3A7 ç ç
232 E8 C3A8 è è
233 E9 C3A9 é é
234 EA C3AA ê ê
235 EB C3AB ë ë
236 EC C3AC ì ì
237 ED C3AD í í
238 EE C3AE î î
239 EF C3AF ï ï
|