环境:
操作系统:中文版windowsXP
数据库:Oracle9iR2
java文件编码UTF-8,jsp页面编码UTF-8
原库字符集为AL32UTF8,国家字符集为UTF8,导出时客户端字符集也为UTF8,新创建数据库字符集为AL32UTF8
,国家字符集为AL16UTF16,导入数据时,客户端字符集为UTF8,导入数据后一切正常。
当从表A中读取一条信息a,将其nclob列内容插入表B的nclob列,同时向表B中varchar2列插入带中文内容,操作完成后
,生成一条信息b,查询表B此条信息,varchar2列内容无问题,nclob内容乱码。
首先查看A表此条a信息,nclob列的部分内容如下:
-----------------------------------------------------------------------------------------------------------
<table cellspacing="0" cellpadding="5" width="100%" align="center" border="0">
<tbody>
<tr>
<td align="center" height="48"><strong><font face="黑体"
-----------------------------------------------------------------------------------------------------------
dump得到:dump1
-----------------------------------------------------------------------------------------------------------
Typ=1 Len=2000: 0,60,0,116,0,97,0,98,0,108,0,101,0,32,0,99,
0,101,0,108,0,108,0,115,0,112,0,97,0,99,0,105,0,110,0,103,0,
61,0,34,0,48,0,34,0,32,0,99,0,101,0,108,0,108,0,112,0,97,0,
100,0,100,0,105,0,110,0,103,0,61,0,34,0,53,0,34,0,32,0,119,
0,105,0,100,0,116,0,104,0,61,0,34,0,49,0,48,0,48,0,37,0,34,
0,32,0,97,0,108,0,105,0,103,0,110,0,61,0,34,0,99,0,101,0,110,
0,116,0,101,0,114,0,34,0,32,0,98,0,111,0,114,0,100,0,101,0,
114,0,61,0,34,0,48,0,34,0,62,0,13,0,10,0,32,0,32,0,32,0,32,0,
60,0,116,0,98,0,111,0,100,0,121,0,62,0,13,0,10,0,32,0,32,0,
32,0,32,0,32,0,32,0,32,0,32,0,60,0,116,0,114,0,62,0,13,0,10,
0,32,0,32,0,32,0,32,0,32,0,32,0,32,0,32,0,32,0,32,0,32,0,32,
0,60,0,116,0,100,0,32,0,97,0,108,0,105,0,103,0,110,0,61,0,
34,0,99,0,101,0,110,0,116,0,101,0,114,0,34,0,32,0,104,0,101,
0,105,0,103,0,104,0,116,0,61,0,34,0,52,0,56,0,34,0,62,0,60,
0,115,0,116,0,114,0,111,0,110,0,103,0,62,0,60,0,102,0,111,
0,110,0,116,0,32,0,102,0,97,0,99,0,101,0,61,0,34,158,209 ==>黑的UTF-16的2个十进制数
-----------------------------------------------------------------------------------------------------------
其中"<"符号Unicode编码如下:
UTF-8:3C
UTF-16:003C
其中3C为16进制数,转换为10进制则为3*16+12=60
此外,黑体的“黑”字UTF-8编码为E9 BB 91=233 187 145,UTF-16编码为9E D1=158 209。
我们可以看到,实际内容确实是以UTF-16编码方式存储。
再查询B表的b,部分内容如下:
-----------------------------------------------------------------------------------------------------------
㱴慢汥敬汳灡捩湧㴢〢敬汰慤摩湧㴢㔢⁷楤瑨㴢〥∠慬楧渽≣敮瑥爢潲摥爽∰∾††㱴扯摹㸠†††‼瑲㸠†††††‼瑤汩杮㴢捥湴敲∠桥楧桴㴢㐸∾㱳瑲潮朾㱦潮琠晡捥㴢釤붓∠獩穥㴢㘢㻧뺎꿥뢸蓨꾾駥궦뻨꺡ꇦ鶿ㄼ⽦潮琾㰯獴牯湧㸼⽴搾††††㰯瑲㸠†††‼瑲㸠†††††‼瑤汩杮㴢捥湴敲∠桥楧桴㴢ㄵ∾駥뢈㨼甾♮扳瀻♮扳瀻♮扳瀻♮扳瀻♮
-----------------------------------------------------------------------------------------------------------
dump得到:dump2
-----------------------------------------------------------------------------------------------------------
Typ=1 Len=2000: 60,116,97,98,108,101,32,99,101,108,108,115,112,97,
99,105,110,103,61,34,48,34,32,99,101,108,108,112,97,100,100,105,110,
103,61,34,53,34,32,119,105,100,116,104,61,34,49,48,48,37,34,32,97,108,
105,103,110,61,34,99,101,110,116,101,114,34,32,98,111,114,100,101,
114,61,34,48,34,62,32,32,32,32,60,116,98,111,100,121,62,32,32,32,32,
32,32,32,32,60,116,114,62,32,32,32,32,32,32,32,32,32,32,32,32,60,116,
100,32,97,108,105,103,110,61,34,99,101,110,116,101,114,34,32,104,101,
105,103,104,116,61,34,52,56,34,62,60,115,116,114,111,110,103,62,60,
102,111,110,116,32,102,97,99,101,61,34,233,187,145 ==>黑的UTF-8编码的三个十进制数
-----------------------------------------------------------------------------------------------------------
我们仔细区分就会发现,dump2其实就是dump1将0丢掉后的内容,这里的nclob列的编码方式为UTF-16,
那么nclob列在查询显示时,第一个字符即为60,116=3C,74
,但是,这两个字节的内容在UTF-16编码表当中不存在,结果即以㱴来显示。
究其原因,其实表B的这条信息b是以UTF-8编码方式保存的。
从表A信息a内容中的首个汉字“黑”来说,我们知道,dump1中,其对应的值是158 209,而再从dump2中分析
可以确定“黑”字其实是以UTF-8的编码E9 BB 91保存的。
那么问题的症结就在于,数据是以UTF-8方式存入,以UTF-16的方式读取,自然会乱码了。
为什么数据会以UTF-8方式存入
查看保存调用的代码,注意到PreparedStatement.setString方法的API注释写道:
void java.sql.PreparedStatement.setString(int parameterIndex, String x) throws SQLException
Sets the designated parameter to the given Java String
value. The driver converts this to an SQL
VARCHAR
or LONGVARCHAR
value (depending on the argument's size relative to the driver's limits on
VARCHAR
values) when it sends it to the database.
原来当参数String X超过限定值时,其值会被驱动转换为一个SQL VARCHAR型值,因为这里数据库字符集是AL32UTF8,而国家字符集是
AL16UTF16,最终,数据因为过长而以varchar2类型即,以AL32UTF8字符集存入数据库,也就因此产生了乱码。