我对字符集的理解

1           The basic knowledge of encoding,

 

1.1          ASCII

At the first start, computer is invented. The data is stored in computer by byte. The character is represented by one byte from 0 - 127. So the ASCII sums up 128 characters. Then the western people realize that they are stupid. The 128 character can not satisfy the development of computer. So they publish the extended ASCII, that is from 129 – 255. Please see the ASCII table,

1.2          ISO8859

It is similar with  ASCII.

属于单字节编码,最多能表示的字符范围是0-255,应用于英文系列。比如,字母'a'的编码为0x61=97

很明显,iso8859-1编码表示的字符范围很窄,无法表示中文字符。但是,由于是单字节编码,和计算机最基础的表示单位一致,所以很多时候,仍旧使用iso8859-1编码来表示。而且在很多协议上,默认使用该编码。比如,虽然"中文"两个字不存在iso8859-1编码,以gb2312编码为例,应该是"d6d0 cec4"两个字符,使用iso8859-1编码的时候则将它拆开为4个字节来表示:"d6 d0 ce c4"(事实上,在进行存储的时候,也是以字节为单位处理的)。而如果是UTF编码,则是6个字节"e4 b8 ad e6 96 87"。很明显,这种表示方法还需要以另一种编码为基础。

The UTF8 is divided into many types to show different western characters. Please see,

ISO-8859-1

ISO8859_1

ISO 8859-1, Latin Alphabet No. 1

ISO-8859-2

ISO8859_2

Latin Alphabet No. 2

ISO-8859-4

ISO8859_4

Latin Alphabet No. 4

ISO-8859-5

ISO8859_5

Latin/Cyrillic Alphabet

ISO-8859-7

ISO8859_7

Latin/Greek Alphabet

ISO-8859-9

ISO8859_9

Latin Alphabet No. 5

ISO-8859-13

ISO8859_13

Latin Alphabet No. 7

ISO-8859-15

ISO8859_15

Latin Alphabet No. 9

 

1.3          GB2312/GBK

这就是汉子的国标码,专门用来表示汉字,变长的编码,英文字母和ASCII一致(兼容ASCII编码,用一个字节(0-127)表示), 一个汉字用两个字节表示,事实上,第一个字节是用扩展的ASCII128-255)表示,而第二个字节是任何字节(0-255)。例如,“中文”表示为"d6d0 cec4",其中gbk编码能够用来同时表示繁体字和简体字,而gb2312只能表示简体字,gbk是兼容gb2312编码的。

1.4          Other variant charsets

Like GB2312/GBK, other counties or regions do the same way, they make use of the extended ASCII to create their own encoding, for example, Big5 for HongKong in China .

1.5          Unicode

For all the character used in different counties, a character is represented by 2 bytes, for example, “中国 is encoded "4e2d 6587", it is not compatible with ASCII and ISO8859. By default, windows can not show it, but if the sign FF EE exists, it can be shown as usual. But it is suitable for the intern processing in computer. The windows can not show it by default if the sign FF EE is not specified, and windows only show the UTF8(ASCII) and GB2312(ASCII).

这是最统一的编码,可以用来表示所有语言的字符,而且是定长双字节(也有四字节的)编码,包括英文字母在内。所以可以说它是不兼容iso8859-1编码的,也不兼容任何编码。不过,相对于iso8859-1编码来说,unicode编码只是在前面增加了一个0字节,比如字母'a'"00 61"

需要说明的是,定长编码便于计算机处理(注意GB2312/GBK不是定长编码),而unicode又可以用来表示所有字符,所以在很多软件内部是使用unicode编码来处理的,比如java

JAVA默认的unicodeUTF-16, 并且有个开始标志,就是FFFE, 这个标志能够告诉notepad去正确地显示,例如,下列程序得到同样的结果,

       String s = "/u4e2d"; //

       byte[] bytesUnicode = s.getBytes("UNICODE");

       byte[] bytesUTF16 = s.getBytes("UTF-16");

The result bytes are

FF FE 2D 4E

UnicodeBOMFFFE.

1.6          UTF-8

UTF is variant character set, for the English, it is compatible with ASCII, so it stands for the English with one byte. Then for the Chinese, it use the 3 bytes to show it. For example, 中国” is encoded as "e4b8ad e69687" as UTF-8

考虑到unicode编码不兼容iso8859-1编码,而且容易占用更多的空间:因为对于英文字母,unicode也需要两个字节来表示。所以unicode不便于传输和存储。因此而产生了utf编码,utf编码兼容iso8859-1编码,同时也可以用来表示所有语言的字符,不过,utf编码是不定长编码,每一个字符的长度从1-6个字节不等。另外,utf编码自带简单的校验功能。一般来讲,英文字母都是用一个字节表示,而汉字使用三个字节。

注意,虽然说utf是为了使用更少的空间而使用的,但那只是相对于unicode编码来说,如果已经知道是汉字,则使用GB2312/GBK无疑是最节省的。不过另一方面,值得说明的是,虽然utf编码对汉字使用3个字节,但即使对于汉字网页,utf编码也会比unicode编码节省,因为网页中包含了很多的英文字符。

UTF-8BOMEFBBBF.

2           UTF-8 编码的文件可以分为no BOM BOM两种格式

 

何谓BOM "EF BB BF" 这三个字节就叫BOMBOM的全称叫做"Byte Order Mard".utf8文件中常用BOM来表明这个文件是UTF-8文件,BOM的本意是在utf16中用来表示高低字节序列的。

在字节流之前有BOM表示采用低字节序列(低字节在前面),而utf8不用考虑字节序列,所以其实有无BOM都可以。

 

微软的记事本 Word 等只能正确打开含BOMUTF8文件,然后UltraEdit却恰恰相反,会把BOM utf8文件 误认为ascii编码。

 

UTF-8BOM EFBBBF,因为UE载入UTF-8文件会转成Utf16,上述的EFBBBF Utf16中是FFFEUnicode-LEBOM),UltraEdit不认识BOM又加多一個BOM,所以有2FFFE

文件就被它破坏了。

 

3           The Windows operations system

For each API function, Windows provides two versions , first one is for platform dependent while another one is Unicode. For example, setWindowText/setWindowTextW. The Unicode is suitable for the processing in intern computer, because it is the fixed length charset. So inside computer or program, the Unicode is always used, but when showing the charset, the platform dependent should be used. The Windows support the conversion to convert some platform dependent charset to Unicode, and vice versa,  MultiByteToWideChar/WideCharToMultiByte.

 

But the applications are different at supporting charsets. Some applications support only platform dependent, other support the Unicode, UTF8 or ISO and so on.

If you use the Chinese operation system, some program can process the charset GB2312 and UTF8.

 

3.1          How to process the charset for some window application?

3.1.1     Notepad

It can process GB2312 and UTF-8. If you edit the binary file whose content is like d6d0 and e4b8ad. Then it will show “” in notepad.

 

3.1.2     Cmd

It can process only GBK. Go to System menu/properties/Current Code Page, it is GBK. Only the GBK codes can be processed in the cmd/mysql.

 

3.1.3     UltraEdit

It can process GB2312 by default. But go to the Advanced -> Configuration -> 文件处理 -> Unicode/UTF-8 检测 and Advanced -> Configuration -> 文件处理 -> 代码页选择 and Advanced -> Set Code Page/Locale and View -> Set Code Page to do advanced configuration.

 

3.2          The code conversion tool

Word, when you save a work file and select the file type as pure text, then you will be prompted to select one charset encoding.

UltraEdit, go to the file/conversions, some ways are supported. Please see the screen snap.

   

 

It supports the following charset conversions, please refer to the help document in ultraedit(Focus the menu item, then click F1)

3.2.1     To Unicode

ASCII to Unicode, platform dependent charset(GB) to Unicode

UTF-8 to Unicode, UTF to Unicode

 

3.2.2     To platform dependent(GB)

Unicode to ASCII, Unicode to platform dependent(GB)

UTF-8 to ASCII, UTF-8 to platform dependent(GB)

 

3.2.3     To UTF-8

ASCII to UTF-8(Unicode Editing)

UNICODE/UTF-8 to ASCII to UTF-8(Unicode Editing)

UNICODE/ASCII/UTF-8 to UTF-8(Unicode Editing)

 

 

3.3          The architecture of the charset processing, so we can see it is up to the applications that decide which charset can be processed. Because it can convert any charset to Unicode inside itself. Then process them and Then output the platform dependent charset/Unicode to OS.

    

OS

Windows APIs that supports thet unicode and platform dependent.

(setWindowText/setWindowTextW)

Applications

Unicode

UTF-8

ISO-8859-1

GB

 

MultiByteToWideChar/WideCharToMultiByte

 

4          java对字符的处理

java应用软件中,会有多处涉及到字符集编码,有些地方需要进行正确的设置,有些地方需要进行一定程度的处理。记住,java内部String使用Unicode表示的。

4.1       getBytes(charset)

这是java字符串处理的一个标准函数,其作用是将字符串所表示的字符按照charset编码,并以字节方式表示。注意字符串在java内存中总是按unicode编码存储的。比如"中文",正常情况下(即没有错误的时候)存储为"4e2d 6587",如果charset"gbk",则被编码为"d6d0 cec4",然后返回字节"d6 d0 ce c4"。如果charset"utf8"则最后是"e4 b8 ad e6 96 87"。如果是"iso8859-1",则由于无法编码,最后返回 "3f 3f"(两个问号)。

4.2       new String(bytes, charset)

这是java字符串处理的另一个标准函数,和上一个函数的作用相反,将字节数组按照charset编码进行组合识别,最后转换为unicode存储在java内部。参考上述getBytes的例子,"gbk" "utf8"都可以得出正确的结果"4e2d 6587",但iso8859-1最后变成了"003f 003f"(两个问号)。

因为utf8可以用来表示/编码所有字符,所以new String( str.getBytes( "utf8" ), "utf8" ) === str,即完全可逆。

 

Please see the processing examples,

package com;

 

import java.io.UnsupportedEncodingException;

 

public class Test {

   public static String bytesToHexString(byte[] bytes) {

       if (bytes == null) {

          return "";

       }

 

       StringBuffer sb = new StringBuffer();

       for (int i = 0; i < bytes.length; i++) {

          byte bTemp = bytes[i];

          String sTemp = Integer.toHexString(bTemp);

          sb.append(sTemp.substring(sTemp.length() - 2));

       }

 

       return sb.toString();

   }

 

   public static void main(String[] args) throws UnsupportedEncodingException {

       // A string in java with unicode (Unicode|0x4e2d), that can not shown

       // in windows by defult

       char cOne = 0x4e2d;

       String sTarget = new String(new char[] { cOne });

 

       // println method will convert the unicode in core String to platform

       // default charset(GB) to output, so it show (GB|D6D0), that can be shown

       // in windows by defult

       System.out.println(sTarget);

 

       // The string to bytes

 

       // (Unicode[UTF-16,fffe是标志]|fffe2d4e), that can not be shown in windows by default

       System.out.println(bytesToHexString(sTarget.getBytes("unicode")));

 

       // (GB|D6D0), that can be shown in windows notepad

       System.out.println(bytesToHexString(sTarget.getBytes(/* GB2312 */)));

 

       // (UTF-8|e4b8ad), that can be shown in windows notepad

       System.out.println(bytesToHexString(sTarget.getBytes("UTF-8")));

 

       // ?(ISO8859|3F), because it can not found the same char in ISO8859,

       // that will be shown as a ?

       System.out.println(bytesToHexString(sTarget.getBytes("ISO8859-1")));

 

       // Bytes to String

      

       // unicode bytes

       byte[] baUnicode = new byte[] { (byte) 0xff, (byte) 0xfe, 0x2d, 0x4e };

       String sUnicode = new String(baUnicode, "Unicode");

       // Show Chinese char

       System.out.println(sUnicode);

 

       // gb bytes

       byte[] baGb = new byte[] { (byte) 0xd6, (byte) 0xd0 };

       String sGb = new String(baGb/* , GB2312 */);

       // Show Chinese char

       System.out.println(sGb);

 

       // utf-8 bytes

       byte[] baUtf = new byte[] { (byte) 0xe4, (byte) 0xb8, (byte) 0xad };

       String sUtf = new String(baUtf, "UTF-8");

       // Show Chinese char

       System.out.println(sUtf);

 

       // gb bytes

       byte[] baIso = new byte[] { (byte) 0xd6, (byte) 0xd0 };

       String sIso = new String(baIso, "ISO8859-1");

       // Because d6 and d0 is the char that can not be shown in ISO8859-1

       System.out.println(sIso);

   }

}

5           If you create the mysql using the charset ISO8859-1, please see the mysql,

 

5.1          If you use the mysql command line to execute the sql, the Chinese char can be processed normally. But mysql command line is a kind of command line, so go to System Menu/properties/Code Page, you will find it only supports GBK. So only the GBK can be shown.

 

mysql>insert into EMP values('');

mysql> select * from emp;

+------+

| NAME |

+------+

|    |

+------+

1 row in set (0.00 sec)select * from EMP;

 

Why the Chinese can be converted successfully in Chinese Version OS?

   Because when you insert into EMP values(''); is run, it is like,

     //Sqlplus will do the following operations

String s = new String(“” /*, “GB2312”*/ );

Byte[] abRawGb = s.getBytes(“GB2312”);

String sFakeIso = new String(abRawGb, “ISO8859-1”);

 

//Database server will save the ISO charset(In fact, it is GB raw bytes) to storage

              prepareStatement.setString(1, sFakeIso);

            So the fakeIso charset(in fact, it is raw data of GB chars) is saved in mysql.

 

            While select * from EMP; is run, it is like,

              //Database server return the ISO charset(In fact, it is GB raw bytes) to sqlplus

              String sFakeIso = new String(fakeIso(raw data of GB), “ISO-8859-1”)

 

              //Sqlplus will convert it to gb string to show to user

              Bytes[] sbRawGb = sFakeIso.getBytes(“ISO-8859-1”);

              String sbGb = new String(sbRawGb, “GB2312”);

              Show(sbGb) in sqlplus.

 

5.2          If you use the JDBC directly like this, it will not work, the “?” will be shown.

Because the JDBC try to convert the Unicode in String() to ISO8859-1, but in ISO8859-1,  character is not defined, so the “?” is save, then when the query is run, it return “?”.

  

package com;

 

import java.sql.Connection;

import java.sql.DriverManager;

import java.sql.PreparedStatement;

import java.sql.ResultSet;

 

public class TestDatabase {

 

       /**

        * @param args

        */

       public static void main(String[] args) {

 

              try {

                     Class.forName("com.mysql.jdbc.Driver").newInstance();

 

                     Connection conn = DriverManager.getConnection(

                                   "jdbc:mysql://localhost:3306/test" + "?useUnicode=true",

                                   "root", "root");

 

                     PreparedStatement ps = conn

                                   .prepareStatement("insert into EMP values('')");

                     ps.executeUpdate();

                     ps.close();

 

                     ps = conn.prepareStatement("select * from EMP");

                     ResultSet rs = ps.executeQuery();

                     while (rs.next()) {

                            System.out.println(rs.getString(1));

                     }

                     rs.close();

                     ps.close();

                     conn.close();

              } catch (Exception e) {

                     e.printStackTrace();

              }

       }

 

}

 

5.3          The following codes can be run successfully. And the Chinese char can be shown. Because we save the raw bytes of GB to ISO database. Then load it to convert it to GB string. In fact, we can use any raw bytes of charset, for example, UTF-8, Unicode and so on.

package com;

 

import java.sql.Connection;

import java.sql.DriverManager;

import java.sql.PreparedStatement;

import java.sql.ResultSet;

 

public class TestDatabase2 {

 

     public static void main(String[] args) {

            try {

                   Class.forName("com.mysql.jdbc.Driver").newInstance();

 

                   Connection conn = DriverManager.getConnection(

                                 "jdbc:mysql://localhost:3306/test", "root", "root");

 

                   String sValue = "";//unicode

                  

                   byte[] abRawGb = sValue.getBytes("GB2312");

                   String sFakeIso = new String(abRawGb, "ISO8859-1");

                   PreparedStatement ps = conn

                                 .prepareStatement("insert into EMP values(?)");

                   ps.setString(1, sFakeIso);

                   ps.executeUpdate();

                   ps.close();

 

                   ps = conn.prepareStatement("select * from EMP");

                   ResultSet rs = ps.executeQuery();

                   while (rs.next()) {

                          sFakeIso = rs.getString(1);

                          abRawGb = sFakeIso.getBytes("ISO8859-1");

                          sValue = new String(abRawGb, "GB2312");

                          System.out.println(sValue);

                   }

                   rs.close();

                   ps.close();

                   conn.close();

            } catch (Exception e) {

                   e.printStackTrace();

            }

     }

}

 

6           If you create the mysql using the charset UTF-8, everything will be ok. They JDBC driver can convert the Unicode in java String to the UTF8 when it save string to database. And convert the UTF-8 to the java Unicode String when it load the data from the database. The following codes can work correctly. But with mysql the charset UTF-8, the sqlplus can not process the Chinese charset, only GBK can be processed in command line.

package com;

 

import java.sql.Connection;

import java.sql.DriverManager;

import java.sql.PreparedStatement;

import java.sql.ResultSet;

 

public class TestDatabase {

 

     /**

      * @param args

      */

     public static void main(String[] args) {

 

            try {

                   Class.forName("com.mysql.jdbc.Driver").newInstance();

 

                   Connection conn = DriverManager.getConnection(

                                 "jdbc:mysql://localhost:3306/test",

                                 "root", "root");

                   //?useUnicode=true&characterEncoding=gb2312

 

 

 

                   PreparedStatement ps = conn

                                 .prepareStatement("insert into EMP values(?)");

                   ps.setString(1, new String(""));

                   ps.executeUpdate();

                   ps.close();

 

                   ps = conn.prepareStatement("select * from EMP");

                   ResultSet rs = ps.executeQuery();

                   while (rs.next()) {

                          System.out.println(rs.getString(1));

                   }

                   rs.close();

                   ps.close();

                   conn.close();

            } catch (Exception e) {

                   e.printStackTrace();

            }

     }

}

 

7           The charset encoding in web application.

7.1          <%@pageEncoding="UTF-8"%>  (contentType="text/html; charset=UTF-8")

 

The pageEncoding attribute defines the character encoding for the JSP page. It means what format the JSP file is in. The servlet container will resolve the JSP file with that format. The default is that specified in the contentType attribute, or "ISO-8859-1" none was specified there.

 

It should be the same with the format of the JSP file itself. Otherwise, the garbage code will be shown. You can convert the format of JSP with UltraEdit, Eclipse or Word.

 

The jsp is compiled to java source code in charset UTF-8 whenever you set the file format GBK with pageEncoding=”GBK” or the file format UTF-8 with pageEncoding=”UTF-8”

 

So the order of the pageEncodeing is as follows,

1.         The value of the pageEncoding attribute.

2.         The value of the contentType attribute.

3.         The default ISO-8859-1.

        

7.2          For the javac compiler, there is an option –encoding, you can specify what format is used in the java source file. If you do not specify, the platform default charset(GBK) is used. But since the servlet container output the servlet in UTF-8 from JSP, servlet container will compile java servlet like this,

Javac –encoding UTF-8 Yoursevlet.java, In addition, the UTF-8 class code will be outputted.

 

7.3          <%@ page contentType="text/html; charset= UTF-8" %>

response.setCharacterEncoding("UTF-8")

< META http-equiv="Content-Type" content="text/html; charset=UTF-8" />

 

The contentType attribute sets the MIME type and the character set for the response. The default value is text/html when defining JSP Pages standard syntax and text/xml when implementing JSP Documents in XML format.

 

指定文件输出到browser时使用的编码,该设置也应该置于文件的开头。例如:<%@ page contentType="text/html; charset= UTF-8" %>。该设置和response.setCharacterEncoding("UTF-8")等效。

 

browser显示网页的时候,首先使用response中指定的编码(jsp文件头指定的contentType最终也反映在response上),如果未指定,则会使用网页中meta项指定中的contentType。如果同时采用了jsp输出和meta设置两种编码指定方式,则jsp指定的优先。因为jsp指定的直接体现在response中。

 

It means jsp.out(…) will convert the java String(unicode) to UTF-8 and transfer the UTF-8 charset string the client and tell client site that the response charset is UTF-8, client will process it as UTF-8 and show the page.

 

7.4          The get data from client with get and post method,

7.4.1     浏览器是如何提交数据的?

l         当浏览器提交表单时候,可以指定相应的编码。例如:<form accept-charset= "UTF-8">。一般不必使用该设置,浏览器会直接使用网页的编码(JSP文件中设置的contentType)

l         在浏览器中, go to Tool/Internet Options/Advanced, 选择“总是以UTF-8发送URL”

 

7.4.2     How Web Container solve processs the Get and Post parameters. There are three ways to avoid the Chinese Garbage.

 

7.4.2.1    Convert the fake ISO-8859-1 to UTF-8 chaset. The method can solve the charset problem both in Get and Post methods at once.

 

Because the web container will resolve the parameter as ISO-8859-1(raw GB codes) chars, so we can convert it to UTF-8,

 

  public static final String convertStringCharset(String target, String  oldCharset, String newCharset) throws UnsupportedEncodingException{

    byte[] abRawData = target.getBytes(oldCharset);

    return new String(abRawData, newCharset);

  }

 

  public static final String getUtf8Parameter(HttpServletRequest request, String  name) throws UnsupportedEncodingException{

    String sPara = request.getParameter(name); // ISO-8859-1(raw GB codes) chars

    if (sPara == null || sPara.equals("")) {

      return "";

    }

    return convertStringCharset(sPara, "ISO8859-1", "UTF-8");

  }

 

The following codes can run successfully. It is am example.

 

<%@ page language="java"

         import = "java.io.*"

         pageEncoding="GB2312"

         contentType="text/html; charset=UTF-8"

%>

<%!

  public static final String convertStringCharset(String target, String  oldCharset, String newCharset) throws UnsupportedEncodingException{

    byte[] abRawData = target.getBytes(oldCharset);

    return new String(abRawData, newCharset);

  }

 

  public static final String getUtf8Parameter(HttpServletRequest request, String  name) throws UnsupportedEncodingException{

    String sPara = request.getParameter(name); // ISO-8859-1(raw GB codes) chars

    if (sPara == null || sPara.equals("")) {

      return "";

    }

    return convertStringCharset(sPara, "ISO8859-1", "UTF-8");

  }

 

%>

<%

  FileOutputStream fos = new FileOutputStream("D://test.txt", false);

  OutputStreamWriter osw = new OutputStreamWriter(fos);

  BufferedWriter bw = new BufferedWriter(osw);

  try {

    String sName = getUtf8Parameter(request, "FIELD");

    if (sName != null) {

      bw.write(sName);

    }

  } finally {

    bw.close();

  }

%>

 

<html>

  <head>

  </head>

  <body>

    <form name="MAINFORM" method="post">

      <input type="text" name="FIELD" value="">

      <input type="submit" value="提交">

    </form>

  </body>

</html>

 

7.4.2.2    Set URIEncoding=”UTF-8” in tomcat(Server.xml) for Get method and set the charset for Post method to request before the parameter is read from request.

The web container will resolve the post data as ISO-8859-1 by default. So in order to process any data correctly, you must use the one of the followings,

 

Call request.setCharacterEncoding(UTF-8) before you you call request.getParameter(…). Please note that you must call setCharacterEncoding at fist. Or the it is invalid. In addition, request.setCharacterEncoding will affect the post data, not get data. Get data is resolved at the very start before the JSP can control the program.

 

<%@ page language="java"

         import = "java.io.*"

         pageEncoding="GB2312"

         contentType="text/html; charset=UTF-8"

%>

 

<%

  FileOutputStream fos = new FileOutputStream("D://test.txt", false);

  OutputStreamWriter osw = new OutputStreamWriter(fos);

  BufferedWriter bw = new BufferedWriter(osw);

  try {

    request.setCharacterEncoding("UTF-8");

    String sName = request.getParameter("FIELD");

    if (sName != null) {

      bw.write(sName);

    }

  } finally {

    bw.close();

  }

%>

 

<html>

  <head>

  </head>

  <body>

    <form name="MAINFORM" method="post">

      <input type="text" name="FIELD" value="">

      <input type="submit" value="提交">

    </form>

  </body>

</html>

 

7.4.2.3    Like last one, set URIEncoding=”UTF-8” in tomcat(Server.xml) for Get method and set the charset for Post method to request in the Filter.

A filter can be added to set the charset before the request is used to get parameter.

1)        First, implement a Filter like this,

 

package filters;

 

import java.io.IOException;

import javax.servlet.Filter;

import javax.servlet.FilterChain;

import javax.servlet.FilterConfig;

import javax.servlet.ServletException;

import javax.servlet.ServletRequest;

import javax.servlet.ServletResponse;

import javax.servlet.UnavailableException;

 

public class SetCharacterEncodingFilter implements Filter {

    protected String encoding = null;

    protected FilterConfig filterConfig = null;

    protected boolean ignore = true;

   

    public void destroy() {

        this.encoding = null;

        this.filterConfig = null;

    }

   

    public void doFilter(ServletRequest request, ServletResponse response,

                         FilterChain chain)

    throws IOException, ServletException {

        // Conditionally select and set the character encoding to be used

        if (ignore || (request.getCharacterEncoding() == null)) {

            String encoding = selectEncoding(request);

            if (encoding != null)

                request.setCharacterEncoding(encoding);

        }

          // Pass control on to the next filter

        chain.doFilter(request, response);

    }

 

    public void init(FilterConfig filterConfig) throws ServletException {

 

          this.filterConfig = filterConfig;

        this.encoding = filterConfig.getInitParameter("encoding");

        String value = filterConfig.getInitParameter("ignore");

        if (value == null)

            this.ignore = true;

        else if (value.equalsIgnoreCase("true"))

            this.ignore = true;

        else if (value.equalsIgnoreCase("yes"))

            this.ignore = true;

        else

            this.ignore = false;

    }

 

    protected String selectEncoding(ServletRequest request) {

        return (this.encoding);

    }

}

 

2)        Then configure the Filter in web.xml.

  <filter>  

    <filter-name>SetCharacterEncoding</filter-name>  

    <filter-class>filters.SetCharacterEncodingFilter</filter-class>  

    <init-param>  

      <param-name>encoding</param-name>  

      <param-value>UTF-8</param-value>  

    </init-param>  

  </filter>  

 

  <filter-mapping>  

    <filter-name>SetCharacterEncoding</filter-name>  

    <url-pattern>/*</url-pattern>  

</filter-mapping>  

 

3)        Inside the program, don’t bother do any process. Only get the parameters data from the request.

<%@ page language="java"

         import = "java.io.*"

         pageEncoding="GB2312"

         contentType="text/html; charset=UTF-8"

%>

<%

  FileOutputStream fos = new FileOutputStream("D://test.txt", false);

  OutputStreamWriter osw = new OutputStreamWriter(fos);

  BufferedWriter bw = new BufferedWriter(osw);

  try {

    String sName = request.getParameter("FIELD");

    if (sName != null) {

      bw.write(sName);

    }

  } finally {

    bw.close();

  }

%>

 

<html>

  <head>

  </head>

  <body>

    <form name="MAINFORM" method="post">

      <input type="text" name="FIELD" value="">

      <input type="submit" value="提交">

    </form>

  </body>

</html>

 

8           What is the whole solutin for the Chinese charset?

1.         Make the Database store the UTF-8 chaset. Then the JDBC will convert the charset automatically.

2.         Convert the JSP file to UTF-8 format with work, eclipse or Ultraedit.

3.         Add pageEncoding = “UTF-8” in JSP page directive.

4.         Add contentType = “text/html; charset=UTF=8” in JSP page directive.

5.         Tick the Encode the URI as UTF-8 options in IE.

6.         The following one path can be applied in web container at last,

Ø         Make a converter to convert the ISO8859-1 parameters to UTF-8 string. Please refer to 6-4-2-1.

Ø         In Tomcat sever.xml, for the connector element, add the attributes URIEncoding = “UTF-8”.

Ø         request.setCharacterEncoding(“UTF-8”); before any parameter is gotten. Please refer to 6-4-2-2.

Ø         request.setCharacterEncoding(“UTF-8”); in a filter. Please refer to 6-4-2-3.

 

9           Why do some characters become question mark?

 

In the current world, the company becomes bigger and bigger. So do the software companies. As a result, the developments in multiple countries are very common.

 

Maybe some program is developed in western company. Then shift them to Chinese. So the problem about charset shift will occurs definitely.

 

Sometimes, when the files from the western are edited by the eastern, some characters become the question mark. Why?

 

First, we should know the default character set of western counties is ISO8859-1 while eastern counties(Chinese) use the multiple bytes character set(GB2312).

 

Please see readme(UTF-8).txt for encoding details at first.

 

So when a file is created in western counties, it is always in the format of ISO8859-1. Take an example, ISO8859-1_character.txt. You should open it with IE or Filefox, then click View -> Charset Encoding -> ISO8859-1. This is the real view that should be seen by the Western. That is to say, the ISO8859-1(French) is really shown. So it looks like,

 

USU~Mise à jour réussie!

 

Then the file is shift to the eastern countries. Then they open it with notepad or UltraEdit. Notepad can process the GB2312 first, then UTF-8. It will show it with the GB2312 like this

 

USU~Mise ?jour rssie!

 

That is to say, the notepad sees the binary bytes E0 20 that starts with bit 1, so it tires to explain it as a Chinese character(GB2312), but the Chinese character E0 20 does not exist, so it show a  ?  at the place. Then the notepad sees E9 75, it tries to explain it as a Chinese character. In fact, it succeeds. It is Chinese Character. So it show it normally.

 

Then when the user saves the file. The ? is saved as byte 3F. But the Chinese Character is saved back to E975.

 

So the à and the following blank are replcaced with  ?. For example, the file ISO8859-1_character(Open in notepad or UE and Save).txt, at this time, the file is open by the IE and Firefox and although set the character set as ISO8859-1, the ? is ?.

 

This is why the ? is added after the Chinese update the file in Chinese Windows System.

 

Since we know what causes the error, we must solve it. There are many ways to do. One of the following ways is feasible.

 

1.         Use the pure American operation system. Go to Control Panel->Region and Language, set any option as American.

 

2.         When opening the file with EditPlus, then it will hint,

         

 

Then select yes and select ISO, then edit it.

 

3.         In eclipse, right click the file and select the property. Then select the ISO-8859-1 to show and edit it.

 

4.         Open with the word. If there are many ISO8859-1 characters(Not any time), word will hint which code you want show the file. You can select the ISO variants. Then when saving as pure text file. You have access to select the character encoding to save it.

 

5.         Try to change the ISO8859-1 to UTF-8 with Java, then edit it and change it back to ISO8859-1. The following java methods are available.

1)        String(byte[] bytes, String charsetName)

2)        public byte[] getBytes(String charsetName)

 

1)        public InputStreamReader(InputStream in,Charset cs)

2)        public int read(char[] cbuf, int offset, int length)

3)        java.io.OutputStreamWriter(java.io.OutputStream, java.nio.charset.Charset)

4)        public void write(char[] cbuf, int off, int len)

 

 

6.         (Read only)Open the file with Firefox. Then view page source. In the menu view/Character Encoding in the editor can show the ISO8859-1, but you can not edit it.

 

7.         In JBuilder, if you want to change the charset encoding of the file, you must update all the files in the project.

 

8.         In Ultraedit, it doesnot support the ISO-8859-1. But it can change between the many charset encoding. Please go to File -> Convertions

 

9.         In textpads4.7.3, it only supports GB2312 and Western. But I think it doesn’t work.

 

10.     In notepad of Windows, the go to format -> font, the character set can be judged, but it does not work, I think.

 

10       A problem when using JSTL.

在网上看了很多关于多语言的支持,这里讨论的是关于jstl对多语言的支持,网上有解决方案如下(后来证明并不完全正确),

1. messages_zh_CN.properties转换成UTF-8编码.

2. jsp指令contentType="text/html;charset=UTF-8"

3. 使用如下标记<fmt:message key="reload"/>

 

我按照这个做完后,发现一个问题是,jsp(本身就是UTF-8格式的)本身的中文字符在客户端能够正确的显示,但是,在messages_zh_CN.properties里的中文字符却是不能正常显示

 

然后,我找到了standard.jar的源代码,看了一下源代码,发现,似乎它本身就是不能支持中文,理由如下,

1. 请看下列的源代码,在编译后的servlet和在MessageSupport.java中,为了简化,我省略了一些步骤,  

          response.setContentType("text/html;charset=UTF-8");

         ResourceBundle bundle = locCtxt.getResourceBundle();

         message = bundle.getString(key);

       pageContext.getOut().print(message);

2. 这也就是说,在messages.propertiesUTF-8的字符数据通过ResourceBundle取进程序后,编码成了ISO8859-1 的字符,然后,这个字符又被转换成了相应的UTF-8的字符发送到了客户端,所以,到了客户端看到的是乱码,例如,

中国(UTF-8) --->> 中å?½ (ISO8859-1) --->> 中å?(UTF-8)

 

解决问题的思路应该是让jstl以正确的编码进行解析这个资源文件, 本例子中是UTF-8.

 

我的问题是,

1. 如何告诉jstl去用UTF-8编码去读这个资源文件?我查了所有的标签,没有相关的属性.

2. jstl的国际化机制是基于ResourceBundle, ResorceBundle是基于Stream, 也不能制定字符编码

3. 为什么ResourceBundle不基于Reader and Writer, 这样,创建的时候就可以指定字符编码,以用来取不同编码的数据了,而不是仅仅ISO8859-1

4. 如果真的是他们天生就不支持中文,那么,我们用什么好办法实现呢,难道要做如下转换,

            我们去ResourceBundle取的每个数据都要这样转化

                String sValue = resouceBundle.getString("key");

                sValue = new String(sValue.getBytes(8859-1), "UTF-8");

 

            而对于jstl我们需要自己习俗化一个标签,把上面代码加进去,

              Update the MessageSupport.doEndTag() as follows,

                     message = bundle.getString(key);

                     message = new String(message.getBytes(“ISO8859-1”), “UTF-8”);

 

以上是我对JSTL进行字符编码处理的疑问,下面是一些后来得到的解决方案。

 

1.         首先,在java中的字符串常量被默认为是系统的默认编码,例如,在语言为中文的操作系统下,

String s = “中国”;

这个字符串会以GB2312的编码被保存,在编译的时候,javac会以默认的GB2312处理这个字符,这个字符编码能够被手工指定通过编译选项

              javac –encoding GB2312

那我们如何去添加一个其他语言字符到这个串里呢? 我们可以通过unicodeascii编码来处理字符串。例如,

            String s = “/u4e2d/u56fd”;

编译器看见转义字符/u, 就知道后面是一个unicode编码,所以,就读取相应的unicode字符,以支持unicode.

 

2.         对于Resource Bundle, 默认的字符编码应该是ISO8859-1,这个选项是不能被改变的, 因为ResouceBundle没有任何参数指定字符编码。上面的方法同样可以应用到ResourceBundle, 例如,

                     heading=/u4e2d/u56fd

              这样ResourceBundle知道如何读取相应的unicode编码进行处理。“中国”两个字能够被正确的处理。

              Jstl的国际化机制是基于ResourceBundle的,支持国际化的方法和ResourceBundle是一样的。

 

3.         没有人能记住所有字符的unicode的编码,所以,sun jdk自带一个转换工具,native2ascii.exe, which converts a file with native-encoded characters (characters which are non-Latin 1 and non-Unicode) to one with Unicode-encoded characters.使用方法如下,

native2 native2ascii [options] [inputfile [outputfile]]

-reverse

Perform the reverse operation: convert a file with Latin-1 and/or Unicode encoded characters to one with native-encoded characters.

-encoding encoding_name

Specify the encoding name which is used by the conversion procedure. The default encoding is taken from System property file.encoding. The encoding_name string must be taken from the first column of the table of supported encodings in the Supported Encodings document.

-Joption

Pass option to the Java virtual machine, where option is one of the options described on the reference page for the java application launcher. For example, -J-Xms48m sets the startup memory to 48 megabytes.

       例如,native2ascii –encoding UTF-8 messages_zh_CN.properties  new_messages_zh_CN.properties

             native2ascii –encoding GB2312 messages_zh_CN.properties  new_messages_zh_CN.properties

 

4.         最终的解决方案应该使用上述的方法来支持国际化,如下,

1. native2asciimessages_zh_CN.properties转换成unicodeascii编码. 如果在区域和语言为中国的OS中编辑属性文件,则用下列命令转换native2ascii –encoding GB2312 messages_zh_CN.properties new_messages_zh_CN.properties, 如果,属性文件的编码为UTF-8, 则用下列的命令,native2ascii –encoding UTF-8 messages_zh_CN.properties new_messages_zh_CN.properties.

heading=中国

将会被转化为

       heading= /u4e2d/u56fd

 

2. jsp指令contentType="text/html;charset=UTF-8"

 

3. 使用如下标记<fmt:message key="reload"/>

 

11        

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值