HttpClient POST 的 UTF-8 编码问题

最新推荐文章于 2023-04-24 17:05:20 发布

转载最新推荐文章于 2023-04-24 17:05:20 发布 · 1.5w 阅读

文章标签：

#parameters #string #encoding #url #header #exception

Java学习专栏收录该内容

12 篇文章

订阅专栏

本文介绍如何解决Apache HttpClient在使用UTF-8编码时出现的乱码问题，并提供了一个具体的例子来展示如何通过覆盖`getRequestCharSet()`方法来实现。

Apache HttpClient ( http://jakarta.apache.org/commons/httpclient/ ) 是一个纯 Java 的HTTP 协议的客户端编程工具包, 对 HTTP 协议的支持相当全面, 更多细节也可以参考IBM 网站上的这篇文章 HttpClient入门 ( http://www-128.ibm.com/developerworks/cn/opensource/os-httpclient/ ).

问题分析不过在实际使用中, 还是发现按照最基本的方式调用 HttpClient 时, 并不支持 UTF-8 编码, 在网络上找过一些文章, 也不得要领, 于是查看了 commons-httpclient-3.0.1 的一些代码, 首先在 PostMethod 中找到了 generateRequestEntity() 方法:

/**
   * Generates a request entity from the post parameters, if present.  Calls
   * {@link EntityEnclosingMethod#generateRequestBody()} if parameters have not been set.
   *
   * @since 3.0
   */
protected RequestEntity generateRequestEntity() {
      if (!this.params.isEmpty()) {
         // Use a ByteArrayRequestEntity instead of a StringRequestEntity.
         // This is to avoid potential encoding issues.  Form url encoded strings
         // are ASCII by definition but the content type may not be.  Treating the content
         // as bytes allows us to keep the current charset without worrying about how
         // this charset will effect the encoding of the form url encoded string.
         String content = EncodingUtil.formUrlEncode(getParameters(), getRequestCharSet());
         ByteArrayRequestEntity entity = new ByteArrayRequestEntity(
            EncodingUtil.getAsciiBytes(content),
            FORM_URL_ENCODED_CONTENT_TYPE
         );
         return entity;
      } else {
         return super.generateRequestEntity();
      }
}

原来使用 NameValuePair 加入的 HTTP 请求的参数最终都会转化为 RequestEntity 提交到 HTTP 服务器, 接着在 PostMethod 的父类 EntityEnclosingMethod 中找到了如下的代码:

/**
   * Returns the request's charset.  The charset is parsed from the request entity's
   * content type, unless the content type header has been set manually.
   *
   * @see RequestEntity#getContentType()
   *
   * @since 3.0
   */
public String getRequestCharSet() {
      if (getRequestHeader("Content-Type") == null) {
         // check the content type from request entity
         // We can't call getRequestEntity() since it will probably call
         // this method.
         if (this.requestEntity != null) {
            return getContentCharSet(
                  new Header("Content-Type", requestEntity.getContentType()));
         } else {
            return super.getRequestCharSet();
         }
      } else {
         return super.getRequestCharSet();
      }
}

解决方案从上面两段代码可以看出是 HttpClient 是如何依据 "Content-Type" 获得请求的编码(字符集), 而这个编码又是如何应用到提交内容的编码过程中去的. 按照这个原来, 其实我们只需要重载 getRequestCharSet() 方法, 返回我们需要的编码(字符集)名称, 就可以解决 UTF-8 或者其它非默认编码提交 POST 请求时的乱码问题了.

测试首先在 Tomcat 的 ROOT WebApp 下部署一个页面 test.jsp, 作为测试页面, 主要代码片段如下:

<%@ page contentType="text/html;charset=UTF-8"%>
<%@ page session="false" %>
<%
request.setCharacterEncoding("UTF-8");
String val = request.getParameter("TEXT");
System.out.println(">>>> The result is " + val);
%>

接着写一个测试类, 主要代码如下:

public static void main(String[] args) throws Exception, IOException {
      String url = "http://localhost:8080/test.jsp";
      PostMethod postMethod = new UTF8PostMethod(url);
      //填入各个表单域的值
      NameValuePair[] data = {
            new NameValuePair("TEXT", "中文"),
      };
      //将表单的值放入postMethod中
      postMethod.setRequestBody(data);
      //执行postMethod
      HttpClient httpClient = new HttpClient();
      httpClient.executeMethod(postMethod);
}

//Inner class for UTF-8 support
public static class UTF8PostMethod extends PostMethod{
      public UTF8PostMethod(String url){
         super(url);
      }
      @Override
      public String getRequestCharSet() {
         //return super.getRequestCharSet();
         return "UTF-8";
      }
}

运行这个测试程序, 在 Tomcat 的后台输出中可以正确打印出 ">>>> The result is 中文" .