深入分析noedjs爬虫中出现的乱码情况

本文探讨了网页乱码的三大原因,包括编码不一致、压缩编码问题及服务器未遵从标准的行为,并提出了解决方案。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

       上一篇文章中分析了目前没有能够解决的乱码的三种情况,今天就这三种情况分析一下背后的原因。
       1,网页源码中的编码方式和抓包得到的编码方式不一致问题,这个有可能是故意为之,为了反爬虫之类的。当然也有可能是在配置服务器的时候出错了。
       2,由content-Encoding字段为gzip导致的问题:
       客户端和浏览器进行通信的协商过程中存在Accept-Encoding字段,摘录RFC2616中关于该字段的定义如下:

   Examples of its use are:
   Accept-Encoding: compress, gzip
    Accept-Encoding:
    Accept-Encoding: *
    Accept-Encoding: compress;q=0.5, gzip;q=1.0
    Accept-Encoding: gzip;q=1.0, identity; q=0.5, *;q=0
   A server tests whether a content-coding is acceptable, according to an Accept-Encoding field, using these rules:
    1. If the content-coding is one of the content-codings listed in the Accept-Encoding field, then it is acceptable, unless it is accompanied by a qvalue of 0. (As defined in section 3.9, a qvalue of 0 means “not acceptable.”)
   2. The special “*” symbol in an Accept-Encoding field matches any available content-coding not explicitly listed in the header field.
   3. If multiple content-codings are acceptable, then the acceptable content-coding with the highest non-zero qvalue is preferred.
   4. The “identity” content-coding is always acceptable, unless specifically refused because the Accept-Encoding field includes “identity;q=0”, or because the field includes “*;q=0” and does not explicitly include the “identity” content-coding. If the Accept-Encoding field-value is empty, then only the “identity” encoding is acceptable.
   If an Accept-Encoding field is present in a request, and if the server cannot send a response which is acceptable according to the Accept-Encoding header, then the server SHOULD send an error response with the 406 (Not Acceptable) status code.
   If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding. In this case,if “identity” is one of the available content-codings, then the server SHOULD use the “identity” content-coding, unless it has additional information that a different content-coding is meaningful to the client.

       因此在针对www.guoguo-app.com进行爬虫的时候设置了两次不同的Accept-Encoding,效果如下:

GET / HTTP/1.1
Accept-Charset: gbk
Host: www.guoguo-app.com
Connection: close

HTTP/1.1 200 OK
Date: Fri, 07 Apr 2017 05:16:22 GMT
Content-Type: text/html;charset=GBK
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Vary: Accept-Encoding
Content-Language: zh-CN
Server: Tengine/Aserver
GET / HTTP/1.1
Accept-Charset: gbk
Accept-Encoding: gzip
Host: www.guoguo-app.com
Connection: close

HTTP/1.1 200 OK
Date: Fri, 07 Apr 2017 05:21:53 GMT
Content-Type: text/html;charset=GBK
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Content-Language: zh-CN
Content-Encoding: gzip
Server: Tengine/Aserver

       可以看到当客户端显示的指出可以接受gzip的时候,服务器给与了gzip的响应,而默认的情况是不经过压缩的。
       但是针对andersonjiang.blog.sohu.com网站就没有那么顺利了,如下所示:

GET / HTTP/1.1
Accept-Encoding: gzip;q=1.0
Host: andersonjiang.blog.sohu.com
Connection: close

HTTP/1.1 200 OK
Content-Type: text/html; charset=GBK
Transfer-Encoding: chunked
Connection: close
Server: nginx
Date: Fri, 07 Apr 2017 05:36:10 GMT
Vary: Accept-Encoding
Expires: Thu, 01 Jan 1970 00:00:00 GMT
RHOST: 192.168.108.217@8162
Pragma: No-cache
Cache-Control: no-cache
Content-Language: en-US
Content-Encoding: gzip
FSS-Cache: MISS from 13998460.19372422.21936590
FSS-Proxy: Powered by 9935166.11245896.17873234
GET / HTTP/1.1
Accept-Encoding: *;q=0
Host: andersonjiang.blog.sohu.com
Connection: close

HTTP/1.1 200 OK
Content-Type: text/html; charset=GBK
Transfer-Encoding: chunked
Connection: close
Server: nginx
Date: Fri, 07 Apr 2017 05:59:08 GMT
Vary: Accept-Encoding
Expires: Thu, 01 Jan 1970 00:00:00 GMT
RHOST: 10.10.127.109@4709
Pragma: No-cache
Cache-Control: no-cache
Content-Language: en-US
Content-Encoding: gzip
FSS-Cache: MISS from 13998460.19372422.21936590
FSS-Proxy: Powered by 10131777.11639115.18069848

       第二次的时候表示客户端不接受任何形式的压缩编码,但是服务端却仍然以压缩形式返回,这种情况是不符合标准RFC的规定的,因此针对这种情况只有编写解压缩程序,方可提取到想要的网页内容。
       由上可以看出,服务器端给不给出压缩或者非压缩的网页,完全取决于服务器段行为,并没有完全遵守RFC 给出的建议和规定。
       3,网页编码导致的乱码问题。
       客户端和浏览器进行通信的协商过程中存在Accept-Charset字段,摘录RFC2616中关于该字段的定义如下:

    An example is
   Accept-Charset: iso-8859-5, unicode-1-1;q=0.8
   The special value ““, if present in the Accept-Charset field,matches every character set (including ISO-8859-1) which is not mentioned elsewhere in the Accept-Charset field. If no “” is present in an Accept-Charset field, then all character sets not explicitly mentioned get a quality value of 0, except for ISO-8859-1, which gets a quality value of 1 if not explicitly mentioned.
   If no Accept-Charset header is present, the default is that any character set is acceptable. If an Accept-Charset header is present,and if the server cannot send a response which is acceptable according to the Accept-Charset header, then the server SHOULD send an error response with the 406 (not acceptable) status code, though the sending of an unacceptable response is also allowed.

GET / HTTP/1.1
Accept-Charset: gbk;q=0
Accept-Encoding: *;q=0
Host: www.guoguo-app.com
Connection: close

HTTP/1.1 200 OK
Date: Fri, 07 Apr 2017 06:23:12 GMT
Content-Type: text/html;charset=GBK
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Vary: Accept-Encoding
Content-Language: zh-CN
Server: Tengine/Aserver
GET / HTTP/1.1
Accept-Charset: utf-8;q=1
Accept-Encoding: *;q=0
Host: www.guoguo-app.com
Connection: close

HTTP/1.1 200 OK
Date: Fri, 07 Apr 2017 06:25:57 GMT
Content-Type: text/html;charset=GBK
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Vary: Accept-Encoding
Content-Language: zh-CN
Server: Tengine/Aserver

       可以看到客户端要求的utf8编码,但是服务端的Content-Type却是GBK。
       由上面可以看出,使用’Accept-Charset’: ‘gbk’,’Accept-Encoding’: ‘gzip’等头部请求的时候,虽然客户端声明了自己所支持的编码方式以及,解压缩方式,但是服务器段并没有按照客户端的要求返回对应的编码和压缩方法。但是正常情况下,可以加上述的两个协商字段,当服务器有选择的时候,则会返回我们请求的方式。
       本文为优快云村中少年原创文章,转载记得加上小尾巴偶,博主链接这里

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

村中少年

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值