爬虫编码问题

本文深入探讨了爬虫在抓取网页时常见的编码问题,包括字符集编码、gzip/deflate压缩编码、html实体编码及URL编码。并提供了在Node.js环境下如何正确处理这些编码问题的具体解决方案,避免因双解码导致的数据错误。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

每次写爬虫都会遇到编码问题,所以总结一下常见的编码问题

  1. 字符集编码(gbk,gb2312,win1252)
  2. gzip, deflate编码
  3. html编码
  4. url编码

上面是常见的编码问题,下面给出在nodejs中的解决方法

  1. 字符集编码:
var data = Buffer.concat(chunks, size);
data=iconv.decode(data,'gbk')
decoded_data=iconv.decode(data,'gbk')
  1. gzip, deflate编码:
zlib.gunzip(buffer, function(err, decoded) {
	console.log(decoded.toString());
})
  1. html编码
    不要直接对某个标签进行toString(),而是使用text()等函数获得文本
element.text()
element.html()
……
  1. url编码

非常重要的提醒:

Use Buffers when decoding

Alexander Shtuchkin edited this page on 11 Jun 2014 · 5 revisions
Decoding a string is probably the most common mistake when working with legacy encoded resources. Why? Lets see.

Problem
This is wrong:

var http = require('http'),
    iconv = require('iconv-lite');

http.get("http://website.com/", function(res) {
  var body = '';
  res.on('data', function(chunk) {
    body += chunk;
  });
  res.on('end', function() {
    var decodedBody = iconv.decode(body, 'win1252');
    console.log(decodedBody);
  });
});

Before being decoded with iconv.decode function, the original resource was (unintentionally) already decoded in body += chunk via javascript type conversion. What really happens here is:

  res.on('data', function(chunkBuffer) {
    body += chunkBuffer.toString('utf8');
  });

The same conversion is done behind the scenes if you call res.setEncoding(‘utf8’);.

Not only double-decoding leads to wrong results, it is also nearly impossible to restore original bytes because utf8 conversion is lossy, so even iconv.decode(new Buffer(body, ‘utf8’), ‘win1252’) will not help.

Note: theoretically, if you use ‘binary’ encoding to first decode to strings, then feed them to decode, you get the correct results. This is a bad practice because it’s slower, it’s mixing concepts and ‘binary’ encoding is deprecated.

Solution
Keep original Buffer-s and provide them to iconv.decode. Use Buffer.concat() if needed.

In general, keep in mind that all javascript strings are already decoded and should not be decoded again.

http.get("http://website.com/", function(res) {
  var chunks = [];
  res.on('data', function(chunk) {
    chunks.push(chunk);
  });
  res.on('end', function() {
    var decodedBody = iconv.decode(Buffer.concat(chunks), 'win1252');
    console.log(decodedBody);
  });
});
// Or, with iconv-lite@0.4 and Node v0.10+, you can use streaming support with `collect` helper
http.get("http://website.com/", function(res) {
  res.pipe(iconv.decodeStream('win1252')).collect(function(err, decodedBody) {
    console.log(decodedBody);
  });
});

What if you know what you’re doing and just want to mute the warning?

iconv.skipDecodeWarning = true;

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值