每次写爬虫都会遇到编码问题,所以总结一下常见的编码问题
- 字符集编码(gbk,gb2312,win1252)
- gzip, deflate编码
- html编码
- url编码
上面是常见的编码问题,下面给出在nodejs中的解决方法
- 字符集编码:
var data = Buffer.concat(chunks, size);
data=iconv.decode(data,'gbk')
decoded_data=iconv.decode(data,'gbk')
- gzip, deflate编码:
zlib.gunzip(buffer, function(err, decoded) {
console.log(decoded.toString());
})
- html编码
不要直接对某个标签进行toString(),而是使用text()等函数获得文本
element.text()
element.html()
……
- url编码
非常重要的提醒:
Use Buffers when decoding
Alexander Shtuchkin edited this page on 11 Jun 2014 · 5 revisions
Decoding a string is probably the most common mistake when working with legacy encoded resources. Why? Lets see.
Problem
This is wrong:
var http = require('http'),
iconv = require('iconv-lite');
http.get("http://website.com/", function(res) {
var body = '';
res.on('data', function(chunk) {
body += chunk;
});
res.on('end', function() {
var decodedBody = iconv.decode(body, 'win1252');
console.log(decodedBody);
});
});
Before being decoded with iconv.decode function, the original resource was (unintentionally) already decoded in body += chunk via javascript type conversion. What really happens here is:
res.on('data', function(chunkBuffer) {
body += chunkBuffer.toString('utf8');
});
The same conversion is done behind the scenes if you call res.setEncoding(‘utf8’);.
Not only double-decoding leads to wrong results, it is also nearly impossible to restore original bytes because utf8 conversion is lossy, so even iconv.decode(new Buffer(body, ‘utf8’), ‘win1252’) will not help.
Note: theoretically, if you use ‘binary’ encoding to first decode to strings, then feed them to decode, you get the correct results. This is a bad practice because it’s slower, it’s mixing concepts and ‘binary’ encoding is deprecated.
Solution
Keep original Buffer-s and provide them to iconv.decode. Use Buffer.concat() if needed.
In general, keep in mind that all javascript strings are already decoded and should not be decoded again.
http.get("http://website.com/", function(res) {
var chunks = [];
res.on('data', function(chunk) {
chunks.push(chunk);
});
res.on('end', function() {
var decodedBody = iconv.decode(Buffer.concat(chunks), 'win1252');
console.log(decodedBody);
});
});
// Or, with iconv-lite@0.4 and Node v0.10+, you can use streaming support with `collect` helper
http.get("http://website.com/", function(res) {
res.pipe(iconv.decodeStream('win1252')).collect(function(err, decodedBody) {
console.log(decodedBody);
});
});
What if you know what you’re doing and just want to mute the warning?
iconv.skipDecodeWarning = true;