PHP中使用file_get_contents()抓取网页乱码的问题
一,编码问题导致的
如果是编码问题导致的很容易,把抓取到的内容转换一下编码就可以了
$content=iconv("GBK","UTF- 8",$content),
(1)$htmlsource = iconv("gb2312", "utf-8//IGNORE",$htmlsource);
(2)用mb_convert_encoding( $string, 'UTF-8', 'UTF-8,GBK,GB2312,BIG5' ); 转换编码
二,开了gzip的
如何抓取开了gzip的页面,如何判断页面开了gzip呢,获取的header头中含有Content- Encoding:gzip;
说明页面内容是经过gzip压缩过的,可以通过firebug查看页面是否开启了gzip压缩。
下面介绍一些解决方案:
1、使用PHP自带的zlib库,用下面的代码可以解决:
$data = file_get_contents("compress.zlib://".$url);
2、使用curl代替file_get_contents()函数:
function curl_get($url, $gzip=false){
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
if($gzip){
curl_setopt($curl, CURLOPT_ENCODING, "gzip"); // 关键在这里
}
$content = curl_exec($curl);
curl_close($curl);
return $content;
}
使用方法:
$res = curl_get($url, $gzip=true);
原文网址:http://suiwnet.blog.51cto.com/2492370/1295262
3,使用 gzip 解压函数
function gzdecode($data) {
$len = strlen($data);
if ($len < 18 || strcmp(substr($data,0,2),"\x1f\x8b")) {
return null; // Not GZIP format (See RFC 1952)
}
$method = ord(substr($data,2,1)); // Compression method
$flags = ord(substr($data,3,1)); // Flags
if ($flags & 31 != $flags) {
// Reserved bits are set -- NOT ALLOWED by RFC 1952
return null;
}
// NOTE: $mtime may be negative (PHP integer limitations)
$mtime = unpack("V", substr($data,4,4));
$mtime = $mtime[1];
$xfl = substr($data,8,1);
$os = substr($data,8,1);
$headerlen = 10;
$extralen = 0;
$extra = "";
if ($flags & 4) {
// 2-byte length prefixed EXTRA data in header
if ($len - $headerlen - 2 < 8) {
return false; // Invalid format
}
$extralen = unpack("v",substr($data,8,2));
$extralen = $extralen[1];
if ($len - $headerlen - 2 - $extralen < 8) {
return false; // Invalid format
}
$extra = substr($data,10,$extralen);
$headerlen += 2 + $extralen;
}
$filenamelen = 0;
$filename = "";
if ($flags & 8) {
// C-style string file NAME data in header
if ($len - $headerlen - 1 < 8) {
return false; // Invalid format
}
$filenamelen = strpos(substr($data,8+$extralen),chr(0));
if ($filenamelen === false || $len - $headerlen - $filenamelen - 1 < 8) {
return false; // Invalid format
}
$filename = substr($data,$headerlen,$filenamelen);
$headerlen += $filenamelen + 1;
}
$commentlen = 0;
$comment = "";
if ($flags & 16) {
// C-style string COMMENT data in header
if ($len - $headerlen - 1 < 8) {
return false; // Invalid format
}
$commentlen = strpos(substr($data,8+$extralen+$filenamelen),chr(0));
if ($commentlen === false || $len - $headerlen - $commentlen - 1 < 8) {
return false; // Invalid header format
}
$comment = substr($data,$headerlen,$commentlen);
$headerlen += $commentlen + 1;
}
$headercrc = "";
if ($flags & 1) {
// 2-bytes (lowest order) of CRC32 on header present
if ($len - $headerlen - 2 < 8) {
return false; // Invalid format
}
$calccrc = crc32(substr($data,0,$headerlen)) & 0xffff;
$headercrc = unpack("v", substr($data,$headerlen,2));
$headercrc = $headercrc[1];
if ($headercrc != $calccrc) {
return false; // Bad header CRC
}
$headerlen += 2;
}
// GZIP FOOTER - These be negative due to PHP's limitations
$datacrc = unpack("V",substr($data,-8,4));
$datacrc = $datacrc[1];
$isize = unpack("V",substr($data,-4));
$isize = $isize[1];
// Perform the decompression:
$bodylen = $len-$headerlen-8;
if ($bodylen < 1) {
// This should never happen - IMPLEMENTATION BUG!
return null;
}
$body = substr($data,$headerlen,$bodylen);
$data = "";
if ($bodylen > 0) {
switch ($method) {
case 8:
// Currently the only supported compression method:
$data = gzinflate($body);
break;
default:
// Unknown compression method
return false;
}
} else {
// I'm not sure if zero-byte body content is allowed.
// Allow it for now... Do nothing...
}
// Verifiy decompressed size and CRC32:
// NOTE: This may fail with large data sizes depending on how
// PHP's integer limitations affect strlen() since $isize
// may be negative for large sizes.
if ($isize != strlen($data) || crc32($data) != $datacrc) {
// Bad format! Length or CRC doesn't match!
return false;
}
return $data;
}
参考网址:http://www.poluoluo.com/jzxy/201311/249305.html
使用方法:
$html=file_get_contents('http://www.jb51.net/');
$html=gzdecode($html);
编码规则:html里边的meta是确定html页面提交给cgi程序的内容编码的, 显示编码是根据cgi程序发送的header头确定的,如果程序没有设定,则 由服务器默认的charset决定, 有些文件乱码的原因是编码没统一,php文件编码决定生成的html的编码,header决定查看时使用的默认编码。