python读取gzip格式及普通格式网页的方法

最新推荐文章于 2022-10-09 15:59:07 发布

原创

最新推荐文章于 2022-10-09 15:59:07 发布 · 7.5k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#python #urllib2 #gzip #乱码 #utf-8

本文介绍了Python在处理网页内容时，如何读取gzip压缩和普通格式的网页。当遇到乱码问题，除了考虑编码方式，还需要关注返回数据的格式。在读取gzip格式的网页时，需要使用gzip模块进行解压。示例代码中强调了在处理完内容后务必关闭文件，特别是在多线程爬虫中。此外，由于服务器可能会动态改变返回格式，建议先检测Content-Encoding，再进行相应处理。

一般情况下，我们读取网页分析去返回内容时是这样子的：

#!/usr/bin/python
#coding:utf-8
import urllib2
headers = {"User-Agent": 'Opera/9.25 (Windows NT 5.1; U; en)'}
request = urllib2.Request(url='http://www.baidu.com', headers=headers)
response = urllib2.urlopen(request).read()

一般情况下，你可以看到返回的网页源码：

<html>
<head>
    
    <meta http-equiv="content-type" content="text/html;charset=utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
	<meta content="always" name="referrer">
    <meta name="theme-color" content="#2932e1">
    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
    <link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索" /> 
    <link rel="icon" sizes="any" mask href="//www.baidu.