解析Python网络爬虫与HTML解析库应用-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_25300563/article/details/50526857

环境：Python3 Anaconda

URL的具体格式 scheme://host:port/path?query#fragment:

.scheme: 通信协议，如http,ftp等。

.host: 主机，服务器(计算机)域名系统 (DNS) 主机名或 IP 地址。

.port: 端口号，可选，省略时使用方案的默认端口，如http的默认端口为80。

.path: 路径，可选，由零或多个"/"符号隔开的字符串，一般用来表示主机上的一个目录或文件地址。
.query: 查询，可选，用于给动态网页（如用PHP/JSP/ASP/ASP.NET等技术制作的网页）传递参数，可有多个参数，用"&"符号隔开

.fragment: 信息片断，字符串，用于指定网络资源中的片断。

二.新手上路

使用urllib中的urlopen，在Python2中用的是urllib2,但Python3已没有这个库，归并到urllib中了

<span style="font-size:12px;">import urllib
from urllib.request import urlopen 

response=urlopen("http://www.baidu.com")     #</span><span style="color: rgb(85, 85, 85); font-family: 'Microsoft Yahei', 'Helvetica Neue', Helvetica, Arial, sans-serif; line-height: 26px; text-indent: 30px;"><span style="font-size:10px;">返回的信息保存在response里面。</span></span><span style="font-size:12px;">
print(response.read())                       #response对象的read方法可以返回网页内容</span>

结果如下：

三.

BeautifulSoup库

Beautiful将HTML文档解析成树形结构，它共有四大对象：

<1> Tag

<2>NavigableString

<3>BeautifulSoup

<4>Commnet

首先介绍Tag:

from bs4 import BeautifulSoup

html="""
<html id="spLianghui">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<title>网易</title>
<link rel="dns-prefetch" href="http://img1.cache.netease.com" />
<base target="_blank" />
<meta name="Keywords" content="网易,邮箱,游戏,新闻,体育,娱乐,女性,亚运,论坛,短信,数码,汽车,手机,财经,科技,相册" />
<meta name="Description" content="网易是中国领先的互联网技术公司，为用户提供免费邮箱、游戏、搜索引擎服务，开设新闻、娱乐、体育等30多个内容频道，及博客、视频、论坛等互动交流，网聚人的力量。" />
<meta name="robots" content="index, follow" />
<meta name="googlebot" content="index, follow" />
<link rel="apple-touch-icon-precomposed" href="http://img1.cache.netease.com/www/logo/logo-ipad-icon.png" >
<script type="text/javascript">
    window.NTES_logger_start_time = new Date();
</script>
"""

soup=BeautifulSoup(html)

<img src="https://img-blog.youkuaiyun.com/20160116111724692" alt="" />

未完待续。。。

Python爬虫（一）