urllib是Python的标准库,包含了从网络请求数据,处理Cookie,甚…..本书中广泛使用urllib,所以建议读读这个库的Python文档
urllib官方文档 廖雪峰教程
from urllib.request import urlopen
html = urlopen("http://www.pythonscraping.com/pages/page1.html");
print(html.read());
BeautifulSoup:它通过定位HTML标签来格式化和组织复杂的网络信息。
安装Python的包管理器pip,然后cmd在命令提示符里运行:
pip install beautifulsoup4
运行BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html");
bs0bj = BeautifulSoup(html.read());
print(bs0bj.h1)
可能的异常处理:
from urllib.request import urlopen
from urllib.error import HTTPError,URLError
from bs4 import BeautifulSoup
def getTitle(url):
try:
html=urlopen(url);
except(HTTPError,URLError) as e:
return None;
try:
bs0bj = BeautifulSoup(html.read())
title = bs0bj.body.h1;
except AttributeError as e:
return None;
return title;
title=getTitle("http://www.pythonscraping.com/pages/page1.html");
if title == None:
print("Title could not be found");
else:
print(title);
写爬虫的时候,思考代码的总体格局,让代码可以捕捉异常又容易阅读。