Python爬虫之美味鸡汤-BeautifulSoup
进一步学习:
python3实现网络爬虫(2)–BeautifulSoup使用(1)
python3实现网络爬虫(3)–BeautifulSoup使用(2)
python3实现网络爬虫(4)–BeautifulSoup使用(3)
安装
1.在Pycharm中安装插件:bs4
2.pip install beautifulsoup4
拓展
安装lxml –> 插件:lxml 或者 pip install lxml
最简单的使用
# coding:utf-8
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://tieba.baidu.com/')
bsObj = BeautifulSoup(html, 'lxml') # 在这里讲html对象转化为BeautifulSoup对象.
print(bsObj.title)
通过标签的名称和属性来查找标签
find_all方法
# coding:utf-8
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://movie.douban.com')
bsObj = BeautifulSoup(html, 'lxml')
liList = bsObj.find_all('li', {'class': 'title'}) # 通过标签的名称和属性来查找标签
for li in liList:
print(li.a.get_text()) # 获取标签<a>中的文字
标签没有属性值时借助父节点处理
# coding:utf-8
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://movie.douban.com')
bsObj = BeautifulSoup(html, 'lxml')
liList = bsObj.findAll('li', {'class': 'ui-slide-item'})
for li in liList:
ul = li.children
for child in ul: #由于children是个孩子集合,所以下面要迭代进行查看
print(child)
结合正则表达式批量下载图片
# coding:utf-8
import random
import re
from urllib.request import urlopen, Request, urlretrieve
from bs4 import BeautifulSoup
def get_html(url, headers):
"""
用于抓取返回403禁止访问的网页
:param url:
:param headers:
:return:
"""
random_header = random.choice(headers)
req = Request(url)
req.add_header('User-Agent', random_header)
req.add_header('GET', url)
req.add_header('Host', 'tieba.baidu.com')
req.add_header('Referer', 'http://tieba.baidu.com/p/4792769205')
html = urlopen(req)
return html
url = 'http://tieba.baidu.com/p/4792769205'
# 下面headers需要使用自己主机的User-Agent进行构造
my_headers = ['Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36']
html = get_html(url, my_headers)
bsObj = BeautifulSoup(html, 'lxml')
imageList = bsObj.findAll('img', {'src': re.compile('http://imgsrc.baidu.com/forum/w%3D580/sign=.+\.jpg')})
for index, image in enumerate(imageList):
imageUrl = image['src']
imageLocation = '/home/wangdongdong/test/' + str(index + 1) + '.jpg'
urlretrieve(imageUrl, imageLocation)
print("图片 ", index + 1, "下载完成")