Python爬虫之美味鸡汤-BeautifulSoup

最新推荐文章于 2023-12-22 13:45:16 发布

mazouri

最新推荐文章于 2023-12-22 13:45:16 发布

阅读量1.8k

点赞数

CC 4.0 BY-SA版权

分类专栏： Python大法好文章标签： python 爬虫网络爬虫插件

本文链接：https://blog.youkuaiyun.com/dongdong230/article/details/77866850

Python大法好专栏收录该内容

1 篇文章

订阅专栏

本文介绍如何使用Python和BeautifulSoup进行网络爬虫开发，包括安装配置、基本使用方法及通过标签名称和属性查找元素等内容，并展示了如何利用正则表达式批量下载图片。

Python爬虫之美味鸡汤-BeautifulSoup

进一步学习：
python3实现网络爬虫（2）–BeautifulSoup使用（1）

python3实现网络爬虫（3）–BeautifulSoup使用（2）

python3实现网络爬虫（4）–BeautifulSoup使用（3）

安装

1.在Pycharm中安装插件：bs4
2.pip install beautifulsoup4

拓展

安装lxml –> 插件：lxml 或者 pip install lxml

最简单的使用

# coding:utf-8

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://tieba.baidu.com/')

bsObj = BeautifulSoup(html, 'lxml')  # 在这里讲html对象转化为BeautifulSoup对象.

print(bsObj.title)

通过标签的名称和属性来查找标签

find_all方法

# coding:utf-8

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://movie.douban.com')

bsObj = BeautifulSoup(html, 'lxml')

liList = bsObj.find_all('li', {'class': 'title'})  # 通过标签的名称和属性来查找标签

for li in liList:
    print(li.a.get_text())  # 获取标签<a>中的文字

标签没有属性值时借助父节点处理

# coding:utf-8

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://movie.douban.com')

bsObj = BeautifulSoup(html, 'lxml')

liList = bsObj.findAll('li', {'class': 'ui-slide-item'})

for li in liList:
    ul = li.children
    for child in ul: #由于children是个孩子集合，所以下面要迭代进行查看
        print(child)

结合正则表达式批量下载图片

# coding:utf-8
import random
import re
from urllib.request import urlopen, Request, urlretrieve

from bs4 import BeautifulSoup


def get_html(url, headers):
    """
    用于抓取返回403禁止访问的网页
    :param url:
    :param headers:
    :return:
    """
    random_header = random.choice(headers)

    req = Request(url)
    req.add_header('User-Agent', random_header)
    req.add_header('GET', url)
    req.add_header('Host', 'tieba.baidu.com')
    req.add_header('Referer', 'http://tieba.baidu.com/p/4792769205')
    html = urlopen(req)
    return html

url = 'http://tieba.baidu.com/p/4792769205'

# 下面headers需要使用自己主机的User-Agent进行构造
my_headers = ['Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36']

html = get_html(url, my_headers)

bsObj = BeautifulSoup(html, 'lxml')

imageList = bsObj.findAll('img', {'src': re.compile('http://imgsrc.baidu.com/forum/w%3D580/sign=.+\.jpg')})

for index, image in enumerate(imageList):
    imageUrl = image['src']
    imageLocation = '/home/wangdongdong/test/' + str(index + 1) + '.jpg'
    urlretrieve(imageUrl, imageLocation)
    print("图片 ", index + 1, "下载完成")