python+selenium+BeautifulSoup爬取酒店评论_飞猪评论获取-优快云博客

本文链接：https://blog.youkuaiyun.com/keaideciel/article/details/115280862

python+selenium+BeautifulSoup爬取酒店评论

环境与配置

本文使用的Python版本是 Anaconda Python 3.8.5
安装selenium
打开命令行cmd，输入下面的命令

pip install selenium -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

下载驱动
火狐浏览器要下载火狐驱动geckodriver 网盘链接:
提取码：1234
添加环境变量
建议直接将geckodriver.exe放到python.exe的相同路径下，方便
其他
关于Selenium的安装与配置可以参考 python selenium firefox使用详解:

准备代码

假设我们的目标是爬取携程上某一酒店的4千多条评论
我们的目标酒店链接:
代码部分一：

from selenium import webdriver
from bs4 import BeautifulSoup
import time

# 选定火狐浏览器
browser = webdriver.Firefox()
# 通过网页地址打开网页，此时会弹出浏览器，并加载相应的网页
browser.get('https://hotels.ctrip.com/hotels/346754.html')
# 隐式等待，程序会一直等待整个页面加载完成，才能执行后续操作
# 最长等待10秒
browser.implicitly_wait(10)

更多的等待方式可以参考 Selenium的三种等待方式:

爬取代码

代码部分二：

# 设置爬取的评论页数
pagenum = 360

for n in range(pagenum):  
    # 解析网页，第二个参数为解析器
    soup = BeautifulSoup(browser.page_source, "lxml")
    # 获取所有class属性为comment的div标签
    review = soup.find_all("div", attrs={'class': 'comment'})
    # 创建一个评论列表
    reviewlist = []
    # 获取所有class属性为p的标签，并加入reviewlist里
    for r in review:
        rvw = r.find_all("p")
        reviewlist.append(rvw[0])
    # 将爬到的评论写入a.txt文件
    with open('a.txt', mode='a', encoding='gb18030', errors='ignore') as f:
        for i in range(len(reviewlist)):
            f.write("****************************************************************\n")
            f.write(str(reviewlist[i]))
            f.write('\n')
    # 输出一下爬了几次了
    print(n," time\n")
    # 由于需要使用Selenium模拟点击标签，进行评论的翻页
    # 此处获取一个class属性为"u-icon-arrowRight"的i标签，并点击
    browser.find_element_by_xpath('//i[@class="u-icon u-icon-arrowRight"]').click()
    # 隐式等待方式，10秒
    browser.implicitly_wait(10)
    #强制等待0.1秒，保险起见
    time.sleep(0.1)