简单爬取豆瓣电影花木兰热门短评

最新推荐文章于 2023-03-08 12:05:32 发布

Jaeshn

最新推荐文章于 2023-03-08 12:05:32 发布

阅读量590

点赞数 1

分类专栏：爬虫文章标签： python

本文链接：https://blog.youkuaiyun.com/cyx_1103/article/details/108737843

版权

爬虫专栏收录该内容

2 篇文章

订阅专栏

1 准备

主要做三件事：

抓取网页数据
清理数据
用词云进行展示

使用到time、jieba、re、wordcloud、numpy、matplotlib、selenium、PIL 库

2 过程思路

一、登录豆瓣

import time
import jieba
import re
import wordcloud
import numpy as np
import matplotlib.pyplot as plt
from selenium import webdriver
from PIL import Image

username = '这里输入你的豆瓣用户名'
password = '这里输入你的豆瓣账号密码'

#opt = webdriver.ChromeOptions()
# 把chrome设置成无界面模式，不论windows还是linux都可以，自动适配对应参数
#opt.set_headless()
# 用的是谷歌浏览器
# driver = webdriver.Chrome(options=opt)
chrome_driver = r'E:\Anaconda3\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe'
driver = webdriver.Chrome(executable_path = chrome_driver)

driver.get('https://accounts.douban.com/passport/login')
# 等待网页加载完成
time.sleep(3)
# 因为登录页面分为手机登录和密码登录，所以要点击一下密码登录
driver.find_element_by_xpath("/html/body/div/div[2]/div[2]/div/div[1]/ul[1]/li[2]").click()
time.sleep(1)
# 找到账号位置然后输入值
driver.find_element_by_xpath('//*[@id="username"]').send_keys(username)
# 找到按钮的位置
ret = driver.find_element_by_xpath('//*[@id="account"]/div[2]/div[2]/div/div[2]/div[1]/div[4]/a')
time.sleep(1)
# 找到密码的位置
driver.find_element_by_xpath('//*[@id="password"]').send_keys(password)
# 确保账号和密码输入完成
time.sleep(3)
# 识别验证码
# 找到图片地址
# 点击按钮
ret.click()
time.sleep(10)

二、获取影评数据并写入文件

commentList = []
for i in range(0, 501, 20):
    url = 'https://movie.douban.com/subject/26357307/comments?start='+str(i)+'&limit=20&sort=new_score&status=P'
    driver.get(url)
    source = driver.page_source
    pattern = re.compile('<span class="short">(.*?)</span>')
    comments = re.findall(pattern, source)
    commentList.extend(comments)
    time.sleep(5)

with open('shortComments.txt', 'w', encoding='utf-8') as f:
    for comment in commentList:
        f.write(comment + '\n')
driver.close()

三、数据清洗与词云

这里使用jieba分词以及wordcloud词云展示

com = open('shortComments.txt', 'r', encoding='utf-8').read()
wordlist = jieba.cut(str(com), cut_all=False) #精确模式
wl = " ".join(wordlist)
# 设置背景图
mask = np.array(Image.open(r'G:\Python学习\【爬虫】电影影评分析\shape\cat.png', 'r'))
# 设置停用词
stopwords_file = open('shortComments_douban.txt', 'r', encoding='utf-8')
stopwords = [words.strip() for words in stopwords_file.readlines()]
# 设置词云（里面多加了几个参数）
wc = wordcloud.WordCloud(background_color='white', # 设置背景颜色
                        mask=mask, # 设置背景图片
                        max_words=20000, # 设置最大词数
                        stopwords=stopwords, # 设置停用词
                        font_path="C:\Windows\Fonts\STXINWEI.TTF", # 设置中文字体
                        max_font_size=100, # 设置字体最大值
                        random_state=30, # 设置有多少种随机生成状态，即有多少种配色方案
                        width=1200,
                        height=800
).generate(wl) # 生成词云

# 展示词云图
plt.figure(figsize=(8, 6)) #设置画布大小
plt.imshow(wc, interpolation='bilinear')
plt.axis("off") #去掉坐标轴
plt.show()
wc.to_file('result.jpg') #存为图片