python爬取优快云博客并用WordCloud词云分析

最新推荐文章于 2025-02-06 13:44:38 发布

原创最新推荐文章于 2025-02-06 13:44:38 发布 · 1k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫

python 专栏收录该内容

21 篇文章

订阅专栏

本文介绍如何使用Python进行网页内容爬取，包括利用requests库获取网页数据、使用BeautifulSoup解析HTML内容，以及通过词云分析爬取到的数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前言

这周就不写机器学习算法了，下周再更新Logistic分类算法吧，这个算法算是机器学习比较重要的算法了，里面还有关于梯度下降的应用。
这周我们来玩点有趣的东西——爬虫。
爬虫也是最近不知不觉就火起来了，关于爬取网页内容的方法也挺多的比如：

用python爬取网页内容，一般是requests库+BeautifulSoup4库结合使用。
R语言爬虫 rvest包+magrittr包+xml2包，感觉与python爬虫差不多。
软件爬取：八爪鱼等一些商用软件，唯一的好处就是不用写一行代码。

不管用哪种方法，各有各的优点吧，由于最近用的是python语言，所以我这篇就以python来爬取吧。而一般我先会用软件爬取，一些比较难爬取的我再用python或者R来爬取数据，其实我个人是觉得R更好用一点（喜欢它数据框这个类型）。关于爬虫的原理我觉得跟计算机网络里面的抓包有点像，这应该是封装好后的抓包程序。

代码实现

在爬取前我们必须知道要爬取页面的网址优快云博客，你可以根据自己需要爬取的页面更改网址，我觉得这个页面内容信息多比较好爬取
这里写图片描述

选取好页面后，可以启动网页查看器（火狐是Ctrl+Shift+c快捷键），查看需要爬取内容的html网页代码

这里写图片描述

我这里想爬取文章标题，创作日期，阅读数，文章内容，留言数这五个数据项。
附上代码：

import requests
import re#导入正则表达式
from bs4 import BeautifulSoup
r= requests.get('http://blog.youkuaiyun.com/qq_34739497?viewmode=contents')
r.raise_for_status()#获取连接状态，不是200则抛出异常
r.encoding='utf-8'
print(r.text)
soup = BeautifulSoup(r.text,"html.parser")
title = soup.find_all('span',class_="link_title")#能根据span标签返回html内容
data = soup.find_all('span',class_="link_postdate")
readnum = soup.find_all('span',class_="link_view")
Title = []
Data = []
ReadNum = []
Href = []
Content = []
Comment = []

for te in title:
    Title.append(te.text.split(' ')[8])
    Href.append(str(te)[34:71])
for da in data:
    Data.append(da.text)
for rn in readnum:
    ReadNum.append(rn.text.replace(')','(').split('(')[1])
for href in Href:
    r=requests.get("http://blog.youkuaiyun.com"+href)
    soup=BeautifulSoup(r.text,"html.parser")
    content = soup.find_all('div',class_="markdown_views") 
    if len(content)==0:
         content = soup.find_all('div',id="article_content") 
    Content.append(content[0].text.split('\n'))
    comment = soup.find_all('ul',class_='panel_body itemlist')
    Comment = comment[2].text.replace("\n"," ").split("\r") 
print("{1:{0}^20}{2:{0}^10}{3:{0}^10}".format(chr(12288),"文章名","发布日期","阅读数"))
for i in range(len(Title)):
    print("{1:{0}^20}{2:{0}^10}{3:{0}^10}".format(chr(12288),str(Title[i]),str(Date[i]),str(ReadNum[i])))

这里我们先用requests库，将网页信息爬取过来，转化成字符串，然后用BeautifulSoup根据html和xml语法建立解析树，能够精确解析其中的内容。具体语法可以参照这篇博客Python爬虫入门。
然后我们输出结果如下：

这里写图片描述
由于博客内容太多这里我就不输出了，本来想通过Excel表格格式输出，目前还不知道怎么弄，这还有待研究，但是我们可以把文章内容和评论内容用词云分析表现出来。

词云表现数据

词云我们需要导入WordCloud库，我用pip导入库时出现了错误，好像缺少C++一个什么库，需要安装vs，我安装后，错误依旧在。提供解决方案

在http://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud下载
wordcloud‑1.3.2‑cp36‑cp36m‑win_amd64.whl文件,然后到本文件所在目录执行就好了。

from wordcloud import WordCloud    
import matplotlib.pyplot as plt
from scipy.misc import imread
import jieba
fo = open('C:\\Users\\user\\Desktop\\1.txt',"w+",encoding="utf-8")
fo.writelines(str(Content))
fo.seek(0)
fo.close()
f = open('C:\\Users\\user\\Desktop\\1.txt','r',encoding='gbk',errors="ignore").read()
back_coloring = imread('C:\\Users\\user\\Desktop\\1.png')
txt = jieba.lcut(f)
text = {}
for word in txt:
    if len(word) == 1:
        continue
    else:
        rword = word
    text[rword]= text.get(rword,0) + 1 
wordcloud = WordCloud(background_color="black",mask=back_coloring,font_path=r'C:\Windows\Fonts\simfang.ttf',width=2000,height=1600,margin=2).generate_from_frequencies(text)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
wordcloud.to_file('C:\\Users\\user\\Desktop\\2.png')