一个完整的大作业

最新推荐文章于 2025-05-31 11:40:51 发布

weixin_30715523

最新推荐文章于 2025-05-31 11:40:51 发布

阅读量111

点赞数

CC 4.0 BY-SA版权

文章标签： python

原文链接：http://www.cnblogs.com/45hjq/p/7762553.html

本文通过实例演示了如何使用Python爬取小说网站的数据，并利用BeautifulSoup解析网页源代码，最终实现小说内容的下载与词云生成，展示了完整的技术流程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

我选择的主题是小说网站的爬取

作业要求

选一个自己感兴趣的主题。

网络上爬取相关的数据。

进行文本分析，生成词云。

对文本分析结果解释说明。

写一篇完整的博客，附上源代码、数据爬取及分析结果，形成一个可展示的成果。

1、用2345加速浏览器打开"http://www.xs.la/46_46454/"，在空白地方点击鼠标右键调出查看源代码选项。

可以通过网页源代码查看标题的代码,可以看出每条消息的标题与链接

2.直接获取本小说网站所有的内容

爬取到数据之后就对数据进行分析和统计，代码如下

from bs4 import BeautifulSoup
import requests

#获取小说各章节的网页地址
def get_urls():
    req = requests.get(url = aim)
    html = req.text
    div_bf = BeautifulSoup(html)
    div = div_bf.find_all(‘div‘, class_ = ‘listmain‘)
    a_bf = BeautifulSoup(str(div[0]))
    a = a_bf.find_all(‘a‘)
    nums = len(a[15:])                                      #删除开头重复章节
    for each in a[15:]:
        names.append(each.string)
        urls.append(main + each.get(‘href‘))
    return nums

#保存小说到本地
def writer(name, path, text):
    write_flag = True
    with open(path, ‘a‘, encoding=‘utf-8‘) as f:
        f.write(name + ‘\n‘)
        f.writelines(text)
        f.write(‘\n\n‘)

#获取小说内容
def get_contents(target):
    req = requests.get(url = target)
    html = req.text
    bf = BeautifulSoup(html)
    texts = bf.find_all(‘div‘, class_ = ‘showtxt‘)
    texts = texts[0].text.replace(‘\xa0‘*8,‘\n\n‘)
    return texts



main = ‘http://www.biqukan.com/‘
aim = ‘http://www.biqukan.com/30_30398/‘
names = []                                                   #章节名
urls = []                                                    #章节链接
nums = get_urls()                                            #章节数

print(‘小说开始下载：‘)
for i in range(nums):
    writer(names[i], ‘小说.txt‘, get_contents(urls[i]))
print(‘小说下载完成‘)

3进行文本分析，生成词云。

import re
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud
file=open(‘I:\\大四\\python\\大作业\\小说.txt‘,‘r‘,encoding=‘utf-8‘).readlines()
data=‘‘
for i in file:
    data+=‘ ‘.join(jieba.cut(i))+‘ ‘
my_wordcloud = WordCloud(font_path=‘I:\\大四\\python\\py5\\msyh.ttf‘).generate(data)
plt.imshow(my_wordcloud)
plt.axis("off")
plt.show()
my_wordcloud.to_file(path.join("小说.png"))

转载于:https://www.cnblogs.com/45hjq/p/7762553.html