7篇Python爬虫实例，直接代码可运行，全网最全，注释超详细（适合收藏）——1、爬取博客园信息。

摇光~

已于 2024-12-27 13:26:01 修改

阅读量3.5k

点赞数 18

分类专栏： # python爬虫文章标签： python 爬虫开发语言大数据数据挖掘

于 2024-11-01 09:31:53 首次发布

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.youkuaiyun.com/qq_41877371/article/details/143416570

版权

python爬虫专栏收录该内容

4 篇文章

订阅专栏

7篇Python爬虫实例，可直接运行，适合收藏

python爬虫7篇实例，分七个文章进行发布；第一篇：爬取博客园信息。

爬虫主要三部分：
1、获取数据
2、数据解析
3、数据整理至文档中

先上完整代码： 代码后面进行一步步解析。

import requests
from bs4 import BeautifulSoup
import pandas as pd

# 获取网址
url_root = 'https://www.cnblogs.com/sitehome/p/'
headers = {
    "Referer":"https://www.cnblogs.com/",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}

# 构建链接
n = range(2,10)                         
urls = [url_root+f'{i}' for i in n]     # 构建网页

# 取单个链接里面的文章信息：url、标题、作者、点赞数量、评论数量等
def get_single_article_info(url):

    re = requests.get(url,headers=headers)      # 返回信息
    if re.status_code != 200:                   # 判断是否爬取成功
        print('error!')
    soup = BeautifulSoup(re.text,"html.parser")         # 创建一个 BeautifulSoup 对象进行数据解析

    articles = soup.find('div',id = 'post_list',class_ = 'post-list') .find_all('article',class_ = 'post-item')     # 找到文章 article

    data = []                                               # 创建一个装数据的列表
    for article in articles:
        author,comment,recomment,views = '',0,0,0           # 创建初始值

        infos = article.find_all('a')
        for info in infos:
            if 'post-item-title' in str(info):
                href = info['href']
                title = info.get_text()
            if 'post-item-author' in str(info):
                author = info.get_text().strip()
            if 'icon_comment' in str(info):
                comment = info.get_text().strip()
            if 'icon_digg' in str(info):
                recomment = info.get_text().strip()
            if 'icon_views' in str(info):
                views = info.get_text().strip()

        data.append([href,title,author,comment,recomment,views])        # 将需要的信息放入data里

    return data

# 循环每个url，获取信息
data = []
i = 0
for url in urls:
    i += 1
    print(f'正在爬取: {i},url:{url}')
    single_data = get_single_article_info(url)
    for single in single_data:
        data.append(single)

# 打印日志
print(f'爬取完成，共爬取{len(urls)}个页面')

# 写入 Excel
df = pd.DataFrame(data,columns=['href','title','author','comment','recomment','views'])
df.to_excel('文章信息爬取结果.xlsx',index=True)  # index=True 加上索引

1、获取数据

首先我们在爬取数据时，需要先知道爬取数据的网址
博客园的网址有切换的页面，所以需要构建网址

#-------------------------------------------------------------------------------------------------------------------
# 导入需要的包
import requests
from bs4 import BeautifulSoup
import pandas as pd
#-------------------------------------------------------------------------------------------------------------------
# 获取网址
url_root = 'https://www.cnblogs.com/sitehome/p/'			# 基础网址
headers = {													# 头文件
    "Referer":"https://www.cnblogs.com/",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}
#-------------------------------------------------------------------------------------------------------------------
# 构建urls链接
n = range(2,10)							# 爬取第 2-10 页的文章数据
urls = [url_root+f'{i}' for i in n]     # 构建网页

2、数据解析

这里我们构建一个函数，该函数用来获取解析后的数据
该函数的作用就是将获取到的数据，进行数据清洗、数据分析，再返回数据

#-------------------------------------------------------------------------------------------------------------------
# 取单个链接里面的文章信息：url、标题、作者、点赞数量、评论数量等
def get_single_article_info(url):
#-------------------------------------------------------------------------------------------------------------------
# 获取到 url 链接中整个网页的信息
    re = requests.get(url,headers=headers)      		# 返回信息
    if re.status_code != 200:                   		# 判断是否爬取成功
        print('error!')
    soup = BeautifulSoup(re.text,"html.parser")         # 创建一个 BeautifulSoup 对象进行数据解析
#-------------------------------------------------------------------------------------------------------------------
# 根据网页信息，通过find方法找到文章 article 信息。
    articles = soup.find('div',id = 'post_list',class_ = 'post-list') .find_all('article',class_ = 'post-item')     

    data = []                                               # 创建一个装数据的列表
    for article in articles:
        author,comment,recomment,views = '',0,0,0           # 创建初始值

        infos = article.find_all('a')
        for info in infos:
            if 'post-item-title' in str(info):
                href = info['href']
                title = info.get_text()					# 获取到"title "
            if 'post-item-author' in str(info):
                author = info.get_text().strip()		# 获取到"author "
            if 'icon_comment' in str(info):
                comment = info.get_text().strip()		# 获取到"comment"
            if 'icon_digg' in str(info):
                recomment = info.get_text().strip()		# 获取到"recomment "
            if 'icon_views' in str(info):
                views = info.get_text().strip()			# 获取到"icon_views"

        data.append([href,title,author,comment,recomment,views])        # 将需要的信息放入data里

    return data

3、数据整理至文档中

因为我们对单个 url 进行了数据解析，所以需要网址，去解析数据，放入 date 列表
对已经解析到数据需要整理到文档中，将date数据放在excel

#-------------------------------------------------------------------------------------------------------------------
# 循环url，解析每个 url 内的数据
data = []				# 创建空列表存放数据
i = 0					# 日志监控变量
for url in urls:		# 开始循环urls
    i += 1
    print(f'正在爬取: {i},url:{url}')				# 打印日志，便于观察爬取到第几个url
    single_data = get_single_article_info(url)		# 单个 url 进行数据解析
    for single in single_data:						# 循环爬取到的数据，将每个数据放入 data 中
        data.append(single)							# 将数据加入到 data 中
#-------------------------------------------------------------------------------------------------------------------
# 打印日志
print(f'爬取完成，共爬取{len(urls)}个页面')			# 爬取完成后，结尾日志
#-------------------------------------------------------------------------------------------------------------------
# 写入 Excel
df = pd.DataFrame(data,columns=['href','title','author','comment','recomment','views'])		# 将数据放入DataFrame中，为其取列名
df.to_excel('文章信息爬取结果.xlsx',index=True) 		# 将数据放入Excel中，index=True 加上索引