你说：公主请学点爬虫吧！

原创已于 2024-01-11 15:11:51 修改 · 1.3k 阅读

16 ·

CC 4.0 BY-SA版权

文章标签：

#爬虫 #python #Python编程 #Python学习

于 2023-12-01 09:40:22 首次发布

本文介绍了在大数据时代如何通过Python入门爬虫，包括Python环境搭建、requests和BeautifulSoup库的使用，以及如何解析HTML结构抓取信息。随后提到在实际项目中可能遇到的反爬虫问题，并推荐了BrightData平台作为强大的数据处理和爬虫工具。

在大数据时代，数据的处理已成为很关键的问题。如何在茫茫数字的海洋中找到自己所需的数据呢？不妨试试爬虫吧！

本文，我们从最基本的 python 爬虫入门。谈谈小白如何入门！

前期条件

既然我们需要 python 来爬虫，这需要在我们的本地搭建 python 环境。python 环境搭建很简单。如下：

😘windows11

在win11中，我们只需在cmd命令中输入python在应用商店中，直接点击获取即可。

🐯Windows 其他系统

对于其他系统，我们只需要到官网下载安装包，进行安装即可。

安装完成，在 cmd 命令中输入python能显示相应的 python 版本就行了。

🐻‍❄️Linux

在 Linux 中，我们只需执行下面命令

# 更新源
apt-get update
# 安装
apt-get install python3.8
# 查看
python -V

常用依赖模块

python 是不能直接爬虫的。我们需要借助各种依赖环境。现对常用的依赖环境简单的说明：

requests

requests 是一个常用的 HTTP 请求库，可以方便地向网站发送 HTTP 请求，并获取响应结果。它的安装也很简单，执行下面命令进行安装

pip install requests

使用示例：

# 导入 requests 包
import requests
# 发送请求
x = requests.get('https://blog.bbskali.cn')
# 返回网页内容
print(x.text)

beautifulsoup4

和前者一样，利用beautifulsoup4库也能很好的解析 html 中的内容。

# 安装
pip install beautifulsoup4

小试牛刀

这里，我们以Quotes to Scrape这个简单的网站为例。

我们可以看到，当前页面主要有标题 作者 标签等信息。现在我们对当前的页面进行分析。

在当前页面中，我们可以看到 css 的结构如下；

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world">
            <a class="tag" href="/tag/change/page/1/">change</a>
        </div>
    </div>

我们只需关键字段的 css 就行了。您可以从图上看到， quote

HTML HTML 元素由 quote/引用类标识。这包含：

<span> HTML 元素中的引用文本
<small> HTML 元素中的引用作者
<div> 元素中的标签列表，每个标签都包含 <a> HTML 元素中

现在我们来学习如何使用 Python 的 Beautiful Soup 实现这一目标。

soup = BeautifulSoup(page.text, 'html.parser')

接下来，利用find_all() 方法将返回由 quote 类标识的所有<div> HTML 元素的列表。

quote_elements = soup.find_all('div', class_='quote')

最后完整代码如下：

#导入第三方库
import requests
from bs4 import BeautifulSoup
import csv

def scrape_page(soup, quotes):
    # 查找当前页面中所有class="quote"的div
    quote_elements = soup.find_all('div', class_='quote')
    # 通过for循环 遍历quote_elements下的标题 作者 标签等信息。
    for quote_element in quote_elements:
        # 遍历标题
        text = quote_element.find('span', class_='text').text
        # 遍历作者
        author = quote_element.find('small', class_='author').text

        # 遍历标签
        tag_elements = quote_element.find('div', class_='tags').find_all('a', class_='tag')
        #由于标签不止一个，所以将其放到数组中。
        tags = []
        for tag_element in tag_elements:
            tags.append(tag_element.text)

        quotes.append(
            {
                'text': text,
                'author': author,
                'tags': ', '.join(tags)
            }
        )

# 设置目标域名
base_url = 'https://quotes.toscrape.com'

# 设置浏览器信息，让系统认为我们的请求是浏览器的正常请求。
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

#使用requests来下载网页，并将数据赋值给page
page = requests.get(base_url, headers=headers)
#将上级page的数据递交给 BeautifulSoup函数。
soup = BeautifulSoup(page.text, 'html.parser')
# 初始化一个包含了所有抓取的数据列表的变量
quotes = []
scrape_page(soup, quotes)
# 抓取下一页内容
next_li_element = soup.find('li', class_='next')
while next_li_element is not None:
    next_page_relative_url = next_li_element.find('a', href=True)['href']
    page = requests.get(base_url + next_page_relative_url, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    scrape_page(soup, quotes)

    next_li_element = soup.find('li', class_='next')
    #将结果保存为csv文件
csv_file = open('quotes.csv', 'w', encoding='utf-8', newline='')
writer = csv.writer(csv_file)
writer.writerow(['Text', 'Author', 'Tags'])

for quote in quotes:
    writer.writerow(quote.values())
csv_file.close()