《人生苦短，我用python·十一》python网络爬虫的简单使用-优快云博客

本文链接：https://blog.youkuaiyun.com/cs1395293598/article/details/140646870

Python 有很多库可以用于网络爬虫，最常用的包括 requests 和 BeautifulSoup。以下是如何使用这些库来爬取数据的详细步骤和示例。

1. 安装依赖库
首先，确保安装了 requests 和 BeautifulSoup 库。如果还没有安装，可以使用以下命令进行安装：

pip install requests
pip install beautifulsoup4

2. 使用 requests 库获取网页内容
requests 库用于发送 HTTP 请求并接收响应。以下是获取网页内容的示例：

import requests

url = 'https://example.com'
response = requests.get(url)

if response.status_code == 200:
    html_content = response.text
    print(html_content)
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

3. 使用 BeautifulSoup 解析 HTML 内容
BeautifulSoup 是一个用于解析 HTML 和 XML 文档的库。以下是解析 HTML 内容的示例：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')

# 查找所有的标题标签
titles = soup.find_all('h1')
for title in titles:
    print(title.get_text())

# 查找特定的标签
specific_div = soup.find('div', {'class': 'specific-class'})
if specific_div:
    print(specific_div.get_text())

4. 综合示例
以下是一个综合示例，演示如何从一个新闻网站爬取标题和链接：

import requests
from bs4 import BeautifulSoup

url = 'https://news.ycombinator.com/'
response = requests.get(url)

if response.status_code == 200:
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')

    # 查找所有新闻条目
    stories = soup.find_all('a', {'class': 'storylink'})
    for story in stories:
        title = story.get_text()
        link = story['href']
        print(f"Title: {title}")
        print(f"Link: {link}\n")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

5. 处理分页
有些网站的数据分布在多个页面上，需要处理分页。以下是处理分页的示例：

import requests
from bs4 import BeautifulSoup

base_url = 'https://example.com/page/'
page_number = 1

while True:
    url = f"{base_url}{page_number}"
    response = requests.get(url)

    if response.status_code != 200:
        break

    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')

    items = soup.find_all('div', {'class': 'item'})
    if not items:
        break

    for item in items:
        title = item.find('h2').get_text()
        print(title)

    page_number += 1

6. 处理动态内容
对于动态生成的内容，如通过 JavaScript 加载的内容，可以使用 Selenium 库。安装方法：

pip install selenium

使用 Selenium 获取动态内容的示例：

from selenium import webdriver

url = 'https://example.com'
driver = webdriver.Chrome()  # 或者使用其他浏览器的驱动程序
driver.get(url)

html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')

# 解析内容
items = soup.find_all('div', {'class': 'item'})
for item in items:
    title = item.find('h2').get_text()
    print(title)

driver.quit()

7. 爬虫礼仪
遵守网站的 robots.txt 文件：这个文件定义了哪些页面允许被爬取。
设置适当的请求间隔：避免频繁请求，给服务器带来负担。
使用 User-Agent：在请求头中添加 User-Agent，表明请求是由浏览器发出的。

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

response = requests.get(url, headers=headers)