Python爬虫技术案例集锦

Python爬虫实战指南

最新推荐文章于 2025-09-26 17:45:40 发布

原创

最新推荐文章于 2025-09-26 17:45:40 发布 · 4.5k 阅读

71 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫 #开发语言 #django #flask #flink #java

让我们通过几个实际的案例来说明如何使用Python编写网络爬虫。这些案例将涵盖从简单的静态网页爬取到较为复杂的动态网站交互，并且还会涉及到数据清洗、存储和分析的过程。

案例 1: 简单的静态网页爬虫

假设我们需要从一个简单的静态新闻网站上抓取文章标题和链接。

Python 代码

我们将使用requests库来获取网页内容，使用BeautifulSoup来解析HTML。

import requests
from bs4 import BeautifulSoup

def fetch_articles(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    articles = soup.find_all('div', class_='article')
    
    for article in articles:
        title = article.find('h2').text
        link = article.find('a')['href']
        print(f"Title: {
     
     title}\nLink: {
     
     link}\n")

# 爬取示例网站
fetch_articles('https://example-news-site.com/articles')

案例 2: 动态网站爬虫

对于动态加载的内容，例如使用Ajax加载的网页，我们可以使用Selenium库模拟浏览器行为。

Python 代码

我们将使用Selenium来与JavaScript驱动的网页进行交互。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def fetch_articles_selenium(url):
    driver = webdriver.Chrome()
    driver.get(url)
    wait = WebDriverWait(driver, 10)

    # 等待元素加载完成
    articles = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'article')))
    
    for article in articles:
        title = article.find_element(By.TAG_NAME, 'h2').text
        link = article.find_element(By.TAG_NAME, 'a').get_attribute('href')
        print(f"Title: {
     
     title}\nLink: {
     
     link}\n")

    driver.quit()

# 爬取动态加载内容的网站
fetch_articles_selenium('https://example-dynamic-news-site.com/articles')

案例 3: 数据清洗和存储

一旦数据被爬取，可能需要清洗和整理。我们可以使用Pandas库来处理数据。

Python 代码

我们将使用pandas来清洗数据，并将其保存到CSV文件中。

import pandas as pd

def clean_and_store(articles):
    df = pd.DataFrame(articles, columns=['title', 'link'])
    df.drop_duplicates(inplace=True)
    df.to_csv('articles.csv', index=False)
    print("Data has been cleaned and stored.")

# 示例数据
articles = [
    {
   
   'title': 'Example Title 1', 'link': 'http://example.com/1'},
    {
   
   'title': 'Example Title 2', 'link': 'http://example.com/2'},
    {
   
   'title': 'Example Title 1', 'link': 'http://example.com/1'},  # Duplicate entry
]

# 清洗并存储数据
clean_and_store(articles)

案例 4: 数据分析和可视化

最后，我们可以使用Matplotlib或Seaborn等库来进行数据分析和可视化。

Python 代码

我们将使用matplotlib来创建一个简单的图表，显示不同类别的文章数量。

import matplotlib.pyplot as plt

def plot_article_categories(df):
    category_counts = df['category'].value_counts()
    category_counts.plot(kind='bar')
    plt.title('Article Categories')
    plt.xlabel('Category'<