解决Python爬虫开发中的数据输出问题：确保正确生成CSV文件_为什么python导出csv格式有问题-优快云博客

本文链接：https://blog.youkuaiyun.com/ip16yun/article/details/140145897

引言

在大数据时代，爬虫技术成为获取和分析网络数据的重要工具。然而，许多开发者在使用Python编写爬虫时，常常遇到数据输出问题，尤其是在生成CSV文件时出错。本文将详细介绍如何解决这些问题，并提供使用代理IP和多线程技术的完整示例代码，以确保高效、准确地生成CSV文件。

正文

一、常见问题分析

数据提取不完整：网页结构变化或抓取逻辑错误导致数据提取不全。
编码问题：不同网页的编码格式不同，可能导致乱码。
文件写入问题：CSV文件写入过程中的格式或权限问题。

二、解决方案

使用代理IP：避免因IP被封禁导致的数据提取失败。
设置User-Agent和Cookie：模拟浏览器行为，提高成功率。
多线程技术：提升数据抓取效率，减少等待时间。
编码处理：确保爬取数据的编码统一，避免乱码。

实例

以下代码展示了如何使用代理IP、多线程技术进行高效、稳定的数据抓取，并正确生成CSV文件。示例中使用了亿牛云爬虫代理。

import requests
from bs4 import BeautifulSoup
import csv
import threading
import queue

# 常量
SEARCH_URL = "https://pubmed.ncbi.nlm.nih.gov/"
QUERY = "Breast Cancer"
START_DATE = "2023/06/01"
END_DATE = "2023/12/31"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Cookie": "your_cookie_here"
}
# 设置代理 亿牛云爬虫代理加强版 www.16yun.cn
PROXY = {
    "http": "http://username:password@proxy.16yun.cn:12345",
    "https": "http://username:password@proxy.16yun.cn:12345"
}

# 线程锁
lock = threading.Lock()

# 获取文章链接
def fetch_article_links(query, start_date, end_date):
    params = {
        "term": query,
        "mindate": start_date,
        "maxdate": end_date
    }
    response = requests.get(SEARCH_URL, params=params, headers=HEADERS, proxies=PROXY)
    soup = BeautifulSoup(response.text, 'html.parser')
    article_links = [a['href'] for a in soup.find_all('a', class_='docsum-title')]
    return article_links

# 获取文章详情
def fetch_article_details(article_link, data_queue):
    try:
        response = requests.get(article_link, headers=HEADERS, proxies=PROXY)
        soup = BeautifulSoup(response.text, 'html.parser')
        title = soup.find('h1', class_='heading-title').text.strip()
        authors = [a.text.strip() for a in soup.find_all('a', class_='full-name')]
        pub_date = soup.find('span', class_='cit').text.strip()
        abstract = soup.find('div', class_='abstract-content').text.strip()
        data_queue.put({
            "Title": title,
            "Authors": ", ".join(authors),
            "Publication Date": pub_date,
            "Abstract": abstract
        })
    except Exception as e:
        print(f"Error fetching details for {article_link}: {e}")

# 保存为CSV文件
def save_to_csv(data_queue, filename='pubmed_breast_cancer.csv'):
    with lock:
        with open(filename, mode='w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=["Title", "Authors", "Publication Date", "Abstract"])
            writer.writeheader()
            while not data_queue.empty():
                writer.writerow(data_queue.get())

# 主函数
def main():
    article_links = fetch_article_links(QUERY, START_DATE, END_DATE)
    base_url = "https://pubmed.ncbi.nlm.nih.gov"
    data_queue = queue.Queue()

    threads = []
    for link in article_links:
        full_link = f"{base_url}{link}"
        t = threading.Thread(target=fetch_article_details, args=(full_link, data_queue))
        t.start()
        threads.append(t)

    for t in threads:
        t.join()

    save_to_csv(data_queue)

if __name__ == "__main__":
    main()