用beautifulsoup，re，xpath爬取文章并保存为csv文件

最新推荐文章于 2021-07-10 17:29:46 发布

hellenlee22

最新推荐文章于 2021-07-10 17:29:46 发布

阅读量836

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/hellenlee22/article/details/89856812

本文展示了如何使用Python的BeautifulSoup、正则表达式(Re)和XPath库来抓取网站上的文章信息，并将数据导出到CSV文件中，适合初学者学习爬虫基础知识。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

话不多说，直接上代码，

import csv, requests, re
from bs4 import BeautifulSoup
from lxml import etree

url = 'https://www.v2ex.com/?tab=all'
'''
#soup加正则
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
articles = []
for article in soup.find_all(class_='cell item'):
    title = article.find(class_='item_title').get_text()
    category = article.find(class_='node').get_text()
    author = re.findall(r'(?<=<a href="/member/).+(?="><img)', str(article))[0]
    #print(author)
    u = article.select('.item_title > a')
    #print(u)
    link = 'https://www.v2ex.com' + re.findall(r'(?<=href=").+(?=")', str(u))[0]
    articles.append([title, category, author, link])
print(articles)
'''

#xpath 写
response=requests.get(url).text
html=etree.HTML(response)
#print(html)

tag_div=html.xpath('//div[@class="box"]/div[@class="cell item"]')
#print(tag_div)

articles=[]
for each in tag_div:
    title=each.xpath('./table//tr/td[3]/span[1]/a/text()')[0]
    h