爬取资讯网站的新闻并保存到excel

最新推荐文章于 2024-07-01 11:04:11 发布

转载最新推荐文章于 2024-07-01 11:04:11 发布 · 595 阅读

2 ·

CC 4.0 BY-SA版权

原文链接：http://blog.51cto.com/jackyxin/2066959

文章标签：

#python

本文介绍了一种使用Python Selenium和BeautifulSoup实现的一读资讯网站爬虫程序。该程序能够自动加载页面，抓取新闻标题、来源、评论数及链接，并将数据保存到Excel文件中。通过模拟浏览器操作，确保获取完整的网页内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

#!/usr/bin/env python
#* coding:utf-8 *
#author:Jacky

from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from bs4 import BeautifulSoup
import xlwt

driver = webdriver.Firefox()
driver.implicitly_wait(3)
first_url = 'http://www.yidianzixun.com/channel/c6'
driver.get(first_url)
driver.find_element_by_class_name('icon-refresh').click()
for i in range(1, 90):
driver.find_element_by_class_name('icon-refresh').send_keys(Keys.DOWN)
soup = BeautifulSoup(driver.page_source, 'lxml')
print soup
articles=[]
for article in soup.findall(class='item doc style-small-image style-content-middle'):
title= article.find(class_='doc-title').gettext()
source=article.find(class='source').gettext()
comment=article.find(class='comment-count').get_text()
link='http://www.yidianzixun.com'+article.get('href')
articles.append([title,source,comment,link])
print articles
driver.quit()

wbk=xlwt.Workbook(encoding='utf-8')
sheet=wbk.add_sheet('yidianzixun')
i=1
sheet.write(0, 0, 'title')
sheet.write(0, 1, 'source')
sheet.write(0, 2, 'comment')
sheet.write(0, 3, 'link')
for row in articles:
#print row[0]
sheet.write(i,0,row[0])
sheet.write(i,1,row[1])
sheet.write(i,2,row[2])
sheet.write(i,3,row[3])
i +=1
wbk.save(r'zixun\zixun.xls')

转载于:https://blog.51cto.com/jackyxin/2066959