python newspaper,使用Python中的NewsPaper库将新闻文章爬到一个列表中吗?

作者求助于Stackoverflow社区,寻求如何将从CNN RSS feed抓取的文章链接从多个列表整合到一个单独的列表中,使用Python的newspaper库。目标是获取所有链接并存储为一个列表或字典。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Dear Stackoverflow community!

I would like to scrape news articles from the CNN RSS feed and get the link for each scraped article. This workes very well with the Python NewsPaper library, but unfortunately I am unable to get the output in a usable format i.e. a list or a dictionary.

I want to add the scraped links into one SINGLE list, instead of many separated lists.

import feedparser as fp

import newspaper

from newspaper import Article

website = {"cnn": {"link": "http://edition.cnn.com/", "rss": "http://rss.cnn.com/rss/cnn_topstories.rss"}}

for source, value in website.items():

if 'rss' in value:

d = fp.parse(value['rss'])

#if there is an RSS value for a company, it will be extracted into d

for entry in d.entries:

if hasattr(entry, 'published'):

article = {}

article['link'] = entry.link

print(article['link'])

The output is as follows:

http://rss.cnn.com/~r/rss/cnn_topstories/~3/5aHaFHz2VtI/index.html

http://rss.cnn.com/~r/rss/cnn_topstories/~3/_O8rud1qEXA/joe-walsh-trump-gop-voters-sot-crn-vpx.cnn

http://rss.cnn.com/~r/rss/cnn_topstories/~3/xj-0PnZ_LwU/index.html

.......

I would like to have ONE list with all the links in it i.e:

list =[http://rss.cnn.com/~r/rss/cnn_topstories/~3/5aHaFHz2VtI/index.html , http://rss.cnn.com/~r/rss/cnn_topstories/~3/_O8rud1qEXA/joe-walsh-trump-gop-voters-sot-crn-vpx.cnn , http://rss.cnn.com/~r/rss/cnn_topstories/~3/xj-0PnZ_LwU/index.html ,... ]

I tried appending the content via a for loop as follows:

for i in article['link']:

article_list = []

article_list.append(i)

print(article_list)

But then the output is like this:

['h']

['t']

['t']

['p']

[':']

['/']

['/']

['r']

['s']

...

Does anyone know an alternative method, how to get the content into one list?

Or alternatively a dictionary as following:

dict = {'links':[link1 , link2 , link 3]}

Thank you VERY much in advance for your help!!

解决方案

Try modifying your code like this and see if it works:

article_list = []

for entry in d.entries:

if hasattr(entry, 'published'):

article = {}

article['link'] = entry.link

article_list.append(article['link'])

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值