myFirstCrawler

本文介绍了一种使用Python和lxml库爬取下厨房网站菜谱的方法,包括菜名、作者、烹饪次数及食材等信息,并实现了翻页功能,最终将数据保存至本地文件。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

爬下厨房网页,并且自带翻页功能~~

还能把内容录入到文件里面喔~

废话不多说,上代码:

#get the information(recipes and the chief's name included)of some delicious dishes
#and save the information in a file

from lxml import html
from time import sleep

import os
ls = os.linesep
filename = "OutputFile.txt"
if os.path.exists(filename):
    print "ERROR,'%s' already exists!Please name the file again~" % filename
fobj = open(filename, 'w')

writeline = "get the information of some delicious dishes"+ls

x = html.parse('http://www.xiachufang.com/explore')
titles = x.xpath("//ul[@class='list']/li/div/div/p[@class='name']/a/text()")
cook = x.xpath("//ul[@class='list']/li/div/div/p[@class='author']/a/text()")
status = x.xpath("//ul[@class='list']/li/div/div/p[@class='stats green-font']/span/text()")
material = x.xpath("//ul[@class='list']/li/div/div/p[@class='ing ellipsis']/text()")
writeline += "We got %s titles with its chief name and status. Here are the top 5:" % len(titles)+ls
i = 0
j = 5
for title in titles:
    if i<j:
        writeline += "   >"+title+ls
        writeline += "   >>chief:"+cook[i]+ls
        writeline += "   >>>has been cooked:"+status[i]+"times"+ls
        writeline += "   >>>>material:"+material[i]+ls
        writeline += "**********************************************************"+ls
        i = i + 1
    else:
        break
    
#function:next page
#assume that 50 titles are enough,
writeline += ls+ls+"*********************function: searching next pages***********************"+ls
next_button_xpath = "//a[@class='next']/@href"
headline_xpath = "//ul[@class='list']/li/div/div/p[@class='name']/a/text()"

newTitles = []
base_url = 'http://www.xiachufang.com/{}'
next_page = 'http://www.xiachufang.com/explore'
threshold = 50
while len(newTitles) < threshold and next_page:
    x = html.parse(next_page)
    headlines = x.xpath(headline_xpath)
    writeline += "Retrieved {} titles from url: {}".format(len(headlines), next_page)+ls
    
    newTitles += headlines
    next_pages = x.xpath(next_button_xpath)
    if next_pages:
        next_page = base_url.format(next_pages[0]) 
    else:
        writeline += "No next button found"+ls
        next_page = None
    sleep(3)
    
if len(newTitles)>=threshold:
    writeline += "the number of titles is:%s,enough information!" % len(newTitles)+ls

with open(filename, 'wb') as out:
    out.write(writeline.encode('utf-8'))
    
fobj.close() 
print 'Done!Tada!!'

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值