myFirstCrawler

最新推荐文章于 2024-04-21 09:53:42 发布

玉儿Qi

最新推荐文章于 2024-04-21 09:53:42 发布

阅读量348

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/u014328357/article/details/48972519

本文介绍了一种使用Python和lxml库爬取下厨房网站菜谱的方法，包括菜名、作者、烹饪次数及食材等信息，并实现了翻页功能，最终将数据保存至本地文件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

爬下厨房网页，并且自带翻页功能～～

还能把内容录入到文件里面喔～

废话不多说，上代码：

#get the information(recipes and the chief's name included)of some delicious dishes
#and save the information in a file

from lxml import html
from time import sleep

import os
ls = os.linesep
filename = "OutputFile.txt"
if os.path.exists(filename):
    print "ERROR,'%s' already exists!Please name the file again~" % filename
fobj = open(filename, 'w')

writeline = "get the information of some delicious dishes"+ls

x = html.parse('http://www.xiachufang.com/explore')
titles = x.xpath("//ul[@class='list']/li/div/div/p[@class='name']/a/text()")
cook = x.xpath("//ul[@class='list']/li/div/div/p[@class='author']/a/text()")
status = x.xpath("//ul[@class='list']/li/div/div/p[@class='stats green-font']/span/text()")
material = x.xpath("//ul[@class='list']/li/div/div/p[@class='ing ellipsis']/text()")
writeline += "We got %s titles with its chief name and status. Here are the top 5:" % len(titles)+ls
i = 0
j = 5
for title in titles:
    if i<j:
        writeline += "   >"+title+ls
        writeline += "   >>chief:"+cook[i]+ls
        writeline += "   >>>has been cooked:"+status[i]+"times"+ls
        writeline += "   >>>>material:"+material[i]+ls
        writeline += "**********************************************************"+ls
        i = i + 1
    else:
        break
    
#function:next page
#assume that 50 titles are enough,
writeline += ls+ls+"*********************function: searching next pages***********************"+ls
next_button_xpath = "//a[@class='next']/@href"
headline_xpath = "//ul[@class='list']/li/div/div/p[@class='name']/a/text()"

newTitles = []
base_url = 'http://www.xiachufang.com/{}'
next_page = 'http://www.xiachufang.com/explore'
threshold = 50
while len(newTitles) < threshold and next_page:
    x = html.parse(next_page)
    headlines = x.xpath(headline_xpath)
    writeline += "Retrieved {} titles from url: {}".format(len(headlines), next_page)+ls
    
    newTitles += headlines
    next_pages = x.xpath(next_button_xpath)
    if next_pages:
        next_page = base_url.format(next_pages[0]) 
    else:
        writeline += "No next button found"+ls
        next_page = None
    sleep(3)
    
if len(newTitles)>=threshold:
    writeline += "the number of titles is:%s,enough information!" % len(newTitles)+ls

with open(filename, 'wb') as out:
    out.write(writeline.encode('utf-8'))
    
fobj.close() 
print 'Done!Tada!!'