Ubuntu<使用crontab执行定时爬虫任务>

最新推荐文章于 2024-07-29 17:53:13 发布

原创最新推荐文章于 2024-07-29 17:53:13 发布 · 755 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#ubuntu #爬虫 #python

python爬虫专栏收录该内容

5 篇文章

订阅专栏

本文介绍在Ubuntu系统中使用Crontab进行定时任务配置的过程，特别是针对Python爬虫程序的执行。解决了从Windows环境迁移过来的问题，并详细记录了在Crontab环境下找到Scrapy的具体路径这一关键步骤。

之前在windows下使用python 脚本调用爬虫程序，做成定时任务的时候，都挺顺利的，以下是代码video_command.py

# -*- coding: utf-8 -*-
# !/usr/bin/python2.7
import os
import time
def run_spider(spider_name, folder_name):
    try:       
        command1 = 'scrapy crawl ' + str(spider_name)
        os.chdir('/python/shixi/' + str(folder_name))
        os.system(command1)
        #    print '******************'
        print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
        #    print 'success'
        print command1
        print "----------------------------->>>>>>>>>>>>>>>>>>>done<<<<<<----------------------"
    except Exception, e:
        print e
if __name__ == '__main__':
    run_spider('qiyi1_spider', 'QIYI_movie')

然后创建了一个bat文件，然后使用windows自带的定时任务，调一下bat文件就可以了。

今天想要在Ubuntu上使用crontab来定时跑爬虫程序，于是将以上代码搬到Ubuntu上去，进入到项目的路径下，使用
scrapy crawl xxxx来执行的时候都能很真长的执行，使用shell运行video_command.py也能正常运行。但是放到crontab里边却死活跑不起来，以下是crontab -e中的代码：

53 15 * * *  sh /python/shixi/crontab3/start_crawl.sh >>/python/shixi/crontab3/log.text 2>&1

最后troubleshotting才发现问题的根源是在video_command.py里边，虽然直接运行爬虫，或者使用脚本来运行爬虫都能成功，但是不知道crontab内部是怎么写的，放上去之后，就找不到scrapy,于是加上以下路径就能找到了
这里写图片描述

完整的代码是：

# -*- coding: utf-8 -*-
# !/usr/bin/python2.7
import os
import time
def run_spider(spider_name, folder_name):
    try:       
        command1 = '/usr/local/bin/scrapy crawl ' + str(spider_name)
        os.chdir('/python/shixi/' + str(folder_name))
        os.system(command1)
        #    print '******************'
        print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
        #    print 'success'
        print command1
        print "----------------------------->>>>>>>>>>>>>>>>>>>done<<<<<<----------------------"
    except Exception, e:
        print e
if __name__ == '__main__':
    run_spider('qiyi1_spider', 'QIYI_movie')