oo1

Scrapy爬虫配置
本文介绍了一个使用Python Scrapy框架的爬虫配置示例。该爬虫能够针对不同的目标进行网页抓取,如百度关键词搜索和百科页面。通过命令行参数可以灵活地设定爬虫类型、关键词、最大抓取深度及数量等。
import argparse,os,datetime,logging,time
from urllib.parse import quote
from scrapy.cmdline import execute
from crawler import settings


now=datetime.datetime.now()
# 设置日志
logger=logging.getLogger('crawler.py')
logger.setLevel(logging.INFO)
rq=time.strftime('%Y%m%d%H%M',time.localtime(time.time()))
log_path=os.path.dirname(os.getcwd())+'/Logs/'
log_name=log_path+rq+'.log'
logfile=log_name
fh=logging.FileHandler(logfile,mode='w')
fh.setLevel(logging.DEBUG)
formatter=logging.Formatter("%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s")
fh.setFormatter(formatter)
logger.addHandler(fh)


def main():
    # 判断所选择的爬虫是哪一个
    if FLAGS.type=='BaiduKeySearch':
        # 如果要求限制爬取的最大数量,执行scrapy自带的限制爬取数量的命令
        if FLAGS.max_nb_docs:
            execute("scrapy crawl {} -a key_name={} -s CLOSESPIDER_ITEMCOUNT={}".format(FLAGS.type,quote(FLAGS.key),FLAGS.max_nb_docs).split())
        else:
            execute("scrapy crawl {} -a key_name={}".format(FLAGS.type,quote(FLAGS.key)).split())
    elif FLAGS.type=='baike':
        if FLAGS.max_nb_docs:
            execute('scrapy crawl baike -s CLOSESPIDER_ITEMCOUNT={}'.format(FLAGS.max_nb_docs).split())
        else:
            execute('scrapy crawl {}'.format(FLAGS.type).split())
    else:
        logger.warning('Wrong Type!')

if __name__ == '__main__':
    # 创建命令行解析对象
    parser=argparse.ArgumentParser()
                                                                                                                         
                                                                                                                          # 添加命令行参数
    parser.add_argument('-t','--type',choices=['BaiduKeySearch','baike'],dest='type',help='choose one spider in choices list')
    parser.add_argument('-k','--key',dest='key',type=str,help='add the key you want to search')
    parser.add_argument('-m','--max-deepth',dest='max_deepth',type=int,help='set the maximum deepth you want to crawl')
    parser.add_argument('-max-nb-docs',dest='max_nb_docs',type=int,help='set the maximum quantity of items you want to crawl')
    parser.add_argument('-o','--output',dest='output',type=str,default=os.path.abspath('.')+'/crawled_data/',help='set the downloading output path')
    parser.add_argument('-l','--log',dest='log',type=str,default=os.path.abspath('.')+'/log/',help='set the log file path ')
    parser.add_argument('-f','--filename',dest='filename',type=str,help='recording the downloaded filename in this file')
    # 解析
    FLAGS, unparsed = parser.parse_known_args()
    settings.KEY_NAME=FLAGS.key
    settings.DEPTH_LIMIT=FLAGS.max_deepth
    settings.OUT_PUT=FLAGS.output
    settings.SHA_ONE=FLAGS.filename
    settings.LOG_FILE=FLAGS.log+'scrapy {} {} {}.log'.format(now.year,now.month,now.day)
    main()


hcare@appsrv:~$ ^C hcare@appsrv:~$ sudo docker logs 05ceb963f56b 1:C 04 Sep 2025 06:10:51.417 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:10:51.417 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:10:51.417 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:10:51.417 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:10:51.418 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:10:51.421 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:10:51.421 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:10:51.421 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:10:51.421 # Failed listening on port 6379 (tcp), aborting. 1:C 04 Sep 2025 06:10:51.925 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:10:51.925 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:10:51.925 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:10:51.925 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:10:51.926 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:10:51.929 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:10:51.929 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:10:51.929 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:10:51.929 # Failed listening on port 6379 (tcp), aborting. 1:C 04 Sep 2025 06:10:52.520 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:10:52.521 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:10:52.521 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:10:52.521 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:10:52.521 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:10:52.524 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:10:52.524 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:10:52.524 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:10:52.524 # Failed listening on port 6379 (tcp), aborting. 1:C 04 Sep 2025 06:10:53.344 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:10:53.344 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:10:53.344 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:10:53.344 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:10:53.345 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:10:53.348 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:10:53.348 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:10:53.348 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:10:53.348 # Failed listening on port 6379 (tcp), aborting. 1:C 04 Sep 2025 06:10:54.511 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:10:54.511 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:10:54.511 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:10:54.511 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:10:54.512 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:10:54.515 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:10:54.515 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:10:54.515 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:10:54.515 # Failed listening on port 6379 (tcp), aborting. 1:C 04 Sep 2025 06:10:56.554 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:10:56.555 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:10:56.555 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:10:56.555 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:10:56.555 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:10:56.559 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:10:56.559 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:10:56.559 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:10:56.559 # Failed listening on port 6379 (tcp), aborting. 1:C 04 Sep 2025 06:11:00.159 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:11:00.160 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:11:00.160 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:11:00.160 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:11:00.160 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:11:00.163 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:11:00.163 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:11:00.163 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:11:00.163 # Failed listening on port 6379 (tcp), aborting. 1:C 04 Sep 2025 06:11:06.954 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:11:06.954 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:11:06.954 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:11:06.954 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:11:06.955 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:11:06.958 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:11:06.958 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:11:06.958 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:11:06.958 # Failed listening on port 6379 (tcp), aborting. 1:C 04 Sep 2025 06:11:20.140 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:11:20.140 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:11:20.140 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:11:20.140 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:11:20.141 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:11:20.143 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:11:20.143 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:11:20.143 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:11:20.143 # Failed listening on port 6379 (tcp), aborting. 1:C 04 Sep 2025 06:11:46.163 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:11:46.163 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:11:46.163 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:11:46.163 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:11:46.164 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:11:46.167 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:11:46.167 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:11:46.167 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:11:46.167 # Failed listening on port 6379 (tcp), aborting. 1:C 04 Sep 2025 06:12:37.826 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:12:37.826 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:12:37.826 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:12:37.826 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:12:37.827 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:12:37.831 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:12:37.831 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:12:37.831 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:12:37.831 # Failed listening on port 6379 (tcp), aborting. 1:C 04 Sep 2025 06:13:38.274 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 1:C 04 Sep 2025 06:13:38.274 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 04 Sep 2025 06:13:38.274 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 04 Sep 2025 06:13:38.274 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 1:M 04 Sep 2025 06:13:38.275 * monotonic clock: POSIX clock_gettime 1:M 04 Sep 2025 06:13:38.277 * Running mode=standalone, port=6379. 1:M 04 Sep 2025 06:13:38.277 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 04 Sep 2025 06:13:38.278 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use 1:M 04 Sep 2025 06:13:38.278 # Failed listening on port 6379 (tcp), aborting. 日志中的问题如何解决
最新发布
09-05
### 内存提交未启用问题 内存提交未启用通常是指 `overcommit_memory` 参数的设置问题。可以通过以下命令在主机上修改该参数: ```bash sysctl vm.overcommit_memory=1 ``` 为了使该设置在系统重启后仍然生效,可以编辑 `/etc/sysctl.conf` 文件,添加或修改以下行: ```plaintext vm.overcommit_memory = 1 ``` 然后执行以下命令使配置生效: ```bash sysctl -p ``` ### 无配置文件问题 若没有配置文件,可先创建一个配置文件,再将其挂载到 Docker 容器中。例如,在主机上创建一个 Redis 配置文件: ```bash mkdir -p /mydata/redis/conf vim /mydata/redis/conf/redis.conf ``` 在 `redis.conf` 中可添加一些基本配置,如设置持久化: ```plaintext appendonly yes ``` 之后在启动 Docker 容器时挂载该配置文件: ```bash docker run -d \ --network host \ -v /mydata/redis/conf:/usr/local/etc/redis \ --restart unless-stopped \ --name eft-p9-redis \ redis:7.4.2 redis-server /usr/local/etc/redis/redis.conf ``` ### TCP 积压设置无法生效问题 TCP 积压设置通常在 Redis 配置文件中进行。若设置无法生效,需确保配置文件中设置正确且容器正确加载了该配置文件。在 `redis.conf` 中设置 `tcp-backlog` 参数: ```plaintext tcp-backlog 511 ``` 然后重启 Docker 容器使配置生效: ```bash docker restart eft-p9-redis ``` ### 端口 6379 被占用问题 可使用以下命令检查 6379 端口是否被占用,并终止占用该端口的进程: ```bash lsof -i :6379 ``` 若有进程占用,使用以下命令终止进程(假设进程 ID 为 `PID`): ```bash kill -9 PID ``` 也可使用 `netstat` 命令检查端口占用情况: ```bash netstat -tulnp | grep :6379 ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值