Spider Storage Engine安装手顺

本文介绍了一种针对大型数据库的解决方案——Spider 存储引擎的安装步骤。Spider 存储引擎能够帮助 MySQL 数据库应对记录数超过两千万时的性能瓶颈问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

MySQL数据库的记录数达到两千万后,性能会急剧下降。所以在未到达之前,就要考虑新的办法。Spider Storage Engine可以对MySQL进行分区,因此进行了相关的调查。

1.下载MySQL源代码、Spider源代码、condition pushdown安装包(用于将Spider服务器的查询条件传递到远端服务器)
MySQL源代码下载地址:[url]http://dev.mysql.com/downloads/mysql/#downloads[/url]
Spider源代码下载地址:[url]https://launchpad.net/spiderformysql/+download[/url]
condition pushdown安装包下载地址:[url]https://edge.launchpad.net/partitionconditionpushdownformysql/+download[/url]
都放置于/home/peng目录下。

2.安装Spider
解压MySQL源代码:
cd /home/peng
mkdir spider
mkdir /usr/src/redhat # MySQL源代码会被解压到这里
cd spider
rpm -ivh --nodeps --force /home/peng/MySQL-community-5.1.46-1.rhel5.src.rpm

执行成功后,会在/usr/src/redhat/SOURCES下出现MySQL源文件的压缩包:mysql-5.1.46.tar.gz

解压上述压缩包:
tar -xzf /usr/src/redhat/SOURCES/mysql-5.1.46.tar.gz


同样在/home/peng/spider目录下解压Spider源代码以及文档:
tar -xzf /home/peng/spider-src-2.19-for-5.1.44.tgz.tar
tar -xzf /home/peng/spider-doc-2.19-for-5.1.44.tgz.tar


同样在/home/peng/spider目录下解压condition pushdown安装包(版本推荐5.1.36):
tar -xzf /home/peng/partition_cond_push-0.1-for-5.1.36.tgz


spider storage engine源码向MySQL中移行:
mv spider mysql-5.1.46/storage/


spider storage engine和MySQL集成编译:
cd mysql-5.1.46
patch -p2 < ../mysql-5.1.44.spider.diff
patch -p2 < ../mysql-5.1.36.partition_cond_push.diff
autoconf # autoconf安装手顺请参照下面
automake # automake安装手顺请参照下面

./configure --enable-thread-safe-client \
--enable-local-infile \
--with-pic --with-fast-mutexes \
--with-client-ldflags=-static \
--with-mysqld-ldflags=-static --with-zlib-dir=bundled \
--with-big-tables --with-ssl --with-readline \
--with-embedded-server --with-partition \
--with-innodb --without-ndbcluster \
--without-archive-storage-engine \
--without-blackhole-storage-engine \
--with-csv-storage-engine \
--without-example-storage-engine \
--without-federated-storage-engine \
--with-extra-charsets=complex && make


编译成功时的信息
[quote]g++ -DMYSQL_INSTANCE_MANAGER -DMYSQL_SERVER -I. -I../../include -I../../zlib -I../../include -I../../include -O3 -fno-implicit-templates -fno-exceptions -fno-rtti -MT mysqlmanager-guardian.o -MD -MP -MF .deps/mysqlmanager-guardian.Tpo -c -o mysqlmanager-guardian.o `test -f 'guardian.cc' || echo './'`guardian.cc
mv -f .deps/mysqlmanager-guardian.Tpo .deps/mysqlmanager-guardian.Po
g++ -DMYSQL_INSTANCE_MANAGER -DMYSQL_SERVER -I. -I../../include -I../../zlib -I../../include -I../../include -O3 -fno-implicit-templates -fno-exceptions -fno-rtti -MT mysqlmanager-parse_output.o -MD -MP -MF .deps/mysqlmanager-parse_output.Tpo -c -o mysqlmanager-parse_output.o `test -f 'parse_output.cc' || echo './'`parse_output.cc
mv -f .deps/mysqlmanager-parse_output.Tpo .deps/mysqlmanager-parse_output.Po
g++ -DMYSQL_INSTANCE_MANAGER -DMYSQL_SERVER -I. -I../../include -I../../zlib -I../../include -I../../include -O3 -fno-implicit-templates -fno-exceptions -fno-rtti -MT mysqlmanager-user_management_commands.o -MD -MP -MF .deps/mysqlmanager-user_management_commands.Tpo -c -o mysqlmanager-user_management_commands.o `test -f 'user_management_commands.cc' || echo './'`user_management_commands.cc
mv -f .deps/mysqlmanager-user_management_commands.Tpo .deps/mysqlmanager-user_management_commands.Po
g++ -DMYSQL_INSTANCE_MANAGER -DMYSQL_SERVER -I. -I../../include -I../../zlib -I../../include -I../../include -O3 -fno-implicit-templates -fno-exceptions -fno-rtti -MT mysqlmanager-angel.o -MD -MP -MF .deps/mysqlmanager-angel.Tpo -c -o mysqlmanager-angel.o `test -f 'angel.cc' || echo './'`angel.cc
mv -f .deps/mysqlmanager-angel.Tpo .deps/mysqlmanager-angel.Po
/bin/sh ../../libtool --preserve-dup-deps --tag=CXX --mode=link g++ -O3 -fno-implicit-templates -fno-exceptions -fno-rtti -rdynamic -o mysqlmanager mysqlmanager-command.o mysqlmanager-mysqlmanager.o mysqlmanager-manager.o mysqlmanager-log.o mysqlmanager-thread_registry.o mysqlmanager-listener.o mysqlmanager-protocol.o mysqlmanager-mysql_connection.o mysqlmanager-user_map.o mysqlmanager-messages.o mysqlmanager-commands.o mysqlmanager-instance.o mysqlmanager-instance_map.o mysqlmanager-instance_options.o mysqlmanager-buffer.o mysqlmanager-parse.o mysqlmanager-guardian.o mysqlmanager-parse_output.o mysqlmanager-user_management_commands.o mysqlmanager-angel.o -static liboptions.la libnet.a ../../vio/libvio.a ../../mysys/libmysys.a ../../strings/libmystrings.a ../../dbug/libdbug.a ../../extra/yassl/src/libyassl.la ../../extra/yassl/taocrypt/src/libtaocrypt.la ../../zlib/libzlt.la -lpthread -lcrypt -lnsl -lm -lpthread
libtool: link: g++ -O3 -fno-implicit-templates -fno-exceptions -fno-rtti -rdynamic -o mysqlmanager mysqlmanager-command.o mysqlmanager-mysqlmanager.o mysqlmanager-manager.o mysqlmanager-log.o mysqlmanager-thread_registry.o mysqlmanager-listener.o mysqlmanager-protocol.o mysqlmanager-mysql_connection.o mysqlmanager-user_map.o mysqlmanager-messages.o mysqlmanager-commands.o mysqlmanager-instance.o mysqlmanager-instance_map.o mysqlmanager-instance_options.o mysqlmanager-buffer.o mysqlmanager-parse.o mysqlmanager-guardian.o mysqlmanager-parse_output.o mysqlmanager-user_management_commands.o mysqlmanager-angel.o ./.libs/liboptions.a -lpthread -lpthread -lpthread -lpthread libnet.a ../../vio/libvio.a ../../mysys/libmysys.a ../../strings/libmystrings.a ../../dbug/libdbug.a ../../extra/yassl/src/.libs/libyassl.a -lpthread -lpthread -lpthread -lpthread ../../extra/yassl/taocrypt/src/.libs/libtaocrypt.a -lpthread -lpthread -lpthread -lpthread ../../zlib/.libs/libzlt.a -lpthread -lcrypt -lnsl -lm -lpthread
make[2]: Leaving directory `/var/lib/mysql/spider/mysql-5.1.46/server-tools/instance-manager'
make[1]: Leaving directory `/var/lib/mysql/spider/mysql-5.1.46/server-tools'
Making all in win
make[1]: Entering directory `/var/lib/mysql/spider/mysql-5.1.46/win'
make[1]: Nothing to be done for `all'.
make[1]: Leaving directory `/var/lib/mysql/spider/mysql-5.1.46/win'[/quote]

3.创建二进制安装包
./scripts/make_binary_distribution

成功时的信息:
[quote]mysql-5.1.46-linux-i686/share/man/man1/my_print_defaults.1
mysql-5.1.46-linux-i686/share/man/man1/mysqld_safe.1
mysql-5.1.46-linux-i686/share/man/man1/mysqlshow.1
mysql-5.1.46-linux-i686/share/man/man1/mysql_config.1
mysql-5.1.46-linux-i686/share/man/man1/mysqltest.1
mysql-5.1.46-linux-i686/share/man/man1/mysqlslap.1
mysql-5.1.46-linux-i686/share/man/man8/
mysql-5.1.46-linux-i686/share/man/man8/mysqlmanager.8
mysql-5.1.46-linux-i686/share/man/man8/mysqld.8
mysql-5.1.46-linux-i686/share/aclocal/
mysql-5.1.46-linux-i686/share/aclocal/mysql.m4
mysql-5.1.46-linux-i686/share/info/
mysql-5.1.46-linux-i686/share/info/mysql.info
mysql-5.1.46-linux-i686.tar.gz created
Removing temporary directory[/quote]

4.安装集成了Spider引擎的MySQL(利用sandbox可以在一台服务器上安装多个MySQL实例,安装方法请参照后边)
make_sandbox $PWD/mysql-5.1.46-linux-i686.tar.gz --sandbox_directory=spider_main

安装成功时的信息:
[quote]unpacking /var/lib/mysql/spider/mysql-5.1.46/mysql-5.1.46-linux-i686.tar.gz
Executing low_level_make_sandbox --basedir=/var/lib/mysql/spider/mysql-5.1.46/5.1.46 \
--sandbox_directory=msb_5_1_46 \
--install_version=5.1 \
--sandbox_port=5146 \
--no_ver_after_name \
--sandbox_directory=spider_main \
--my_clause=log-error=msandbox.err
The MySQL Sandbox, version 3.0.09
(C) 2006-2010 Giuseppe Maxia
installing with the following parameters:
upper_directory = /root/sandboxes
sandbox_directory = spider_main
sandbox_port = 5146
check_port =
no_check_port =
datadir_from = script
install_version = 5.1
basedir = /var/lib/mysql/spider/mysql-5.1.46/5.1.46
tmpdir =
my_file =
operating_system_user = root
db_user = msandbox
db_password = msandbox
my_clause = log-error=msandbox.err
prompt_prefix = mysql
prompt_body = [\h] {\u} (\d) >
force =
no_ver_after_name = 1
verbose =
load_grants = 1
no_load_grants =
no_run =
no_show =
do you agree? ([Y],n) Y
loading grants
... sandbox server started
Your sandbox server was installed in $HOME/sandboxes/spider_main[/quote]

5.设定:
①Spider引擎设定(从Spider作者博客下载了[url=http://datacharmer.org/downloads/spider_setup.sql]spider_setup.sql[/url],但是不好用,以为是编译的问题费了好长时间,后来在日文博客里找到了可以用的设定。)
cd $HOME/sandboxes/spider_main
wget http://dl.iteye.com/topics/download/ce85be56-696a-36e9-9c6f-4db53dac528b
mv install_spider.zip install_spider.sql
./use < install_spider.sql

②检查Spider引擎
./use 
select engine,support,transactions,xa from information_schema.engines;

执行成功时会出现的信息:

+------------+---------+--------------+------+
| engine | support | transactions | xa |
+------------+---------+--------------+------+
| SPIDER | YES | YES | YES |
| MRG_MYISAM | YES | NO | NO |
| CSV | YES | NO | NO |
| MyISAM | DEFAULT | NO | NO |
| InnoDB | YES | YES | YES |
| MEMORY | YES | NO | NO |
+------------+---------+--------------+------+


[b]编译时遇到的问题及解决办法[/b]
①编译时发生错误:
[quote]configure:3615: error: in `/var/lib/mysql/spider/mysql-5.1.46':
configure:3618: error: no acceptable C compiler found in $PATH[/quote]
经查阅,需要安装GCC组件。GCC组件的安装方法可参照下面的手顺。

②装完GCC后,再次MySQL命令时,又发生了如下的错误:
[quote]configure:7121: /lib/cpp conftest.c
In file included from /usr/include/bits/posix1_lim.h:153,
from /usr/include/limits.h:145,
from /usr/lib/gcc/i386-redhat-linux/4.1.2/include/limits.h:122,
from /usr/lib/gcc/i386-redhat-linux/4.1.2/include/syslimits.h:7,
from /usr/lib/gcc/i386-redhat-linux/4.1.2/include/limits.h:11,
from conftest.c:11:
/usr/include/bits/local_lim.h:36:26: error: linux/limits.h: No such file or directory
configure:7128: $? = 1
configure: failed program was:
| /* confdefs.h. */
| #define PACKAGE_NAME "MySQL Server"
| #define PACKAGE_TARNAME "mysql"
| #define PACKAGE_VERSION "5.1.46"
| #define PACKAGE_STRING "MySQL Server 5.1.46"
| #define PACKAGE_BUGREPORT ""
| #define PACKAGE "mysql"
| #define VERSION "5.1.46"
| /* end confdefs.h. */
| #ifdef __STDC__
| # include <limits.h>
| #else
| # include <assert.h>
| #endif
| Syntax error
configure:7190: error: in `/var/lib/mysql/spider/mysql-5.1.46':
configure:7193: error: C preprocessor "/lib/cpp" fails sanity check
See `config.log' for more details.[/quote]
解决办法:
ln -s /usr/src/linux/include/linux /usr/include/.


③继续执行上述编译命令,出现如下的错误:
[quote]configure: error: No curses/termcap library found[/quote]
经调查,需要安装 ncurses-devel。
执行远程安装命令 yum install ncurses-devel 命令安装即可解决该问题。

④继续执行上述编译命令命令,出现如下的错误:
[quote]In file included from /usr/include/bits/errno.h:25,
from /usr/include/errno.h:36,
from zutil.h:38,
from crc32.c:29:
/usr/include/linux/errno.h:4:23: error: asm/errno.h: No such file or directory
make[1]: *** [crc32.lo] Error 1
make[1]: Leaving directory `/var/lib/mysql/spider/mysql-5.1.46/zlib'
make: *** [all-recursive] Error 1[/quote]
解决办法如下:
cd /usr/include
ln -s /usr/src/linux/include/asm-i386 /usr/include/asm
ln -s /usr/src/linux/include/asm-generic /usr/include/asm-generic


⑤继续执行上述编译命令,出现如下的错误:
[quote]mv -f .deps/handler.Tpo .deps/handler.Po
g++ -DMYSQL_SERVER -DDEFAULT_MYSQL_HOME="\"/usr/local\"" -DMYSQL_DATADIR="\"/usr/local/var\"" -DSHAREDIR="\"/usr/local/share/mysql\"" -DPLUGINDIR="\"/usr/local/lib/mysql/plugin\"" -DHAVE_EVENT_SCHEDULER -DHAVE_CONFIG_H -I. -I../include -I../zlib -I../include -I../include -I../regex -I. -O3 -fno-implicit-templates -fno-exceptions -fno-rtti -MT ha_partition.o -MD -MP -MF .deps/ha_partition.Tpo -c -o ha_partition.o ha_partition.cc
ha_partition.cc:6575: error: no "const COND* ha_partition::cond_push(const COND*)" member function declared in class "ha_partition"
ha_partition.cc:6595: error: no "void ha_partition::cond_pop()" member function declared in class "ha_partition"
ha_partition.cc:6608: error: redefinition of "const COND* ha_partition::cond_push(const COND*)"
ha_partition.cc:6575: error: "const COND* ha_partition::cond_push(const COND*)" previously defined here
ha_partition.cc:6628: error: redefinition of "void ha_partition::cond_pop()"
ha_partition.cc:6595: error: "void ha_partition::cond_pop()" previously defined here
make[3]: *** [ha_partition.o] Error 1
make[3]: Leaving directory `/var/lib/mysql/spider/mysql-5.1.46/sql'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/var/lib/mysql/spider/mysql-5.1.46/sql'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/var/lib/mysql/spider/mysql-5.1.46/sql')[/quote]
condition pushdown使用5.1.36版本可以回避这个问题。

[b]工具安装[/b]
autoconf安装
下载地址:[url]http://download.chinaunix.net/download/0001000/653.shtml[/url]
cd /home/peng
tar -xzf autoconf-2.63.tar.gz
cd autoconf-2.63/
./configure --prefix=/usr
make
make install


automake安装
下载地址:[url]http://ftp.gnu.org/gnu/automake/[/url]
cd /home/peng
tar -xzf automake-1.11.1.tar.gz
cd automake-1.11.1
./configure --prefix=/usr
make
make check (耗时很长,非必须!!!)
make install


gcc安装
CentOS的DVD系统光盘,然后进入CentOS目录内,取得如下的安装包,安装顺序如下:
rpm -ivh cpp-4.1.2-42.el5.i386.rpm
rpm -ivh kernel-headers-2.6.6-1.i386.rpm
rpm -ivh glibc-2.5-24.i386.rpm
rpm -ivh glibc-headers-2.5-24.i386.rpm
rpm -ivh glibc-devel-2.5-24.i386.rpm
rpm -ivh libgomp-4.1.2-42.el5.i386.rpm
rpm -ivh gcc-4.1.2-42.el5.i386.rpm


g++安装
CentOS的DVD系统光盘,然后进入CentOS目录内,取得如下的安装包,安装顺序如下:
rpm -ivh libstdc++-devel-4.1.2-42.el5.i386.rpm
rpm -ivh gcc-c++-4.1.2-42.el5.i386.rpm


安装MySQLSandbox
下载地址:[url]https://launchpad.net/mysql-sandbox/+download[/url]
cd /home/peng
tar -xzf MySQL-Sandbox-3.0.09.tar.gz
cd MySQL-Sandbox-3.0.09
perl Makefile.PL
make
sudo make install
export SANDBOX_AS_ROOT=1 (如果是root用户,此处必须是0以外的值;如果是root以外用户,设不不需要执行这行。)


MySQL的卸载方法
查找已经安装的MySQL
[quote]rpm -qa | grep MySQL
MySQL-server-community-5.1.28-0.rhel5
MySQL-client-community-5.1.28-0.rhel5[/quote]
卸载现有的MySQL
[quote]rpm -e --nodeps MySQL-client-community-5.1.28-0.rhel5
rpm -e --nodeps MySQL-server-community-5.1.28-0.rhel5[/quote]

[b]参考[/b]
[url]http://nippondanji.blogspot.com/2010/04/spider.html[/url]
2025-07-06 22:15:17 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2025-07-06 22:15:17 [scrapy.extensions.telnet] INFO: Telnet Password: 94478ec1879f1a75 2025-07-06 22:15:17 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.throttle.AutoThrottle'] 2025-07-06 22:15:17 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats', 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware'] 2025-07-06 22:15:17 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2025-07-06 22:15:17 [scrapy.middleware] INFO: Enabled item pipelines: ['nepu_spider.pipelines.ContentCleanPipeline', 'nepu_spider.pipelines.DeduplicatePipeline', 'nepu_spider.pipelines.SQLServerPipeline'] 2025-07-06 22:15:17 [scrapy.core.engine] INFO: Spider opened 2025-07-06 22:15:17 [nepu_spider.pipelines] INFO: ✅ 数据库表 'knowledge_base' 已创建或已存在 2025-07-06 22:15:17 [nepu_info] INFO: ✅ 成功连接到 SQL Server 数据库 2025-07-06 22:15:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2025-07-06 22:15:17 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\Lenovo\nepu_qa_project\.scrapy\httpcache 2025-07-06 22:15:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2025-07-06 22:15:17 [nepu_info] INFO: 🚀 开始爬取东北石油大学官网... 2025-07-06 22:15:17 [nepu_info] INFO: 初始URL数量: 4 2025-07-06 22:15:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.nepu.edu.cn/robots.txt> from <GET http://www.nepu.edu.cn/robots.txt> 2025-07-06 22:15:24 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.nepu.edu.cn/robots.txt> (referer: None) 2025-07-06 22:15:24 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on. 2025-07-06 22:15:24 [protego] DEBUG: Rule at line 13 without any user agent to enforce it on. 2025-07-06 22:15:24 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on. 2025-07-06 22:15:24 [protego] DEBUG: Rule at line 15 without any user agent to enforce it on. 2025-07-06 22:15:24 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on. 2025-07-06 22:15:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.nepu.edu.cn/index.htm> from <GET http://www.nepu.edu.cn/> 2025-07-06 22:15:34 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.nepu.edu.cn/tzgg.htm> from <GET http://www.nepu.edu.cn/tzgg.htm> 2025-07-06 22:15:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.nepu.edu.cn/xwzx.htm> from <GET http://www.nepu.edu.cn/xwzx.htm> 2025-07-06 22:15:45 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.nepu.edu.cn/xxgk.htm> from <GET http://www.nepu.edu.cn/xxgk.htm> 2025-07-06 22:15:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nepu.edu.cn/index.htm> (referer: None) 2025-07-06 22:15:51 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.gov.cn': <GET https://www.gov.cn/gongbao/content/2001/content_61066.htm> 2025-07-06 22:15:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nepu.edu.cn/tzgg.htm> (referer: None) 2025-07-06 22:16:01 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.nepu.edu.cn/xwzx.htm> (referer: None) 2025-07-06 22:16:01 [nepu_info] ERROR: 请求失败: https://www.nepu.edu.cn/xwzx.htm | 状态: 404 | 错误: Ignoring non-200 response 2025-07-06 22:16:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.nepu.edu.cn/xxgk.htm> (referer: None) 2025-07-06 22:16:03 [nepu_info] ERROR: 请求失败: https://www.nepu.edu.cn/xxgk.htm | 状态: 404 | 错误: Ignoring non-200 response 2025-07-06 22:16:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nepu.edu.cn/info/1049/28877.htm> (referer: https://www.nepu.edu.cn/index.htm) 2025-07-06 22:16:05 [nepu_info] ERROR: ❌ 解析失败: https://www.nepu.edu.cn/info/1049/28877.htm | 错误: Expected selector, got <DELIM '/' at 0> Traceback (most recent call last): File "C:\Users\Lenovo\nepu_qa_project\nepu_spider\spiders\info_spider.py", line 148, in parse_item date_text = response.css(selector).get() ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\scrapy\http\response\text.py", line 147, in css return self.selector.css(query) ^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\selector.py", line 282, in css return self.xpath(self._css2xpath(query)) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\selector.py", line 285, in _css2xpath return self._csstranslator.css_to_xpath(query) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\csstranslator.py", line 107, in css_to_xpath return super(HTMLTranslator, self).css_to_xpath(css, prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath for selector in parse(css)) ^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 415, in parse return list(parse_selector_group(stream)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group yield Selector(*parse_selector(stream)) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 436, in parse_selector result, pseudo_element = parse_simple_selector(stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 544, in parse_simple_selector raise SelectorSyntaxError( cssselect.parser.SelectorSyntaxError: Expected selector, got <DELIM '/' at 0> 2025-07-06 22:16:05 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.nepu.edu.cn/info/1049/28867.htm> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 2025-07-06 22:16:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nepu.edu.cn/info/1313/16817.htm> (referer: https://www.nepu.edu.cn/index.htm) 2025-07-06 22:16:05 [nepu_info] ERROR: ❌ 解析失败: https://www.nepu.edu.cn/info/1313/16817.htm | 错误: Expected selector, got <DELIM '/' at 0> Traceback (most recent call last): File "C:\Users\Lenovo\nepu_qa_project\nepu_spider\spiders\info_spider.py", line 148, in parse_item date_text = response.css(selector).get() ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\scrapy\http\response\text.py", line 147, in css return self.selector.css(query) ^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\selector.py", line 282, in css return self.xpath(self._css2xpath(query)) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\selector.py", line 285, in _css2xpath return self._csstranslator.css_to_xpath(query) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\csstranslator.py", line 107, in css_to_xpath return super(HTMLTranslator, self).css_to_xpath(css, prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath for selector in parse(css)) ^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 415, in parse return list(parse_selector_group(stream)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group yield Selector(*parse_selector(stream)) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 436, in parse_selector result, pseudo_element = parse_simple_selector(stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 544, in parse_simple_selector raise SelectorSyntaxError( cssselect.parser.SelectorSyntaxError: Expected selector, got <DELIM '/' at 0> 2025-07-06 22:16:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nepu.edu.cn/info/1313/17517.htm> (referer: https://www.nepu.edu.cn/index.htm) 2025-07-06 22:16:06 [nepu_info] ERROR: ❌ 解析失败: https://www.nepu.edu.cn/info/1313/17517.htm | 错误: Expected selector, got <DELIM '/' at 0> Traceback (most recent call last): File "C:\Users\Lenovo\nepu_qa_project\nepu_spider\spiders\info_spider.py", line 148, in parse_item date_text = response.css(selector).get() ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\scrapy\http\response\text.py", line 147, in css return self.selector.css(query) ^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\selector.py", line 282, in css return self.xpath(self._css2xpath(query)) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\selector.py", line 285, in _css2xpath return self._csstranslator.css_to_xpath(query) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\csstranslator.py", line 107, in css_to_xpath return super(HTMLTranslator, self).css_to_xpath(css, prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath for selector in parse(css)) ^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 415, in parse return list(parse_selector_group(stream)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group yield Selector(*parse_selector(stream)) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 436, in parse_selector result, pseudo_element = parse_simple_selector(stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 544, in parse_simple_selector raise SelectorSyntaxError( cssselect.parser.SelectorSyntaxError: Expected selector, got <DELIM '/' at 0> 2025-07-06 22:16:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nepu.edu.cn/info/1313/19127.htm> (referer: https://www.nepu.edu.cn/index.htm) 2025-07-06 22:16:07 [nepu_info] ERROR: ❌ 解析失败: https://www.nepu.edu.cn/info/1313/19127.htm | 错误: Expected selector, got <DELIM '/' at 0> Traceback (most recent call last): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 544, in parse_simple_selector raise SelectorSyntaxError( cssselect.parser.SelectorSyntaxError: Expected selector, got <DELIM '/' at 0> 2025-07-06 22:16:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nepu.edu.cn/info/1049/28867.htm> (referer: https://www.nepu.edu.cn/index.htm) 2025-07-06 22:16:58 [nepu_info] ERROR: ❌ 解析失败: https://www.nepu.edu.cn/info/1049/28867.htm | 错误: Expected selector, got <DELIM '/' at 0> Traceback (most recent call last): File "C:\Users\Lenovo\nepu_qa_project\nepu_spider\spiders\info_spider.py", line 148, in parse_item date_text = response.css(selector).get() ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\scrapy\http\response\text.py", line 147, in css return self.selector.css(query) ^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\selector.py", line 282, in css return self.xpath(self._css2xpath(query)) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\selector.py", line 285, in _css2xpath return self._csstranslator.css_to_xpath(query) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\parsel\csstranslator.py", line 107, in css_to_xpath return super(HTMLTranslator, self).css_to_xpath(css, prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath for selector in parse(css)) ^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 415, in parse return list(parse_selector_group(stream)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group yield Selector(*parse_selector(stream)) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 436, in parse_selector result, pseudo_element = parse_simple_selector(stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\annaCONDA\Lib\site-packages\cssselect\parser.py", line 544, in parse_simple_selector raise SelectorSyntaxError( cssselect.parser.SelectorSyntaxError: Expected selector, got <DELIM '/' at 0> 2025-07-06 22:16:58 [scrapy.core.engine] INFO: Closing spider (finished) 2025-07-06 22:16:58 [nepu_info] INFO: ✅ 数据库连接已关闭 2025-07-06 22:16:58 [nepu_info] INFO: 🛑 爬虫结束,原因: finished 2025-07-06 22:16:58 [nepu_info] INFO: 总计爬取页面: 86 2025-07-06 22:16:58 [scrapy.utils.signal] ERROR: Error caught on signal handler: <function Spider.close at 0x000001FF77BA2C00> Traceback (most recent call last): File "D:\annaCONDA\Lib\site-packages\scrapy\utils\defer.py", line 312, in maybeDeferred_coro result = f(*args, **kw) File "D:\annaCONDA\Lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "D:\annaCONDA\Lib\site-packages\scrapy\spiders\__init__.py", line 92, in close return closed(reason) File "C:\Users\Lenovo\nepu_qa_project\nepu_spider\spiders\info_spider.py", line 323, in closed json.dump(stats, f, ensure_ascii=False, indent=2) File "D:\annaCONDA\Lib\json\__init__.py", line 179, in dump for chunk in iterable: File "D:\annaCONDA\Lib\json\encoder.py", line 432, in _iterencode yield from _iterencode_dict(o, _current_indent_level) File "D:\annaCONDA\Lib\json\encoder.py", line 406, in _iterencode_dict yield from chunks File "D:\annaCONDA\Lib\json\encoder.py", line 406, in _iterencode_dict yield from chunks File "D:\annaCONDA\Lib\json\encoder.py", line 439, in _iterencode o = _default(o) File "D:\annaCONDA\Lib\json\encoder.py", line 180, in default raise TypeError(f'Object of type {o.__class__.__name__} ' TypeError: Object of type SettingsAttribute is not JSON serializable 2025-07-06 22:16:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 35146, 'downloader/request_count': 96, 'downloader/request_method_count/GET': 96, 'downloader/response_bytes': 729404, 'downloader/response_count': 96, 'downloader/response_status_count/200': 88, 'downloader/response_status_count/302': 5, 'downloader/response_status_count/404': 3, 'dupefilter/filtered': 184, 'elapsed_time_seconds': 101.133916, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2025, 7, 6, 14, 16, 58, 758524), 'httpcache/firsthand': 96, 'httpcache/miss': 96, 'httpcache/store': 96, 'httpcompression/response_bytes': 2168438, 'httpcompression/response_count': 88, 'log_count/DEBUG': 120, 'log_count/ERROR': 89, 'log_count/INFO': 18, 'log_count/WARNING': 1, 'offsite/domains': 2, 'offsite/filtered': 4, 'request_depth_max': 3, 'response_received_count': 91, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/404': 1, 'scheduler/dequeued': 94, 'scheduler/dequeued/memory': 94, 'scheduler/enqueued': 94, 'scheduler/enqueued/memory': 94, 'start_time': datetime.datetime(2025, 7, 6, 14, 15, 17, 624608)} 2025-07-06 22:16:58 [scrapy.core.engine] INFO: Spider closed (finished)
07-07
### Scrapy 爬虫日志分析与错误解决方案 #### CSS Selector 错误:`Expected selector got DELIM` 此错误通常出现在使用 `scrapy.Selector` 对象进行解析时,CSS 选择器格式存在问题。常见的原因包括: - **非法字符或语法错误**:CSS 选择器中存在未被正确转义的特殊字符(如空格、冒号、点号等),导致解析器在预期选择器的位置遇到了意外的分隔符(DELIM)。 - **字符串拼接问题**:动态构建选择器字符串时,由于格式不正确,导致最终构造的选择器无效。 解决方法包括: - **检查选择器语法**:确保选择器符合 CSS 标准,例如 `div.content > p.main` 是合法的,而 `div. content > p.main` 中的多余空格会导致解析失败。 - **转义特殊字符**:如果选择器中包含特殊字符,例如 `.` 或 `#`,需要使用反斜杠 `\` 进行转义,例如 `.class\.name`。 - **调试输出选择器**:在代码中打印出最终生成的选择器字符串,确认其格式是否正确[^1]。 #### TypeError: `Object of type SettingsAttribute is not JSON serializable` 该错误表示尝试使用 `json.dumps()` 序列化一个非标准 Python 数据类型对象(如 `SettingsAttribute` 类型)。Scrapy 的配置对象 `settings` 中的某些字段可能不是基本数据类型,因此无法直接序列化。 解决办法包括: - **转换为字典并过滤不可序列化字段**:将对象转换为字典形式,并动移除或转换其中的不可序列化字段。例如: ```python import json def process_item(self, item, spider): try: data = json.dumps(dict(item), ensure_ascii=False) self.file.write(data + '\n') except TypeError as e: # 处理异常,例如记录日志或跳过不可序列化的字段 pass return item ``` - **自定义 JSON 编码器**:继承 `json.JSONEncoder` 并重写 `default()` 方法,以支持自定义对象的序列化。例如: ```python class CustomEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, SomeCustomType): return str(obj) # 或者返回可序列化的结构 return super().default(obj) data = json.dumps(obj, cls=CustomEncoder, ensure_ascii=False) ``` - **使用第三方库进行序列化**:如 `jsonpickle` 可用于序列化任意类型的对象,尽管这可能会牺牲一定的性能和安全性。 #### 日志分析技巧 - **启用详细日志记录**:在 `settings.py` 中设置 `LOG_LEVEL = 'DEBUG'`,以便捕获更详细的运行时信息,帮助定位问题源头。 - **查看爬虫启动阶段日志**:关注爬虫初始化阶段的日志,尤其是中间件和扩展加载部分,可以发现潜在的配置冲突或依赖问题。 - **使用 `scrapy shell` 调试选择器**:通过命令行工具交互式测试 CSS 或 XPath 表达式,快速验证选择器是否有效。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值