关于Python爬虫程序scrapy的安装问题

本文记录了在Linux环境下安装Python爬虫框架Scrapy的过程及遇到的问题解决方法,包括解决'Spider'属性错误及lxml安装失败等问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Linux下关于Python爬虫程序scrapy的安装问题

我的安装过程:

sudo pip install scrapy

够简单吧。 
但是在运行第一个爬虫例子时

scrapy crawl dmoz

出现下面错误:

AttributeError: 'module' object has no attribute 'Spider'

解决方案如下: 
http://stackoverflow.com/questions/30695866/attributeerror-module-object-has-no-attribute-spider

sudo pip install scrapy --upgrade

正常上述过程之后,问题应该能够解决。但是我又出了下面的问题lxml装不上

    creating build/temp.linux-x86_64-2.7/src/lxml
    x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Isrc/lxml/includes -I/usr/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-2.7/src/lxml/lxml.etree.o -w
    In file included from src/lxml/lxml.etree.c:320:0:
    src/lxml/includes/etree_defs.h:14:31: fatal error: libxml/xmlversion.h: 没有那个文件或目录
     #include "libxml/xmlversion.h"
                                   ^
    compilation terminated.
    Compile failed: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    creating tmp
    cc -I/usr/include/libxml2 -c /tmp/xmlXPathInitM_KXBh.c -o tmp/xmlXPathInitM_KXBh.o
    cc tmp/xmlXPathInitM_KXBh.o -lxml2 -o a.out
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

    ----------------------------------------
  Rolling back uninstall of lxml
Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-F1ulO4/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-OMbiRQ-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-F1ulO4/lxml/

解决方案如下: 
http://stackoverflow.com/questions/5178416/pip-install-lxml-error

sudo apt-get install python-dev libxml2-dev libxslt1-dev zlib1g-dev

安装好依赖之后

sudo pip install lxml --upgrade

成功安装

beast@beast:~/Code/python/tutorial$ sudo pip install lxml --upgradeThe directory '/home/beast/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/beast/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:318: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.
  SNIMissingWarning
/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:122: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
Collecting lxml
  Downloading lxml-3.6.0.tar.gz (3.7MB)
    100% |████████████████████████████████| 3.7MB 213kB/s 
Installing collected packages: lxml
  Found existing installation: lxml 3.3.3
    Uninstalling lxml-3.3.3:
      Successfully uninstalled lxml-3.3.3
  Running setup.py install for lxml ... done
Successfully installed lxml-3.6.0

现在可以第一个爬虫例子了:

beast@beast:~/Code/python/tutorial$ scrapy crawl dmoz
/usr/local/lib/python2.7/dist-packages/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
    BOT_VERSION: no longer used (user agent defaults to Scrapy now)
  warnings.warn(msg, ScrapyDeprecationWarning)
2016-07-06 16:41:56 [scrapy] INFO: Scrapy 1.1.0 started (bot: tutorial)
2016-07-06 16:41:56 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'USER_AGENT': 'tutorial/1.0', 'BOT_NAME': 'tutorial'}
2016-07-06 16:41:56 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-07-06 16:41:56 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-06 16:41:56 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-06 16:41:56 [scrapy] INFO: Enabled item pipelines:
[]
2016-07-06 16:41:56 [scrapy] INFO: Spider opened
2016-07-06 16:41:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-06 16:41:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-06 16:41:58 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-07-06 16:41:58 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-07-06 16:41:58 [scrapy] INFO: Closing spider (finished)
2016-07-06 16:41:58 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 472,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 16392,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 7, 6, 8, 41, 58, 337488),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 7, 6, 8, 41, 56, 777087)}
2016-07-06 16:41:58 [scrapy] INFO: Spider closed (finished)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值