关于Python爬虫程序scrapy的安装问题

最新推荐文章于 2024-01-08 20:14:52 发布

原创最新推荐文章于 2024-01-08 20:14:52 发布 · 4w 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫 #scrapy

爬虫专栏收录该内容

2 篇文章

订阅专栏

本文记录了在Linux环境下安装Python爬虫框架Scrapy的过程及遇到的问题解决方法，包括解决'Spider'属性错误及lxml安装失败等问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Linux下关于Python爬虫程序scrapy的安装问题

我的安装过程：

sudo pip install scrapy

够简单吧。
但是在运行第一个爬虫例子时

scrapy crawl dmoz

出现下面错误：

AttributeError: 'module' object has no attribute 'Spider'

解决方案如下：
http://stackoverflow.com/questions/30695866/attributeerror-module-object-has-no-attribute-spider

sudo pip install scrapy --upgrade

正常上述过程之后，问题应该能够解决。但是我又出了下面的问题lxml装不上

    creating build/temp.linux-x86_64-2.7/src/lxml
    x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Isrc/lxml/includes -I/usr/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-2.7/src/lxml/lxml.etree.o -w
    In file included from src/lxml/lxml.etree.c:320:0:
    src/lxml/includes/etree_defs.h:14:31: fatal error: libxml/xmlversion.h: 没有那个文件或目录
     #include "libxml/xmlversion.h"
                                   ^
    compilation terminated.
    Compile failed: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    creating tmp
    cc -I/usr/include/libxml2 -c /tmp/xmlXPathInitM_KXBh.c -o tmp/xmlXPathInitM_KXBh.o
    cc tmp/xmlXPathInitM_KXBh.o -lxml2 -o a.out
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

    ----------------------------------------
  Rolling back uninstall of lxml
Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-F1ulO4/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-OMbiRQ-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-F1ulO4/lxml/

解决方案如下：
http://stackoverflow.com/questions/5178416/pip-install-lxml-error

sudo apt-get install python-dev libxml2-dev libxslt1-dev zlib1g-dev

安装好依赖之后

sudo pip install lxml --upgrade

成功安装

beast@beast:~/Code/python/tutorial$ sudo pip install lxml --upgradeThe directory '/home/beast/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/beast/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:318: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.
  SNIMissingWarning
/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:122: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
Collecting lxml
  Downloading lxml-3.6.0.tar.gz (3.7MB)
    100% |████████████████████████████████| 3.7MB 213kB/s 
Installing collected packages: lxml
  Found existing installation: lxml 3.3.3
    Uninstalling lxml-3.3.3:
      Successfully uninstalled lxml-3.3.3
  Running setup.py install for lxml ... done
Successfully installed lxml-3.6.0

现在可以第一个爬虫例子了：

beast@beast:~/Code/python/tutorial$ scrapy crawl dmoz
/usr/local/lib/python2.7/dist-packages/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
    BOT_VERSION: no longer used (user agent defaults to Scrapy now)
  warnings.warn(msg, ScrapyDeprecationWarning)
2016-07-06 16:41:56 [scrapy] INFO: Scrapy 1.1.0 started (bot: tutorial)
2016-07-06 16:41:56 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'USER_AGENT': 'tutorial/1.0', 'BOT_NAME': 'tutorial'}
2016-07-06 16:41:56 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-07-06 16:41:56 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-06 16:41:56 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-06 16:41:56 [scrapy] INFO: Enabled item pipelines:
[]
2016-07-06 16:41:56 [scrapy] INFO: Spider opened
2016-07-06 16:41:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-06 16:41:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-06 16:41:58 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-07-06 16:41:58 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-07-06 16:41:58 [scrapy] INFO: Closing spider (finished)
2016-07-06 16:41:58 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 472,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 16392,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 7, 6, 8, 41, 58, 337488),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 7, 6, 8, 41, 56, 777087)}
2016-07-06 16:41:58 [scrapy] INFO: Spider closed (finished)