Scrapy简介

Scrapy at a glance(Scrapy简介)

 

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. 
Scrapy是Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、信息处理和历史档案。

 

Even though Scrapy was originally designed for screen scraping (more precisely, web scraping), it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
尽管Scrapy原本是设计用来屏幕抓取(更精确的说,是网络抓取)的目的,但它也可以用来访问API来提取数据,比如Amazon的AWS或者用来当作通常目的应用的网络蜘蛛

 

The purpose of this document is to introduce you to the concepts behind Scrapy so you can get an idea of how it works and decide if Scrapy is what you need.
本文档的目的是介绍一下Scrapy背后的概念,这样你会了解它是如何工作的,以决定它是不是你需要的

 

When you’re ready to start a project, you can start with the tutorial.
当你准备启动一个项目时,可以从这个教程开始

 

Pick a website(选择一个网站)

So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information.
如果你需要从某个网站提取一些信息,但是网站不提供API或者其他可编程的访问机制,那么Scrapy可以帮助你提取信息

Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.

Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.
让我们看下Mininova网站今天增加的torrent文件,我们需要提取网址,名称,描述和文件大小

The list of all torrents added today can be found on this page:
下面这个列表是所有今天新增的torrents文件的页面

Define the data you want to scrape(定义你要抓取的数据)

The first thing is to define the data we want to scrape. In Scrapy, this is done through Scrapy Items (Torrent files, in this case).第一件事情就是定义你要抓取的数据,在Scrapy这个是通过定义Scrapy Items来实现的(本例是BT文件)

This would be our Item:这就是要定义的Item

from scrapy.item import Item, Field

class Torrent(Item):
    url = Field()
    name = Field()
    description = Field()
    size = Field()

 

 

Write a Spider to extract the data(撰写一个蜘蛛来抓取数据)

The next thing is to write a Spider which defines the start URL (http://www.mininova.org/today), the rules for following links and the rules for extracting the data from pages.下一步是写一个指定起始网址的蜘蛛,这个蜘蛛的规则包含follow链接规则和数据提取规则

If we take a look at that page content we’ll see that all torrent URLs are like http://www.mininova.org/tor/NUMBER where NUMBER is an integer. We’ll use that to construct the regular expression for the links to follow: /tor/\d+. 如果你看一眼页面内容,就会发现所有的torrent网址都是类似http://www.mininova.org/tor/NUMBER,其中Number是一个整数,我们将用正则表达式,例如 /tor/\d+. 来提取规则

We’ll use XPath for selecting the data to extract from the web page HTML source. Let’s take one of those torrent pages: 我们将使用Xpath,从页面的HTML Source里面选取要要抽取的数据,我们 选中一个页面

And look at the page HTML source to construct the XPath to select the data we want which is: torrent name, description and size.根据页面HTML 源码,建立XPath,选取我们所要的:torrent name, description和size

By looking at the page HTML source we can see that the file name is contained inside a <h1> tag: 通过页面HTML源代码可以看到name属性包含在H1 标签内

<h1>Home[2009][Eng]XviD-ovd</h1>

 

An XPath expression to extract the name could be: 使用 XPath expression提取的表达式:

//h1/text()

 

And the description is contained inside a <div> tag with id="description":  同时description被包含在id=”description“的div中
<h2>Description:</h2>

<div id="description">
"HOME" - a documentary film by Yann Arthus-Bertrand
<br/>
<br/>
***
<br/>
<br/>
"We are living in exceptional times. Scientists tell us that we have 10 years to change the way we live, avert the depletion of natural resources and the catastrophic evolution of the Earth's climate.

...

 

An XPath expression to select the description could be:使用 XPath expression提取的表达式:
//div[@id='description']

 

Finally, the file size is contained in the second <p> tag inside the <div> tag with id=specifications: size属性在第二个<p>tag,id=specifications的div内
<div id="specifications">

<p>
<strong>Category:</strong>
<a href="/cat/4">Movies</a> &gt; <a href="/sub/35">Documentary</a>
</p>

<p>
<strong>Total size:</strong>
699.79&nbsp;megabyte</p>

 

An XPath expression to select the description could be:使用 XPath expression提取的表达式:
//div[@id='specifications']/p[2]/text()[2]

 

For more information about XPath see the XPath reference. 如果要了解更多的XPath 参考这里 XPath reference.

Finally, here’s the spider code: 最后,蜘蛛代码如下:

class MininovaSpider(CrawlSpider):

    name = 'mininova.org'
    allowed_domains = ['mininova.org']
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)

        torrent = TorrentItem()
        torrent['url'] = response.url
        torrent['name'] = x.select("//h1/text()").extract()
        torrent['description'] = x.select("//div[@id='description']").extract()
        torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
        return torrent

 

For brevity’s sake, we intentionally left out the import statements. The Torrent item is defined above.因为很简单的原因,我们有意把重要的数据定义放在了上面(torrent数据定义),

Run the spider to extract the data(运行蜘蛛来抓取数据)

Finally, we’ll run the spider to crawl the site an output file scraped_data.json with the scraped data in JSON format:  最后,我们运行蜘蛛来爬取这个网站,输出为json格式 scraped_data.json

scrapy crawl mininova.org -o scraped_data.json -t json

 

This uses feed exports to generate the JSON file. You can easily change the export format (XML or CSV, for example) or the storage backend (FTP or Amazon S3, for example).
这个使用了feed exports,来生成json格式,当然,你可以很简单的更改输出格式为csv,xml,或者存储在后端(ftp或者Amazon S3)

You can also write an item pipeline to store the items in a database very easily.

你也可以写一段item pipeline,把数据直接写入数据库,很简单

Review scraped data(检查抓取的数据)

If you check the scraped_data.json file after the process finishes, you’ll see the scraped items there:
要运行结束以后,查看一下数据:scraped_data.json,内容大致如下

[{"url": "http://www.mininova.org/tor/2657665", "name": ["Home[2009][Eng]XviD-ovd"], "description": ["HOME - a documentary film by ..."], "size": ["699.69 megabyte"]},
# ... other items ...
]

 

You’ll notice that all field values (except for the url which was assigned directly) are actually lists. This is because the selectors return lists. You may want to store single values, or perform some additional parsing/cleansing to the values. That’s what Item Loaders are for.

关注一下数据,你会发现,所有字段都是lists(除了url是直接赋值),这是因为selectors返回的就是lists格式,如果你想存储单独数据或者在数据上增加一些解释或者清洗,可以使用Item Loaders

 

What else?(更多)

You’ve seen how to extract and store items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:

你也看到了如何使用Scrapy从一个网站提取和存储数据,但这只是表象,实际上,Scrapy提供了许多强大的特性,让它更容易和高效的抓取:

  • Built-in support for selecting and extracting data from HTML and XML sources   内建 selecting and extracting,支持从HTML,XML提取数据
  • Built-in support for cleaning and sanitizing the scraped data using a collection of reusable filters (called Item Loaders) shared between all the spiders. 内建Item Loaders,支持数据清洗和过滤消毒,使用预定义的一个过滤器集合,可以在所有蜘蛛间公用
  • Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)内建多格式generating feed exports支持(JSON, CSV, XML),可以在后端存储为多种方式(FTP, S3, local filesystem)
  • A media pipeline for automatically downloading images (or any other media) associated with the scraped items针对抓取对象,具有自动图像(或者任何其他媒体)下载automatically downloading images的管道线
  • Support for extending Scrapy by plugging your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).支持扩展抓取extending Scrap,使用signals来自定义插入函数或者定义好的API(middlewares, extensions, and pipelines)
  • Wide range of built-in middlewares and extensions for:大范围的内建中间件和扩展:
    • cookies and session handling
    • HTTP compression
    • HTTP authentication
    • HTTP cache
    • user-agent spoofing
    • robots.txt
    • crawl depth restriction
    • and more
  • Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.强壮的编码支持和自动识别机制,可以处理多种国外的、非标准的、不完整的编码声明等等
  • Support for creating spiders based on pre-defined templates, to speed up spider creation and make their code more consistent on large projects. See genspider command for more details.支持根据预定义的模板创建蜘蛛,在大型项目中用来加速蜘蛛并使其代码更一致。查看genspider命令了解更多细节
  • Extensible stats collection for multiple spider metrics, useful for monitoring the performance of your spiders and detecting when they get broken可扩展的统计采集stats collection,针对数十个采集蜘蛛,在监控蜘蛛性能和识别断线断路方面很有用处
  • An Interactive shell console for trying XPaths, very useful for writing and debugging your spiders一个可交互的XPaths脚本命令平台接口Interactive shell console,在调试撰写蜘蛛是上非常有用
  • A System service designed to ease the deployment and run of your spiders in production.一个系统服务级别的设计,可以在产品中非常容易的部署和运行你的蜘蛛
  • A built-in Web service for monitoring and controlling your bot内建的Web service,可以监视和控制你的机器人
  • A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler一个Telnet控制台Telnet console,可以钩入一个Python的控制台在你的抓取进程中,以便内视或者调试你的爬虫
  • Logging facility that you can hook on to for catching errors during the scraping process. Logging功能使得可以在抓取过程中提取捕获的错误
  • Support for crawling based on URLs discovered through Sitemaps支持基于Sitemap的网址发现的爬行抓取
  • A caching DNS resolver 具备缓存DNS解析功能

What’s next?(下一步)

The next obvious steps are for you to download Scrapy, read the tutorial and join the community. Thanks for your interest!很明显啦,下一步就是下载Scrapy,然后阅读教程,加入社区,感谢你对Scrapy感兴趣~!

 

T:\mininova\mininova\items.py 源码

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class MininovaItem(Item):
    # define the fields for your item here like:
    # name = Field()
    url = Field()
    name = Field()
    description = Field()
    size = Field()
        

 T:\mininova\mininova\spiders\spider_mininova.py 源码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule   
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from mininova.items import MininovaItem

class MininovaSpider(CrawlSpider):

    name = 'mininova.org'
    allowed_domains = ['mininova.org']
    start_urls = ['http://www.mininova.org/today']
    #start_urls = ['http://www.mininova.org/yesterday']
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_item')]

    # def parse_item(self, response):
        # filename = response.url.split("/")[-1] + ".html"
        # open(filename, 'wb').write(response.body)

    
    def parse_item(self, response):
        x = HtmlXPathSelector(response)
        item = MininovaItem()
        item['url'] = response.url
        #item['name'] = x.select('''//*[@id="content"]/h1''').extract()
        item['name'] = x.select("//h1/text()").extract()
        #item['description'] = x.select("//div[@id='description']").extract()
        item['description'] = x.select('''//*[@id="specifications"]/p[7]/text()''').extract() #download
        #item['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
        item['size'] = x.select('''//*[@id="specifications"]/p[3]/text()''').extract()
        return item

 

 

<think>好的,我现在需要帮助用户解决如何在VMware中安装合适的USB设备驱动的问题。首先,我需要理解用户的问题到底是什么。用户提到的是“VMware USB device在哪下载合适的驱动”,可能是指当他们在VMware虚拟机中使用USB设备时遇到了驱动问题,需要找到正确的驱动程序。 接下来,我要回忆一下VMware产品中关于USB设备支持的相关知识。VMware Workstation和Fusion等产品通常会自动处理USB设备的连接,但在某些情况下,可能需要安装额外的驱动,尤其是在客户机操作系统中。例如,如果客户机是Windows,可能需要安装VMware Tools来增强USB支持;如果是Linux,可能需要特定的驱动包。 然后,我需要考虑用户可能的具体情况。用户可能已经将USB设备连接到主机,但在虚拟机中无法识别,或者出现驱动错误。这时候需要确定是主机驱动的问题还是客户机驱动的问题。通常,主机不需要额外驱动,因为VMware会通过自身的服务处理USB设备的连接,而客户机可能需要安装VMware Tools来获得更好的兼容性。 接着,我需要思考正确的步骤来引导用户解决问题。首先,应该检查VMware Tools是否已经安装,这是解决很多设备兼容性问题的第一步。如果已经安装,可能需要更新到最新版本。其次,对于客户机操作系统,比如Windows,可能需要检查设备管理器中的USB控制器是否有感叹号,是否需要手动安装驱动。此外,还需要确认主机的USB控制器设置是否正确,比如在虚拟机设置中选择USB兼容性版本(如USB 3.0或2.0)。 如果用户的问题是关于主机系统的驱动,比如VMwareUSB控制器驱动在主机上出现问题,可能需要到VMware官网下载并重新安装VMware Workstation或Player,这样会自动安装所需的驱动。另外,某些情况下,主机的USB驱动可能需要更新,比如芯片组驱动或USB 3.0驱动,这需要到主机厂商的网站下载。 还需要考虑用户可能的误解。例如,用户可能认为需要单独下载USB驱动,但实际上,大多数情况下,VMware Tools已经包含了必要的驱动。如果用户已经安装了VMware Tools但仍然有问题,可能需要进一步的故障排除,比如检查USB设备的兼容性,尝试不同的USB端口,或者重启VMware的相关服务。 最后,整理出清晰的步骤,确保用户能够按照流程操作,同时提醒常见的注意事项,比如以管理员身份运行安装程序,确保虚拟机配置正确等。可能还需要提供VMware官网的链接,方便用户下载工具或更新版本。</think>在VMware虚拟机中使用USB设备时,通常不需要单独下载"VMware USB设备驱动",但需要按以下流程操作: ### 一、基础准备 1. **主机端** - 确保主机已安装最新版VMware Tools(VMware Workstation/Fusion自带) - 无需单独下载驱动,系统会自动加载`vmware-usbarbitrator`服务 2. **虚拟机配置** - 打开虚拟机设置 → USB控制器 → 勾选`USB兼容性`(建议选3.0或自动) ### 二、客户机驱动安装 1. **Windows客户机** - 安装VMware Tools(虚拟机菜单 → 管理 → 安装VMware Tools) - 插入USB设备后,会自动加载`vmware USBCCID`驱动程序 2. **Linux客户机** - 安装open-vm-tools: ```bash sudo apt-get install open-vm-tools-desktop ``` - 加载USB驱动模块: ```bash sudo modprobe uhci_hcd ehci_hcd ohci_hcd xhci_hcd ``` ### 三、常见问题排查 1. **设备无法识别** - 检查主机设备管理器 → 确保无`VMware USB Device`黄色感叹号 - 重启服务:`services.msc` → 重启`VMware USB Arbitration Service` 2. **特殊设备支持** - 加密狗/工控设备需在虚拟机设置 → USB控制器 → 开启`显示所有USB输入设备` - 对于USB转串口设备,建议在客户机安装对应芯片驱动(如FTDI、CH340) ### 四、驱动下载渠道(仅限特殊情况) 如需获取底层驱动文件,可通过: 1. VMware官网支持页面: ``` https://customerconnect.vmware.com/downloads ``` 2. 选择对应产品 → 驱动包通常包含在`VMware Tools Bundle`中 > **注意事项**: > - 虚拟机运行时才能看到USB设备连接选项 > - 苹果M系列芯片需使用ARM版Windows/Linux系统 > - 安卓设备需开启开发者模式+USB调试
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值