scapy框架学习

最新推荐文章于 2024-06-22 16:33:22 发布

原创最新推荐文章于 2024-06-22 16:33:22 发布 · 608 阅读

0 ·

CC 4.0 BY-SA版权

script 专栏收录该内容

14 篇文章

订阅专栏

scrapy startproject mySpider 创建爬虫

scrapy crawl myspider 运行代码

scrapy crawl myspider -o myspider.json 将服务器的内容生成json文件

爬虫的目录结构：

└── mySpider
├── mySpider
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py      管道文件
│ ├── settings.py      配置文件
│ └── spiders          爬虫代码目录

│ ├── __init__.py

在spiders 目录下生成文件

我们自己的爬虫代码所在位置

class aaaSpider(scrapy.Spider):

"运行爬虫时的名字"

name = "myspider "
"允许爬虫作用的范围"
allowd_domains = ["http://www.xxxcn/"]
"爬虫起始的url"
start_urls = ["http://www.xxx.cn/channel/xxxx.shtml#"]

“处理下载文件的方法固定名字”

def parse(selfi,response):

setting.py 文件中：

设置http报文头

    DEFAULT_REQUEST_HEADERS = {
'User-Agent' : 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;
',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',

}

“设置管道文件”

ITEM_PIPELINES = {
'mySpider.pipelines.MyspiderPipeline': 300,

}

pipelines.py 文件:

用来处理服务器返回的内容，将内容写入文件

class MyspiderPipeline(object):
def __init__(self):
self.filename = open("xxxx.json","w")

def process_item(self,item,spider):
jsontext = json.dumps(dict(item),ensure_ascii = False) + "\n"
self.filename.write(jsontext.encode("utf-8"))
return item

def close_spider(self,spider):
sefl.filename.close()