爬取目标:javlib,使用框架Scrapy
首先使用在命令行里
scrapy startproject projectname
和
scrapy genspider spidername
指令创建爬虫。
首先定义items.py
import scrapy
class AvmoItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
pic = scrapy.Field()
url = scrapy.Field()
id_ = scrapy.Field()
这是spiders文件夹中的爬虫文件
spidername.py
# -*- coding: utf-8 -*-
import scrapy
import os
import sys
import re
import urllib.parse
sys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))
import items
class JavbusSpider(scrapy.Spider):
name = 'javlibrary'
allowed_domains = ['www.ja14b.com']
start_urls = ['http://www.ja14b.com/cn/vl_genre.php?g=cu']
# 爬取目录页