
爬虫实战
此专栏为爬虫实战练习进阶,希望能与各位共同学习进步。
shifanfashi
这个作者很懒,什么都没留下…
展开
-
爬虫实战15:selenium爬取拉勾网Python职位表,并保存到MySQL中
import requestsimport timefrom lxml import etreefrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.su...原创 2019-06-09 20:56:48 · 1328 阅读 · 0 评论 -
爬虫实战14:爬取江苏省环境监测项目
import requestsfrom bs4 import BeautifulSoupimport timeimport reimport osheaders = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", # 请求报头域,制定客户端可接受的类型信息 ...原创 2019-05-19 20:48:50 · 1321 阅读 · 0 评论 -
爬虫实战13:Ajax爬取豆瓣电影排行
import requestsimport jsonurl = "https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=40&limit=20"headers = {'User-Agent': 'Mozilla/5.0 (Windows NT ...原创 2019-05-19 09:51:18 · 870 阅读 · 0 评论 -
爬虫实战12:爬取英雄联盟盒子助手
from urllib.request import urlretrieveimport requestsimport osdef hero_imgs_download(url, header): req = requests.get(url = url, headers = header).json() hero_num = len(req['list']) p...原创 2019-05-18 21:52:41 · 847 阅读 · 0 评论 -
爬虫实战11:爬取aiss图片并保存
import requestsimport bs4import urllib.requesturl = "http://www.ligui.org/aiss/"headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/7...原创 2019-05-18 21:07:39 · 29799 阅读 · 0 评论 -
爬虫实战10:爬取豆瓣书评排行,并保存到Excel中
import requestsimport xlwtimport xlrdfrom lxml import etreeclass doubanBookData(object): def __init__(self): self.f = xlwt.Workbook() #创建工作薄 self.sheet1 = self.f.add_sheet(u...原创 2019-05-18 20:41:30 · 2101 阅读 · 0 评论 -
爬虫实战:9,爬取1688商家联系方式
# coding:utf-8import requestsimport bs4import timeimport xlwtimport randomdef get_urls(url, page): """获取查询商品的每家店的地址""" headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; W...原创 2019-05-18 17:38:18 · 10541 阅读 · 6 评论 -
爬虫实战9:爬取1688网站商家信息
# coding:utf-8import requestsimport bs4import timeimport xlwtimport randomdef get_IP(): """获取代理IP """ url = "http://www.xicidaili.com/nn/" headers = { 'User-Agent': 'Mo...原创 2019-05-18 15:24:11 · 7162 阅读 · 3 评论 -
爬虫实战8:爬取微博内容信息
获取微博大V账号的用户基本信息,如:微博昵称、微博地址、微博头像、关注人数、粉丝数、性别、等级等# -*- coding: utf-8 -*-import urllib.requestimport jsonid = '1259110474'proxy_addr = "122.241.72.191:808"def use_proxy(url, proxy_addr): re...原创 2019-05-17 14:20:30 · 4130 阅读 · 1 评论 -
爬虫实战7:爬取笔趣阁小说
import requestsfrom bs4 import BeautifulSoupimport osif __name__=='__main__': #所要爬取的小说主页,每次使用时,修改该网址即可,同时保证本地保存根路径存在即可 target="https://www.biqubao.com/book/17570/" # 本地保存爬取的文本根路径 ...原创 2019-05-15 20:19:08 · 3962 阅读 · 0 评论 -
爬虫实战6:爬取微博资料
from urllib.error import URLErrorimport urllib.requestimport jsonid='1259110474'headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) " ...原创 2019-05-14 10:56:11 · 287 阅读 · 0 评论 -
爬虫实战5:爬取百度图片
import requestsfrom urllib.error import URLErrorimport osimport urllibfrom urllib.parse import urlencodeheaders = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, li...原创 2019-05-13 17:32:28 · 500 阅读 · 0 评论 -
爬虫实战28:多线程爬取知乎关注人信息,并保存到mysql和MongoDB中
上篇文章讲的是如何爬取知乎关注者信息,这篇是上篇的多线程版本from threading import Threadfrom queue import Queueimport requestsimport json# 引入mysql数据库import pymysqldb = pymysql.connect(host='localhost', user='root', passwor...原创 2019-09-03 16:48:18 · 446 阅读 · 0 评论 -
爬虫实战4:爬取淘宝商品信息
在此之前先确定已安装成功mongodb并且运行无误from selenium import webdriverfrom selenium.common.exceptions import TimeoutExceptionfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support import expe...原创 2019-05-11 16:25:08 · 2545 阅读 · 0 评论 -
爬虫实战3:爬取豆瓣妹子图片
import urllib.requestimport bs4import urllib.errorheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}def ge...原创 2019-05-08 14:41:54 · 970 阅读 · 0 评论 -
爬虫实战2:爬取2345电影排行榜
import requestsimport bs4headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/70.0.3538.110 Safari/537.36"}url =...原创 2019-05-07 20:18:22 · 969 阅读 · 0 评论 -
爬虫实战1:爬取豆瓣top250电影
爬取豆瓣排行榜前250名电影名称import urllib.requestimport bs4headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/70.0.3538.110...原创 2019-05-07 11:27:06 · 2720 阅读 · 2 评论 -
爬虫实战20:爬取安居客租房信息
import requestsimport bs4import jsonimport timeimport osfrom lxml import etreeclass spider(object): def __init__(self): self.url = "https://bj.zu.anjuke.com/?from=navigation" ...原创 2019-07-16 11:49:00 · 2718 阅读 · 0 评论 -
爬虫实战21:爬取全国空气质量情况
import requestsimport bs4import timefrom lxml import etreeimport osdef get_cities_url(): headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,imag...原创 2019-08-15 16:24:43 · 2791 阅读 · 0 评论 -
爬虫实战19:爬取tandfonline.com文章信息
import requestsimport bs4import xlwtimport openpyxlclass aticle_title(): def __init__(self): self.url = "https://www.tandfonline.com/action/doSearch?AllField=urban+design&Ppub=%5...原创 2019-07-14 17:30:36 · 2164 阅读 · 0 评论 -
爬虫实战27:爬取关注人所关注者的信息
# encoding:utf-8import requestsimport json# 引入mysql数据库import pymysqldb = pymysql.connect(host='localhost', user='root', password='123456', port=3306, db='mysql') # 连接mysql数据库cursor = db.curso...原创 2019-09-03 15:32:42 · 548 阅读 · 0 评论 -
爬虫实战26:爬取知乎关注人信息并保存到mysql和MongoDB中
# encoding:utf-8import requestsimport json# 引入mysql数据库import pymysqldb = pymysql.connect(host='localhost', user='root', password='123456', port=3306, db='mysql') # 连接mysql数据库cursor = db.curso...原创 2019-09-03 10:57:43 · 314 阅读 · 0 评论 -
爬虫实战25:scrapy框架爬取优快云讲师信息
items相关代码:name = scrapy.Field() href = scrapy.Field() students = scrapy.Field() contents = scrapy.Field()csdn相关代码import scrapyfrom scdnedu import itemsimport requestsfrom lxml impor...原创 2019-08-23 14:20:12 · 746 阅读 · 0 评论 -
爬虫实战24:爬取阳光问政信息
import requestsimport timeimport bs4from lxml import etreeimport geventimport gevent.monkeyimport threadingheaders = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image...原创 2019-08-22 17:03:44 · 855 阅读 · 0 评论 -
爬虫实战23:多线程爬取2345电影排行榜
import requestsimport bs4import timefrom threading import Threadfrom queue import Queueglobal my_queuemy_queue = Queue()start_time = time.time()print(start_time)class MyThread1(Thread): ...原创 2019-08-22 16:33:40 · 790 阅读 · 0 评论 -
爬虫实战18:selenium模拟登陆B站
import timefrom io import BytesIOfrom PIL import Imagefrom selenium import webdriverfrom selenium.webdriver import ActionChainsfrom selenium.webdriver.common.by import Byfrom selenium.webdriver....原创 2019-07-09 14:32:43 · 1615 阅读 · 0 评论 -
爬虫实战17:多线程爱丝APP图片爬虫
# -*- coding: utf-8 -*-import osimport jsonimport requestsimport timefrom multiprocessing import Process, Queue, Poolclass downloadinfo: def download_info(self): """ 下载列表页(包含对所有图片的描...原创 2019-07-06 09:44:57 · 30129 阅读 · 2 评论 -
爬虫实战16:爬取知乎关注人信息
import requestsimport jsonfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support.wait import WebD...原创 2019-06-18 21:39:37 · 1022 阅读 · 0 评论 -
爬虫实战22:多线程爬取豆瓣妹子图
from threading import Threadimport urllib.requestimport bs4import urllib.errorfrom queue import Queueheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like ...原创 2019-08-20 10:09:53 · 755 阅读 · 0 评论