
爬虫
文章平均质量分 92
小基基o_O
GitHub:https://github.com/AryeYellow
码云:https://gitee.com/arye
展开
-
scrapy 简明开发步骤
1、开发环境Win10、Anaconda、Pycharm、scrapy(version 1.5.0)2、创建爬虫工程Pycharm中运行Terminal,输入命令↓ cd 路径 scrapy startproject 工程名 cd 工程名 scrapy genspider example example.com3、修改spider/example.py修改spid...原创 2018-06-17 22:17:37 · 854 阅读 · 0 评论 -
Python【requests】封装自用
基础补充函数版面向对象版基础补充requests模块函数版import requests, random, timeymd = time.strftime('%Y-%m-%d')# User-Agentua = [ 'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0;', # IE9.0 ...原创 2018-07-30 11:49:38 · 3050 阅读 · 0 评论 -
Python 爬取经纬度
爬取百度地图拾取坐标系统from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditi...原创 2018-07-22 17:34:52 · 7152 阅读 · 0 评论 -
爬虫抓取自己csdn博客点赞数
# 翻页,获取全部文章链接import requests, re, mathurl = 'https://me.youkuaiyun.com/yellow_python'r = requests.get(url, headers={'User-Agent': 'Opera/8.0 (Windows NT 5.1; U; en)'}).textarticles = re.search('<li&a原创 2018-08-18 21:48:19 · 823 阅读 · 0 评论 -
爬虫性能
多线程+异步非阻塞from gevent import monkeymonkey.patch_all()import requests, gevent, time# 待访问的URLdef get_urls(): jd_url = 'https://search.jd.com/Search?keyword=%E7%88%AC%E8%99%AB&enc=utf-8&am...原创 2018-09-07 10:46:51 · 723 阅读 · 2 评论 -
爬虫保存图文(按序)
图文保存from time import time, sleepfrom selenium.webdriver import Chromefrom requests import getfrom urllib.parse import urlsplit, urljoinimport os, rePATH = 'DOWNLOAD/'class Driver: """网页...原创 2019-07-18 15:03:42 · 644 阅读 · 0 评论 -
爬取并执行本篇优快云博客的代码
Catalog爬取【爬取内容的代码】爬取【内容】的代码内容基础补充爬取【爬取内容的代码】import requests, re, pandas as pda = '''def d(): url = 'https://blog.youkuaiyun.com/Yellow_python/article/details/81240395' header = {...原创 2018-07-30 00:17:30 · 557 阅读 · 0 评论 -
爬取本篇优快云博客内容
代码片如梦令常记溪亭日暮,沉醉不知归路,兴尽晚回舟,误入藕花深处。争渡,争渡,惊起一滩鸥鹭。Tab如梦令常记溪亭日暮,沉醉不知归路,兴尽晚回舟,误入藕花深处。争渡,争渡,惊起一滩鸥鹭。普通如梦令 常记溪亭日暮,沉醉不知归路,兴尽晚回舟,误入藕花深处。争渡,争渡,惊起一滩鸥鹭。...原创 2018-07-18 23:40:56 · 536 阅读 · 0 评论 -
Python 爬虫 xpath 和 bs4
Catalogxmlscrapy 中的 xpathxmlweb_data = '''</head> <body&原创 2018-07-01 00:14:05 · 1077 阅读 · 0 评论 -
python3 爬虫 urllib
urllib.request.urlopenimport urllib.request# 打开网页,返回 HTTP Response 对象response = urllib.request.urlopen('http://www.baidu.com')# 状态码print(response.getcode())# 获取当前所爬取的URL地址print(response.geturl...原创 2018-06-25 14:43:19 · 411 阅读 · 0 评论 -
python3-爬虫【requests】模块
requests.requestimport requests# 发送请求,返回response对象response = requests.request('GET', 'http://www.baidu.com')# 状态码print(response.status_code)# 获取cookieprint(response.cookies)# 编码方式(不太准,一般要改)p...原创 2018-06-25 15:00:40 · 978 阅读 · 0 评论 -
Python selenium 模拟浏览器:输入+点击
模拟淘宝搜索from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as...原创 2018-06-30 17:34:56 · 5080 阅读 · 0 评论 -
Python selenium 爬取淘宝商品
from urllib import parsefrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expec...原创 2018-06-30 22:56:57 · 482 阅读 · 0 评论 -
Python 爬取异步加载的数据
http://www.runoob.com/ajax/ajax-database.htmlimport requestsfrom lxml import etree# 浏览器伪装ua = 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'header = {"User-Agent&quo原创 2018-06-28 23:24:58 · 4709 阅读 · 0 评论