abc诗意-优快云博客

转载 SQL kaggle learn : WHERE AND

WHERE trip_start_timestamp Between '2017-01-01' And '2017-07-01' and trip_seconds > 0 and trip_miles > 0WHERE trip_start_timestamp > '2017-01-01' and trip_start_timestamp &lt...

2019-04-04 16:49:00 229

转载 SQL kaggle learn with as excercise

rides_per_year_query = """SELECT EXTRACT(YEAR FROM trip_start_timestamp) AS year ,COUNT(unique_key) AS num_tripsFROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`GROUP BY yearORD...

2019-04-04 11:37:00 258

转载 SQL count(1)

If you are ever unsure what to put inside aCOUNT()aggregation, you can doCOUNT(1)to count the rows in each group. Most people find it especially readable, because we know it's not focusing ...

2019-04-02 20:58:00 257

转载 sql where 里面判定要加 ' '

WHERE year>=2010 and year<=2017 and indicator_code = 'SE.XPD.TOTL.GD.ZS'转载于:https://www.cnblogs.com/bamboozone/p/10644973.html

2019-04-02 20:09:00 219

转载 kaggle learn python

def has_lucky_number(nums): return any([num % 7 == 0 for num in nums])def menu_is_boring(meals): """Given a list of meals served over some period of time, return True if the...

2019-03-24 21:47:00 330

转载 pandas

df = reviews.loc[:99,['country','variety']] or df = reviews.loc[[1,2,3,4],['country','variety']]df = reviews.loc[[0,1,10,100],['country','province','region_1','region_2']] 两颜色不能互换，必须index在前iloc...

2019-03-24 19:17:00 100

转载实习僧的字体加密破解

1，https://www.hitoy.org/tool/file_base64.php 选择base64编码生成文件格式选择ttf2，https://fontdrop.info/ 上传ttf鼠标悬停会显示所有的被加密字和对应关系转载于:https://www.cnblogs.com/bamboozone/p/10555027.html...

2019-03-18 21:13:00 459

转载 cookiejar

referer:https://www.cnblogs.com/why957/p/9297779.html文章介绍了四种模拟登陆方法yield Request()可以将一个新的请求返回给爬虫执行在发送请求时cookie的操作，meta={'cookiejar':1}表示开启cookie记录，首次请求时写在Request()里meta={'cookiejar':response...

2019-03-09 11:54:00 808

转载煎蛋ooxx

pipeline.pyclass Jiandanline(FilesPipeline): def get_media_requests(self, item, info): for file_url in item['file_urls']: yield scrapy.Request(file_url) de...

2019-03-08 20:04:00 391

转载纪念一下学写pipeline时脑子里的坑

用的是filespipeline，用的存储地址是images的地址测试煎蛋ooxx首页，shell测试的时候返回很多列表，但是实际爬的时候一直只返回一条，很烦，一直测一直测，就是不行，后来才发现，首页已经刷新了就是只有一条。。。。def file_path 写不好的话，会被def item_completed当成无效文件过滤掉file path只是写一个路径名，只是一个路径名...

2019-03-08 15:58:00 209

转载 scrapy流程图

refer：https://blog.yongli1992.com/2015/02/08/python-scrapy-module/这里是一张Scrapy架构图的展示。Scrapy Engine负责整个程序的运行。Scheduler负责调度要访问的网址。Downloader负责从网络获取响应。Spider负责分析响应，从响应中解析出我们要的数据，同时也负责找出接下来要访问的后续网...

2019-03-08 13:36:00 139

转载改写pipeline

为什么要改写方法：get_media_requests，他们的区别在哪里def get_media_requests(self, item, info):#原始的 return [Request(x) for x in item.get(self.images_urls_field, [])]def get_media_requests(self, ...

2019-03-08 13:30:00 164

转载 super()

fromhttps://mozillazg.com/2016/12/python-super-is-not-as-simple-as-you-thought.html# 这个作者真的牛逼在单继承中super就像大家所想的那样，主要是用来调用父类的方法的。class A: def __init__(self): self.n = 2 ...

2019-03-07 10:15:00 207

转载 os.path.join

os.path.join()函数：第一个以”/”开头的参数开始拼接，之前的参数全部丢弃。以上一种情况为先。在上一种情况确保情况下，若出现”./”开头的参数，会从”./”开头的参数的上一个参数开始拼接import os print("1:",os.path.join('aaaa','/bbbb','ccccc.txt')) print("2:...

2019-03-06 21:51:00 120

转载 scrapy item pipeline

item pipelineprocess_item(self, item, spider) #这个是所有pipeline都必须要有的方法在这个方法下再继续编辑具体怎么处理另可以添加别的方法open_spider(self, spider) This method is called when the spider is opened.close...

2019-03-05 21:05:00 252

转载学习使用scrapy itemspipeline过程

开始非常不理解fromhttps://www.jianshu.com/p/18ec820fe706 找到了一个比较完整的借鉴，然后编写自己的煎蛋pipeline首先在items里创建image_urls = scrapy.Field() #images = scrapy.Field() #这两个是必须的image_paths = sc...

2019-03-05 20:16:00 106

转载 dygod.net

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass DgSpider(CrawlSpider): name = 'dg' # a...

2019-03-03 10:08:00 3424

转载 https://scrapingclub.com/exercise/detail_sign/

def parse(self, response): # pattern1 = re.compile('token=(.*?);') # token = pattern1.findall(response.headers.getlist("set-cookie")[1].decode("utf-8"))[0] patt...

2019-03-02 11:21:00 188

转载 https://scrapingclub.com/exercise/basic_captcha/

def parse(self, response): # set_cookies = response.headers.getlist("set-cookie").decode("utf-8") pattern1 = re.compile('csrftoken=(.*?);') pattern2 = re.compil...

2019-03-01 16:52:00 626

转载 https://scrapingclub.com/exercise/basic_login/

遇到的问题：csrftoken cfduid 是在request.headers里面的，一直在找怎么在scrapy里get request.header,从scrapy shell ,then fetch then request.headers可以get正确的内容，但是scrapy project中，不知道怎么写代码，网上找到response.request.headers，这个写...

2019-03-01 11:21:00 303

转载 Python scrapy - Login Authenication Issue

https://stackoverflow.com/questions/37841409/python-scrapy-login-authenication-issuefrom scrapy.crawler import CrawlerProcessimport scrapyfrom scrapy.http import Requestclass FirstS...

2019-03-01 10:44:00 161

转载 https://scrapingclub.com/exercise/detail_cookie/

def parse(self, response): pattern=re.compile('token=(.*?);') token=pattern.findall( response.headers.get("set-cookie").decode("utf-8"))[0] cookie = { ...

2019-02-27 14:47:00 349

转载 scrapy:get cookie from response

scrapy shellfetch('your_url')response.headers.getlist("Set-Cookie")https://stackoverflow.com/questions/46543143/scrapy-get-cookies-from-response-request-headers response.headers 返回...

2019-02-27 10:04:00 527

转载 css selectors tips

from https://saucelabs.com/resources/articles/selenium-tips-css-selectorsSauce Labs uses cookies to give you the best online experience. If you continue to use this site, you agree to the use o...

2019-02-24 18:30:00 410

转载 css选择问题

<div class="col-lg-4 col-md-6 mb-4"><div class="card"><a href="/exercise/list_basic_detail/90008-E/"><img class="card-img-top img-fluid" src="/static/img/90008-E.jpg"...

2019-02-23 19:32:00 165

转载从js中提取数据

2019-02-21 12:35:00 1106

转载 F12搜索json内容

转载于:https://www.cnblogs.com/bamboozone/p/10411256.html

2019-02-21 11:19:00 2191

转载 materials

http://interactivepython.org/runestone/static/pythonds/index.htmlhttps://blog.michaelyin.info/scrapy-exercises-make-you-prepared-for-web-scraping-challenge/https://scrapingclub.com/https://...

2019-02-21 09:00:00 207

转载 xpath ,css

https://docs.scrapy.org/en/latest/intro/tutorial.htmlxpath @选择属性 .当前目录下选择 //任意路径选择/bookstore/book[position()<3]，选取最前面的两个属于 bookstore 元素的子元素的 book 元素cssspan.text::textresponse.css("...

2019-02-13 20:32:00 100

转载 chromedriver 全屏翻页错误

from selenium import webdriverfrom selenium.common.exceptions import TimeoutException, StaleElementReferenceExceptionfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.sup...

2019-01-29 14:51:00 248

转载 Pycharm学习python路

import 模块之后是灰色的表明没有被引用过lxml找不到的话用anaconda prompt :pip uninstall lxml 重新安装用request时，写的reg无法正确解析网页，先print然后再写regpyquery 的attr()获取不到值，因为只获取第一个值，具体参照https://www.cnblogs.com/airnew/p/10056551...

2019-01-29 14:01:00 349

abc2801141176的博客