前言
虽然知乎早已不是最开始的样子了,但是其用户还是很广泛的。我原本打算做的写个爬虫,把用户的居住地,学历,专业等信息爬下来。然后持久化到数据库中,最后写个web服务,用图标的形式展示出来。
但是echarts地图这块,还需努力。尽管做了调试,效果还是不甚理想。汗颜(⊙﹏⊙)b
框架搭建
正如前言部分所述,这里用到的技术还是挺多的。
简要的来展示一下项目目录吧。
C:\Users\biao\Desktop\network\code\zhihu-range>tree . /f文件夹 PATH 列表卷序列号为 E0C6-0F15C:\USERS\BIAO\DESKTOP\NETWORK\CODE\ZHIHU-RANGE│ dbhelper.py│ scheduler.py│ spider.py│ zhihu.db│ __init__.py│├─web│ │ service.py│ │ __init__.py│ ││ ├─static│ │ china.js│ │ echarts.js│ │ echarts.min.js│ │ jquery-2.2.4.min.js│ ││ └─templates│ index.html│└─__pycache__ dbhelper.cpython-36.pyc spider.cpython-36.pyc
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
模块化
接下来就一点点的对每一个小模块进行实现吧。
爬虫
爬虫部分需要注意的有这么几点。
- 请求头上的authorization
然后是请求频率的控制,通过添加随机时延可以明显的改善防爬虫限制
获取关注我的人的信息:
https://www.zhihu.com/api/v4/members/zhi-ai-89-18/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20
- 1
- 获取我关注的人的信息:
https://www.zhihu.com/api/v4/members/zhi-ai-89-18/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20
- 1
- 获取我的信息:
https://www.zhihu.com/api/v4/members/zhi-ai-89-18?include=locations%2CemploymentsXXXXXXXXXXXX
- 1
明确了这点,基本上对于爬虫就没有什么问题了。详见代码部分。
# coding: utf8# @Author: 郭 璞# @File: spider.py # @Time: 2017/5/22 # @Contact: 1064319632@qq.com# @blog: http://blog.youkuaiyun.com/marksinoberg# @Description: 爬虫,爬取地域数据import requestsimport jsonimport reimport mathclass Spider(object): def __init__(self): """ 初始化请求头,必备一个authorization,否则无法获取到数据。 """ self.headers = { 'authorization': 'Bearer Mi4wQUFEQWRCUTdBQUFBRUFMU3Y1YTRDeGNBQUFCaEFsVk5SMmsyV1FEeC11Uy03U2Zmc0pmSG8wTm55V2RSdjBSd3hn|1495413191|2fac9f462ad7607baaea9fca2a64abe72134af4a', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36', 'Host': 'www.zhihu.com', 'x-udid': 'ABAC0r-WuAuPTsVSA2wl0bXj3UZqixKgbPE=', } self.cookie = { 'Cookie': 'q_c1=cbf69b836d4645b29f057b71be86c00e|1493896915000|1493896915000; r_cap_id="NWY3YjIzYzlmOTg0NDVhM2FmMzdjNzA1YzY5NTBlYmU=|1494146108|664527b0598db30d7734ff56ea5ac12b17cbe2d8"; cap_id="MWRhOTIzNGYzZDdjNDA3MjhiNTg1MGQ3ZDJlMjQ5NWE=|1494146108|94fc913a73ce89aeb3b60439fdcc69687baf438d"; d_c0="ABAC0r-WuAuPTsVSA2wl0bXj3UZqixKgbPE=|1494146110"; _zap=c27db1fb-911e-48bd-babe-3b6e66c3e558; _xsrf=55d8c6a475335b06ee3e848612afdd80; aliyungf_tc=AQAAAJ+R5xghJQIAlnF1b59VTAruEEc9; acw_tc=AQAAAGxlvy3TLgIAlnF1bxgpA2LSD8+W; s-q=%E6%A2%81%E5%8B%87; s-i=1; sid=p74htbkp; z_c0=Mi4wQUFEQWRCUTdBQUFBRUFMU3Y1YTRDeGNBQUFCaEFsVk5SMmsyV1FEeC11Uy03U2Zmc0pmSG8wTm55V2RSdjBSd3hn|1495413191|2fac9f462ad7607baaea9fca2a64abe72134af4a; __utma=155987696.1489589582.1495414813.1495414813.1495414813.1; __utmb=155987696.0.10.1495414813; __utmc=155987696; __utmz=155987696.1495414813.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)' } def parse_homepage(self, username): # 方式一 # homeurl = "https://www.zhihu.com/people/{}".format(username) # response = requests.get(url=homeurl, headers=self.headers) # if response.status_code == 200: # followees_number = int(re.findall(re.compile('followingCount":(\d+),'), response.text)[0]) # followers_number = int(re.findall(re.compile('se,"followerCount":(\d+),'), response.text)[0]) # print("关注了", followees_number) # print("被关注", followers_number) # return (followees_number, followers_number) # else: # print(response.status_code) ###------------------------------------------------- """ 返回`username`对应的居住地, 学校名称,专业名称 :param username: :return: """ # 方式二 tempurl = 'https://www.zhihu.com/api/v4/members/{}?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'.format( username) response = requests.get(url=tempurl, headers=self.headers) if response.status_code == 200: data = json.loads(response.text) return (data['following_count'], data['follower_count']) else: print(response.status_code) def get_location_edu(self, username): """ 返回`username`对应的居住地, 学校名称,专业名称 :param username: :return: """ tempurl = 'https://www.zhihu.com/api/v4/members/{}?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'.format(username) response = requests.get(url=tempurl, headers=self.headers) if response.status_code == 200: data = json.loads(response.text) try: location = data['locations'][0]['name'] except: location = "未填写" # 处理学校 try: school = data['educations'][0]['school']['name'] major = data['educations'][0]['major']['name'] except: school = "未填写" major = "未填写" return (username, location, school, major) else: print(response.status_code) def get_followees(self, username): """ 获取 :username 所关注的人的列表 :param username: :return: """ # 先获取用户关注的人的总数,来确定分页的范围 homeparsed = self.parse_homepage(username=username) print(homeparsed) followees_number = homeparsed[0] pages = math.ceil(followees_number/20) # 设置一个集合,去除重复元素 followee_result = [] counter = 1 for offset in range(pages): tempurl = 'https://www.zhihu.com/api/v4/members/{username}/followees?offset={offset}&limit=20'.format(username=username, offset=offset*20) response = requests.get(url=tempurl, headers=self.headers) if response.status_code == 200: data = json.loads(response.text) followees = data['data'] for followee in followees: # print(counter, ": ", followee['url_token']) followee_result.append(followee['url_token']) counter += 1 else: print(response.status_code) # 返回无重复的username所关注的人列表 return list(set(followee_result)) def get_followers(self, username): """ 获取关注了 :username 的人的列表 :param username: :return: """ # 先获取关注username的人的总数,来确定分页的范围 homeparsed = self.parse_homepage(username=username) print(homeparsed) followers_number = homeparsed[1] pages = math.ceil(followers_number / 20) # 设置一个集合,去除重复元素 follower_result = [] counter = 1 for offset in range(pages): tempurl = 'https://www.zhihu.com/api/v4/members/{username}/followers?offset={offset}&limit=20'.format( username=username, offset=offset * 20) response = requests.get(url=tempurl, headers=self.headers) if response.status_code == 200: data = json.loads(response.text) followees = data['data'] for followee in followees: # print(counter, ": ", followee['url_token']) follower_result.append(followee['url_token']) counter += 1 else: print(response.status_code) # 返回无重复的username所关注的人列表 return list(set(follower_result))if __name__ == '__main__': spider = Spider() # spider.get_followees(username='tianshansoft') # spider.parse_homepage(username='zhi-ai-89-18') # location = spider.get_location_edu(username='zhi-ai-89-18') # print(location) # print(spider.parse_homepage(username='tianshansoft')) # followee_result = spider.get_followees(username='tianshansoft') # print(followee_result) # print(len(followee_result)) followers_result = spider.get_followers(username='tianshansoft') print(len(followers_result)) print(followers_result[:100])
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
数据库
数据库为了更加简单,方便。这里就采用sqlite3好了。因为本次的需求很简单,所以只需要一张表就可以了。
create table user( id INTEGER not null primary key autoincrement, username varchar(36) not null, location varchar(255), school varchar(255), major varchar(255));
- 1
- 2
- 3
- 4
- 5
- 6
- 7
然后还需要一个数据库工具类,要不然每次都写那么多重复的代码,也没什么意义。
# coding: utf8# @Author: 郭 璞# @File: dbhelper.py # @Time: 2017/5/22 # @Contact: 1064319632@qq.com# @blog: http://blog.youkuaiyun.com/marksinoberg# @Description: 数据库相关操作工具类import sqlite3class DbConfig(object): DATABASE_FILE_PATH = 'zhihu.db'class DbHelper(object): def __init__(self): self.conn = sqlite3.connect(DbConfig.DATABASE_FILE_PATH) def create_table(self): # 自增字段关键字AUTOINCREMENT. sql = """ create table user( id INTEGER not null primary key autoincrement, username varchar(36) not null, location varchar(255), school varchar(255), major varchar(255) ); """ cursor = self.conn.cursor() cursor.execute(sql) cursor.close() def add(self, data=()): cursor = self.conn.cursor() sql = "insert into user(name, location, school, major) values('{}', '{}', '{}', '{}');".format(data[0], data[1], data[2], data[3]) cursor.executescript(sql) # cursor.execute(sql) cursor.close() def get_data(self): cursor = self.conn.cursor() sql = "select location, count(location) as numbers from user group by location" cursor.execute(sql) resultset = cursor.fetchall() print(resultset)if __name__ == '__main__': dbhelper = DbHelper() # dbhelper.create_table() # data = { # 'username': 'zhi-ai-89-18', # 'location': '大连', # 'school': '大连理工大学', # 'major':'软件', # } # data = ('tianshansoft', '上海', 'weizhi', 'software') # dbhelper.add(data=data) dbhelper.get_data()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
这里简单的以需求驱动开发,我需要的功能也就存储数据,查询数据,所以这个工具类写的很简单。但是从功能上来说,却是足够了。
最后来看看,之乎用户地区的人数分布情况。用到的SQL语句如下:
select location, count(location) as numbers from user group by location ORDER BY numbers DESC
- 1
结果如下:
调度器
调度器是一个概念化的名词。作用就是粘合爬虫和数据持久层。根据六度空间理论,社交网是一个超大的互联。所以基本上来说爬虫是爬不干净所有用户的,于是只能退而求其次,爬取一部分吧。虽然是一部分,但是这还是相当于随机抽样,部分与整体的差别不会很大。
下面简要的来做下调度(说是简要,是因为没有做去重操作)
# coding: utf8# @Author: 郭 璞# @File: scheduler.py # @Time: 2017/5/22 # @Contact: 1064319632@qq.com# @blog: http://blog.youkuaiyun.com/marksinoberg# @Description: 程序调度器,用于粘合各个模块,实现配合工作。import spiderimport dbhelperimport time, randomsp = spider.Spider()entrance = 'ghostcomputing'queue = [entrance]container = []LEVEL = 3counter = 0dbhelper = dbhelper.DbHelper()while queue: if counter>=10000: break else: temp = queue.pop(0) followees = sp.get_followees(username=temp) queue.extend(followees) counter += (len(followees)-1) # 随即休眠 timeseed = random.randint(1, 5) print('随即休眠{}秒!'.format(timeseed)) time.sleep(timeseed) # 获取关注username的人的详细信息 for index, followee in enumerate(followees): # container.append(sp.get_location_edu(username=followee)) data = sp.get_location_edu(username=followee) dbhelper.add(data=data) print('{} 信息获取完成'.format(followee)) # 随即休眠 if index%28==0: timeseed = random.randint(1, 3) print('随即休眠{}秒!'.format(timeseed)) time.sleep(timeseed)print(container)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
web服务
echarts最好的使用就是前后端分离,所以使用接口技术来为前端的图标提供数据是一个不错的选择。之前写过一个用PHP做后台提供数据的,这里同样可以。使用JQuery也很方便。
不过,这里我打算试用一下Flask,更加的轻量。但是使用之前需要注意一个问题,那就是对于模板引擎来说,HTML代码已经不能算是原来的HTML代码了,其中对于JavaScript, CSS这些文件的路径要手动处理一下,否则他们无法被正确的找到。
函数 : url_for("static的path,一般为static", filename="想要在src上显示的值,通常是改文件在static中的路径")比如:我想要一个<script src="echarts.js">那么:模板中要这么写: <script src="{{ echarts_path}}">在后台就可以这么写:echarts_path = url_for('static', filename='echarts.js')return render_template('index.html', echarts_path=echarts_path)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
明白了这一点,就可以把脚本和样式应用到我们自己的模板上了。
http://echarts.baidu.com/echarts2/doc/example/map15.html
而我只画出了一个中国地图。。。 。。。
待做… …
总结
回顾就是,爬虫那块对接口的数据获取,操作sqlite3,以及web服务中静态资源的显示。其他图形化展示继续加油。
再分享一下我老师大神的人工智能教程吧。零基础!通俗易懂!风趣幽默!还带黄段子!希望你也加入到我们人工智能的队伍中来!https://blog.youkuaiyun.com/jiangjunshow