知乎用户分布研究

最新推荐文章于 2022-01-23 08:40:00 发布

原创最新推荐文章于 2022-01-23 08:40:00 发布 · 1.9k 阅读

0 ·

CC 4.0 BY-SA版权

前言

虽然知乎早已不是最开始的样子了，但是其用户还是很广泛的。我原本打算做的写个爬虫，把用户的居住地，学历，专业等信息爬下来。然后持久化到数据库中，最后写个web服务，用图标的形式展示出来。

但是echarts地图这块，还需努力。尽管做了调试，效果还是不甚理想。汗颜(⊙﹏⊙)b

框架搭建

正如前言部分所述，这里用到的技术还是挺多的。
简要的来展示一下项目目录吧。

C:\Users\biao\Desktop\network\code\zhihu-range>tree . /f文件夹 PATH 列表卷序列号为 E0C6-0F15C:\USERS\BIAO\DESKTOP\NETWORK\CODE\ZHIHU-RANGE│  dbhelper.py│  scheduler.py│  spider.py│  zhihu.db│  __init__.py│├─web│  │  service.py│  │  __init__.py│  ││  ├─static│  │      china.js│  │      echarts.js│  │      echarts.min.js│  │      jquery-2.2.4.min.js│  ││  └─templates│          index.html│└─__pycache__        dbhelper.cpython-36.pyc        spider.cpython-36.pyc
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

模块化

接下来就一点点的对每一个小模块进行实现吧。

爬虫

爬虫部分需要注意的有这么几点。

请求头上的authorization
然后是请求频率的控制，通过添加随机时延可以明显的改善防爬虫限制
获取关注我的人的信息：

https://www.zhihu.com/api/v4/members/zhi-ai-89-18/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20
  1

获取我关注的人的信息：

https://www.zhihu.com/api/v4/members/zhi-ai-89-18/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20
  1

获取我的信息：

https://www.zhihu.com/api/v4/members/zhi-ai-89-18?include=locations%2CemploymentsXXXXXXXXXXXX
  1

明确了这点，基本上对于爬虫就没有什么问题了。详见代码部分。

# coding: utf8# @Author: 郭 璞# @File: spider.py                                                                 # @Time: 2017/5/22                                   # @Contact: 1064319632@qq.com# @blog: http://blog.youkuaiyun.com/marksinoberg# @Description: 爬虫，爬取地域数据import requestsimport jsonimport reimport mathclass Spider(object):    def __init__(self):        """        初始化请求头，必备一个authorization，否则无法获取到数据。        """        self.headers = {            'authorization': 'Bearer Mi4wQUFEQWRCUTdBQUFBRUFMU3Y1YTRDeGNBQUFCaEFsVk5SMmsyV1FEeC11Uy03U2Zmc0pmSG8wTm55V2RSdjBSd3hn|1495413191|2fac9f462ad7607baaea9fca2a64abe72134af4a',            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',            'Host': 'www.zhihu.com',            'x-udid': 'ABAC0r-WuAuPTsVSA2wl0bXj3UZqixKgbPE=',        }        self.cookie = {            'Cookie': 'q_c1=cbf69b836d4645b29f057b71be86c00e|1493896915000|1493896915000; r_cap_id="NWY3YjIzYzlmOTg0NDVhM2FmMzdjNzA1YzY5NTBlYmU=|1494146108|664527b0598db30d7734ff56ea5ac12b17cbe2d8"; cap_id="MWRhOTIzNGYzZDdjNDA3MjhiNTg1MGQ3ZDJlMjQ5NWE=|1494146108|94fc913a73ce89aeb3b60439fdcc69687baf438d"; d_c0="ABAC0r-WuAuPTsVSA2wl0bXj3UZqixKgbPE=|1494146110"; _zap=c27db1fb-911e-48bd-babe-3b6e66c3e558; _xsrf=55d8c6a475335b06ee3e848612afdd80; aliyungf_tc=AQAAAJ+R5xghJQIAlnF1b59VTAruEEc9; acw_tc=AQAAAGxlvy3TLgIAlnF1bxgpA2LSD8+W; s-q=%E6%A2%81%E5%8B%87; s-i=1; sid=p74htbkp; z_c0=Mi4wQUFEQWRCUTdBQUFBRUFMU3Y1YTRDeGNBQUFCaEFsVk5SMmsyV1FEeC11Uy03U2Zmc0pmSG8wTm55V2RSdjBSd3hn|1495413191|2fac9f462ad7607baaea9fca2a64abe72134af4a; __utma=155987696.1489589582.1495414813.1495414813.1495414813.1; __utmb=155987696.0.10.1495414813; __utmc=155987696; __utmz=155987696.1495414813.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'        }    def parse_homepage(self, username):        # 方式一        # homeurl = "https://www.zhihu.com/people/{}".format(username)        # response = requests.get(url=homeurl, headers=self.headers)        # if response.status_code == 200:        #     followees_number = int(re.findall(re.compile('followingCount&quot;:(\d+),'), response.text)[0])        #     followers_number = int(re.findall(re.compile('se,&quot;followerCount&quot;:(\d+),'), response.text)[0])        #     print("关注了", followees_number)        #     print("被关注", followers_number)        #     return (followees_number, followers_number)        # else:        #     print(response.status_code)        ###-------------------------------------------------        """            返回`username`对应的居住地， 学校名称，专业名称            :param username:            :return:            """        # 方式二        tempurl = 'https://www.zhihu.com/api/v4/members/{}?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'.format(            username)        response = requests.get(url=tempurl, headers=self.headers)        if response.status_code == 200:            data = json.loads(response.text)            return (data['following_count'], data['follower_count'])        else:            print(response.status_code)    def get_location_edu(self, username):        """        返回`username`对应的居住地， 学校名称，专业名称        :param username:        :return:        """        tempurl = 'https://www.zhihu.com/api/v4/members/{}?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'.format(username)        response = requests.get(url=tempurl, headers=self.headers)        if response.status_code == 200:            data = json.loads(response.text)            try:                location = data['locations'][0]['name']            except:                location = "未填写"            # 处理学校            try:                school = data['educations'][0]['school']['name']                major = data['educations'][0]['major']['name']            except:                school = "未填写"                major = "未填写"            return (username, location, school, major)        else:            print(response.status_code)    def get_followees(self, username):        """        获取 :username 所关注的人的列表        :param username:        :return:        """        # 先获取用户关注的人的总数，来确定分页的范围        homeparsed = self.parse_homepage(username=username)        print(homeparsed)        followees_number = homeparsed[0]        pages = math.ceil(followees_number/20)        # 设置一个集合，去除重复元素        followee_result = []        counter = 1        for offset in range(pages):            tempurl = 'https://www.zhihu.com/api/v4/members/{username}/followees?offset={offset}&limit=20'.format(username=username, offset=offset*20)            response = requests.get(url=tempurl, headers=self.headers)            if response.status_code == 200:                data = json.loads(response.text)                followees = data['data']                for followee in followees:                    # print(counter, ":  ", followee['url_token'])                    followee_result.append(followee['url_token'])                    counter += 1            else:                print(response.status_code)        # 返回无重复的username所关注的人列表        return list(set(followee_result))    def get_followers(self, username):        """            获取关注了 :username 的人的列表            :param username:            :return:            """        # 先获取关注username的人的总数，来确定分页的范围        homeparsed = self.parse_homepage(username=username)        print(homeparsed)        followers_number = homeparsed[1]        pages = math.ceil(followers_number / 20)        # 设置一个集合，去除重复元素        follower_result = []        counter = 1        for offset in range(pages):            tempurl = 'https://www.zhihu.com/api/v4/members/{username}/followers?offset={offset}&limit=20'.format(                username=username, offset=offset * 20)            response = requests.get(url=tempurl, headers=self.headers)            if response.status_code == 200:                data = json.loads(response.text)                followees = data['data']                for followee in followees:                    # print(counter, ":  ", followee['url_token'])                    follower_result.append(followee['url_token'])                    counter += 1            else:                print(response.status_code)        # 返回无重复的username所关注的人列表        return list(set(follower_result))if __name__ == '__main__':    spider = Spider()    # spider.get_followees(username='tianshansoft')    # spider.parse_homepage(username='zhi-ai-89-18')    # location = spider.get_location_edu(username='zhi-ai-89-18')    # print(location)    # print(spider.parse_homepage(username='tianshansoft'))    # followee_result = spider.get_followees(username='tianshansoft')    # print(followee_result)    # print(len(followee_result))    followers_result = spider.get_followers(username='tianshansoft')    print(len(followers_result))    print(followers_result[:100])
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171

数据库

数据库为了更加简单，方便。这里就采用sqlite3好了。因为本次的需求很简单，所以只需要一张表就可以了。

create table user(    id INTEGER not null  primary key autoincrement,    username varchar(36) not null,    location varchar(255),    school varchar(255),    major varchar(255));
  1
2
3
4
5
6
7

然后还需要一个数据库工具类，要不然每次都写那么多重复的代码，也没什么意义。

# coding: utf8# @Author: 郭 璞# @File: dbhelper.py                                                                 # @Time: 2017/5/22                                   # @Contact: 1064319632@qq.com# @blog: http://blog.youkuaiyun.com/marksinoberg# @Description: 数据库相关操作工具类import sqlite3class DbConfig(object):    DATABASE_FILE_PATH = 'zhihu.db'class DbHelper(object):    def __init__(self):        self.conn = sqlite3.connect(DbConfig.DATABASE_FILE_PATH)    def create_table(self):        # 自增字段关键字AUTOINCREMENT.        sql = """        create table user(        id INTEGER not null  primary key autoincrement,        username varchar(36) not null,        location varchar(255),        school varchar(255),        major varchar(255)        );        """        cursor = self.conn.cursor()        cursor.execute(sql)        cursor.close()    def add(self, data=()):        cursor = self.conn.cursor()        sql = "insert into user(name, location, school, major) values('{}', '{}', '{}', '{}');".format(data[0], data[1], data[2], data[3])        cursor.executescript(sql)        # cursor.execute(sql)        cursor.close()    def get_data(self):        cursor = self.conn.cursor()        sql = "select location, count(location) as numbers from user group by location"        cursor.execute(sql)        resultset = cursor.fetchall()        print(resultset)if __name__ == '__main__':    dbhelper = DbHelper()    # dbhelper.create_table()    # data = {    #     'username': 'zhi-ai-89-18',    #     'location': '大连',    #     'school': '大连理工大学',    #     'major':'软件',    # }    # data = ('tianshansoft', '上海', 'weizhi', 'software')    # dbhelper.add(data=data)    dbhelper.get_data()
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

这里简单的以需求驱动开发，我需要的功能也就存储数据，查询数据，所以这个工具类写的很简单。但是从功能上来说，却是足够了。

最后来看看，之乎用户地区的人数分布情况。用到的SQL语句如下：

select location, count(location) as numbers from user   group by location ORDER BY numbers DESC
  1

结果如下：

知乎用户地理分布

调度器

调度器是一个概念化的名词。作用就是粘合爬虫和数据持久层。根据六度空间理论，社交网是一个超大的互联。所以基本上来说爬虫是爬不干净所有用户的，于是只能退而求其次，爬取一部分吧。虽然是一部分，但是这还是相当于随机抽样，部分与整体的差别不会很大。

下面简要的来做下调度（说是简要，是因为没有做去重操作）

# coding: utf8# @Author: 郭 璞# @File: scheduler.py                                                                 # @Time: 2017/5/22                                   # @Contact: 1064319632@qq.com# @blog: http://blog.youkuaiyun.com/marksinoberg# @Description: 程序调度器，用于粘合各个模块，实现配合工作。import spiderimport dbhelperimport time, randomsp = spider.Spider()entrance = 'ghostcomputing'queue = [entrance]container = []LEVEL = 3counter = 0dbhelper = dbhelper.DbHelper()while queue:    if counter>=10000:        break    else:        temp = queue.pop(0)        followees = sp.get_followees(username=temp)        queue.extend(followees)        counter += (len(followees)-1)        # 随即休眠        timeseed = random.randint(1, 5)        print('随即休眠{}秒！'.format(timeseed))        time.sleep(timeseed)        # 获取关注username的人的详细信息        for index, followee in enumerate(followees):            # container.append(sp.get_location_edu(username=followee))            data = sp.get_location_edu(username=followee)            dbhelper.add(data=data)            print('{} 信息获取完成'.format(followee))            # 随即休眠            if index%28==0:                timeseed = random.randint(1, 3)                print('随即休眠{}秒！'.format(timeseed))                time.sleep(timeseed)print(container)
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

web服务

echarts最好的使用就是前后端分离，所以使用接口技术来为前端的图标提供数据是一个不错的选择。之前写过一个用PHP做后台提供数据的，这里同样可以。使用JQuery也很方便。

不过，这里我打算试用一下Flask，更加的轻量。但是使用之前需要注意一个问题，那就是对于模板引擎来说，HTML代码已经不能算是原来的HTML代码了，其中对于JavaScript， CSS这些文件的路径要手动处理一下，否则他们无法被正确的找到。

函数 ： url_for（"static的path，一般为static", filename="想要在src上显示的值，通常是改文件在static中的路径"）比如：我想要一个<script src="echarts.js">那么:模板中要这么写： <script src="{{ echarts_path}}">在后台就可以这么写:echarts_path = url_for('static', filename='echarts.js')return render_template('index.html', echarts_path=echarts_path)
  1
2
3
4
5
6
7
8
9
10