思路:
1.手工查找一些僵尸用户
2.对僵尸用户的关注、粉丝列表进行多层遍历获取大量候选用户集
3.手工标注僵尸用户
技术难点在于第二步,工程量在第三步
1.手工查找僵尸用户
通过在微博手工查找,发现僵尸用户如下:
其特征较为明显、关注多,粉丝少,几乎没有活跃度
2.迭代遍历僵尸用户关注粉丝列表
参考此前文章:https://blog.youkuaiyun.com/weixin_43906500/article/details/115919312
相关代码略有修改,并封装为函数库,如下
2.1获取粉丝关注
get_follow_fan.py
import re
from urllib import request
import urllib
import code_weibo.config as config
import json
def get_follow_fan(o_id,num=5):
headers = config.get_headers()
add = urllib.request.Request(url='https://weibo.com/u/%s' % o_id, headers=headers)
r = urllib.request.urlopen(url=add, timeout=10).read().decode('utf-8')
p_id = re.findall(r'CONFIG\[\'page_id\']=\'(\d+)\'',r)[0]
follow_data = []
fan_data = []
dic_follow_fan = {}
for i in range(1,num+1):
add = urllib.request.Request(url='https://weibo.com/p/%s/follow?page=%d' % (p_id, i), headers=headers)
r = urllib.request.urlopen(url=add, timeout=10).read().decode('utf-8')
follows = re.findall(r'action-type=\\"itemClick\\" action-data=\\"uid=(\d+)&fnick=(.*?)&',r)
print("关注:")
print(len(follows))
if(len(follows)==0):
break
for follow in follows:
dic = {}
dic["uid"] = follow[0]
dic["name"] = follow[1]
follow_data.append(dic)
for i in range(1,num+1):
add = urllib.request.Request(url='https://weibo.com/p/%s/follow?relate=fans&page=%d' % (p_id, i), headers=headers)
r = urllib.request.urlopen(url=add, timeout=10).read().decode('utf-8')
fans = re.findall(r'action-type=\\"itemClick\\" action-data=\\"uid=(\d+)&