爬取网站:深圳市社会组织信息平台 .
网址:http://sgj.mzj.sz.gov.cn:9008/xxcx/index.jhtml
需要爬取的字段如下:
columns = ['社会组织名称', '统一信用代码', '成立时间', '法定代表人', '登记管理机关', '住所地址', '业务范围', '是否慈善组织', '是否取得公益性捐赠税前扣除资格', '是否具有公开募捐资格',
'在承接政府职能转移和购买服务社会组织推荐目录中', '状态', '证书有效期', '注册资金', '业务主管单位', '慈善组织认定登记日期', '是否纳入异常名录',
'是否纳入严重失信名录', '评估等级', '授予时间', '截止时间', '编号'] #编号是后期另一个文件的前提条件
如图所示,先选择类型为基金会,其次点击查询,获得下图的表格数据,
然后点击XHR中的信息,从中获取表格中数据,json格式的比较容易爬取,
爬取如上信息
def bsae_info():
list_base_info = [] #保存数据
url_1 = 'http://218.17.83.146:9008/SOCSP_PS_SP/api/society/findList'
for m in range(1, 3):
data_1 = {
'pageNum': m,
'SOCIETY_TYPE': '4'} #传递参数
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'} #为避免麻烦,每个函数都加入
res_1 = requests.post(url_1, data=data_1, headers=headers)
res_1_json = res_1.json()
items_1 = res_1_json['data']['list']
for i in items_1: #获取数据
try:
ORGANIZATION_NAME = i['ORGANIZATION_NAME']
CREDIT_CODE = i['CREDIT_CODE']
SETUP_DATE = i['SETUP_DATE']
LEGAL_REPRESENTATIVE = i['LEGAL_REPRESENTATIVE']
REGISTRATION_AUTHORITY = i['REGISTRATION_AUTHORITY']
ID = i['ID']
except:
pass
list_base_info.append([ORGANIZATION_NAME, CREDIT_CODE, SETUP_DATE, LEGAL_REPRESENTATIVE, REGISTRATION_AUTHORITY, ID]) #列表加入数据
return list_base_info
点进去首页的基金会组织名称,得到该基金会的进一步详细数据。点击左侧信息框的内容,右侧会出现相应的json信息,得到相应的数据,使用列表格式保存数据。
上述信息爬取
def info(ID):
list_info = []
for id_ in ID:
url_2 = 'http://218.17.83.146:9008/SOCSP_PS_SP/api/index/findSociety'
data_2 = {
'societyId': id_}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
res_2 = requests.post(url_2, data=data_2, headers=headers)
res_2_json = res_2.json()
datas = res_2_json['data']
try:
DETAILED_RESIDENCE = datas['DETAILED_RESIDENCE']
BUSINESS_RANGE = datas['BUSINESS_RANGE']
CHARITABLE_ORGANIZATION = datas['CHARITABLE_ORGANIZATION']
DEDUCTION_QUALIFICATION = datas['DEDUCTION_QUALIFICATION']
PUBLIC_APPLICATION = datas['PUBLIC_APPLICATION']
IS_FUNCTION_SERVE = datas['IS_FUNCTION_SERVE']
STATUS = datas['STATUS']
CERT_VALID_TIME = datas['CERT_VALID_TIME']
REGISTERED_CAPITAL = datas['REGISTERED_CAPITAL']
COMPETENT_UNIT = datas['COMPETENT_UNIT']
SETUP_DATE = datas['SETUP_DATE']
YCML = datas['YCML']
YZSX = datas['YZSX']
except NameError: #部分内容存在问题,需要加错误信息处理
pass
except KeyError:
pass
list_info.append([DETAILED_RESIDENCE, BUSINESS_RANGE, CHARITABLE_ORGANIZATION, DEDUCTION_QUALIFICATION, PUBLIC_APPLICATION, IS_FUNCTION_SERVE, STATUS, CERT_VALID_TIME, REGISTERED_CAPITAL, COMPETENT_UNIT, SETUP_DATE, YCML, YZSX])
return list_info
接下来爬取评估等级信息,由于不是每个基金会都含有此信息,因此需要找一个有此信息的基金会,如图所示,等级信息如右所示,
爬虫代码
def grade(ID):
list_grade = []
for id_ in ID:
url_3 = 'http://218.17.83.146:9008/SOCSP_PS_SP/api/index/findPgdjList'
data_2 = {
'societyId': id_}
headers = {
'User-A