Python爬取安居客储存到csv或者mongo!

本文介绍如何使用Python爬取安居客网站上的房产信息,并利用jsonpath提取关键字段,最终将数据存储到CSV文件或MongoDB数据库中。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

目的
爬取安居客房产进行信息储存

pycharm
寻找安居客的返回数据的url
1.经过分析了原网页和json和js发现,真正返回数据的url在json中:https://shuyang.anjuke.com/v3/ajax/map/sale/1158/prop_list/?room_num=-1&price_id=-1&area_id=-1&floor=-1&orientation=-1&is_two_years=0&is_school=0&is_metro=0&order_id=0&p=1&zoom=12&lat=34.052874_34.181985&lng=118.368882_119.251952&kw=&maxp=99&et=f7908e&ib=1&bst=pem632


对url进行浏览器访问 ,查看数据,发现它是十六进制输出的数据,明显的反爬手段


由于要获取房产信息,必须获取里面的数据信息,然后进行分析筛选信息,得到自己想要的数据
获取请求头

获取r返回的json数据

import requests

url = "https://shuyang.anjuke.com/v3/ajax/map/sale/1158/prop_list/?room_num=-1&price_id=-1&area_id=-1&floor=-1&orientation=-1&is_two_years=0&is_school=0&is_metro=0&order_id=0&p=1&zoom=12&lat=34.052874_34.181985&lng=118.368882_119.251952&kw=&maxp=99&et=f7908e&ib=1&bst=pem632"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"
}
# 获取url返回数据
r = requests.get(url=url, headers=headers)
jsn = r.json()
# 打印json数据
print(jsn)

123456789101112


获得了json数据,我们就可以对数据进行分析,查看里面的数据,发现props下面的每个字典就对应着每个房产的信息

 

 


要想取到里面的数据,就必须用jsonpath获取,获取以下九个数据,以列表格式输出 list=[ ]


所以要定义一个列表来储存数据,用jsonpath来获取数据

import jsonpath
import requests

url = "https://shuyang.anjuke.com/v3/ajax/map/sale/1158/prop_list/?room_num=-1&price_id=-1&area_id=-1&floor=-1&orientation=-1&is_two_years=0&is_school=0&is_metro=0&order_id=0&p=1&zoom=12&lat=34.052874_34.181985&lng=118.368882_119.251952&kw=&maxp=99&et=f7908e&ib=1&bst=pem632"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"
}
# 获取url返回数据
r = requests.get(url=url, headers=headers)
jsn = r.json()
# 打印json数据
# print(jsn)
list = []
#获取所有region_name的数据
region = jsonpath.jsonpath(jsn, "$..region_name")
#获取所有block_name的数据
pblock = jsonpath.jsonpath(jsn, "$..block_name")
#获取所有rhval的数据
rhval1 = jsonpath.jsonpath(jsn, "$..rhval")
#获取所有area的数据
area1 = jsonpath.jsonpath(jsn, "$..area")
#获取所有price的数据
price1 = jsonpath.jsonpath(jsn, "$..price")
#获取所有long_title的数据
title = jsonpath.jsonpath(jsn, "$..long_title")
#获取所有comm_name的数据
comm = jsonpath.jsonpath(jsn, "$..comm_name")
#获取所有house_orient_nam的数据
house = jsonpath.jsonpath(jsn, "$..house_orient_name")
#获取所有floor_tag的数据
floor = jsonpath.jsonpath(jsn, "$..floor_tag")

for i in range(len(region)):
    list.append([region[i], pblock[i], rhval1[i], area1[i], price1[i], title[i], comm[i], house[i], floor[i]])
print(list)
1234567891011121314151617181920212223242526272829303132333435

打印了当前页面的房产数据(第一页)


好不容易爬一次,不能只爬一页啊!所以我们分析每一页的url的变化


所以要定义一个页面的参数p,p=1 ,进行循环遍历每一页

import requests
import jsonpath
import json


from pymongo import MongoClient

client = MongoClient('127.0.0.1', 27017)
collection = client['stu']['real estate']

url = "https://shuyang.anjuke.com/v3/ajax/map/sale/1075/prop_list/?room_num=-1&price_id=-1&area_id=-1&floor=-1&orientation=-1&is_two_years=0&is_school=0&is_metro=0&order_id=0&p=1&zoom=12&lat=34.060052_34.174817&lng=118.368882_119.251952&kw=&maxp=99&et=b4e7df&ib=1&bst=pem576"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"
}


cLass= ['region', 'pblock', 'rhval1', 'area1', 'price1','title','comm','house','floor']
p=1
for  q in range(1,6):
    p=q
    data={
        "p":"%s" % p
    }
    print(data)
    r = requests.get(url=url, headers=headers,params=data)
    jsn = r.json()
    list = []
    region = jsonpath.jsonpath(jsn, "$..region_name")

    pblock = jsonpath.jsonpath(jsn, "$..block_name")

    rhval1 = jsonpath.jsonpath(jsn, "$..rhval")

    area1 = jsonpath.jsonpath(jsn, "$..area")

    price1 = jsonpath.jsonpath(jsn, "$..price")

    title = jsonpath.jsonpath(jsn, "$..long_title")

    comm = jsonpath.jsonpath(jsn, "$..comm_name")

    house = jsonpath.jsonpath(jsn, "$..house_orient_name")

    floor = jsonpath.jsonpath(jsn, "$..floor_tag")

    for i in range(len(region)):
        list.append([region[i], pblock[i], rhval1[i], area1[i], price1[i], title[i], comm[i], house[i], floor[i]])
        collection.insert_one({'region':region[i],'pblock':pblock[i],'rhval1':rhval1[i],'area1':area1[i], 'price1':price1[i],'title':title[i],'comm':comm[i],'house':house[i],'floor':floor[i]})


    print(list)
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152

第一页的房产信息


第二页的房产信息


接下来还有第三页第四页第五页…(循环五次)

数据出来了,接下来该储存数据了,方法有两种
1.csv储存
2.mongo数据库储存

csv储存,完整代码
定义一个class头为他们的类

import requests
import jsonpath
import json
import csv

from pymongo import MongoClient

url = "https://shuyang.anjuke.com/v3/ajax/map/sale/1075/prop_list/?room_num=-1&price_id=-1&area_id=-1&floor=-1&orientation=-1&is_two_years=0&is_school=0&is_metro=0&order_id=0&p=1&zoom=12&lat=34.060052_34.174817&lng=118.368882_119.251952&kw=&maxp=99&et=b4e7df&ib=1&bst=pem576"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"
}

#定义一个csv储存的类
cLass= ['region', 'pblock', 'rhval1', 'area1', 'price1','title','comm','house','floor']
p=1
for  q in range(1,6):
    p=q
    data={
        "p":"%s" % p
    }
    print(data)
    r = requests.get(url=url, headers=headers,params=data)
    jsn = r.json()
    list = []
    region = jsonpath.jsonpath(jsn, "$..region_name")

    pblock = jsonpath.jsonpath(jsn, "$..block_name")

    rhval1 = jsonpath.jsonpath(jsn, "$..rhval")

    area1 = jsonpath.jsonpath(jsn, "$..area")

    price1 = jsonpath.jsonpath(jsn, "$..price")

    title = jsonpath.jsonpath(jsn, "$..long_title")

    comm = jsonpath.jsonpath(jsn, "$..comm_name")

    house = jsonpath.jsonpath(jsn, "$..house_orient_name")

    floor = jsonpath.jsonpath(jsn, "$..floor_tag")

    for i in range(len(region)):
        list.append([region[i], pblock[i], rhval1[i], area1[i], price1[i], title[i], comm[i], house[i], floor[i]])
    print(list)
#csv储存数据 用wps打开或者excel打开都行,只要不乱码就算成功
    with open("anjuke%s.csv" % p, 'w', encoding="utf-8") as f:
        f_csv = csv.writer(f)
        f_csv.writerow(cLass)
        f_csv.writerows(list)
        
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152

 

mongo数据库储存,完整代码

import requests
import jsonpath
import json
import csv

from pymongo import MongoClient

#monggo数据库
client = MongoClient('127.0.0.1', 27017)
collection = client['stu']['real estate']

url = "https://shuyang.anjuke.com/v3/ajax/map/sale/1075/prop_list/?room_num=-1&price_id=-1&area_id=-1&floor=-1&orientation=-1&is_two_years=0&is_school=0&is_metro=0&order_id=0&p=1&zoom=12&lat=34.060052_34.174817&lng=118.368882_119.251952&kw=&maxp=99&et=b4e7df&ib=1&bst=pem576"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"
}


cLass= ['region', 'pblock', 'rhval1', 'area1', 'price1','title','comm','house','floor']
p=1
for  q in range(1,6):
    p=q
    data={
        "p":"%s" % p
    }
    print(data)
    r = requests.get(url=url, headers=headers,params=data)
    jsn = r.json()
    list = []
    region = jsonpath.jsonpath(jsn, "$..region_name")

    pblock = jsonpath.jsonpath(jsn, "$..block_name")

    rhval1 = jsonpath.jsonpath(jsn, "$..rhval")

    area1 = jsonpath.jsonpath(jsn, "$..area")

    price1 = jsonpath.jsonpath(jsn, "$..price")

    title = jsonpath.jsonpath(jsn, "$..long_title")

    comm = jsonpath.jsonpath(jsn, "$..comm_name")

    house = jsonpath.jsonpath(jsn, "$..house_orient_name")

    floor = jsonpath.jsonpath(jsn, "$..floor_tag")

    for i in range(len(region)):
        list.append([region[i], pblock[i], rhval1[i], area1[i], price1[i], title[i], comm[i], house[i], floor[i]])
        collection.insert_one({'region':region[i],'pblock':pblock[i],'rhval1':rhval1[i],'area1':area1[i], 'price1':price1[i],'title':title[i],'comm':comm[i],'house':house[i],'floor':floor[i]})
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950

PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取

python免费学习资料以及群交流解答点击即可加入

 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值