项目实践——BeautifulSoup爬取上海二手房的数据
获取房源的名称、价格、户型、面积大小、楼层、建造年份、联系人、地址、标签等数据。
一、 网站分析
1:请求头
URL:https://shanghai.anjuke.com/sale/p1/#filtersort(第一页)
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.66
2:定位到各个元素的地址
名称 div.house-title
价格 div.pro-price
每平米价格 span.unit-price
地址 span.comm-address
户型 div.details-item
房屋大小 div.details-item
楼层 div.details-item
建造年份 div.details-item
标签 span.item-tags tag-others
联系人 Span.broker-name broker-text
地址获取,关注我,前期内容已详细讲解
二、 敲代码
import requests
from bs4 import BeautifulSoup
link_1='https://shanghai.anjuke.com/sale/p'#安居客二手房
link_2='/#filtersort'
f=1#计数用
for i in range(1,6):#一到五页
link=link_1+str(i)+link_2
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
' AppleWebKit/537.36 (KHTML, like Gecko)'
' Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.66'}
#发送请求
r=requests.get(url=link,headers=headers)
print(r.status_code)
#解析网页
soup=BeautifulSoup(r.text,'lxml')
#抓取内容
house_item_list=soup.find_all('li',class_='list-item')
for eachitem in house_item_list:
print("第",f,'套房源')
house_name_list = eachitem.find('div', class_='house-title')
name=house_name_list.a.text.strip()
print(name)
house_price_list = eachitem.find('div', class_='pro-price')
price=house_price_list.span.text.strip()
print('价格',price)
house_areaPrice_list=eachitem.find('span',class_='unit-price')
areaPrice=house_areaPrice_list.text.strip()
print(areaPrice)
house_address_list=eachitem.find('span',class_='comm-address')
address=house_address_list.text.strip()
print('地址',address)
house_detail_list=eachitem.find('div',class_='details-item')
detail=house_detail_list.span.text.strip()
print('户型',detail)
area=house_detail_list.contents[3].text
print('房屋大小:',area)
floor=house_detail_list.contents[5].text
print('楼层:',floor)
year=house_detail_list.contents[7].text
print(year)
house_tag_list = eachitem.find_all('span', class_='item-tags tag-others')
tag = [i.text for i in house_tag_list]
print(tag)
house_broker_list=eachitem.find('span',class_='broker-name broker-text')
broker_name=house_broker_list.text.strip()
print('联系人:',broker_name)
print('\n')
f+=1
代码结构:获取地址——发送请求——解析网页——抓取数据(总地址——子地址——循环抓取)
运行结果
关注我,获取跟多小技能!