以下是一个关于Python爬取懂车帝汽车数据并结合SSM框架(Spring+Spring MVC+MyBatis)构建数据分析平台的技术实现方案分析。代码示例主要分为爬虫部分和平台整合思路:
一、Python爬虫实现核心逻辑(示例)
1.1 使用 Requests+BeautifulSoup 抓取静态数据
import requests
from bs4 import BeautifulSoup
import json
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
def get_car_list(page=1):
url = f"https://www.dongchedi.com/car_list?page={page}"
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
car_items = soup.select('.car-list-item')
data_list = []
for item in car_items:
car_data = {
"brand": item.select('.brand-name')[0].text.strip(),
"model": item.select('.model-name')[0].text.strip(),
"price": item.select('.price')[0].text.strip(),
"engine": item.select('.engine-info')[0].text.strip(),
"sales": item.select('.sales-num')[0].text.strip()
}
data_list.append(car_data)
return data_list
else:
print(f"请求失败:{response.status_code}")
return []
1.2 处理动态加载数据(推荐使用Selenium)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def get_dynamic_content():
options = Options()
options.add_argument("--headless") # 无头模式
driver = webdriver.Chrome(options=options)
driver.get("https://www.dongchedi.com/sales_rank")
driver.implicitly_wait(10)
# 解析动态生成的内容
sales_data = driver.find_elements_by_css_selector('.sales-rank-item')
for item in sales_data:
pass # 数据解析逻辑
driver.quit()
二、数据存储(MySQL示例)
import pymysql
def save_to_mysql(data):
conn = pymysql.connect(
host='localhost',
user='root',
password='123456',
database='car_analysis',
charset='utf8mb4'
)
try:
with conn.cursor() as cursor:
sql = '''INSERT INTO car_info
(brand, model, price, engine, sales_month)
VALUES (%s, %s, %s, %s, %s)'''
cursor.executemany(sql, data)
conn.commit()
finally:
conn.close()
三、SSM框架整合(Java部分关键代码)
3.1 MyBatis Mapper接口
public interface CarMapper {
@Select("SELECT * FROM car_info WHERE brand = #{brand}")
List<Car> selectByBrand(String brand);
@Select("SELECT brand, AVG(price) as avg_price FROM car_info GROUP BY brand")
List<Map<String, Object>> getBrandPriceAnalysis();
}
3.2 Spring MVC控制器
@RestController
@RequestMapping("/api/car")
public class CarController {
@Autowired
private CarService carService;
@GetMapping("/analysis/brand")
public ResponseEntity<List<CarAnalysisVO>> getBrandAnalysis() {
return ResponseEntity.ok(carService.getBrandAnalysis());
}
}
3.3 定时任务调用Python爬虫
@Component
public class CrawlerScheduler {
@Scheduled(cron = "0 0 2 * * ?") // 每天凌晨2点执行
public void executeCarCrawler() {
try {
String[] cmd = {"python3", "/path/to/crawler.py"};
Process process = Runtime.getRuntime().exec(cmd);
process.waitFor();
} catch (IOException | InterruptedException e) {
e.printStackTrace();
}
}
}
四、数据分析功能示例
4.1 价格分布分析(Python)
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_sql("SELECT * FROM car_info", con=engine)
df['price'] = df['price'].str.replace('万', '').astype(float)
plt.figure(figsize=(10,6))
df['brand'].value_counts().plot(kind='bar')
plt.title('各品牌车型数量统计')
plt.savefig('brand_dist.png')
4.2 Spring Boot返回分析结果
@GetMapping("/salesTrend")
public ResponseEntity<SalesTrendDTO> getSalesTrend() {
// 调用分析服务获取数据
return ResponseEntity.ok(analysisService.getSalesTrendData());
}
五、注意事项
-
反爬对抗策略:
- 使用代理IP池
- 设置随机请求间隔(time.sleep(random.uniform(1,3)))
- 定期更换User-Agent
-
数据清洗:
- 处理价格区间字符串(如"10.58-15.98万")
- 统一单位转换(公里/小时 → km/h)
-
平台架构示意:
爬虫系统 → MySQL → SSM服务 → VUE前端 ↗ ↖ Redis缓存 Echarts可视化
-
法律合规:
- 遵守
robots.txt
协议 - 控制请求频率
- 仅用于学习研究
- 遵守
需要完整项目源码可以告知具体模块需求,可以提供更详细的代码实现。