python 数据爬取

最新推荐文章于 2025-04-20 10:44:53 发布

穿月女

最新推荐文章于 2025-04-20 10:44:53 发布

阅读量180

点赞数

分类专栏： python 文章标签： python 爬虫开发语言

本文链接：https://blog.youkuaiyun.com/yingyingyueyue/article/details/128247444

版权

python 专栏收录该内容

7 篇文章

订阅专栏

爬虫4步骤

第0步：获取数据。爬虫程序会根据我们提供的网址，向服务器发起请求，然后返回数据。

第1步：解析数据。爬虫程序会把服务器返回的数据解析成我们能读懂的格式。

第2步：提取数据。爬虫程序再从中提取出我们需要的数据。

第3步：储存数据。爬虫程序把这些有用的数据保存起来，便于你日后的使用和分析。

问题

1. 数据爬取返回404

<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.16.1</center>
</body>
</html>

原因：服务器识别到我们在爬取数据，所以返回404

解决：模拟浏览器访问服务器chrome://version/

header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"}
res = requests.get(url, headers=header)

示例

import requests

from bs4 import BeautifulSoup

url = 'http://www.xiachufang.com/explore/'
header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"}
#获取数据
res = requests.get(url, headers=header)   
#解析数据
soup = BeautifulSoup(res.text, 'html.parser')
#提取数据
items = soup.find_all(class_="info pure-u")
for item in items:
    title = item.find(class_="name").text.strip()
    url = item.find(class_="name").find('a')['href']
    shicai = item.find(class_="ing ellipsis").text
    print(title, url, shicai)