12个python爬虫项目(非常详细)

宝子们👋,今天来给大家分享 12 个超酷的 Python 爬虫项目,无论是想提升编程技能,还是获取有用数据,都超合适!

1. 爬取豆瓣电影 Top250

此项目能爬取豆瓣电影 Top250 的电影信息,像电影名、评分、简介等。

import requests from bs4 import BeautifulSoup url = 'https://movie.douban.com/top250' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') items = soup.find_all('div', class_='item') for item in items: title = item.find('span', class_='title').text rating = item.find('span', class_='rating_num').text print(f'电影名: {title}, 评分: {rating}')

2. 爬取天气预报

这个项目可爬取指定城市的天气预报信息。

import requests from bs4 import BeautifulSoup city = '北京' url = f'https://tianqi.so.com/weather/{city}' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') weather = soup.find('p', class_='nowtemp').text print(f'{city}当前天气: {weather}')

3. 爬取知乎热榜

该项目能爬取知乎热榜的问题和链接。

import requests from bs4 import BeautifulSoup url = 'https://www.zhihu.com/hot' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3', 'Cookie': 'your_cookie_here' } response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') items = soup.find_all('div', class_='HotItem-content') for item in items: title = item.find('a').text link = 'https://www.zhihu.com' + item.find('a')['href'] print(f'问题: {title}, 链接: {link}')

4. 爬取微博热搜

此项目会爬取微博的热搜话题和热度。

import requests from bs4 import BeautifulSoup url = 'https://s.weibo.com/top/summary' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3', 'Cookie': 'your_cookie_here' } response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') items = soup.find_all('tr')[1:] for item in items: title = item.find('td', class_='td-02').find('a').text hot = item.find('td', class_='td-02').find('span').text print(f'话题: {title}, 热度: {hot}')

5. 爬取小说内容

该项目可爬取小说网站上的小说章节内容。

import requests from bs4 import BeautifulSoup url = 'https://www.example.com/novel/chapter1' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') content = soup.find('div', class_='novel-content').text print(content)

6. 爬取图片

这个项目会爬取图片网站上的图片并保存到本地。

import requests import os url = 'https://www.example.com/image-page' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) image_urls = [] # 解析出图片链接 for img in response.text.split('<img src="')[1:]: image_url = img.split('"')[0] if image_url.startswith('http'): image_urls.append(image_url) if not os.path.exists('images'): os.makedirs('images') for i, image_url in enumerate(image_urls): try: image_response = requests.get(image_url) with open(f'images/image_{i}.jpg', 'wb') as f: f.write(image_response.content) except Exception as e: print(f'下载图片 {image_url} 失败: {e}')

7. 爬取招聘信息

此项目能爬取招聘网站上的招聘信息,如职位、公司、薪资等。

import requests from bs4 import BeautifulSoup url = 'https://www.example.com/jobs' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') items = soup.find_all('div', class_='job-item') for item in items: title = item.find('h3', class_='job-title').text company = item.find('span', class_='company-name').text salary = item.find('span', class_='salary').text print(f'职位: {title}, 公司: {company}, 薪资: {salary}')

8. 爬取股票信息

该项目会爬取股票网站上的股票行情信息。

import requests from bs4 import BeautifulSoup url = 'https://www.example.com/stocks' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') items = soup.find_all('tr')[1:] for item in items: stock_name = item.find_all('td')[0].text price = item.find_all('td')[1].text change = item.find_all('td')[2].text print(f'股票名称: {stock_name}, 价格: {price}, 涨跌幅: {change}')

9. 爬取论文信息

此项目可爬取学术网站上的论文信息,如标题、作者、摘要等。

import requests from bs4 import BeautifulSoup url = 'https://www.example.com/papers' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') items = soup.find_all('div', class_='paper-item') for item in items: title = item.find('h4', class_='paper-title').text author = item.find('span', class_='author-name').text abstract = item.find('p', class_='abstract').text print(f'论文标题: {title}, 作者: {author}, 摘要: {abstract}')

10. 爬取商品信息

这个项目能爬取电商网站上的商品信息,如商品名、价格、销量等。

import requests from bs4 import BeautifulSoup url = 'https://www.example.com/products' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') items = soup.find_all('div', class_='product-item') for item in items: name = item.find('h5', class_='product-name').text price = item.find('span', class_='product-price').text sales = item.find('span', class_='product-sales').text print(f'商品名: {name}, 价格: {price}, 销量: {sales}')

11. 爬取视频信息

此项目会爬取视频网站上的视频信息,如标题、播放量、作者等。

import requests from bs4 import BeautifulSoup url = 'https://www.example.com/videos' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') items = soup.find_all('div', class_='video-item') for item in items: title = item.find('h6', class_='video-title').text views = item.find('span', class_='video-views').text author = item.find('span', class_='video-author').text print(f'视频标题: {title}, 播放量: {views}, 作者: {author}')

12. 爬取新闻信息

该项目可爬取新闻网站上的新闻信息,如标题、发布时间、内容等。

import requests from bs4 import BeautifulSoup url = 'https://www.example.com/news' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') items = soup.find_all('div', class_='news-item') for item in items: title = item.find('h2', class_='news-title').text time = item.find('span', class_='news-time').text content = item.find('p', class_='news-content').text print(f'新闻标题: {title}, 发布时间: {time}, 内容: {content}')

宝子们,Python 爬虫的世界丰富多彩,赶紧动手试试这些项目吧,在实践中提升自己的编程能力💪!要是在过程中有任何问题,或者有其他好玩的爬虫想法,都可以在评论区留言交流哦👇

如果这篇文章对你有帮助,还请花费2秒的时间点个赞+分享,让更多的人看到这篇文章,帮助他们走出误区。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值