beautifulsoup 基本语法

最新推荐文章于 2020-12-01 20:43:12 发布

转载最新推荐文章于 2020-12-01 20:43:12 发布 · 616 阅读

1 ·

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/kaibindirver/p/9927297.html

文章标签：

#python #json

本文通过两个案例展示了如何使用Python的requests和BeautifulSoup库进行网页数据抓取。首先介绍了从iteste.info抓取课程信息的方法，然后展示了如何从v2ex.com获取热门话题标题和链接。文章深入讲解了如何解析HTML标签，提取所需信息。

案例一：

#coding=utf-8
import json
import requests
from bs4 import BeautifulSoup
url = 'http://www.itest.info/courses' # 定义被抓取页面的url
soup = BeautifulSoup(requests.get(url).text, 'html.parser')# 获取被抓取页面的html代码（注意这里是用 request框架获取的页面源码），并使用html.parser来实例化BeautifulSoup，属于固定套路
for course in soup.find_all('h4'):# 遍历页面上所有的h4标签
　　print course.text.encode('utf-8')# 打印出h4标签的text字符如: 测试开发--试验班
　　print course # 打印出h4的text字符加标签如:<h4>测试开发--试验班</h4>

案例二:

图例:

url = 'https://www.v2ex.com/'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for span in soup.find_all('span', class_='item_hot_topic_title'):#查找span标签且样式为class_='item_hot_topic_title'，注意是class_，不是class，因为class是python的关键字，所以后面要加个尾巴，防止冲突
　　print span.find('a').text.encode('utf-8')#获取里面的a标签展示,假如span标签里面有很多a标签，可以 for i in span.find_all('a', href='/t/415664')继续筛选
　　print span.find('a')['href'].encode('utf-8') #获取href属性，在bs4里，我们可以通过[attribute_name]的方式来获取元素的属性

转载于:https://www.cnblogs.com/kaibindirver/p/9927297.html