python webscraping爬虫相关知识

最新推荐文章于 2025-04-27 20:21:48 发布

little_miya

最新推荐文章于 2025-04-27 20:21:48 发布

阅读量1.1k

点赞数

分类专栏： python 文章标签： python 爬虫数据挖掘

本文链接：https://blog.youkuaiyun.com/allenhsu6/article/details/122095540

版权

python 专栏收录该内容

13 篇文章

订阅专栏

一、简单说明

主要就是两个包：
requests：从网络获取数据，主要是html格式的数据。
beautifulsoup：解析html格式为python相关数据结构。

这里主要介绍下beautifulsoup中比较重要的attribute和method。

首先：

soup = BeautifulSoup(html, "html.parser")

将soup做出来，他的类型是bs4.BeautifulSoup

通常我们需要大致看看这个html模样：

print(soup.prettify())

可以将html 更nest的表现出来。

还有tag 和 Navigable String Object比较重要。

tag就是html的标签

string就是content内容，可以通过<tag>.string获取。

find_all(name, attrs, recursive, string, limit, **kwargs)

当我们提供一个标签（tag）作为name的时候，函数会返回所有带该标签的内容以及该标签覆盖的子内容。
第一个参数还是比较重要的。如果对html语言熟悉的话，能很快找到自己想要的东西。比不a表示link

find()函数相比find_all()而言，只是找到第一个相关的匹配的内容。

二、举例

1. 常见例子

通常的爬虫模式：

data = requests.get('url').text
soup = BeautifulSoup(data,"html.parser")

比如想要获取所有的url中的链接：


for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>

    print(link.get('href'))

再比如想获取所有的image:

for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

再比如想要解析table表格中的数据：

我们用这个url作为例子展示：

url = “https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html”

# 第一步找到table
data = requests.get(url).text
soup = BeautifulSoup(data, "html.parser")

table = soup.find('table')

第二步开始parse

for row in table.find_all('tr'):
	cols = row.find_all('td')
	color_name = cols[2].string
	color_code = cols[3].string
	print("{}-->{}".format(color_name, color_code))

2. 将html中的table转换到pandas中的dataframe

这个应该是比较常用的功能。

网址为：
https://en.wikipedia.org/wiki/World_population

# 前面的步骤一样，这里直接从解析table开始
tables = soup.find_all('table')
for index, table in enumerate(tables):
	if ('10 most densely populated countries') in str(table):
	table_index = index

# 检查下找到的表格对不对
print(table[table_index].prettify())

开始往pandas中导入

population_data = pd.DataFrame(columns=['Rank', 'Country', 'Population','Area','Density'])

for row in tables[table_index].tbody.find_all('tr'):
	col = row.find_all('td')
	if col != []:
	rank = col[0].text
	country = col[1].text
	population = col[2].text.strip()
	area = col[3].text.strip()
    density = col[4].text.strip()
    population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)

看看效果：

在这里插入图片描述
pandas其实也提供了相应的读取html 的函数：

read_html

population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]

甚至可以直接从url读取所有的table：

dataframe_list = pd.read_html(url, flavor='bs4')

使用match参数进行dataframe的匹配：

pd.read_html(url, match="10 most densely populated countries", flavor='bs4')[0]