python webscraping爬虫相关知识

一、简单说明

主要就是两个包:
requests:从网络获取数据,主要是html格式的数据。
beautifulsoup:解析html格式为python相关数据结构。

这里主要介绍下beautifulsoup中比较重要的attribute和method。

首先:

soup = BeautifulSoup(html, "html.parser")

将soup做出来,他的类型是bs4.BeautifulSoup

通常我们需要大致看看这个html模样:

print(soup.prettify())

可以将html 更nest的表现出来。

还有tag 和 Navigable String Object比较重要。

tag就是html的标签

string就是content内容,可以通过<tag>.string获取。

find_all(name, attrs, recursive, string, limit, **kwargs)

当我们提供一个标签(tag)作为name的时候,函数会返回所有带该标签的内容以及该标签覆盖的子内容。
第一个参数还是比较重要的。如果对html语言熟悉的话,能很快找到自己想要的东西。比不a表示link

find()函数相比find_all()而言,只是找到第一个相关的匹配的内容。

二、举例

1. 常见例子

通常的爬虫模式:

data = requests.get('url').text
soup = BeautifulSoup(data,"html.parser")

比如想要获取所有的url中的链接:


for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>

    print(link.get('href'))

再比如想获取所有的image:

for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

再比如想要解析table表格中的数据:

我们用这个url作为例子展示:

url = “https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html”

# 第一步找到table
data = requests.get(url).text
soup = BeautifulSoup(data, "html.parser")

table = soup.find('table')

第二步开始parse

for row in table.find_all('tr'):
	cols = row.find_all('td')
	color_name = cols[2].string
	color_code = cols[3].string
	print("{}-->{}".format(color_name, color_code))

2. 将html中的table转换到pandas中的dataframe

这个应该是比较常用的功能。

网址为:
https://en.wikipedia.org/wiki/World_population

# 前面的步骤一样,这里直接从解析table开始
tables = soup.find_all('table')
for index, table in enumerate(tables):
	if ('10 most densely populated countries') in str(table):
	table_index = index

# 检查下找到的表格对不对
print(table[table_index].prettify())

开始往pandas中导入

population_data = pd.DataFrame(columns=['Rank', 'Country', 'Population','Area','Density'])

for row in tables[table_index].tbody.find_all('tr'):
	col = row.find_all('td')
	if col != []:
	rank = col[0].text
	country = col[1].text
	population = col[2].text.strip()
	area = col[3].text.strip()
    density = col[4].text.strip()
    population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)

看看效果:

在这里插入图片描述
pandas其实也提供了相应的读取html 的函数:

read_html

population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]

甚至可以直接从url读取所有的table:

dataframe_list = pd.read_html(url, flavor='bs4')

使用match参数进行dataframe的匹配:

pd.read_html(url, match="10 most densely populated countries", flavor='bs4')[0]
Python Web Scraping - Second Edition by Katharine Jarmul English | 30 May 2017 | ASIN: B0725BCPT1 | 220 Pages | AZW3 | 3.52 MB Key Features A hands-on guide to web scraping using Python with solutions to real-world problems Create a number of different web scrapers in Python to extract information This book includes practical examples on using the popular and well-maintained libraries in Python for your web scraping needs Book Description The Internet contains the most useful set of data ever assembled, most of which is publicly accessible for free. However, this data is not easily usable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online. This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you'll get hands-on practice building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers. You'll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You'll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You'll find out how to automate these actions with Python packages such as mechanize. You'll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites. By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics. What you will learn Extract data from web pages with simple Python programming Build a concurrent crawler to process web pages in parallel Follow links to crawl a website Extract features from the HTML Cache downloaded HTML for reuse Compare concurrent models to determine the fastest crawler Find out how to parse JavaScript-dependent websites Interact with forms and sessions About the Author Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam) Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing in web scraping while travelling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational in Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones. Table of Contents Introduction Scraping the data Caching downloads Concurrent downloading Dynamic content Interacting with forms Solving CAPTCHA Scrapy Putting it All Together
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值