《Using Python to Access Web Data》Week4 Programs that Surf the Web 课堂笔记-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_28392015/article/details/133062202

本文介绍Python如何处理Unicode字符，使用urllib和BeautifulSoup库进行网页数据采集与解析，包括读取网页、解析HTML及网络爬虫实践。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Coursera课程《Using Python to Access Web Data》密歇根大学

Week4 Programs that Surf the Web

12.3 Unicode Characters and Strings

Representing Simple Strings

使用ASCII码，每个字符都被一个0到256的数字表示来存在8bits的内存里。

使用ord()函数可以查询，指定字符所对应的ASCII码。

>>> print(ord('H'))
72
>>> print(ord('e'))
101
>>> print(ord('\n'))
10

Multi-Byte Characters

unicode是为了解决之前编码只考虑英文字符的问题而出现的，因为以前的ASCII码只使用一个字节来表示字符，所以它最多也只能表示256个字符，如果只使用英文是够用的，但是事实是世界要发展，所以出现各种字符，那么它就不够用了。于是就出现了许多的编码方式，例如中国就出了GBK编码。但是这是不适于国家与国家之间的交流的，于是ISO就指定了unicode这个编码标准。

unicode现在通常是使用两个字节来表示一个字符，但是其实它可以被看成一个字符集，它将所有字符都定义了一个唯一的ID，这样网络就能有一个统一的字符表，不再出现之前的需要相互转化的问题。

那么新的问题就是直接使用unicode来表示字符时，它有时候会浪费空间，在编码表中靠前的字符，例如英文字符，它前一字节就是0000，后一字节才是它真正的序号。于是，在网络传输中，由于网络带宽并没有这么的理想，大家肯定就会嫌弃unicode编码浪费带宽，所以，于1992年创建，由Ken Thompson创建了UTF传输标准，它的全名是Unicode Transformation Format，这个全名就能明白了UTF的意思。

UTF是针对unicode的一种网络传输标准，按照我们通信的人来说，它就是一种针对unicode的编码方式。它现在有UTF-7、UTF-7.5、UTF-8、UTF-16、UTF-32几种格式，当然现在最为流行的就是其中的UTF-8。它是一种变长的编码方式，这样就能减少网络传输中的数据量，所以在网络传输中，大家都用它。

Two Kinds of Strings in Python

# python 2.7.10
>>> x = '我爱你'
>>> type(x)
<type 'str'>
>>> x = u'我爱你'
>>> type(x)
<type 'unicode'>

# python 3.5.1
>>> x = '我爱你'
>>> type(x)
<class 'str'>
>>> x = u'我爱你'
>>> type(x)
<class 'str'>

Python2 Versus Python3

# Python 2.7.10
>>> x = b'abc'
>>> type(x)
<type 'str'>
>>> x = '我爱你'
>>> type(x)
<type 'str'>
>>> x = u'我爱你'
>>> type(x)
<type 'unicode'>

# Python 3.5.1
>>> x = b'abc'
>>> type(x)
<class 'bytes'>
>>> x = '我爱你'
>>> type(x)
<class 'str'>
>>> x = u'我爱你'
>>> type(x)
<class 'str'>

在python3内部，所有的字符都是unicode编码，unicode与UTF-8并不能直接通用，所有在字符出入的时候就要加上encode()与decode()。

Python Strings to Bytes

当我们与外部资源通信时，比如说一个网络socket，我们就要发送bytes。所以我们要把Python3的strings来encode。

当我们读取外部资源发来的数据时，我们就要把它decode变成strings。

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    mystring = data.decode()
    print(mystring)

也就是说，在网络上大家用bytes(它包含UTF-8等)交流，而在自己的程序里使用unicode。

12.4 Retrieving Web Pages

Using urllib in Python

由于HTTP现在太普遍了，所以我们有一个库可以做所有的socket工作，以及让网页看起来像文件一样。

import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
    print(line.decode().strip())

这个urllib.request.urlopen()函数看起来就跟打开文件一样。

Reading Web Pages

import urllib.request,urllib.parse,urllib.error

fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
    print(line.decode().strip())

输出结果

<h1>The First Page</h1>
<p>If you like, you can switch to the <ahref="http://www.dr-chuck.com/page2.htm">Second Page</a>
</p>

其实读取html文件是一样的。

12.5 Parsing Web Pages

What is Web Scraping？

网络数据采集。使用搜索引擎来采集网页数据，我们叫"spidering the web"或者"web crawling"。

The Easy Way - Beautiful Soup

BeautifulSoup是一个额外的模块，可以使用pip来安装。

pip install bs4

具体的用处，官方解释如下

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

import urllib.request,urllib.parse,urllib.error
from bs4 import BeautifulSoup

url = input('Enter -')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href',None))

其中第6行的soup = BeautifulSoup(html, 'html.parser')代码，就是把我们的html文件处理成了一个soup对象.

第9行的tags = soup('a')是把html文件中的anchor tag读成了一个anchor tag的list。

而第11行的tag对象相当于一个字典，可以获取'href'或None的值。

再复杂的学习可以参看三夜灯的blog文章(文后参考2)

Worked Example: BeautifulSoup

urllinks.py

# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4

# Or download the file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

作业：Scraping HTML Data with BeautifulSoup

# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4

# Or download the file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags
tags = soup('span')
count = 0
sum = 0
for tag in tags:
    # Look at the parts of a tag
   sum = sum + int(tag.contents[0])
   count += 1

print ('Count ',count)
print('Sum ', sum)

使用的是python3与bs4。

作业：Following Links in HTML Using BeautifulSoup

# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4

# Or download the file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

count = int(input("Enter count:"))
position = int(input("Enter position:"))
first_name = input("Start with:")

def retrievinghtml(name):
	url = "http://py4e-data.dr-chuck.net/known_by_%s.html"%(name)
	
	html = urllib.request.urlopen(url, context=ctx).read()
	soup = BeautifulSoup(html, 'html.parser')

	# Retrieve all of the anchor tags
	tags = soup('a')
	temp_tag = tags[position-1]
	temp_site = temp_tag.get('href', None)
	temp_name = (re.findall('known_by_(\S+).html',temp_site))[0]
	print("Retrieving: "+"http://py4e-data.dr-chuck.net/known_by_%s.html"%(temp_name))
	return temp_name
	

name = []
name.append(first_name)
print("Retrieving: "+"http://py4e-data.dr-chuck.net/known_by_%s.html"%(first_name))
for i in range(count):
	temp_name = retrievinghtml(name[i])
	name.append(temp_name)
print(name)

使用的仍然是python3与bs4。

【参考】

[1]龙的博客

[2]用python的BeautifulSoup分析html