爬虫--03：数据分析_selenium爬虫process finished with exit code 0-优快云博客

本文介绍了Python中的正则表达式，包括其概念、应用场景、在Python中的支持以及re模块的常用方法。接着讲解了XPath的基本概念、节点关系、应用以及lxml模块的使用。最后探讨了CSV文件的读写操作，并结合BeautifulSoup4进行了网页信息提取和数据存储的案例实践。

Reptilien 03: analyse Von daten

正则表达式
xpath
CSV
BeautifulSoup4

正则表达式

在这里插入图片描述

一、正则表达式的简介

1、概念

正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、特定字符的组个，自称一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种逻辑过滤。

2、正则表达式的应用场景

表单验证（如：手机号、邮箱、身份证…）
爬虫

二、正则表达式对python的支持

1、普通字符

字母、数字、汉字、下划线、以及没有特殊定义的符号，都是"普通字符"。正则表达式中的普通字符，在匹配的时候,只匹配与自身相同的一个字符。
例如：表达式c，在匹配字符串abcde时，匹配结果是：成功；匹配到的内容是c；匹配到的位置开始于2，结束于3。（注：下标从0开始还是从1开始，因当前编程语言的不同而可能不同）

2、match()函数

语法：match(pattern, string, flags=0)
第一个参数是正则表达式，如果匹配成功，则返回一个match对象，否则返回一个None。
第二个参数表示要匹配的字符串。
第三个参数是标致位用于控制正则表达式的匹配方式如: 是否区分大小写,多行匹配等等。
扩展：标识位

修饰符	描述
re.I	忽略大小写 ( IGNORE CASE )
re.L	做本地化识别（locale-aware）匹配 ( LOCALE )
re.M	多行模式（MULTILINE ）
re.S	使’.'特殊字符匹配任何字符，包括换行；如果没有此标志， '.'将匹配任何内容除换行符。( DOTALL )
re.U	根据Unicode字符集解析字符。这个标志影响 \w, \W, \b, \B. ( UNICODE )
re.X	标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解 ( VERBOSE )

代码演示：

import re
'''
pattern--->正则表达式：如果匹配成功，则返回一个match对象，否则返回一个None
string--->需要匹配的数据
flags=0--->标致位：控制正则表达式的匹配方式（是否需要换行匹配，是否区分大小写......）

# match函数只能从开头匹配，不能从其他位置匹配
'''
# re.match(pattern, string, flags=0)
s = 'python and java'
pattern = 'python'  # 正则表达式
result = re.match(pattern, s)
if result:
    # print(result) # 返沪：<re.Match object; span=(0, 6), match='python'>
    print(result.group()) # 返沪：python
else:
    print('没有匹配！')

3、元字符

正则表达式中使用了很多元字符，用来表示一些特殊的含义或功能。
一些无法书写或者具有特殊功能的字符，采用在前面加斜杠” / “的方法。
尚未列出的还有问号” ？ “、星号” * “和括号”（）“等其他的符号。所有正则表达式中具有特殊含义的字符在匹配自身的时候，都要使用斜杠进行转义。这些转义字符的匹配用法与普通字符类似，也是匹配与之相同的一个字符。

4、预定义匹配字符集

正则表达式中的一些表示方法，可以同时匹配某个预定义字符集中的任意一个字符。比如，表达式\d可以匹配任意一个数字。虽然可以匹配其中任意字符，但是只能是一个，而不是多个

代码演示：

import re

'''
01:\d--表示匹配0-9当中的任意一个字符
'''
result1 = re.match(r'\d', '123').group()
print(result1)
'''
02:\w--表示匹配任意一个字母或数字或下划线，即0-9、a-z、A-Z
'''
result2 = re.match(r'\w', '123').group()
print(result2)
result3 = re.match(r'\w', 'a123').group()
print(result3)
result4 = re.match(r'\w', 'A123').group()
print(result4)
'''
03：\s--表示匹配空格、制表符、其他的一些空白
'''
result5 = re.match(r'\s', ' ').group()
print(result5)
'''
04：\D（\d的反集--表示非数字当中的任意一个
'''
result6 = re.match(r'速度与激情\D', '速度与激情a').group()
print(result6)
'''
05:\W(\w的反集)--表示匹配一些特殊符号
'''
result7 = re.match(r'\W', '&^%%').group()
print(result7)
'''
06：\S(\s的反集)--表示匹配除了空格、制表符、其他的一些空白的任意字符
'''
result8 = re.match(r'\S', 'sdsdfasd').group()
print(result8)
result9 = re.match(r'\S', '158532').group()
print(result9)

5、重复匹配

前面的表达式，无论是只能匹配一种字符的表达式，还是可以匹配多种字符其中任意一个的表达式，都只能匹配一次。但是有时候我们需要对某个字段进行重复匹配，例如手机号码13666666666，一般的新手可能会写成\d\d\d\d\d\d\d\d\d\d\d（注意，这不是一个恰当的表达式），不但写着费劲，看着也累，还不⼀定准确恰当。
这种情况可以使用表达式再加上修饰匹配次数的特殊符号{}，不但重复书写表达式就可以重复匹配。例如[abcd][abcd]可以写成[abcd]{2}

import re

'''
# {n}:表达式重复n次。
'''
# 代码演示
result1 = re.match(r'\d{3}', '123').group()
print(result1)
result2 = re.match(r'^1[345678]\d{9}$', '15840039263').group()
print(result2)
'''
# {m,n}:表达式至少重复m次，最多重复n次。
'''
# 代码演示
result3 = re.match(r'\d{3,4}-\d{7,8}', '0123-1234567').group()
print(result3)
'''
# {m,}:表达式至少重复m次。
'''
# 代码演示
result4 = re.match(r'\d{3,}-\d{7,8}', '012-1234567').group()
print(result4)
'''
# ?:匹配表达式0次或者1次
'''
# 代码演示
result9 = re.match(r'w[a-z]?', 'wa').group()
print(result9)
result10 = re.match(r'w[a-z]?', 'wae').group()
print(result10)
'''
# +:表达式匹配至少出现1次
'''
# 代码演示
result7 = re.match(r'w[a-z]+', 'weasd').group()
print(result7)
# result8 = re.match(r'w[a-z]+', 'w').group()
# print(result8) # 报错

'''
# *:表达式出现0次到任意次，相当于{0,}
'''
# 代码演示
result5 = re.match(r'w[a-z]*', 'wfadasd').group()
print(result5)
result6 = re.match(r'w[a-z]*', 'w').group()
print(result6)

6、位置匹配与非贪婪模式

①、位置匹配

有时候，我们对匹配出现的位置有要求，比如开头、结尾、单词之间等等

②、贪婪与非贪婪模式

在重复匹配时，正则表达式默认总是尽可能多的匹配，这被称为贪婪模式。例如，针对文本dxxxdxxxd，表达式(d)(\w+)(d)中的\w+将匹配第一个d和最后一个d之间的所有字符xxxdxxx。可见，\w+在匹配的时候，总是尽可能多的匹配符合它规则的字符。同理，带有?、*和{m,n}的重复匹配表达式都是尽可能地多匹配。
校验数字的相关表达式：
特使场景的表达式：

代码演示01：贪婪与非贪婪模式

import re

'''
# 贪婪匹配：以最长的结果做为返回，在python当中是默认贪婪的，总是尝试去匹配尽可能多的字符
'''
s = '<div>abc</div><div>bcd</div>'
# '''
# 需求：<div>abc</div>
# '''
ptn = '<div>.*</div>'
r = re.match(ptn, s)
print(r.group()) # <div>abc</div><div>bcd</div>


'''
# 非贪婪模式：总是尝试去匹配尽可能少的字符
# 怎么使用非贪婪模式：在.*加上？，在.+加上？或者{m, n}
'''
s = '<div>abc</div><div>bcd</div>'
'''
需求：<div>abc</div>
'''
ptn = '<div>.*?</div>'
r = re.match(ptn, s)
print(r.group())

打印输出结果：

C:\python\python.exe D:/PycharmProjects/Python大神班/day-06/01-非贪婪匹配.py
<div>abc</div><div>bcd</div>
<div>abc</div>

Process finished with exit code 0

代码演示02：正则表达式练习

import re

def fn(ptn, list):
    for x in list:
        result = re.match(ptn, x)
        if result:
            print('匹配成功！', '匹配结果是：', result.group())
        else:
            print(x, '匹配失败！')

# . :表示匹配除了换行符的任意字符
list = ['abc1', 'ab', 'aba', 'abbcd', 'other', 'another']
ptn = 'ab.'
fn(ptn, list)

# [] : 匹配中括号中列举的字符
list = ['man', 'mbn', 'mcn', 'mdn', 'mon', 'nba']
ptn = 'm[abcd]n'
fn(ptn, list)

# \d :匹配数字【0-9】
list = ['py9', 'py3', 'other', 'pyxx']
ptn = 'py\d'
fn(ptn, list)

# \D :匹配非数字的字符
list = ['py9', 'py3', 'other', 'pyxx']
ptn = 'py\D'
fn(ptn, list)

# \s :匹配空白字符
list= ['hello world', 'helloxxx', 'hello,world']
ptn = 'hello\sworld'
fn(ptn, list)

# \w :匹配单词、字母、下划线
list = ['1-age', 'a-age', '#-age', '_-age']
ptn = '\w-age'
fn(ptn, list)

# * :表示匹配出现0次或任意次
list = ['hello', 'abc', 'xxx', 'h']
ptn = 'h[a-z]*'
fn(ptn, list)

# {m} :表示匹配至少m次
list = ['hello', 'python', '^%%&%^#^$%', '123456']
ptn = '\w{6}'
fn(ptn, list)

# {m, n} ：表示至少匹配m次，最多匹配n次
list = {'abcd', 'python', '*&%%*&*&', '_xxx211', '65465465'}
ptn = '\w{3, 5}'
fn(ptn, list)

# $ : 表示匹配带该字符（￥）结束
list = ['123@qq.com', 'abc@yy.com', 'bcd@qq.com.cm']
ptn = '\w+@qq.com$'
fn(ptn, list)

打印输出结果：

C:\python\python.exe D:/PycharmProjects/Python大神班/day-06/02-正则表达式练习.py
匹配成功！ 匹配结果是： abc
ab 匹配失败！
匹配成功！ 匹配结果是： aba
匹配成功！ 匹配结果是： abb
other 匹配失败！
another 匹配失败！
匹配成功！ 匹配结果是： man
匹配成功！ 匹配结果是： mbn
匹配成功！ 匹配结果是： mcn
匹配成功！ 匹配结果是： mdn
mon 匹配失败！
nba 匹配失败！
匹配成功！ 匹配结果是： py9
匹配成功！ 匹配结果是： py3
other 匹配失败！
pyxx 匹配失败！
py9 匹配失败！
py3 匹配失败！
other 匹配失败！
匹配成功！ 匹配结果是： pyx
匹配成功！ 匹配结果是： hello world
helloxxx 匹配失败！
hello,world 匹配失败！
匹配成功！ 匹配结果是： 1-age
匹配成功！ 匹配结果是： a-age
#-age 匹配失败！
匹配成功！ 匹配结果是： _-age
匹配成功！ 匹配结果是： hello
abc 匹配失败！
xxx 匹配失败！
匹配成功！ 匹配结果是： h
hello 匹配失败！
匹配成功！ 匹配结果是： python
^%%&%^#^$% 匹配失败！
匹配成功！ 匹配结果是： 123456
python 匹配失败！
abcd 匹配失败！
*&%%*&*& 匹配失败！
_xxx211 匹配失败！
65465465 匹配失败！
匹配成功！ 匹配结果是： 123@qq.com
abc@yy.com 匹配失败！
bcd@qq.com.cm 匹配失败！

Process finished with exit code 0

三、re模块的常用方法

在这里插入图片描述

compile(pattern, flags=0)
这个⽅法是re模块的工厂法，⽤于将字符串形式的正则表达式编译为Pattern模式对象，可以实现更加效率的匹配。第二个参数flag是匹配模式使用compile()完成一次转换后，再次使用该匹配模式的时候就不能进行转换了。经过compile()转换的正则表达式对象也能使用普通的re⽅法

1、flag匹配模式

在这里插入图片描述

2、search(pattern, string, flags=0)函数

在文本内查找，返回第一个匹配到的字符串。它的返回值类型和使用方法与match()是一样的，唯一的区别就是查找的位置不用固定在文本的开头

3、findall(pattern, string, flags=0)函数

作为re模块的三大搜索函数之一，findall()和match()、search()的不同之处在于，前两者都是单值匹配，找到一个就忽略后面，直接返回不再查找了。而findall是全文查找，它的返回值是一个匹配到的字符串的列表。这个列表没有group()方法，没有start、end、span，更不是一个匹配对象，仅仅是个列表！如果一项都没有匹配到那么返回一个空列表。

4、split(pattern, string, maxsplit=0, flags=0)函数

re模块的split()方法和字符串的split()方法很相似，都是利用特定的字符去分割字符串。但是re模块的split()可以使用正则表达式，因此更灵活，更强大。

5、sub(pattern, repl, string, count=0, flags=0)函数

sub()方法类似字符串的replace()方法，用指定的内容替换匹配到的字符，可以指定替换次数

代码演示：

import re

# compile()--根据包含的正则表达式的字符串创建模式对象，返回正则表达式对象（re对象）
pat = re.compile(r'abc')
res = pat.match('abc123').group()
print(res)
# re.I-->表示不区分大小写匹配
pat = re.compile(r'abc', re.I)
res = pat.match('ABC123').group()
print(res)

# search()--在字符串中查找，返回第一个匹配的对象或者None
r = re.search(r'abc', '123abc456abc789').group()
print(r)

# findall()--作为re模块三大搜索函数之一，findall()与match()、search()的不同指出在于，前两者都是单值匹配，找到一个就忽略后边，而findall()是全文查找，它的返回值是一个匹配到的字符串的列表。这个列表没有group()方法，没有start、end、span，更不是一个匹配对象，仅仅是个列表！如果一项都没有匹配到那么返回一个空列表
r = re.findall(r'abc', '123abc456abc789')
print(r)

# split()--re模块的split()方法和字符串的split()方法很相似，都是利用特定的字符去分割字符串。但是re模块的split()可以使用正则表达式，因此更灵活，更强大
# split有个参数maxsplit()，用于指定分割的次数
s = '8+7*5+6/3'
r = re.findall(r'\d', s)
print(r)
r = re.split(r'[\+\*\/]', s)
print(r)
# maxsplit--最大分割次数
r = re.split(r'[\+\*\/]', s, maxsplit=3)
print(r)

# sub()--sub()方法类似字符串的replace()方法，用指定的内容替换匹配到的字符，可以指定替换次数
s = 'i am wangjiaxin i am very handsome'
r = re.sub(r'i', 'I', s)
print(r)

打印输出结果：

C:\python\python.exe D:/PycharmProjects/Python大神班/day-06/03-re模块常用的方法.py
abc
ABC
abc
['abc', 'abc']
['8', '7', '5', '6', '3']
['8', '7', '5', '6', '3']
['8', '7', '5', '6/3']
I am wangjIaxIn I am very handsome

Process finished with exit code 0

四、分组功能

Python的re模块有一个分组功能。所谓的分组就是去已经匹配到的内容再筛选出需要的内容，相当于二次过滤。实现分组靠圆括号()，而获取分组的内容靠的是group()、groups()，其实前面我们已经展示过。re模块里的积个重要方法在分组上，有不同的表现形式，需要区别对待。

代码演示：反正组功能

# 分组，就是去已经匹配到的内容里面筛选出需要的内容（二次过滤）
import re
s = 'apple price is $66, banana price is $6'
'''
需求
$66
$6
'''
result = re.search('.+\$\d+.+\$\d+', s).group()
print(result)
result1 = re.search('.+(\$\d+).+(\$\d+)', s)
result2 = re.search('.+(\$\d+).+(\$\d+)', s)
print(result1.group(1))
print(result2.group(2))
print(result1.groups())
'''
print(result1.group(1))--匹配第一个分组
print(result2.group(2))--匹配第二个分组
print(result1.groups())--获取所有分组（元组形式返回）
print(result1.group())--匹配整个分组
'''

打印输出结果：

C:\python\python.exe D:/PycharmProjects/Python大神班/day-06/04-分组功能.py
apple price is $66, banana price is $6
$66
$6
('$66', '$6')

Process finished with exit code 0

正则表达式练习

百度贴吧图片爬取练习

代码演示：百度贴吧图片爬取练习

import requests
import re
import json
'''
需求：爬取贴吧主题的图片
'''
# 思路：找到这些图片的URL，然后保存图片
'''
1、找到图片的url地址，但是网页源码中没有
分析：1、通过network分析数据接口。2、通过selenium进行模拟爬去数据
'''
name = 1
# 目标url
for i in range(1, 80, 39):
    url = 'https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1' + '&ps=' + str(i) + '&ps=' + str(39 + i) + '&wall_type=h&_=1612926942322'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
            }
    # 发起请求，获得响应结果
    res = requests.get(url, headers=headers).text
    img_urls = re.findall('"murl":"(.*?)"', res)
    # print(img_urls)
    for img_url in img_urls:
        print(img_url)
        print('正在下载第{}张图片'.format(name))
        # 对图片链接发起请求
        img_response = requests.get(img_url)
        # 保存图片
        with open('image/%d.jpg' %name, 'wb') as file_object:
            file_object.write(img_response.content)
        name += 1


'''
找出url规律
url1 = https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=1&pe=40&info=1&_=1612926875683

url2 = https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=40&pe=79&wall_type=h&_=1612926908798

url3 = https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=79&pe=118&wall_type=h&_=1612926942322
规律：ps=1、ps=40、ps=79
规律：pe=40、pe=79、pe=118
差值为39
'''

打印输出结果：

http://imgsrc.baidu.com/forum/wh%3D322%2C200/sign=40e5cae078f0f736d8ab440238679f2b/e5f5bc3eb13533fa690a6fb1a8d3fd1f40345b9b.jpg
正在下载第48张图片
http://imgsrc.baidu.com/forum/wh%3D321%2C200/sign=4463c55cd7ca7bcb7d2ecf2c8c384751/0b1bafc379310a558574226bb74543a98326108e.jpg
正在下载第49张图片
http://imgsrc.baidu.com/forum/wh%3D321%2C200/sign=7950ef3859ee3d6d22938fc871274110/251ec895d143ad4bf97eba5f82025aafa50f068e.jpg
正在下载第50张图片
http://imgsrc.baidu.com/forum/wh%3D322%2C200/sign=41a5ab7ff703918fd78435c9630f0aa5/1ae07b899e510fb32da68b08d933c895d0430c9b.jpg
正在下载第51张图片
http://imgsrc.baidu.com/forum/wh%3D321%2C200/sign=1153ab76f2d3572c66b794dfb8224f15/ae4bd0160924ab188ca91da835fae6cd7a890b9b.jpg
正在下载第52张图片
http://imgsrc.baidu.com/forum/wh%3D321%2C200/sign=90adcea728381f309e4c85aa9b30603a/28128794a4c27d1edc99c75b1bd5ad6edcc4389b.jpg
正在下载第53张图片

xpath

一、xpath介绍

1、基本概念

XPath（XML Path Language）是一种XML的查询语言，他能在XML树状结构中寻找节点。XPath 用于在 XML 文档中通过元素和属性进行导航。
xml是一种标记语法的文本格式，xpath可以方便的定位xml中的元素和其中的属性值。lxml是python中的一个第三方模块，它包含了将html文本转成xml对象，和对对象执行xpath的功能。
拓展：
1、XML：可扩展标记语言
2、HTML：超文本标记语言
3、LXML：是python这种的一个第三方库（将html文本转换成一份XML对象）

2、节点的关系

xml_content = '''
<bookstore>
<book>
    <title lang='eng'>Harry Potter</title>
    <author>JK.Rowing</author>
    <year>2005</year>
    <price>29<price>
</book>
</bookstore>
'''

<bookstore>---->文档节点
<author>];. Rowling</author>---->元素节点
lang="en"---->属性节点

父(Parent) book元素是title、author、year、price元素的父
子(Children) title、author、year、price都是book元素的子
同胞(Sibling) title、author、year、price都是同胞
先辈(Ancestor) title元素的先辈是 book元素和bookstore元素

二、xpath的应用

1、xpath工具的安装

chrome 插件 xpath-helper
- 1、chrome浏览器输入：chrome://extensions/
- 2、拖拽文件安装，如果不行，将文件后缀crx更改为rar
- 3、解压
- 4、扩展程序（加载已解压的扩展程序）
firefox 插件 xpath checker
安装参考：https://blog.youkuaiyun.com/qq_31082427/article/details/84987723

2、xpath工具的使用

通过ctrl + shift + x 打开扩展程序

演示01：//div[@class='iteminfo__line1__jobname']/span/text()
@-->选取属性  
[]-->谓语：用来查找某个特定的节点或者包含某个特定值的节点，谓语被镶嵌到方括号内。


演示02：//div[@class='soupager']/span[1]   # 翻页
演示03：//div[@class='soupager']/span[position()=1]  # 翻页
演示04：//div[@class='soupager']/span[last()]  # 翻页（直接找最后）
演示05：//div[@class='soupager']/span[position()<4]  # 取范围页面

在这里插入图片描述

查找某个特定的节点或者包含某个指定的值的节点

三、lxml模块的使用

1、安装

安装pip install lxml / pip install lxml -i https://pypi.douban.com/simple
作用：
- 导入：from lxml import etree
- 得到响应结果：网页源码
- etree.HTML (网页源码)—>返回一个element对象
- element对象就可以使用xpath进行导航
  代码演示：

from lxml import etree

wb_data = """
        <div>
            <ul>
                 <li class="item-0"><a href="link1.html">first item</a></li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-inactive"><a href="link3.html">third item</a></li>
                 <li class="item-1"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a>
             </ul>
         </div>
        """
# etree.HTML(wb_data)将网页源码转换成element对象
element = etree.HTML(wb_data)
# print(element)
'''
需求：
获取li标签下面的href属性
'''
result1 = element.xpath('//li/a/@href')
print(result1)
for i in result1:
    print(i)
'''
需求：
获取li标签下a标签的文本数据
'''
result2 = element.xpath('//li/a/text()')
print(result2)
for j in result2:
    print(j)

CSV

一、什么是CSV？

是python的一个内置模块
概念：CSV (Comma Separated Values)](http://zh.wikipedia.org/zh-cn/逗号分隔值)，即逗号分隔值（也称字符分隔值，因为分隔符可以不是逗号），是一种常用的文本格式，用以存储表格数据，包括数字或者字符。很多程序在处理数据时都会碰到csv这种格式的文件。python自带了csv模块，专门用于处理csv文件的读取

二、CSV模块的使用

1、写入csv文件

1、通过创建writer对象，主要用到2个方法。一个是writerow，写入一行。另一个是writerows写入多行
2、使用DictWriter 可以使用字典的方式把数据写入进去

2、读取文件

1、通过reader()读取到的每一条数据是一个列表。可以通过下标的方式获取具体某一个值。
2、通过DictReader()读取到的数据是一个字典。可以通过Key值(列名)的方式获取数据。
代码演示：

import csv

# 写入数据01：weiterow与writerows
# weiterow
headers = ('name', 'age', 'height')
persons = [('我的梦', '18','178'), ('我的梦', '23','180'), ('我的梦', '25','170')]
with open('persons.csv', 'w', encoding='utf-8') as file_object:
    writer1 = csv.writer(file_object)
    # writerow()写入一行
    writer1.writerow(headers)
    for data in persons:
        writer1.writerow(data)
# writerows
headers = ('name', 'age', 'height')
persons = [('我的梦', '18','178'), ('我的梦', '23','180'), ('我的梦', '25','170')]
with open('persons1.csv', 'w', encoding='utf-8', newline='') as file_object:
    writer1 = csv.writer(file_object)
    # writerow()写入一行
    writer1.writerow(headers)
    writer1.writerows(persons)


# 第二种写入方式
# 注意：字典中的KEY值需要与表头保持一致
headers = ('name', 'age', 'height')
persons = [
    {'name': '我的梦', 'age': 18, 'height': 178},
    {'name': '125', 'age': 18, 'height': 178},
    {'name': '王佳欣', 'age': 24, 'height': 183}
        ]
with open('persons2.csv', 'w', encoding='utf-8', newline='') as file_object:
    writer = csv.DictWriter(file_object, headers)
    writer.writeheader()
    writer.writerows(persons)


# 读取数据
# 方案一：
with open('persons.csv', 'r', encoding='utf-8') as file_object:
    reader = csv.reader(file_object)
    for x in reader:
        print(x)
# 方案二：
with open('persons2.csv', 'r', encoding='utf-8') as file_object:
    reader = csv.DictReader(file_object)
    for y in reader:
        print(y)
        print(y['name'])

xpath与CSV文件练习

爬取豆瓣电影TOP250

需求：爬取电影的名字、评分、引言、详情url，保存在csv文件中1-10页（title、other、star、url、quote）
思路
- 01：向目标url发起请求，获取响应对象，拿到网页源码
- 02：通过etree.HTML（）函数将网页源码转换成element对象
- 03：通过element对象、xpath导航到我想要获取的数据
- 04：得到title、other、star、url、quote，并放到一个字典中，在保存至字典中
- 05：将列表中的数据保存至CSV文件中
  代码实现01：

'''
目标页规律
第一页：https://movie.douban.com/top250?start=0&filter=
第二页：https://movie.douban.com/top250?start=25&filter=
第三页：https://movie.douban.com/top250?start=50&filter=
start = (pag - 1) * 25
目标url需进行10次循环
'''
import requests
import csv
from lxml import etree
import time

list = []
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
        }
for i in range(0, 250, 25):
    url = 'https://movie.douban.com/top250?start=' + str(i) + '&filter='

# 01:向目标url发起请求，获取响应对象，拿到网页源码
    response = requests.get(url, headers=headers).content

# 02：通过etree.HTML（）函数将网页源码转换成element对象
    element = etree.HTML(response)

# 03：通过element对象、xpath导航到我想要获取的数据
    result = element.xpath('//ol/li/div[@class="item"]')
# print(len(result1)) # 得到25条电影信息
    for html in result:
    # # 01:获取电影名字
        result_titles = html.xpath('./div[@class="info"]/div[@class="hd"]/a/span[@class="title"]/text()')
    # print(result_titles)
    # # 02:获取电影别名
        result_others = html.xpath('./div[@class="info"]/div[@class="hd"]/a/span[@class="other"]/text()')
    # print(result_others)
    # # 03:获取电影评分
        result_stars = html.xpath('./div[@class="info"]//div[@class="star"]/span[@class="rating_num"]/text()')
    # # 04:获取评价
        result_qutes = html.xpath('./div[@class="info"]//p[@class="quote"]/span[@class="inq"]/text()')
    # # 05:获取电影的url
        result_urls = html.xpath('./div[@class="info"]/div[@class="hd"]/a/@href')

# 04：得到title、other、star、url、quote，并放到一个字典中，在保存至字列表中
        dict = {}
        dict['title'] = result_titles
        dict['other'] = result_others
        dict['star'] = result_stars
        dict['qute'] = result_qutes
        dict['url'] = result_urls
        print(dict)
        list.append(dict)

# # 05：将列表中的数据保存至CSV文件中
    header = ['title', 'other', 'star', 'qute', 'url']
    with open('TOP电影信息爬取.csv', 'w', encoding='utf-8', newline='') as file_object:
        time.sleep(0.5)
        writer = csv.DictWriter(file_object, header)
        writer.writeheader()
        writer.writerows(list)

打印输出结果：
在这里插入图片描述
代码实现02：

import requests
import time
from lxml import etree
import csv

'''
目标url规律
URL1 = https://movie.douban.com/top250?start=0&filter=
URL2 = https://movie.douban.com/top250?start=25&filter=
URL3 = https://movie.douban.com/top250?start=50&filter=
规律如下：
pages = (page - 1) * 25
'''
# 1、目标网站
doubanUrl = 'https://movie.douban.com/top250?start={}&filter='

# 2、发起请求获取响应结果
def getSource(url):
    headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.192 Safari/537.36'}
    time.sleep(0.5)
    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'
    return response.text

# 3、解析数据(名字、评分、引言、url)
def getEveryItem(source):
    html_element = etree.HTML(source)
    movieItemList = html_element.xpath('//div[@class="info"]')

    # 定义一个空列表，储存电影字典数据
    movielist = []
    for eachMovie in movieItemList:
        # 定义字典，保存字典星系
        movieDict = {}
        # 1、获取标题（title）
        title = eachMovie.xpath('./div[@class="hd"]/a/span[@class="title"][1]/text()')
        # 2、获取别名
        othertitle = eachMovie.xpath('./div[@class="hd"]/a/span[@class="other"]/text()')
        # 3、获取评分
        stars = eachMovie.xpath('./div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()')[0]
        # 4、获取评语
        quote = eachMovie.xpath('./div[@class="bd"]/p[@class="quote"]/span/text()')
        # 有些引言没有数据(只要不是空性的，结果为True，反之为空性，结果为false)
        if quote:
            quote = quote[0]
        else:
            quote = ''
        # 5、获取目标url
        url = eachMovie.xpath('./div[@class="hd"]/a/@href')[0]


        movieDict['title'] = ''.join(title+othertitle) # 主标题+副标题
        movieDict['stars'] = stars # 评分
        movieDict['quote'] = quote # 评语
        movieDict['url'] = url  # 目标url

        movielist.append(movieDict)
        print(movielist)
    return movielist

# 4、保存数据
def writeData(movielist):

    with open('douban250.csv','w',encoding='utf-8',newline='') as file_obj:
        writer = csv.DictWriter(file_obj,fieldnames=['title','stars','quote','url'])
        writer.writeheader()
        writer.writerows(movielist)


if __name__ == '__main__':
    movielist = []

    for i in range(10):
        # 获取目标url
        pagelink = doubanUrl.format(i * 25)
        # 将目标url传入请求函数
        source = getSource(pagelink)
        # 将请求结果传入解析函数，最终将解析函数的结果传入列表
        movielist += getEveryItem(source) # movielist = movielist+getEveryItem(source)含义是将每页获取的结果依次加入列表（逐条写入）
    # 将最终写入的列表对象传入写入函数完成csv文件写入
    writeData(movielist)

案例总结

1、str格式化的运用

str.format()
r''
f''
s1 = '大桥局'
s2 = '大骗局'

r = f'hello {s1},{s2}'

s = 'i like {}' # str.format()
w = s.format(python)

2、xpath语法的运用
- ①、生成一个element对象
- ②、通过xpath（）进行导航
3、csv文件的储存

    with open('douban250.csv','w',encoding='utf-8',newline='') as file_obj:
        writer = csv.DictWriter(file_obj,fieldnames=['title','stars','quote','url'])
        writer.writeheader()
        writer.writerows(movielist)
    with open('douban.csv','w',encoding='utf-8',newline='') as file_obj:
        writer = csv.DictWriter(file_obj,fieldnames=['title','star','quote','url'])
        writer.writeheader()
        for each in movieList:
            writer.writerow(each)

4、非空判断

# 非布尔值的判断 只要不是空性的 统统代表是True 反之 空性 []  () {} None 0...
        if quote:
            quote = quote[0]
        else:
            quote = ''

BeautifulSoup4

一、BeautifulSoup4简介

1、基本概念

概念：Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库
简介：在解析网页的时候，每个网页结构都不一样，所以采用最合适的解析技术

2、源码分析

github下载源码
安装：
- pip install lxml
- pip install bs4
Beautiful Soup4学习文档链接：Beautiful Soup4文档

二、BeautifulSoup4 快速入门

1、导入

import bs4—一般不用
from bs4 import BeautifulSoup

2、Beautiful Soup4解析器的优缺点

1、python标准库
- 使用方法：BeautiSoup（narkup, 'html.parser'）
- 优势：1、Python的内置标准库 2、执行速度适中 3、文档容错能力强
- 劣势：Python2.7.3 or 3.2.2 之前的版本文档容错能力差
2、lxmlHTML 解释器
- 使用方法：BeautiSoup（narkup, 'lxml'）
- 优势：1、速度快 2、文档容错能力强
- 劣势：需要安装C语言库
3、lxml XML解释器
- 使用方法：1、BeautiSoup（narkup, 'xml'）2、BeautiSoup（'lxml', 'xml'）
- 优势：1、速度快 2、唯一支持XML的解释器
- 劣势：需要安装C语言库
4、html5lib
- 使用方法：BeautiSoup（narkup, 'html5lib'）
- 优势：1、最好的容错性 2、以浏览器的方式解析文档 3、生成HTML5格式的文档
- 劣势：1、速度慢 2、不依赖外部扩展

注：推荐使用lxml解析器，因为效率高。
代码演示：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# 第一步：先有一个soup对象(soup是实例对象，BeautifulSoup是类对象)
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.prettify()) # 精简输出（查看复杂结构）
print(soup.title) # 直接通过标签找到数据
print(soup.title.name) # 获取标签的名称
print(soup.title.string) # The Dormouse's story 通过string获得文本数据
print(soup.p) # 通过标签找数据，只找到第一个，以第一个返回
r = soup.find_all('p') # 通过find_All找到所有的满足的数据（p标签的数据）
print(r, len(r))
'''
需求：找到a标签下的href属性(网址)
'''
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

三、BeautifulSoup4 的四大对象种类

1、ag : 标签
2、NavigableString : 可导航的字符串
3、BeautifulSoup : soup对象
4、Comment : 注释
代码演示：

from bs4 import BeautifulSoup
from bs4.element import NavigableString

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# 获得soup对象
soup = BeautifulSoup(html_doc, 'lxml')
'''
tag : 标签
NavigableString : 可导航的字符串
BeautifulSoup : soup对象
Comment : 注释
'''

# 一、tag : 标签
print(type(soup.title)) # 获得<class 'bs4.element.Tag'>
print(type(soup.a)) # 获得<class 'bs4.element.Tag'>

# 二、NavigableString : 可导航的字符串
print(type(soup.title.string)) # 获得<class 'bs4.element.NavigableString'>

# 三、BeautifulSoup : soup对象
print(type(soup)) # 获得<class 'bs4.BeautifulSoup'>

# 四、Comment : 注释
html = '<a><!--大桥局早日倒闭，杭绍台梁断隧塌--></a>'
soup2 = BeautifulSoup(html, 'lxml')
print(type(soup2.string)) # 获得<class 'bs4.element.Comment'>

打印输出结果：

/Volumes/苹果微软公共盘/PycharmProjects/venv/bin/python /Volumes/苹果微软公共盘/PycharmProjects/Python大神班/day-08/03-bs4的四大对象类型.py
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.element.Comment'>

Process finished with exit code 0

四、遍历文档树

bs里面有三种情况，第一种是遍历、第二种是查找、第三种是修改

1、contents、children、descendants

cintents：返回一个所有节点的列表
代码演示01：

from bs4 impor4 BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# contents 返回的是一个所有子节点的列表
head_tag = soup.head
print(head_tag)
res = head_tag.contents
print(res)

children:返回一个子节点的迭代器
代码演示02：

from bs4 impor4 BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
head_tag = soup.head 
print(head_tag.children)# 得到一个迭代器对象，需要遍历才能拿出数据
for i in head_tag.children:
    print(i)

descendants：返回的是一个生成器（特殊的迭代器）。可以遍历出节点的子子孙孙。
代码演示03：

from bs4 impor4 BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
head_tag = soup.html
for i in head_tag.descendants:
    print('-'*80)
    print(i)

2、string、strings、stripped_strings

string：获取标签里面的内容
代码演示01:

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
title_tag = soup.title
print(title_tag.string)

strings：返回一个生成器对象，用来获取多个标签的内容
代码演示02：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
res = soup.html.strings
for i in res:
    print(i)

stripped_strings:stripped_strings和strings基本一致，但是它可以把多余的空格去掉
代码演示03：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
res = soup.html.stripped_strings
for i in res:
    print(i)

3、parent和parents

parent：直接获取父节点
代码演示01：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
title_tag = soup.title
print(title_tag.parent)
print(soup.html.parent)

parents：获取所有的父节点
代码演示02：

x from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
a_tag = soup.a
for i in a_tag.parents:    
	print(i)

4、遍历兄弟节点

next_sibling：下一个兄弟结点
代码演示01：

from bs4 import BeautifulSoup


html_data = '<a><b>bbbb</b><c>cccc</c></a>'
soup = BeautifulSoup(html_data, 'lxml')
# print(soup.prettify())
b_tag = soup.b
print(b_tag.next_sibling)片

previous_sibling 上一个兄弟结点
代码演示02：

from bs4 import BeautifulSoup


html_data = '<a><b>bbbb</b><c>cccc</c></a>'
soup = BeautifulSoup(html_data, 'lxml')
c_tag = soup.c
print(c_tag.previous_sibling )

previous_siblings上一个所有兄弟结点
代码演示03：

from bs4 import BeautifulSoup


html_data = '<a><b>bbbb</b><c>cccc</c></a>'
soup = BeautifulSoup(html_data, 'lxml')
b_tag = soup.b
res = b_tag.previous_siblings
for i in res:
    print(i)

五、find（）与find_all（）方法

1、过滤器

字符串过滤器
代码演示01：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# 字符串过滤器
a_tag = soup.find('a') # 参数"a"代表字符串过滤器
print(a_tag)
a_tags = soup.find_all('a')
for i in a_tags:
    print(i)

列表过滤器
代码演示02：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all(['p','a']))
print(soup.find_all(['title','b'])) # ['title','b']为列表过滤器

2、find_all（）方法

简介
- fand_all（）方法以列表形式返回所有搜索到的标签数据
- find（）方法返回搜索到的第一条数据
- fiand_all（）方法的参数

def find_all(self, name=None, attrs={}, recursive=True, text=None,
                 limit=None, **kwargs):

参数含义
- name : tag名称
- attr : 标签的属性
- recursive : 是否递归搜索
- text : 文本内容
- limli : 限制返回条数
- kwargs : 关键字参数
  代码演示01：

html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">职位名称</td>
            <td>职位类别</td>
            <td>人数</td>
            <td>地点</td>
            <td>发布时间</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""
soup = BeautifulSoup(html, 'lxml')
# 1、获取所有tr标签
print(soup.tr)
print(soup.find('tr'))
print(soup.find_all('tr'))

# 2、获取第二个突然标签
tr = soup.find_all('tr')[1]
print(tr)
# limit参数：限制返回条数
tr = soup.find_all('tr', limit=2)[1]
print(tr)

# 3、获取所有class等于even的tr标签(class是python中的关键字，不允许直接写出，需要加_装饰一下)
trs = soup.find_all('tr', class_='even')
for tr in trs:
    print(tr)
    print('-'*80)

# 4、将所有id='text'和class='text'的a标签提取出来
r = soup.find_all('a', id='test', class_='test')
for a in r:
    print(a)

# 5、获取所有a标签的href属性
a_tag = soup.find_all('a')
for a in a_tag:
    # 1
    href = a['href']
    print(href)
    2
    href = a.get('href')
    print(href)

# 6、获取所有的职位信息(需要文本)
trs = soup.find_all('tr')[1:]# 找到所有的tr标签柄过滤第一个tr
for tr in trs:
    tds = tr.find_all('td')# 找tr标签里所有的td标签
    job_name = tds[0].string # 将符合条件的td标签取出来
    print(job_name)

六、select（）方法

根据类名来进行查找

通过类名来进行查找 .class(这个class表示的是类的值)
print(soup.select('.sister'))

根据id来进行查找

通过id进行查找 #id(id表示的是这个值)
print(soup.select('#link1'))

代码演示01：

html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">职位名称</td>
            <td>职位类别</td>
            <td>人数</td>
            <td>地点</td>
            <td>发布时间</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""
soup = BeautifulSoup(html, 'lxml')

# 1、获取tr
trs = soup.select('tr')
print(trs)

# 2、获取第二个tr标签
tr = soup.select('tr')[1]
print(tr)

# 3、获取所有class=evenbiaoq
trs = soup.select('.even')
print(trs)
trs = soup.select('tr[class="even"]')
print(trs)

# 4、获取所有a标签href属性
a_tags = soup.select('a')
for a in a_tags:
    href = a['href']
    print(href)

# 5、获取所有的职位信息
trs = soup.select('tr')[1:]
for tr in trs:
    info  = list(tr.stripped_strings)[0]
    print(info)

七、修改文档树

修改tag的名称和属性
代码演示01：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

# 修改tag的名称和属性
tag_p = soup.p
print(tag_p)
tag_p['class'] = 'content'  # 修改属性
tag_p.name = 'w' # 修改名称
print(tag_p)

修改string 属性赋值,就相当于用当前的内容替代了原来的内容
代码演示02：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# 修改string  属性赋值,就相当于用当前的内容替代了原来的内容
tag_p = soup.p
print(tag_p.string)
tag_p.string = 'you need python'
print(tag_p)

append() 像tag中添加内容,就好像Python的列表的 .append() 方法
代码演示03：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# append() 像tag中添加内容,就好像Python的列表的 .append() 方法
tag_p = soup.p
print(tag_p)
tag_p.append('123')
print(tag_p)

decompose() 修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉
代码演示04：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# decompose() 修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉
r = soup.find(class_ = 'title')
# print(r)
r.decompose()
print(soup)

BeautifulSoup4 案例实践

'''
需求：
爬取全国所有的城市（包括直辖市的区）的名称、温度。并保存在CSV文件中。
'''

代码实践：

import requests
import csv
from bs4 import BeautifulSoup


titles = ('city', 'temp')
def parse_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'}
    # 发起请求，获取相应响应数据
    response = requests.get(url, headers=headers)
    # print(response.content.decode('utf-8'))
    text = response.content.decode('utf-8')

    # 解析网页数据：寻找整页的div = div ="conMidtab"标签
    # html5lib可以解决标签错乱
    soup = BeautifulSoup(text, 'html5lib')
    conMidtab = soup.find('div', class_='conMidtab')
    # print(conMidtab)

    # 接下来找到每一个直辖市或者每一个省会的标签
    tables = conMidtab.find_all('table')
    # 定义列表保存数据
    lst = []
    for table in tables:
        # 找table标签里的tr标签（注：将前两个过滤）
        trs = table.find_all('tr')[2:]
        # enumerate()方法返回两个值，一个是值，另一个是下标索引
        for index,tr in enumerate(trs):
            # tr标签当前页面所有城市的基本信息
            # tr标签里面去找td标签（第0个td是城市的名称，倒数第二个是温度信息）
            tds = tr.find_all('td')
            '''
            当我们这样取成熟名称数据的时候city_td = tds[0]，除了省份的第一个城市名称不对，其余的都是正确的。解决办法：找到有问题的tr标签，第0个tr标签有问题
            '''
            city_td = tds[0] #得到城市标签
            if index == 0: # 当下标索引值为0时，对第二个td标签进行取值
                city_td = tds[1]

            info = {}

            city = list(city_td.stripped_strings)[0] #得到城市
            temp_td = tds[-2]  # 得到温度标签
            temp = list(temp_td.stripped_strings)[0]

            info['city'] = city
            info['temp'] = temp
            lst.append(info)

            print('city:', city,'temp:', temp)
    return lst

def writerDate(lst):
    with open('weather.csv', 'w', encoding='utf-8', newline='') as file_obj:
        writer = csv.DictWriter(file_obj, titles)
        writer.writeheader()
        writer.writerows(lst)





def main():
    lst = []
    # 先去请求华北地区的数据，后请求其他
    urls = ['hb', 'db', 'hd', 'hz', 'hn', 'xb', 'xn', 'gat']
    for url in urls:
        country_url = 'http://www.weather.com.cn/textFC/' + str(url) + '.shtml'
        lst += parse_page(country_url)

    writerDate(lst)


if __name__ == '__main__':
    main()

分析：
解决方案：通过bs4进行解析网页
分析：
1、每一页代表一个区域的信息，所以先确定一个区别，然后动态的替换url
2、分析页面结构

1、寻找整页的div = div ="conMidtab"标签
2、接下来找到每一个直辖市或者每一个省会的标签
3、找table标签里的tr标签（注：将前两个过滤）
4、tr标签里面去找td标签（第0个td是城市的名称，倒数第二个是温度信息）
总结：
1、bs4的特点：通过find()或者时find_all()方法去找标签的内容
2、如何过滤不需要的标签或数据（基础的支持点）
3、enumerate()方法的含义及其用法：enumerate()方法返回两个值，一个是值，另一个是下标索引
4、在数据不正确的情况下，如何的去分析判断

city_td = tds[0] #得到城市标签
	if index == 0: # 当下标索引值为0时，对第二个td标签进行取值
    	city_td = tds[1]

5、不同的url页面的处理：通过把URL放在列表中进行去遍历
6、页面标签不全或者是错乱，bs4提供了一个更加强大的解析器（html5lib）
7、CSV文件的应用

爬虫--03：数据分析

Reptilien 03: analyse Von daten

正则表达式

一、正则表达式的简介

1、 概念

2、正则表达式的应用场景

二、正则表达式对python的支持

1、普通字符

2、match()函数

3、元字符

4、预定义匹配字符集

5、重复匹配

6、位置匹配与非贪婪模式

①、位置匹配

②、贪婪与非贪婪模式

三、re模块的常用方法

1、flag匹配模式

2、search(pattern, string, flags=0)函数

3、findall(pattern, string, flags=0)函数

4、split(pattern, string, maxsplit=0, flags=0)函数

5、sub(pattern, repl, string, count=0, flags=0)函数

四、分组功能

正则表达式练习

百度贴吧图片爬取练习

xpath

一、xpath介绍

1、基本概念

2、节点的关系

二、xpath的应用

1、xpath工具的安装

2、xpath工具的使用

三、lxml模块的使用

1、安装

CSV

一、什么是CSV？

二、CSV模块的使用

1、写入csv文件

2、读取文件

xpath与CSV文件练习

爬取豆瓣电影TOP250

案例总结

BeautifulSoup4

一、BeautifulSoup4简介

1、基本概念

2、源码分析

二、BeautifulSoup4 快速入门

1、导入

2、Beautiful Soup4解析器的优缺点

三、BeautifulSoup4 的四大对象种类

四、遍历文档树

1、contents、children、descendants

2、string、strings、stripped_strings

3、parent和parents

4、遍历兄弟节点

五、find（）与find_all（）方法

1、过滤器

2、find_all（）方法

六、select（）方法

七、修改文档树

BeautifulSoup4 案例实践

1、概念