python 爬取文章（内含图片，表格，文章夹杂）

zpeien

已于 2022-08-16 21:49:41 修改

阅读量7.9k

点赞数 3

CC 4.0 BY-SA版权

分类专栏： python 文章标签： python Beautiful Soup

于 2019-08-10 16:19:01 首次发布

本文链接：https://blog.youkuaiyun.com/qq_44920726/article/details/99072226

python 专栏收录该内容

13 篇文章

订阅专栏

最近发现了一个挺厉害的人工智能学习网站，内容通俗易懂，风趣幽默，感兴趣的可以点击此链接进行查看：床长人工智能教程

废话不多说，请看正文！

使用Beautiful Soup 库

Beautiful soup 库是一个非常强大的库函数，使用它可以分析很多html网页，相对于正则表达式好用却方便，不用费劲心思去考虑怎么用正则表达式去提取自己所需要的信息，直接引用便可以。

url = 'http://old.pep.com.cn/czsx/xszx/czsxtbjxzy/czsxdzkb/czsxdzkb7s_1_1_1_1_1/201112/t20111208_1088252.htm'
content = requests.get(url)
content.encoding = content.apparent_encoding
soup = BeautifulSoup(content.text, 'lxml')
title = soup.find('title').text

find 可以直接找到html 中的 title 节点，text 是将 title 节点的文字提取出来，这样就简单明了的将网页文章中的标题给提取出来了。相对于正则表达式方便了许多。

 da = soup.find_all('a', target="_self")
    file1 = da[4].text
    file2 = da[1].text[:2]
    file3 = da[1].text[2:]
    file4 = da[5].text
    print(file1)
    print(file2)
    print(file3)
    print(file4)

find_all() 是将 html 中的所有 “a ” 节点都查找出来，保存进 da 这个列表变量中。自己可以根据自己的需求对信息进行提取，整理。

title = title.replace('\n', '').replace(' ', '_').replace('/', '_')

replace ('\n','') 是将 title 变量中存在的 '\n' 变量换成空格，也是对文字的整理

for child in soup.children:
        
        child = BeautifulSoup(str(child), 'lxml')
        #print(child)
        
        #包含<img>的子节点，在word文档中插入对应的图片
        if child.img:
              print("dfaf")
#包含<tr>的子节点，在word文档中插入表格
        elif child.tr:
            rows = child.find_all('tr')
            cols = rows[0].find_all('td')
            #创建空白表格
            table = currentDocument.add_table(len(rows), len(cols))
            #往对应的单元格中写入内容
            for rindex, row in enumerate(rows):
                for cindex , col in enumerate(row.find_all('td')):
                    try:
                        cell = table.cell(rindex, cindex)
                        cell.text = col.text
                    except:
                        pass
                        
        #纯文字，直接写入word文件
        elif child.p:
            para = child.p.text.replace('\n', '').replace('_', '_')
            currentDocument.add_paragraph(text= para)

soup.children 是将 html 中的所有子节点进行循环，以实现对信息的完全整合。

child.img 是指节点中含有图片的， child. tr ,是指节点中存在表格。

child.p 是对 child 中的p 节点的文字都爬下来。

Beautifulsoup 的节点分析

原网址为：Beautiful Soup 4.4.0 文档 — beautifulsoup 4.4.0q 文档

子节点

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.

注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点

tag的名字

操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取 <head> 标签,只要用 soup.head :

soup.head
# <head><title>The Dormouse's story</title></head>

soup.title
# <title>The Dormouse's story</title>

这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取<body>标签中的第一个标签:

soup.body.b
# <b>The Dormouse's story</b>

通过点取属性的方式只能获得当前名字的第一个tag:

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all()

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

.contents 和 .children

tag的 .contents 属性可以将tag的子节点以列表的方式输出:

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']

BeautifulSoup 对象本身一定会包含子节点,也就是说<html>标签也是 BeautifulSoup 对象的子节点:

len(soup.contents)
# 1
soup.contents[0].name
# u'html'

字符串没有 .contents 属性,因为字符串没有子节点:

text = title_tag.contents[0]
text.contents
# AttributeError: 'NavigableString' object has no attribute 'contents'

通过tag的 .children 生成器,可以对tag的子节点进行循环:

for child in title_tag.children:
    print(child)
    # The Dormouse's story

.descendants

.contents 和 .children 属性仅包含tag的直接子节点.例如,<head>标签只有一个直接子节点<title>

head_tag.contents
# [<title>The Dormouse's story</title>]

但是<title>标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于<head>标签的子孙节点. .descendants 属性可以对所有tag的子孙节点进行递归循环 [5] :

for child in head_tag.descendants:
    print(child)
    # <title>The Dormouse's story</title>
    # The Dormouse's story

上面的例子中, <head>标签只有一个子节点,但是有2个子孙节点:<head>节点和<head>的子节点, BeautifulSoup 有一个直接子节点(<html>节点),却有很多子孙节点:

len(list(soup.children))
# 1
len(list(soup.descendants))
# 25

父节点

继续分析文档树,每个tag或字符串都有父节点:被包含在某个tag中

.parent

通过 .parent 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,<head>标签是<title>标签的父节点:

title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>

文档title的字符串也有父节点:<title>标签

title_tag.string.parent
# <title>The Dormouse's story</title>

文档的顶层节点比如<html>的父节点是 BeautifulSoup 对象:

html_tag = soup.html
type(html_tag.parent)
# <class 'bs4.BeautifulSoup'>

BeautifulSoup 对象的 .parent 是None:

print(soup.parent)
# None

.parents

通过元素的 .parents 属性可以递归得到元素的所有父辈节点,下面的例子使用了 .parents 方法遍历了<a>标签到根节点的所有节点.

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
# p
# body
# html
# [document]
# None

兄弟节点

看一段简单的例子:

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())
# <html>
#  <body>
#   <a>
#    <b>
#     text1
#    </b>
#    <c>
#     text2
#    </c>
#   </a>
#  </body>
# </html>

因为标签和<c>标签是同一层:他们是同一个元素的子节点,所以和<c>可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系.

.next_sibling 和 .previous_sibling

在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点:

sibling_soup.b.next_sibling
# <c>text2</c>

sibling_soup.c.previous_sibling
# <b>text1</b>

标签有 .next_sibling 属性,但是没有 .previous_sibling 属性,因为标签在同级节点中是第一个.同理,<c>标签有 .previous_sibling 属性,却没有 .next_sibling 属性:

print(sibling_soup.b.previous_sibling)
# None
print(sibling_soup.c.next_sibling)
# None

例子中的字符串“text1”和“text2”不是兄弟节点,因为它们的父节点不同:

sibling_soup.b.string
# u'text1'

print(sibling_soup.b.string.next_sibling)
# None

实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白. 看看“爱丽丝”文档:

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

如果以为第一个<a>标签的 .next_sibling 结果是第二个<a>标签,那就错了,真实结果是第一个<a>标签和第二个<a>标签之间的顿号和换行符:

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

link.next_sibling
# u',\n'

第二个<a>标签是顿号的 .next_sibling 属性:

link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

.next_siblings 和 .previous_siblings

通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出:

for sibling in soup.a.next_siblings:
    print(repr(sibling))
    # u',\n'
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    # u' and\n'
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    # u'; and they lived at the bottom of a well.'
    # None

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))
    # ' and\n'
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    # u',\n'
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    # u'Once upon a time there were three little sisters; and their names were\n'
    # None

将爬取的内容保存进word文件

#创建空白的word文档
currentDocument = Document()
#写入文章标题
currentDocument.add_heading(title)
#将文字写进段落
currentDocument.add_paragraph(text= para)
#创建一个段落
p = currentDocument.add_paragraph('')
#在段落中添加文字
p.add_run(para)
在段落结尾添加图片
run = p.add_run()
run.add_picture(pic)

完整代码

from time import sleep 
from os import mkdir
from os.path import isdir
import requests
from bs4 import BeautifulSoup
from docx import Document, opc, oxml
from docx.shared import Inches
import random

def get_agent():
    '''
    模拟header的user-agent字段，
    返回一个随机的user-agent字典类型的键值对
    '''
    agents = ['Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv,2.0.1) Gecko/20100101 Firefox/4.0.1',
              'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
              'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)']
    fakeheader = {}
    fakeheader['User-agent'] = agents[random.randint(0, len(agents)-1)]
    return fakeheader

#用来存放Word文档的文件夹，如果不存在就创建
dstDir = 'words'
if not isdir(dstDir):
    mkdir(dstDir)
    
    
for a in range(0,10):
    #每隔五秒中爬取一篇文章
    sleep(5)
    url = 'http://old.pep.com.cn/czsx/xszx/czsxtbjxzy/czsxdzkb/czsxdzkb7s_1_1_1_1_1/201112/t20111208_1088252.htm'
    content = requests.get(url,headers = get_agent())
    content.encoding = content.apparent_encoding
    soup = BeautifulSoup(content.text, 'lxml')
    title = soup.find('title').text
    da = soup.find_all('a', target="_self")
    file1 = da[4].text
    file2 = da[1].text[:2]
    file3 = da[1].text[2:]
    file4 = da[5].text
    print(file1)
    print(file2)
    print(file3)
    print(file4)
    file = 'D://教育数据包//' +file1
    if not isdir(file):
        mkdir(file)
    print(da[1].text)
    print(da[4].text)
    print(da[5].text)
    title = title.replace('\n', '').replace(' ', '_').replace('/', '_')
    #text属性会自动忽略内部的所有html标签
    #替换文章标题中不能在文件名使用的反斜线和竖线符号
    
    print(title)
    #每篇文章的链接地址
    #link = a['href']
    
    #创建空白的word文档
    currentDocument = Document()
    #写入文章标题
    currentDocument.add_heading(title)
    #找到文章的内容
    soup = soup.find('div',attrs={'class': 'Section1'})
    #print(soup)
    if not soup:
        continue
    for child in soup.children:
        
        child = BeautifulSoup(str(child), 'lxml')
        #print(child)
        
        #包含<img>的子节点，在word文档中插入对应的图片
        if child.img:
            de = child.find('p')   
            p = currentDocument.add_paragraph('')
            for child in de.children:
                child = BeautifulSoup(str(child), 'lxml')
                if child.img:
                    des = child.find("span")
                    para1 = des.text.replace('\n', '').replace('_', '_')  
                    n=0
                    for child in des.children:
                        child = BeautifulSoup(str(child), 'lxml')
                        if child.img:
                            pic = 'temp.gif'
                            ur = url[:-21]+child.img['src'][1:] 
                            data = requests.get(ur,headers=get_agent())
                            with open(pic, 'wb') as fp:
                                for chunk in data.iter_content(chunk_size=100):
                                    fp.write(chunk) 
                            fp.close()   
                            para2 = child.text.replace('\n', '').replace('_', '_') 
                            if para2:
                                for i in range(n , len(para1)):
                                    if para1[i]==para2[0]:
                                        s=para1[n:i]
                                        j= len(para2)
                                        n=i+j
                                        p.add_run(s)
                            try:
                                run = p.add_run()
                                run.add_picture(pic)
                            except:
                                pass
                            p.add_run(para2)
                        
                        else:
                            para3 = child.text.replace('\n', '').replace('_', '_') 
                            j= len(para3)
                            n+=j
                            p.add_run(para3)
                    
                else:
                    para = child.text.replace('\n', '').replace('_', '_')      
                    p.add_run(para)
                
            
 
        
       
        #包含<tr>的子节点，在word文档中插入表格
        elif child.tr:
            rows = child.find_all('tr')
            cols = rows[0].find_all('td')
            #创建空白表格
            table = currentDocument.add_table(len(rows), len(cols))
            #往对应的单元格中写入内容
            for rindex, row in enumerate(rows):
                for cindex , col in enumerate(row.find_all('td')):
                    try:
                        cell = table.cell(rindex, cindex)
                        cell.text = col.text
                    except:
                        pass
                        
        #纯文字，直接写入word文件
        elif child.p:
            para = child.p.text.replace('\n', '').replace('_', '_')
            currentDocument.add_paragraph(text= para)
            '''
            paras = child.find_all('p')
            for para in paras:
                print(para.text)
                currentDocument.add_paragraph(text= para.text)
             '''
        
    #保存当前文章的word 文档
    currentDocument.save(dstDir +'\\' + title +'.docx')
    break