beauifulsoup模块的介绍

最新推荐文章于 2024-01-30 22:20:17 发布

weixin_30664051

最新推荐文章于 2024-01-30 22:20:17 发布

阅读量114

点赞数

CC 4.0 BY-SA版权

文章标签： python 爬虫

原文链接：http://www.cnblogs.com/klsfct/p/9197355.html

本文介绍了使用Python进行网络爬虫的基本方法，包括如何利用requests库发起GET和POST请求，使用BeautifulSoup解析HTML文档，以及利用正则表达式提取特定信息。通过实际案例展示了如何抓取特定格式的帖子链接。

01 爬虫基础知识介绍

　　　相关库：1.requests,re 2.BeautifulSoup 3.hackhttp

　　使用requests发起get，post请求，获取状态码，内容；

　　使用re匹配随便一个帖子

 BeautifulSoup模块的使用介绍：在这里一定要看官方文档http://beautifulsoup.readthedocs.io/zh_CN/latest/

　　1.解析内容：soup= BeautifulSoup(html)

　　2.浏览数据：soup.title   soup.title.string
　　3.BeautifulSoup正则使用： soup.find_all(name='x',attrs={'xx':re.complie('x')
　　　　　　　　　　　　name 代表标签的名称    attrs 标签中的参数内容

#针对thread-41730-1-1.html怎么做？
bbs_new=soup.find_all(name='a',attrs={'href':re.compile('thread-\d*?-1-1.html')})

02 爬虫简单实现

03 正则表达式的应用

04 多线程python爬虫

05 爬虫实战

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import  re

#要爬取的地址
url ='https://bbs.ichunqiu.com/portal.php'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}


#对url发送get请求
r= requests.get(url=url,headers=headers)

print(r.status_code)
#html的内容r.content
print(r.content)
#html网页内容放入beautifulsoup进行解析
soup =BeautifulSoup(r.content,'lxml')  #需要lxml参数
print(soup.title)
print(soup.title.string)
#获取内容实例，万金油  正则使用
#bbs_new=soup.find_all(name='a',attrs={'target':"blank", 'class':"ui_colorG" ,'style':"color: #555555;"})

#针对thread-41730-1-1.html怎么做？
bbs_new=soup.find_all(name='a',attrs={'href':re.compile('thread-\d*?-1-1.html')})

for new in bbs_new:
    print(new.string)  #不加string 默认返回整个标签的内容