网络爬虫requests和bs4简单入门_reuests加bs4教程-优快云博客

本文介绍了网络爬虫的基本工作过程，包括使用requests库抓取网页内容和BeautifulSoup解析HTML。讲解了requests的get和post方法，以及如何构造请求头。同时，展示了如何使用BeautifulSoup定位和提取页面信息，最后给出一个爬取中国大学排名的实例。

网络爬虫基础（嵩天老师爬虫教学）

本博客的主要内容：介绍如何使用基本的库完成对html页面内容的爬取和分析，分以下几方面介绍

介绍网络爬虫的基本工作过程
requests库的基本用法
使用BeautifulSoup对页面进行解析

1.介绍网络爬虫的基本工作过程

The Website is the API 我们应该将网页看成是一个我们获取信息的接口，我们可以通过python爬虫从中获取我们所需要的信息。
一般步骤：
（1）通过requests库爬取html页面的内容
（2）使用BeautifulSoup库对爬取到的html页面进行解析
（3）使用BeautifulSoup以及正则表达式来进一步提取我们想要的关键信息
（4）将信息格式化并输出

2.requests库的基本使用

requests库有好几种方法，这里我们介绍最主要的get和post方法
最简单的请求方法get：

import requests
r  = requests.get("http://python123.io/ws/demo.html")
print(r)
<Response [200]>    #返回码200表示访问正常
r.encoding = r.apparent_encoding   #使用该语句将正确的编码给到 r 
r.text    #打印出html页面的内容
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

以上说明