Python网络爬虫学习笔记(定向)_python如何实现定向抓取-优快云博客

本文链接：https://blog.youkuaiyun.com/anderslu/article/details/64189823

Windows安装python运行环境

Python安装，建议安装3.的版本，因为3.的版本编码语言模式utf-8。安装包下载网址为：python官网下载地址，双击一步步执行下去即可。IDE的安装，个人习惯了JetBrains的PyCharm，我们平日里做各种小程序，学习之类的下载社区版本(免费版)即可，下载网址为：PyCharm下载地址。双击一步步执行下去即可。以安装Django为例，讲解一下pip命令的使用方法。
这里写图片描述

网络爬虫的准备

requests库安装

这里写图片描述

#通过如下四行代码就可以把百度首页的内容显示出来：
r=requests.get("https://www.baidu.com",timeout=30)
print(r.status_code)
r.encoding=r.apparent_encoding
print(r.text)

requests库概述

Requests 是用Python语言编写，基于 urllib，采用 Apache2 Licensed 开源协议的 HTTP 库。它比 urllib 更加方便，可以节约我们大量的工作，完全满足 HTTP 测试需求。Requests 的哲学是以 PEP 20 的习语为中心开发的，所以它比 urllib 更加 Pythoner。更重要的一点是它支持 Python3 。

beautifulsoup4库安装

这里写图片描述

beautifulsoup4库概述

Beautiful Soup是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航navigating，搜索以及修改剖析树的操作。它可以大大节省你的编程时间。使用html.parser解析方式是比较慢的，推荐使用lxml解析方式，或者尝试使用scrapy。
这里写图片描述

lxml库安装

这里写图片描述

lxml库概述

这里写图片描述

定向爬虫实例：

抓取中国大学排名前十的列表

import requests
from bs4 import BeautifulSoup
import bs4
def getHtml(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""
def fillUnivList(ulist,html):
    soup=BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds=tr('td')
            ulist.append([tds[0].string,tds[1].string,tds[3].string])
def printUnivList(ulist,num):
    tplt="{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名","学校名称","分数",chr(12288)))
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))
def main():
    uinfo=[]
    url="http://www.zuihaodaxue.com/zuihaodaxuepaiming2016.html"
    html=getHtml(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,10)
main()

从互联网抓取天气新闻的python代码

已经被我应用到具体项目中了，效果不错。

#创意网络爬虫：爬取互联网天气信息，为公司业务软件服务，改善以往的陈旧老办法
import requests
from bs4 import BeautifulSoup
import bs4
def getHtml(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""

def fillWeatherList(ulist,html):
    soup=BeautifulSoup(html,"html.parser")
    focusnews=soup.find('div','focusnews')
    for a in focusnews.descendants:
       if(a.name=='a'):
           ulist.append([a.string, 'http://shanxi.weather.com.cn/'+a.attrs['href']])
def printWeatherList(ulist):
    tplt="{0:{2}^50}\t{1:{2}^50}"
    print(tplt.format("天气新闻","网址",chr(12288)))
    for i in range(len(ulist)):
        u=ulist[i]
        print(tplt.format(u[0],u[1],chr(12288)))
def main():
    uinfo=[]
    url="http://shanxi.weather.com.cn/"
    html=getHtml(url)
    fillWeatherList(uinfo,html)
    printWeatherList(uinfo)
main()