1.爬虫
(1)浏览网页时经历的过程 浏览器 (请求request)-> 输入URL地址(http://www.baidu.com/index.html file:///mnt ftp://172.25.254.250/pub -> http协议确定, www.baidu.com访问的域名确定 -> DNS服务器解析到IP地址 -> 确定要访问的网页内容 -> 将获取到的页面内容返回给浏览器(响应过程) ) (2) 爬取网页 1). 基本方法 from urllib import request from urllib.error import URLError try: respose = request.urlopen('http://www.baidu.com', timeout=0.01) content = respose.read().decode('utf-8') print(content) except URLError as e: print("访问超时", e.reason) 2). 使用Resuest对象(可以添加其他的头部信息) from urllib import request from urllib.error import URLError url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html' headers = {'User-Agent':' Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'} try: 实例化request对象, 可以自定义请求的头部信息; req = request.Request(url, headers=headers) urlopen不仅可以传递url地址, 也可以传递request对象; content = request.urlopen(req).read().decode('utf-8') print(content) except URLError as e: print(e.reason) else: print("success") 运行结果: 后续添加头部信息 from urllib import request from urllib.error import URLError url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html' user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0' try: 实例化request对象, 可以自定义请求的头部信息; req = request.Request(url) req.add_header('User-Agent',user_agent) urlopen不仅可以传递url地址, 也可以传递request对象; content = request.urlopen(req).read().decode('utf-8') print(content) except URLError as e: print(e.reason) else: print("success") 运行结果:部分截图 (3)反爬虫策略 1). 模拟浏览器(同上) 1.Android Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19 Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30 Mozilla/5.0 (Linux; U; Android 2.2; en-gb; GT-P1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 2.Firefox Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0 3.Google Chrome Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36 Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19 4.iOS Mozilla/5.0 (iPad; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML,