爬虫技术详解-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_40273198/article/details/84144899

两种爬取信息的方式，和查看源码差不多。

1.用requests库

import requests
url = 'url'
response = requests.get(url)
print (response.content)
response.encode = 'utf_8'

2.用urllib 在控制台读

#coding:utf-8
import urllib

page = urllib.urlopen('URL')#打开网页
htmlcode = page.read()#读取页面源码
print htmlcode#在控制台输出

3.用urllib 在TXT文档中读

#coding:utf-8
import urllib

page = urllib.urlopen('URL')#打开网页
htmlcode = page.read()#读取页面源码

pageFile = open('pageCode.txt','w')#以写的方式打开pageCode.txt
pageFile.write(htmlcode)#写入
pageFile.close()#开了记得关

4.封装

#coding:utf-8
import urllib

def get_html(url):

    page = urllib.urlopen(url)

    html = page.read()

    return html

5.打印图片地址

# coding:utf-8
import urllib
import re

def get_html(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html
reg = r'src="(.+?\.jpg)" width'#正则表达式
reg_img = re.compile(reg)#编译一下，运行更快
imglist = reg_img.findall(get_html('http://tieba.baidu.com/p/1753935195'))#进行匹配
for img in imglist:
    print img

图片的正则表达式： reg = r'src="(.+?\.jpg)" width'

匹配以src="开头然后接一个或多个任意字符(非贪婪)，以.jpg" width结尾的字符串。

接着我们要做的就是从get_html方法返回的辣么长一串字符串中拿到满足正则表达式的字符串。

用到python中的re库中的 re.findall(str) 它返回一个满足匹配的字符串组成的列表

6.打印图片

# coding:utf-8
import urllib
import re

def get_html(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

reg = r'src="(.+?\.jpg)" width'
reg_img = re.compile(reg)
imglist = reg_img.findall(get_html('http://tieba.baidu.com/p/1753935195'))
x = 0
for img in imglist:
    urllib.urlretrieve(img, '%s.jpg' %x)
    x += 1

7.封装打印图片

def get_image(html_code):
    reg = r'src="(.+?\.jpg)" width'
    reg_img = re.compile(reg)
    img_list = reg_img.findall(html_code)
    x = 0
    for img in img_list:
        urllib.urlretrieve(img, '%s.jpg' % x)
        x += 1

8.在以上基础上输入地址

print u'请输入url:',
url = raw_input()
if url:
    pass
else:
    url = 'http://tieba.baidu.com/p/1753935195'
html_code = get_html(url)
get_image(html_code)