观看了五期的斯巴达的爬虫搜索视频,感觉由浅入深,整期的视频还是比较浅显的,其中有介绍一些基本的web概念(比如一些常见的编码格式等)和大网站web一些问题,感觉适合初学者,视频推荐书籍: http权威指南,第三方库BeautifulSoup、chardet。
安装最新BeautifulSoup时候,遇到与python版本不兼容的问题,推荐解决方办法下载低版本的BeautifulSoup。详见链接:
BeautifulSoup官网链接:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
官网下载最新BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/#Download
Windows平台安装Beautiful Soup: http://kevinkelly.blog.163.com/blog/static/21390809320133185748442/
视频里面用Sublime Text 3,感觉没有wing IDE好用,因为编译错误没法指定到行。复制下面的代码,ctrl+B可以运行了,功能是抓取百度贴吧里面标签为
“<img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=269396684d4a20a4311e3ccfa0539847/0aa95edf8db1cb132cd1f269df54564e92584b15.jpg"
pic_ext="jpeg"
width="510"
height="765"
style="cursor: url(http://tb2.bdstatic.com/tb/static-pb/img/cur_zin.cur), pointer;">
”
的图片,详细见如下代码:
<pre name="code" class="python"># -*- coding: utf-8 -*-
"""
版本 :Python 2.76
编辑器 : Sublime Text 2
标准库:urllib
作者 :斯巴达
"""
#这次视频,先以加密方式发送,首先会发放给支持群里建设的朋友。
# [视频中,有口误]。
# Server: Apache/2.23 (CentOS)
# 应该是,apache Web服务器2.23版本,用的是 CentOS的系统。
"""
通过以下方式,获取最新视频教程:
百度网盘下载:http://pan.baidu.com/share/home?uk=1145352858
优酷视频: http://i.youku.com/u/id_UMTQxMTk3OTYxNg==
新浪视频: http://you.video.sina.com.cn/m/5041094315
Python新手互助QQ群: 20419428
"""
#版本:
#编辑器:
#编辑库:
#作者:
# <!DOCTYPE html>
# <html lang="zh-CN">
# <head>
# <meta charset="utf-8">
#申明编码类型在两个地方:一个在服务器;一个在网页;
#ctrl + / 批量注释
import urllib#标准库
import chardet#字符检测:http://irayd.com/blog/python-module-chardet/
import urllib2
import random
import re
from bs4 import BeautifulSoup#import BeautifulSoup
# def callback(a,b,c):
# """
# @a:到目前为止传递的数据块数量。
# @b:每个块大小,单位Byte,字节。
# @c:远程文件的大小。(有时候返回-1)
# """
# download_progress = 100.0*a*b/c
# if download_progress > 100:
# download_progress = 100
# print "%0.2f%%" % download_progress,#逗号的作用可以同一行显示
#url = 'http://blog.youkuaiyun.com/happydeer'
#url = 'http://www.iplaypython.com/'
#url = 'http://www.iplaypython.com/ksdflkjlwisdfioq.html'
#url = "http://www.163.com/"
# url2 = "http://www.python.org"#单引号双引号都行
# """
# 200:正常;303:永久重定向:404:not found;403:禁止反问;
# 500:服务器忙,无响应;
# """
# local = 'C:\\Users\\_Phoenix_Luo\\Desktop\\iplaypython.html'
# html = urllib.urlopen(url2)
# #content = html.read().decode('gbk', 'ignore').encode('utf-8')
# #print content
# #print html.info()
# #print html.read()
# #print html.getcode()
# code = html.getcode()
# """
# 200:正常;303:永久重定向:404:not found;403:禁止反问;
# 500:服务器忙,无响应;
# """
# if code == 200:#注意:冒号
# #print html.read()
# #print html.info()
# urllib.urlretrieve(url2, local, callback)
# else:
# 404
# html.close()
# info = urllib.urlopen(url).info()#返回服务的编码申明
# print info
# print info.getparam("charset")
# def automatic_detect(url):
# """doc"""
# content = urllib.urlopen(url).read()
# result = chardet.detect(content)
# encoding = result['encoding']
# return encoding
# urls = ['http://www.iplaypython.com',
# 'http://www.baidu.com',
# 'http://www.jd.com',
# 'http://www.163.com',
# 'http://www.dangdang.com'
# ]
# for url in urls:
# #TestData = urllib.urlopen(url).read()
# print url,automatic_detect(url)
# my_headers = [
# "Mozilla/5.0 (Windows NT6.1;WOW64;rv:27.0)Gecko/20100101 Firefox/27.0",
# # 'Mozilla/5.0 (Windows;U;Windows NT 6.1;en-US;rv:1.9.1.6)Gecko/20091201 Firefox/3.5.6',
# # #'Mozilla/5.0 (Windows;U;Windows NT 6.1)AppleWebKit/537.36(KHTML,like Gecko) Chrome/34.0.1838.2 Safari/',
# 'Mozilla/5.0 (X11;Ubuntu; Linux i686;rv:10.0) Gecko/20100101 Firefox/10.0'
# ]
# def get_content(url,headers):
# random_header = random.choice(headers)
# req = urllib2.Request(url)
# req.add_header("User-Agent", random_header)
# req.add_header("Host","blog.youkuaiyun.com")
# req.add_header("Refer","http://blog.youkuaiyun.com/")
# req.add_header("GET",url)
# content = urllib2.urlopen(req).read()
# return content
# info = get_content(url,my_headers)
# html = urllib2.urlopen(url)
# print html.read()
def get_content(url):
html = urllib.urlopen(url)
content = html.read()
html.close()
return content
#def get_images(info):
#"""
#<img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=269396684d4a20a4311e3ccfa0539847/0aa95edf8db1cb132cd1f269df54564e92584b15.jpg"
#pic_ext="jpeg"
#width="510"
#height="765"
#style="cursor: url(http://tb2.bdstatic.com/tb/static-pb/img/cur_zin.cur), pointer;">
#"""
#regex = r'class="BDE_Image" src="(.+?\.jpg)'
#pat = re.compile(regex)
#image_code = re.findall(pat,info)
#i = 0
#for image_url in image_code:
#print image_url
#urllib.urlretrieve(image_url,'%s.jpg')
#i = i+1
def get_images(info):
soup = BeautifulSoup(info)
all_img = soup.find_all('img', class_="BDE_Image")
x = 1
for img in all_img:
print img['src']
image_name = '%s.jpg' %x
urllib.urlretrieve(img['src'], image_name)
x +=1
info = get_content('http://tieba.baidu.com/p/2772656630')
print get_images(info)
如果个人应用的可能需要熟悉Python的正则表达式,结合网页审查元素里面图片的标签和正则表达式,爬指定的数据。
或者使用第三方库BeautifulSoup。