斯巴达Python搜索爬虫抓取的学习笔记

最新推荐文章于 2023-02-17 11:33:49 发布

原创最新推荐文章于 2023-02-17 11:33:49 发布 · 1.1k 阅读

CC 4.0 BY-SA版权

本文介绍了爬虫基础知识，包括web概念、编码方式、常见网站问题，并提供了书籍推荐和BeautifulSoup等第三方库的使用方法。同时分享了在安装最新BeautifulSoup时遇到的与Python版本不兼容问题的解决方案，以及在Windows平台上的安装步骤。视频中使用的SublimeText3与wingIDE进行对比，展示了如何使用SublimeText3抓取特定标签的图片。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

观看了五期的斯巴达的爬虫搜索视频，感觉由浅入深，整期的视频还是比较浅显的，其中有介绍一些基本的web概念（比如一些常见的编码格式等）和大网站web一些问题，感觉适合初学者，视频推荐书籍： http权威指南，第三方库BeautifulSoup、chardet。

安装最新BeautifulSoup时候，遇到与python版本不兼容的问题，推荐解决方办法下载低版本的BeautifulSoup。详见链接：

BeautifulSoup官网链接：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

官网下载最新BeautifulSoup： http://www.crummy.com/software/BeautifulSoup/#Download

Windows平台安装Beautiful Soup： http://kevinkelly.blog.163.com/blog/static/21390809320133185748442/

旧版本BeautifulSoup-3.2.0： http://download.youkuaiyun.com/detail/liangzhaoxin/3963677

玩蛇网链接(供观看视频学习python)： http://www.iplaypython.com/

视频里面用Sublime Text 3，感觉没有wing IDE好用，因为编译错误没法指定到行。复制下面的代码，ctrl+B可以运行了，功能是抓取百度贴吧里面标签为

“<img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=269396684d4a20a4311e3ccfa0539847/0aa95edf8db1cb132cd1f269df54564e92584b15.jpg"

pic_ext="jpeg"
width="510"
height="765"
style="cursor: url(http://tb2.bdstatic.com/tb/static-pb/img/cur_zin.cur), pointer;">

”

的图片，详细见如下代码：

<pre name="code" class="python"># -*- coding: utf-8 -*-

"""
版本   ：Python 2.76 
编辑器 : Sublime Text 2
标准库：urllib
作者   ：斯巴达
"""
#这次视频，先以加密方式发送，首先会发放给支持群里建设的朋友。

# [视频中，有口误]。
# Server: Apache/2.23 (CentOS)
# 应该是，apache Web服务器2.23版本，用的是 CentOS的系统。

"""
通过以下方式，获取最新视频教程：

百度网盘下载：http://pan.baidu.com/share/home?uk=1145352858

优酷视频： http://i.youku.com/u/id_UMTQxMTk3OTYxNg==

新浪视频： http://you.video.sina.com.cn/m/5041094315

Python新手互助QQ群： 20419428

"""

#版本：
#编辑器：
#编辑库：
#作者：
# <!DOCTYPE html>
# <html lang="zh-CN">
# <head>
#     <meta charset="utf-8">
#申明编码类型在两个地方：一个在服务器；一个在网页；
#ctrl + / 批量注释

import urllib#标准库
import chardet#字符检测:http://irayd.com/blog/python-module-chardet/
import urllib2
import random
import re
from bs4 import BeautifulSoup#import BeautifulSoup
# def callback(a,b,c):
# 	"""
# 	@a:到目前为止传递的数据块数量。
# 	@b:每个块大小，单位Byte，字节。
# 	@c:远程文件的大小。（有时候返回-1）
# 	"""
# 	download_progress = 100.0*a*b/c

# 	if download_progress > 100:
# 		download_progress = 100

# 	print "%0.2f%%" % download_progress,#逗号的作用可以同一行显示

#url = 'http://blog.youkuaiyun.com/happydeer'
#url = 'http://www.iplaypython.com/'
#url = 'http://www.iplaypython.com/ksdflkjlwisdfioq.html'
#url = "http://www.163.com/"
# url2 = "http://www.python.org"#单引号双引号都行
# """
# 200:正常；303：永久重定向：404：not found；403：禁止反问；
# 500：服务器忙，无响应；
# """

# local = 'C:\\Users\\_Phoenix_Luo\\Desktop\\iplaypython.html'

# html = urllib.urlopen(url2)
# #content = html.read().decode('gbk', 'ignore').encode('utf-8')
# #print content

# #print html.info()
# #print html.read()

# #print html.getcode()

# code = html.getcode()
# """
# 200:正常；303：永久重定向：404：not found；403：禁止反问；
# 500：服务器忙，无响应；
# """
# if code == 200:#注意：冒号
# 	#print html.read()
# 	#print html.info()
# 	urllib.urlretrieve(url2, local, callback)
# else:
# 	404

# html.close()

# info = urllib.urlopen(url).info()#返回服务的编码申明

# print info
# print info.getparam("charset")
# def automatic_detect(url):
# 	"""doc"""
# 	content = urllib.urlopen(url).read()
# 	result = chardet.detect(content)
# 	encoding = result['encoding']
# 	return encoding

# urls = ['http://www.iplaypython.com',
# 		'http://www.baidu.com',
# 		'http://www.jd.com',
# 		'http://www.163.com',
# 		'http://www.dangdang.com'
# 		]
# for url in urls:
# #TestData = urllib.urlopen(url).read()    
# 	print url,automatic_detect(url)


# my_headers = [
# 	"Mozilla/5.0 (Windows NT6.1;WOW64;rv:27.0)Gecko/20100101 Firefox/27.0",
# # 	'Mozilla/5.0 (Windows;U;Windows NT 6.1;en-US;rv:1.9.1.6)Gecko/20091201 Firefox/3.5.6',
# # 	#'Mozilla/5.0 (Windows;U;Windows NT 6.1)AppleWebKit/537.36(KHTML,like Gecko) Chrome/34.0.1838.2 Safari/',
# 	'Mozilla/5.0 (X11;Ubuntu; Linux i686;rv:10.0) Gecko/20100101 Firefox/10.0'
# ]
# def get_content(url,headers):

#     random_header = random.choice(headers)
    
#     req = urllib2.Request(url)
#     req.add_header("User-Agent", random_header)
#     req.add_header("Host","blog.youkuaiyun.com")
#     req.add_header("Refer","http://blog.youkuaiyun.com/")
#     req.add_header("GET",url)
    
#     content = urllib2.urlopen(req).read()
#     return content

# info = get_content(url,my_headers)


# html = urllib2.urlopen(url)

# print html.read()

def get_content(url):
	html = urllib.urlopen(url)
	content = html.read()
	html.close()
	return content



#def get_images(info):
	#""" 
	#<img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=269396684d4a20a4311e3ccfa0539847/0aa95edf8db1cb132cd1f269df54564e92584b15.jpg" 
	#pic_ext="jpeg" 
	#width="510" 
	#height="765" 
	#style="cursor: url(http://tb2.bdstatic.com/tb/static-pb/img/cur_zin.cur), pointer;">
	#"""
	#regex = r'class="BDE_Image" src="(.+?\.jpg)'
	#pat = re.compile(regex)

	#image_code = re.findall(pat,info)
	#i = 0
	#for image_url in image_code:
		#print image_url
		#urllib.urlretrieve(image_url,'%s.jpg')
		#i = i+1
def get_images(info):
       soup = BeautifulSoup(info)
       all_img = soup.find_all('img', class_="BDE_Image")
       
       x = 1
       for img in all_img:
	      print img['src']
	      image_name = '%s.jpg' %x
	      urllib.urlretrieve(img['src'], image_name)
	      x +=1


info = get_content('http://tieba.baidu.com/p/2772656630')
print get_images(info)

如果个人应用的可能需要熟悉Python的正则表达式，结合网页审查元素里面图片的标签和正则表达式，爬指定的数据。

或者使用第三方库BeautifulSoup。