2014-12-22

本文介绍了一种使用Python从豆瓣小组抓取图片的方法。通过解析网页获取帖子链接,并进一步下载帖子内的图片。代码中涉及了网页请求、正则表达式匹配及多线程处理等内容。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

http://www.gamefromscratch.com/

  game from scratch

http://www.ccs.neu.edu/home/matthias/HtDP2e/part_prologue.html

  racket lang 

3 http://www.youkuaiyun.com/article/2014-12-17/2823183

  sina post system 

4  https://gitcafe.com/gdgeek/UnityGame

   a unity3D game 

http://pdos.csail.mit.edu/6.828/2011/schedule.html

a mit os course

import urllib
import re
import time
import threading


def getHtml(url):
	html = urllib.urlopen(url).read().decode('utf-8')
	return html


def gettopic(html):
	reg = r'http://www.douban.com/group/topic/\d+'
	topiclist = re.findall(reg,html)
	x = 0
	for topicurl in topiclist:
		x+=1
	return topicurl


def download(topic_page):
	reg2 = r'http://img3.douban.com/view/group_topic/large/public/.+\.jpg'
	imglist = re.findall(reg2,topic_page)
	i = 1
	print(imglist)
	download_img = None
	for imgurl in imglist:
		img_numlist = re.findall(r'p\d{7}',imgurl)
		print(img_numlist)
		for img_num in img_numlist:
			download_img = urllib.urlretrieve(imgurl,'/home/space/hua.song/mycode/girl/%s.jpg'%img_num)
			time.sleep(1)
			i+=1
			print(imgurl)


	return download_img


def main_loop(num,num_end,pagenum):
	while num <= num_end:
		html=getHtml('http://www.douban.com/group/kaopulove/discussion?start=%d'%num)
	#	print(html)
		topicurl = gettopic(html)
	#	print(topicurl)
		topic_page = getHtml(topicurl)
	#	print(topic_page)
		download_img = download(topic_page)
		num = page_num*25
		page_num+=1
	else:
		print('over')




page_end = int(input('input eng pagenum:'))
num_end = page_end*25
num = 0
page_num = 1
threadnum = 5


while num <= num_end:
	num_start = num
	num_stop = num + page_end / threadnum * 25
	pagenum = num / 25 + 1
	
	t = threading.Thread(target=main_loop,args=(num_start,num_stop,pagenum,))
	t.start()
	t.join()
	num = num_stop




	
#while num <= num_end:
#	html=getHtml('http://www.douban.com/group/kaopulove/discussion?start=%d'%num)
#	print(html)
#	topicurl = gettopic(html)
#	print(topicurl)
#	topic_page = getHtml(topicurl)
#	print(topic_page)
#	download_img = download(topic_page)
#	num = page_num*25
#	page_num+=1
#else:
#	print('over')


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值