这次先进行简单的爬取,获取一定数量的电影url、id等信息。为下次项目做铺垫
(请大家在爬取信息时控制循环的次数!!!)
代理ip的设置,请看:https://blog.youkuaiyun.com/az9996/article/details/85094193
请求头设置请看,请看:https://blog.youkuaiyun.com/az9996/article/details/85094462
这里直接写成函数形式,方便以后的项目调用
为函数传递一个参数,用来控制要爬取的页数
#number为页面的页数,默认为第一页。页面加一,number+20···
def get_movie_url(number=20):
page = 0
filename='movie_url.txt'
file_operation.make_empty_file(filename)
while(page!=number):
proxies_support = urllib.request.ProxyHandler(request_body.get_proxy())
opener = urllib.request.build_opener(proxies_support)
urllib.request.install_opener(opener) # 将代理Ip设置到全局
url = 'https://movie.douban.com/j/search_subjects?type=movie&' \
'tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=' + str(page)
headers = request_body.get_header()
res = request.Request(url,headers=headers)
req = request.urlopen(res,timeout=0.5).read().decode("utf8")
js = json.loads(req)
url = ''
for subjects in js['subjects']:
url = url + subjects['url'] +'\n'
#print(url)
#将爬取到的数据写入本地
with open(filename, 'a', encoding='utf-8') as file_obj:
file_obj.writelines(url)
file_obj.write('\n')
page=page+20
这次我仅将电影url写入到本地的txt文件。
下一个项目:https://blog.youkuaiyun.com/az9996/article/details/85094604