Python爬虫实例1
爬取猫眼电影TOP100(http://maoyan.com/board/4)的相关内容
step1 准备工作
目标:
爬取猫眼电影TOP100的电影名称、时间、评分、图片
分析:
第一页URL:https://maoyan.com/board/4,展示了排行1-10的电影;
第二页URL:https://maoyan.com/board/4?offset=10,展示了排行10-20的电影;
…
获取TOP100,需要分开请求10次,参数offset分别为:0,10…90
step2 获取数据
1.爬取第一页的源代码
import requests
def get_one_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
html = response.text
return html
def main():
url = "https://maoyan.com/board/4"
html = get_one_page(url)
print(html)
2.正则提取信息
每个电影对应一个dd节点
<dd>
<i class="board-index board-index-1">1</i>
<a href="/films/1200486" title="我不是药神" class="image-link" data-act="boarditem-click" data-val="{movieId:1200486}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p0.meituan.net/movie/414176cfa3fea8bed9b579e9f42766b9686649.jpg@160w_220h_1e_1c" alt="我不是药神" class="board-img" />`在这里插入代码片`