简单小爬虫

最新推荐文章于 2021-03-26 10:49:26 发布

原创最新推荐文章于 2021-03-26 10:49:26 发布 · 472 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python

1.用xpath helper来做一个小爬虫

首先你得安装一个谷歌的插件(限定在谷歌浏览器做),它叫xpath helper
网盘:https://pan.baidu.com/s/1phXPKllX0-BA7IDxPGRhZA
密码: yuuv
具体怎么装上去,人好,链接都帮你找好了:
https://www.cnblogs.com/baxianhua/p/8878268.html
打开:
https://theater.mtime.com/China_Beijing
然后f12打开网页源码和打开xpath,下面是例图

相信你这么聪明能懂的嚯
以下是python代码:

import urllib.request
from lxml import etree
url = 'http://theater.mtime.com/China_Beijing/'
webpage = urllib.request.urlopen(url)
txt = webpage.read()
txt = txt.decode('utf-8')
html = etree.HTML(txt)
html_data = html.xpath('//div[@class="isthebox"]/div/div/div/div/ul/li/a/@title')
for i in html_data:
    print(i)

运行截图如下:
在这里插入图片描述

2.用正则表达式匹配

这个灵活性就很强了,你得熟悉正则表达式,原理就是把整个网页源码爬下来,
然后再用正则表达式进行匹配晒选
这个就直接贴代码吧

import urllib.request
import re
url = 'http://theater.mtime.com/China_Beijing/'
# 获取网页源代码
webpage = urllib.request.urlopen(url)
text = webpage.read()
text = text.decode('utf-8')
# 正则表达式匹配出兴趣点
rs = re.findall('Id".*"}];',text)
# 字符串操作
rs1 = str(rs).split('{')
set_list = []
for i in rs1:
    film_id = i[5:].split(',')[0]
    film_name = i.split('Title":"')[1].split('"')
    if [film_name,film_id] not in set_list:
        set_list.append([film_name,film_id])
for i in set_list:
    print(str(i).strip('[').replace("'","").replace("}","").replace(",","").replace("]",""))

运行结果如下:
在这里插入图片描述
that’s over