1.用xpath helper来做一个小爬虫
- 首先你得安装一个谷歌的插件(限定在谷歌浏览器做),它叫xpath helper
网盘:https://pan.baidu.com/s/1phXPKllX0-BA7IDxPGRhZA
密码: yuuv - 具体怎么装上去,人好,链接都帮你找好了:
https://www.cnblogs.com/baxianhua/p/8878268.html - 打开:
https://theater.mtime.com/China_Beijing
然后f12打开网页源码 和打开xpath,下面是例图
相信你这么聪明能懂的嚯
以下是python代码:
import urllib.request
from lxml import etree
url = 'http://theater.mtime.com/China_Beijing/'
webpage = urllib.request.urlopen(url)
txt = webpage.read()
txt = txt.decode('utf-8')
html = etree.HTML(txt)
html_data = html.xpath('//div[@class="isthebox"]/div/div/div/div/ul/li/a/@title')
for i in html_data:
print(i)
运行截图如下:
2.用正则表达式匹配
这个灵活性就很强了,你得熟悉正则表达式,原理就是把整个网页源码爬下来,
然后再用正则表达式进行匹配晒选
这个就直接贴代码吧
import urllib.request
import re
url = 'http://theater.mtime.com/China_Beijing/'
# 获取网页源代码
webpage = urllib.request.urlopen(url)
text = webpage.read()
text = text.decode('utf-8')
# 正则表达式匹配出兴趣点
rs = re.findall('Id".*"}];',text)
# 字符串操作
rs1 = str(rs).split('{')
set_list = []
for i in rs1:
film_id = i[5:].split(',')[0]
film_name = i.split('Title":"')[1].split('"')
if [film_name,film_id] not in set_list:
set_list.append([film_name,film_id])
for i in set_list:
print(str(i).strip('[').replace("'","").replace("}","").replace(",","").replace("]",""))
运行结果如下:
that’s over