最近一段时间在使用python学习爬虫,需要从服务端返回的内容中提取所有图片的url list信息。
首先,我打印了下服务端返回的消息内容,发现这次图片的url不在html的标签中,而是隐藏在js的内容中。返回的消息信息如下(我要提取的是linkData下所有类似https://t7.baidu.com/it/u
=1819248061,230866778&fm=193&f=GIF链接的url):
<script type="text/javascript" src="//img0.bdstatic.com/static/albumsdetail/pkg/albumsdetail_3da7458.js" ></script>
<script type="text/javascript">!function(){ window.logid = "8180837745298986662";
require.async(['albumsdetail:widget/ui/app/app'], function (app) {
app.setPageInfo({
word: '%E6%B8%90%E5%8F%98%E9%A3%8E%E6%A0%BC%E6%8F%92%E7%94%BB',
hasResult: '1',
albumTab: '%E8%AE%BE%E8%AE%A1%E7%B4%A0%E6%9D%90',
setId: '409',
title: '渐变风格插画',
logo: 'https:\/\/emoji.cdn.bcebos.com\/yunque\/pc_vcg.png',
coverUrl: 'https:\/\/t7.baidu.com\/it\/u=1819248061,230866778&fm=193&f=GIF',
totalNum: '314',
albumLinkRn: '30',
linkData: '[{\x22pid\x22:144520,\x22width\x22:1200,\x22height\x22:562,\x22oriwidth\x22:1200,\x22oriheight\x22:562,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u
=1819248061,230866778&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1274231988\x22,\x22contSign\x22:\x221819248061,230866778\x22},{\x22pid\x22:144521,\x22wi
dth\x22:562,\x22height\x22:1000,\x22oriwidth\x22:562,\x22oriheight\x22:1000,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=4036010509,3445021118&fm=193&f=GIF\x22,\x22fromUr
l\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1147957933\x22,\x22contSign\x22:\x224036010509,3445021118\x22},{\x22pid\x22:144522,\x22width\x22:1200,\x22height\x22:675,\x22oriwidth\x22
:1200,\x22oriheight\x22:675,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=963301259,1982396977&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.veer.com\\\/vectorgra
ph\\\/368621010?utm_source=baidu&utm_medium=imagesearch&chid=902\x22,\x22contSign\x22:\x22963301259,1982396977\x22},{\x22pid\x22:144523,\x22width\x22:1200,\x22height\x22:798,\x22oriwidth\x
22:1200,\x22oriheight\x22:798,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=1575628574,1150213623&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creativ
e\\\/1241121526\x22,\x22contSign\x22:\x221575628574,1150213623\x22},{\x22pid\x22:144524,\x22width\x22:1200,\x22height\x22:708,\x22oriwidth\x22:1200,\x22oriheight\x22:708,\x22thumbnailUrl\x
22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=737555197,308540855&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1057378806\x22,\x22contSign\x22:\x22737555197
,308540855\x22},{\x22pid\x22:144525,\x22width\x22:1200,\x22height\x22:675,\x22oriwidth\x22:1200,\x22oriheight\x22:675,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=9167306
0,7145840&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.veer.com\\\/vectorgraph\\\/372794457?utm_source=baidu&utm_medium=imagesearch&chid=902\x22,\x22contSign\x22:\x2291673060,714
5840\x22},{\x22pid\x22:144526,\x22width\x22:845,\x22height\x22:1200,\x22oriwidth\x22:845,\x22oriheight\x22:1200,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=2291349828,41
44427007&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1289012916\x22,\x22contSign\x22:\x222291349828,4144427007\x22},{\x22pid\x22:144527,\x22width\x22:1200
,\x22height\x22:675,\x22oriwidth\x22:1200,\x22oriheight\x22:675,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=1297102096,3476971300&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22ht
tps:\\\/\\\/www.vcg.com\\\/creative\\\/1061297425\x22,\x22contSign\x22:\x221297102096,3476971300\x22},{\x22pid\x22:144528,\x22width\x22:1200,\x22height\x22:1002,\x22oriwidth\x22:1200,\x22o
riheight\x22:1002,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=852388090,130270862&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/105938097
4\x22,\x22contSign\x22:\x22852388090,130270862\x22},{\x22pid\x22:144529,\x22width\x22:1000,\x22height\x22:613,\x22oriwidth\x22:1000,\x22oriheight\x22:613,\x22thumbnailUrl\x22:\x22https:\\\
/\\\/t7.baidu.com\\\/it\\\/u=4283365501,347124022&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1284958551\x22,\x22contSign\x22:\x224283365501,347124022\x22
},{\x22pid\x22:144530,\x22width\x22:1200,\x22height\x22:900,\x22oriwidth\x22:1200,\x22oriheight\x22:900,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=4240641596,3235181048
&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1262443433\x22,\x22contSign\x22:\x224240641596,3235181048\x22},{\x22pid\x22:144531,\x22width\x22:1200,\x22hei
ght\x22:511,\x22oriwidth\x22:1200,\x22oriheight\x22:511,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=3655946603,4193416998&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/
\\\/www.vcg.com\\\/creative\\\/1280896397\x22,\x22contSign\x22:\x223655946603,4193416998\x22},{\x22pid\x22:144532,\x22width\x22:562,\x22height\x22:1000,\x22oriwidth\x22:562,\x22oriheight\x
22:1000,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=376303577,3502948048&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1147957934\x22,\x2
2contSign\x22:\x22376303577,3502948048\x22},{\x22pid\x22:144533,\x22width\x22:707,\x22height\x22:1000,\x22oriwidth\x22:707,\x22oriheight\x22:1000,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.
baidu.com\\\/it\\\/u=3078321260,3840584311&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1276888890\x22,\x22contSign\x22:\x223078321260,3840584311\x22},{\x2
2pid\x22:144534,\x22width\x22:707,\x22height\x22:1000,\x22oriwidth\x22:707,\x22oriheight\x22:1000,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=1092574912,855301095&fm=193
&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1276889074\x22,\x22contSign\x22:\x221092574912,855301095\x22},{\x22pid\x22:144535,\x22width\x22:1200,\x22height\x22:
674,\x22oriwidth\x22:1200,\x22oriheight\x22:674,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=3652245443,3894439772&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.
vcg.com\\\/creative\\\/1288048741\x22,\x22contSign\x22:\x223652245443,3894439772\x22},{\x22pid\x22:144536,\x22width\x22:1200,\x22height\x22:562,\x22oriwidth\x22:1200,\x22oriheight\x22:562,
\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=12235476,3874255656&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1287835587\x22,\x22contSign
\x22:\x2212235476,3874255656\x22},{\x22pid\x22:144537,\x22width\x22:707,\x22height\x22:1000,\x22oriwidth\x22:707,\x22oriheight\x22:1000,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\
\\/it\\\/u=1469153347,135482855&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1276888907\x22,\x22contSign\x22:\x221469153347,135482855\x22},{\x22pid\x22:144
538,\x22width\x22:1000,\x22height\x22:879,\x22oriwidth\x22:1000,\x22oriheight\x22:879,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=1595511510,3648399748&fm=193&f=GIF\x22,
\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1291580596\x22,\x22contSign\x22:\x221595511510,3648399748\x22},{\x22pid\x22:144539,\x22width\x22:1000,\x22height\x22:507,\x22or
iwidth\x22:1000,\x22oriheight\x22:507,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=1700588201,792130339&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/
creative\\\/1177269086\x22,\x22contSign\x22:\x221700588201,792130339\x22},{\x22pid\x22:144540,\x22width\x22:1200,\x22height\x22:533,\x22oriwidth\x22:1200,\x22oriheight\x22:533,\x22thumbnai
lUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=1032479594,2383177859&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1292520120\x22,\x22contSign\x22:\x221
032479594,2383177859\x22},{\x22pid\x22:144541,\x22width\x22:1000,\x22height\x22:855,\x22oriwidth\x22:1000,\x22oriheight\x22:855,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\
/u=2489672413,3546290349&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1147957935\x22,\x22contSign\x22:\x222489672413,3546290349\x22},{\x22pid\x22:144542,\x
22width\x22:675,\x22height\x22:1200,\x22oriwidth\x22:675,\x22oriheight\x22:1200,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=3203007717,1062852813&fm=193&f=GIF\x22,\x22fr
omUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1250996827\x22,\x22contSign\x22:\x223203007717,1062852813\x22},{\x22pid\x22:144543,\x22width\x22:1200,\x22height\x22:885,\x22oriwidth
\x22:1200,\x22oriheight\x22:885,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=810585695,3039658333&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creati
ve\\\/1244963673\x22,\x22contSign\x22:\x22810585695,3039658333\x22},{\x22pid\x22:144544,\x22width\x22:720,\x22height\x22:1000,\x22oriwidth\x22:720,\x22oriheight\x22:1000,\x22thumbnailUrl\x
22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=1485780900,1171834527&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1284958196\x22,\x22contSign\x22:\x221485780
900,1171834527\x22},{\x22pid\x22:144545,\x22width\x22:1100,\x22height\x22:1100,\x22oriwidth\x22:1200,\x22oriheight\x22:1200,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=8
39828294,1619278046&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1188913672\x22,\x22contSign\x22:\x22839828294,1619278046\x22},{\x22pid\x22:144546,\x22widt
h\x22:1200,\x22height\x22:620,\x22oriwidth\x22:1200,\x22oriheight\x22:620,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=3195384123,421318755&fm=193&f=GIF\x22,\x22fromUrl\x
22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1251136970\x22,\x22contSign\x22:\x223195384123,421318755\x22},{\x22pid\x22:144547,\x22width\x22:1200,\x22height\x22:562,\x22oriwidth\x22:120
0,\x22oriheight\x22:562,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=2671101745,1413589787&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1
214912793\x22,\x22contSign\x22:\x222671101745,1413589787\x22},{\x22pid\x22:144548,\x22width\x22:849,\x22height\x22:1200,\x22oriwidth\x22:849,\x22oriheight\x22:1200,\x22thumbnailUrl\x22:\x2
2https:\\\/\\\/t7.baidu.com\\\/it\\\/u=1728637936,3151165212&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1263187311\x22,\x22contSign\x22:\x221728637936,31
51165212\x22},{\x22pid\x22:144549,\x22width\x22:676,\x22height\x22:1200,\x22oriwidth\x22:676,\x22oriheight\x22:1200,\x22thumbnailUrl\x22:\x22https:\\\/\\\/t7.baidu.com\\\/it\\\/u=248753646
4,3153080617&fm=193&f=GIF\x22,\x22fromUrl\x22:\x22https:\\\/\\\/www.vcg.com\\\/creative\\\/1289747446\x22,\x22contSign\x22:\x222487536464,3153080617\x22}]',
dyTabStr: ''
});
app.init();
app.run();
});
}();
如果需要提取的信息在html的标签中,我们可以使用etree模块中的xpath来解析和提取。但是现在所有的urls都镶嵌在js的datalink下,我们无法使用xpath来解析。因此我选择用正则表达式来匹配。最开始想着使用正则表达式将js中所有的url都匹配出来进入list中,然后对list中每个url做修正的操作。很快,我被打脸了。因为字符串太大,我想通过正则表达式来将所有的url匹配出来有点困难,也就是一步到位行不通。我通过正则表达式只能匹配所有url都在一起的一串很大的字符串,无法将单个url匹配出来。
所以,我改变了思路,改变后的思路如下:
1.使用正则表达式将包含所有url的字符串提取出来,提取出的字符串为allurls_string。
2.对allurls_string做分割处理,形成列表。
3.对列表做筛选操作。
4.对筛选后的列表元素做分割,连接,去除不需要的字符的处理。
Step1:使用正则表达式提取包含所有url的字符串allurls_string
pattern =r'thumbnailUrl([\S]+t7\.baidu\.com[\S]+)\\x22' allurls_string = re.findall(pattern, response.text)
我使用的是截取字符thumbnailUrl和\x22中间的内容,提取后的内容如下:
Step2:对allurls_string做分割,将字符串通过关键字x22fromUrl和thumbnailUrl
分别分割成list_a和list_b。然后将分割后的list_b append进入新的new_match list中。
list_a =allurls_string[0].split('x22fromUrl') for a in list_a: list_b = a.split('thumbnailUrl') new_match.append(list_b)
查看 new_match二维数组中的元素\\x22:\\x22https:\\\\\\/\\\\\\/t7.baidu.com\\\\\\/it\\\\\\/u=1819248061,230866778&fm=193&f=GIF\\x22,\\,这里的url信息和我们想要的比较接近了。但是除了t7.baidu.com域名的url外,还有其他我们不需要的url,如www.vcg.com域名的url信息。
Step3:对new_match列表做清筛选工作,筛选需要的url。我使用的是正则表达式来筛选域名为t7.baidu.com的url。
for new_m in new_match: for new1 in new_m: if re.findall(r't7\.baidu\.com',new1): senc_match.append(new1)
筛选后的senc_match列表中的url变少了,变成都是我们需要的url了。
Step3:对筛选后的senc_match列表做分割操作,将多余的'\\\\\'去掉,分割后的list做join操作。
for senc_m in senc_match: split_match = senc_m.split('\\\\\\') aftersplit_match.append(split_match) #print(aftersplit_match) for after_m in aftersplit_match: process_url = "".join(after_m) process_urls.append(process_url)
process_urls的信息如下:
现在列表中的元素\\x22:\\x22https://t7.baidu.com/it/u=1819248061,230866778&fm=193&f=GIF\\x22,\\只需要将开始的\\x22:\\x22字符和结尾的\\x22,\\去掉就可以了,这个有多种方式来处理。我选择的是使用正则表达式来匹配开始和结尾的中间内容,从而提取出来。
Step4:使用正则表达式来提取中间的url信息。
for i in range(len(process_urls)): pattern = ':\\\\x22(.*)\\\\x22' new_url = re.findall(pattern,process_urls[i]) new_urls.append(new_url)
提取后的url如下,是我需要的url信息。