url match

正则表达式解析视频链接

最新推荐文章于 2023-07-20 10:55:31 发布

原创最新推荐文章于 2023-07-20 10:55:31 发布 · 1.3k 阅读

0 ·

CC 4.0 BY-SA版权

Python 专栏收录该内容

48 篇文章

订阅专栏

videoUrlMatch = re.search(r"Real URLs:\s*(?P<videoUrls>.*){}".format(os.linesep), parseResult, re.DOTALL | re.IGNORECASE).group('videoUrls').split()  # 搜寻链接，因为有了download-url:，所以必须从Real URLs:后面搜寻
            # print(re.search(r"Real URLs:\s*(?P<videoUrls>.*){}".format(os.linesep), parseResult, re.DOTALL | re.IGNORECASE).group('videoUrls'),videoUrlMatch)
        

videoUrlMatch = re.findall(r'''Real URLs?:\s*[\[']*(?P<videoUrls>http.*?)['\]]*\s+''', parseResult, re.IGNORECASE)



videoUrlMatch = re.findall(r'''Real URLs?:#you-get解析爱奇艺显示Real URL，解析搜狐显示Real URLs，ykdl则是Real urls
                                    \s*#匹配换行和空格
                                    [\[']*#匹配you-get解析爱课程、爱奇艺的字符化列表左边['，用括号（['）*的话findall会得出不相干的结果，故而不用
                                    (?P<videoUrls>http.*?)#匹配url，非贪婪模式以便不匹配出多余空格和不匹配结尾的字符化列表右边']
                                    ['\]]*#匹配you-get解析爱课程、爱奇艺的字符化列表右边']
                                    \s+''', #url的结尾都是换行，必须有
                                    parseResult, re.IGNORECASE| re.VERBOSE)  # 搜寻链接，因为有了download-url:，所以必须从Real URLs:后面搜寻

>>> re.findall(r'''Real URLs?:\s*[\[']*(?P<videoUrls>http://\S*)['\]]*''', parseResult, re.IGNORECASE|re.DOTALL)
['http://39.130.134.23:80/199/2/78/letv-uts/14/ver_00_22-1053883476-avc-2999799-aac-128000-2820200-1106779568-60506e98dba03609f162c0a1319f8365-1467722446828_mp4/ver_00_22_0_0_1_518316_0_0.ts?mltag=100&platid=1&splatid=101&playid=0&geo=CN-25-353-4&tag=letv&ch=&p1=&p2=&p3=&tss=ios&b=3139&bf=52&nlh=4096&path=&sign=letv&proxy=3702879625,1728279117,3719677991&uuid=&ntm=1491493200&keyitem=GOw_33YJAAbXYE-cnQwpfLlv_b2zAkYctFVqe5bsXQpaGNn3T1-vhw..&its=0&nkey2=8df56d0d2d11ae422e5abf18d684822f&uid=1971868906.rp&qos=3&enckit=&m3v=1&token=&vid=&liveid=&station=&app_name=&app_ver=&fcheck=0&pantm=&panuid=&pantoken=&cips=117.136.84.234&vod_live_path=&ledituid=&leditcid=&leditcip=&leditfl=&leditafl=&ajax=&lsbv=']
>>>

对于多行url只能match Real URL后面第一个，所以后面采用了去掉Real URL的match regex