工具脚本学习_用什么工具学脚本-优快云博客

本文链接：https://blog.youkuaiyun.com/jjdhshdahhdd/article/details/8173561

本文详细介绍了如何使用Python的正则表达式、循环和字典进行网页数据抓取，包括URL请求、HTML解析、匹配特定模式并更新数据字典。涉及HTTP状态码检查、链接和ID提取等关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

import urllib,urllib2
import re
import string



def get_htmls(url):  

    response = urllib.urlopen(url)  
    html_str = response.read()
    http_status = response.code  
    header_str = str(response.info())
    #print html_str
    return html_str



def  get_match(pattern,string):
    match = pattern.search(string)
    if match:
        print match.group()
        return match.group()
    else:
        print "none"

def update_dict(list_oid,oid):
    
    print oid in list_oid.keys()
    if oid in list_oid.keys():
        print "have"
        oid_value=list_oid.get(oid)
        list_oid[oid]=int(oid_value)+1
    else:
        list_oid[oid] = '0'
    
    
    

if __name__=="__main__":

    list_oid={}
    for i in range(1,50):
        url = 'http://api.sfefefefe'
        html_temp=get_htmls(url)
        pattern = re.compile(r'http://xxx.com/stats_imp.php.*?vendor_id=')
        link_temp=get_match(pattern,html_temp)
        pattern = re.compile(r'oid=\d+')
        oid_temp=get_match(pattern,link_temp)
        pattern = re.compile(r'\d+')
        oid_string=get_match(pattern,oid_temp)
        
        update_dict(list_oid,oid_string)
        print list_oid.items()

主要是用了正则，循环，字典。

http://www.cnblogs.com/morya/archive/2011/05/12/2044904.html url http等的学习

http://www.cnblogs.com/wxw0813/archive/2012/09/18/2690694.html 超时问题

http://blog.sina.com.cn/s/blog_a04184c101010ksg.html