网页内容抓取与链接计数-优快云博客

本文链接：https://blog.youkuaiyun.com/syh_486_007/article/details/55683955

本文介绍了一个简单的Python程序，该程序可以抓取指定URL的内容并将其保存到本地文件中。此外，还提供了一个函数用于计算网页中链接的数量。通过随机文件名的方式避免了文件覆盖的风险。

#coding=utf8

import os
import urllib
import random

#获取地址，然后写入文件
def save_url_content(url,folder_path):

    if not (url.startswith('http://') and  url.startswith('https://')):
        print u'error'

    if not os.path.isdir(folder_path):
        return u'folder_path not a folder'

    d = urllib.urlopen(url)
    content = d.read()
    print content
    random_name = 'test_%s.txt' % random.randint(1,1000)
    #filepath = '%s%s' %(folder_path,random_name)
    filepath = os.path.join(folder_path,random_name)
    file_handle = open(filepath,'w')
    file_handle.write(content)
    file_handle.close()
    return filepath

#print save_url_content('aa','dfsf')
#print save_url_content('http://www.baidu.com','fdsfsd')
print save_url_content('http://www.baidu.com','F:\\')


#获取url中的连接数量
def get_url_list(url):

    if not (url.startswith('http://') and  url.startswith('https://')):
        print u'error'

    d = urllib.urlopen(url)
    content = d.read()
    print content
    return len(content.split('<a href=')) -1

print get_url_list("http://www.baidu.com")

#递归解决目录下的所有文件，只能用递归来解决，否则需要用栈来记忆递归过程

python自学-第八次作业