外链跳转

本文介绍了如何编写一个爬虫,该爬虫能够在遇到无外链的页面时,随机跳转到一个内链,以此进行连续的链接收集。爬虫的代码实现是解决此类问题的关键。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

写一个爬虫,实现外链间的随机跳转。

若某页面没有外链,则随机跳转到一个内链,然后继续收集外链。

代码如下:


from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())

def getInternalLinks(bsObj, includeurl):
    includeurl = urlparse(includeurl).scheme+"://"+urlparse(includeurl).netloc
    internalLinks = [] 

    for link in bsObj.findAll("a",href = re.compile("^(/|.*" + includeurl + ")")):
        if link.attrs["href"] is not None:
            if link.attrs["href"] not in internalLinks:
                if(link.attrs["href"].startswith("/")):
                    internalLinks.append(includeurl + link.attrs["href"])
                else:
                    internalLinks.append(link.attrs["href"])
    return internalLinks

def getExternalLinks(bsObj, excludeurl):
    externalLinks = []

    for link in bsObj.findAll("a",href = re.compile("^(http|www)((?!"+excludeurl+").)*$")):
        if link.attrs["href"] is not None:
            if link.attrs["href"] not in externalLinks:
                externalLinks.append(link.attrs["href"])
    return externalLinks

def splitAddress(address):
    addressParts = address.replace("http://", "").split("/")
    return addressParts

def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bsObj = BeautifulSoup(html)
    externalLinks = getExternalLinks(bsObj, urlparse(startingPage).netloc)
    if len(externalLinks) == 0:
        print("No external links, looking around the site for one")
        domain = urlparse(startingPage).scheme+"://"+urlparse(startingPage).netloc
        internalLinks = getInternalLinks(bsObj, domain)
        if len(internalLinks) == 0:
            return None
        else:
            return getRandomExternalLink(internalLinks[random.randint(0,len(internalLinks)-1)])
    else:
        return externalLinks[random.randint(0,len(externalLinks)-1)]

def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink(startingSite)
    if externalLink is None:
        print("Over")
        return None;
    else:
        print("Random external link is: "+externalLink)
        followExternalOnly(externalLink)

followExternalOnly("http://www.baidu.com")







    


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值