php 获取超链接_php 如何精准获取网站中的所有超链接?

使用phpsnoopy类从网页中抓取超链接时,遇到了包含javascript:的无效URL。要过滤这些URL,可以遍历结果数组,检查每个字符串是否以javascript:开头,如果是,则不保留。QueryList库也被推荐,它支持jQuery选择器语法,可能更方便进行此类操作。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

想获取网站中的所有超链接,使用的是php snoopy类

$sourceURL = $url;

$snoopy->fetchlinks($sourceURL);

$content = $snoopy->results;

获取的结果如下:

array (size=627)

0 => string 'http://www.alibaba.com/https://login.alibaba.com/' (length=49)

1 => string 'http://sh.vip.alibaba.com?tracelog=nav_ma' (length=41)

2 => string 'http://message.alibaba.com/feedback/default.htm?routeto=inbox&tracelog=nav_ma_mc' (length=80)

3 => string 'http://www.alibaba.com//hz-favorite.alibaba.com/favorite/favorite_home.htm?tracelog=nav_ma_fav' (length=94)

4 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_myalibaba' (length=57)

5 => string 'http://hz.sourcing.alibaba.com/rfq/request/rfq_manage_list.htm?tracelog=nav_ma_mana_rfq' (length=87)

6 => string 'http://biz.alibaba.com/generalorders/list_orders.htm?tracelog=ma_mana_orders' (length=76)

7 => string 'http://sh.vip.alibaba.com/product/post_product_interface.htm?tracelog=newschp_nav_madp' (length=86)

8 => string 'http://sh.vip.alibaba.com/product/manage_products.htm?tracelog=newschp_nav_mamng' (length=80)

9 => string 'http://hz.sourcing.alibaba.com/rfq/quotation/rfq_not_quoted_manage_list.htm?nav_ma_rec_rfqs' (length=91)

10 => string 'http://www.alibaba.com/javascript:;' (length=35)

11 => string 'http://www.alibaba.com/Products?tracelog=beacon_cate_140704' (length=59)

12 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_forbuyers' (length=57)

13 => string 'http://globalexpo.alibaba.com?tracelog=beacon_expo_150820' (length=57)

14 => string 'http://wholesale.alibaba.com?tracelog=nav_ws' (length=44)

15 => string 'http://buyer.alibaba.com/bizid_buyer?tracelog=nav_bi' (length=52)

16 => string 'http://tradeassurance.alibaba.com/bao/buyer_advertise.htm?tracelog=from_home_menu' (length=81)

17 => string 'http://activities.alibaba.com/alibaba/secure-payment.php?tracelog=beacon_payment_150114' (length=87)

18 => string 'http://ecredit.alibaba.com/ecl/buyer.htm?tracelog=beacon_credit_140704' (length=70)

19 => string 'http://inspection.alibaba.com/?tracelog=beacon_is_140704' (length=56)

20 => string 'http://buyer.alibaba.com/intelligence?tracelog=beacon_ti_140704' (length=63)

21 => string 'http://buyer.alibaba.com/forum?tracelog=beacon_df_140704' (length=56)

22 => string 'http://ask.alibaba.com/?tracelog=beacon_ta_140704' (length=49)

23 => string 'http://www.alibaba.com/javascript:;' (length=35)

24 => string 'http://seller.alibaba.com/memberships/index.html?tracelog=seller_channel_member_hp_header' (length=89)

25 => string 'http://seller.alibaba.com/learningcenter?tracelog=seller_channel_lc_hp_header' (length=77)

26 => string 'http://seller.alibaba.com/training.htm?tracelog=seller_channel_training_hp_header' (length=81)

27 => string 'http://sourcing.alibaba.com/?tracelog=newschp_nav_narfq' (length=55)

28 => string 'http://www.alibaba.com/javascript:;' (length=35)

怎么能把“http://www.alibaba.com/javascript:;”类似的URL去掉?

回复内容:

想获取网站中的所有超链接,使用的是php snoopy类

$sourceURL = $url;

$snoopy->fetchlinks($sourceURL);

$content = $snoopy->results;

获取的结果如下:

array (size=627)

0 => string 'http://www.alibaba.com/https://login.alibaba.com/' (length=49)

1 => string 'http://sh.vip.alibaba.com?tracelog=nav_ma' (length=41)

2 => string 'http://message.alibaba.com/feedback/default.htm?routeto=inbox&tracelog=nav_ma_mc' (length=80)

3 => string 'http://www.alibaba.com//hz-favorite.alibaba.com/favorite/favorite_home.htm?tracelog=nav_ma_fav' (length=94)

4 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_myalibaba' (length=57)

5 => string 'http://hz.sourcing.alibaba.com/rfq/request/rfq_manage_list.htm?tracelog=nav_ma_mana_rfq' (length=87)

6 => string 'http://biz.alibaba.com/generalorders/list_orders.htm?tracelog=ma_mana_orders' (length=76)

7 => string 'http://sh.vip.alibaba.com/product/post_product_interface.htm?tracelog=newschp_nav_madp' (length=86)

8 => string 'http://sh.vip.alibaba.com/product/manage_products.htm?tracelog=newschp_nav_mamng' (length=80)

9 => string 'http://hz.sourcing.alibaba.com/rfq/quotation/rfq_not_quoted_manage_list.htm?nav_ma_rec_rfqs' (length=91)

10 => string 'http://www.alibaba.com/javascript:;' (length=35)

11 => string 'http://www.alibaba.com/Products?tracelog=beacon_cate_140704' (length=59)

12 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_forbuyers' (length=57)

13 => string 'http://globalexpo.alibaba.com?tracelog=beacon_expo_150820' (length=57)

14 => string 'http://wholesale.alibaba.com?tracelog=nav_ws' (length=44)

15 => string 'http://buyer.alibaba.com/bizid_buyer?tracelog=nav_bi' (length=52)

16 => string 'http://tradeassurance.alibaba.com/bao/buyer_advertise.htm?tracelog=from_home_menu' (length=81)

17 => string 'http://activities.alibaba.com/alibaba/secure-payment.php?tracelog=beacon_payment_150114' (length=87)

18 => string 'http://ecredit.alibaba.com/ecl/buyer.htm?tracelog=beacon_credit_140704' (length=70)

19 => string 'http://inspection.alibaba.com/?tracelog=beacon_is_140704' (length=56)

20 => string 'http://buyer.alibaba.com/intelligence?tracelog=beacon_ti_140704' (length=63)

21 => string 'http://buyer.alibaba.com/forum?tracelog=beacon_df_140704' (length=56)

22 => string 'http://ask.alibaba.com/?tracelog=beacon_ta_140704' (length=49)

23 => string 'http://www.alibaba.com/javascript:;' (length=35)

24 => string 'http://seller.alibaba.com/memberships/index.html?tracelog=seller_channel_member_hp_header' (length=89)

25 => string 'http://seller.alibaba.com/learningcenter?tracelog=seller_channel_lc_hp_header' (length=77)

26 => string 'http://seller.alibaba.com/training.htm?tracelog=seller_channel_training_hp_header' (length=81)

27 => string 'http://sourcing.alibaba.com/?tracelog=newschp_nav_narfq' (length=55)

28 => string 'http://www.alibaba.com/javascript:;' (length=35)

怎么能把“http://www.alibaba.com/javascript:;”类似的URL去掉?

QueryList

['img','src']])->data;

//打印结果

print_r($data);

//采集某页面所有的超链接

$data = QueryList::Query('http://cms.querylist.cc/google/list_1.html',['link' => ['a','href']])->data;

//打印结果

print_r($data);

http://git.oschina.net/jae/QueryList

可以看下这个,比snoopy要强大一些,支持jquery选择器语法

相关标签:php

本文原创发布php中文网,转载请注明出处,感谢您的尊重!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值