用lwp分析网页链接和图片链接

#!/usr/bin/perl
use HTML::TokeParser;
use LWP::Simple;

$url="http://www.51cto.com/";
$filename="51cto.html";
my $status_code=getstore("$url",$filename);
die "cant download $url" unless is_success($status_code);


my $stream = HTML::TokeParser->new($filename)|| die "Couldn't read HTML file $filename: $!";
my @href_array;
my @img_array;

#获取所有内链接和图片链接
while(my $start_tag=$stream->get_token()){
    if($start_tag->[0] eq 'S'){
          if($start_tag->[1] eq 'a'){
                   push @href_array,$start_tag->[2]{'href'};
           }
          if($start_tag->[1] eq 'img'){
                   push @img_array,$start_tag->[2]{'src'};
           }
     }
}
print "print inner herf\n============================\n";
foreach my $href_one(@href_array){
     print $href_one."\n";
}

print "print pic herf\n============================\n";
foreach my $img_one(@img_array){
    print $img_one."\n";
}

#收集所有图片链接
while(my $img_tag=$stream->get_tag('img')){
    print $img_tag->[1]{'src'}."\n";
}

#收集所有内链接
while(my $href_tag=$stream->get_tag('a')){
     #home => default.html
    print $stream->get_text('/a')." => ".$href_tag->[1]{'href'}."\n";
}
 

 

结果:

......

http://cn.made-in-china.com/
http://www.139shop.com/
http://www.51cto.com/about/aboutus_e.html
http://www.51cto.com/about/aboutus.html
http://www.51cto.com/about/zhaopin.html
http://www.51cto.com/about/contactus.html
http://www.51cto.com/about/history.html
http://www.51cto.com/about/links.html
http://www.51cto.com/php/guestbook
http://www.51cto.com/about/map.html
http://www.hd315.gov.cn/beian/view.asp?bianhao=010202006111400015
print pic herf
============================
http://home.51cto.com/public/themes/blue/images/top_bg_xian.gif
http://images.51cto.com/images/art/top_images/nav_ico1.gif
http://home.51cto.com/public/themes/blue/images/top_bg_xian.gif
http://images.51cto.com/images/art/top_images/nav_ico1.gif
http://home.51cto.com/public/themes/blue/images/top_bg_xian.gif
http://home.51cto.com/public/themes/blue/images/top_shequ.gif
http://images.51cto.com/images/art/top_images/nav_ico1.gif
http://images.51cto.com/images/index/Images/Logo.gif
http://www.51cto.com/book/s_waidian.gif
http://www.51cto.com/book/icon_yuanchuang.gif
http://www.51cto.com/book/icon_dujia.gif
http://images.51cto.com/images/tuijian_tit_bg.gif
http://images.51cto.com/images/index/Images/wrap_bg6_logo.gif
http://img1.51cto.com/images/expert/122234011545.jpg
http://img1.51cto.com/images/expert/130622222590.51
http://img1.51cto.com/images/expert/129983960396.jpg
......

如果想了解更多,请关注我们的公众号
公众号ID:opdevos
扫码关注

gongzhouhao.jpg 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值