BeautifulSoup 爬虫

最新推荐文章于 2025-09-12 15:58:40 发布

weixin_30410119

最新推荐文章于 2025-09-12 15:58:40 发布

阅读量79

点赞数

CC 4.0 BY-SA版权

文章标签： python 爬虫

原文链接：http://www.cnblogs.com/wxc1/p/6130079.html

本文介绍了一个使用Python的BeautifulSoup库抓取网页图片的简单实例，演示了如何从指定网站抓取图片资源，并通过set集合去除重复链接。

一安装BeautifulSoup

安装Python的包管理器pip 然后运行

$pip3 install beautifulsoup

在终端里导入它测试下是否安装成功

>>>from bs import BeautifulSoup

如果没有错误，说明导入成功了

简单例子 http://sc.chinaz.com/biaoqing/baozou.html 爬取图片

代码如下

from urllib.request import urlopen
from urllib.error import HTTPError,URLError
from bs4 import BeautifulSoup
import re
import warnings
warnings.filterwarnings("ignore")
def getTitle(url):
    list =[];
    try:
       html=urlopen(url);
    except (HTTPError,URLError) as e:
        return None;
    try:
        bsObj = BeautifulSoup(html)
        a=bsObj.findAll("img",{"src":re.compile("http:\/\/.*jpg|png|jpeg|tiff|raw|bmp|gig")});
        for i in a:
            if i['src']!="":
               list.append(i['src']);
    except AttributeError as e:
        return None;

    return list;
# a=getTitle(url)
# print(a)

def getHread(is_urls):
    list=[];
    try:
        html = urlopen(is_urls);
    except (HTTPError, URLError) as e:
        return None;
    try:
        bsObj = BeautifulSoup(html)
        tables=bsObj.findAll("a")

        for i in tables:
            if "href" in i.attrs:
               list.append(i.attrs['href']);

             #print(getTitle(i.attrs['href']));
        temp=set(list);
        for d in temp:
            print(getTitle(d));
    except AttributeError as e:
        return None;
    #return list;
is_ulrs="http://sc.chinaz.com/biaoqing/baozou.html";
a=getHread(is_ulrs)
print(a)
##################运行结果****************************** 
没有具体需求 只是简单的例子 只是处理了重复返回的图片用到set集合 运行的速度有点慢 没有时间优化 等有时间一定好好写写。