爬取同类标签方法

最新推荐文章于 2023-05-26 00:39:55 发布

原创最新推荐文章于 2023-05-26 00:39:55 发布 · 4.6k 阅读

6 ·

CC 4.0 BY-SA版权

python爬虫学习专栏收录该内容

4 篇文章

订阅专栏

网上很少有爬取的信息在多个同种标签下，如何爬取的方法，恩，我就尝试了几种，还成功爬取到了。

废话不多说直接上例子

例子：http://www.17k.com/list/2743300.html

发现要爬取的链接在<dl class = "Volume">里面，而且有三个。

方法一、

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

def search(url):
    content = []
    html = urlopen(url)
    bsObj = BeautifulSoup(html,"html.parser")
    html_1 =  bsObj.find("div",{"class":"Main List"}).findAll("dl")
    for line in html_1:
            urls =  line.findAll("a")
            for thing   in urls:
                c= thing.attrs["href"]
                content.append(c)
    return content
#;
url = 'http://www.17k.com/list/2743300.html'
lines = search(url)
print(lines)

首先我是找到他们都在<div class="Main List"> 这个大标签下的

然后找dl标签，由于是finAll（）返回的是列表，因此进入列表，找到含'a'标签的列表，

因为想用attrs，打个比方，tag是个标签，tag.attrs["xxx"],xxx可以是里面的属性，href id 等等

然后就把这个拿出来，给c，然后添加到列表里。

方法二、

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import requests


url = "http://www.17k.com/list/2743300.html"
html = requests.get(url).text.encode('utf-8')
bsObj = BeautifulSoup(html,"html.parser")
for line in bsObj.select(".Volume > dd > a"):
    a = line.attrs["href"]
    urls = "http://www.17k.com/%s" % a
    print(urls)

这个就简单一点，首先要会用select这个方法，这个也比较方便，然后我会在最下面放解释这个方法的链接

用的是属性查找，注意：类名查找时有 . （点），然后结尾要有一个空格不然会报错，然后 > 是进入子标签

差不多就是这个意思，我是这样理解的，然后进入‘a’标签，最后使用attrs ，就找到了网址

由于本人刚接触爬虫，有些地方可能有问题，有觉得不对的地方可以提。

顺便我有一个疑问。

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

def search(url):
    content = []
    html = urlopen(url)
    bsObj = BeautifulSoup(html,"html.parser")
    html_1 =  bsObj.find("div",{"class":"Main List"}).findAll("dl")
    print(html_1)
#     for line in html_1:
#             urls =  line.findAll("a")
#             for thing   in urls:
#                 c= thing.attrs["href"]
#                 content.append(c)
#     return content
# #;
url = 'http://www.17k.com/list/2743300.html'
lines = search(url)
print(lines)