parse_url的用法,以及用途

parse_url类似于$_SERVER

它的用途是获取网址

相比而言,$_SRVER获取的网址的信息更多,更具体一些

下面说一下parse_url:

<?php

$Url='http://www.yidawang.net/index.html';
$tempu=parse_url($Url);
print_r($tempu);die;

打印的结果为:Array( [scheme] => http [host] => www.yidawang.net [path] => /index.html)

$message=$tempu['host'];
echo $message;

输出的结果为:www.yidawat_ng.net(域名)

?>

而$_SERVER

echo "<pre>";

print_r($_SERVER);

Array
(
    [PATH] => D:\phpStudy\php55n;C:\Users\59106\Desktop\node-v8.1.3-win-x64\node-v8.1.3-win-x64;E:\yu\vue\my-project;C:\Users\59106\AppData\Local\Microsoft\WindowsApps;
    [SYSTEMROOT] => C:\WINDOWS
    [COMSPEC] => C:\WINDOWS\system32\cmd.exe
    [PATHEXT] => .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
    [WINDIR] => C:\WINDOWS
    [PHP_FCGI_MAX_REQUESTS] => 1000
    [PHPRC] => D:/phpStudy/php55n
    [_FCGI_SHUTDOWN_EVENT_] => 2080
    [SCRIPT_NAME] => /test10.php
    [REQUEST_URI] => /test10.php
    [QUERY_STRING] => 
    [REQUEST_METHOD] => GET
    [SERVER_PROTOCOL] => HTTP/1.1
    [GATEWAY_INTERFACE] => CGI/1.1
    [REMOTE_PORT] => 54677
    [SCRIPT_FILENAME] => E:/wamp/www/test10.php
    [SERVER_ADMIN] => www.goods.com
    [CONTEXT_DOCUMENT_ROOT] => E:/wamp/www
    [CONTEXT_PREFIX] => 
    [REQUEST_SCHEME] => http
    [DOCUMENT_ROOT] => E:/wamp/www
    [REMOTE_ADDR] => 127.0.0.1
    [SERVER_PORT] => 80
    [SERVER_ADDR] => 127.0.0.1
    [SERVER_NAME] => www.goods.com
    [SERVER_SOFTWARE] => Apache/2.4.18 (Win32) OpenSSL/1.0.2e mod_fcgid/2.3.9
    [SERVER_SIGNATURE] => 
    [SystemRoot] => C:\WINDOWS
    [HTTP_UPGRADE_INSECURE_REQUESTS] => 1
    [HTTP_CONNECTION] => close
    [HTTP_DNT] => 1
    [HTTP_ACCEPT_ENCODING] => gzip, deflate
    [HTTP_ACCEPT_LANGUAGE] => zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3
    [HTTP_ACCEPT] => text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    [HTTP_USER_AGENT] => Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0
    [HTTP_HOST] => www.goods.com
    [FCGI_ROLE] => RESPONDER
    [PHP_SELF] => /test10.php
    [REQUEST_TIME_FLOAT] => 1499823769.6059
    [REQUEST_TIME] => 1499823769
)


``` import os import requests from bs4 import BeautifulSoup from urllib.parse import urljoin, urlparse from tqdm import tqdm class ImageCrawler: def __init__(self, base_url, save_dir='images', max_depth=2): self.base_url = base_url self.save_dir = save_dir self.max_depth = max_depth self.visited = set() self.headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } # 创建保存目录 os.makedirs(self.save_dir, exist_ok=True) def is_valid_url(self, url): """验证URL是否合法""" parsed = urlparse(url) return bool(parsed.netloc) and bool(parsed.scheme) def get_filename(self, url): """从URL提取有效文件名""" path = urlparse(url).path return os.path.basename(path).split('?')[0] or 'default.jpg' def download_image(self, url): """下载单个图片""" try: response = requests.get(url, headers=self.headers, stream=True, timeout=10) if response.status_code == 200: filename = self.get_filename(url) filepath = os.path.join(self.save_dir, filename) # 避免重复下载 if not os.path.exists(filepath): with open(filepath, 'wb') as f: for chunk in response.iter_content(chunk_size=1024): if chunk: f.write(chunk) return True except Exception as e: print(f"下载失败 {url}: {str(e)}") return False def extract_images(self, url): """提取页面中的所有图片""" try: response = requests.get(url, headers=self.headers, timeout=10) soup = BeautifulSoup(response.text, 'html.parser') img_tags = soup.find_all('img') for img in img_tags: img_url = img.attrs.get('src') or img.attrs.get('data-src') if not img_url: continue # 处理相对URL img_url = urljoin(url, img_url) if self.is_valid_url(img_url): yield img_url except Exception as e: print(f"页面解析失败 {url}: {str(e)}") def crawl(self, url=None, depth=0): """递归爬取页面""" if depth > self.max_depth: return current_url = url or self.base_url if current_url in self.visited: return self.visited.add(current_url) print(f"正在爬取: {current_url}") # 下载当前页面的图片 for img_url in self.extract_images(current_url): if self.download_image(img_url): print(f"成功下载: {img_url}") # 递归爬取子链接 try: response = requests.get(current_url, headers=self.headers, timeout=10) soup = BeautifulSoup(response.text, 'html.parser') for link in soup.find_all('a'): href = link.get('href') if href and href not in self.visited: absolute_url = urljoin(current_url, href) if self.is_valid_url(absolute_url): self.crawl(absolute_url, depth+1) except Exception as e: print(f"链接爬取失败: {str(e)}") if __name__ == "__main__": # 使用示例 crawler = ImageCrawler( base_url="https://example.com", # 替换为目标网站 save_dir="downloaded_images", max_depth=2 ) crawler.crawl()```请解释这个代码
03-08
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

A_青涩

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值