【Chapter5】一些UserAgent

【UserAgent属性/方法东东】

第一个重要的attributes是Agent的等待时间,或者称作超时时间

例5-1 修改UserAgent默认时间,默认时间为180s,修改为10s

#!usr/bin/perl -w
use LWP::UserAgent;

my $browser=LWP::UserAgent->new();
$oldval=$browser->timeout();
$browser->timeout(10);
print "Changed timeout from $oldval to 10/n";

 

打印出来为:

 

还可以打印其他属性。默认属性里面有值的只有,timeout是180s;agent是libwww-perl/5.837(我的版本哈);parse_head为1;

例5-2 打印属性requests_redirectable。

#!usr/bin/perl -w
use LWP::UserAgent;

my $browser=LWP::UserAgent->new();
my $a=$browser->requests_redirectable;
#$browser->timeout(10);
print "$a->[0]/n,$a->[1]/n";

 

打印出来是

 

conn_cache是说连接数缓冲区大小的。默认为undef。
需要这样设置
use LWP::ConnCache;
$cache = $browser->conn_cache(LWP::ConnCache->new( ));
$browser->conn_cache->total_capacity(10);

如果接受所有的连接,那么不设置,或者设置为undef。
$browser->conn_cache->total_capacity(undef);

【UserAgent的其他属性/方法】
1、UserAgent属性代理和cookie_jar可以任意设置代理和cookie。详细内容卖个关子。
设置大概是:
$browser->agent("Mozilla/4.76 [en] (Windows NT 5.0; U)");#设置代理
$browser->cookie_jar(HTTP::Cookies->new(
  'file' => $some_file, 'autosave' => 1
));#用文件$some_file来设置cookie

2、protocols_forbidden和protocols_allowed用来设置agent使用的协议,允许还是不允许。这样设置:
$browser->protocols_forbidden(["ftp"]);或者
$browser->protocols_allowed(["ftp"]);
is_protocol_supported()方法用来检查是否支持。是的话返回true;否则false
3、request_redirectable()方法控制agent用何种方法转向;默认列表只有GET和HEAD.但可以往列表里面加。如下:
push @{$browser->requests_redirectable}, 'POST';
#往列表里面加POST方法;告诉LWP在POST请求发送后如果发生重新定向就自动跟随
$aref = $browser->requests_redirectable([/@methods]);(这个编辑面板居然没有右斜杠)

import requests from bs4 import BeautifulSoup import time import os import random from urllib.parse import urljoin from fake_useragent import UserAgent # 修正导入 # 初始化随机UA ua = UserAgent() # 修正初始化 headers = {'User-Agent': ua.random} def get_random_ua(): return ua.random def random_delay(min=1.0, max=3.0): time.sleep(random.uniform(min, max)) def get_all_chapter_links(base_url_pattern): all_links = [] page = 1 while True: url = base_url_pattern.format(page) try: headers['User-Agent'] = get_random_ua() response = requests.get(url, headers=headers, timeout=15) if response.status_code != 200: print(f"停止分页: 状态码{response.status_code} @ {url}") break soup = BeautifulSoup(response.text, 'html.parser') links = soup.select('ul.chapter-list li a') if not links: print(f"停止分页: 未找到章节 @ {url}") break for link in links: chapter_url = urljoin(url, link['href']) all_links.append({ 'title': link.text.strip(), 'url': chapter_url }) print(f"已获取第{page}页目录,共{len(links)}章") page += 1 random_delay(1.5, 2.5) except Exception as e: print(f"目录页获取失败: {url} - {str(e)}") break return all_links def extract_content(soup): # 内容容器选择器优先级 content_selectors = [ 'div#chaptercontent', 'div.content', 'div#content', 'div.read-content', 'div.txt', 'div.article-content' ] for selector in content_selectors: content_div = soup.select_one(selector) if content_div: # 清理广告元素 ad_selectors = [ 'script', 'div.ads', 'ins', '.ad', '.advertisement', '.ad-box', '.adsbygoogle' ] for ad_selector in ad_selectors: for ad in content_div.select(ad_selector): ad.decompose() # 提取段落文本 paragraphs = [p.text.strip() for p in content_div.select('p') if p.text.strip()] if paragraphs: return '\n\n'.join(paragraphs) # 备用方案:直接提取所有文本 return soup.get_text(separator='\n\n', strip=True) def get_chapter_content(url): try: headers['User-Agent'] = get_random_ua() response = requests.get(url, headers=headers, timeout=15) response.encoding = 'utf-8' # 强制设置编码 soup = BeautifulSoup(response.text, 'html.parser') # 标题提取(尝试多种选择器) title_selectors = ['h1.chapter-title', 'h1.title', 'h1', 'div.title h1'] title = None for selector in title_selectors: title_element = soup.select_one(selector) if title_element: title = title_element.text.strip() break content = extract_content(soup) return title if title else "未知章节", content except Exception as e: print(f"章节获取失败: {url} - {str(e)}") return "获取失败章节", "" def main(): novel_title = "苟在初圣魔门当人材" os.makedirs(novel_title, exist_ok=True) # 使用分页URL模式 base_pattern = "https://m.dppss.com/nkl/1426/1426458_{}/" chapters = get_all_chapter_links(base_pattern) print(f"共获取 {len(chapters)} 个章节,开始爬取内容...") # 创建输出文件 output_path = os.path.join(novel_title, f"{novel_title}.txt") with open(output_path, "w", encoding="utf-8") as f: for i, chap in enumerate(chapters): print(f"进度: {i+1}/{len(chapters)} - {chap['title']}") title, content = get_chapter_content(chap['url']) # 写入章节 f.write(f"\n\n{'='*40}\n{title}\n{'='*40}\n\n") f.write(content) # 随机延时 random_delay(1.8, 3.5) print(f"小说爬取完成! 保存路径: {output_path}") if __name__ == "__main__": main() D:\tools\python\python.exe D:\历史项目留存2\诺褀2025\python加工浦发模型模拟\py搭建\pythonProject1\爬取数据.py Traceback (most recent call last): File "D:\历史项目留存2\诺褀2025\python加工浦发模型模拟\py搭建\pythonProject1\爬取数据.py", line 7, in <module> from fake_useragent import UserAgent # 修正导入 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'fake_useragent' Process finished with exit code 1
最新发布
08-02
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值