JSOUP org.jsoup.HttpStatusException HTTP error fetching URL. Status=403, URL=

本文提供了一种解决JSoup爬虫在访问特定网站时遇到的HTTP 403错误的方法,即通过设置User-Agent来伪装爬虫请求,使其看起来更像常规浏览器访问。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://xxxx.com/xxx/xxx/xxx.html

 

设置下userAgent,伪装成浏览器就可以了

 
Jsoup.connect("http://xxxx.com/xxx/xxx/xxx.html").userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31");

 

2025-03-09T19:46:32.300+08:00 ERROR 4464 — [ CrawlerTask-1] c.x.backend.service.PriceCrawlerService : [source1] 网络请求异常: HTTP error fetching URL. Status=404, URL=[https://example.com/price?id=P123]package com.xymzsfxy.backend.service; import com.xymzsfxy.backend.entity.PriceHistory; import com.xymzsfxy.backend.entity.Product; import com.xymzsfxy.backend.repository.PriceHistoryRepository; import com.xymzsfxy.backend.repository.ProductRepository; import org.jsoup.Jsoup; import org.junit.jupiter.api.Test; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.boot.test.context.SpringBootTest; import org.springframework.boot.test.context.TestConfiguration; import org.springframework.boot.test.mock.mockito.MockBean; import org.springframework.boot.test.mock.mockito.SpyBean; import org.springframework.context.annotation.Bean; import org.springframework.scheduling.annotation.EnableAsync; import org.springframework.test.context.TestPropertySource; import java.math.BigDecimal; import java.util.Optional; import java.util.concurrent.TimeUnit; import java.io.IOException; import org.jsoup.nodes.Document; import static org.awaitility.Awaitility.await; import static org.mockito.ArgumentMatchers.any; import static org.mockito.Mockito.*; // 测试类注解 @SpringBootTest @EnableAsync @TestPropertySource(properties = { “crawler.sources.source1=https://example.com/price?id=%s”, “crawler.selectors.source1=div.price” }) class PriceCrawlerServiceTest { @Autowired private PriceCrawlerService priceCrawlerService; @MockBean private ProductRepository productRepository; @MockBean private PriceHistoryRepository priceHistoryRepository; @SpyBean private PriceCrawlerService spyPriceCrawlerService; @TestConfiguration static class TestConfig { @Bean public JsoupWrapper jsoupWrapper() { return new JsoupWrapper(); } } @Test void shouldSavePriceWhenCrawlSuccess() throws Exception { // 模拟测试数据 Product mockProduct = new Product(); mockProduct.setId(1L); mockProduct.setExternalId("P123"); // 配置Mock行为 when(productRepository.findById(1L)).thenReturn(Optional.of(mockProduct)); doReturn(new BigDecimal("299.00")).when(spyPriceCrawlerService).extractPrice(any(), anyString()); // 执行测试方法 priceCrawlerService.crawlPrices(1L); // 验证异步执行结果 await().atMost(5, TimeUnit.SECONDS).untilAsserted(() -> { verify(priceHistoryRepository, times(1)).save(any(PriceHistory.class)); verify(productRepository, times(1)).save(argThat(p -> p.getLatestPrice().equals(new BigDecimal("299.00")) )); }); } @Test void shouldHandleMissingProduct() { when(productRepository.findById(999L)).thenReturn(Optional.empty()); // 直接调用方法,若抛出异常测试会自动失败 priceCrawlerService.crawlPrices(999L); verify(priceHistoryRepository, never()).save(any()); } } // 辅助类用于模拟网页请求 class JsoupWrapper { Document fetchDocument(String url) throws IOException { // 模拟返回包含价格的HTML文档 String html = “<html><div class=‘price’>¥299.00</div></html>”; return Jsoup.parse(html); } }如何用正确的网址验证
03-12
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值